Data Science

How To Select Multiple Columns Using Grep & R

Why you need to be using Grep when programming with R.

There’s a reason that grep is included in most if not all programming language to this day 44 years later from creation. It’s useful and simple to use. Below is an example of using grep to make selecting multiple columns in R simple and easy to read.

The dataset below has the following column names.

names(data) # Column Names
 [1] "fips"                 "state"                "county"               "metro_area"          
 [5] "population"           "med_hh_income"        "poverty_rate"         "population_lowaccess"
 [9] "lowincome_lowaccess"  "no_vehicle_lowaccess" "s_grocery"            "s_supermarket"       
[13] "s_convenience"        "s_specialty"          "s_farmers_market"     "r_fastfood"          
[17] "r_full_service"      

How can we select only the columns we need to work with?

  • metro_area
  • med_hh_income
  • poverty_rate
  • population_lowaccess
  • lowincome_lowaccess
  • no_vehicle_lowaccess
  • s_grocery
  • s_supermarket
  • s_convenience
  • s_specialty
  • s_farmers_market
  • r_fastfood
  • r_full_service

We can tell R exactly by listing each column as below

data[c("metro_area","med_hh_income", "poverty_rate", "population_lowaccess", "lowincome_lowaccess", "no_vehicle_lowaccess","s_grocery","s_supermarket","s_convenience","s_specialty","s_farmers_market", "r_fastfood", "r_full_service")]

OR

We can tell R where each column we want is.

data[c(4,6,7:17)]

First, writing out each individual column is time consuming and chances are you’re going to make a typo (I did when writing it). Second option we have to first figure out where the columns are located to then tell R. Well looking at the columns we are trying to access vs the others theirs a specific difference. All these columns have a “_” located in there name, and we can use regular expressions (grep) to select these.

data[grep("_", names(data))])

FYI… to get the column locations you can actually use…

grep("_", names(data))
[1]  4  6  7  8  9 10 11 12 13 14 15 16 17

You will rarely have a regular expression as easy at “_” to select multiple columns, a very useful resource to learn and practice is https://regexr.com

Data was obtained from https://www.ers.usda.gov/data-products/food-access-research-atlas/download-the-data/

Exploring HR Employee Attrition and Performance with R

Based on IBM’s fictional data set created by their data scientists.

Introduction:
Employee Attrition is when an employee leaves a company due to normal means, (loss of customers, retirement, and resignation), and there is not someone to fill the vacancy. Can a company identify employee’s that are likely to leave a company?
A company with a high employee attrition rate is a good sign of underlying problems and can affect a company in a very negative way. One such way is the cost related to finding and training a replacement, as well as the possible strain it can put on other workers that in the meantime have to cover.

Preprocessing:
This dataset was produced by IBM and has just under 1500 observations of 31 different variables including attrition. 4 of the variables (EmployeeNumber, Over18, EmployeeCount, StandardHours) have the same value for all observations. Due to this, we can drop these since they won’t be helpful for our model. Next, the column “ï..Age” was renamed to “Age” to make calling this variable simpler. Finally, for build and testing models, the dataset was split into a training and test set at 70/30.

Initial Analysis:
Looking at the overall employee attrition rate for the entire dataset we can see it’s ~19%. Typically, a goal for a company is to keep this rate to ~10% and this dataset shows almost double that rate.

Here we show the influence of all factors on the employee attrition rate which shows the influence levels are similar. However, we can take the top factors and explore those in depth.

Top Factor Analysis Findings:

Factor Variable Importance
Total Working Years 0.6564557
Years At Company 0.6525268
Overtime 0.6505954
Years In Current Role 0.6480052
Monthly Income 0.6456590
Job Level 0.6414233

Total Working Years:

Looking at the total amount of years an employee has been in the workforce (at any job) there are two significant points to be found. First, in the initial 3 years of working, the data shows the attrition rate of 50%. This is expected as people tend to start at an entry-level job and get their first job experience before moving on. The rate drops off in the following amount of years until reaching 37 – 40 years in the job force. Here we have just under ~75% attrition rate which can be best explained as employees retiring since 37 years from 18 is 55 years old, the age people usually retire at.

Years at the company: 
The findings related to the number of years at the company and employee attrition followed the same trend as total working years did but with the rate lower for each. The reasoning behind this is most likely the same as total working years, with early on moving around. Then, staying put and finally retiring.

Overtime:
Employees that work overtime have over double the attrition rate (~25%), then those who don’t (~10). A possible reason behind this could be that some employees can get “burned out” working overtime. Possibly want to spend time outside of work and end up looking for a new job.

Monthly Income:
As expected employees with a higher monthly income were less likely to leave a company. Specifically, in the human resource and research and development departments. The sales department was interesting in that monthly income wasn’t as big a factor in attrition.

Model Building:

Gradient Boosting Model (GBM):
Using a GBM model with default parameters, the best training model came at 88%, at 150 trees. Using this model, we can create a prediction using the test data. The accuracy of this prediction was 87% which being very close to the training accuracy shows this is correct.

Interaction.depth n.Trees Accuracy Kappa
1 150 .878 .397

Classification Trees:
The classification tree built with default parameters showed a slightly lower overall accuracy. The training accuracy came to 82% and the prediction was 83%.

dt_model<- train(Attrition ~ ., data = attrition_train, method = "rpart")
cp Accuracy Kappa
0.039 .82 .24

When building a classification tree with only the top 5 factors, the accuracy fell in between the other two models at, 84% training and prediction.

dt_model1<- train(Attrition ~ TotalWorkingYears + YearsAtCompany + OverTime + YearsInCurrentRole + MonthlyIncome + JobLevel, data = attrition_train, method = "rpart")

cp Accuracy Kappa
0.0301 .84 .19

Recommendation:

As we can see from this data analysis, the biggest factor to employee attrition is the length of time in the workforce either at the same company or not. However, I would recommend looking deeper into employees that work overtime and getting their reasons for leaving. Possibly, have meetings with overtime workers and find out if they need help. For example, if they are working at their capacity and still having to work overtime then might be time and possibly even cheaper to hire extra help.
I would also recommend for the company to continue to collect this same type of data at an annual basis and run the models to find those employees that are more likely to leave. Once you have the list of employees, set up reviews and see if their’s a way to help them out or even you may catch, worker issues early on. Lastly, a further review into the sales department is warranted with the high attrition rate.

Introduction to Data Analysis with R

Using Basic Data Analysis functions on the mtcars dataset

Let’s Start

# Copying mtcars data frame to our new data frame myCars
myCars <- mtcars

Which car has the highest horsepower (hp) ? 

#find and display the car with the highest horsepower index <- which.max(myCars$hp)
# Display the car name along with the rest of the row myCars[index,]
##                mpg cyl disp hp drat  wt  qsec vs am  gear carb ## Maserati Bora  15   8  301 335 3.54 3.57 14.6  0  1    5    8

Maserati Bora has the highest horsepower at 335

Exploring miles per gallon (mpg) of the cars

# find and display the car with the highest mpg
index<-which.max(myCars$mpg)
myCars[index,]
##                 mpg cyl disp hp drat    wt qsec vs am gear carb ## Toyota Corolla 33.9   4 71.1 65 4.22 1.835 19.9  1  1    4    1
# Creating a sorted dataframe, based on mpg
highMPGcars <- myCars[ order(-myCars$mpg),]
head(highMPGcars)
mpg cyl  disp  hp drat    wt  qsec vs am gear carb ## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1 ## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2 ## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1 ## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2

Which car has the “best” combination of mpg and hp?

# Best car combination of mpg and hp, where mpg and hp must be given equal # weight
bestCombo<- myCars$hp / myCars$mpg
myCars[which.max(bestCombo),]
##                mpg cyl disp  hp drat   wt qsec vs am gear carb ## Maserati Bora  15   8  301 335 3.54 3.57 14.6  0  1    5    8

The Maserati Bora hp to mpg is ~ 22hp per gallon 

Manipulating Data Frames in R

Learn To Manipulate Data Frames Using The “mtcars” Dataset

Task 1: Create a new column to find Displacement per Cylinder 

Create a new variable (DisplacementPerCylinder), to calculate the total displacement per cylinder in cubic inches for each vehicle from the mtcars dataset.

# "str" allows you to display the internal structure of an R object
str(mtcars) 
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# As a backup we can copy the original data frame into a new one to work with
# That way if there is any issues we can go back

my_mtcars <- mtcars
# Calculate Displacement Per Cylinder by dividing the values (disp) and (cyl)

my_mtcars$DisplacementPerCylinder <- my_mtcars$disp / my_mtcars$cyl

# Report a summary of the variable
summary(my_mtcars$DisplacementPerCylinder)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.77   26.92   34.48   35.03   43.19   59.00

Task 2: Create your own data frame

Gather data from family & friends on the number of pets they have, the birth order they are in their family and the number of siblings. 

# Family/Friends ID
friendID  <- c(1, 2, 3, 4, 5)

# Number of pets they have
Pets <- c(4, 4, 2, 3, 1)

# The birth order they are in their family
Order <- c(1, 2, 2, 1, 1)

# Number of Siblings 
Siblings <- c(2, 2, 1, 2, 0)

# Binding the vectors into a data frame called myFriends
myFriends <- data.frame(friendID, + Pets, + Order, + Siblings)

# Command to report the structure of the data frame myFriends
str(myFriends)

## 'data.frame':    5 obs. of  4 variables:
##  $ friendID  : num  1 2 3 4 5
##  $ X.Pets    : num  4 4 2 3 1
##  $ X.Order   : num  1 2 2 1 1
##  $ X.Siblings: num  2 2 1 2 0
# Rename the columns to get rid of the "x." in front of the names
colnames(myFriends) <- c("FriendID", "Pets", "Order", "Siblings")
str(myFriends)
## 'data.frame':    5 obs. of  4 variables:
##  $ FriendID: num  1 2 3 4 5
##  $ Pets    : num  4 4 2 3 1
##  $ Order   : num  1 2 2 1 1
##  $ Siblings: num  2 2 1 2 0
# Listing the values of the vector friendID from the data frame myFriends
myFriends$FriendID 
## [1] 1 2 3 4 5
# Listing the values of the vector Pets from the data frame myFriends
myFriends$Pets
## [1] 4 4 2 3 1
# Listing the values of the vector Order from the data frame myFriends
myFriends$Order
## [1] 1 2 2 1 1
# Listing the values of the vector Siblings from the dataframe myFriends
myFriends$Siblings
# [1] 2 2 1 2 0
# Report a summary of the dataframe
summary(myFriends)
##     FriendID      Pets         Order        Siblings  
##  Min.   :1   Min.   :1.0   Min.   :1.0   Min.   :0.0  
##  1st Qu.:2   1st Qu.:2.0   1st Qu.:1.0   1st Qu.:1.0  
##  Median :3   Median :3.0   Median :1.0   Median :2.0  
##  Mean   :3   Mean   :2.8   Mean   :1.4   Mean   :1.4  
##  3rd Qu.:4   3rd Qu.:4.0   3rd Qu.:2.0   3rd Qu.:2.0  
##  Max.   :5   Max.   :4.0   Max.   :2.0   Max.   :2.0