Data Science

Turing In-Complete (part 1)

Before man-built machines that could be used to manually calculate all the same mathematical problems we now regard as computation, we – humans were regarded as the “computers”, not the artificial machines. This explains the label “manually” calculated. Man built the machines. This has only been true for a relatively short period of time when compared to the timeline man has existed in the current evolutionary state.

This technology goes back much farther than the existence of our most popular desktop pc, laptops, tablets, or smartphones. Major developments in the twentieth century progressed at a very rapid pace, not with the help of Extraterrestrial beings, but by some very brilliant humans. Maybe you could make a case for “math” from outer space in ancient history, and you’d be technically close when you factor in the influence of the orbit of planets and positions of stars that inspired the desire to figure out what was seen in the skies.

The abacus was the first device currently known for crunching numbers. The Sumerian abacus is thousands of years old and noted throughout ancient history. This isn’t what I would regard as an early computer, but it was and still is an impressive design.

The Analytical Engine an improvement over the Difference Engine – both designed by Charles Babbage in the early 1800s could be considered the foundation of modern computing. Ada King, countess of Lovelace created the first computer program for the Analytical Engine – if it had been completed. The design was, but not the fully functional machine. So the idea or design for the device came before the actual machine – as did a program that could have run on the machine.

I always felt that this part of history was a bit murky, but within the fog, there was a spark. The point is that this was a starting point that others could build upon.

Could the Analytical Engine be categorized as the first Turing Complete machine?

If we consider all modern programming languages Turing-compatible, then could it have run a program that would solve any calculation initially performed manually? In theory – possibly, in practical application, I am skeptical.

To consider the current concern about Artificial Intelligence taking over every aspect of man’s future in both positive and negative light, you should look back through its short history of advancements. Computers have come a long way not fully envisioned by the early creators, but it is still a very short time compared to man’s intellectual development.

Turing Completeness requires a system to make a decision based on data manipulated by rule sets. Remember those “If”, “and”, “goto” statements from BASIC (Beginner’s All-Purpose Symbolic Instruction Code). Maybe you remember (90s version) QBasic. If you don’t, no problem. Just know that there was some amazing progress in computer development from the 1950s and 1960s that used instructions which could be considered Turing-Complete -theoretically – not always in practice. This may not be the best way to explain this, but I think I’m in the ballpark.

I’m not disregarding Turing’s calculating machine design of from the ’30s, but things started to ramp up in the ’50s.

Consider the fact that we still use Fortran and LISP programming, both from the 1950s. Yes, I should mention assembly language which dates back to the late ’40s.

You can look back at the Rand Corporation’s Math-Matic AT-3 from 1957 used as a compiler and programming language for the Univac 1. Charles Katz led a team tasked with developing “Math-Matic” programming language under the direction of Grace Hopper who was notable in the movement towards “machine-independent” programming languages which helped lead to the development of high-level programming languages.

This was all done in the 1950-1960s. This is the era of Big computers like the DATATRON 200 series weighing in at over 3000 lbs. Big computers working with word size 10 decimal digits. All this incredibly amazing computer development which would later lead to the machines we now fear. It’s amazing to think we would later spin up the development of AI – which initially required the development of sophisticated computer code which came from these early systems. The history of computers and programming languages is very interesting and usually not referenced enough when we look at our current state of affairs with how much we depend on them. Man built these with the intent to improve his condition, and in most cases they have. What may be getting lost through time is the appreciation of all those who contributed over the last two centuries to the existence and development of all this amazing technology. It continues today, and it still requires some very brilliant minds to continue the advancement for the good of man. This is just the beginning. We are still in the early stages of computing, and we are still the computers.

Creating Excel Workbooks with multiple sheets in R

Create Excel Workbooks

Generally, when doing anything in R I typically work with .csv files, their fast and straightforward to use. However, I find times, where I need to create a bunch of them to output and having to go and open each one individually, can be a pain for anyone. In this case, it’s much better to create a workbook where each of the .csv files you would have created will now be a separate sheet.



Below is a simple script I use frequently that gets the job done. Also included is the initial process of creating dummy data to outline the process.

EXAMPLE CODE:

Libraries used

library(tidyverse)
library(openxlsx)

Creating example files to work with

products <- c("Monitor", "Laptop", "Keyboards", "Mice")
Stock <- c(20,10,25,50)
Computer_Supplies <- cbind(products,Stock)
products <- c("Packs of Paper", "Staples")
Stock <- c(100,35)
Office_Supplies <- cbind(products,Stock)
# Write the files to our directory
write.csv(Computer_Supplies, "Data/ComputerSupplies.csv", row.names = FALSE)
write.csv(Office_Supplies, "Data/OfficeSupplies.csv", row.names = FALSE)

Point to directory your files are located in (.csv here) and read each in as a list

# Get the file name read in as a column
read_filename <- function(fname) {
  read_csv(fname, col_names = TRUE) %>%
    mutate(filename = fname)
}
tbl <-
  list.files(path = "Data/",
             pattern ="*.csv",
             full.names = TRUE) %>%
  map_df(~read_filename(.))

Removing path from the file names

*Note: Max length of a Workbook’s name is 31 characters

tbl$filename <- gsub("Data/", "", tbl$filename)
tbl$filename <- gsub(".csv", "", tbl$filename)

Split the “tbl” object into individual lists

 mylist <- tbl %>% split(.$filename)
names(mylist)
## [1] "/ComputerSupplies" "/OfficeSupplies"

Creating an Excel workbook and having each CSV file be a separate sheet

wb <- createWorkbook()
lapply(seq_along(mylist), function(i){
  addWorksheet(wb=wb, sheetName = names(mylist[i]))
  writeData(wb, sheet = i, mylist[[i]][-length(mylist[[i]])])
})
#Save Workbook
saveWorkbook(wb, "test.xlsx", overwrite = TRUE

Reading in sheets from an Excel file

(The one we just created)

 df_ComputerSupplies <- read.xlsx("test.xlsx", sheet = 1)

Loading and adding a new sheet to an already existing Excel workbook

wb <- loadWorkbook("test.xlsx")
names(wb)
## [1] "/ComputerSupplies" "/OfficeSupplies"
addWorksheet(wb, "News Sheet Name")
names(wb)
## [1] "/ComputerSupplies" "/OfficeSupplies" "News Sheet Name"

Sort

The command sort is used to sort files line by line.  Lines starting with a number go first. Lines that come next in order go alphabetical with uppercase letters appearing before lowercase ones.

Use cat to create “testsort” for the example.

~/Test>cat testsort
A line 1
a line 2
8 line 3
line 4
5 line 5
~/Test>sort testsort
5 line 5
8 line 3
A line 1
a line 2
line 4

R sorts by using a random hash of keys

~/Test>sort -R testsort
a line 2
5 line 5
A line 1
8 line 3
line 4
~/Test>sort -R testsort
5 line 5
A line 1
a line 2
line 4
8 line 3

Egrep & Fgrep

EGREP:

            The Command egrep is the same as running grep –E. egrep is used to search for a pattern using extended regular expressions.

Terry@f:~/FinderDing>cat testsort
A line 1	
a line 2	
8 line 3	
line 4
5 line 5	
Terry@f:~/FinderDing>egrep '^[a-zA-Z]' testsort	
A line 1
a line 2
line 4

*Show lines that start with a letter from alphabet

Terry@f:~/FinderDing>cat html
<!DOCTYPE html>
<html>	
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
Terry@f:~/FinderDing>egrep "My|first" html
<h1>My First Heading</h1>
<p>My first paragraph.</p>

`*Find lines with pattern My first from html file

FGREP:

The command fgrep is the same as running grep –F. The Command searches for fixed character strings in a file, which means regular expressions can’t be used.

Terry@f:~/FinderDing>fgrep "My" html
<h1>My First Heading</h1>
<p>My first paragraph.</p>l

Exploring HR Employee Attrition and Performance with R

Based on IBM’s fictional data set created by their data scientists.

Introduction:
Employee Attrition is when an employee leaves a company due to normal means, (loss of customers, retirement, and resignation), and there is not someone to fill the vacancy. Can a company identify employee’s that are likely to leave a company?
A company with a high employee attrition rate is a good sign of underlying problems and can affect a company in a very negative way. One such way is the cost related to finding and training a replacement, as well as the possible strain it can put on other workers that in the meantime have to cover.

Preprocessing:
This dataset was produced by IBM and has just under 1500 observations of 31 different variables including attrition. 4 of the variables (EmployeeNumber, Over18, EmployeeCount, StandardHours) have the same value for all observations. Due to this, we can drop these since they won’t be helpful for our model. Next, the column “ï..Age” was renamed to “Age” to make calling this variable simpler. Finally, for build and testing models, the dataset was split into a training and test set at 70/30.

Initial Analysis:
Looking at the overall employee attrition rate for the entire dataset we can see it’s ~19%. Typically, a goal for a company is to keep this rate to ~10% and this dataset shows almost double that rate.

Here we show the influence of all factors on the employee attrition rate which shows the influence levels are similar. However, we can take the top factors and explore those in depth.

Top Factor Analysis Findings:

Factor Variable Importance
Total Working Years 0.6564557
Years At Company 0.6525268
Overtime 0.6505954
Years In Current Role 0.6480052
Monthly Income 0.6456590
Job Level 0.6414233

Total Working Years:

Looking at the total amount of years an employee has been in the workforce (at any job) there are two significant points to be found. First, in the initial 3 years of working, the data shows the attrition rate of 50%. This is expected as people tend to start at an entry-level job and get their first job experience before moving on. The rate drops off in the following amount of years until reaching 37 – 40 years in the job force. Here we have just under ~75% attrition rate which can be best explained as employees retiring since 37 years from 18 is 55 years old, the age people usually retire at.

Years at the company: 
The findings related to the number of years at the company and employee attrition followed the same trend as total working years did but with the rate lower for each. The reasoning behind this is most likely the same as total working years, with early on moving around. Then, staying put and finally retiring.

Overtime:
Employees that work overtime have over double the attrition rate (~25%), then those who don’t (~10). A possible reason behind this could be that some employees can get “burned out” working overtime. Possibly want to spend time outside of work and end up looking for a new job.

Monthly Income:
As expected employees with a higher monthly income were less likely to leave a company. Specifically, in the human resource and research and development departments. The sales department was interesting in that monthly income wasn’t as big a factor in attrition.

Model Building:

Gradient Boosting Model (GBM):
Using a GBM model with default parameters, the best training model came at 88%, at 150 trees. Using this model, we can create a prediction using the test data. The accuracy of this prediction was 87% which being very close to the training accuracy shows this is correct.

Interaction.depth n.Trees Accuracy Kappa
1 150 .878 .397

Classification Trees:
The classification tree built with default parameters showed a slightly lower overall accuracy. The training accuracy came to 82% and the prediction was 83%.

dt_model<- train(Attrition ~ ., data = attrition_train, method = "rpart")
cp Accuracy Kappa
0.039 .82 .24

When building a classification tree with only the top 5 factors, the accuracy fell in between the other two models at, 84% training and prediction.

dt_model1<- train(Attrition ~ TotalWorkingYears + YearsAtCompany + OverTime + YearsInCurrentRole + MonthlyIncome + JobLevel, data = attrition_train, method = "rpart")

cp Accuracy Kappa
0.0301 .84 .19

Recommendation:

As we can see from this data analysis, the biggest factor to employee attrition is the length of time in the workforce either at the same company or not. However, I would recommend looking deeper into employees that work overtime and getting their reasons for leaving. Possibly, have meetings with overtime workers and find out if they need help. For example, if they are working at their capacity and still having to work overtime then might be time and possibly even cheaper to hire extra help.
I would also recommend for the company to continue to collect this same type of data at an annual basis and run the models to find those employees that are more likely to leave. Once you have the list of employees, set up reviews and see if their’s a way to help them out or even you may catch, worker issues early on. Lastly, a further review into the sales department is warranted with the high attrition rate.