forked from Hendrik147/HR_Analytics_in_R_book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
20-absenteeism-MFG.Rmd
507 lines (359 loc) · 18.9 KB
/
20-absenteeism-MFG.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
# Absenteeism in manufacturing {#absenteeism-MFG}
```{r absenteism-MFG, include=FALSE}
chap <- 21
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
tidy = FALSE,
out.width = '\\textwidth',
fig.height = 4,
warning = FALSE
)
options(scipen = 99, digits = 3)
# Set random number generator see value for replicable pseudorandomness. Why 76?
# https://www.youtube.com/watch?v=xjJ7FheCkCU
set.seed(76)
```
Employee absenteeism is a significant problem for most organizations. Absenteeism excludes paid leave and occasions where an employer has granted an employee time off.
If you’re a manager, supervisor, or team leader, you’ll have likely experienced cases of absenteeism. You should know that staff will be absent from time to time, such as for illness, jury duty, or bereavement. However, when absenteeism is frequent and excessive it begins to become a problem.
Absenteeism at the workplace affects both employers and employees alike.
Some consequences for employers include:
* Reduced productivity levels.
* High administration costs.
* Increased labour costs if you hire replacement temporary workers.
* Understaffing which could lead to poor customer service.
* Poor morale among colleagues. This will be particularly prevalent if employees constantly have to fill in for absent staff and if they don’t see any retributions for absence.
To address this issue, HR professionals are encouraged to develop a business case and to devise and implement plans for managing employee attendance in their organizations. The following topics related to managing employee attendance are presented in this article:
The direct and indirect costs of absenteeism.
Collecting attendance data for the organization.
Measuring the cost of absenteeism.
Calculating rates of absenteeism.
Strategies aimed at addressing the organization's particular attendance issues.
Legal issues that may be implicated in designing and implementing absence control strategies.
##1.Define the goal (or HR business problem/issue).
A hypothetical company MFG has decided that it needs to do something about absenteeism.
To start with, it wants answers to the following questions:
* __What is its rate of absenteeism?__
* __Does anyone have excessive absenteeism?__
* __Is is the same across the organization?__
* __Does it vary by gender?__
* __Does it vary by length of service or age?__ Its guesses are that initially age and length of service may be related to absenteeism rates.
* __Can it predict next year's absenteeism?__
* __If so, how well can it predict?__
* __Can we reduce our absenteeism?__
If they can make future People Management decisions "driven" by what the data is telling them, then they will feel they have started the People Analytics journey.
##2.Collect and Manage Data.
Let us suppose this is a skunkworks project. Formal separate resources have not be identified for this initiative.
Only an initial look at recent data is possible. The HRIS system is able to provide some rudimentary information covering absences only for 2015
It was able to generate the following information as a CSV file (comma separated values):
* EmployeeNumber
* Surname
* GivenName
* Gender
* City
* JobTitle
* DepartmentName
* StoreLocation
* Division
* Age
* LengthService
* AbsentHours
* BusinessUnit
Ensure all needed libraries are installed
```{r include=FALSE}
if(!require(tidyverse)) install.packages("tidyverse")
if(!require(RcmdrMisc)) install.packages("RcmdrMisc")
if(!require(rattle)) install.packages("rattle")
if(!require(scatterplot3d)) install.packages("scatterplot3d")
if(!require(caret)) install.packages("caret")
library(tidyverse)
library(modelr)
```
###Let's read in the data provided
```{r eval=FALSE}
MFGEmployees <- read_csv("https://hranalytics.netlify.com/data/MFGEmployees4.csv")
```
```{r read_data8, echo=FALSE, warning=FALSE, message=FALSE}
MFGEmployees <- read_csv("data/MFGEmployees4.csv")
```
The first thing we should do is check on quality of data. Data will rarely be clean or perfect when we receive it. Either questionnable data should be corrected (preferrred) or deleted.
```{r}
summary(MFGEmployees)
```
The only thing that stands out initially is that age has some questionable data- some one who is 3 and someone who is 77. The range for purposes of this example should be 18 to 65. Normally you would want to clean the data by getting the correct information and then changing it.
### For expediency we will delete the problem records in the dataset.
```{r}
MFGEmployees<-
MFGEmployees %>%
filter(Age>=18 & Age<=65)
```
Now let's summarize again with cleaned up data
```{r}
summary(MFGEmployees)
```
###Transform the data
Now calculate absenteeism rate by dividing the absent hours by total standard hours for the year (52 weeks/year * 40 hours/week = 2080 hours/year)
```{r}
MFGEmployees <-
MFGEmployees %>%
mutate(AbsenceRate = AbsentHours /2080*100)
str(MFGEmployees)
```
We can now see our metric AbsenceRate has been calculated and created.
###Explore The Data
Part of collecting and managing data is the Exploratory Data Analysis.
Let's start with bar graphs of some of the categorical data
```{r}
MFGEmployees %>%
ggplot() +
aes(x=BusinessUnit) +
geom_bar() +
labs(x="Business Units",
y="Count",
title="Employee Count By Business Units") +
theme_minimal() +
coord_flip()
MFGEmployees %>%
ggplot() +
aes(x=Gender) +
geom_bar() +
labs(x="Business Units",
y="Count",
title="Employee Count By Gender") +
theme_minimal() +
coord_flip()
MFGEmployees %>%
ggplot() +
aes(x=Division) +
geom_bar() +
labs(x="Division",
y="Count",
title="Employee Count By Division") +
theme_minimal() +
coord_flip()
```
Let's ask some of our questions answered through this exploratory analysis.
__First of all, what is our absenteeism rate?__
```{r}
MFGEmployees %>%
summarise(median = median(AbsenceRate), mean = mean(AbsenceRate)) #Average absence rate and number of observations
means <- aggregate(AbsenceRate ~ 1, MFGEmployees, mean)
ggplot() +
geom_boxplot(aes(y = AbsenceRate, x =1), data = MFGEmployees) +
coord_flip() +
geom_text(data = round(means, digits = 3), aes(y = AbsenceRate, x =1, label = AbsenceRate), check_overlap=TRUE) + #Add average absence rate as a text
theme_minimal()
```
The average absence rate is 'r scales::percent(mean(MFGEmployees$AbsenceRate)'.
__Does anyone have excessive absenteeism?__
The boxplot shows the average (three digit number), the mean and standard deviation (through lines) of the data. Any observations beyond 3 standard deviations shows up as dots. So at least under that definition of outliers, some people show way more absenteeism than 99% of employees.
__Does it vary across the organization?__
When you conduct a piece of quantitative research, you are inevitably attempting to answer a research question or hypothesis that you have set. One method of evaluating this research question is via a process called hypothesis testing, which is sometimes also referred to as significance testing.
In order to undertake hypothesis testing you need to express your research hypothesis as a null and alternative hypothesis. The null hypothesis and alternative hypothesis are statements regarding the differences or effects that occur in the population. You will use your sample to test which statement (i.e., the null hypothesis or alternative hypothesis) is most likely (although technically, you test the evidence against the null hypothesis). So, with respect to our example, the null and alternative hypothesis will reflect statements about absences among the organization.
In the following we will use ANOVA, as statistical test.
```{r}
ggplot() +
geom_boxplot(aes(y = AbsenceRate, x = Gender), data = MFGEmployees) +
coord_flip() +
theme_minimal()
AnovaModel.1 <- lm(AbsenceRate ~ Gender, data=MFGEmployees) %>%
Anova()
AnovaModel.1
#means
MFGEmployees %>% group_by(Gender) %>% summarise(avg=mean(AbsenceRate))
```
The output of the function is a classical ANOVA table with the following data:
Df = degree of freedom
Sum Sq = deviance (within groups, and residual)
Mean Sq = variance (within groups, and residual)
F value = the value of the Fisher statistic test, so computed (variance within groups) / (variance residual)
Pr(>F) = p-value
Please note that the value 2.2e-16 actually means 2.2 X 10 ^ -16. It is just a way R prints numbers that are either too big or too small.
Since the p-Value is much less than 0.05, we reject the null hypothesis and accept the alternative hypothesis, i.e. the absence levels among are significantly different among genders.
```{r}
ggplot() +
geom_boxplot(aes(y = AbsenceRate, x = Division), data = MFGEmployees) +
coord_flip() +
theme_minimal()
AnovaModel.2 <- lm(AbsenceRate ~ Division, data=MFGEmployees) %>%
Anova()
AnovaModel.2
# means
MFGEmployees %>% group_by(Division) %>% summarise(avg=mean(AbsenceRate))
```
The absence levels among are significantly different among Division.
```{r}
AnovaModel.3 <- lm(AbsenceRate ~ Division*Gender, data=MFGEmployees) %>%
Anova()
AnovaModel.3
#means
MFGEmployees %>% group_by(Division, Gender) %>% summarise(avg=mean(AbsenceRate))
```
If varies significantly by the interaction of gender and division.
These are just a handful of the categorical summaries we could do.
__Does AbsenceRate vary by length of service and age?__
Scatterplots and correlations help answer this.
```{r}
# basic scatterplot
MFGEmployees %>%
ggplot() +
aes(x=Age, y=AbsenceRate) +
geom_point(colour = "blue") +
theme_minimal()
cor(MFGEmployees$Age, MFGEmployees$AbsenceRate)
```
There is a strong correlation of Age and Absence Rate
```{r}
MFGEmployees %>%
ggplot() +
aes(x=LengthService, y=AbsenceRate) +
geom_point(colour = "blue") +
geom_smooth(method='lm', se = FALSE, color='red') +
theme_minimal()
cor(MFGEmployees$LengthService, MFGEmployees$AbsenceRate)
```
There is not a strong correlation between length of service and Absence Rate.
```{r}
MFGEmployees %>%
ggplot() +
aes(x=Age, y=LengthService) +
geom_point(colour = "blue") +
geom_smooth(method='lm', se = FALSE, color='red') +
theme_minimal()
cor(MFGEmployees$Age, MFGEmployees$LengthService)
```
There is not much correlation between age and length of service either.
So far, we have done data exploration, now we are going to start building a model and starting to answer more difficult questions.
##3. Build The model
One of the questions asked in the defining the goal step was 'whether it was possible to __predict__ absenteeism?'
Absence Rate is a numeric continuous value. In the 'Building a model' step we have to chose which models/statistical algorithms to use. Prediction of a numerics continous values suggests a couple of models that could be brought to bear: Regression trees and linear regression. There are many more but for purposes of this article we will look at these.
###3.1 Regression Trees
Regression Trees will allow for use of both categorical and numeric values as predictors. Let's choose the following data as potential predictors in this analysis:
* Gender
* Department Name
* Store Location
* Division
* Age
* Length of Service
* Business Unit
__Absence Rate__ will be the the 'target' or thing to be predicted.
```{r}
MYinput <- c("Gender", "DepartmentName", "StoreLocation", "Division",
"Age", "LengthService", "BusinessUnit")
MYtarget <- "AbsenceRate"
library(rpart, quietly=TRUE)
set.seed(crv$seed)
MYrpart <- rpart(AbsenceRate ~ .,
data=MFGEmployees[, c(MYinput, MYtarget)],
method="anova",
parms=list(split="information"),
control=rpart.control(minsplit=10,
maxdepth=10,
usesurrogate=0,
maxsurrogate=0))
fancyRpartPlot(MYrpart, main="Decision Tree MFGEmployees and AbsenceRate")
```
The regression decision tree shows that age is a big factor in determining absence rate with gender playing a small part in one of the age ranges: >43 and <52 with males having a lower absence rate in this group. Almost all categorical information other than gender doesn't look like to help in the prediction.
Now let's look at linear regression as another model. The restriction in linear regression is that it can only accept non-categorical variables. Categorical variables can sometimes be made numeric through transformation, but that is beyond the scope of this article.
###3.2 Linear Regression
In linear regression, then, we will need to restrict it to numeric variables:
* Age
* Length of Service
being used to predict absence rate.
```{r}
#Linear Regression Model
RegressionCurrentData <- lm(AbsenceRate~Age+LengthService, data=MFGEmployees)
summary(RegressionCurrentData)
```
The summary shows an adjusted R-squared of `r round(summary(RegressionCurrentData)$adj.r.squared, digits =3)` which means approximately `r scales::percent(summary(RegressionCurrentData)$adj.r.squared)` of the variance is accounted by age and length of service.
The variables are both significant at Pr(>|t|) of 'r summary(RegressionCurrentData)$coefficients[3,4]`. These results are using the entirety of the existing data to predict itself.
Graphically it look like this:
```{r}
#2D plot of Age and AbsenceRate
ggplot() +
geom_point(aes(x = Age,y = AbsenceRate),data=MFGEmployees) +
geom_smooth(aes(x = Age,y = AbsenceRate),data=MFGEmployees,method = 'lm') +
theme_minimal()
#3D Scatterplot of Age and Length of Service with Absence Rate - with Coloring and Vertical Lines
# and Regression Plane
library(scatterplot3d)
s3d <-scatterplot3d(MFGEmployees$Age,MFGEmployees$LengthService,MFGEmployees$AbsenceRate, pch=16, highlight.3d=TRUE,
type="h", main="Absence Rate By Age And Length of Service")
fit <- lm(MFGEmployees$AbsenceRate ~ MFGEmployees$Age+MFGEmployees$LengthService)
s3d$plane3d(fit)
```
##4.Evaluate and evaluate the model
Up till now we have concentrated on producing a couple of models. The effort so far has had one weakness. __We have used all of our data for 2015 to generate the models.__ They can both predict, but the prediction are based on existing data - that are already known. We don't know how well it will predict on data it hasn't seen yet.
To evaluate and critique the models, we need to train the model using part of the data and hold out a portion to test on. We will divide the data into 10 parts- using 9 parts as training data and 1 part as testing data, and alternate which are the 9 and the 1, so that each of the 10 parts gets to be training data 9 times and testing data once.
The R "caret" library helps us do that. We will run both a regression tree and linear regression and compare how they do against each other.
First the Linear Regression
```{r}
library(caret)
set.seed(998)
inTraining <- createDataPartition(MFGEmployees$BusinessUnit, p = .75, list = FALSE)
training <- MFGEmployees[inTraining,]
testing <- MFGEmployees[ - inTraining,]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv", number = 10,
repeats = 10 ## repeated ten times
)
set.seed(825)
lmFit1 <- train(AbsenceRate ~ Age + LengthService, data = training,
method = "lm",
trControl = fitControl)
lmFit1
Testingdatasetandpredictions <- testing %>% add_predictions(lmFit1, type = "raw")
Testingdatasetandpredictions$pred[Testingdatasetandpredictions$pred<0] <- 0 #Put a zero to all negative predictions
```
The Rsquared shows a value of 0.688 which means even with sampling different parts of the data on 10 fold cross validation the use of age and length of service seems to be pretty robust so far.
Next the decision tree. The first time with just the numeric variables.
```{r}
set.seed(825)
rpartFit1 <- train(AbsenceRate ~ Age + LengthService, data = training,
method = "rpart",
trControl = fitControl,
maxdepth = 5)
rpartFit1
```
You will notice that the decision tree with 10 fold cross validation didn't perform as well with an RSquared of approximately 0.60
The second time with the original categorical and numeric varibles used.
```{r}
set.seed(825)
rpartFit2 <- train(AbsenceRate ~ Gender + DepartmentName + StoreLocation + Division + Age + LengthService + BusinessUnit, data = training,
method = "rpart",
trControl = fitControl,
maxdepth = 5)
rpartFit2
```
Here when you include all originally used vaiables in 10 fold cross validation, the Rsquared changed little and is still around 0.60.
So far, the linear regression is performing better.
##5.Present Results and Document
The presenting of results and documenting is something that R helps in. You may not have realized it, but the R Markdown language has been used to create the full layout of these two blog articles. HTML,PDF and Word formats can be produced.
R Markdown allows the reader to see exactly what you having been doing, so that an independent person can replicate your results, to confirm what you have done. It shows the R code/commands, the statistical results and graphics, all is included in the Absenteeism-Part.Rmd file.
##6.Deploy Model
Once you have evaluated your model(s) and chosen to use them, they need to be deployed so that they can be used. At the simplest level, 'deploy' can mean using the 'predict' function in R (where applicable) in conjunction with your model.
__Can it predict next year absenteeism?__
Let's predict the 2016 Absenteeism from the 2015 model.
If we make the simplifying assumption that nobody quits and nobody new comes in, we can take the 2015 data and add 1 to age and 1 to years of service for an approximation of new 2016 data before we get to 2016.
```{r}
#Apply model
#Generate 2016 data
Absence2016Data<-MFGEmployees
Absence2016Data$Age<-Absence2016Data$Age+1
Absence2016Data$LengthService<-Absence2016Data$LengthService+1
Absence2016Data <- Absence2016Data %>% add_predictions(lmFit1, type = "raw")
Absence2016Data$pred[Absence2016Data$pred<0] <- 0 #Put a zero to all negative predictions
```
To get single estimate for 2016 we ask for mean of absence rate.
```{r}
mean(Absence2016Data$pred)
mean(MFGEmployees$AbsenceRate)
```
The first figure above is the 2016 prediction, the second is the 2015 actual for comparison.
__If so, how well can it predict?__
As mentioned previously, about 'r scales::percent(mean(MFGEmployees$AbsenceRate)'. of the variation is accounted for in a linear regression model using age and length of service.
__Can we reduce our absenteeism?__
On the surface, only getting the age reduced and length of service increased will reduce absensteeism with this model.
Obviously, absenteeism is much more complex that just the rudimentary data we have collected. A serious look at this metric and problem would require more and different kinds of data. As mentioned before, the raw data used in this article and analysis is totally contrived to illustrate an example.