-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathadvertising.R
452 lines (295 loc) · 18.1 KB
/
advertising.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
### Loading required libraries
## sudo apt-get install r-cran-curl r-cran-openssl r-cran-xml2
## install.packages(c("Rcpp","tidyverse"))
library(tidyverse) # Pipe operator (%>%) and other commands
library(caret) # Random split of data/cross validation
library(olsrr) # Heteroscedasticity Testing (ols_test_score)
## install.packages("olsrr")
library(car) # Muticolinearity detection (vif)
library(broom) # Diagnostic Metric Table (augment)
# Loading Data set
data = read.csv("advertising.csv" , header = T)
# Inspection of top 6-rows of data
head(data)
# Inspection of bottom 6-rows of data
tail(data)
# Getting Structure of whole data set
str(data)
# Checking Outliers
boxplot(data)
### The above plot shows that two outliers are present in the variable "Newspaper".
# Removing Outliers
data <- data[-which(data$Newspaper %in% boxplot.stats(data$Newspaper)$out),]
# Again Checking Outliers
boxplot(data)
# Checking Missing Values
table(is.na(data))
### The above output shows that there is no missing value in the given data set.
# Creating scatter plot matrix
pairs(data , upper.panel = NULL)
### This output shows that -
### 1. No or very low linear relationship between TV and Radio variable.
### 2. Low linear relationship between TV and Newspaper variable.
### 3. Moderate linear relationship between Radio and Newspaper variable.
### 4. High linear relationship between TV and Sales , Radio and Sales , Newspaper and Sales.
### 5. A small curvilinear relationship is also present between TV and Sales as well as Radio and Sales.
# Scatter Plot between TV and Sales
plot(data$TV , data$Sales)
### Notice, there is a small curvilinear relationship between TV and Sales.
# Scatter Plot between Radio and Sales
plot(data$Radio , data$Sales)
### Notice, there is a curvilinear relationship between Radio and Sales.
# Scatter Plot between Newspaper and Sales
plot(data$Newspaper , data$Sales)
### Low linear relationship between Newspaper and Sales variable.
# Randomly Split the data into training and test set
set.seed(1)
training.samples <- data$Sales %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- data[training.samples, ]
test.data <- data[-training.samples, ]
### Fitting Simple Linear Regression
# Fitting Sales ~ TV
sm1 <- lm(Sales ~ TV , data = train.data)
# Take a look on summary of the model
summary(sm1)
#### From the above output, you must notice that:
### 1. Created model is statistically significant since p-value <<< 0.05 (see in the last line of output)
### 2. From the coefficients section, it is clear that both coefficients (slope and intercept) are statistically significant since p-value <<< 0.05
### 3. This model with TV as predictor explains approximately 81% variability of target (Sales).
### 4. Residual standard error for the model is 2.28
# Fitting Sales ~ Radio
sm2 <- lm(Sales ~ Radio , data = train.data)
# Take a look on summary of the model
summary(sm2)
#### From the above output, you must notice that -
### 1. Created model is statistically significant since p-value << 0.05 (see in the last line of output)
### 2. From the coefficients section, it is clear that both coefficients (slope and intercept) are statistically significant since p-value << 0.05
### 3. This model with TV as predictor explains approximately 14% variability of target (Sales).
### 4. Residual standard error for the model is 4.95
# Fitting Sales ~ Newspaper
sm3 <- lm(Sales ~ Newspaper , data = train.data)
# Take a look on summary of the model
summary(sm3)
#### From the above output, you must notice that -
### 1. Created model is statistically significant since p-value < 0.05 (see in the last line of output)
### 2. From the coefficients section, it is clear that both coefficients (slope and intercept) are statistically significant since p-value < 0.05
### 3. This model with TV as predictor explains approximately 4% variability of target (Sales).
### 4. Residual standard error for the model is 5.21
### We have obtained that Simple Linear Regression Model with TV as predictor is explaining
### more variability of target (Sales).
# Scatter plot with Simple Linear Regression Line
plot(train.data$TV , train.data$Sales)
# Adding Regression Line
abline(lm(train.data$Sales ~ train.data$TV) , col = "blue")
### One problem occurs here : The above plot shows that it is not feasible to predict Sales
### only on the basis of a single predictor due to more variability in the Sales. Also, if we
### use single predictor then we completely neglect the effect of rest two other predictors on Sales,
### that may not be the case in real. So, why not extend this model ?
#### Fitting Multiple Linear Regression with Diagnostic Plot
#### Extending Simple Linear Regression Model using Forward Selection Method
# Fitting MLR model with predictors TV and Radio
mm1 <- lm(Sales ~ TV + Radio , data = train.data)
# Take a look on summary of the model
summary(mm1)
### Well, From the above output, notice that -
### 1. Created model is statistically significant since p-value <<< 0.05 (see in the last line of output)
### 2. From the coefficients section, it is clear that both coefficients (slopes and intercept) are statistically significant since p-value <<< 0.05
### 3. This model with TV and Radio as predictors explains approximately 90% variability of target (Sales) that is a better indication with respect to the model with TV alone as predictor.
### 4. Residual standard error for the model is 1.693
### But we must test, whether this improvement in Adjusted R-squared is statistically significant ?
### i.e., we want to test the null hypothesis
## H0 : The improvement in Adjusted R-squared is not statistically significant
## H1 : The improvement in Adjusted R-squared is statistically significant.
### For this testing, we use ANOVA (Analysis of Variance) technique
# Performing ANOVA to test the above stated null hypothesis
anova(sm1 , mm1)
### In the above output, Notice the value in the last column of second row. This value (2.2e-16)
### indicates the p-value for testing null hypothesis. Since this value is extremely less than 0.05,
### hence we have sufficient evidence from the data to reject the null hypothesis and accept the alternative.
### That's why the improvement in Adjusted R-squared is statistically significant.
### Hence, Adopt the model Sales ~ 0.05531 TV + 0.102164 Radio at this stage.
# Extending further the MLR including the predictor Newspaper
mm2 <- lm(Sales ~ TV + Radio + Newspaper , data = train.data)
# Take a look on summary of the model
summary(mm2)
### From the above output and using the information from previously fitted model, Notice that:
### From the coefficients section of the above output, It is clear that Newspaper predictor is
### not statistically significant for the model due to p-value (0.69) > 0.05
### Adjusted R-squared has been reduced 89.9 to 89.8
### Residual standard error has been increased from 1.715 to 1.72
### Although, the created model is statistically significant since p-value <<< 0.05 (see in the last line of output)
### We have sufficient evidence from the data for not to include the Newspaper as predictor in the model.
#### Diagnostic Plots
### Checking Linearity Assumption
# Residual Plot
plot(mm1, 1)
### Notice from the above plot:
### Red line is approximately horizontal and linear which indicates that Linearity assumption holds well.
### Residual fluctuates in a random manner inside a band drawn between Residuals = -4 to +4 which indicates
### that the fitted model is good for prediction to some extent. Why to some extent ?
### Because the points 131 and 151 may be potential outliers since they are very far from other points.
### But we know that Large residuals (as in our case with 131 and 151 data points) could also indicate that
### either the variance is not constant (heterosceadasticity) or the true relationship between target and predictors is nonlinear.
### These possibilities should be investigated before the points are considered outliers.
# Score Test for Heteroscedasticity
ols_test_score(mm1)
### From the last line of the above output, It is clear that p-value is greater than the significance level 0.05.
### Hence, we may accept the null hypothesis and conclude that the variance is homogeneous. i.e., Homoscedasticity
# Checking effect of Auto-correlation
durbinWatsonTest(mm1)
### From the above output, It is clear that p-value (0.326) > 0.05 , Hence, we may accept the null hypothesis and
### conclude that there is no autocorrelation between errors. i.e., Errors are uncorrelated.
#### Checking Multicolinearity -
### Generally, Variance Inflation Factor is used to detect Multicolinearity.
### As a rule of thumb, VIF greater than 5 or 10 represents Multicolinearity.
# Detecting Multicolinearity
vif(mm1)
### Note that Variance inflation factor for both predictors are less than 5 (as a rule of thumb) ,
### Hence there is no multicolinearity between predictors.
#### Checking Normality Assumption
### Shapiro Wilk Test is generally used to check normality assumption.
# Checking Normality of Errors
shapiro.test(mm1$residuals)
### Normality does not hold since p-value < 0.05
### Just plot histogram for residuals to get an idea about the pattern of distribution
# Plotting Histogram for Residuals
hist(mm1$residuals)
### We see that there is some problem with left tail. It may be due to the data points 131
### and 151 as pointed out earlier.
#### Fitting Orthogonal Polynomial Regression with Diagnostic Plot
# Fitting second order orthogonal polynomial model in two variables to avoid multicolinearity
pm1 <- lm(Sales ~ poly(TV , 2) + poly(Radio , 2) + TV:Radio , data = train.data)
# Take a look on summary of the model
summary(pm1)
### From the above output, Notice that -
### Created model is statistically significant since p-value <<< 0.05 (see in the last line of output)
### From the coefficients section, it is clear that all coefficients are statistically significant since p-value <<< 0.05
### This second order orthogonal polynomial model explains 93.28% variability of target (Sales) that is a better
### indication with respect to the multiple linear regression model with TV and Radio as predictor.
### Residual standard error for the model is 1.403
### Checking Whether this improvement in Adjusted R-squared is statistically significant -
### H0 : The improvement in Adjusted R-squared is not statistically significant.
### H1 : The improvement in Adjusted R-squared is statistically significant.
# Performing ANOVA to test the above stated null hypothesis
anova(mm1 , pm1)
### In the above output, the value (3.559e-13) indicates the p-value for testing null hypothesis. Since this value is extremely less than 0.05,
### hence we have sufficient evidence from the data to reject the null hypothesis and accept the alternative.
### That's why the improvement in Adjusted R-squared is statistically significant.
### Hence, Adopt the second order orthogonal polynomial model at this stage.
# Fitting third order (orthogonal) polynomial model in two variables to avoid multicolinearity
pm2 <- lm(Sales ~ poly(TV , 3) + poly(Radio , 3) + TV:Radio , data = train.data)
# Take a look on summary of the model
summary(pm2)
### It is clear from the coefficients section of the above output that third order of TV predictor is not statistically
### significant (p-value > 0.05). Hence, Don't include this term in the model.
# Fitting third order (orthogonal) polynomial model in two variables to avoid multicolinearity but after removing third order of TV predictor
pm3 <- lm(Sales ~ poly(TV , 2) + poly(Radio , 3) + TV:Radio , data = train.data)
# Take a look on summary of the model
summary(pm3)
### From the above output and using the information from second order orthogonal polynomial model stored in R-object pm2, Notice that:
### Created model is statistically significant since p-value <<< 0.05 (see in the last line of output)
### From the coefficients section, it is clear that all coefficients are statistically significant since p-value <<< 0.05
###This third order orthogonal polynomial model in two variables after removing third order of TV predictor explains 92.93% variability of
### target (Sales) that is a better indication with respect to the second order orthogonal polynomial regression model.
### Residual standard error for the model is 1.377
# Performing ANOVA to test the above stated null hypothesis
anova(pm1 , pm3)
### In the above output, Notice the value in the last column of second row. This value (0.009299) indicates the p-value for
### testing null hypothesis. Since this value is extremely less than 0.05, hence we have sufficient evidence from the data to
### reject the null hypothesis and accept the alternative.
### That's why the improvement in Adjusted R-squared is statistically significant.
### Hence, Adopt the third order orthogonal polynomial model without third order of TV predictor at this stage.
#### Diagnostic Plots
### Checking Linearity Assumption
# Residual Plot
plot(pm3 , 1)
### Notice from the above plot:
### Red line is approximately horizontal and linear at Residuals = 0 which indicates that Linearity assumption holds well.
### Residual fluctuates in a random manner inside a band drawn between Residuals = -4 to +4 which indicates that the fitted
### model is good for prediction to some extent. Why to some extent ?
### Because again we see that the point 131 (Note : 151 is now with the other points) may be potential outlier since this is very far from other points.
#### Checking Homoscedasticity Assumption
# Score Test for Heteroscedasticity
ols_test_score(pm3)
### Errors have constant variance, p-value > 0.05
#### Checking Auto-correlation Assumption
# Checking effect of Auto-correlation
durbinWatsonTest(pm3)
### Errors are uncorrelated.
#### Checking Normality Assumption
# Checking Normality of Errors
shapiro.test(pm3$residuals)
### Errors are normally distributed.
#### Checking Multicolinearity
# Detecting Multicolinearity
vif(pm3)
### Note that all values in the last column of the above output are less than 5 (as a rule of thumb)
### Hence there is no multicolinearity.
#### Removing Observation number 131 from train data set -
### Now we have only a choice that Delete the observation number 131 from the train data set as it has large
### residual (See : Residual Plot for pm3 object) and check whether Adjusted R-squared improves significantly.
### If yes, then remove it otherwise include observation number 131 too.
# Creating Diagnostic metrics Table for model pm3
dm = augment(pm3)
### Notice from the above output that -
### This table consists of information on different diagnostic metrics such as Residuals (column - 9),
### cooks distance (column - 12) and Studentized Residuals (column - 13) and many more.
### I will use last column of the above table to delete observation number 131.
# Checking minimum value of last column (Studentized Residuals)
min(dm$.std.resid)
max(dm$.std.resid)
### The above value of Studentized Residual is less than -3 (Rule of thumb), Hence it indicates an outlier.
### So just remove that observation as follows:
# Checking the index of that observation in train data
which(dm$.std.resid %in% min(dm$.std.resid))
### The above output indicates that the outlier is at index 105 in train data set. Just check the complete information
### about that row as follows -
# Info. about 98th row of train data set
train.data[103,]
### This is our target. We have to remove it from our train data set.
# Removing 98th row of outlier
train.data1 = train.data %>% filter(train.data$Sales != 1.6)
# Checking number of rows in old train data set
nrow(train.data)
# Checking number of rows in new train data set (train.data1)
nrow(train.data1)
### Now, again fit the same orthogonal polynomial model as is stored in pm3 but
### using the data stored in R-object train.data1 -
# Fitting third order orthogonal polynomial model in two variables TV and Radio but after removing third order of TV predictor using train.data1
pm4 <- lm(Sales ~ poly(TV , 2) + poly(Radio , 3) + TV:Radio , data = train.data1)
# Take a look on summary of the model
summary(pm4)
### From the above output and using the information from second order orthogonal polynomial
### model stored in R-object pm3, Notice that:
### Created model pm4 is statistically significant since p-value <<< 0.05 (see in the last line of output)
### From the coefficients section, it is clear that all coefficients are statistically significant since p-value <<< 0.05
### This polynomial model after removing the outlier explains 93.84% variability of target (Sales) that is a better
### indication with respect to the polynomial regression model stored in R-object pm3.
### Residual standard error for the model is 1.325
### Now, Check all other assumptions in a quick:
# Linearity Assumption
plot(pm4 ,1)
# Homoscedasticity Assumption
ols_test_score(pm4)
# Autocorrelation Assumption
durbinWatsonTest(pm4)
# Normality Assumption
shapiro.test(pm4$residuals)
# Multicolinearity Assumption
vif(pm4)
### All assumptions have been satisfied now.
### Checking outliers again by creating Diagnostic metric table for model pm4
# Creating Diagnostic metric table for model pm4
dm1 = augment(pm4)
# Checking minimum and maximum value of Studentized Residuals
min(dm1$.std.resid)
max(dm1$.std.resid)
### The above output shows that Studentized Residuals are not greater than 3 (rule of thumb)
### in absolute value. Hence, there are no potential outliers.
#### Making Predictions
# Making Predictions
prediction = pm4 %>% predict(test.data)
# Checking performance by calculating R2 , RMSE and MAE
data.frame( R2 = R2(prediction, test.data$Sales),
RMSE = RMSE(prediction, test.data$Sales),
MAE = MAE(prediction, test.data$Sales))