-
Notifications
You must be signed in to change notification settings - Fork 6
/
slides_open1.html
1083 lines (732 loc) · 37.8 KB
/
slides_open1.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="" xml:lang="">
<head>
<title>Coding Machine Learning Models with R</title>
<meta charset="utf-8" />
<meta name="author" content="John Lewis" />
<link href="slides_open1_files/remark-css/default.css" rel="stylesheet" />
<link href="slides_open1_files/remark-css/default-fonts.css" rel="stylesheet" />
</head>
<body>
<textarea id="source">
class: center, middle, inverse, title-slide
# Coding Machine Learning Models with R
## Meet Tidymodels
### John Lewis
### 2020/04/08 (updated: 2020-08-11)
---
class:center
<img class="circle" src="images/tidymodels_hex.png" width="450"/>
####These slides and a pdf file can be found at (github.com/jelewis and repository:OpenGeoHubSlides1)
---
## .center[Why Tidymodels??]
Fundamentally, tidymodels is an ecosystem of packages which is specifically designed with common APIs and a shared philosophy.
--
R has a consistency problem. Since everything was made by different people and using different principles, everything has a slightly different interface, and trying to keep everything in line can be frustrating. Therefore, to circumvent this problem 'tidymodels' was created by Max Kuhn at R Studio.
--
So the 'tidymodels' package is an integrated, modular, extendable set of packages that implement a framework that facilitates creating predicative statistical models. It adhere to tidyverse syntax and design principles that promote consistency and well-designed human interfaces over the speed of code execution.
--
However, it has built-in capabilities for parallel execution tasks to help with resampling, cross validation and parameter tuning. 'Tidymodels' works through the steps of the basic ML modelling and implements conceptual structures that make complex iterative workflows possible and reproducible.
*Paraphrased from: Joseph Rickert R Views 2020-04-21
---
class: center, middle
## Who am I?
## *John Lewis*
## McGill University Professor (retired)
<img class="circle" src="images/Lewis.jpg" width="150px"/>
#### "Github" <http://github.com/jelewis>
#### "email" <[email protected]>
<!--background-image: ![background] ("images/tidymodels_hex.png")-->
---
## What is Machine Learning or the World of AI?
<img src="images/7vw.jpg" width="80%" style="display: block; margin: auto;" />
.footnote[Source:https://vas3k.com/blog/machine_learning/]
---
<img src="images/7w1.jpg" width="90%" style="display: block; margin: auto;" />
.footnote[Source:https://vas3k.com/blog/machine_learning/]
---
class: center,middle
![alt text](images/machine_learning.png)
.footnote[reference=@xkcd]
---
class: center,middle
## A wise man once said "Just because you have a bag of hammers does not make every problem a nail"
---
name: novice
class: inverse,center
# Goals of this presentation
--
## -explain key concepts of tidymodels
--
## -use tidymodels and its companion packages to produce output for a ML regression model
--
## -illustrate with R code how to program the workflow sequence for ML models
---
name: novice1
class: inverse
## Assumptions behind this talk
### -Have a working knowledge of the R language
### -Are somewhat familiar with machine learning material
---
## A word about tidymodels within the tidyverse
<img src="images/ds.png" width="95%" style="display: block; margin: auto;" />
.footnote[https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/]
---
## An extra bit of motivation for Tidymodels
### "Whether you are just starting out today or have years of experience with modeling, tidymodels offers a consistent, flexible framework for your work."
From - *Max Kuhn* originator of both 'tidymodels' and the 'carets' package in R
or
### The <b>tidymodels</b> ecosystem bundles together a set of packages that work hand in hand to solve machine-learning problems from start to end. Together with the data-wrangling facilities in the <b>tidyverse</b> and the plotting tools from <b>ggplot2</b>, this makes for a rich toolbox for every data scientist working with R. (Hansjörg Plieninger, Blog, Feb. 2020)
---
<img src="images/slide1.png" width="95%" style="display: block; margin: auto;" />
---
<img src="images/slide2.png" width="95%" style="display: block; margin: auto;" />
---
<img src="images/slide3.png" width="95%" style="display: block; margin: auto;" />
---
## Other important packages
<img src="images/slide4a.png" width="95%" />
---
class:center,middle
## Learn more about the tidymodels packages and the metapackage itself please go to the following sites:
<https://www.tidymodels.org/>
<https://tidymodels.github.io/tidymodels/>
---
## **Getting set up**
### First we need to load some libraries: tidymodels and tidyverse.
#### load the relevant tidymodels libraries
```r
library("tidymodels")
library("tidyverse")
```
If you don’t already have the tidymodels library (or any of the other libraries) installed, then you’ll need to install it (once only) using
**install.packages("tidymodels")**
---
### Loaded Packages in the 'tidyverse' and 'tidymodels' ecosystem
####library(tidyverse)
-- Attaching packages -------------tidyverse 1.3.0 --
<br>ggplot2 3.3.2&emsp;purrr 0.3.4
<br>tibble 3.0.3&emsp;dplyr 1.0.0
<br>tidyr 1.1.0&emsp;stringr 1.4.0
<br>readr 1.3.1&emsp;forcats 0.5.0
####library(tidymodels)
-- Attaching packages --------------tidymodels 0.1.1 --
<br>broom 0.7.0&emsp;recipes 0.1.13
<br>dials 0.0.8&emsp;rsample 0.0.7
<br>infer 0.5.3&emsp;tune 0.1.1
<br>modeldata 0.0.2&emsp;workflows 0.1.2
<br>parsnip 0.1.2&emsp;yardstick 0.0.7
---
## Loading the data
We’ll be using the Ames Housing dataset which contains 81 variables and 2930 observations and our dependent variable/target outcome is **Sale_Price**. Obviously, in an actual analysis we would spend much more time exploring this dataset, but for sole purpose of demonstrating the {tidymodels} workflow, we’ll just perform a variety of preprocessing, throw a relatively simple model at the data, use cross-validation and then tune the model.
```r
data(ames, package = "modeldata")
```
---
### Ames, Iowa
<img src="images/ames_map.png" width="90%" style="display: block; margin: auto;" />
.footnote[Source: Max Kuhn & Davis Vaughan https://github.com/rstudio-conf-2020/applied-ml/blob/master/Part_1.pdf]
---
## First look at the data
```r
nrow(ames)
```
```
## [1] 2930
```
```r
ncol(ames)
```
```
## [1] 81
```
---
### Another look: Column names
```r
head(colnames(ames), n=69)
## [1] "MS_SubClass" "MS_Zoning" "Lot_Frontage"
## [4] "Lot_Area" "Street" "Alley"
## [7] "Lot_Shape" "Land_Contour" "Utilities"
## [10] "Lot_Config" "Land_Slope" "Neighborhood"
## [13] "Condition_1" "Condition_2" "Bldg_Type"
## [16] "House_Style" "Overall_Qual" "Overall_Cond"
## [19] "Year_Built" "Year_Remod_Add" "Roof_Style"
## [22] "Roof_Matl" "Exterior_1st" "Exterior_2nd"
## [25] "Mas_Vnr_Type" "Mas_Vnr_Area" "Exter_Qual"
## [28] "Exter_Cond" "Foundation" "Bsmt_Qual"
## [31] "Bsmt_Cond" "Bsmt_Exposure" "BsmtFin_Type_1"
## [34] "BsmtFin_SF_1" "BsmtFin_Type_2" "BsmtFin_SF_2"
## [37] "Bsmt_Unf_SF" "Total_Bsmt_SF" "Heating"
## [40] "Heating_QC" "Central_Air" "Electrical"
## [43] "First_Flr_SF" "Second_Flr_SF" "Low_Qual_Fin_SF"
## [46] "Gr_Liv_Area" "Bsmt_Full_Bath" "Bsmt_Half_Bath"
## [49] "Full_Bath" "Half_Bath" "Bedroom_AbvGr"
## [52] "Kitchen_AbvGr" "Kitchen_Qual" "TotRms_AbvGrd"
## [55] "Functional" "Fireplaces" "Fireplace_Qu"
## [58] "Garage_Type" "Garage_Finish" "Garage_Cars"
## [61] "Garage_Area" "Garage_Qual" "Garage_Cond"
## [64] "Paved_Drive" "Wood_Deck_SF" "Open_Porch_SF"
## [67] "Enclosed_Porch" "Three_season_porch" "Screen_Porch"
```
---
### An another look: Data types & values
```r
glimpse(ames)
## Rows: 2,930
## Columns: 81
## $ MS_SubClass <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1...
## $ MS_Zoning <fct> Residential_Low_Density, Residential_High_Densit...
## $ Lot_Frontage <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, ...
## $ Lot_Area <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5...
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, ...
## $ Alley <fct> No_Alley_Access, No_Alley_Access, No_Alley_Acces...
## $ Lot_Shape <fct> Slightly_Irregular, Regular, Slightly_Irregular,...
## $ Land_Contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl...
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, ...
## $ Lot_Config <fct> Corner, Inside, Corner, Corner, Inside, Inside, ...
## $ Land_Slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, ...
## $ Condition_1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm,...
## $ Condition_2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, ...
## $ Bldg_Type <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, ...
## $ House_Style <fct> One_Story, One_Story, One_Story, One_Story, Two_...
## $ Overall_Qual <fct> Above_Average, Average, Above_Average, Good, Ave...
## $ Overall_Cond <fct> Average, Above_Average, Above_Average, Average, ...
## $ Year_Built <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, ...
## $ Year_Remod_Add <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, ...
## $ Roof_Style <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable...
## $ Roof_Matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com...
## $ Exterior_1st <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, Vin...
## $ Exterior_2nd <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, Vin...
## $ Mas_Vnr_Type <fct> Stone, None, BrkFace, None, None, BrkFace, None,...
## $ Mas_Vnr_Area <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Exter_Qual <fct> Typical, Typical, Typical, Good, Typical, Typica...
## $ Exter_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typ...
## $ Foundation <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PC...
## $ Bsmt_Qual <fct> Typical, Typical, Typical, Typical, Good, Typica...
## $ Bsmt_Cond <fct> Good, Typical, Typical, Typical, Typical, Typica...
## $ Bsmt_Exposure <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, ...
## $ BsmtFin_Type_1 <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf...
## $ BsmtFin_SF_1 <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, ...
## $ BsmtFin_Type_2 <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf...
## $ BsmtFin_SF_2 <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120...
## $ Bsmt_Unf_SF <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 9...
## $ Total_Bsmt_SF <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 159...
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, ...
## $ Heating_QC <fct> Fair, Typical, Typical, Excellent, Good, Excelle...
## $ Central_Air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,...
## $ First_Flr_SF <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 161...
## $ Second_Flr_SF <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676,...
## $ Low_Qual_Fin_SF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Gr_Liv_Area <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1...
## $ Bsmt_Full_Bath <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, ...
## $ Bsmt_Half_Bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Full_Bath <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, ...
## $ Half_Bath <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, ...
## $ Bedroom_AbvGr <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, ...
## $ Kitchen_AbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Kitchen_Qual <fct> Typical, Typical, Good, Excellent, Typical, Good...
## $ TotRms_AbvGrd <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12,...
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ...
## $ Fireplaces <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, ...
## $ Fireplace_Qu <fct> Good, No_Fireplace, No_Fireplace, Typical, Typic...
## $ Garage_Type <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, ...
## $ Garage_Finish <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin...
## $ Garage_Cars <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, ...
## $ Garage_Area <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442...
## $ Garage_Qual <fct> Typical, Typical, Typical, Typical, Typical, Typ...
## $ Garage_Cond <fct> Typical, Typical, Typical, Typical, Typical, Typ...
## $ Paved_Drive <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Pa...
## $ Wood_Deck_SF <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157,...
## $ Open_Porch_SF <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75...
## $ Enclosed_Porch <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Screen_Porch <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 14...
## $ Pool_Area <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Pool_QC <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_...
## $ Fence <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, M...
## $ Misc_Feature <fct> None, None, Gar2, None, None, None, None, None, ...
## $ Misc_Val <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, ...
## $ Mo_Sold <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, ...
## $ Year_Sold <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, ...
## $ Sale_Type <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD ...
## $ Sale_Condition <fct> Normal, Normal, Normal, Normal, Normal, Normal, ...
## $ Sale_Price <int> 215000, 105000, 172000, 244000, 189900, 195500, ...
## $ Longitude <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93....
## $ Latitude <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090...
```
---
## Take a look at some data plots - an example of potential one predictor variable
.center[Living Area]
<img src="slides_open1_files/figure-html/plot1-1.png" width="360" style="display: block; margin: auto;" /><img src="slides_open1_files/figure-html/plot1-2.png" width="360" style="display: block; margin: auto;" />
.center[Living Area vs Sale Price]
---
## We could use all the variables but for this example I am only using two
<img src="slides_open1_files/figure-html/plot1a-1.png" width="360" style="display: block; margin: auto;" /><img src="slides_open1_files/figure-html/plot1a-2.png" width="360" style="display: block; margin: auto;" />
---
## The response/target variable (Sales Price) we want to predict
<img src="slides_open1_files/figure-html/plot2-1.png" width="360" style="display: block; margin: auto;" /><img src="slides_open1_files/figure-html/plot2-2.png" width="360" style="display: block; margin: auto;" />
.center[(log10 transform)]
---
### Focus variables
```
## # A tibble: 5 x 3
## Latitude Longitude Sale_Price
## <dbl> <dbl> <int>
## 1 42.1 -93.6 215000
## 2 42.1 -93.6 105000
## 3 42.1 -93.6 172000
## 4 42.1 -93.6 244000
## 5 42.1 -93.6 189900
```
---
<img src="images/aml_kuhn.png" width="90%" style="display: block; margin: auto;" />
.footnote[Source: Max Kuhn & Davis Vaughan https://github.com/rstudio-conf-2020/applied-ml/blob/master/Part_1.pdf]
---
## Now we start the modelling workflow-
--
## First we want to split the data into a training set and a test set
--
## For that we use the Tidymodels package "rsample"
---
<img src="images/predicting.006.jpeg" width="100%" style="display: block; margin: auto;" />
.footnote[Source:https://alison.rbind.io/post/2019-12-23-learning-to-teach-machines-to-learn/]
---
## .center[**Data Partitioning**]
Use the package "rsample"<img src="images/rsample.png" align="right" width=7% class="title-hex">
```r
set.seed(7014) #for reproducability
ames_split <- initial_split(ames, prop = .70)
#prop defines the amount of split
ames_split
```
```
## <Training/Validation/Total>
## <2051/879/2930>
```
```r
ames_train <- training(ames_split)
#
ames_cv <- vfold_cv(ames_train)# for later to do cross-validation
```
---
## Next step - Define a recipe
### Recipes allow you to specify the role of each variable as an outcome or a predictor variable (using a “formula”), and any preprocessing steps you want to conduct (such as transforms, normalization, imputation, PCA, etc)
---
## Creating a recipe has two parts:
### 1. Specify the formula (recipe()): specify the outcome variable and predictor variables.
### 2. Specify preprocessing steps (step_*()): define the preprocessing steps,such as imputation, creating dummy variables, scaling, and more
---
## .center[**Preprocessing**]
Use the package "recipes"<img src="images/recipes.png" align="right" width=7% class="title-hex">
```r
mod_rec <-
recipe(Sale_Price ~ Longitude + Latitude,
data = ames_train) %>%
step_log(Sale_Price, base = 10)
```
---
## We could have use many more variables with much more prepossessing as this recipe shows:
ames_rec <-
recipe(Sale_Price ~ ., data = ames_train) %>%
step_log(Sale_Price, base = 10) %>%
step_YeoJohnson(Lot_Area, Gr_Liv_Area) %>%
step_other(Neighborhood, threshold = .1) %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
step_ns(Longitude, deg_free = tune("lon")) %>%
step_ns(Latitude, deg_free = tune("lat"))
#### The full list of preprocessing steps available can be found here:
<https://recipes.tidymodels.org/reference/index.html>
---
###Reasons for Modifying the Data
- <h4>Some models (K-NN, SVMs, PLS, neural networks) require that the predictor variables have the same units. Centering and scaling the predictors can be used for this purpose.</h4>
--
- <h4>Other models are very sensitive to correlations between the predictors and filters or PCA signal extraction can improve the model.</h4>
--
- <h4>Scaling of the predictors using a transformation can lead to a big improvements.</h4>
--
- <h4> Many models cannot cope with missing data so imputation strategies might be necessary.</h4>
--
- <h4>In some cases, the data can be encoded in a way that maximizes its effect on the model such as the date for the day of the week can be very effective for modeling air pollution data.</h4>
.footnote[Source:Max Kuhn Workshop Presentation at the 'nyr-2020' Meeting, August 2020]
---
## .center[**Model Training & Tuning**]
## **Parsnip** offers a unified interface for the "massive" variety of models that exist in R.
## This means that you only have to learn one way of specifying a model, and you can use this specification and have it generate a linear model, a random forest model, a support vector machine model, and more with only a few lines of code.
---
## Specifying the model:
## 1) *The model type*: what kind of model you want to fit, set with different function of the specific model, such as rand_forest() for random forest, logistic_reg() for logistic regression etc.
## 2) *The engine*: the underlying package the model should come from (e.g.*ranger* for the ranger implementation of Random Forest), set using *set_engine()*.
---
## 3) *The mode*: the type of prediction - since several packages can do both classification (binary/categorical prediction) and regression (continuous prediction), set using *set_mode()*.
## 4) *The arguments*: the model parameter values (now consistently named across different models), set using *set_args()*.
---
<img src="images/predicting.007.jpeg" width="100%" style="display: block; margin: auto;" />
.footnote[Source:https://alison.rbind.io/post/2019-12-23-learning-to-teach-machines-to-learn/]
---
## A list of models accessible via **parsnip**
### The mode of a model is related to its goal-*regression* and *classification*
### common models: boost_tree(), decision_tree(), mars(), mlp(), nearest_neighbor(), rand_forest(), svm_poly(), svm_rbf(), ...
### classification: logistic_reg(),multinom_reg()
### regression: linear_reg(), surv_reg()
Source: vignettes/articles/Models.Rmd
####For more of a complete list please see:
<https://www.tidymodels.org/find/parsnip/>
---
Use "parsnip"<img src="images/parsnip.png" align="right" width=7% class="title-hex">
####We can easily fit a 5-nearest neighbor model with the following code:
```r
fit_knn <-
nearest_neighbor(mode = "regression", neighbors = 5) %>%
set_engine("kknn") %>%
fit(log10(Sale_Price) ~ Longitude + Latitude, data = ames_train)
fit_knn
```
```
## parsnip model object
##
## Fit time: 40ms
##
## Call:
## kknn::train.kknn(formula = log10(Sale_Price) ~ Longitude + Latitude, data = data, ks = ~5)
##
## Type of response variable: continuous
## minimal mean absolute error: 0.07027179
## Minimal mean squared error: 0.009850297
## Best kernel: optimal
## Best k: 5
```
---
### A few words about kNN - nearest neighbor model
#### k-nearest neighbor is a relatively simple supervised algorithm where each observation is predicted based on its "similarity" to its nearest neighbors. Its particular strengths are:
####1) the algorithm is easy to understand which enables better interpretation of your results
####2) it makes no assumptions about the data, such as its distributional form
####3) it may not provide the best prediction accuracy but it works well for a wide variety of datasets
####4) it can be very useful for preprocessing purposes
---
Use "parsnip"<img src="images/parsnip.png" align="right" width=7% class="title-hex">
### Define a knn regression model with the tuning parameters empty for later tuning.
```r
knn_mod <-
#specify the model as nearest neighbor
nearest_neighbor() %>%
#set tuning parameters-we chose 2 from the knn model - no. of neighbors
#and distance exponent
set_args(neighbors = tune(),dist_power = tune()) %>%
# set R package that is associated with the model
set_engine("kknn")%>%
set_mode("regression")
```
---
## .center[**Putting it all together with workflow**]
### Now I am going to diverge from the modelling steps I showed during the introduction.
### This is to introduce a new package entitled "workflow".
#### `*workflow*` is a package that can bundle together your preprocessing, modeling, and post processing requests. For example, you don’t have to keep track of separate objects in your workspace and model fitting can be executed using a single call to *fit()*.
---
Use "workflow"<img src="images/workflows.png" align="right" width=7% class="title-hex">
### Construct a workflow that combines your recipe and your model
```r
ml_wflow <-
workflow() %>%
#add the recipe
add_recipe(mod_rec) %>%
#add the model
add_model(knn_mod)
```
---
## .center[**Tune the hyperparameters**]
### We need to tune the model (i.e. choose the hyperparameter value(s) that leads to the best performance) before fitting our final model. If there are no hyperparameters to tune, you can skip this step.
### In this example, we will do the tuning using the vfold cross-validation (vfold=10) object that we created previously then specify the range for k neighbors and distanced squared exponent after which we add a tuning layer to our workflow using the function *tune_grid()*
---
### 10 fold cross validation example
<img src="images/cv-plot.svg" width="90%" style="display: block; margin: auto;" />
.footnote[Source:Max Kuhn & Davis Vaughan https://github.com/rstudio-conf-2020/applied-ml/blob/master/Part_4.pdf]
---
<img src="images/resampling.svg" width="90%" style="display: block; margin: auto;" />
.footnote[source:<https://rviews.rstudio.com/2020/04/21/the-case-for-tidymodels/>]
---
## Use "tune" and "yardstick" package to find best model
<img src="images/yardstick.png" align="right" width=7% class="title-hex">
<img src="images/tune.png" align="right" width=7% class="title-hex">
### Objective: find the best model
```r
ml_wflow_tune <-
ml_wflow %>%
tune_grid(resamples = ames_cv, # cv object
grid = 10, # grid values
metrics = metric_set(rmse)) # performance metric of interest
```
---
<img src="images/grid-plot.svg" width="75%" style="display: block; margin: auto;" />
---
class:center,middle
<img src="images/tidymodels1a.svg" style="display: block; margin: auto;" />
.footnote[<https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/>]
---
# .center[A word about yardstick]
## **yardstick** is a package to estimate how well models are working
### There are a large number of different metrics that one can use for assessment. The three main groups are:
#### 1) *class metrics*, such as accuracy, sensitivity, kappa matrix ...,
#### 2) *probability metrics* such as, area under the receiver operator curve(ROC)...,
#### 3) *regression metrics* that has root mean squared error, R^2, mean absolute error and many more.
####For a complete list of metric choices please see:
<https://tidymodels.github.io/yardstick/reference/index.html>
---
### collecting the results of the tuning
```r
res <- ml_wflow_tune %>%
collect_metrics()
```
---
## printing the results
```r
res
```
```
## # A tibble: 10 x 7
## neighbors dist_power .metric .estimator mean n std_err
## <int> <dbl> <chr> <chr> <dbl> <int> <dbl>
## 1 1 1.16 rmse standard 0.121 10 0.00519
## 2 3 0.0548 rmse standard 0.110 10 0.00507
## 3 4 1.32 rmse standard 0.102 10 0.00452
## 4 6 1.50 rmse standard 0.0988 10 0.00451
## 5 7 0.530 rmse standard 0.0979 10 0.00484
## 6 8 0.227 rmse standard 0.0993 10 0.00486
## 7 10 0.866 rmse standard 0.0980 10 0.00493
## 8 12 0.387 rmse standard 0.0985 10 0.00491
## 9 13 0.659 rmse standard 0.0982 10 0.00493
## 10 14 1.00 rmse standard 0.0989 10 0.00496
```
---
### Plot performance over iterations for cross-validation
```r
autoplot(ml_wflow_tune, metric = "rmse")
```
<img src="slides_open1_files/figure-html/iter-1.png" width="60%" style="display: block; margin: auto;" />
---
## **Finalize the workflow**
### Now need to add a layer to our workflow that adds the tuned parameter, i.e. k neighbors for the value that yielded the best results.
### We extract the best value for the accuracy metric by applying the *select_best()* function to the tune object.
---
## Select best parameters
```r
best_params <-
ml_wflow_tune %>%
select_best(metric = "rmse")
```
## Finalize workflow
```r
# Now add this parameter (best_params) to the workflow using the
# finalize_workflow() function.
ames_reg_res <-
ml_wflow %>%
finalize_workflow(best_params)
```
---
<img src="images/predicting.008.jpeg" width="95%" style="display: block; margin: auto;" />
.footnote[Source:https://alison.rbind.io/post/2019-12-23-learning-to-teach-machines-to-learn/]
---
## .center[**Fit the final model**]
## Now that we’ve defined our recipe, our model, and tuned the model’s parameters, we’re ready to actually fit the final model. Since all of this information is contained within the workflow object, we just apply the *last_fit()* function to our workflow and the train/test split object. This will automatically train the model specified by the workflow using the training data, and produce evaluations based on the *test* set.
---
## Fit using the entire training data
```r
ames_wfl_fit <- ames_reg_res %>%
last_fit(ames_split)
```
---
<img src="images/predicting.009.jpeg" width="95%" style="display: block; margin: auto;" />
.footnote[Source:https://alison.rbind.io/post/2019-12-23-learning-to-teach-machines-to-learn/]
---
## .center[**Test Performance**]
### Since we supplied the train/test object when we fit the workflow object, the metrics are evaluated on the test set. Now when we use the *collect_metrics()* function, it extracts the performance of the final model (since *ames_wfl_fit* now consists of a single final model) applied to the test set.
```r
test_performance <- ames_wfl_fit %>%
collect_metrics()
test_performance #print the results
```
```
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0959
## 2 rsq standard 0.703
```
---
## extract the test set predictions themselves
```r
test_predictions <- ames_wfl_fit %>%
collect_predictions()
test_predictions #print the results
```
```
## # A tibble: 879 x 4
## id .pred .row Sale_Price
## <chr> <dbl> <int> <dbl>
## 1 train/test split 5.17 3 5.24
## 2 train/test split 5.27 5 5.28
## 3 train/test split 5.27 6 5.29
## 4 train/test split 5.31 9 5.37
## 5 train/test split 5.26 11 5.25
## 6 train/test split 5.24 12 5.27
## 7 train/test split 5.22 14 5.23
## 8 train/test split 5.66 16 5.73
## 9 train/test split 5.15 24 5.17
## 10 train/test split 5.17 25 5.18
## # ... with 869 more rows
```
---
<img src="slides_open1_files/figure-html/plotfinal-1.png" width="70%" style="display: block; margin: auto;" />
---
class: center,middle
## The results of the model performance are not great but remember in this example we only used 2 predictors (longitude and latitude) which accounted for ~70% of the variance. In some ways the real job of modelling house sale prices starts now!
---
## Finally to sum up a tidymodels workflow
<img src="images/finalflow.svg" width="90%" style="display: block; margin: auto;" />
.footnote[Source: Max Kuhn & Davis Vaughan https://github.com/rstudio-conf-2020/applied-ml/blob/master/Part_3.pdf]
---
###A test-Match tasks to packages
.pull-left[- Fit a K-NN model
- Extract holidays from dates
- Make a training/test split
- Bundle a recipe and model
- Is high in vitamin A
- Compute R2
- Bin a predictor (but seriously,...don't)
]
.pull-right[
- workflows
- yardstick
- carrot
- recipes
- parsnip
- rsample
- ggvis
]
.footnote[Source:Max Kuhn Workshop Presentation at the 'nyr-2020' Meeting, August 2020]
---
## .center[For more information and code you can find examples of these topic on my Github site]
- build another kNN model but with different predictors in addition build a random forest model and lasso model
- compare the different model results for the Ames data
- look more deeply into cross-validation techniques and other methods of tuning hyperparameters
- also we investigate the implications of the bias-variance tradeoff and the problem of overfitting
- discuss aspects of model performance assessment in relation to a TidyTuesday dataset-global volcano data. You can run this with the supplied Rmd file.
---
# .center[Some references you might find useful]
### - Boehmke, B. and Greenwell, B., 2020: Hands-On Machine Learning with R, CRC Press, NY.
### - Burkov, A., 2019: The Hundred-Page Machine Learning Book, self-published,(email: [email protected])
### - Rhys, H., 2020: Machine Learning with R, the tidyverse and mlr, Manning, Shelter Island
---
##Further Resources
#### * For further information about each of the tidymodels packages, I recommend the vignettes/articles on the respective package homepage: (e.g., https://tidymodels.github.io/recipes/ or https://tune.tidymodels.org/articles/getting_started.html).
#### * Variable importance (plots) are provided by the package 'vip', which works well in combination with tidymodels packages.
#### * Recipe steps for dealing with unbalanced data are provided by the 'themis' package.There are a few more tidymodels packages that are not cover herein, like 'infer' or 'embed'. Read more about these and other packages at https://tidymodels.tidymodels.org/.
#### * Also see my slide deck "Deeper Dive into Tidymodels for Machine Learning in R" and the Rmd file (pdf too) "Multinomial classification with tidymodels and TidyTuesday volcano data" in my github account. (github.com/jelewis and repository:OpenGeoHubSlides2)
---
## In space, no one can hear you scream.
– Alien (1979)
Luckily tidymodels is a friendlier place. Ease of adoption and ease of use are fundamental design principles for the packages within the tidymodels ecosystem.
## .center[The end]
Thanks to Max Kuhn, Alison Hill, and Julia Silge for all the insightful material concerning 'Tidymodels' that they have made available on the internet. It has made my presentation possible.
</textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create();
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
window.dispatchEvent(new Event('resize'));
});
(function(d) {
var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
if (!r) return;
s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
d.head.appendChild(s);
})(document);
(function(d) {
var el = d.getElementsByClassName("remark-slides-area");
if (!el) return;
var slide, slides = slideshow.getSlides(), els = el[0].children;
for (var i = 1; i < slides.length; i++) {