forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
1477 lines (1387 loc) · 597 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>Introduction to Open Data Science - Course Project</title>
<script type="text/javascript">
window.onload = function() {
var imgs = document.getElementsByTagName('img'), i, img;
for (i = 0; i < imgs.length; i++) {
img = imgs[i];
// center an image if it is the only element of its parent
if (img.parentElement.childElementCount === 1)
img.parentElement.style.textAlign = 'center';
}
};
</script>
<!-- Styles for R syntax highlighter -->
<style type="text/css">
pre .operator,
pre .paren {
color: rgb(104, 118, 135)
}
pre .literal {
color: #990073
}
pre .number {
color: #099;
}
pre .comment {
color: #998;
font-style: italic
}
pre .keyword {
color: #900;
font-weight: bold
}
pre .identifier {
color: rgb(0, 0, 0);
}
pre .string {
color: #d14;
}
</style>
<!-- R syntax highlighter -->
<script type="text/javascript">
var hljs=new function(){function m(p){return p.replace(/&/gm,"&").replace(/</gm,"<")}function f(r,q,p){return RegExp(q,"m"+(r.cI?"i":"")+(p?"g":""))}function b(r){for(var p=0;p<r.childNodes.length;p++){var q=r.childNodes[p];if(q.nodeName=="CODE"){return q}if(!(q.nodeType==3&&q.nodeValue.match(/\s+/))){break}}}function h(t,s){var p="";for(var r=0;r<t.childNodes.length;r++){if(t.childNodes[r].nodeType==3){var q=t.childNodes[r].nodeValue;if(s){q=q.replace(/\n/g,"")}p+=q}else{if(t.childNodes[r].nodeName=="BR"){p+="\n"}else{p+=h(t.childNodes[r])}}}if(/MSIE [678]/.test(navigator.userAgent)){p=p.replace(/\r/g,"\n")}return p}function a(s){var r=s.className.split(/\s+/);r=r.concat(s.parentNode.className.split(/\s+/));for(var q=0;q<r.length;q++){var p=r[q].replace(/^language-/,"");if(e[p]){return p}}}function c(q){var p=[];(function(s,t){for(var r=0;r<s.childNodes.length;r++){if(s.childNodes[r].nodeType==3){t+=s.childNodes[r].nodeValue.length}else{if(s.childNodes[r].nodeName=="BR"){t+=1}else{if(s.childNodes[r].nodeType==1){p.push({event:"start",offset:t,node:s.childNodes[r]});t=arguments.callee(s.childNodes[r],t);p.push({event:"stop",offset:t,node:s.childNodes[r]})}}}}return t})(q,0);return p}function k(y,w,x){var q=0;var z="";var s=[];function u(){if(y.length&&w.length){if(y[0].offset!=w[0].offset){return(y[0].offset<w[0].offset)?y:w}else{return w[0].event=="start"?y:w}}else{return y.length?y:w}}function t(D){var A="<"+D.nodeName.toLowerCase();for(var B=0;B<D.attributes.length;B++){var C=D.attributes[B];A+=" "+C.nodeName.toLowerCase();if(C.value!==undefined&&C.value!==false&&C.value!==null){A+='="'+m(C.value)+'"'}}return A+">"}while(y.length||w.length){var v=u().splice(0,1)[0];z+=m(x.substr(q,v.offset-q));q=v.offset;if(v.event=="start"){z+=t(v.node);s.push(v.node)}else{if(v.event=="stop"){var p,r=s.length;do{r--;p=s[r];z+=("</"+p.nodeName.toLowerCase()+">")}while(p!=v.node);s.splice(r,1);while(r<s.length){z+=t(s[r]);r++}}}}return z+m(x.substr(q))}function j(){function q(x,y,v){if(x.compiled){return}var u;var s=[];if(x.k){x.lR=f(y,x.l||hljs.IR,true);for(var w in x.k){if(!x.k.hasOwnProperty(w)){continue}if(x.k[w] instanceof Object){u=x.k[w]}else{u=x.k;w="keyword"}for(var r in u){if(!u.hasOwnProperty(r)){continue}x.k[r]=[w,u[r]];s.push(r)}}}if(!v){if(x.bWK){x.b="\\b("+s.join("|")+")\\s"}x.bR=f(y,x.b?x.b:"\\B|\\b");if(!x.e&&!x.eW){x.e="\\B|\\b"}if(x.e){x.eR=f(y,x.e)}}if(x.i){x.iR=f(y,x.i)}if(x.r===undefined){x.r=1}if(!x.c){x.c=[]}x.compiled=true;for(var t=0;t<x.c.length;t++){if(x.c[t]=="self"){x.c[t]=x}q(x.c[t],y,false)}if(x.starts){q(x.starts,y,false)}}for(var p in e){if(!e.hasOwnProperty(p)){continue}q(e[p].dM,e[p],true)}}function d(B,C){if(!j.called){j();j.called=true}function q(r,M){for(var L=0;L<M.c.length;L++){if((M.c[L].bR.exec(r)||[null])[0]==r){return M.c[L]}}}function v(L,r){if(D[L].e&&D[L].eR.test(r)){return 1}if(D[L].eW){var M=v(L-1,r);return M?M+1:0}return 0}function w(r,L){return L.i&&L.iR.test(r)}function K(N,O){var M=[];for(var L=0;L<N.c.length;L++){M.push(N.c[L].b)}var r=D.length-1;do{if(D[r].e){M.push(D[r].e)}r--}while(D[r+1].eW);if(N.i){M.push(N.i)}return f(O,M.join("|"),true)}function p(M,L){var N=D[D.length-1];if(!N.t){N.t=K(N,E)}N.t.lastIndex=L;var r=N.t.exec(M);return r?[M.substr(L,r.index-L),r[0],false]:[M.substr(L),"",true]}function z(N,r){var L=E.cI?r[0].toLowerCase():r[0];var M=N.k[L];if(M&&M instanceof Array){return M}return false}function F(L,P){L=m(L);if(!P.k){return L}var r="";var O=0;P.lR.lastIndex=0;var M=P.lR.exec(L);while(M){r+=L.substr(O,M.index-O);var N=z(P,M);if(N){x+=N[1];r+='<span class="'+N[0]+'">'+M[0]+"</span>"}else{r+=M[0]}O=P.lR.lastIndex;M=P.lR.exec(L)}return r+L.substr(O,L.length-O)}function J(L,M){if(M.sL&&e[M.sL]){var r=d(M.sL,L);x+=r.keyword_count;return r.value}else{return F(L,M)}}function I(M,r){var L=M.cN?'<span class="'+M.cN+'">':"";if(M.rB){y+=L;M.buffer=""}else{if(M.eB){y+=m(r)+L;M.buffer=""}else{y+=L;M.buffer=r}}D.push(M);A+=M.r}function G(N,M,Q){var R=D[D.length-1];if(Q){y+=J(R.buffer+N,R);return false}var P=q(M,R);if(P){y+=J(R.buffer+N,R);I(P,M);return P.rB}var L=v(D.length-1,M);if(L){var O=R.cN?"</span>":"";if(R.rE){y+=J(R.buffer+N,R)+O}else{if(R.eE){y+=J(R.buffer+N,R)+O+m(M)}else{y+=J(R.buffer+N+M,R)+O}}while(L>1){O=D[D.length-2].cN?"</span>":"";y+=O;L--;D.length--}var r=D[D.length-1];D.length--;D[D.length-1].buffer="";if(r.starts){I(r.starts,"")}return R.rE}if(w(M,R)){throw"Illegal"}}var E=e[B];var D=[E.dM];var A=0;var x=0;var y="";try{var s,u=0;E.dM.buffer="";do{s=p(C,u);var t=G(s[0],s[1],s[2]);u+=s[0].length;if(!t){u+=s[1].length}}while(!s[2]);if(D.length>1){throw"Illegal"}return{r:A,keyword_count:x,value:y}}catch(H){if(H=="Illegal"){return{r:0,keyword_count:0,value:m(C)}}else{throw H}}}function g(t){var p={keyword_count:0,r:0,value:m(t)};var r=p;for(var q in e){if(!e.hasOwnProperty(q)){continue}var s=d(q,t);s.language=q;if(s.keyword_count+s.r>r.keyword_count+r.r){r=s}if(s.keyword_count+s.r>p.keyword_count+p.r){r=p;p=s}}if(r.language){p.second_best=r}return p}function i(r,q,p){if(q){r=r.replace(/^((<[^>]+>|\t)+)/gm,function(t,w,v,u){return w.replace(/\t/g,q)})}if(p){r=r.replace(/\n/g,"<br>")}return r}function n(t,w,r){var x=h(t,r);var v=a(t);var y,s;if(v){y=d(v,x)}else{return}var q=c(t);if(q.length){s=document.createElement("pre");s.innerHTML=y.value;y.value=k(q,c(s),x)}y.value=i(y.value,w,r);var u=t.className;if(!u.match("(\\s|^)(language-)?"+v+"(\\s|$)")){u=u?(u+" "+v):v}if(/MSIE [678]/.test(navigator.userAgent)&&t.tagName=="CODE"&&t.parentNode.tagName=="PRE"){s=t.parentNode;var p=document.createElement("div");p.innerHTML="<pre><code>"+y.value+"</code></pre>";t=p.firstChild.firstChild;p.firstChild.cN=s.cN;s.parentNode.replaceChild(p.firstChild,s)}else{t.innerHTML=y.value}t.className=u;t.result={language:v,kw:y.keyword_count,re:y.r};if(y.second_best){t.second_best={language:y.second_best.language,kw:y.second_best.keyword_count,re:y.second_best.r}}}function o(){if(o.called){return}o.called=true;var r=document.getElementsByTagName("pre");for(var p=0;p<r.length;p++){var q=b(r[p]);if(q){n(q,hljs.tabReplace)}}}function l(){if(window.addEventListener){window.addEventListener("DOMContentLoaded",o,false);window.addEventListener("load",o,false)}else{if(window.attachEvent){window.attachEvent("onload",o)}else{window.onload=o}}}var e={};this.LANGUAGES=e;this.highlight=d;this.highlightAuto=g;this.fixMarkup=i;this.highlightBlock=n;this.initHighlighting=o;this.initHighlightingOnLoad=l;this.IR="[a-zA-Z][a-zA-Z0-9_]*";this.UIR="[a-zA-Z_][a-zA-Z0-9_]*";this.NR="\\b\\d+(\\.\\d+)?";this.CNR="\\b(0[xX][a-fA-F0-9]+|(\\d+(\\.\\d*)?|\\.\\d+)([eE][-+]?\\d+)?)";this.BNR="\\b(0b[01]+)";this.RSR="!|!=|!==|%|%=|&|&&|&=|\\*|\\*=|\\+|\\+=|,|\\.|-|-=|/|/=|:|;|<|<<|<<=|<=|=|==|===|>|>=|>>|>>=|>>>|>>>=|\\?|\\[|\\{|\\(|\\^|\\^=|\\||\\|=|\\|\\||~";this.ER="(?![\\s\\S])";this.BE={b:"\\\\.",r:0};this.ASM={cN:"string",b:"'",e:"'",i:"\\n",c:[this.BE],r:0};this.QSM={cN:"string",b:'"',e:'"',i:"\\n",c:[this.BE],r:0};this.CLCM={cN:"comment",b:"//",e:"$"};this.CBLCLM={cN:"comment",b:"/\\*",e:"\\*/"};this.HCM={cN:"comment",b:"#",e:"$"};this.NM={cN:"number",b:this.NR,r:0};this.CNM={cN:"number",b:this.CNR,r:0};this.BNM={cN:"number",b:this.BNR,r:0};this.inherit=function(r,s){var p={};for(var q in r){p[q]=r[q]}if(s){for(var q in s){p[q]=s[q]}}return p}}();hljs.LANGUAGES.cpp=function(){var a={keyword:{"false":1,"int":1,"float":1,"while":1,"private":1,"char":1,"catch":1,"export":1,virtual:1,operator:2,sizeof:2,dynamic_cast:2,typedef:2,const_cast:2,"const":1,struct:1,"for":1,static_cast:2,union:1,namespace:1,unsigned:1,"long":1,"throw":1,"volatile":2,"static":1,"protected":1,bool:1,template:1,mutable:1,"if":1,"public":1,friend:2,"do":1,"return":1,"goto":1,auto:1,"void":2,"enum":1,"else":1,"break":1,"new":1,extern:1,using:1,"true":1,"class":1,asm:1,"case":1,typeid:1,"short":1,reinterpret_cast:2,"default":1,"double":1,register:1,explicit:1,signed:1,typename:1,"try":1,"this":1,"switch":1,"continue":1,wchar_t:1,inline:1,"delete":1,alignof:1,char16_t:1,char32_t:1,constexpr:1,decltype:1,noexcept:1,nullptr:1,static_assert:1,thread_local:1,restrict:1,_Bool:1,complex:1},built_in:{std:1,string:1,cin:1,cout:1,cerr:1,clog:1,stringstream:1,istringstream:1,ostringstream:1,auto_ptr:1,deque:1,list:1,queue:1,stack:1,vector:1,map:1,set:1,bitset:1,multiset:1,multimap:1,unordered_set:1,unordered_map:1,unordered_multiset:1,unordered_multimap:1,array:1,shared_ptr:1}};return{dM:{k:a,i:"</",c:[hljs.CLCM,hljs.CBLCLM,hljs.QSM,{cN:"string",b:"'\\\\?.",e:"'",i:"."},{cN:"number",b:"\\b(\\d+(\\.\\d*)?|\\.\\d+)(u|U|l|L|ul|UL|f|F)"},hljs.CNM,{cN:"preprocessor",b:"#",e:"$"},{cN:"stl_container",b:"\\b(deque|list|queue|stack|vector|map|set|bitset|multiset|multimap|unordered_map|unordered_set|unordered_multiset|unordered_multimap|array)\\s*<",e:">",k:a,r:10,c:["self"]}]}}}();hljs.LANGUAGES.r={dM:{c:[hljs.HCM,{cN:"number",b:"\\b0[xX][0-9a-fA-F]+[Li]?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+(?:[eE][+\\-]?\\d*)?L\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\b\\d+\\.(?!\\d)(?:i\\b)?",e:hljs.IMMEDIATE_RE,r:1},{cN:"number",b:"\\b\\d+(?:\\.\\d*)?(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"number",b:"\\.\\d+(?:[eE][+\\-]?\\d*)?i?\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"keyword",b:"(?:tryCatch|library|setGeneric|setGroupGeneric)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\.",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\.\\.\\d+(?![\\w.])",e:hljs.IMMEDIATE_RE,r:10},{cN:"keyword",b:"\\b(?:function)",e:hljs.IMMEDIATE_RE,r:2},{cN:"keyword",b:"(?:if|in|break|next|repeat|else|for|return|switch|while|try|stop|warning|require|attach|detach|source|setMethod|setClass)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"literal",b:"(?:NA|NA_integer_|NA_real_|NA_character_|NA_complex_)\\b",e:hljs.IMMEDIATE_RE,r:10},{cN:"literal",b:"(?:NULL|TRUE|FALSE|T|F|Inf|NaN)\\b",e:hljs.IMMEDIATE_RE,r:1},{cN:"identifier",b:"[a-zA-Z.][a-zA-Z0-9._]*\\b",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"<\\-(?!\\s*\\d)",e:hljs.IMMEDIATE_RE,r:2},{cN:"operator",b:"\\->|<\\-",e:hljs.IMMEDIATE_RE,r:1},{cN:"operator",b:"%%|~",e:hljs.IMMEDIATE_RE},{cN:"operator",b:">=|<=|==|!=|\\|\\||&&|=|\\+|\\-|\\*|/|\\^|>|<|!|&|\\||\\$|:",e:hljs.IMMEDIATE_RE,r:0},{cN:"operator",b:"%",e:"%",i:"\\n",r:1},{cN:"identifier",b:"`",e:"`",r:0},{cN:"string",b:'"',e:'"',c:[hljs.BE],r:0},{cN:"string",b:"'",e:"'",c:[hljs.BE],r:0},{cN:"paren",b:"[[({\\])}]",e:hljs.IMMEDIATE_RE,r:0}]}};
hljs.initHighlightingOnLoad();
</script>
<style type="text/css">
body, td {
font-family: sans-serif;
background-color: white;
font-size: 13px;
}
body {
max-width: 800px;
margin: auto;
padding: 1em;
line-height: 20px;
}
tt, code, pre {
font-family: 'DejaVu Sans Mono', 'Droid Sans Mono', 'Lucida Console', Consolas, Monaco, monospace;
}
h1 {
font-size:2.2em;
}
h2 {
font-size:1.8em;
}
h3 {
font-size:1.4em;
}
h4 {
font-size:1.0em;
}
h5 {
font-size:0.9em;
}
h6 {
font-size:0.8em;
}
a:visited {
color: rgb(50%, 0%, 50%);
}
pre, img {
max-width: 100%;
}
pre {
overflow-x: auto;
}
pre code {
display: block; padding: 0.5em;
}
code {
font-size: 92%;
border: 1px solid #ccc;
}
code[class] {
background-color: #F8F8F8;
}
table, td, th {
border: none;
}
blockquote {
color:#666666;
margin:0;
padding-left: 1em;
border-left: 0.5em #EEE solid;
}
hr {
height: 0px;
border-bottom: none;
border-top-width: thin;
border-top-style: dotted;
border-top-color: #999999;
}
@media print {
* {
background: transparent !important;
color: black !important;
filter:none !important;
-ms-filter: none !important;
}
body {
font-size:12pt;
max-width:100%;
}
a, a:visited {
text-decoration: underline;
}
hr {
visibility: hidden;
page-break-before: always;
}
pre, blockquote {
padding-right: 1em;
page-break-inside: avoid;
}
tr, img {
page-break-inside: avoid;
}
img {
max-width: 100% !important;
}
@page :left {
margin: 15mm 20mm 15mm 10mm;
}
@page :right {
margin: 15mm 10mm 15mm 20mm;
}
p, h2, h3 {
orphans: 3; widows: 3;
}
h2, h3 {
page-break-after: avoid;
}
}
</style>
</head>
<body>
<hr />
<h1>Introduction to Open Data Science - Course Project</h1>
<h2>If you’re peer reviewing this after 13 Dec…</h2>
<p>..you might be wondering why it looks this crappy. I had major issues with knitr and had to revert to knit2html, which apparently uses an old version of Rmarkdown, and I ran horribly out of time while troubleshooting. Sorry!</p>
<hr />
<h1>Chapter 1</h1>
<p><a href="https://www.youtube.com/watch?v=bhlD1XfiEJE">Now playing</a><br />
<a href="https://github.com/voikukas/IODS-project">My GitHub IODS repository</a>\</p>
<h2><strong>How am I feeling right now?</strong><br />
5 minutes after switching from my UH computer to my personal computer, I finally got the environment working (after struggling with my work computer for 6 hours)! Feeling slightly annoyed now.</h2>
<p><strong>Impressions from R for health data science & Exercise set 1</strong><br />
I’m already slightly familiar with programming and R, so I didn’t go through everything in great detail.<br />
Many of the practical examples and recommendations, such as the one about equivalence & rounding, seem very useful and probably will have saved me from a big headache. On the other hand, I skimmed through the import data section as it had nothing new for me. Everything else will serve as a great reminder.</p>
<pre><code class="language-r"># This is a so-called "R chunk" where you can write R code.
date()
</code></pre>
<pre><code>## [1] "Tue Dec 13 02:32:39 2022"
</code></pre>
<hr />
<h1>Regression and model validation</h1>
<p><em>Describe the work you have done this week and summarize your learning.</em></p>
<p>This week, I worked with the exercise set and read through R for Health Data Science to be able to answer the stats questions. I learned something new about plotting (ggpairs is super nifty) and statistical analysis with R. I don’t want to touch SPSS ever again.</p>
<pre><code class="language-r">date()
</code></pre>
<pre><code>## [1] "Tue Dec 13 02:32:39 2022"
</code></pre>
<pre><code class="language-r">library(ggplot2)
library(GGally)
library(dplyr)
learning2014 <- read.csv("data/learning2014.csv")
</code></pre>
<h2>1.</h2>
<p>The data contain a condensed version of a full questionnaire dataset (details at <a href="https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS2-meta.txt">https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS2-meta.txt</a>). Students in an introductory statistics course have answered a questionnaire regarding their approaches and attitudes towards learning and statistics; additionally, their exam points and other background variables have been collected.</p>
<p>Questionnaire variables:
Respondents have given their answers on a 1-5 scale.
From the raw data, certain variables have been combined (and scaled back) in order to measure the respondents’ use of the following learning approaches: deep (variable name deep), strategic (stra), and surface (surf). The variable ‘attitude’ combines questions about respondents’ attitude towards statistics.</p>
<p>Other variables:
gender
age
points (note: respondents with 0 points have been removed)</p>
<pre><code class="language-r">dim(learning2014)
</code></pre>
<pre><code>## [1] 166 7
</code></pre>
<pre><code class="language-r">head(learning2014)
</code></pre>
<pre><code>## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
</code></pre>
<p>There are 166 respondents (rows) and 7 variables (columns).</p>
<h2>2.</h2>
<pre><code class="language-r"># Checking number of female respondents
nrow(filter(learning2014, gender =="F"))
</code></pre>
<pre><code>## [1] 110
</code></pre>
<pre><code class="language-r"># Graphical overview & variable summaries
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=.3), lower = list(combo = wrap("facethist", bins = 20)))
p
</code></pre>
<pre><code>## plot: [1,1] [>--------------------------------------------------------------------] 2% est: 0s
## plot: [1,2] [==>------------------------------------------------------------------] 4% est: 2s
## plot: [1,3] [===>-----------------------------------------------------------------] 6% est: 2s
## plot: [1,4] [=====>---------------------------------------------------------------] 8% est: 3s
## plot: [1,5] [======>--------------------------------------------------------------] 10% est: 3s
## plot: [1,6] [=======>-------------------------------------------------------------] 12% est: 3s
## plot: [1,7] [=========>-----------------------------------------------------------] 14% est: 3s
## plot: [2,1] [==========>----------------------------------------------------------] 16% est: 3s
## plot: [2,2] [============>--------------------------------------------------------] 18% est: 3s
## plot: [2,3] [=============>-------------------------------------------------------] 20% est: 3s
## plot: [2,4] [==============>------------------------------------------------------] 22% est: 3s
## plot: [2,5] [================>----------------------------------------------------] 24% est: 3s
## plot: [2,6] [=================>---------------------------------------------------] 27% est: 2s
## plot: [2,7] [===================>-------------------------------------------------] 29% est: 2s
## plot: [3,1] [====================>------------------------------------------------] 31% est: 2s
## plot: [3,2] [======================>----------------------------------------------] 33% est: 2s
## plot: [3,3] [=======================>---------------------------------------------] 35% est: 2s
## plot: [3,4] [========================>--------------------------------------------] 37% est: 2s
## plot: [3,5] [==========================>------------------------------------------] 39% est: 2s
## plot: [3,6] [===========================>-----------------------------------------] 41% est: 2s
## plot: [3,7] [=============================>---------------------------------------] 43% est: 2s
## plot: [4,1] [==============================>--------------------------------------] 45% est: 2s
## plot: [4,2] [===============================>-------------------------------------] 47% est: 2s
## plot: [4,3] [=================================>-----------------------------------] 49% est: 2s
## plot: [4,4] [==================================>----------------------------------] 51% est: 2s
## plot: [4,5] [====================================>--------------------------------] 53% est: 2s
## plot: [4,6] [=====================================>-------------------------------] 55% est: 1s
## plot: [4,7] [======================================>------------------------------] 57% est: 1s
## plot: [5,1] [========================================>----------------------------] 59% est: 1s
## plot: [5,2] [=========================================>---------------------------] 61% est: 1s
## plot: [5,3] [===========================================>-------------------------] 63% est: 1s
## plot: [5,4] [============================================>------------------------] 65% est: 1s
## plot: [5,5] [=============================================>-----------------------] 67% est: 1s
## plot: [5,6] [===============================================>---------------------] 69% est: 1s
## plot: [5,7] [================================================>--------------------] 71% est: 1s
## plot: [6,1] [==================================================>------------------] 73% est: 1s
## plot: [6,2] [===================================================>-----------------] 76% est: 1s
## plot: [6,3] [=====================================================>---------------] 78% est: 1s
## plot: [6,4] [======================================================>--------------] 80% est: 1s
## plot: [6,5] [=======================================================>-------------] 82% est: 1s
## plot: [6,6] [=========================================================>-----------] 84% est: 1s
## plot: [6,7] [==========================================================>----------] 86% est: 0s
## plot: [7,1] [============================================================>--------] 88% est: 0s
## plot: [7,2] [=============================================================>-------] 90% est: 0s
## plot: [7,3] [==============================================================>------] 92% est: 0s
## plot: [7,4] [================================================================>----] 94% est: 0s
## plot: [7,5] [=================================================================>---] 96% est: 0s
## plot: [7,6] [===================================================================>-] 98% est: 0s
## plot: [7,7] [=====================================================================]100% est: 0s
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-10" /></p>
<p>There are 110 female respondents and 56 male respondents, which may bias the data towards female respondents. Thus, it makes sense to explore the data grouped by gender.<br />
By looking at the data densities grouped by gender, we can see that the distributions are somewhat similar in all variables. Largest gender differences seem to show in attitude (male respondents report a more neutral/positive attitude).</p>
<p>Most respondents are under 30 years old, but ages range from 17 to 55.
There are no significant interactions between age and other variables.</p>
<p>The strongest correlation can be found between attitude and points (.437, p < .001).
In males, there is a negative correlation between attitude and surface approach (r = -.374, p < .01). Males also show a strong negative correlation between deep and surface approaches (-.622, p < .001).</p>
<p>While there is a negative correlation between strategic and surface approaches (-.16, p < .05), it is not present in any of the genders separately.</p>
<h2>3. & 4.</h2>
<p>Chosen explanatory variables: attitude, deep, and surface approach</p>
<pre><code class="language-r">model <- lm(points ~ attitude + deep + surf, data = learning2014)
summary(model)
</code></pre>
<pre><code>##
## Call:
## lm(formula = points ~ attitude + deep + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9168 -3.1487 0.3667 3.8326 11.3519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3551 4.7124 3.895 0.000143 ***
## attitude 3.4661 0.5766 6.011 1.18e-08 ***
## deep -0.9485 0.7903 -1.200 0.231815
## surf -1.0911 0.8360 -1.305 0.193669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1876
## F-statistic: 13.7 on 3 and 162 DF, p-value: 5.217e-08
</code></pre>
<p>Variables deep and surf do not have a statistically significant relationship with points; adjusting the model.</p>
<pre><code class="language-r">model <- lm(points ~ attitude + stra, data = learning2014)
summary(model)
</code></pre>
<pre><code>##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
</code></pre>
<p>We’re looking at the relationship between two variables; we’d like to see how “attitude” explains the obtained course points. Using ordinary least squares, simple linear regression fits a line to the observed data (where y is the dependent variable and x is the explanatory variable).
Residuals are the differences between observed values and the fitted line(s). The median of residuals is .56, which suggests that the data fits the model quite nicely (but further analyses show whether this is true or not).</p>
<p>The coefficient intercept estimates: when attitude+stra = 0, points are at ~9 (p < .001 at null hypothesis, so when slope is at 0). The coefficient estimate of the explanatory variables are the slopes of the regression line (p < .001 for attitude, p < .1 for stra).</p>
<p>The adjusted R-squared stands for the coefficient of determination and is an indicator of how much of the variability is explained by the explanatory variable. It is at .1951, which is not that great. Thus, attitude and stra are not the only factors explaining the scores.</p>
<h2>5.</h2>
<pre><code class="language-r"># Scatter plots
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
</code></pre>
<pre><code>## `geom_smooth()` using formula = 'y ~ x'
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-13" /></p>
<pre><code class="language-r">qplot(stra, points, data = learning2014) + geom_smooth(method = "lm")
</code></pre>
<pre><code>## `geom_smooth()` using formula = 'y ~ x'
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-13" /></p>
<pre><code class="language-r"># Diagnostic plots
par(mfrow=c(2,2))
plot(model, which=c(1,2,5))
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-13" />
A simple regression model assumes</p>
<p><em>A linear relationship between predictors and outcome</em>
The scatter plots describe a somewhat linear pattern.</p>
<p><em>Independence of residuals</em>
The observations are independent (each data point represents an individual).</p>
<p><em>Normal distribution of residuals</em>
Q-Q plot looks normally distributed.</p>
<p><em>Equal variance of residuals</em>
Residuals vs fitted plot looks balanced.</p>
<hr />
<h1>Logistic regression</h1>
<pre><code class="language-r">date()
</code></pre>
<pre><code>## [1] "Tue Dec 13 02:32:45 2022"
</code></pre>
<pre><code class="language-r">library(ggplot2)
library(GGally)
library(dplyr)
alc <- read.csv("data/alc.csv")
</code></pre>
<h2>2.</h2>
<p>The data, gathered of students in secondary education in two Portuguese schools, describe student achievement in Portuguese and mathematics. Detailed description of the data set and all attributes can be found <a href="https://archive.ics.uci.edu/ml/datasets/Student+Performance">here</a>.</p>
<p>Our version is modified from the original as follows:
Only students with both Portuguese and maths grades available have been included.</p>
<p>Averaged grades:
G1 first period grade average of Math and Portuguese (0-20)
G2 second period grade average of Math and Portuguese (0-20)
G3 third period grade average of Math and Portuguese (0-20)</p>
<p>Variables on alcohol use:
alc_use average alcohol use based on workday and weekend alcohol consumption (from 1 - very low to 5 - very high)
high_use high use of alcohol: TRUE if alc_use is > 2; otherwise FALSE</p>
<pre><code class="language-r">head(alc)
</code></pre>
<pre><code>## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian
## 1 GP F 18 U GT3 A 4 4 at_home teacher course mother
## 2 GP F 17 U GT3 T 1 1 at_home other course father
## 3 GP F 15 U LE3 T 1 1 at_home other other mother
## 4 GP F 15 U GT3 T 4 2 health services home mother
## 5 GP F 16 U GT3 T 3 3 other other home father
## 6 GP M 16 U LE3 T 4 3 services other reputation mother
## traveltime studytime schoolsup famsup activities nursery higher internet romantic famrel
## 1 2 2 yes no no yes yes no no 4
## 2 1 2 no yes no no yes yes no 5
## 3 1 2 yes no no yes yes yes no 4
## 4 1 3 no yes yes yes yes yes yes 3
## 5 1 2 no yes no yes yes no no 4
## 6 1 2 no yes yes yes yes yes no 5
## freetime goout Dalc Walc health failures paid absences G1 G2 G3 alc_use high_use
## 1 3 4 1 1 3 0 no 5 2 8 8 1.0 FALSE
## 2 3 3 1 1 3 0 no 3 7 8 8 1.0 FALSE
## 3 3 2 2 3 3 2 no 8 10 10 11 2.5 TRUE
## 4 2 2 1 1 5 0 no 1 14 14 14 1.0 FALSE
## 5 3 2 1 2 5 0 no 2 8 12 12 1.5 FALSE
## 6 4 2 1 2 5 0 no 8 14 14 14 1.5 FALSE
</code></pre>
<pre><code class="language-r">glimpse(alc)
</code></pre>
<pre><code>## Rows: 370
## Columns: 35
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", …
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F", "M", "M", "M", "F"…
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15, 16, 16, 16, 17, 16…
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U"…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE3", "GT3", "GT3", "…
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T", "T", "T", "A", "T"…
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4, 4, 4, 4, 2, 2, 2, …
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3, 3, 4, 2, 2, 4, 2, …
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "services", "other", "othe…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other", "other", "teacher",…
## $ reason <chr> "course", "course", "other", "home", "home", "reputation", "home", "home", "ho…
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mother", "mother", "mother"…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, …
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1, 2, 1, 2, 2, 3, 1, …
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "yes", "no", "yes", "y…
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "y…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "…
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes", "yes", "yes", "yes…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3, 4, 5, 4, 5, 4, 1, …
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1, 4, 4, 5, 4, 3, 2, …
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3, 1, 2, 1, 4, 2, 2, …
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, …
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3, 1, 1, 3, 4, 1, 3, …
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5, 1, 5, 5, 5, 5, 5, …
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 1, …
## $ paid <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5, 0, 0, 1, 1, 2, 10,…
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16, 13, 10, 7, 10, 12,…
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16, 14, 12, 6, 11, 14…
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 16, 14, 12, 6, 11, 1…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0, 2.0, 1.5, 1.0, 1.5…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
</code></pre>
<p>There are 370 observations (students) and 35 attributes.</p>
<h2>3.</h2>
<p>Chosen variables and hypotheses:</p>
<ul>
<li>
<p>famrel (Quality of family relationships 1-5)
My suspicion is that poor family relationships might correlate with higher alcohol consumption, i.e., there could be a negative correlation.</p>
</li>
<li>
<p>G3 (Final grade 0-20)
Again, a possible negative correlation.</p>
</li>
<li>
<p>goout (Going out with friends 1-5)
Outgoing people might use more alcohol.</p>
</li>
<li>
<p>failures (Number of past class failures)
More failures might correlate with more alcohol consumption.</p>
</li>
</ul>
<h2>4.</h2>
<pre><code class="language-r">g1 <- ggplot(data = alc, aes(x = high_use))
g2 <- ggplot(data = alc, aes(x = alc_use))
# Plot bar plots grouped by family relations
g1 + geom_bar() + facet_wrap("famrel")
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-16" /></p>
<pre><code class="language-r"># Most respondents report great relations with their parents. In fact, there are very few respondents who report worse relations (famrel <=2) to begin with.
# Those with worse relations do not report higher alcohol use relative to those with better relations.
# It's hard to tell counts from bar plots, so let's check cross-tabulations
alc %>% group_by(famrel, high_use) %>% summarise(count = n())
</code></pre>
<pre><code>## `summarise()` has grouped output by 'famrel'. You can override using the `.groups` argument.
</code></pre>
<pre><code>## # A tibble: 10 × 3
## # Groups: famrel [5]
## famrel high_use count
## <int> <lgl> <int>
## 1 1 FALSE 6
## 2 1 TRUE 2
## 3 2 FALSE 9
## 4 2 TRUE 9
## 5 3 FALSE 39
## 6 3 TRUE 25
## 7 4 FALSE 128
## 8 4 TRUE 52
## 9 5 FALSE 77
## 10 5 TRUE 23
</code></pre>
<pre><code class="language-r"># Here, my hypothesis seems wrong.
# Plot boxplots of grade and high use
g3 <- ggplot(alc, aes(x = high_use, y = G3))
g3 + geom_boxplot() + ylab("grade")
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-16" /></p>
<pre><code class="language-r"># Non-high users seem to have slightly larger variation in their grades, and indeed better grades. There are some outliers among high users in both ends, though.
# Plot bar plots grouped by going out
g1 + geom_bar() + facet_wrap("goout")
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-16" /></p>
<pre><code class="language-r"># Most people seem to be in the "moderately outgoing" category (goout value at 3).
# Very outgoing people are more likely to report higher use of alcohol than lower use of alcohol, which lends support to my hypothesis.
# Let's look at cross-tabulations, too
alc %>% group_by(goout, high_use) %>% summarise(count = n())
</code></pre>
<pre><code>## `summarise()` has grouped output by 'goout'. You can override using the `.groups` argument.
</code></pre>
<pre><code>## # A tibble: 10 × 3
## # Groups: goout [5]
## goout high_use count
## <int> <lgl> <int>
## 1 1 FALSE 19
## 2 1 TRUE 3
## 3 2 FALSE 82
## 4 2 TRUE 15
## 5 3 FALSE 97
## 6 3 TRUE 23
## 7 4 FALSE 40
## 8 4 TRUE 38
## 9 5 FALSE 21
## 10 5 TRUE 32
</code></pre>
<pre><code class="language-r"># Check cross-tabulations of failures and high use
alc %>% group_by(high_use, failures) %>% summarise(count = n())
</code></pre>
<pre><code>## `summarise()` has grouped output by 'high_use'. You can override using the `.groups` argument.
</code></pre>
<pre><code>## # A tibble: 8 × 3
## # Groups: high_use [2]
## high_use failures count
## <lgl> <int> <int>
## 1 FALSE 0 238
## 2 FALSE 1 12
## 3 FALSE 2 8
## 4 FALSE 3 1
## 5 TRUE 0 87
## 6 TRUE 1 12
## 7 TRUE 2 9
## 8 TRUE 3 3
</code></pre>
<pre><code class="language-r"># Can't really say that the data lend support to my hypothesis here: there are very few failures in total and they seem to be somewhat evenly distributed between high and non-high users. However, considering the proportions of high vs. non-high users, there are proportionally considerably more non-high users with zero failures.
</code></pre>
<h2>5.</h2>
<pre><code class="language-r">m <- glm(high_use ~ famrel + G3 + goout + failures, data = alc, family = "binomial")
# view summary
summary(m)
</code></pre>
<pre><code>##
## Call:
## glm(formula = high_use ~ famrel + G3 + goout + failures, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8046 -0.7672 -0.5733 0.9947 2.3416
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.68968 0.80944 -2.087 0.0368 *
## famrel -0.37266 0.13616 -2.737 0.0062 **
## G3 -0.02464 0.04099 -0.601 0.5478
## goout 0.75188 0.12022 6.254 4e-10 ***
## failures 0.48924 0.22963 2.131 0.0331 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 388.05 on 365 degrees of freedom
## AIC: 398.05
##
## Number of Fisher Scoring iterations: 4
</code></pre>
<p>Going out is the strongest predictor of high alcohol use (p < .001). Family relations also have a significance level of < .01. Failures are significant at p < .05.</p>
<p>Final grade does not have a significant correlation with high use.</p>
<p>In terms of my hypotheses, famrel, goout and failures are supported. G3 is not.</p>
<pre><code class="language-r"># compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
</code></pre>
<pre><code>## Waiting for profiling to be done...
</code></pre>
<pre><code class="language-r"># print out the odds ratios with their confidence intervals
cbind(OR, CI)
</code></pre>
<pre><code>## OR 2.5 % 97.5 %
## (Intercept) 0.1845781 0.03639237 0.8771040
## famrel 0.6888964 0.52587985 0.8985892
## G3 0.9756601 0.90036973 1.0578228
## goout 2.1209922 1.68623053 2.7043570
## failures 1.6310777 1.04644221 2.5883902
</code></pre>
<p>The odds ratio here is the change in the odds of high/low alcohol use.</p>
<hr />
<h1>Clustering and classification</h1>
<pre><code class="language-r">date()
</code></pre>
<pre><code>## [1] "Tue Dec 13 02:32:47 2022"
</code></pre>
<pre><code class="language-r">#library(GGally)
#library(dplyr)
library(MASS)
library(tidyr)
library(corrplot)
library(ggplot2)
data("Boston")
</code></pre>
<h2>2.</h2>
<p>The data set describes housing value in suburbs of Boston, Massachusetts. The data set has been put together and analyzed by Harrison & Rubinfeld (1978). The raw data has been collected in ~1970 (Harrison & Rubinfeld, 1978).</p>
<p>Listing of its variables are available <a href="https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html">here</a>, for instance. Definitions of how the variables have been collected and calculated can be found in Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.</p>
<pre><code class="language-r">str(Boston)
</code></pre>
<pre><code>## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
</code></pre>
<p>There are 506 observations and 14 variables. Each observation represents one neighborhood/town.</p>
<p>In summary, the variables include median values of homes, variables describing housing, infrastructural, environmental and industrial structure, and demographic information on the neighborhood (Harrison & Rubinfeld, 1978).</p>
<pre><code class="language-r">summary(Boston)
</code></pre>
<pre><code>## crim zn indus chas nox
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 Min. :0.3850
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 1st Qu.:0.4490
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Median :0.5380
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 Mean :0.5547
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 Max. :0.8710
## rm age dis rad tax
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000 Min. :187.0
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000 Median :330.0
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549 Mean :408.2
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000 Max. :711.0
## ptratio black lstat medv
## Min. :12.60 Min. : 0.32 Min. : 1.73 Min. : 5.00
## 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95 1st Qu.:17.02
## Median :19.05 Median :391.44 Median :11.36 Median :21.20
## Mean :18.46 Mean :356.67 Mean :12.65 Mean :22.53
## 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :22.00 Max. :396.90 Max. :37.97 Max. :50.00
</code></pre>
<p>The median value of housing seems to be capped at $50.000, which looks like an artificial decision.</p>
<p>Let’s visualize the data set.</p>
<h2>3.</h2>
<pre><code class="language-r"># Create scatterplot matrix
pairs(Boston)
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-22" /></p>
<h3>Commenting on median value plotted against other factors:</h3>
<p>Not surprisingly, higher-value areas have lower crime rates. The density of residential areas (var zn) and whether tract bounds river (chas [this is not surprising as this is a dummy variable anyway]) do not have a clear effect on housing value. Prices seem to be lower where industrial density is very high, but low industrial density has both high and low-value areas. The situation looks fairly similar with highly polluted areas (one might ask how independent nox and indus are). As average number of rooms in units (rm) grows, so does value – again, not surprisingly.
Newest areas are not cheap, but the priciest areas are not new. Both the cheapest and the most expensive area seem to have a short distance to Boston employment centers. Rad – index of accessibility to radial highways – seems to have a very binary distribution despite its values varying from 1 to 24. I am not sure how to interpret how tax and value are related (the highest taxes seem to be in cheapest areas, which seems odd from someone coming from a progressive tax state). Pupil-teacher ratio is generally lower in more valuable neighborhoods, as one would expect. Towns with a lower proportion of African-Americans are not among the most expensive areas. There is a high correlation between the price of housing and lower percentage of “lower-status” population (i.e., male workers classified as laborers and adults with no high school education).</p>
<pre><code class="language-r"># Look at correlations
corbos = cor(Boston)
corrplot(corbos)
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-23" />
The strongest correlation seems to be between radial highway accessibility and property tax rate. Looking at the scatterplot matrix, there is a highly taxed highly accessible outlier, which may confound the results.
Distance and pollution have a strong negative correlation: far-away areas have lower pollution.
Nox and indus are also strongly correlated.</p>
<h2>4.</h2>
<pre><code class="language-r"># Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)
boston_scaled <- scale(Boston)
summary(boston_scaled)
</code></pre>
<pre><code>## crim zn indus chas nox
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad tax
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819 Min. :-1.3127
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225 Median :-0.4642
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596 Max. : 1.7964
## ptratio black lstat medv
## Min. :-2.7047 Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.2746 Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 1.6372 Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
</code></pre>
<p>Columns are scaled by first subtracting column values by the corresponding column mean, and then dividing the difference by standard deviation. This way, the mean for all variables is at 0, and we can compare variables in a more meaningful way.</p>
<pre><code class="language-r">boston_scaled<-as.data.frame(boston_scaled)
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
</code></pre>
<pre><code>## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
</code></pre>
<pre><code class="language-r"># create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
table(crime)
</code></pre>
<pre><code>## crime
## low med_low med_high high
## 127 126 126 127
</code></pre>
<pre><code class="language-r"># remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
</code></pre>
<h2>5.</h2>
<pre><code class="language-r"># Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
</code></pre>
<pre><code>## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2623762 0.2301980 0.2450495
##
## Group means:
## zn indus chas nox rm age dis rad
## low 1.0077966 -0.9013763 -0.08661679 -0.8911379 0.45085821 -0.8903606 0.8851931 -0.6947544
## med_low -0.1320881 -0.2552932 -0.04947434 -0.5300626 -0.16156904 -0.2934666 0.3206352 -0.5409032
## med_high -0.4033301 0.2702447 0.19334945 0.4181606 0.01553335 0.4386564 -0.4093262 -0.4520945
## high -0.4872402 1.0171737 -0.07348562 1.0794376 -0.38004812 0.8082157 -0.8388567 1.6375616
## tax ptratio black lstat medv
## low -0.7195738 -0.44659523 0.37096288 -0.75369946 0.53649222
## med_low -0.4615823 -0.01301423 0.34853814 -0.11935320 -0.03567311
## med_high -0.3154313 -0.25660396 0.01487736 0.07564177 0.09078245
## high 1.5136504 0.78011702 -0.87620173 0.85077290 -0.74897074
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.088258721 0.65837316 -0.88806054
## indus 0.093040787 -0.25358910 0.21574553
## chas -0.110019905 -0.04817162 0.06112401
## nox 0.349907616 -0.74229932 -1.31086088
## rm -0.196730284 -0.04478351 -0.22896315
## age 0.170060613 -0.29252053 -0.10113863
## dis -0.026360903 -0.21139852 0.15396380
## rad 3.699395868 0.90128195 0.13610207
## tax -0.001425383 0.09251606 0.37290789
## ptratio 0.107397040 -0.03809865 -0.17966173
## black -0.123001318 0.06853286 0.22237406
## lstat 0.222069923 -0.25010923 0.29858207
## medv 0.223316232 -0.41990591 -0.23586695
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9572 0.0322 0.0106
</code></pre>
<pre><code class="language-r"># the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2.5)
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-26" /></p>
<h2>6.</h2>
<pre><code class="language-r"># Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
</code></pre>
<pre><code class="language-r"># cross tabulate the results
results <- table(correct = correct_classes, predicted = lda.pred$class)
# calculate model precision & sensitivity
table(correct = correct_classes, predicted = lda.pred$class)
</code></pre>
<pre><code>## predicted
## correct low med_low med_high high
## low 15 5 1 0
## med_low 6 12 2 0
## med_high 0 15 15 3
## high 0 0 0 28
</code></pre>
<pre><code class="language-r">1-diag(results)/colSums(results) # Precision: fraction of false positives
</code></pre>
<pre><code>## low med_low med_high high
## 0.28571429 0.62500000 0.16666667 0.09677419
</code></pre>
<pre><code class="language-r">1-diag(results)/rowSums(results) # Sensitivity: fraction of false negatives
</code></pre>
<pre><code>## low med_low med_high high
## 0.2857143 0.4000000 0.5454545 0.0000000
</code></pre>
<p>The model is best at predicting high crime areas. No areas were misclassified as high, although one high area was falsely classified as med_high. The model had more difficulties with classes med_high and med_low; many med_high classes were incorrectly classified as med_low, making 45% of med_low classifications incorrect and greatly reducing the sensitivity of med_high classification. Areas with a low crime rate were better classified, with about 80% precision and 26% sensitivity.</p>
<h2>7.</h2>
<pre><code class="language-r"># Make another scaled Boston data set
boston_scaled <- scale(Boston)
# Distances between observations
summary(dist(boston_scaled))
</code></pre>
<pre><code>## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
</code></pre>
<p>An ideal grouping has a minimal within-group sum of squares (“WGSS”) over all variables. Let’s plot these for one to ten-group solutions to see which solution has the smallest WGSS.</p>
<pre><code class="language-r">set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
df<- data.frame(wgss=c(twcss),groups=1:k_max)
ggplot(data = df, aes(x=groups, y=wgss)) + geom_line() + scale_x_continuous(breaks = 1:10)
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-30" />
There are two “elbows” in the plot, at 3 and 8 groups. Let’s investigate those further.</p>
<pre><code class="language-r">set.seed(123)
# k-means clustering
km3 <- kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km3$cluster)
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-31" /></p>
<pre><code class="language-r"># k-means clustering
km8 <- kmeans(boston_scaled, centers = 8)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km8$cluster)
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-31" />
It’s difficult to point out any single variable based on which the groups have been clustered (which is a good sign, I suppose). It’s likely that km3 makes a better model of the data as groups 4-7 actually make the model worse; km8 might just overfit the data by dividing groups into unnecessary subgroups.</p>
<hr />
<h1>Dimensionality reduction techniques</h1>
<pre><code class="language-r">date()
</code></pre>
<pre><code>## [1] "Tue Dec 13 02:32:54 2022"
</code></pre>
<pre><code class="language-r">library(GGally)
library(dplyr)
#library(MASS)
library(tidyr)
library(corrplot)
library(ggplot2)
library(FactoMineR)
human <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.txt", sep =",", header = TRUE)
</code></pre>
<h2>1.</h2>
<pre><code class="language-r"># Note: the variables and data set are described in create_human.R
summary(human)
</code></pre>
<pre><code>## Edu2.FM Labo.FM Edu.Exp Life.Exp GNI
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00 Min. : 581
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20 Median : 12040
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65 Mean : 17628
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50 Max. :123124
## Mat.Mor Ado.Birth Parli.F
## Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 49.0 Median : 33.60 Median :19.30
## Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :1100.0 Max. :204.80 Max. :57.50
</code></pre>
<pre><code class="language-r">corhum <- cor(human)
corrplot(corhum, type="upper", col=COL2(diverging="PuOr"))
</code></pre>
<p><img src="" alt="plot of chunk unnamed-chunk-33" /></p>
<pre><code class="language-r">#human_scaled <- scale(human)
</code></pre>
<p>Life expectancy and maternal mortality make the strongest negative correlation, which is not surprising because an early maternal death brings life expectancy down. Adolescent birth rate and maternal mortality are, likewise, strongly correlated. Expected years of schooling strongly correlates with life expectancy. While strong, the correlation between share of population with at least secondary education and expected years of education is not among the highest. Perhaps education caps at some stage.</p>
<pre><code class="language-r">ggpairs(human)
</code></pre>
<pre><code>## plot: [1,1] [>--------------------------------------------------------------------] 2% est: 0s
## plot: [1,2] [=>-------------------------------------------------------------------] 3% est: 1s
## plot: [1,3] [==>------------------------------------------------------------------] 5% est: 2s
## plot: [1,4] [===>-----------------------------------------------------------------] 6% est: 2s
## plot: [1,5] [====>----------------------------------------------------------------] 8% est: 2s
## plot: [1,6] [=====>---------------------------------------------------------------] 9% est: 2s
## plot: [1,7] [=======>-------------------------------------------------------------] 11% est: 2s
## plot: [1,8] [========>------------------------------------------------------------] 12% est: 2s
## plot: [2,1] [=========>-----------------------------------------------------------] 14% est: 2s
## plot: [2,2] [==========>----------------------------------------------------------] 16% est: 2s
## plot: [2,3] [===========>---------------------------------------------------------] 17% est: 2s