-
Notifications
You must be signed in to change notification settings - Fork 0
/
r-for-web-analysts.Rhtml
1224 lines (1166 loc) · 55.8 KB
/
r-for-web-analysts.Rhtml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: R for Web Analysts
css: "/css/knitr.css"
---
<h1>R for web analysts</h1>
<p>An introduction and guide to the R programming language for web analysts</p>
<p>If you are new to programming then take your time reading. There are lots
of new concepts - don't expect to be able to scan through and still get
good value for time</p>
<h4>About this document</h4>
<p>Last updated on <!--rinline format(Sys.time(), "%A %d %B %Y") --></p>
<p>This document is an extension and improvement of the course materials
I originally prepared for the <a href="http://www.measurecamp.org/">MeasureCamp</a>
training session on using R.
</p>
<p>The goal of the document (and the training session) is to help web
analysts get to grips with coding using R and enable them to solve the kind
of problems that web analysts try to solve.</p>
<p>Throughout the document, code samples are presented in the following format:</p>
<!--begin.rcode message=FALSE, prompt=TRUE
2+2
end.rcode -->
<p>First the R code which can be copied or typed into the console (if you don't know
what the console is - this will be explained later) and then the output you can expect
to see when it is run.</p>
<p>There will be some code samples where no output is presented. This is either because
when the code is run there is no output to the console or because I have chosen to not
make the results of evaluating that code public.</p>
<p>I highly recommend running all the sample code you come across</p>
<p>This document is available under a <a href="https://creativecommons.org/licenses/by/2.0/uk/">CC Attribution License</a>.
</p>
<h4>About me</h4>
<p>A little bit of context about me might help you decide whether or not my advice is
worth listening to. If your background, experiences and desired outcomes are completely
different from mine then take this document with the pinch of salt it deserves.</p>
<h5>Web analyst background</h5>
<p>I started my career working in PPC and began to make an effort to learn more about web
analytics after gaps and errors in client implementations lead me to take actions that
appeared awesome but which were actually harmful.</p>
<p>My favourite example of this is when I sent tonnes of traffic from Nigeria at a lead
generation form - the form got filled in a lot (which is what was being tracked) but
the client did not make very much money from this.</p>
<p>My experience ranges across a wide range of sectors but it is very narrow when it comes
to tools. Almost all (95%+) my work is done using Google Analytics.</p>
<h5>Programming background</h5>
<p>I studied Maths at university which, with the courses I chose, involved about 8-12 hours
of programming tuition using the languages Maple and Matlab. At the end of university I
could not use programming to produce useful business outcomes. However, I could produce a
sweet animated gif showing how a fourier series converges.</p>
<p>It wasn't long after entering the professional world that I began to think that learning
to code might be useful. This was probably at the time when the
"everyone must learn to code" movement was getting started.</p>
<p>I began learning a language called <a href="https://www.haskell.org/">Haskell</a> because
I had studied a small amount of category theory at university which meant that I got some
of the jokes about how difficult Haskell is to learn. Those of you who know me well will
know that picking the difficult choice like this is fairly typical behaviour for me.</p>
<p>If you want to learn more about computing and maths and aren't in a massive rush to get
something useful done then I really recommend you have a good look at Haskell. I believe
that one day Haskell, or a language inspired by Haskell, will take over the world but that
day will not be soon and right now Haskell is shit for the kind of tasks web analysts typically
do.</p>
<p>The first language is the hardest to learn so once I had a bit of basic coding knowledge I
was able to start dabbling in Python and then later R. With the sort of tasks I do right now
most of my coding is in R.</p>
<h4>Why learn R?</h4>
<p>I often get asked "should I learn R?" type questions. The answer is <strong>not</strong>
an emphatic "YES" and the trade offs to consider are complicated.</p>
<h5>Learning R for analysts who are first time coders</h5>
<p>For an analyst who has no experience with programming the "should I learn R?" question
is actually two questions:
<ol>
<li>Should I learn to code?</li>
<li>Should the first language I learn be R?</li>
</ol>
</p>
<p>The answer to the first question <strong>is</strong> an emphatic yes because being able to
program enables you to use more powerful tools than you would use otherwise.</p>
<p>There is a significant opportunity cost to learning to code, especially to start with, when
doing simple things in code takes way longer than doing them in Excel. But the payoff is huge,
especially when you consider that you have the whole of the rest of your career to recoup the
investment.</p>
<p>I am going to reiterate that point because it is very important. To start with, everything will
take longer and you won't be able to do anything cool. The stuff you start off coding will be the
type of thing that would take you less then a minute in Excel. It is worth it to spend the time to
build this foundation. This is not specific to R - you will have this problem with any programming
language.</p>
<p>R is good for doing analytical tasks so it is appropriate for analysts to learn as a first
language. The code samples in this document will hopefully help speed you over the
"I can do this 10x quicker in Excel" gap.</p>
<h5>Learning R for analysts who can code in something else already</h5>
<p>Once you ignoring the fact that learning new things generally makes you a better person the case for
learning R over another language that you already know is much weaker. For me, the main strengths of R
compared to other languages I have used are:</p>
<ol>
<li>The library and language support for the type of tasks that analysts do is excellent</li>
<li>Libraries implementing cutting edge techniques straight out of academia are more likely
to be available in R than elsewhere. This sounds really cool, but if you are the sort of person
where this is <em>actually</em> a really useful feature then this guide is far too basic
for you.</li>
<li>The support for charting and visualisation is extremely good</li>
</ol>
<p>In all of these areas, particularly 1 and 3, Python is rapidly catching up with R. So if the language
you already know is Python then knowing R too will not increase the amount of cool stuff you can do by
as much as if the language you already know is PHP.</p>
<h5>Other points</h5>
<p>Another thing to think about is the blurring of the boundary between analysis and the operations
that implement the insight. It is increasingly common for a hybrid coder/analyst to define behaviours
that identify a user segment <em>and</em> to code site functionality that does something special
with this user segment. If this kind of work is your end goal and you only have time to learn one
programming language then first you should try to remove your "one programming language" constraint
and then if that doesn't work you should learn Python.</p>
<p>Web analysts frequently have responsibility for site tagging too. So it is often necessary to know
javascript too.</p>
<p>The following table has a rough and ready comparison of R, python and javascript for web analytical tasks:
<table>
<thead>
<tr>
<th>Task</th>
<th>R</th>
<th>Python</th>
<th>Javascript</th>
</tr>
</thead>
<tbody>
<tr>
<td>Site tagging</td>
<td>0/5</td>
<td>0/5</td>
<td>5/5</td>
</tr>
<tr>
<td>Data analysis</td>
<td>5/5</td>
<td>3.5/5 (and rapidly improving)</td>
<td>I used to say 1/5, but I haven't kept up with developments here so no idea!</td>
</tr>
<tr>
<td>Presentation and charting</td>
<td>4/5</td>
<td>2/5</td>
<td>3/5 to 5/5 depending on if you can use d3.js or not</td>
</tr>
<tr>
<td>Building website features</td>
<td>2/5</td>
<td>5/5</td>
<td>5/5</td>
</tr>
</tbody>
</table>
<p>Perhaps I will also have to do a guide on python!</p>
<h4>Getting Started</h4>
<h5>Installing the necessary programs</h5>
<p>Firstly install R from one of these pages
<ul>
<li><a href="http://cran.r-project.org/bin/windows/base/">Windows</a></li>
<li><a href="http://cran.r-project.org/bin/macosx">Mac</a></li>
<li>Linux users can get it from your distribution's repositories - I'm not going to hold your hand
for this</li>
</ul>
</p>
<p>That is all you <em>need</em> to do but you will probably find things way easier if you also install
a piece of software call RStudio. RStudio sits on top of R and presents a much less intimidating
user interface. I highly recommend it for beginners.</p>
<p>Install RStudio from <a href="http://www.rstudio.com/products/rstudio/download/">here</a>.
The rest of this tutorial will assume you are using RStudio.
</p>
<p>Both R and RStudio are free software both in terms of "free as in free beer" and "free as in freedom".
There is much to be said for and against the use of free software in business but my view is that for
things like programming tools free software completely dominates the non free alternatives.</p>
<p>On opening RStudio you should see a pane on the left called the console. You enter R commands in this pane
with the cursor after the ">" character. Press enter to run the command.</p>
<!--begin.rcode message=FALSE, prompt=TRUE
2+15
end.rcode-->
<p>If you get the answer 17 when you type "2+15" and then press enter then things have installed correctly.
You have got over the first hurdle. Well done.</p>
<h5>The very basics</h5>
<p>You have just seen (and replicated for yourself) an example of adding two numbers together. The other
foundational aritmetic functions work in similar ways. Try these examples out and test out your own.</p>
<!--begin.rcode prompt=TRUE
3-21.5
end.rcode-->
<!--begin.rcode prompt=TRUE
2*5
end.rcode-->
<!--begin.rcode prompt=TRUE
9/5
end.rcode-->
<p>The next set of examples is similar but slightly more complicated. I've included comments in the code
to make it easier to understand. A comment is a line that starts with a "#" character. A comment line
does nothing to change the behaviour of a program - it is only for a person reading it</p>
<!--begin.rcode prompt=TRUE
# This line is a comment
# Comments don't return any output
end.rcode-->
<p>Note that there is no output from the above code block. On with the examples</p>
<!--begin.rcode prompt=TRUE
2 + (5*5)
end.rcode-->
<!--begin.rcode prompt=TRUE
2*pi # pi is the number pi (3.14159...)
end.rcode-->
<p>Not every arithmetical operation makes sense</p>
<!--begin.rcode prompt=TRUE
1/0
end.rcode-->
<!--begin.rcode prompt=TRUE
# Dividing the number 5 by the letter 'a'
# This makes no sense!
5/'a'
end.rcode-->
<p>In this error message the "binary operator" is division. It is a binary operation because it takes
two numbers as input - the numerator and the denominator. The "non-numeric argument" is the letter
'a'.</p>
<p>Ok, now you can throw away your pocket calculator or abacus and use R instead.</p>
<p>It is far more useful to do things to a whole series of numbers rather than one or two numbers at a
time. R has <em>excellent</em> support for this.</p>
<p>
There are two main different representations of an series of elements:
<ol>
<li>A vector - all the elements must have the same type e.g. all numbers or all text</li>
<li>A list - the elements can have different types</li>
</ol>
Unless specifically mentioned otherwise you will use vectors in this tutorial.
</p>
<p>Create a vector with more than one element like this:</p>
<!--begin.rcode prompt=TRUE
c("This","is","a","vector","of","words")
end.rcode-->
<p>There are some functions that operate on a whole vector and return a single answer. I call these
<em>aggregate functions</em>. There will be more about functions in general later.</p>
<!--begin.rcode prompt=TRUE
# get the length of a vector (the number of elements)
length(c("This","is","a","vector","of","words"))
end.rcode-->
<!--begin.rcode prompt=TRUE
# add up all the elements in a vector
sum(c(1,2,3,4,5))
end.rcode-->
<!--begin.rcode prompt=TRUE
sum(c("This","is","a","vector","of","words"))
end.rcode-->
<p>Other functions operate on each element of the vector and return another vector as the result</p>
<!--begin.rcode prompt=TRUE
# make everything uppercase
toupper(c("This","is","a","vector","of","words"))
end.rcode-->
<!--begin.rcode prompt=TRUE
# adds 2 to each element of the vector
c(1,2,3,4,5) + 2
end.rcode-->
<p>In fact, when we were doing simple arithmetic before, we weren't just adding numbers together;
we were adding together vectors of numbers - but the vectors only had one element</p>
<!--begin.rcode prompt=TRUE
length(5)
end.rcode-->
<!--begin.rcode prompt=TRUE
# use '==' to check if two things are equal
"word" == c("word")
end.rcode-->
<p>This means we can do things like this:</p>
<!--begin.rcode prompt=TRUE
c(1,2,3) + c(pi,15,1/0)
end.rcode-->
<p>Or this:</p>
<!--begin.rcode
c(1,2,3,4) * c(1,2)
end.rcode-->
<p>In the above example, the shorter vector is duplicated until it is the same length as the longer
vector.</p>
<p>As always, there are some things we can try to do that just don't make sense</p>
<!--begin.rcode prompt=TRUE
c(1,2,3,4) + c(1,2,3)
end.rcode-->
<p>As you can see, you get a warning that R isn't quite sure if you want to be doing this or not.</p>
<h5>Variables</h5>
<p>You can assign a particular result to a variable to make it easier to refer to later</p>
<p>In R people generally use "<-" to assign to a variable, but you might also see code where people
use "=". There are some subtle differences between the two methods but this is not
important right now. In this document we will use "<-".</p>
<p>This code block assigns a vector to the variable "myvector"</p>
<!--begin.rcode prompt=TRUE
myvector <- c("This","is","a","vector","of","words")
end.rcode-->
<p>You can see the value of a variable by entering it into the console</p>
<!--begin.rcode prompt=TRUE
myvector
end.rcode-->
<p>We can use all the normal vector functions on our variable as if we had typed the whole thing out
each time</p>
<!--begin.rcode prompt=TRUE
length(myvector)
end.rcode-->
<!--begin.rcode prompt=TRUE
toupper(myvector)
end.rcode-->
<p>You can also overwrite a variable by reusing the name when you assign something else</p>
<!--begin.rcode prompt=TRUE
myvector <- c(1,2,3,4)
myvector
end.rcode-->
<!--begin.rcode prompt=TRUE
sum(myvector)
end.rcode-->
<p>It will help you out a lot if you use variable names that semantically link with what they
reference. For example call a vector of daily sessions "sessions" rather than "vector".</p>
<h5>Functions</h5>
<p>You have already seen some functions like "sum" and "length".</p>
<p>To get more information on what a function does type a question mark followed by the name
of the function into the console.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
# help with the sum function
?sum
end.rcode-->
<p>You don't need to be able to understand all of a functions help file to use the function.</p>
<p>Some functions have optional arguments. Here is an example:</p>
<!--begin.rcode prompt=TRUE
# create a variable called "x"
# the "NA" is used for a blank or missing value
# it is quite common to have these
x <- c(1,2,3,4,NA)
sum(x)
end.rcode-->
<p>Summing over a vector that contains an NA returns NA as the result. This makes sense when you
think that it is meaningless to add NA to a number.</p>
<!--begin.rcode prompt=TRUE
sum(x, na.rm=TRUE)
end.rcode-->
<p>Instead we use the optional argument "na.rm=TRUE" to tell the sum function to ignore the NA
values.</p>
<p>Use the function help to see a list of function arguments and what they mean (e.g. "?sum").</p>
<h6>Making your own functions</h6>
<p>Creating your own functions is a great way to save time when doing repetitive tasks.</p>
<p>Functions are created using the <em>function</em> function (!!?!). Here is a very simple example
function that takes no arguments and always returns the answer 42.</p>
<!--begin.rcode prompt=TRUE
theAnswer <- function() {
return(42)
}
theAnswer()
end.rcode-->
<p>When copying this code into RStudio, don't copy the "+" symbols that appear at the start of
each line - they show that R is expecting more input before it starts computing. This means
you don't always have to fit all your commands on one line.</p>
<p>Here is an example that takes two arguments and adds them together:</p>
<!--begin.rcode prompt=TRUE
adder <- function(x,y) {
return(x+y)
}
adder(32,2454)
end.rcode-->
<p>The final example inserts an element at the start of a vector:</p>
<!--begin.rcode prompt=TRUE
addToStartOfVector <- function(element,oldvector) {
newvector <- c(element,oldvector)
return(newvector)
}
addToStartOfVector(5,c(1,2,3,4))
end.rcode-->
<h5>Data frames</h5>
<p>So far we have looked at vectors which contain one dimensional data. Far more common is to
have a table of two dimensional data. The most common way of working with this kind of structure
in R is called a <em>data frame</em>.</p>
<!--begin.rcode prompt=TRUE
foods <- data.frame(meal=c("Breakfast","Breakfast","Breakfast","Lunch","Dinner"),
food=c("Bacon","Sausage","Beans","Pork Pie","Raveoli"),
amount=c(2,2,387,1,35)
)
foods
end.rcode-->
<p>A data frame is a list of vectors. Recall that a list is a series of elements not necessarily
of the same type and you see why a data frame has to be a list of vectors rather than a vector of
vectors.</p>
<p>You can see that when creating your own data frame the columns are just vectors with a column name.
Use the <em>names</em> function to see the column names for a data frame.</p>
<!--begin.rcode prompt=TRUE
names(foods)
end.rcode-->
<p>Use the column numbers and row numbers to select rows, columns and elements from a data frame.</p>
<!--begin.rcode prompt=TRUE
# select the third row
foods[3,]
end.rcode-->
<!--begin.rcode prompt=TRUE
# select the first column
foods[,1]
end.rcode-->
<!--begin.rcode prompt=TRUE
# see the number of beans consumed
foods[3,3]
end.rcode-->
<p>An easier way to select a column is like this:</p>
<!--begin.rcode prompt=TRUE
foods$meal
end.rcode-->
<p>Using a "$" followed by the name of the column is much easier to understand than trying to remember
which is the eighth column when you are reading old code.</p>
<p>You can also select a column like this:</p>
<!--begin.rcode prompt=TRUE
foods[['meal']]
end.rcode-->
<p>Use this method rather than the "$" method when:
<ol>
<li>Your column names are reserved words in R. For example if you had a column called "NA".</li>
<li>You are constructing the column name programatically. For example if the column name is one of the inputs to a function.
</ol>
</p>
<!--begin.rcode prompt=TRUE
getColumn <- function(data, columnName) {
## data$columnName won't work
return(data[[columnName]])
}
getColumn(foods,"meal")
end.rcode-->
<p>Once you have selected a column it behaves just like a vector:</p>
<!--begin.rcode prompt=TRUE
# use "?mean" if you don't know what this function does
mean(foods$amount)
end.rcode-->
<p>You can easily add a new column to a data frame in a similar way to how you assign a variable.</p>
<!--begin.rcode prompt=TRUE
foods$cumulativeAmount <- cumsum(foods$amount)
# what does cumsum do?
# how do you find this information out?
foods
end.rcode-->
<p>Filter rows in a data frame like this:</p>
<!--begin.rcode prompt=TRUE
# note the position of the comma - easy to miss!
foods[foods$amount > 5,]
end.rcode-->
<p>Filters can be combined using "&" for AND and "|" for OR.</p>
<!--begin.rcode prompt=TRUE
foods[foods$amount > 5 & foods$meal == "Breakfast",]
end.rcode-->
<!--begin.rcode prompt=TRUE
foods[foods$amount > 5 | foods$meal == "Breakfast",]
end.rcode-->
<p>To see the number of rows in a data frame use the function "nrow" rather than length.
Using length returns the length of the containing list rather than the length of one
of the vectors that contains the data in a column.
</p>
<!--begin.rcode prompt=TRUE
nrow(foods)
end.rcode-->
<p>There is also a function called "ncol" for the number of columns.</p>
<h5>SQLDF</h5>
<p>If you already have mad SQL skills (and non-digital analysts frequently do) then these methods
of manipulating data frames can be a bit tedious.</p>
<p>Fortunately there is a library called "sqldf" that means you can use your knowledge of SQL when
manipulating a data frame.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
# first install the package
install.packages("sqldf")
end.rcode-->
<!--begin.rcode prompt=TRUE
# then load the package
library(sqldf)
end.rcode-->
<p>Now we can run SQL queries against our food data frame:</p>
<!--begin.rcode prompt=TRUE
sqldf("SELECT * FROM foods WHERE amount > 5 OR meal = 'Breakfast'")
end.rcode-->
<h5>Exercises</h5>
<p>R comes with several open data sets built in. For these exercises we will use the "mtcars" data set
which contains some statistics about certain makes/models of car.
Here is how to load it up:</p>
<!--begin.rcode prompt=TRUE
data(mtcars)
# now you have a data frame called "mtcars" with the data
head(mtcars)
end.rcode-->
<p>You can find out more about the data by typing "?mtcars" into the console.</p>
<ol>
<li>What is the mean mpg across all the cars?
<div class="answer">
<!--begin.rcode prompt=TRUE
mean(mtcars$mpg)
end.rcode-->
</div></li>
<li>What is the maximum number of cylinders seen in the data?
<div class="answer">
<!--begin.rcode prompt=TRUE
max(mtcars$cyl)
end.rcode-->
</div></li>
<li>What is the difference in the mean mpg between cars with fewer than 5 cylinders and those with 5
or more cylinders?
<div class="answer">
<!--begin.rcode prompt=TRUE
fewcylinders <- mtcars[mtcars$cyl < 5,]
manycylinders <- mtcars[mtcars$cyl >= 5,]
difference <- mean(manycylinders$mpg) - mean(fewcylinders$mpg)
abs(difference)
end.rcode-->
</div></li>
<li>What proportion of the tested cars have an automatic transmission?
<div class="answer">
<!--begin.rcode prompt=TRUE
automatic <- mtcars[mtcars$am == 0,]
nrow(automatic) / nrow(mtcars)
end.rcode-->
</div></li>
<li>Which car has the fastest quarter mile time?
<div class="answer">
<!--begin.rcode prompt=TRUE
# order the data
fastest <- mtcars[order(mtcars$qsec),]
# get the car name from the first row
row.names(fastest)[1]
end.rcode-->
</div></li>
<li>Write a function which takes an mpg value as input and returns a vector of all cars
that have a better mpg than the input value.
<div class="answer">
<!--begin.rcode prompt=TRUE
betterMpg <- function(targetmpg) {
filteredcars <- mtcars[mtcars$mpg > targetmpg,]
return( row.names(filteredcars) )
}
betterMpg(30)
end.rcode-->
</div></li>
</ol>
<h4>Importing data</h4>
<p>This section covers importing data into R though two methods:
<ol>
<li>CSV files</li>
<li>The Google Analtics API</li>
</ol>
</p>
<p>The first is by far the easiest so we will start with that.</p>
<h5>Importing from CSV files</h5>
<p>There is a simple function for importing from CSV files called "read.csv". The CSV file
is converted into a data frame when it is loaded into R.</p>
<!--begin.rcode prompt=TRUE
# yes - it works with urls and local files
hospitaladmissions <- read.csv("http://www.eanalytica.com/files/UK-Admissions.csv")
head(hospitaladmissions)
end.rcode-->
<p>Use "?read.csv" to see more information about this function. There are many optional
arguments that enable you to work with things like tab separated files or files without
proper column headers.</p>
<p>You can write CSV files using the function "write.csv". In the following example writes the
mtcars data frame to a tab separated variable file.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
# \t means tab
write.csv(mtcars, file="/tmp/mtcars.tsv", sep="\t")
end.rcode-->
<p>You should change the file argument to save the data to a location of your choice.</p>
<h5>Google Analytics</h5>
<p>This is where things get tricky and a lot of people start to get errors which they
can't find a way past. Take your time and proceed with close attention to detail.</p>
<p>There are four parts to getting this working:
<ol>
<li>Generating OAuth client ID and client secret to enable you to access the API</li>
<li>Installing a relevant R library</li>
<li>Generating an authorisation token using the library</li>
<li>Finally, getting some data out of Google Analytics</li>
</ol>
</p>
<p>You will need a Google Analytics account with some data to do all of this.</p>
<h6>Generating the client ID and client secret</h6>
<p>The library we will be using to access the API comes with a default token. But it is better to create your own because then you won't use up the quota on the default token and you won't have stuff break because someone else has used up all the quota.</p>
<p>Follow these instructions precisely.</p>
<p>Common mistakes to be avoided include the following:
<ol>
<li>Reusing the client id and client secret from an old project of the wrong type.
It is safest to generate a fresh project for use with R.</li>
<li>Generating the wrong type of credentials.</li>
<li>Forgetting to enable the Analytics API.</li>
</ol>
</p>
<p>Here are the instructions. The design of the Google developer console changes rapidly; hopefully this is still accurate.</p>
<p>Let me know any parts that are particularly unclear and
I will illustrate with screenshots.</p>
<ol>
<li>Go to <a href="https://console.developers.google.com/project" target="_blank">https://console.developers.google.com/project</a></li>
<li>Click "create project"</li>
<li>The project name and the project id are not very important, but you should name
it something meaningful so you don't accidentally delete it at a later date. Ignore
the advanced options and click "create".</li>
<li>Wait for the project to be created. When it is created you are automatically redirected to the project page</li>
<li>Search for "analytics" and enable the "Analytics API" by selecting it from the list and clicking "Enable".</li>
<li>Click "Go to credentials"</li>
<li>Select "Other UI (e.g. Windows, CLI tool)" in the "Where will you be calling the API from?" dropdown</li>
<li>Check "User data" for the "What data will you be accessing" field. And click "What credentials do I need?" to proceed.</li>
<li>Create a client id - the default of "Other client 1" is fine.</li>
<li>You will also have to fill in some details for the OAuth consent screen. This is shown to people when they authorise your app. Only the project name and email address fields are compulsory.</li>
<li>Eventually you will be shown your Client ID. You also need the Client secret as well. If you don't see the secret, click "Done" and navigate to "Credentials.</li>
<li>Click your Client ID in the list and you should see the Client ID and Client Secret.</li>
</ol>
<p>You should now see your client ID and client secret on the screen. If you also see
things like "email address" and "javascript origins" then you have generated the wrong
type of client ID and should start again from step seven.</p>
<p>Store the client id and client secret in some R variables.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
clientid <- "YOUR CLIENT ID"
clientsecret <- "YOUR CLIENT SECRET"
end.rcode-->
<h6>Installing R libraries</h6>
<p>An R library is a collection of functions written by someone else that they have made available
for you to use. There is a centralised collection of vetted R libraries called <a href="http://cran.r-project.org/web/packages/available_packages_by_date.html">CRAN</a>. Libraries which are available on CRAN are very easy to install.</p>
<p>The function to install a CRAN library is <em>install.packages</em>. To start with, install
the library "Rcpp" - not having an up to date version of this library causes some people
problems later in the process and installing the latest version now does not harm and can
prevent errors later on.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
install.packages("Rcpp")
end.rcode-->
<p>If you get errors about a lack of file system permissions doing this then you will have to
ask your IT department to install libraries for you.</p>
<p>There are four main choices of libraries to work with Google Analytics in R:
<ol>
<li><a href="http://cran.r-project.org/web/packages/RGA/index.html">RGA</a></li>
<li><a href="http://cran.r-project.org/web/packages/RGoogleAnalytics/">RGoogleAnalytics</a> - this
is supported by <a href="http://www.tatvic.com/">Tatvic</a></li>
<li><a href="https://github.com/skardhamar/rga">rga</a> - not available on CRAN so slightly harder
to install. There are some API techniques to reduce how often sampled data is returned; this
library makes these easier to use than the others.</li>
<li><a href="http://code.markedmondson.me/googleAnalyticsR/">googleAnalyticsR</a> - this is the option we will use here</li>
</ol>
<p>All the libraries are similar in both how they operate and what you can do with them.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
install.packages("googleAnalyticsR")
end.rcode-->
<h6>Generating a token</h6>
<p>In this section you will authorise R to access Google Analytics data and create a token file
which saves the details. This means you will not have to authorise every time and it enables
you to automate things to run on a server; just make sure the token file is on the server.</p>
<p>First load the library into R using the <em>library</em> function. Sometimes you will see
people use the function <em>require</em> instead. The difference between them is beyond the
scope of this tutorial.</p>
<!--begin.rcode prompt=TRUE
library(googleAnalyticsR)
end.rcode-->
<p>First we will use the default token just to check that everything is working.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
ga_auth()
end.rcode-->
<p>Allow the library access to Google Analytics which creates a file called .httr-oauth in your working directory which contains your credentials. You don't need to do anything with this file; just be aware that it is there.</p>
<p>Now list your accounts to check everything is working</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
ga_account_list()
end.rcode-->
<p>You should see a list of accounts that you have access to. This is how you find the view id (important later).</p>
<p>Now how do we do this with our own token?</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
options(googleAuthR.webapp.client_id = clientid)
options(googleAuthR.webapp.client_secret = clientsecret)
ga_auth()
end.rcode-->
<p>You should be able to get the list of accounts again:</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
ga_account_list()
end.rcode-->
<h6>Get some data</h6>
<p>The first thing to do is to figure out the view ID of the Google Analytics view you want.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
viewid <- "ID NUMBER"
end.rcode-->
<p>We are almost ready to grab some data. But first we will install another library that makes it
easier to work with dates.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
install.packages("lubridate")
end.rcode-->
<p>Here we go!</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
library(lubridate)
yesterday <- today() - days(1)
twoyearsago <- today() - days(365*2)
sessions <- google_analytics_4(viewid,
date_range=c(twoyearsago,yesterday),
metrics="sessions",
dimensions="date")
end.rcode-->
<p>You will now have a data frame called "sessions" with the daily sessions total for the
last two years.</p>
<!--begin.rcode echo=FALSE, results='hide'
# Loading sessions in from pre-saved CSV.
# This makes generation of the document quicker
# and means that I won't accidentally leak client data
library(lubridate)
sessions <- read.csv("sessions.csv")
sessions$date <- ymd(sessions$date)
end.rcode-->
<!--begin.rcode prompt=TRUE
head(sessions)
end.rcode-->
<p>If this is working for you then you have got the R part working.</p>
<h6>Avoiding sampling</h6>
<p>A big motivation for many people using the API is to avoid sampled data. Sampling
occurs when a combination of the number of hits and the query dimensions exceeds a
threshold. The API can sometimes be used to reduce the amount of or avoid sampling
by making a series of requests and summing the results.</p>
<p>For example, if you are getting sampled data for last month's report it might be
possible to avoid the sampling by requesting data for one day at a time.</p>
<p>The library has a cool "anti_sample" feature that tries to figure out the frequency with which to request data to avoid sampling. Sometimes it will download daily data, sometimes less frequently than that</p>
<p>The argument for this is "anti_sample"
<!--begin.rcode prompt=TRUE, eval=FALSE
unsampled <- google_analytics_4(viewid,
date_range=c(twoyearsago,yesterday),
metrics="sessions",
dimensions="date",
anti_sample=TRUE
)
end.rcode-->
<h6>Filters</h6>
<p>Filters are a bit more complicated in the latest version of the API. But this extra complexity makes it easier to combine predefined filters.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
twitter <- dim_filter(dimension="source",operator="EXACT",expressions="twitter")
facebook <- dim_filter(dimension="source",operator="EXACT",expressions="facebook")
events <- met_filter("totalEvents", "GREATER_THAN", 2)
unsampled <- google_analytics_4(viewid,
date_range=c(twoyearsago,yesterday),
metrics="sessions",
dimensions="date",
anti_sample=TRUE,
dim_filters = filter_clause_ga4(list(twitter, facebook), operator = "OR"),
met_filters = filter_clause_ga4(list(events))
)
end.rcode-->
<h5>Exercises</h5>
<p>There is just a short set of exercises here which are mainly based around knowing the
Google Analytics reporting API. The best resource for getting to grips with this is the
<a href="https://ga-dev-tools.appspot.com/explorer/">Query Explorer</a>.</p>
<ol>
<li>Query the API using more than one dimension (e.g. date and browser).
<div class="answer">
<!--begin.rcode prompt=TRUE, eval=FALSE
# this is just an example.
# there are many dimensions you could use
browsersessions <- google_analytics_4(viewid,
date_range=c(twoyearsago, yesterday),
dimensions=c("date","browser"),
metrics="sessions"
)
end.rcode-->
</div>
</li>
<li>Query the API using more than one metric.
<div class="answer">
<!--begin.rcode prompt=TRUE, eval=FALSE
sessionsevents <- google_analytics_4(viewid,
date_range=c(twoyearsago, yesterday),
dimensions=c("date"),
metrics=c("sessions","totalEvents")
)
end.rcode-->
</div>
</li>
<li>Query the API using a your own filter. The <a href="https://developers.google.com/analytics/devguides/reporting/core/v3/segments#reference">segment reference documentation</a> might be useful here.
</li>
<li>Query the API pulling data from last month. The dates should be generated dynamically so that
you can run the exact same code when you rerun the report next month. Be careful - what happens
when you run your code in January?
</li>
</ol>
<h4>ggplot</h4>
<p>ggplot is the best platform for making non-interactive visualisations/charts in the world.</p>
<p>First, as you might expect, install the library.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
install.packages("ggplot2")
end.rcode-->
<p>First you will see a series of example charts using the sessions data frame made earlier with
data pulled from Google Analytics.</p>
<!--begin.rcode prompt=TRUE, fig.height=8, fig.width=12
# load the library
library(ggplot2)
ggplot(sessions,aes(x=date,y=sessions)) + geom_line()
end.rcode-->
<p>Plot the data as translucent points and a smoothed line</p>
<!--begin.rcode prompt=TRUE, fig.height=8, fig.width=12
ggplot(sessions,aes(x=date,y=sessions)) +
geom_point(alpha=0.2) +
geom_smooth()
end.rcode-->
<p>Plot a histogram of the number of daily sessions</p>
<!--begin.rcode prompt=TRUE, fig.height=8, fig.width=12
ggplot(sessions,aes(x=sessions)) + geom_histogram()
end.rcode-->
<p>Change the theme to something more minimal</p>
<!--begin.rcode prompt=TRUE, fig.height=8, fig.width=12
ggplot(sessions,aes(x=date,y=sessions)) +
geom_line() +
theme_minimal()
end.rcode-->
<p>Add a horizontal red line to see which days are better and worse than the mean</p>
<!--begin.rcode prompt=TRUE, fig.height=8, fig.width=12
avg <- mean(sessions$sessions)
ggplot(sessions,aes(x=date,y=sessions)) +
geom_line() +
geom_hline(yintercept=avg, size=5, color="red") +
theme_minimal()
end.rcode-->
<p>Change the axis labels, make them bigger and add a chart title</p>
<!--begin.rcode prompt=TRUE, fig.height=8, fig.width=12
ggplot(sessions,aes(x=date,y=sessions)) +
geom_line() +
theme_minimal() +
xlab("Date") +
ylab("Number of sessions") +
ggtitle("Sessions per day over two years") +
theme(axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 25))
end.rcode-->
<h5>The grammer of graphics</h5>
<p>ggplot implements an idea called "the grammer of graphics" which is a way of thinking about
and describing charts.</p>
<p>Essentially there are three elements:
<ul>
<li>The data</li>
<li>Aesthetics that can be mapped to dimensions in the data</li>
<li>Layers that control how different aesthetics are expressed</li>
</ul>
</p>
<p>For an example, consider this bar chart:
<!--begin.rcode echo=FALSE, fig.width=12, fig.height=8
df <- data.frame(person=c("Pirate","Pirate","Fergie","Fergie","Cyclops","Cyclops"),
Attribute=c("Legs","Eyes","Legs","Eyes","Legs","Eyes"),
count=c(1,1,2,2,2,1)
)
ggplot(df, aes(x=person, y=count, fill=Attribute)) +
geom_bar(position="dodge", stat="identity") +
xlab("Person") +
ylab("Number person has") +
theme_bw() +
theme(axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20))
end.rcode-->
There are three aesthetics here:
<ol>
<li>The count of things - expressed on the y axis</li>
<li>The attribute - expressed by the color and position of the bars</li>
<li>The person - expressed by the position of the bars</li>
</ol>
</p>
<p>I feel I am not explaining this very well - this is at least partly because it
is a difficult and unusual concept. The <a href="http://www.amazon.co.uk/dp/0387981403/ref=cm_sw_su_dp?tag=ggplot2-20">ggplot book</a>
explains this in more detail and has tonnes of nice examples [the affiliate link
is not mine - it is the author's own]. For something that focusses more on the
grammer of graphics concept and less on how ggplot implements it and which is free
check out the <a href="http://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf">Layered Grammer of Graphics</a>
paper in PDF.</p>
<p>Anyway, the following image will hopefully make it a little bit clearer how all
this works in ggplot.</p>
<img src="/files/ggplot-exp.png"/>
<p>The following example (using the mtcars data) shows a few more asthetics (colour and
size).</p>
<!--begin.rcode prompt=TRUE, fig.width=12, fig.height=8
ggplot(mtcars, aes(x=mpg,y=qsec, size=wt*1000, color=as.factor(cyl))) +
geom_point(alpha=0.8) +
xlab("Miles per gallon") +
ylab("Quarter mile time (s)") +
scale_color_discrete(guide = guide_legend(title = "Cylinders")) +
scale_size_continuous(guide = guide_legend(title = "Weight (lbs)")) +
theme_minimal()
end.rcode-->
<h6>Getting more help with ggplot</h6>
<p>ggplot is a big library with a very broad scope so it is often difficult to know
what is and isn't possible as well as what functions are provided for you. Thankfully,
because of ggplot's fairly unique name Googling for help works well. This is not true
for R in general.</p>
<p>The <a href="http://docs.ggplot2.org/current/">official documentation</a> is also
excellent.</p>
<h5>Exercises</h5>
<p>These exercises use builtin data set (like mtcars) because then everyone is starting from the same
place so it is possible for me to provide solutions. I encourage you to try similar
plots on data pulled in from Google Analytics.</p>
<p>My guess is that you will find these exercises harder than the others because you have not been
introduced to all the functions you will need to complete them.</p>
<ol>
<li>Using the mtcars data draw a boxplot showing the amount of horsepower for engines with
differing numbers of cylinders.
<div class="answer">
<!--begin.rcode prompt=TRUE, fig.width=12, fig.height=8
data(mtcars)
ggplot(mtcars,aes(x=as.factor(cyl),y=hp)) + geom_boxplot()
end.rcode-->
</div>
</li>
<li>Add a point representing each car to the chart created in the previous task.
<div class="answer">
<!--begin.rcode prompt=TRUE, fig.width=12, fig.height=8
box <- ggplot(mtcars,aes(x=as.factor(cyl),y=hp)) + geom_boxplot()
# box + geom_point() will work, but geom_jitter is better
box + geom_jitter(color="blue", alpha=0.4, size=5)
end.rcode-->
</div>
</li>
<li>Using the mtcars data set make a scatter plot (geom_point()) of the power to
weight ratio against the quarter mile time.
<div class="answer">
<!--begin.rcode prompt=TRUE, fig.width=12, fig.height=8
ggplot(mtcars,aes(x=hp/wt,y=qsec)) + geom_point()
end.rcode-->
</div>
</li>
<li>Change your plot from the last task into a faceted plot, faceted on the number
of cylinders.
<div class="answer">
<!--begin.rcode prompt=TRUE, fig.width=12, fig.height=8
ggplot(mtcars,aes(x=hp/wt,y=qsec, color=as.factor(cyl))) +
geom_point(size=3, alpha=0.8) +
facet_grid(. ~ cyl) +
guides(color=FALSE)
end.rcode-->
</div>
</li>
</ol>
<h4>Forecasting</h4>
<p>R has good library support for many methods of forecasting. I suspect
that it has the best and easiest support out of all the languages you
might use for this.</p>
<p>The <em>forecast</em> package is the one we will use in this tutorial.</p>
<!--begin.rcode prompt=TRUE, eval=FALSE
install.packages("forecast")
end.rcode-->
<p>The examples in this section use the "sessions" data frame containing daily Google
Analytics sessions generated earlier.</p>
<!--begin.rcode prompt=TRUE
head(sessions)
end.rcode-->
<p>To start to use the forecast package we must convert this data frame into a timeseries.
For the purpose of this example we are interested in forecasts that take into account
weekly seasonality (by which I mean the common phenomenom that some days of the week
are normally better than others).</p>
<!--begin.rcode prompt=TRUE
library(forecast)
# use 7 for the frequency because there are 7 observations
# per week.
sessionsts <- ts(sessions$sessions, frequency=7)
end.rcode-->
<p>Once you have created a timeseries you can do interesting things very easily.</p>
<!--begin.rcode prompt=TRUE, fig.width=12, fig.height=24
comp <- decompose(sessionsts)
# not interested in fancy ggplot here
# just do something simple
# the meaning of the resulting chart is explained below
plot(comp)
end.rcode-->
<p>Above you see a collection of four charts.
<ol>
<li>A plot of the raw data - nothing fancy here</li>
<li>A plot of the underlying trend - this has the weekly seasonality and any outliers removed.
It is way easier to see what is going on.</li>
<li>The weekly seasonality</li>
<li>The random bits of the data that don't fit anywhere else</li>
</ol>
This analysis makes the assumption that the results you see are the sum of the
underlying trend, the weekly seasonal factors and some randomness. These assumptions fit
a lot of metrics you see in web analytics.
</p>