-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path6manip.tex
934 lines (843 loc) · 32.9 KB
/
6manip.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
% !Rnw root = learnR.Rnw
\input{preamble}
Data analysis has as its end point the use of forms of data summary
that will convey, fairly and succinctly, the information that is in
the data. The fitting of a model is itself a form of data summary.
\marginnote[11pt]{Data summaries that can lead to misleading
inferences arise often, from a unbalance in the data and/or failure
to account properly for important variables or factors.}Be warned of
the opportunities that simple forms of data summary, which seem
superficially harmless, can offer for misleading inferences. These
issues affect, not just data summary per se, but all modeling. Data
analysis is a task that should be undertaken with critical
faculties fully engaged.
\subsection*{Alternative types of data objects}
\begin{trivlist}
\item[{\bf Column objects:}] These include (atomic) vectors,
factors, and dates.
\item[{\bf Date and date-time objects:}] The creation and
manipulations of date objects will be described below.
\item[{\bf Data Frames:}] These are rectangular structures.
\marginnote{A data frame is a list of column objects, all of
the same length.} Columns may be `atomic' vectors, or
factors, or other objects (such as dates) that are one-dimensional.
\item[{\bf Matrices and arrays:}] Matrices\footnote{Internally,
matrices are one long vector in which the columns follow one after
the other.} are rectangular arrays in which all elements have the
same mode. An array is a generalization of a matrix to allow
an arbitrary number of dimensions.
\item[{\bf Tables:}] A table is a specialized form of array.
\item[{\bf Lists:}] A list is a collection of objects that can be of
arbitrary class. List elements are themselves lists. In
more technical language, lists are {\em recursive} data structures.
\item[{\bf S3 model objects:}] These are lists that have a defined
structure.
\item[{\bf S4 objects:}] These are specialized data structures with
tight control on the structure. Unlike S3 objects, they cannot be
manipulated as lists. Modeling functions in certain of the newer
packages\sidenote{These include \pkg{lme4}, the Bioconductor
packages, and the spatial analysis packages.} return S4 objects.
\end{trivlist}
\section{Manipulations with Lists, Data Frames and Arrays}
Recall that data frames are lists of columns that all have the
same length. They are thus a specialized form of list. Matrices
are two-dimensional arrays. Tables are in essence arrays that
hold numeric values.
\subsection{Tables and arrays}
The dataset \code{UCBAdmissions} is stored as a 3-dimensional
table. If we convert it to an array, very little changes:
It changes from a table object to a numeric object, which
affects the way that it is handled by some functions. In
either case, what we have is a numeric vector of length 24
(= 2 $\times$ 2 $\times$ 6) that is structured to have
dimensions 2 by 2 by 6.
\subsection{Conversion between data frames and tables}
The three-way table \code{UCBAdmissions} are admission frequencies,
by Gender, for the six largest departments at the University of
California at Berkeley in 1973. For a reference to a web page that
has the details; see the \code{help(UCBAdmissions)}. Type
\begin{Schunk}
\begin{Sinput}
help(UCBAdmissions) # Get details of the data
example(UCBAdmissions)
\end{Sinput}
\end{Schunk}
Note the margins of the table:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
str(UCBAdmissions)
\end{Sinput}
\begin{Soutput}
'table' num [1:2, 1:2, 1:6] 512 313 89 19 353 207 17 8 120 205 ...
- attr(*, "dimnames")=List of 3
..$ Admit : chr [1:2] "Admitted" "Rejected"
..$ Gender: chr [1:2] "Male" "Female"
..$ Dept : chr [1:6] "A" "B" "C" "D" ...
\end{Soutput}
\end{Schunk}
\end{fullwidth}
%$
In general, operations with a table or array are easiest to
conceptualise if the table is first converted to a data frame
in which the separate dimensions of the table become columns.
Thus, the \code{UCBAdmissions} table will be converted to
a data frame that has columns \code{Admit}, \code{Gender} and
\code{Dept}. Either use the \code{as.data.frame.table()}
command from base R, or use the \code{adply()} function from
the \pkg{plyr} package.
\marginnote[12pt]{As \code{UCBAdmissions} is a table (not an array),
\code{as.data.frame(UCBAdmissions)} will give the same result.}
The following uses the function \code{as.data.frame.table()} to convert
the 3-way table \code{UCBAdmissions} into a data frame in which the
margins are columns:
\begin{Schunk}
\begin{Sinput}
UCBdf <- as.data.frame.table(UCBAdmissions)
head(UCBdf, 5)
\end{Sinput}
\begin{Soutput}
Admit Gender Dept Freq
1 Admitted Male A 512
2 Rejected Male A 313
3 Admitted Female A 89
4 Rejected Female A 19
5 Admitted Male B 353
\end{Soutput}
\end{Schunk}
\begin{quote}
{\small
Alternatively, use the function \code{adply()}
from the \pkg{plyr} package that is described in Section
\ref{sec:plyr}. Here the \code{identity()} function does the
manipulation, working with all three dimensions of the array:
\begin{Schunk}
\begin{Sinput}
library(plyr)
UCBdf <- adply(.data=UCBAdmissions,
.margins=1:3,
.fun=identity)
names(UCBdf)[4] <- "Freq"
\end{Sinput}
\end{Schunk}
}
\end{quote}
First, calculate overall admission percentages for
females and males. The following calculates also the total accepted,
and the total who applied:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
library(dplyr)
gpUCBgender <- dplyr::group_by(UCBdf, Gender)
AdmitRate <- dplyr::summarise(gpUCBgender,
Accept=sum(Freq[Admit=="Admitted"]),
Total=sum(Freq),
pcAccept=100*Accept/Total)
AdmitRate
\end{Sinput}
\begin{Soutput}
# A tibble: 2 x 4
Gender Accept Total pcAccept
<fct> <dbl> <dbl> <dbl>
1 Male 1198 2691 44.5
2 Female 557 1835 30.4
\end{Soutput}
\end{Schunk}
\end{fullwidth}
Now calculate admission rates, total number of females applying,
and total number of males applying, for each department:
\begin{Schunk}
\begin{Sinput}
gpUCBgd <- dplyr::group_by(UCBdf, Gender, Dept)
rateDept <- dplyr::summarise(gpUCBgd,
Total=sum(Freq),
pcAccept=100*sum(Freq[Admit=="Admitted"])/Total)
\end{Sinput}
\end{Schunk}
Results can conveniently be displayed as follows. First show
admission rates, for females and males separately:
\begin{Schunk}
\begin{Sinput}
xtabs(pcAccept~Gender+Dept, data=rateDept)
\end{Sinput}
\begin{Soutput}
Dept
Gender A B C D E F
Male 62.061 63.036 36.923 33.094 27.749 5.898
Female 82.407 68.000 34.064 34.933 23.919 7.038
\end{Soutput}
\end{Schunk}
Now show total numbers applying:
\begin{Schunk}
\begin{Sinput}
xtabs(Total~Gender+Dept, data=rateDept)
\end{Sinput}
\begin{Soutput}
Dept
Gender A B C D E F
Male 825 560 325 417 191 373
Female 108 25 593 375 393 341
\end{Soutput}
\end{Schunk}
\marginnote[12pt]{The overall bias arose because males favored
departments where admission rates were relatively high.}
As a fraction of those who applied, females were strongly favored
in department A, and males somewhat favored in departments C and E.
Note however that relatively many males applied to A and B, where admission
rates were high. This biased overall male rates upwards. Relatively
many females applied to C, D and F, where rates were low.
This biased the overall female rates downwards.
\subsection{Table margins}
For working directly on tables, note the function \code{margin.table()}.
The following retains margin 1 (\code{Admit}) and margin 2 (\code{Gender}),
adding over \code{Dept} (the remaining margin):
\marginnote[12pt]{Take margin 2, first, then margin 1, gving a
table where rows correspond to levels of \margtt{Gender}.}
\begin{Schunk}
\begin{Sinput}
## Tabulate by Admit (margin 2) & Gender (margin 1)
(margin21 <- margin.table(UCBAdmissions,
margin=2:1))
\end{Sinput}
\begin{Soutput}
Admit
Gender Admitted Rejected
Male 1198 1493
Female 557 1278
\end{Soutput}
\end{Schunk}
Use the function \code{margin.table()} to turn this into a table
that has the proportions in each row:
\begin{Schunk}
\begin{Sinput}
prop.table(margin21, margin=1)
\end{Sinput}
\begin{Soutput}
Admit
Gender Admitted Rejected
Male 0.4452 0.5548
Female 0.3035 0.6965
\end{Soutput}
\end{Schunk}
\subsection{Categorization of continuous data}\label{ss:cat-cig}
\marginnote[11pt]{The dataset \margtt{bronchit} may alternatively be found in the \pkg{SMIR} package.}
The data frame \code{DAAGviz::bronchit}
has observations on 212 men in a sample of Cardiff (Wales, UK)
enumeration districts. Variables are \code{r} (1 if respondent
suffered from chronic bronchitis and 0 otherwise), \code{cig} (number
of cigarettes smoked per day) and \code{poll} (the smoke level in the
locality).
It will be convenient to define a function \code{props} that
calculates the proportion of the total in the first (or other nominated
element) of a vector:
\begin{Schunk}
\begin{Sinput}
props <- function(x, elem=1)sum(x[elem])/sum(x)
\end{Sinput}
\end{Schunk}
Now use the function \code{cut()} to classify the data into four
categories, and form tables:
\pagebreak
\marginnote[12pt]{The argument \code{breaks} can be either the number of
intervals, or it can be a vector of break points such that all data
values lie within the range of the breaks. If the smallest of the
break points equals the smallest data value, supply the argument
\code{include.lowest=TRUE}.}
\begin{Schunk}
\begin{Sinput}
library(DAAGviz)
catcig <- with(bronchit,
cut(cig, breaks=c(0,1,10,30),
include.lowest=TRUE))
tab <- with(bronchit, table(r, catcig))
round(apply(tab, 2, props, elem=2), 3)
\end{Sinput}
\begin{Soutput}
[0,1] (1,10] (10,30]
0.072 0.281 0.538
\end{Soutput}
\end{Schunk}
\noindent
There is a clear increase in the risk of bronchitis with the number
of cigarettes smoked.
This categorization was purely for\marginnote{It was at one time
common practice to categorize continuous data, in order to allow
analysis methods for multi-way tables. There is a loss of
information, which can at worst be serious.} purposes of
preliminary analysis. Categorization for purposes of analysis is,
with the methodology and software that are now available, usually
undesirable. Tables that are based on categorization can nevertheless
be useful in data exploration.
\subsection{$^*$Matrix Computations}
Let \code{X} ($n$ by $p$), \code{Y} ($n$ by $p$) and B ($p$ by $k$) be
numeric matrices. Some of the possibilities are:
\marginnote{Note that if \margtt{t()} is used with a data
frame, a matrix is returned. If necessary, all values are coerced
to the same mode.}
\begin{Schunk}
\begin{Sinput}
X + Y # Elementwise addition
X * Y # Elementwise multiplication
X %*% B # Matrix multiplication
solve(X, Y) # Solve X B = Y for B
svd(X) # Singular value decomposition
qr(X) # QR decomposition
t(X) # Transpose of X
\end{Sinput}
\end{Schunk}
\marginnote{Section \ref{ss:apply} will discuss the use of
\margtt{apply()} for operations with matrices, arrays and tables.
}
Calculations with data frames that are slow and time consuming will
often be much faster if they can be formulated as matrix calculations.
This is in general become an issue only for very large datasets,
with perhaps millions of observations. Section \ref{sec:large-dset}
has examples. For small or modest-sized datasets, convenience in
formulating the calculations is likely to be more important than
calculation efficiency.
\section{\pkg{plyr}, \pkg{dplyr} \& \pkg{reshape2} Data Manipulation}\label{sec:plyr}
The \pkg{plyr} package has functions that together:
\begin{itemize}
\item provide a systematic approach to computations that perform a
desired operation across one or more dimensions of an array, or
of a data frame, or of a list;
\item allow the user to choose whether results will be returned as an
array, or as a data frame, or as a list.
\end{itemize}
The \pkg{dplyr} package has functions for performing various summary
and other operations on data frames. For many purposes, it supersedes
the \pkg{plyr} package.
The \pkg{reshape2} package is, as its name suggests, designed for
moving between alternative data layouts.
\subsection{\pkg{plyr} }
The \pkg{plyr} package has a separate function for each of the nine
possible mappings. The first letter of the function name (one of
\txtt{a} = array, \txtt{d} = data frame, \txtt{l} = list) denotes the
class of the input object, while the second letter (the same choice of
one of three letters) denotes the class of output object that is
required. This pair of letters is then followed by \txtt{ply}.
Here is the choice of functions:
% Wed Nov 11 09:01:36 2009
\begin{center}
\begin{tabular}{rlll}
\hline
& \multicolumn{3}{c}{Class of Output Object}\\
& \txtt{a} (array) & \txtt{d} (data frame) & \txtt{l} (list) \\
\hline
Class of Input Object\\
a (array) & aa{\color{gray40} ply} & ad{\color{gray40} ply} & al{\color{gray40} ply} \\
d (data frame) & da{\color{gray40} ply} & dd{\color{gray40} ply} & dl{\color{gray40} ply} \\
l (list) & la{\color{gray40} ply} & ld{\color{gray40} ply} & ll{\color{gray40} ply} \\
\hline
\end{tabular}
\end{center}
First observe how the function \code{adply} can be used to change from
a tabular form of representation to a data frame. The dimension names
will become columns in the data frame.
\noindent
\begin{minipage}[t]{\textwidth}
\begin{Schunk}
\begin{Sinput}
detach("package:dplyr")
library(plyr)
\end{Sinput}
\begin{Sinput}
dreamMoves <-
matrix(c(5,3,17,85), ncol=2,
dimnames=list("Dreamer"=c("Yes","No"),
"Object"=c("Yes","No")))
(dfdream <- plyr::adply(dreamMoves, 1:2,
.fun=identity))
\end{Sinput}
\begin{Soutput}
Dreamer Object V1
1 Yes Yes 5
2 No Yes 3
3 Yes No 17
4 No No 85
\end{Soutput}
\end{Schunk}
\end{minipage}
To get the table back, do:
\begin{Schunk}
\begin{Sinput}
plyr::daply(dfdream, 1:2, function(df)df[,3])
\end{Sinput}
\begin{Soutput}
Object
Dreamer Yes No
Yes 5 17
No 3 85
\end{Soutput}
\end{Schunk}
The following calculates sums over the first two dimensions of the
table \code{UCBAdmissions}:
\marginnote[12pt]{Here, \code{aaply()} behaves exactly like \code{apply()}.}
\begin{Schunk}
\begin{Sinput}
plyr::aaply(UCBAdmissions, 1:2, sum)
\end{Sinput}
\begin{Soutput}
Gender
Admit Male Female
Admitted 1198 557
Rejected 1493 1278
\end{Soutput}
\end{Schunk}
The following calculates, for each level of the column \code{trt}
in the data frame \code{nswdemo}, the number of values of \code{re74}
that are zero:
\begin{Schunk}
\begin{Sinput}
library(DAAG, quietly=TRUE)
plyr::daply(nswdemo, .(trt),
function(df)sum(df[,"re74"]==0, na.rm=TRUE))
\end{Sinput}
\begin{Soutput}
0 1
195 131
\end{Soutput}
\end{Schunk}
To calculate the proportion that are zero, \marginnote{Notice the use
of the syntax \code{.(trt, black)} to identify the columns
\code{trt} and \code{black}. This is an alternative to
\code{c("trt", "black")}.} for each of control and treatment and
for each of non-black and black, do:
\begin{Schunk}
\begin{Sinput}
options(digits=3)
plyr::daply(nswdemo, .(trt, black),
function(df)sum(df[,"re75"]==0)/nrow(df))
\end{Sinput}
\begin{Soutput}
black
trt 0 1
0 0.353 0.435
1 0.254 0.403
\end{Soutput}
\end{Schunk}
The function \code{colwise()} takes as argument a function that operates
on a column of data, returning a function that operates on all
nominated columns of a data frame.
To get information on the proportion of zeros for both of the columns
\code{re75} and \code{re78}, and for each of non-black and black, do:
\marginnote{Here, \margtt{colwise()} operates on the objects that are returned
by splitting up the data frame \margtt{nswdemo} according to levels of
\margtt{trt} and \margtt{black}. Note the use of \margtt{ddply()}, not
\margtt{daply()}.
}
\begin{Schunk}
\begin{Sinput}
plyr::ddply(nswdemo, .(trt, black),
colwise(function(x)sum(x==0)/length(x),
.cols=.(re75, re78)))
\end{Sinput}
\begin{Soutput}
trt black re75 re78
1 0 0 0.353 0.1529
2 0 1 0.435 0.3412
3 1 0 0.254 0.0847
4 1 1 0.403 0.2605
\end{Soutput}
\end{Schunk}
\subsection{Use of \pkg{dplyr} with Word War 1 cricketer data}
Data in the data frame \code{cricketer}, extracted by John Aggleton
(now at Univ of Cardiff), are from records of UK first class
cricketers born 1840 -- 1960. Variables are
\begin{list}{}{
\setlength{\itemsep}{1pt}
\setlength{\parsep}{1pt}}
\item[-] Year of birth
\item[-] Years of life (as of 1990)
\item[-] 1990 status (dead or alive)
\item[-] Cause of death: killed in action / accident / in bed
\item[-] Bowling hand -- right or left
\end{list}
The following creates a data frame in which the first column has the
year, the second the number of right-handers born in that year, and
the third the number of left-handers born in that year.
\marginnote[12pt]{Both \pkg{plyr} and \pkg{dplyr} have functions
\margtt{summarise()}. As in the code shown, detach \pkg{plyr}
before proceeding. Alternatively, or additionally,
specify \margtt{dplyr::summarise()} rather than \margtt{summarise()}}.
\begin{Schunk}
\begin{Sinput}
library(DAAG)
detach("package:plyr")
library(dplyr)
\end{Sinput}
\begin{Sinput}
names(cricketer)[1] <- "hand"
gpByYear <- group_by(cricketer, year)
lefrt <- dplyr::summarise(gpByYear,
left=sum(hand=='left'),
right=sum(hand=='right'))
## Check first few rows
lefrt[1:4, ]
\end{Sinput}
\begin{Soutput}
# A tibble: 4 x 3
year left right
<int> <int> <int>
1 1840 1 6
2 1841 4 16
3 1842 5 16
4 1843 3 25
\end{Soutput}
\end{Schunk}
The data frame is split by values of \code{year}. Numbers of left
and right handers are then tabulated.
\marginnote[12pt]{Note that a cricketer who was born in 1869 would be
45 in 1914, while a cricketer who was born in 1896 would be 18 in 1914.}
From the data frame \code{cricketer}, we determine the range of birth
years for players who died in World War 1. We then extract data for
all cricketers, whether dying or surviving until at least the final
year of Workd War 1, whose birth year was within this range of years.
The following code extracts the relevant range of birth years.
\begin{Schunk}
\begin{Sinput}
## Use subset() from base R
ww1kia <- subset(cricketer,
kia==1 & (year+life)%in% 1914:1918)
range(ww1kia$year)
\end{Sinput}
\begin{Soutput}
[1] 1869 1896
\end{Soutput}
\end{Schunk}
Alternatively, use \code{filter()} from \pkg{dplyr}:
\begin{Schunk}
\begin{Sinput}
ww1kia <- filter(cricketer,
kia==1, (year+life)%in% 1914:1918)
\end{Sinput}
\end{Schunk}
For each year of birth between 1869 and 1896, the following expresses
the number of cricketers killed in action as a fraction of the total
number of cricketers (in action or not) who were born in that year:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
## Use filter(), group_by() and summarise() from dplyr
crickChoose <- filter(cricketer,
year%in%(1869:1896), ((kia==1)|(year+life)>1918))
gpByYearKIA <- group_by(crickChoose, year)
crickKIAyrs <- dplyr::summarise(gpByYearKIA,
kia=sum(kia), all=length(year), prop=kia/all)
crickKIAyrs[1:4, ]
\end{Sinput}
\begin{Soutput}
# A tibble: 4 x 4
year kia all prop
<int> <int> <int> <dbl>
1 1869 1 37 0.0270
2 1870 2 36 0.0556
3 1871 1 45 0.0222
4 1872 0 39 0
\end{Soutput}
\end{Schunk}
\end{fullwidth}
For an introduction to \pkg{dplyr}, enter:
\begin{Schunk}
\begin{Sinput}
vignette("introduction", package="dplyr")
\end{Sinput}
\end{Schunk}
\subsection{\pkg{reshape2}: \code{melt()}, \code{acast()} \& \code{dcast()}
}\label{ss:reshape2}
The \pkg{reshape2} package has functions that move between a
dataframe layout where selected columns are unstacked, and a layout
where they are stacked. In moving from an unstacked to a stacked
layout, column names become levels of a factor. In the move back from
stacked to unstacked, factor levels become column names.
Here is an example of the use of \code{melt()}:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
## Create dataset Crimean, for use in later calculations
library(HistData) # Nightingale is from this package
library(reshape2) # Has the function melt()
Crimean <- melt(Nightingale[,c(1,8:10)], "Date")
names(Crimean) <- c("Date", "Cause", "Deaths")
Crimean$Cause <- factor(sub("\\.rate", "", Crimean$Cause))
Crimean$Regime <- ordered(rep(c(rep('Before', 12), rep('After', 12)), 3),
levels=c('Before', 'After'))
formdat <- format.Date(sort(unique(Crimean$Date)), format="%d %b %y")
Crimean$Date <- ordered(format.Date(Crimean$Date,
format="%b %y"), levels=formdat)
\end{Sinput}
\end{Schunk}
\end{fullwidth}
\marginnote[12pt]{The dataset \margtt{Crimean} has been included in the
\pkg{DAAGviz} package.}
The dataset is now in a suitable form for creating a Florence
Nightingale style wedge plot, in Figure \ref{col:wedgeplot}.
\subsection*{Reshaping data for Motion Chart display -- an example}
The following inputs and displays World Bank Development Indicator
data that has been included with the package \pkg{DAAGviz}:
\begin{fullwidth}
\small
\begin{Schunk}
\begin{Sinput}
## DAAGviz must be installed, need not be loaded
path2file <- system.file("datasets/wdiEx.csv", package="DAAGviz")
wdiEx <- read.csv(path2file)
print(wdiEx, row.names=FALSE)
\end{Sinput}
\begin{Soutput}
Country.Name Country.Code Indicator.Name Indicator.Code X2010 X2000
Australia AUS Labor force, total SL.TLF.TOTL.IN 1.17e+07 9.62e+06
Australia AUS Population, total SP.POP.TOTL 2.21e+07 1.92e+07
China CHN Labor force, total SL.TLF.TOTL.IN 8.12e+08 7.23e+08
China CHN Population, total SP.POP.TOTL 1.34e+09 1.26e+09
\end{Soutput}
\end{Schunk}
\end{fullwidth}
A \pkg{googleVis} Motion Chart does not make much sense for this
dataset as it stands, with data for just two countries and two years.
Motion charts are designed for showing how scatterplot relationships,
here between forest area and population, have changed over a number of
years. The dataset will however serve for demonstrating the reshaping
that is needed.
For input to Motion Charts, we want indicators to be
columns, and years to be rows. The \code{melt()} and
\code{dcast()}\sidenote{Note also \code{acast()}, which outputs
an array or a matrix.} functions from the \pkg{reshape2}
package can be used to achieve the desired result. First, create a
single column of data, indexed by classifying factors:
\begin{Schunk}
\begin{Sinput}
library(reshape2)
wdiLong <- melt(wdiEx, id.vars=c("Country.Code",
"Indicator.Name"),
measure.vars=c("X2000", "X2010"))
## More simply: wdiLong <- melt(wdiEx[, -c(2,4)])
wdiLong
\end{Sinput}
\begin{Soutput}
Country.Code Indicator.Name variable value
1 AUS Labor force, total X2000 9.62e+06
2 AUS Population, total X2000 1.92e+07
3 CHN Labor force, total X2000 7.23e+08
4 CHN Population, total X2000 1.26e+09
5 AUS Labor force, total X2010 1.17e+07
6 AUS Population, total X2010 2.21e+07
7 CHN Labor force, total X2010 8.12e+08
8 CHN Population, total X2010 1.34e+09
\end{Soutput}
\end{Schunk}
Now\marginnote{If a matrix or array is required, use \margtt{acast()}
in place of \margtt{dcast()}.}
use \code{dcast()} to `cast' the data frame into a form where the
indicator variables are columns:
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
names(wdiLong)[3] <- "Year"
wdiData <- dcast(wdiLong,
Country.Code+Year ~ Indicator.Name,
value.var="value")
wdiData
\end{Sinput}
\begin{Soutput}
Country.Code Year Labor force, total Population, total
1 AUS X2000 9.62e+06 1.92e+07
2 AUS X2010 1.17e+07 2.21e+07
3 CHN X2000 7.23e+08 1.26e+09
4 CHN X2010 8.12e+08 1.34e+09
\end{Soutput}
\end{Schunk}
\end{fullwidth}
A final step is to replace the factor \code{Year} by a variable that
has the values 2000 and 2010.
\begin{fullwidth}
\begin{Schunk}
\begin{Sinput}
wdiData <- within(wdiData, {
levels(Year) <- substring(levels(Year),2)
Year <- as.numeric(as.character(Year))
})
wdiData
\end{Sinput}
\begin{Soutput}
Country.Code Year Labor force, total Population, total
1 AUS 2000 9.62e+06 1.92e+07
2 AUS 2010 1.17e+07 2.21e+07
3 CHN 2000 7.23e+08 1.26e+09
4 CHN 2010 8.12e+08 1.34e+09
\end{Soutput}
\end{Schunk}
\end{fullwidth}
\section{Session and Workspace Management}
\subsection{Keep a record of your work}
A recommended procedure
\marginnote{Be sure to save the script file from time to time during
the session, and upon quitting the session.}
is to type commands into an editor window,
then sending them across to the command line. This makes it possible
to recover work on those hopefully rare occasions when the
session aborts.
\subsection{Workspace management}
For tasks that make heavy memory demands, it may be important to
ensure that large data objects do not remain in memory once they are
no longer needed. There are two complementary strategies:
\begin{itemizz}
\item[-] Objects that cannot easily be reconstructed or copied from elsewhere,
but are not for the time being required, are conveniently saved
to an image file, using the \code{save()} function.
\item[-] Use a separate working directory for each major project.
\end{itemizz}
Note the utility function \code{dir()}\marginnote{Use \margtt{getwd()}
to check the name and path of the current working directory. Use
\margtt{setwd()} to change to a new working directory, while leaving
the workspace contents unchanged.} (get the names of files, by
default in the current working directory).
Several image files (`workspaces') that have distinct names can live
in the one working directory. The image file, if any, that is called
\textbf{.RData} is the file whose contents will be loaded at the
beginning of a new session in the directory.
\paragraph{The removal of clutter:}
\marginnote[9pt]{As noted in Section
\ref{ss:saveobjs}, a good precaution can be to make an archive of
the workspace before such removal.}
Use a command of the form \code{rm(x, y, tmp)} to remove
objects (here \code{x}, \code{y}, \code{tmp}) that are no longer
required.
\paragraph{Movement of files between computers:}\label{ss:dump}
Files that are saved in the default binary save file format, as above,
can be moved between different computer systems.
\paragraph{Further possibilities -- saving objects in text form:}
An alternative to saving objects\sidenote{Dumps of S4 objects and
environments, among others, cannot currently be retrieved using
\margtt{source()}. See \margtt{help(dump)}.} in an
image file is to dump them, in a text format, as dump files, e.g.
\begin{Schunk}
\begin{Sinput}
volume <- c(351, 955, 662, 1203, 557, 460)
weight <- c(250, 840, 550, 1360, 640, 420)
dump(c("volume", "weight"), file="books.R")
\end{Sinput}
\end{Schunk}
The objects can be recreated
\sidenote{The same checks are performed on dump files as if the text had been
entered at the command line. These can slow down entry of the data or
other object. Checks on dependencies can be a problem. These can
usually be resolved by editing the R source file to change or remove
offending code.}
from this `dump' file by inputting the
lines of \textbf{books.R} one by one at the command line. This is what,
effectively, the command \code{source()} does.
\begin{Schunk}
\begin{Sinput}
source("books.R")
\end{Sinput}
\end{Schunk}
For long-term archival storage, dump (\textbf{.R}) files may be
preferable to image files. For added security, retain a printed
version. If a problem arises (from a system change, or because the
file has been corrupted), it is then possible to check through the
file line by line to find what is wrong.
\section{Computer Intensive Computations}\label{sec:large-dset}
Computations may be computer intensive because of the size of
datasets. Or the computations may themselves be time-consuming,
even for data sets that are of modest size.
Note that using all of the data for an analysis or for a plot is not
always the optimal strategy. Running calculations separately on
different subsets may afford insights that are not otherwise
available. The subsets may be randomly chosen, or they may be chosen
to reflect, e.g., differences in time or place.
Where
\marginnote{The relatively new Julia language appears to offer
spectacular improvements on both R and Python, with times that
are within a factor of 2 of the Fortran or C times. See
\url{http://julialang.org/}.}
it is necessary to look for ways to speed up computations, it is
important to profile computations to find which parts of the code
are taking the major time. Really big improvements will come from
implementing key parts of the calculation in C or Fortran rather than
in an application oriented language such as R or Python. Python may
do somewhat better than R.
There can be big differences between the alternatives that may be
available in R for handling a calculation. Some broad guidelines will
now be provided, with examples of how differences in the handling of
calculations can affect timings.
\subsection{Considerations for computations with large datasets}
\paragraph{Consider supplying, matrices in preference to data frames:}
Most of R's modeling functions (regression, smoothing, discriminant
analysis, etc.) are designed to accept data frames as input. The
computational and associated memory requirements of the steps
needed to form the matrices used for the numerical computations
can. for large datasets, generate large overheads. The matrix
computations that follow use highly optimized compiled code,
and are much more efficient than if directly implemented in R code.
\marginnote{Biological expression array
applications are among those that are commonly designed to work with
data that is in a matrix format. The matrix or matrices may be
components of a more complex data structure.}
Where it is possible to directly input the matrices that will be
required for the calculations, this can greatly reduce the time
and memory requirements.
Matrix arithmetic can be faster than the equivalent computations
that use \code{apply()}. Here are timings for alternatives that
find the sums of rows of the matrix \code{xy} that was generated thus:
\begin{Schunk}
\begin{Sinput}
xy <- matrix(rnorm(5*10^7), ncol=100)
dim(xy)
\end{Sinput}
\end{Schunk}
\paragraph{Use efficient coding:}
Matrix arithmetic can be faster than the equivalent computations
that use \code{apply()}. Here are timings for some alternatives that
find the sums of rows of the matrix \code{xy} above:\\[-4pt]
% latex table generated in R 2.5.1 by xtable 1.4-6 package
% Tue Aug 21 18:37:55 2007
\marginnote{Timings are on a mid 2012 1.8 Ghz Intel i5 Macbook Air
laptop with 8 gigabytes of random access memory.}
\begin{center}
\begin{tabular}{rrrr}
\hline
& user & system & elapsed \\
\hline
\code{apply(xy,1,sum)} & 0.528 & 0.087 & 0.617 \\
\code{xy \%*\% rep(1,100)} & 0.019 & 0.001 & 0.019 \\
\code{rowSums(xy)} & 0.034 & 0.001 & 0.035 \\
\hline
\end{tabular}
\end{center}
\vspace*{-9pt}
\paragraph{The bigmemory project:} For details, go to
\url{http://www.bigmemory.org/}. The \pkg{bigmemory} package for R
``supports the creation, storage, access, and manipulation of massive
matrices''. Note also the associated packages \pkg{biganalytics}, \pkg{bigtabulate},
\pkg{synchronicity}, and \pkg{bigalgebra}.
\paragraph{The \pkg{data.table} package:}
This allows the creation
\marginnote{On 64-bit systems, massive
data sets, e.g., with tens or hundreds of millions of rows, are
possible. For such large data objects, the time saving can be huge.}
of \txtt{data.table} objects from which information can be quickly
extracted, often in a fraction of the time required for extracting the
same information from a data frame. The package has an accompanying
vignette. To display it (assuming that the package has been
installed), type
\begin{Schunk}
\begin{Sinput}
vignette("datatable-intro", package="data.table")
\end{Sinput}
\end{Schunk}
\section{Summary}
\begin{itemize}
\item[] \code{apply(),} and \code{sapply()} can be useful for
manipulations with data frames and matrices. Note also the
functions \code{melt()}, \code{dcast()} and \code{acast()} from the
\pkg{reshape2} package.
\item[] Careful workspace management is important when files
are large. It pays to use separate working directories for each
different project, and to save important data objects as image files
when they are, for the time being, no longer required.
\item[] In computations with large datasets, operations that
are formally equivalent can differ greatly in their use of
computational resources.
\end{itemize}