-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathR.data.io.Rnw
1112 lines (829 loc) · 74.8 KB
/
R.data.io.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% !Rnw root = appendix.main.Rnw
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'data-chunk')
@
\chapter{Base \Rlang and Extensions: Data Sharing}\label{chap:R:data:io}\label{sec:data:io}
\begin{VF}
Most programmers have seen them, and most good programmers realise they've written at least one. They are huge, messy, ugly programs that should have been short, clean, beautiful programs.
\VA{John Bentley}{\emph{Programming Pearls}, 1986}
\end{VF}
<<echo=FALSE>>=
# set to TRUE to execute code chunks that require internet access and instruments, repectively
eval_online_data <- TRUE
eval_yoctopuce <- TRUE
@
\section{Aims of This Chapter}
In this chapter, you will learn how to exchange data between \Rlang and some other applications. Base \Rlang and the recommended packages (installed by default) include several functions for importing and exporting data. Contributed packages provide both replacements for some of these functions and support for several additional file formats. In the present chapter, I aim at describing both data input and output covering in detail only the most common ``foreign'' data formats (those not native to \Rlang). The function pairs \code{save()} and \code{load()}, and \code{saveRDS()} and \code{readRDS()}, which save and read data in \Rlang's native formats, are described in chapter \ref{chap:R:collective}, sections \ref{sec:data:rda} and \ref{sec:data:rds} starting on page \pageref{sec:data:rda}.
Data file formats that are foreign to \Rlang are not always well defined, making it necessary to reverse-engineer the algorithms needed to read them. These formats, even when clearly defined, may be updated by the developers of the foreign software that writes the files. Consequently, developing software to read and write files using foreign formats can easily result in long, messy, and ugly \Rlang scripts. We can also unwillingly write code that usually works but occasionally fails with specific files, or even worse, occasionally silently corrupts the imported data. The aim of this chapter is to provide guidance for finding functions for reading data encoded using foreign formats, covering both base \Rlang, including the \pkgname{foreign} package, and independently contributed packages. Such functions are well tested or validated and should be used whenever possible when importing data stored in foreign file formats.
\section{Introduction}
The first step in any data analysis with \Rlang is to input or read-in the data. Available sources of data are many and data can be stored or transmitted using various formats, both based on text or binary encodings. It is crucial that data are not altered (corrupted) when read and that in the eventual case of an error, errors are clearly reported. Most dangerous are silent non-catastrophic errors.
The very welcome increase of awareness of the need for open availability of data, makes the output of data from \Rlang into well-defined data-exchange formats another crucial step. Consequently, in many cases, an important step in data analysis is to export the data for submission to a repository, in addition to publication of the results of the analysis.
Faster internet access to data sources and cheaper random-access memory (RAM) has made it possible to efficiently work with relatively large data sets in \Rlang. That \Rlang keeps all data in memory (RAM), imposes limits to the size of data \Rlang functions can operate on. For data sets that do not fit in computer RAM, one can read selected lines from text files, use file formats like NetCDF that natively implement selective reading, or use queries to access data stored in local or remote databases.
Some contributed \Rlang packages support import of data saved in the same formats already supported by base \Rlang, but using different compromises between reliability, easy of use and performance. Functions in base \Rlang tend to prioritise reliability and protection from data corruption while some contributed packages prioritise performance. Other contributed packages make it possible to import and export data stored in file formats not supported by base \Rlang functions. Some of these formats are subject-area specific while others are in widespread use. Packages supporting download and upload of data sets from specific public repositories are also available (see \url{https://ropensci.org/packages/}).
\section{Packages Used in This Chapter}
<<eval=FALSE>>=
install.packages(learnrbook::pkgs_ch10_2ed)
@
To run the examples included in this chapter, you need first to load some packages from the library (see section \ref{sec:script:packages} on page \pageref{sec:script:packages} for details on the use of packages).
\begin{warningbox}
Several examples in this chapter make use of functions from the \pkgname{tidyverse} for data wrangling because some of the packages used to import data in ``foreign'' formats are themselves part of the \pkgname{tidyverse}.
\end{warningbox}
<<message=FALSE>>=
library(learnrbook)
library(tibble)
library(purrr)
library(stringr)
library(dplyr)
library(tidyr)
library(readr)
library(readxl)
library(xlsx)
library(readODS)
library(pdftools)
library(foreign)
library(haven)
library(xml2)
library(XML)
library(ncdf4)
library(tidync)
library(lubridate)
library(jsonlite)
@
\begin{warningbox}
Some data sets used in this and other chapters are available in package \pkgname{learnrbook}. In addition to the \Rlang data objects, the package includes files saved in \emph{foreign} formats used in examples of importing data. The files can be either read from the \Rlang library, or from a copy in a local folder. In this chapter, the code examples assume the user has copied the contents of folder \code{"extdata"} of the package to the current working folder.
The files can be copied by running the two statements below, assuming the current folder is the one that will be used to run the code examples in this chapter.
<<copy-data-files>>=
pkg.path <- system.file("extdata", package = "learnrbook")
file.copy(pkg.path, ".", overwrite = TRUE, recursive = TRUE)
@
Some examples write files to disk, and the statements below ensure that the folder used in these examples exists, creating it if not found.
<<make-dir>>=
save.path = "./data"
if (!dir.exists(save.path)) {
dir.create(save.path)
}
@
\end{warningbox}
\section{File Names and Operations}\label{sec:files:filenames}
\index{file names!portable}
\index{file operations|(}
The naming of files affects data sharing irrespective of the format used for its encoding. The main difficulty is that different operating systems have different rules governing the syntax used for file names and file paths. In many cases, like when depositing data files in a public repository, we need to ensure that file names are valid across multiple operating systems (OSs). If the script used to create the files is itself expected to be OS agnostic, queries for file names and paths in \Rlang code should not make assumptions on the naming rules or available OS commands. This is especially important when developing \Rlang packages.
\begin{warningbox}
\index{file names!script portability}
For maximum portability, file names should never contain white-space characters and contain at most one dot. For the widest possible portability, underscores should be avoided using dashes instead. As an example, instead of \code{my data.2019.csv}, use \code{my-data-2019.csv}.
\end{warningbox}
\Rlang provides functions which help with portability, by hiding the idiosyncrasies of the different OSs from \Rlang code. In scripts, these functions should be preferred over direct call to OS commands (i.e., avoid calls to functions \Rfunction{shell()} or \Rfunction{system()} with OS commands as arguments) whenever possible. As the algorithm needed to extract a file name from a file path is OS specific, \Rlang provides functions such as \Rfunction{basename()}, whose implementation is OS specific but from the side of \Rlang code behave identically---these functions hide the differences among OSs from the user of \Rlang. The chunk below can be expected to work correctly under any OS for which \Rlang is available.
<<filenames-01>>=
basename("extdata/my-file.txt")
@
\begin{warningbox}
\index{file paths!script portability}
\index{folders|see{file paths}}
\index{file paths!parsing|(}
While in \pgrmname{Unix} and \pgrmname{Linux} folder nesting in file paths is marked with a forward slash character (\verb|/|), under \pgrmname{MS-Windows} it is marked with a backslash character (\verb|\|). Backslash (\verb|\|) is an escape character in \Rlang and interpreted as the start of an embedded special character (see section \ref{sec:calc:character} on page \pageref{sec:calc:character}), while in \Rlang a forward slash (\verb|/|) can be used for file paths under any OS, and escaped backslash (\verb|\\|) is valid only under MS-Windows. Consequently, \verb|/| should be always preferred to \verb|\\| to ensure portability, and is the approach used in this book.
<<filenames-02>>=
basename("extdata/my-file.txt")
basename("extdata\\my-file.txt")
@
\end{warningbox}
The complementary function to \code{basename()} is \Rfunction{dirname()} and extracts from a full file path the bare path to the containing folder.
<<filenames-03>>=
dirname("extdata/my-file.txt")
@
\index{file paths!parsing|)}
\index{working directory|(}
Functions \Rfunction{getwd()} and \Rfunction{setwd()} can be used to get the path to the current working directory and to set a directory as current, respectively.
<<filenames-05,eval=FALSE>>=
# not run
getwd()
@
Function \Rfunction{setwd()} returns the path to the current working directory, allowing to portably restore the working directory to the previous one. Both relative paths (relative to the current working directory), as in the example, or absolute paths (given in full) are accepted as an argument. In mainstream OSs ``\code{.}'' indicates the current directory and ``\code{..}'' the directory above the current one.
<<filenames-06,eval=FALSE>>=
# not run
oldwd <- setwd("..")
getwd()
@
The returned value is always an absolute full path, so it remains valid even if the path to the working directory changes more than once before being restored.
<<filenames-07,eval=FALSE>>=
# not run
oldwd
setwd(oldwd)
getwd()
@
\index{working directory|)}
\index{listing files or directories|(}
Function \Rfunction{list.files()} returns a list of names of files and/or directories (= disk folders) portably across OSs. Function \Rfunction{list.dirs()} returns only the names of directories.
<<filenames-09>>=
head(list.files())
head(list.dirs())
@
\begin{playground}
In these functions, the default argument for parameter \code{path} is the current working directory, under Windows, Unix, and Linux indicated by \code{"."}. Convince yourself that this is indeed the default by calling the functions with an explicit argument. After this, play with the functions passing as argument to \code{path} other existing and non-existent file and directory paths.
\end{playground}
\begin{playground}
Pass different arguments to parameter \code{full.names} of \Rfunction{list.files()} to obtain either a list of file paths or bare file names. Similarly, investigate how the returned list of files is affected by the argument passed to \code{all.names}.
\end{playground}
\index{listing files or directories|)}
Base \Rlang provides several functions for portably working with files, and they are listed in the help page for \code{files} and in individual help pages. Use \code{help("files")} to access the help for this ``family'' of functions. The chunk below exercises some of these functions.
<<filenames-08>>=
if (!file.exists("xxx.txt")) {
file.create("xxx.txt")
}
file.size("xxx.txt")
file.info("xxx.txt")
file.rename("xxx.txt", "zzz.txt")
file.exists("xxx.txt")
file.exists("zzz.txt")
file.remove("zzz.txt")
@
\begin{playground}
Function \Rfunction{file.path()} can be used to construct a file path from its components in a way that is portable across OSs. Look at the help page and play with the function to assemble some paths that exist in the computer you are using.
\end{playground}
\index{file operations|)}
\section{Opening and Closing File Connections}\label{sec:io:connections}
Examples\index{file connections} in the rest of this chapter use as an argument for the \code{file} formal parameter literal paths or URLs, and complete the reading or writing operations within the call to a function. Sometimes it is necessary to read or write a text file sequentially, one row or record at a time. In such cases, it is most efficient to keep the file open between reads and close the connection only when it is no longer needed. See \code{help(connections)} for details about the various functions available and their behaviour in different OSs. The code below opens a file connection, reads two lines, first the top one with column headers, then in a separate call to \Rfunction{readLines()}, the two lines or records with data, and finally closes the connection.
<<file-open-01>>=
f1 <- file("extdata/not-aligned-ASCII-UK.csv", open = "r") # open for reading
readLines(f1, n = 1)
@
<<file-open-02>>=
readLines(f1, n = 2)
close(f1)
@
When \Rpgrm is used in batch mode, the ``files'' \code{stdin}, \code{stdout} and \code{stderror} can be opened, and data read from, or written to. These \emph{standard} sources and sinks, so familiar to \Clang programmers, allow the use of \Rlang scripts as tools in data pipes coded as shell scripts under Unix and other OSs.
\section{Plain-Text Files}\label{sec:files:txt}
\index{importing data!text files|(}\index{file formats!plain text}
In general, text files are the most portable approach to data storage but usually also the least efficient with respect to the size of the file. Text files are composed of encoded characters. This makes them easy to edit with text editors and easy to read from programs written in most programming languages. On the other hand, how the data encoded as characters is arranged can be based on two different approaches: positional or using a specific character as a separator.
The positional approach is more concise but almost unreadable to humans as the values run into each other. Reading of data stored using a positional approach requires access to a format definition and was common in FORTRAN and COBOL at the time when punch cards were used to store data. In the case of separators, different separators are in common use. Comma-separated values (CSV) encodings use either a comma or semicolon to separate the fields or columns. Tab-separated values (TSV) use the tab, or tabulator, character as a column separator. Sometimes, whitespace is used as a separator, most commonly when all values are to be converted into \code{numeric}.
\begin{explainbox}
\textbf{Not all text files are born equal.}\index{importing data!R names} When reading text files, and \emph{foreign} binary files which may contain embedded text strings, there is potential for their misinterpretation during the import operation. One common source of problems, is that column headers are to be read as \Rlang names. As earlier discussed, there are strict rules, such as avoiding spaces or special characters if the names are to be used with the normal \Rlang syntax. On import, some functions will attempt to sanitise the names, but others not. Most such names are still accessible in \Rlang statements, but a special syntax is needed to protect them from triggering syntax errors through their interpretation as something different than variable or function names---in \Rlang jargon we say that they need to be quoted.
Some of the things we need to be on the watch for are:
1) Mismatches between the character encoding expected by the function used to read the file, and the encoding used for saving the file---usually because of different locales, i.e., language and country settings.
2) Leading or trailing (invisible) spaces present in the character values or column names---which are almost invisible when data frames are printed.
3) Wrongly guessed column classes---a typing mistake affecting a single value in a column, e.g., the wrong kind of decimal marker, can prevent the column from being recognised as numeric.
4) Mismatched decimal marker in \code{CSV} files---the marker depends on the locale (language and country settings).
If you encounter problems after import, such as failure of extraction of data frame columns by name, use function \code{names()} to get the names printed to the console as a character vector. This is useful because character vectors are always printed with each string delimited by quotation marks making leading and trailing spaces clearly visible. The same applies to use of \code{levels()} with factors created with data that might have contained mistakes or whitespace.
To demonstrate some of these problems, I create a data frame with name sanitation disabled, and in the second statement with sanitation enabled. The first statement is equivalent to the default behaviour of functions in package \pkgname{readr} and the second is equivalent to the behaviour of base \Rlang functions. \pkgname{readr} prioritises the integrity of the original data while \Rlang prioritises compatibility with \Rlang's naming rules.
<<file-io-txt-00a>>=
data.frame(a = 1, "a " = 2, " a" = 3, check.names = FALSE)
data.frame(a = 1, "a " = 2, " a" = 3)
@
An even more subtle case is when characters can be easily confused by the user reading the output, or typing in the data: zero and o (\code{a0} vs.\ \code{aO}) or el and one (\code{al} vs.\ \code{a1}) can be difficult to distinguish in some fonts. When using encodings capable of storing many character shapes, such as unicode, in some cases two characters with almost identical visual shape may be encoded as different characters.
<<file-io-txt-00b>>=
data.frame(al = 1, a1 = 2, aO = 3, a0 = 4)
@
Reading\index{importing data!text encodings} data from a text file can result in very odd-looking values stored in \Rlang variables because of a mismatch in encoding, e.g., when a CSV file saved with \pgrmname{MS-Excel} is silently encoded using 16-bit unicode format, but read as an 8-bit unicode encoded file.
The hardest part of all these problems is to diagnose their origin, as function arguments and working environment options can in most cases be used to force the correct decoding of text files with diverse characteristics, origins, and vintages once one knows what is required. Function \code{tools:::showNonASCIIfile()}\qRfunction{showNonASCIIfile()} from the \Rlang \pkgname{tools} package, which is not exported, but available in recent and current (\Sexpr{paste(R.version$major, R.version$minor, sep = ".")}) versions of \Rpgrm, can be used to test files for the presence on non-ASCII characters. This function takes as an argument the path to a file, and its companion function \code{tools:::showNonASCII()}\qRfunction{showNonASCII()} a \code{character} string.
\end{explainbox}
\subsection[Base \Rlang and \pkgnameNI{utils}]{Base \Rlang and \pkgname{utils}}
\index{text files!with field markers}
Text files containing data in columns can be divided into two broad groups. Those with fixed-width fields and those with delimited fields. Fixed-width fields were especially common in the early days of \langname{FORTRAN} and \langname{COBOL} when data storage capacity was very limited. These formats are frequently capable of encoding information using fewer characters than when delimited fields are used. The best way of understanding the differences is with examples. Although in this section we exemplify the use of functions by passing a file name as an argument, URLs and open file descriptors are also accepted (see section \ref{sec:io:connections} on page \pageref{sec:io:connections}). The file will be uncompressed on the fly if its name ends in \code{.gz}.
\begin{warningbox}
Wether columns containing character strings that cannot be converted into numbers are converted into factors or remain as character strings in the returned data frame depends on the value passed to parameter \code{stringsAsFactors}. The default changed in \Rlang version 4.0.0 from \code{TRUE} into \code{FALSE}. If code is to work consistently in old and new versions of \Rlang \code{stringsAsFactors = FALSE} has to be passed explicitly in calls to \Rfunction{read.csv()} (the approach used in the book).
\end{warningbox}
In\index{text files!CSV files}\index{text files!TSV files} the first example, a file with fields solely delimited by ``,'' is read. This is what is called comma-separated values (CSV) format that can be read and written with \Rfunction{read.csv()} and \Rfunction{write.csv()}, respectively.
The contents of file \code{not-aligned-ASCII-UK.csv} are shown below.
<<file-io-csv-00a, comment='', echo=FALSE>>=
cat(readLines("extdata/not-aligned-ASCII-UK.csv"), sep = "\n")
@
The file is read and the returned value stored in a variable named \code{from\_csv\_a.df}, and printed.
<<file-io-csv-01>>=
from_csv_a.df <-
read.csv("extdata/not-aligned-ASCII-UK.csv", stringsAsFactors = FALSE)
@
<<file-io-csv-02>>=
from_csv_a.df
from_csv_a.df[["col4"]]
sapply(from_csv_a.df, class)
@
\begin{playground}
Read the file \code{not-aligned-ASCII-UK.csv} with function \Rfunction{read.csv2()} instead of \Rfunction{read.csv()}. Although this may look like a waste of time, the point of the exercise is for you to get familiar with \Rlang behaviour in case of such a mistake. This will help you recognise similar errors when they happen accidentally, which is quite common when files are shared.
\end{playground}
Example file \code{aligned-ASCII-UK.csv} contains comma-separated values with added whitespace to align the columns, to make it easier to read by humans.
The contents of file \code{aligned-ASCII-UK.csv} are shown below.
<<file-io-csv-03, comment='', echo=FALSE>>=
cat(readLines("extdata/aligned-ASCII-UK.csv"), sep = "\n")
@
The file is read and the returned value stored in a variable named \code{from\_csv\_b.df}, and printed.
Although space characters are read as part of the fields, they are ignored when conversion to numeric takes place.
<<file-io-csv-03a>>=
from_csv_b.df <-
read.csv("extdata/aligned-ASCII-UK.csv", stringsAsFactors = FALSE)
@
<<file-io-csv-03aa>>=
from_csv_b.df
from_csv_b.df[["col4"]]
sapply(from_csv_b.df, class)
@
By default, column names are sanitised but whitespace in character strings kept. Passing an additional argument changes this default so that leading and trailing whitespace are discarded. Most likely the default has been chosen so that by default data integrity is maintained.
<<file-io-csv-05>>=
from_csv_c.df <-
read.csv("extdata/aligned-ASCII-UK.csv",
stringsAsFactors = FALSE, strip.white = TRUE)
@
<<file-io-csv-05aa>>=
from_csv_c.df
from_csv_c.df[["col4"]]
sapply(from_csv_c.df, class)
@
\begin{explainbox}
When\index{importing data!character to factor conversion} character strings are converted into factors, leading and trailing whitespace is retained in the labels of factor levels. Leading and trailing whitespace are difficult to see when data frames are printed, as shown below. This example shows what problems were frequently encountered in earlier versions of \Rlang, and can still occur when factors are created. The recommended approach is to use the default \code{stringsAsFactors = FALSE} and do the conversion into factors in a separate step.
<<file-io-csv-03b>>=
from_csv_b.df <-
read.csv("extdata/aligned-ASCII-UK.csv", stringsAsFactors = TRUE)
@
Using \code{levels()} it can be seen that the labels of the automatically created factor levels contain leading spaces.
<<file-io-csv-04>>=
sapply(from_csv_b.df, class)
from_csv_b.df[["col4"]]
levels(from_csv_b.df[["col4"]])
@
\end{explainbox}
Decimal\index{importing data!decimal marker} points and exponential notation are allowed for floating point values. In English-speaking locales, the decimal mark is a point, while in many other locales it is a comma. The behaviour of \Rlang functions does not change when run under different locales. When a comma is used as decimal marker, we can a semicolon (\verb|;|) is used as field marker.
This handled by using functions \Rfunction{read.csv2()} and \Rfunction{write.csv2()}. Furthermore, parameters \code{dec} and \code{sep} allow setting the decimal marker and field separator to arbitrary character strings.
Function \Rfunction{read.table()} does the actual work and functions like \Rfunction{read.csv()} only differ in the default arguments for the different parameters. By default, \Rfunction{read.table()} expects fields to be separated by whitespace (one or more spaces, tabs, new lines, or carriage return).
The contents of file \code{aligned-ASCII.txt} are shown below.
<<file-io-txt-00, comment='', echo=FALSE>>=
cat(readLines("extdata/aligned-ASCII.txt"), sep = "\n")
@
The file is read and the returned value stored in a variable named \code{from\_txt\_b.df}, and printed.
Leading and trailing whitespace are removed because they are recognised as part of the separators. For character strings containing embedded spaces to be decoded as a single value, they need to be quoted in the file as in \code{aligned-ASCII.txt} above.
<<file-io-txt-01>>=
from_txt_b.df <-
read.table("extdata/aligned-ASCII.txt",
stringsAsFactors = FALSE, header = TRUE)
@
<<file-io-txt-02>>=
from_txt_b.df
from_txt_b.df[["col4"]]
sapply(from_txt_b.df, class)
@
\index{text files!fixed width fields}
With a fixed-width format, no delimiters are needed. Decoding is based solely on the position of the characters in the line or record. A file like this cannot be interpreted without a description of the format used for saving the data. Files containing data stored in \emph{fixed width format} can be read with function \Rfunction{read.fwf()}. Records for a single observation can be stored in a single or multiple lines. In either case, each line has fields of different but fixed known widths.
Function\index{text files!FORTRAN formatted data transfer} \Rfunction{read.fortran()} is a wrapper on \Rfunction{read.fwf()} that accepts format definitions similar to those used in \langname{FORTRAN}. One particularity of \langname{FORTRAN} \emph{formatted data transfer} is that the decimal marker can be omitted in the saved file and its position specified as part of the format definition, a trick used to make text files (or stacks of punch cards!) smaller. Modern versions of \langname{FORTRAN} support reading from and writing to other formats like those using field delimiters described above.
The contents of file \code{aligned-ASCII.fwf} are shown below.
<<file-io-fwf-00, comment='', echo=FALSE>>=
cat(readLines("extdata/aligned-ASCII.fwf"), sep = "\n")
@
The file is read and the returned value stored in a variable named \code{from\_fwf\_a.df}, and printed. The format definition is passed as a separate character vector argument, e.g., \code{"2F3.1"} describes the format of the first two columns, \code{"I3"} describes the third column and \code{"A3"} the fourth.
<<file-io-fwf-01>>=
from_fwf_a.df <-
read.fortran("extdata/aligned-ASCII.fwf",
format = c("2F3.1", "I3", "A3"),
col.names = c("col1", "col2", "col3", "col4"))
@
<<file-io-fwf-02>>=
from_fwf_a.df
from_fwf_a.df[["col4"]]
sapply(from_fwf_a.df, class)
@
\begin{explainbox}
The file reading functions described above share with \Rfunction{read.table()} the same parameters. In addition to those described above, other frequently useful parameters are \code{skip} and \code{n}, which can be used to skip lines at the top of a file and limit the number of lines (or records) to read; \code{header}, which accepts a logical argument indicating if the fields in the first text line read should be decoded as column names rather than data; \code{na.strings}, to which can be passed a character vector with strings to be interpreted as \code{NA}; and \code{colClasses}, which provides control of the conversion of the fields to \Rlang classes and possibly skipping some columns altogether. All these parameters are described in the corresponding help pages.
\end{explainbox}
\begin{playground}
In reality \Rfunction{read.csv()}, \code{read.csv2()} and \Rfunction{read.table()} are the same function with different default arguments to several of their parameters. Study the help page, and by passing suitable arguments, make \Rfunction{read.csv()} behave like \Rfunction{read.table()}, then make \Rfunction{read.table()} behave like \Rfunction{read.csv2()}.
\end{playground}
\begin{explainbox}
A text file can be read as character strings, without attempting to decode them. This is occasionally useful, such as when the decoding is done in a script, or when needs to print a file as is. In this case, the function used is \code{readLines()}. The returned value is a character vector in which each member string corresponds to one line or record in the file, with the end-of-line markers stripped (see example in section \ref{sec:io:connections} on page \pageref{sec:io:connections}).
\end{explainbox}
\index{importing data!text files|)}
\index{exporting data!text files|(}
The next example shows how a \emph{write} function matching one of the \emph{read} functions described above can be used to save a data frame to a text file. The \Rfunction{write.csv()} function takes as an argument a data frame, or an object that can be coerced into a data frame, converts it to character strings, and saves them to a text file. A data frame, \code{my.df} with five rows is enough for a demonstration.
<<file-io-txt-03>>=
my.df <- data.frame(x = 1:5, y = 5:1 / 10, z = letters[1:5])
@
We write \code{my.df} to a CSV file suitable for an English language locale, and then display its contents.
<<file-io-txt-04>>=
write.csv(my.df, file = "my-file1.csv", row.names = FALSE)
file.show("my-file1.csv", pager = "console")
@
<<file-io-txt-05, comment='', echo=FALSE>>=
cat(readLines("my-file1.csv"), sep = "\n")
@
\begin{explainbox}
In most cases setting, as above, \code{row.names = FALSE} when writing a CSV file will help when it is read. Of course, if row names do contain important information, such as gene tags, you cannot skip writing the row names to the file unless you first copy these data into a column in the data frame. (Row names are stored separately as an attribute in \code{data.frame} objects, see section \ref{sec:calc:attributes} on page \pageref{sec:calc:attributes} for details.)
\end{explainbox}
\begin{playground}
Write the data frame \code{my.df} into text files with functions \Rfunction{write.csv2()} and \Rfunction{write.table()} instead of \Rfunction{read.csv()} and display the files.
\end{playground}
Function \Rfunction{cat()} takes \Rlang objects and writes them after conversion to character strings to the console or a file, inserting one or more characters as separators, by default, a space. This separator can be set through parameter \code{sep}. In our example, we set \code{sep} to a new line (entered as the escape sequence \code{"\textbackslash n"}).
<<file-io-txt-11>>=
my.lines <- c("abcd", "hello world", "123.45")
cat(my.lines, file = "my-file2.txt", sep = "\n")
file.show("my-file2.txt", pager = "console")
@
<<file-io-txt-12, comment='', echo=FALSE>>=
cat(readLines('my-file2.txt'), sep = '\n')
@
\index{exporting data!text files|)}
\subsection[\pkgnameNI{readr}]{\pkgname{readr}}\label{sec:files:readr}
\index{importing data!text files|(}
<<eval=FALSE, include=FALSE>>=
citation(package = "readr")
@
Package \pkgname{readr} is part of the \pkgname{tidyverse} suite. It defines functions that have different default behaviour and that are designed to be faster under different situations than those native to \Rlang. The functions from package \pkgname{readr} can sometimes wrongly decode their input and rarely even do this silently. The \pkgname{readr} functions guess more properties of the text file format; in most cases they succeed, which is very handy, but occasionally they fail. Automatic guessing can be overridden by passing arguments, and this is recommended for scripts that will be reused to read different files in the future. Another important advantage is that these functions read character strings formatted as dates or times directly into columns of class \code{POSIXct}. All \code{write} functions defined in \pkgname{readr} have an \code{append} parameter, which can be used to change the default behaviour of overwriting an existing file with the same name, to appending the output at its end.
Although we exemplify the use of these functions by passing a file name as an argument, as is the case with \Rlang native functions, URLs, and file descriptors are also accepted (see section \ref{sec:io:connections} on page \pageref{sec:io:connections}). The files read are uncompressed, and those written are compressed on the fly if their name ends in \code{.gz}, \code{.bz2}, \code{.xz}, or \code{.zip}.
\begin{warningbox}
Functions ``equivalent'' to native \Rlang functions described in the previous section have names formed by replacing the dot with an underscore, e.g., \Rfunction{read\_csv()} $\approx$ \Rfunction{read.csv()}. The similarity refers to the format of the files read, but not the order, names, or roles of their formal parameters. For example, function \code{read\_table()} has a slightly different behaviour than \Rfunction{read.table()}, although they both read fields separated by whitespace. Row names are not set in the returned \Rclass{tibble}, which inherits from \Rclass{data.frame}, but is not fully compatible (see section \ref{sec:data:tibble} on page \pageref{sec:data:tibble}).
\end{warningbox}
\begin{warningbox}
Package \pkgname{readr} is under active development, and functions with the same name from different major versions are not fully compatible. Code for some examples from the first edition of the book no longer work because the updated implementation fails to recognise escaped special characters. Function \Rfunction{read\_table2()} has been renamed \Rfunction{read\_table()}.
\end{warningbox}
These functions report to the console the specifications of the columns, which is important when these are guessed from the file contents, or even only from rows near the top of the file.
<<readr-01>>=
read_csv(file = "extdata/aligned-ASCII-UK.csv", show_col_types = FALSE)
@
<<readr-02>>=
read_csv(file = "extdata/not-aligned-ASCII-UK.csv", show_col_types = FALSE)
@
Package \pkgname{readr} is under active development, and different major versions are not fully compatible with each other. Because of the misaligned fields in file \code{"not-aligned-ASCII.txt"} in the past we needed to use \Rfunction{read\_table2()}, which allowed misalignment of fields, similarly to \Rfunction{read.table()}. This function has been renamed as \Rfunction{read\_table()} and \Rfunction{read\_table2()} deprecated. However, parsing of both files fails if they are read with \Rfunction{read\_table()}, quoted strings containing whitespace are no longer recognised. See above example using \Rfunction{read.table()}. Examples below are not run, but kept as they may work again in the future.
<<readr-03, eval=FALSE>>=
read_table(file = "extdata/aligned-ASCII.txt")
@
<<readr-04, eval=FALSE>>=
read_table(file = "extdata/not-aligned-ASCII.txt")
@
Function \Rfunction{read\_delim()} with space as the delimiter succeeds only with the not-aligned file as in this file the separator is in all cases a single space.
<<readr-05a>>=
read_delim(file = "extdata/not-aligned-ASCII.txt",
delim = " ", show_col_types = FALSE)
@
Function \Rfunction{read\_tsv()} reads files encoded with the tab character as the delimiter, and \Rfunction{read\_fwf()} reads files with fixed width fields. There is, however, no equivalent to \Rfunction{read.fortran()}, supporting implicit decimal points.
\begin{playground}
Use the "wrong" \code{read\_} functions to read the example files used above and/or your own files. As mentioned earlier, forcing errors will help you learn how to diagnose when such errors are caused by coding or data entry mistakes. In this case, as wrongly read data are not always accompanied by error or warning messages, carefully check the returned tibbles for misread data values.
\end{playground}
\begin{explainbox}
The functions from \Rlang's package \pkgname{utils} read the whole file before attempting to guess the class of the columns or their alignment. This is reliable but slow for text files with many lines. The functions from \pkgname{readr} read by default only the top 1000 lines (\code{guess\_max = 1000}) when guessing the format and class, assuming that the guessed properties also apply to the remaining lines of the file. This is more efficient, but rather risky. However, the functions from \Rlang's package \pkgname{utils} are faster at reading files with many fields (or columns) per line and few lines.
In earlier versions of \pkgname{readr}, a problem was the failure to correctly decode numeric values when increasingly large numbers resulted in wider fields in the lines below those used for guessing. However, at the time of writing, this case is correctly handled. A guess based on the top lines of a text file also means that when values in lines below \code{guess\_max} lines cannot be converted to numeric, the \code{numeric} column returned contains \code{NA} values. In contrast, in this situation, functions from \Rlang's package \pkgname{utils}, skip decoding and return a \code{character} column. Below, a very small value for \code{guess\_max} is used to demonstrate this behaviour with a file only a few lines in length.
<<tibble-print-20, echo=FALSE>>=
options(tibble.print_max = 6, tibble.print_min = 6)
@
<<readr-06>>=
read_table(file = "extdata/miss-aligned-ASCII.txt", show_col_types = FALSE)
@
<<readr-07>>=
read_table(file = "extdata/miss-aligned-ASCII.txt", show_col_types = FALSE,
guess_max = 3L)
@
\end{explainbox}
\index{importing data!text files|)}
<<tibble-print-21, echo=FALSE>>=
options(tibble.print_max = 3, tibble.print_min = 3)
@
\index{exporting data!text files|(}
The \code{write\_} functions from \pkgname{readr} are the counterpart to \code{write.} functions from \pkgname{utils}. In addition to the expected \Rfunction{write\_csv()}, \Rfunction{write\_csv2()}, \Rfunction{write\_tsv()} and \Rfunction{write\_delim()}, \pkgname{readr} provides functions that write \pgrmname{MS-Excel}-friendly CSV files. Function \Rfunction{write\_excel\_csv()} saves a text file with comma-separated fields suitable for import into \pgrmname{MS-Excel}.
<<readr-10>>=
write_excel_csv(my.df, file = "my-file6.csv")
file.show("my-file6.csv", pager = "console")
@
<<readr-11, comment='', echo=FALSE>>=
cat(read_lines('my-file6.csv'), sep = '\n')
@
\begin{playground}
Compare the output from \Rfunction{write\_excel\_csv()} and \Rfunction{write\_csv()}. What is the difference? Does it matter when you import the written CSV file into Excel (in the version you are using, and with the locale settings of your computer)?
\end{playground}
The pair of functions \Rfunction{read\_lines()} and \Rfunction{write\_lines()} read and write character vectors without conversion, similarly to base \Rlang \code{readLines()} and \code{writeLines()}. Functions \Rfunction{read\_file()} and \Rfunction{write\_file()} read and write the contents of a whole text file into, and from, a single character string. Functions \Rfunction{read\_file()} and \Rfunction{write\_file()} can also be used with raw vectors to read and write binary files or text files of unknown encoding.
The contents of the whole file are returned as a character vector of length one, with the embedded new line markers. We use \code{cat()} to print it so these new line characters force the start of a new print-out line.
<<readr-12>>=
one.str <- read_file(file = "extdata/miss-aligned-ASCII.txt")
length(one.str)
cat(one.str)
@
\begin{advplayground}
Use \Rfunction{write\_file()} to write a file that can be read with \Rfunction{read\_csv()}.
\end{advplayground}
\index{exporting data!text files|)}
\section{XML and HTML Files}
\index{importing data!XML and HTML files|(}
\index{file formats!HTML}\index{file formats!XML}
XML files contain text with special markup. Several modern data exchange formats are based on the \langname{XML} standard (see \url{https://www.w3.org/TR/xml/}) which uses schemas for flexibility. Schemas define specific formats, allowing reading of formats not specifically targeted during development of the read functions. Even the modern \langname{XHTML} standard used for web pages is based on such schemas, while \langname{HTML} only differs slightly in its syntax.
\subsection[\pkgnameNI{xml2}]{\pkgname{xml2}}
<<xml2-00, eval=FALSE, include=FALSE>>=
citation(package = "xml2")
@
Package \pkgname{xml2} provides functions for reading and parsing \langname{XTML} and \langname{HTML} files. This is a vast subject, of which I will only give a brief example.
Function \Rfunction{read\_html()} can be used to read an HTML document, either locally or from a URL as below.
<<xml2-01>>=
web_page <- read_html("https://www.learnr-book.info/")
@
Function \Rfunction{html\_structure()} displays the structure of an HTML document (long text output not shown).
<<xml2-01a, eval=FALSE>>=
html_structure(web_page)
@
Function \Rfunction{xml\_text()} extracts the text content of a field. Function \Rfunction{xml\_find\_all()} returns a field searched by name. Here used to extract the text from the \code{title} attribute, using functions \Rfunction{xml\_find\_all()} and .
<<xml2-02>>=
xml_text(xml_find_all(web_page, ".//title"))
@
The functions defined in this package can be used to ``harvest'' data from web pages, but also to read data from files using formats that are defined through \langname{XML} schemas.
\index{importing data!XML and HTML files|)}
\section{GPX Files}
\index{importing data!GPX files|(}\index{file formats!GPX}
GPX\index{importing data!geographic positioning system} (GPS Exchange Format) files use an XML scheme designed for saving and exchanging data from geographic positioning systems (GPS). There is some variation on the variables saved depending on the settings of the GPS receiver. The example data used here is from a Transmeta BT747 GPS logger. The example below reads the data into a \code{tibble} as character strings. For plotting, the character values representing numbers and dates would need to be converted to numeric and datetime (\code{POSIXct}) values, respectively. In the case of plotting tracks on a map, it is preferable to use package \pkgname{sf} to import the tracks directly from the \code{.gpx} file into a layer (use of \Rlang pipe operator is described in section \ref{sec:script:pipes} on page \pageref{sec:script:pipes}).
<<gps-01>>=
xmlTreeParse(file = "extdata/GPSDATA.gpx", useInternalNodes = TRUE) |>
xmlRoot(x = _) |>
xmlToList(node = _) |>
_[["trk"]] |>
assign(x = "temp", value = _) |>
_[names(x = temp) == "trkseg"] |>
unlist(x = _, recursive = FALSE) |>
map_df(.x = _, .f = function(x) as_tibble(x = t(x = unlist(x = x))))
rm(temp) # cleanup
@
\begin{playground}
To understand what data transformation takes place in each statement of this pipe, start by running the first statement by itself, excluding the pipe operator, and continue adding one statement at a time, and at each step check the returned value and look out for what has changed from the previous step. Optionally you can insert a line \code{print() |>} at the point where you wish to see the data being ``piped''.
\end{playground}
\index{importing data!GPX files|)}
\section{Worksheets}\label{sec:files:worksheets}
\index{importing data!worksheets and workbooks|(}
Microsoft Office, Open Office, and Libre Office are the most frequently used suites containing programs based on the worksheet paradigm. There is available a standardised file format for exchange of worksheet data, but it does not support all the features present in native file formats. We will start by considering \pgrmname{MS-Excel}. The file format used by \pgrmname{MS-Excel} has changed significantly over the years, and old formats tend to be less well supported by available \Rlang packages and may require the file to be updated to a more modern format with \pgrmname{MS-Excel} itself before import into \Rlang. The current format is based on XML and relatively simple to decode, whereas older binary formats are more difficult. Worksheets contain code as equations in addition to the actual data. In all cases, only values entered as such or those computed by means of the embedded equations can be imported into \Rlang rather than the equations themselves.
\begin{warningbox}
When directly reading from a worksheet, a column of cells with mixed type, can introduce \code{NA} values. A wrongly selected cell range from the worksheet can result in missing columns or rows, if the area is too small, or in rows or columns filled with \code{NA} values, if the range includes empty cells in the worksheet. Depending on the function used, it may be possible to ignore empty cells, by passing an argument.
Many problems related to the import of data from worksheets and workbooks are due to translation between two different formats that impose different restrictions on what is allowed or not. While in a worksheet it is allowed to set the ``format'' (as called in \pgrmname{Excel}, and roughly equivalent to \code{mode} in \Rlang) of individual cells, a variable (column) in an \Rlang data frame is expected to be vector, and thus contain members belonging the same \code{mode} or type. For the import to work as expected, the ``format'' must be consistent, i.e., all cells in a column to be imported are marked as one of the \code{Number}, \code{Date}, \code{Time}, or \code{Text} formats, with the possible exception of a \emph{single row} of column headers with the names of the variables as \code{Text}. The default format \code{General} also works but as it does not ensure consistency, it makes more difficult to see format inconsistencies at a glance in Excel.
When reading a \code{CSV} file, text representing numbers will be recognised and converted, but only if the decimal point is encoded as expected from the arguments passed to the function call. So a single number with a comma instead of a dot as decimal marker (or vice versa) will result in most cases in the column not being decoded as numbers and returned as a \code{character} vector (or column) in the data frame. In the case of package \pkgname{readr}, a \code{numeric} vector containing \code{NA} values for the non-decoded text may be returned instead of a \code{character} vector depending on whether the wrong decimal marker appears near the top or near the end of the file.
When importing data from a worksheet or workbook, my recommendation is first to check it in the original software to ensure that the cells to be imported are encoded as expected. When using a \code{CSV} as an intermediate step, it is crucial to also open this file in a plain-text editor such as the editor pane in \RStudio (or \pgrmname{Notepad} in \pgrmnameNI{Windows} or \pgrmname{Nano}, \pgrmname{Emacs}, etc., in \pgrmnameNI{Unix} and \pgrmnameNI{Linux}). Based on what field separator, decimal mark, and possibly character encoding has been used, which depends on the locale settings in the operating system of the computer and in the worksheet program, select a suitable function to call and the necessary arguments to pass to it.
\end{warningbox}
\subsection{CSV files as middlemen}
If we have access to the original software used for creating a worksheet or workbook, then exporting worksheets to text files in CSV format and importing them into \Rlang using the functions described in sections \ref{sec:files:txt} and \ref{sec:files:readr} starting on pages \pageref{sec:files:txt} and \pageref{sec:files:readr} provides a broadly compatible route for importing data---with the caveat that one must ensure that delimiters and decimal marks match the expectations of the functions used. This approach is not ideal from the perspective of having to create intermediate \code{CSV} formatted text files. A better approach is, when feasible, to import the data directly from the workbook or worksheets into \Rlang.
\subsection[\pkgnameNI{readxl}]{\pkgname{readxl}}\label{sec:files:excel}
\index{importing data!.xlsx files|(}\index{file formats!xlsx}
<<readxl-00, eval=FALSE, include=FALSE>>=
citation(package = "readxl")
@
Package \pkgname{readxl} supports reading of \pgrmname{MS-Excel} workbooks, and selecting worksheets and regions within worksheets specified in ways similar to those used by \pgrmname{MS-Excel} itself. The interface is simple, and the package easy to install. We will import a file that in \pgrmname{MS-Excel} looks like the screen capture below.
\begin{center}
\includegraphics[width=0.75\textwidth]{figures/Book1-xlsx.png}
\end{center}
Function \Rfunction{excel\_sheets()} lists the sheets contained in the workbook.
<<readxl-01>>=
sheets <- excel_sheets("extdata/Book1.xlsx")
sheets
@
In this case, the argument passed to \code{sheet} is redundant, as there is only a single worksheet in the file. It is possible to use either the name of the sheet or a positional index (in this case \code{1} would be equivalent to \code{"my data"}). Function \Rfunction{read\_excel()} with no range specification imports the whole worksheet into a tibble, as can be expected from a package included in the \pkgname{tidyverse}.
<<readxl-02>>=
Book1.df <- read_excel("extdata/Book1.xlsx",
sheet = "my data")
Book1.df
@
It is also possible to read a region instead of the whole worksheet.
<<readxl-03>>=
Book1_region.df <- read_excel("extdata/Book1.xlsx",
sheet = "my data",
range = "A1:B8")
Book1_region.df
@
Of the remaining arguments, the most useful ones have the same names and play similar roles as in \pkgname{readr} (see section \ref{sec:files:readr} on page \pageref{sec:files:readr}). For example, new names for the columns can be passed as an argument to override the names in the worksheet.
<<readxl-04>>=
Book1_region.df <- read_excel("extdata/Book1.xlsx",
sheet = "my data",
range = "A2:B8",
col_names = c("A", "B"))
Book1_region.df
@
\subsection[\pkgnameNI{xlsx}]{\pkgname{xlsx}}
<<xlsx-00, eval=FALSE, include=FALSE>>=
citation(package = "xlsx")
@
Package \pkgname{xlsx} can be more difficult to install as it uses Java functions to do the actual work. However, it is more comprehensive, with functions both for reading and writing \pgrmname{MS-Excel} worksheets and workbooks, in different formats including the older binary ones. Similarly to \pkgname{readr}, it allows selected regions of a worksheet to be imported.
Function \Rfunction{read.xlsx()} can be used indexing the worksheet by name. The returned value is a data frame, and following the expectations of \Rlang package \pkgnameNI{utils}, character columns are \emph{no longer} converted into factors by default.
<<xlsx-01>>=
Book1_xlsx.df <- read.xlsx("extdata/Book1.xlsx",
sheetName = "my data")
Book1_xlsx.df
sapply(Book1_xlsx.df, class)
@
With function \Rfunction{write.xlsx()}, we can write data frames out to Excel worksheets and even append new worksheets to an existing workbook.
<<xlsx-05>>=
set.seed(456321)
my.data <- data.frame(x = 1:10, y = letters[1:10])
write.xlsx(my.data,
file = "extdata/my-data.xlsx",
sheetName = "first copy")
write.xlsx(my.data,
file = "extdata/my-data.xlsx",
sheetName = "second copy",
append = TRUE)
@
When opened in Excel, we get a workbook containing two worksheets, named using the arguments we passed through \code{sheetName} in the code chunk above.
% screen capture to be replaced!!
\begin{center}
\includegraphics[width=0.75\textwidth]{figures/my-data-xlsx.png}
\end{center}
\begin{playground}
If you have some worksheet files available, import them into \Rlang to get a feel for how data is organised in the worksheets affects how easy or difficult it is to import them into \Rlang.
\end{playground}
\index{importing data!.xlsx files|)}
\subsection[\pkgnameNI{readODS}]{\pkgname{readODS}}
\index{importing data!.ods files|(}\index{file formats!ODS}
Package \pkgname{readODS} provides functions for reading data saved in files that follow the \emph{Open Documents Standard}. Function \Rfunction{read\_ods()} has a similar user interface to that of \code{read\_excel()} and reads one worksheet at a time, with support only for skipping top rows and selecting ranges of columns and rows. The value returned is a tibble or, optionally, a data frame. Function \Rfunction{read\_fods()} reads flat ODS files.
<<readODS-00>>=
list_ods_sheets("extdata/Book1.ods")
@
<<readODS-01>>=
ods.df <- read_ods("extdata/Book1.ods", sheet = 1)
@
<<readODS-02>>=
ods.df
@
Functions \Rfunction{write\_ods()} and \Rfunction{write\_fods()} write a data frame into an ODS or FODS file.
\index{importing data!.ods files|)}
\index{importing data!worksheets and workbooks|)}
\section{Statistical Software}\label{sec:files:stat}
\index{importing data!other statistical software|(}
There are two different comprehensive packages for importing data saved from other statistical programs such as \pgrmname{SAS}, \pgrmname{Statistica}, \pgrmname{SPSS}, etc. The longtime ``standard'' is package \pkgname{foreign} included in base \Rlang, and package \pkgname{haven} is a newer contributed extension. In the case of files saved with old versions of statistical programs, functions from \pkgname{foreign} tend to be more robust than those from \pkgname{haven}.
\subsection[\pkgnameNI{foreign}]{\pkgname{foreign}}
<<foreign-00, eval=FALSE, include=FALSE>>=
citation(package = "foreign")
@
Functions\index{importing data!from SAS}\index{importing data!from Stata}\index{importing data!from SPSS}\index{importing data!from Systat} in package \pkgname{foreign} allow us to import data from files saved by several statistical analysis programs, including \pgrmname{SAS}, \pgrmname{Stata}, \pgrmname{SPSS}, \pgrmname{Systat}, \pgrmname{Octave} among others, and a function for writing data into files with formats native to \pgrmname{SAS}, \pgrmname{Stata}, and \pgrmname{SPSS}. \Rlang documents the use of these functions in detail in the \emph{R Data Import/Export} manual. As a simple example, we use function \Rfunction{read.spss()} to read a \texttt{.sav} file, saved a few years ago with the then current version of \pgrmname{SPSS}. Only the first six rows and seven columns of the data frame are shown, including a column with dates, which appears as numeric.
<<foreign-01>>=
my_spss.df <- read.spss(file = "extdata/my-data.sav", to.data.frame = TRUE)
my_spss.df[1:6, c(1:6, 17)]
@
A second example, this time with a simple \code{.sav} file saved 15 years ago.
<<foreign-02>>=
thiamin.df <- read.spss(file = "extdata/thiamin.sav", to.data.frame = TRUE)
head(thiamin.df)
@
Another example, for a \pgrmname{Systat} file saved on an PC more than 20 years ago, and read with \Rfunction{read.systat()}.
<<foreign-03>>=
my_systat.df <- read.systat(file = "extdata/BIRCH1.SYS")
head(my_systat.df)
@
Not all functions in \pkgname{foreign} return data frames by default, but all of them can be coerced to do so.
\subsection[\pkgnameNI{haven}]{\pkgname{haven}}
<<haven-00, eval=FALSE, include=FALSE>>=
citation(package = "haven")
@
Package \pkgname{haven} is less ambitious with respect to the number of formats supported, or their vintages, providing read and write functions for only three file formats: \pgrmname{SAS}, \pgrmname{Stata}, and \pgrmname{SPSS}. On the other hand, \pkgname{haven} provides flexible ways to convert the different labelled values that cannot be directly mapped to \Rlang modes. They also decode dates and times according to the idiosyncrasies of each of these file formats. In cases when the imported file contains labelled values, the returned \Rclass{tibble} object needs some additional attention from the user. Labelled numeric columns in \pgrmname{SPSS} are not necessarily equivalent to factors, although they sometimes are. Consequently, conversion to factors cannot be automated and must be done manually in a separate step.
Function \Rfunction{read\_sav()} can be used to import a \code{.sav} file saved by a recent version of \pgrmname{SPSS}. As in the previous section, we display below only the first six rows and seven columns of the data frame, including a column \code{treat} containing a labelled numeric vector and \code{harvest\_date} with dates encoded as \Rlang date values.
<<haven-01>>=
my_spss.tb <- read_sav(file = "extdata/my-data.sav")
my_spss.tb[1:6, c(1:6, 17)]
@
In this case, the dates are correctly decoded.
Next, we import an \pgrmname{SPSS}'s \code{.sav} file saved 20 years ago.
<<haven-02>>=
thiamin.tb <- read_sav(file = "extdata/thiamin.sav")
thiamin.tb
@
<<haven-02a>>=
thiamin.tb <- as_factor(thiamin.tb)
thiamin.tb
@
\begin{playground}
Compare the values returned by different \code{read} functions when applied to the same file on disk. Use \Rfunction{names()}, \Rfunction{str()}, and \Rfunction{class()} as tools in your exploration. If you are brave, also use \Rfunction{attributes()}, \Rfunction{mode()}, \Rfunction{dim()}, \Rfunction{dimnames()}, \Rfunction{nrow()}, and \Rfunction{ncol()}.
\end{playground}
\begin{playground}
If you use or have in the past used other statistical software or a general-purpose language like \langname{Python}, look for some old files and import them into \Rlang.
\end{playground}
\index{importing data!other statistical software|)}
\section{NetCDF Files}\label{sec:dataio:netcdf}
\index{importing data!NeCDF files|(}\index{file formats!NetCDF}
In some fields, including geophysics and meteorology, \pgrmname{NetCDF} is a very common format for the exchange of data. It is also used in other contexts in which data are referenced to a grid of locations, like with data read from Affymetrix microarrays used to study gene expression. \pgrmname{NetCDF} files are binary but use a format that allows the storage of metadata describing each variable together with the data itself in a well-organised and standardised format, which is ideal for exchange of moderately large data sets measured on a spatial or spatio-temporal grid.
Officially described as follows:
\begin{quote}
\pgrmname{NetCDF} is a set of software libraries [from Unidata] and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
\end{quote}
That \pgrmname{NetCDF} files be selectively read, extracting the data from individual variables, is important as it allows computations in \Rlang with data sets too big to fit in a computer's RAM. Selective reading is possible using functions from packages \pkgname{ncdf4} or \pkgname{RNetCDF}. As a consequence of this flexibility, contrary to other data file reading operations, reading a \pgrmname{NetCDF} file is done in multiple steps---i.e., opening the file, reading metadata describing the variables and spatial grid, and finally selectively reading the data of interest.
\subsection[\pkgnameNI{ncdf4}]{\pkgname{ncdf4}}
<<ncdf4-00, eval=FALSE, include=FALSE>>=
citation(package = "ncdf4")
@
Package \pkgname{ncdf4} supports reading of files using \pgrmname{NetCDF} version 4 or earlier formats. Functions in \pkgname{ncdf4} not only allow reading and writing of these files, but also their modification.
Below, first file \code{pevpr.sfc.mon.ltm.nc}, containing meteorological data, is opened with function \Rfunction{nc\_open()}. The object returned is saved to \code{meteo\_data.nc}. This object contains only an index to the file contents, whose structure is displayed with a call to \code{str()}, it plays the role of a file connection.
<<ncdf4-01>>=
meteo_data.nc <- nc_open("extdata/pevpr.sfc.mon.ltm.nc")
str(meteo_data.nc, max.level = 1)
@
\begin{advplayground}
Increase \code{max.level} in the call to \Rfunction{str()} above and study how the connection object stores information on the dimensions and for each data variable. You can also \code{print(meteo\_data.nc)} for a more complete printout once you have understood the structure of the object.
\end{advplayground}
The dimensions of the data array are stored as metadata, in the file used mapping indexes to a grid of latitudes and longitudes and into a time vector as a third dimension. The dates are returned as character strings. The variables describing the grid are read one at a time with function \Rfunction{ncvar\_get()}.
<<ncdf4-02>>=
time.vec <- ncvar_get(meteo_data.nc, "time")
head(time.vec)
longitude <- ncvar_get(meteo_data.nc, "lon")
head(longitude)
latitude <- ncvar_get(meteo_data.nc, "lat")
head(latitude)
@%
\pagebreak
The \code{time} vector contains only monthly values as the file contains a long-term series of monthly averages, expressed as days from 1800-01-01 corresponding to the first day of each month of year "1". We use package \pkgname{lubridate} for the conversion. To find the indexes for the grid point of interest, it is necessary to study the vectors \code{longitude} and \code{latitude} saved above.
Next, the potential evapotranspiration is read for one grid point, and used to construct a data frame, with some values recycled.
<<ncdf4-03>>=
pet.tb <-
tibble(time = time.vec,
month = month(ymd("1800-01-01") + days(time)),
lon = longitude[6],
lat = latitude[2],
pet = ncvar_get(meteo_data.nc, "pevpr")[6, 2, ]
)
pet.tb
@
To read data for several grid points, different approaches are available. However, the order of nesting of dimensions can make adding the dimensions as columns error prone. It is much simpler to use package \pkgnameNI{tidync} described next.
\subsection[\pkgnameNI{tidync}]{\pkgname{tidync}}
<<tidync-00, eval=FALSE, include=FALSE>>=
citation(package = "tidync")
@
Package \pkgname{tidync} provides functions that make it easy to extract subsets of the data from an \pgrmname{NetCDF} file. The initial steps are the same operations as in the examples for \pkgnameNI{ncdf4}.
Function \Rfunction{tidync()} is used to open the file and simultaneously activate the first grid. The returned object is saved as \code{meteo\_data.tnc}. This object is subsequently used to access the file, and when printed displays a summary of the file structure and data encoding.
<<tidync-01>>=
meteo_data.tnc <- tidync("extdata/pevpr.sfc.mon.ltm.nc")
meteo_data.tnc
@
Function \Rfunction{hyper\_dims()} returns a description of the grid for which observations are available.
<<tidync-01a>>=
hyper_dims(meteo_data.tnc)
@
Function \Rfunction{hyper\_vars()} returns a description of the observations or variables available at each grid point.
<<tidync-01b>>=
hyper_vars(meteo_data.tnc)
@
Function \Rfunction{hyper\_tibble()} extracts a subset of the data into a tibble in long (or tidy) format. The selection of the grid point is done in the same operation and in this case using \code{signif()} to test for an approximate match to actual longitude and latitude values. A pipe is used to add the decoded dates, using the pipe operator (\code{|>}) and methods from \pkgname{dplyr} (see section \ref{sec:dplyr:group:wise} on page \pageref{sec:dplyr:group:wise}). The decoding of dates is done using functions from package \pkgname{lubridate} (see section \ref{sec:data:datetime} on page \pageref{sec:data:datetime}).
<<tidync-02>>=
hyper_tibble(meteo_data.tnc,
lon = signif(lon, 1) == 9,
lat = signif(lat, 2) == 87)
# 'tidync' (== 0.4.0) fails to include variable time in the tibble
# |>
# mutate(.data = _, month = month(ymd("1800-01-01") + days(time))) |>
# select(.data = _, -time)
@
In this second example, data are extracted for all grid points along latitudes by omitting the test for \code{lat} from the chunk above. The tibble is assembled automatically and columns for the active dimensions added. The decoding of the months remains the same as above.
<<tidync-03>>=
hyper_tibble(meteo_data.tnc,
lon = signif(lon, 1) == 9) |>
mutate(.data = _, month = month(ymd("1800-01-01") + days(time))) |>
select(.data = _, -time)
@
\begin{playground}
Instead of extracting data for one longitude across latitudes, extract data across longitudes for one latitude near the Equator.
\end{playground}
\index{importing data!NeCDF files|)}
\section{Remotely Located Data}\label{sec:files:remote}
\index{importing data!remote connections|(}
Many\index{importing data!remotely}\index{importing data!using URL}\index{files!downloading} of the functions described above accept a URL address in place of a file name. Consequently, files can be read remotely without having to first download and save a copy in the local file system. This can be useful, especially when file names are generated within a script. However, one should avoid, especially in the case of servers open to public access, repeatedly downloading the same file as this unnecessarily increases network traffic and workload on the remote server. Because of this, our first example reads a small file from my own web site. See section \ref{sec:files:txt} on page \pageref{sec:files:txt} for details on the use of these and other functions for reading text files.
<<url-01, eval=eval_online_data>>=
logger.df <-
read.csv2(file = "http://r4photobiology.info/learnr/logger_1.txt",
header = FALSE,
col.names = c("time", "temperature"))
sapply(logger.df, class)
@
While functions in package \pkgname{readr} support the use of URLs, those in packages \pkgname{readxl} and \pkgname{xlsx} do not. Consequently, the file has to be first downloaded and saved locally, and subsequently imported as described in section \ref{sec:files:excel} on page \pageref{sec:files:excel}. Function \Rfunction{download.file()} in the \Rlang \pkgname{utils} package can be used to download files using URLs. It supports different modes such as binary or text for the contents, and write or append for the local file, and different methods such as \code{"internal"}, \code{"wget"}, and \code{"libcurl"}.
\begin{warningbox}
For portability, \pgrmname{MS-Excel} files should be downloaded in binary mode, setting \code{mode = "wb"}, which is required under \osname{MS-Windows}.
\end{warningbox}
<<url-11, eval=eval_online_data>>=
download.file("http://r4photobiology.info/learnr/my-data.xlsx",
"data/my-data-dwn.xlsx",
mode = "wb")
@%
\pagebreak
Functions from packages \pkgname{foreign} and \pkgname{haven}, useful for reading files saved by other statistical software, support URLs. See section \ref{sec:files:stat} on page \pageref{sec:files:stat} for more information about importing this kind of data into \Rlang. The two examples below read a file saved by \pgrmname{SPSS} located in a remote server, using these two packages.
<<url-03, eval=eval_online_data>>=
remote_thiamin.df <-
read.spss(file = "http://r4photobiology.info/learnr/thiamin.sav",
to.data.frame = TRUE)
head(remote_thiamin.df)
@
<<url-04, eval=eval_online_data>>=
remote_my_spss.tb <-
read_sav(file = "http://r4photobiology.info/learnr/thiamin.sav")
remote_my_spss.tb
@
Next, we download from NOAA's server a NetCDF file with long-term means for potential evapotranspiration, the same file used above in the \pkgname{ncdf4} example. This is a moderately large file at 834~KB. In this case, it is not possible to directly open the connection to the NetCDF file and it has to be downloaded. The \code{if} statement ensures that the file is downloaded only if the local copy is missing (to refresh the local copy simply delete the existing one). Once downloaded, the file can be opened as shown in section \ref{sec:dataio:netcdf} on page \pageref{sec:dataio:netcdf}.
<<url-05, eval=eval_online_data>>=
if (!file.exists("extdata/pevpr.sfc.mon.ltm.nc")) {
my.url <- paste("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.derived/",
"surface_gauss/pevpr.sfc.mon.ltm.nc",
sep = "")
download.file(my.url,
mode = "wb",
destfile = "extdata/pevpr.sfc.mon.ltm.nc")
}
pet_ltm.nc <- nc_open("extdata/pevpr.sfc.mon.ltm.nc")
@
\begin{warningbox}
For portability, \pgrmname{NetCDF} files should be downloaded in binary mode, setting \code{mode = "wb"}, which is required under \osname{MS-Windows}.
\end{warningbox}
\index{importing data!remote connections|)}
\begin{warningbox}
Some NetCDF file servers support the OPeNDAP protocol. In these servers, it is possible to open the files remotely and only download a part of the file. Function \Rfunction{open.nc()} from package \pkgname{RNetCDF} transparently supports OPeNDAP URLs.
\end{warningbox}
\section{Databases}\label{sec:data:db}
\index{importing data!databases|(}
<<dbplyr-00a, eval=FALSE, include=FALSE>>=
citation(package = "dbplyr")
@
<<dbplyr-00b, eval=FALSE, include=FALSE>>=
citation(package = "DBI")
@
<<dbplyr-00c, eval=FALSE, include=FALSE>>=
citation(package = "RSQLite")
@
One of the advantages of using databases is that subsets of cases and variables can be retrieved, even remotely, making it possible to work in \Rlang both locally and remotely with huge data sets. One should remember that \Rlang natively keeps whole objects in RAM, and consequently, available machine memory limits the size of data sets with which it is possible to work. Package \pkgname{dbplyr} provides the tools to work with data in databases using the same verbs as when using \pkgname{dplyr} with data stored in memory (RAM) (see chapter \ref{chap:R:data}). This is an important subject, but extensive enough to be outside the scope of this book. We provide a few simple examples to show the very basics but interested readers should consult \citebooktitle{Wickham2023a} \autocite{Wickham2023a}.
The additional steps compared to using \pkgname{dplyr} start with the need to establish a connection to a local or remote database. We will use \Rlang package \pkgname{RSQLite} to create a local temporary \pgrmname{SQLite} database. \pkgname{dbplyr} backends supporting other database systems are also available. We will use meteorological data from \pkgname{learnrbook} for this example.
<<dbplyr-01>>=
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
copy_to(con, weather_wk_25_2019.tb, "weather",
temporary = FALSE,
indexes = list(
c("month_name", "calendar_year", "solar_time"),
"time",
"sun_elevation",