R.data.io.Rnw

% !Rnw root = appendix.main.Rnw

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'data-chunk')
@

\chapter{Base \Rlang and Extensions: Data Sharing}\label{chap:R:data:io}\label{sec:data:io}

\begin{VF}
Most programmers have seen them, and most good programmers realise they've written at least one. They are huge, messy, ugly programs that should have been short, clean, beautiful programs.

\VA{John Bentley}{\emph{Programming Pearls}, 1986}
\end{VF}

<<echo=FALSE>>=
# set to TRUE to execute code chunks that require internet access and instruments, repectively
eval_online_data <- TRUE
eval_yoctopuce <- TRUE
@

\section{Aims of This Chapter}

In this chapter, you will learn how to exchange data between \Rlang and some other applications. Base \Rlang and the recommended packages (installed by default) include several functions for importing and exporting data. Contributed packages provide both replacements for some of these functions and support for several additional file formats. In the present chapter, I aim at describing both data input and output covering in detail only the most common ``foreign'' data formats (those not native to \Rlang). The function pairs \code{save()} and \code{load()}, and \code{saveRDS()} and  \code{readRDS()}, which save and read data in \Rlang's native formats, are described in chapter \ref{chap:R:collective}, sections \ref{sec:data:rda} and \ref{sec:data:rds} starting on page \pageref{sec:data:rda}.

Data file formats that are foreign to \Rlang are not always well defined, making it necessary to reverse-engineer the algorithms needed to read them. These formats, even when clearly defined, may be updated by the developers of the foreign software that writes the files. Consequently, developing software to read and write files using foreign formats can easily result in long, messy, and ugly \Rlang scripts. We can also unwillingly write code that usually works but occasionally fails with specific files, or even worse, occasionally silently corrupts the imported data. The aim of this chapter is to provide guidance for finding functions for reading data encoded using foreign formats, covering both base \Rlang, including the \pkgname{foreign} package, and independently contributed packages. Such functions are well tested or validated and should be used whenever possible when importing data stored in foreign file formats.

\section{Introduction}

The first step in any data analysis with \Rlang is to input or read-in the data. Available sources of data are many and data can be stored or transmitted using various formats, both based on text or binary encodings. It is crucial that data are not altered (corrupted) when read and that in the eventual case of an error, errors are clearly reported. Most dangerous are silent non-catastrophic errors.

The very welcome increase of awareness of the need for open availability of data, makes the output of data from \Rlang into well-defined data-exchange formats another crucial step. Consequently, in many cases, an important step in data analysis is to export the data for submission to a repository, in addition to publication of the results of the analysis.

Faster internet access to data sources and cheaper random-access memory (RAM) has made it possible to efficiently work with relatively large data sets in \Rlang. That \Rlang keeps all data in memory (RAM), imposes limits to the size of data \Rlang functions can operate on. For data sets that do not fit in computer RAM, one can read selected lines from text files, use file formats like NetCDF that natively implement selective reading, or use queries to access data stored in local or remote databases.

Some contributed \Rlang packages support import of data saved in the same formats already supported by base \Rlang, but using different compromises between reliability, easy of use and performance. Functions in base \Rlang tend to prioritise reliability and protection from data corruption while some contributed packages prioritise performance. Other contributed packages make it possible to import and export data stored in file formats not supported by base \Rlang functions. Some of these formats are subject-area specific while others are in widespread use. Packages supporting download and upload of data sets from specific public repositories are also available (see \url{https://ropensci.org/packages/}).

\section{Packages Used in This Chapter}

<<eval=FALSE>>=
install.packages(learnrbook::pkgs_ch10_2ed)
@

To run the examples included in this chapter, you need first to load some packages from the library (see section \ref{sec:script:packages} on page \pageref{sec:script:packages} for details on the use of packages).

\begin{warningbox}
  Several examples in this chapter make use of functions from the \pkgname{tidyverse} for data wrangling because some of the packages used to import data in ``foreign'' formats are themselves part of the \pkgname{tidyverse}.
\end{warningbox}

<<message=FALSE>>=
library(learnrbook)
library(tibble)
library(purrr)
library(stringr)
library(dplyr)
library(tidyr)
library(readr)
library(readxl)
library(xlsx)
library(readODS)
library(pdftools)
library(foreign)
library(haven)
library(xml2)
library(XML)
library(ncdf4)
library(tidync)
library(lubridate)
library(jsonlite)
@

\begin{warningbox}
Some data sets used in this and other chapters are available in package \pkgname{learnrbook}. In addition to the \Rlang data objects, the package includes files saved in \emph{foreign} formats used in examples of importing data. The files can be either read from the \Rlang library, or from a copy in a local folder. In this chapter, the code examples assume the user has copied the contents of folder \code{"extdata"} of the package to the current working folder.

The files can be copied by running the two statements below, assuming the current folder is the one that will be used to run the code examples in this chapter.

<<copy-data-files>>=
pkg.path <- system.file("extdata", package = "learnrbook")
file.copy(pkg.path, ".", overwrite = TRUE, recursive = TRUE)
@

Some examples write files to disk, and the statements below ensure that the folder used in these examples exists, creating it if not found.

<<make-dir>>=
save.path = "./data"
if (!dir.exists(save.path)) {
  dir.create(save.path)
}
@
\end{warningbox}

\section{File Names and Operations}\label{sec:files:filenames}
\index{file names!portable}
\index{file operations|(}
The naming of files affects data sharing irrespective of the format used for its encoding. The main difficulty is that different operating systems have different rules governing the syntax used for file names and file paths. In many cases, like when depositing data files in a public repository, we need to ensure that file names are valid across multiple operating systems (OSs). If the script used to create the files is itself expected to be OS agnostic, queries for file names and paths in \Rlang code should not make assumptions on the naming rules or available OS commands. This is especially important when developing \Rlang packages.

\begin{warningbox}
\index{file names!script portability}
For maximum portability, file names should never contain white-space characters and contain at most one dot. For the widest possible portability, underscores should be avoided using dashes instead. As an example, instead of \code{my data.2019.csv}, use \code{my-data-2019.csv}.
\end{warningbox}

\Rlang provides functions which help with portability, by hiding the idiosyncrasies of the different OSs from \Rlang code. In scripts, these functions should be preferred over direct call to OS commands (i.e., avoid calls to functions \Rfunction{shell()} or \Rfunction{system()} with OS commands as arguments) whenever possible. As the algorithm needed to extract a file name from a file path is OS specific, \Rlang provides functions such as \Rfunction{basename()}, whose implementation is OS specific but from the side of \Rlang code behave identically---these functions hide the differences among OSs from the user of \Rlang. The chunk below can be expected to work correctly under any OS for which \Rlang is available.

<<filenames-01>>=
basename("extdata/my-file.txt")
@

\begin{warningbox}
\index{file paths!script portability}
\index{folders|see{file paths}}
\index{file paths!parsing|(}
While in \pgrmname{Unix} and \pgrmname{Linux} folder nesting in file paths is marked with a forward slash character (\verb|/|), under \pgrmname{MS-Windows} it is marked with a backslash character (\verb|\|). Backslash (\verb|\|) is an escape character in \Rlang and interpreted as the start of an embedded special character (see section \ref{sec:calc:character} on page \pageref{sec:calc:character}), while in \Rlang a forward slash (\verb|/|) can be used for file paths under any OS, and escaped backslash (\verb|\\|) is valid only under MS-Windows. Consequently, \verb|/| should be always preferred to \verb|\\| to ensure portability, and is the approach used in this book.

<<filenames-02>>=
basename("extdata/my-file.txt")
basename("extdata\\my-file.txt")
@
\end{warningbox}

The complementary function to \code{basename()} is \Rfunction{dirname()} and extracts from a full file path the bare path to the containing folder.

<<filenames-03>>=
dirname("extdata/my-file.txt")
@

\index{file paths!parsing|)}
\index{working directory|(}
Functions \Rfunction{getwd()} and \Rfunction{setwd()} can be used to get the path to the current working directory and to set a directory as current, respectively.

<<filenames-05,eval=FALSE>>=
# not run
getwd()
@

Function \Rfunction{setwd()} returns the path to the current working directory, allowing to portably restore the working directory to the previous one. Both relative paths (relative to the current working directory), as in the example, or absolute paths (given in full) are accepted as an argument. In mainstream OSs ``\code{.}'' indicates the current directory and ``\code{..}'' the directory above the current one.

<<filenames-06,eval=FALSE>>=
# not run
oldwd <- setwd("..")
getwd()
@

The returned value is always an absolute full path, so it remains valid even if the path to the working directory changes more than once before being restored.

<<filenames-07,eval=FALSE>>=
# not run
oldwd
setwd(oldwd)
getwd()
@
\index{working directory|)}

\index{listing files or directories|(}
Function \Rfunction{list.files()} returns a list of names of files and/or directories (= disk folders) portably across OSs. Function \Rfunction{list.dirs()} returns only the names of directories.

<<filenames-09>>=
head(list.files())
head(list.dirs())
@

\begin{playground}
In these functions, the default argument for parameter \code{path} is the current working directory, under Windows, Unix, and Linux indicated by \code{"."}. Convince yourself that this is indeed the default by calling the functions with an explicit argument. After this, play with the functions passing as argument to \code{path} other existing and non-existent file and directory paths.
\end{playground}

\begin{playground}
Pass different arguments to parameter \code{full.names} of \Rfunction{list.files()} to obtain either a list of file paths or bare file names. Similarly, investigate how the returned list of files is affected by the argument passed to \code{all.names}.
\end{playground}
\index{listing files or directories|)}

Base \Rlang provides several functions for portably working with files, and they are listed in the help page for \code{files} and in individual help pages. Use \code{help("files")} to access the help for this ``family'' of functions. The chunk below exercises some of these functions.

<<filenames-08>>=
if (!file.exists("xxx.txt")) {
  file.create("xxx.txt")
}
file.size("xxx.txt")
file.info("xxx.txt")
file.rename("xxx.txt", "zzz.txt")
file.exists("xxx.txt")
file.exists("zzz.txt")
file.remove("zzz.txt")
@

\begin{playground}
Function \Rfunction{file.path()} can be used to construct a file path from its components in a way that is portable across OSs. Look at the help page and play with the function to assemble some paths that exist in the computer you are using.
\end{playground}
\index{file operations|)}

\section{Opening and Closing File Connections}\label{sec:io:connections}

Examples\index{file connections} in the rest of this chapter use as an argument for the \code{file} formal parameter literal paths or URLs, and complete the reading or writing operations within the call to a function. Sometimes it is necessary to read or write a text file sequentially, one row or record at a time. In such cases, it is most efficient to keep the file open between reads and close the connection only when it is no longer needed. See \code{help(connections)} for details about the various functions available and their behaviour in different OSs. The code below opens a file connection, reads two lines, first the top one with column headers, then in a separate call to \Rfunction{readLines()}, the two lines or records with data, and finally closes the connection.

<<file-open-01>>=
f1 <- file("extdata/not-aligned-ASCII-UK.csv", open = "r") # open for reading
readLines(f1, n = 1)
@

<<file-open-02>>=
readLines(f1, n = 2)
close(f1)
@

When \Rpgrm is used in batch mode, the ``files'' \code{stdin}, \code{stdout} and \code{stderror} can be opened, and data read from, or written to. These \emph{standard} sources and sinks, so familiar to \Clang programmers, allow the use of \Rlang scripts as tools in data pipes coded as shell scripts under Unix and other OSs.

\section{Plain-Text Files}\label{sec:files:txt}
\index{importing data!text files|(}\index{file formats!plain text}
In general, text files are the most portable approach to data storage but usually also the least efficient with respect to the size of the file. Text files are composed of encoded characters. This makes them easy to edit with text editors and easy to read from programs written in most programming languages. On the other hand, how the data encoded as characters is arranged can be based on two different approaches: positional or using a specific character as a separator.

The positional approach is more concise but almost unreadable to humans as the values run into each other. Reading of data stored using a positional approach requires access to a format definition and was common in FORTRAN and COBOL at the time when punch cards were used to store data. In the case of separators, different separators are in common use. Comma-separated values (CSV) encodings use either a comma or semicolon to separate the fields or columns. Tab-separated values (TSV) use the tab, or tabulator, character as a column separator. Sometimes, whitespace is used as a separator, most commonly when all values are to be converted into \code{numeric}.

\begin{explainbox}
\textbf{Not all text files are born equal.}\index{importing data!R names} When reading text files, and \emph{foreign} binary files which may contain embedded text strings, there is potential for their misinterpretation during the import operation. One common source of problems, is that column headers are to be read as \Rlang names. As earlier discussed, there are strict rules, such as avoiding spaces or special characters if the names are to be used with the normal \Rlang syntax. On import, some functions will attempt to sanitise the names, but others not. Most such names are still accessible in \Rlang statements, but a special syntax is needed to protect them from triggering syntax errors through their interpretation as something different than variable or function names---in \Rlang jargon we say that they need to be quoted.

Some of the things we need to be on the watch for are:
1) Mismatches between the character encoding expected by the function used to read the file, and the encoding used for saving the file---usually because of different locales, i.e., language and country settings.
2) Leading or trailing (invisible) spaces present in the character values or column names---which are almost invisible when data frames are printed.
3) Wrongly guessed column classes---a typing mistake affecting a single value in a column, e.g.,  the wrong kind of decimal marker, can prevent the column from being recognised as numeric.
4) Mismatched decimal marker in \code{CSV} files---the marker depends on the locale (language and country settings).

If you encounter problems after import, such as failure of extraction of data frame columns by name, use function \code{names()} to get the names printed to the console as a character vector. This is useful because character vectors are always printed with each string delimited by quotation marks making leading and trailing spaces clearly visible. The same applies to use of \code{levels()} with factors created with data that might have contained mistakes or whitespace.

To demonstrate some of these problems, I create a data frame with name sanitation disabled, and in the second statement with sanitation enabled. The first statement is equivalent to the default behaviour of functions in package \pkgname{readr} and the second is equivalent to the behaviour of base \Rlang functions. \pkgname{readr} prioritises the integrity of the original data while \Rlang prioritises compatibility with \Rlang's naming rules.

<<file-io-txt-00a>>=
data.frame(a = 1, "a " = 2, " a" = 3, check.names = FALSE)
data.frame(a = 1, "a " = 2, " a" = 3)
@

An even more subtle case is when characters can be easily confused by the user reading the output, or typing in the data: zero and o (\code{a0} vs.\ \code{aO}) or el and one (\code{al} vs.\ \code{a1}) can be difficult to distinguish in some fonts. When using encodings capable of storing many character shapes, such as unicode, in some cases two characters with almost identical visual shape may be encoded as different characters.

<<file-io-txt-00b>>=
data.frame(al = 1, a1 = 2, aO = 3, a0 = 4)
@

Reading\index{importing data!text encodings} data from a text file can result in very odd-looking values stored in \Rlang variables because of a mismatch in encoding, e.g., when a CSV file saved with \pgrmname{MS-Excel} is silently encoded using 16-bit unicode format, but read as an 8-bit unicode encoded file.

The hardest part of all these problems is to diagnose their origin, as function arguments and working environment options can in most cases be used to force the correct decoding of text files with diverse characteristics, origins, and vintages once one knows what is required. Function \code{tools:::showNonASCIIfile()}\qRfunction{showNonASCIIfile()} from the \Rlang \pkgname{tools} package, which is not exported, but available in recent and current (\Sexpr{paste(R.version$major, R.version$minor, sep = ".")}) versions of \Rpgrm, can be used to test files for the presence on non-ASCII characters. This function takes as an argument the path to a file, and its companion function \code{tools:::showNonASCII()}\qRfunction{showNonASCII()} a \code{character} string.
\end{explainbox}

\subsection[Base \Rlang and \pkgnameNI{utils}]{Base \Rlang and \pkgname{utils}}
\index{text files!with field markers}
Text files containing data in columns can be divided into two broad groups. Those with fixed-width fields and those with delimited fields. Fixed-width fields were especially common in the early days of \langname{FORTRAN} and \langname{COBOL} when data storage capacity was very limited. These formats are frequently capable of encoding information using fewer characters than when delimited fields are used. The best way of understanding the differences is with examples. Although in this section we exemplify the use of functions by passing a file name as an argument, URLs and open file descriptors are also accepted (see section \ref{sec:io:connections} on page \pageref{sec:io:connections}). The file will be uncompressed on the fly if its name ends in \code{.gz}.

\begin{warningbox}
Wether columns containing character strings that cannot be converted into numbers are converted into factors or remain as character strings in the returned data frame depends on the value passed to parameter \code{stringsAsFactors}. The default changed in \Rlang version 4.0.0 from \code{TRUE} into \code{FALSE}. If code is to work consistently in old and new versions of \Rlang \code{stringsAsFactors = FALSE} has to be passed explicitly in calls to \Rfunction{read.csv()} (the approach used in the book).
\end{warningbox}

In\index{text files!CSV files}\index{text files!TSV files} the first example, a file with fields solely delimited by ``,'' is read. This is what is called comma-separated values (CSV) format that can be read and written with \Rfunction{read.csv()} and \Rfunction{write.csv()}, respectively.

The contents of file \code{not-aligned-ASCII-UK.csv} are shown below.

<<file-io-csv-00a, comment='', echo=FALSE>>=
cat(readLines("extdata/not-aligned-ASCII-UK.csv"), sep = "\n")
@

The file is read and the returned value stored in a variable named \code{from\_csv\_a.df}, and printed.

<<file-io-csv-01>>=
from_csv_a.df <-
  read.csv("extdata/not-aligned-ASCII-UK.csv", stringsAsFactors = FALSE)
@

<<file-io-csv-02>>=
from_csv_a.df
from_csv_a.df[["col4"]]
sapply(from_csv_a.df, class)
@

\begin{playground}
Read the file \code{not-aligned-ASCII-UK.csv} with function \Rfunction{read.csv2()} instead of \Rfunction{read.csv()}. Although this may look like a waste of time, the point of the exercise is for you to get familiar with \Rlang behaviour in case of such a mistake. This will help you recognise similar errors when they happen accidentally, which is quite common when files are shared.
\end{playground}

Example file \code{aligned-ASCII-UK.csv} contains comma-separated values with added whitespace to align the columns, to make it easier to read by humans.

The contents of file \code{aligned-ASCII-UK.csv} are shown below.

<<file-io-csv-03, comment='', echo=FALSE>>=
cat(readLines("extdata/aligned-ASCII-UK.csv"), sep = "\n")
@

The file is read and the returned value stored in a variable named \code{from\_csv\_b.df}, and printed.
Although space characters are read as part of the fields, they are ignored when conversion to numeric takes place.

<<file-io-csv-03a>>=
from_csv_b.df <-
  read.csv("extdata/aligned-ASCII-UK.csv", stringsAsFactors = FALSE)
@

<<file-io-csv-03aa>>=
from_csv_b.df
from_csv_b.df[["col4"]]
sapply(from_csv_b.df, class)
@

By default, column names are sanitised but whitespace in character strings kept. Passing an additional argument changes this default so that leading and trailing whitespace are discarded. Most likely the default has been chosen so that by default data integrity is maintained.

<<file-io-csv-05>>=
from_csv_c.df <-
  read.csv("extdata/aligned-ASCII-UK.csv",
           stringsAsFactors = FALSE, strip.white = TRUE)
@

<<file-io-csv-05aa>>=
from_csv_c.df
from_csv_c.df[["col4"]]
sapply(from_csv_c.df, class)
@

\begin{explainbox}
When\index{importing data!character to factor conversion} character strings are converted into factors, leading and trailing whitespace is retained in the labels of factor levels. Leading and trailing whitespace are difficult to see when data frames are printed, as shown below. This example shows what problems were frequently encountered in earlier versions of \Rlang, and can still occur when factors are created. The recommended approach is to use the default \code{stringsAsFactors = FALSE} and do the conversion into factors in a separate step.

<<file-io-csv-03b>>=
from_csv_b.df <-
  read.csv("extdata/aligned-ASCII-UK.csv", stringsAsFactors = TRUE)
@

Using \code{levels()} it can be seen that the labels of the automatically created factor levels contain leading spaces.

<<file-io-csv-04>>=
sapply(from_csv_b.df, class)
from_csv_b.df[["col4"]]
levels(from_csv_b.df[["col4"]])
@
\end{explainbox}

Decimal\index{importing data!decimal marker} points and exponential notation are allowed for floating point values. In English-speaking locales, the decimal mark is a point, while in many other locales it is a comma. The behaviour of \Rlang functions does not change when run under different locales. When a comma is used as decimal marker, we can a semicolon (\verb|;|) is used as field marker.

This handled by using functions \Rfunction{read.csv2()} and \Rfunction{write.csv2()}. Furthermore, parameters \code{dec} and \code{sep} allow setting the decimal marker and field separator to arbitrary character strings.

Function \Rfunction{read.table()} does the actual work and functions like \Rfunction{read.csv()} only differ in the default arguments for the different parameters. By default, \Rfunction{read.table()} expects fields to be separated by whitespace (one or more spaces, tabs, new lines, or carriage return).

The contents of file \code{aligned-ASCII.txt} are shown below.

<<file-io-txt-00, comment='', echo=FALSE>>=
cat(readLines("extdata/aligned-ASCII.txt"), sep = "\n")
@

The file is read and the returned value stored in a variable named \code{from\_txt\_b.df}, and printed.
Leading and trailing whitespace are removed because they are recognised as part of the separators. For character strings containing embedded spaces to be decoded as a single value, they need to be quoted in the file as in \code{aligned-ASCII.txt} above.

<<file-io-txt-01>>=
from_txt_b.df <-
  read.table("extdata/aligned-ASCII.txt",
             stringsAsFactors = FALSE, header = TRUE)
@

<<file-io-txt-02>>=
from_txt_b.df
from_txt_b.df[["col4"]]
sapply(from_txt_b.df, class)
@

\index{text files!fixed width fields}
With a fixed-width format, no delimiters are needed. Decoding is based solely on the position of the characters in the line or record. A file like this cannot be interpreted without a description of the format used for saving the data. Files containing data stored in \emph{fixed width format} can be read with function \Rfunction{read.fwf()}. Records for a single observation can be stored in a single or multiple lines. In either case, each line has fields of different but fixed known widths.

Function\index{text files!FORTRAN formatted data transfer} \Rfunction{read.fortran()} is a wrapper on \Rfunction{read.fwf()} that accepts format definitions similar to those used in \langname{FORTRAN}. One particularity of \langname{FORTRAN} \emph{formatted data transfer} is that the decimal marker can be omitted in the saved file and its position specified as part of the format definition, a trick used to make text files (or stacks of punch cards!) smaller. Modern versions of \langname{FORTRAN} support reading from and writing to other formats like those using field delimiters described above.

The contents of file \code{aligned-ASCII.fwf} are shown below.

<<file-io-fwf-00, comment='', echo=FALSE>>=
cat(readLines("extdata/aligned-ASCII.fwf"), sep = "\n")
@

The file is read and the returned value stored in a variable named \code{from\_fwf\_a.df}, and printed. The format definition is passed as a separate character vector argument, e.g., \code{"2F3.1"} describes the format of the first two columns, \code{"I3"} describes the third column and \code{"A3"} the fourth.

<<file-io-fwf-01>>=
from_fwf_a.df <-
  read.fortran("extdata/aligned-ASCII.fwf",
               format = c("2F3.1", "I3", "A3"),
               col.names = c("col1", "col2", "col3", "col4"))
@

<<file-io-fwf-02>>=
from_fwf_a.df
from_fwf_a.df[["col4"]]
sapply(from_fwf_a.df, class)
@

\begin{explainbox}
  The file reading functions described above share with \Rfunction{read.table()} the same parameters. In addition to those described above, other frequently useful parameters are \code{skip} and \code{n}, which can be used to skip lines at the top of a file and limit the number of lines (or records) to read; \code{header}, which accepts a logical argument indicating if the fields in the first text line read should be decoded as column names rather than data; \code{na.strings}, to which can be passed a character vector with strings to be interpreted as \code{NA}; and \code{colClasses}, which provides control of the conversion of the fields to \Rlang classes and possibly skipping some columns altogether. All these parameters are described in the corresponding help pages.
\end{explainbox}

\begin{playground}
In reality  \Rfunction{read.csv()}, \code{read.csv2()} and \Rfunction{read.table()} are the same function with different default arguments to several of their parameters. Study the help page, and by passing suitable arguments, make \Rfunction{read.csv()} behave like \Rfunction{read.table()}, then make \Rfunction{read.table()} behave like \Rfunction{read.csv2()}.
\end{playground}

\begin{explainbox}
A text file can be read as character strings, without attempting to decode them. This is occasionally useful, such as when the decoding is done in a script, or when needs to print a file as is. In this case, the function used is \code{readLines()}. The returned value is a character vector in which each member string corresponds to one line or record in the file, with the end-of-line markers stripped (see example in section \ref{sec:io:connections} on page \pageref{sec:io:connections}).
\end{explainbox}
\index{importing data!text files|)}

\index{exporting data!text files|(}
The next example shows how a \emph{write} function matching one of the \emph{read} functions described above can be used to save a data frame to a text file. The \Rfunction{write.csv()} function takes as an argument a data frame, or an object that can be coerced into a data frame, converts it to character strings, and saves them to a text file. A data frame, \code{my.df} with five rows is enough for a demonstration.

<<file-io-txt-03>>=
my.df <- data.frame(x = 1:5, y = 5:1 / 10, z = letters[1:5])
@

We write \code{my.df} to a CSV file suitable for an English language locale, and then display its contents.

<<file-io-txt-04>>=
write.csv(my.df, file = "my-file1.csv", row.names = FALSE)
file.show("my-file1.csv", pager = "console")
@

<<file-io-txt-05, comment='', echo=FALSE>>=
cat(readLines("my-file1.csv"), sep = "\n")
@

\begin{explainbox}
In most cases setting, as above, \code{row.names = FALSE} when writing a CSV file will help when it is read. Of course, if row names do contain important information, such as gene tags, you cannot skip writing the row names to the file unless you first copy these data into a column in the data frame. (Row names are stored separately as an attribute in \code{data.frame} objects, see section \ref{sec:calc:attributes} on page \pageref{sec:calc:attributes} for details.)
\end{explainbox}

\begin{playground}
Write the data frame \code{my.df} into text files with functions \Rfunction{write.csv2()} and \Rfunction{write.table()} instead of \Rfunction{read.csv()} and display the files.
\end{playground}

Function \Rfunction{cat()} takes \Rlang objects and writes them after conversion to character strings to the console or a file, inserting one or more characters as separators, by default, a space. This separator can be set through parameter \code{sep}. In our example, we set \code{sep} to a new line (entered as the escape sequence \code{"\textbackslash n"}).

<<file-io-txt-11>>=
my.lines <- c("abcd", "hello world", "123.45")
cat(my.lines, file = "my-file2.txt", sep = "\n")
file.show("my-file2.txt", pager = "console")
@

<<file-io-txt-12, comment='', echo=FALSE>>=
cat(readLines('my-file2.txt'), sep = '\n')
@
\index{exporting data!text files|)}

\subsection[\pkgnameNI{readr}]{\pkgname{readr}}\label{sec:files:readr}
\index{importing data!text files|(}

<<eval=FALSE, include=FALSE>>=
citation(package = "readr")
@

Package \pkgname{readr} is part of the \pkgname{tidyverse} suite. It defines functions that have different default behaviour and that are designed to be faster under different situations than those native to  \Rlang. The functions from package \pkgname{readr} can sometimes wrongly decode their input and rarely even do this silently. The \pkgname{readr} functions guess more properties of the text file format; in most cases they succeed, which is very handy, but occasionally they fail. Automatic guessing can be overridden by passing arguments, and this is recommended for scripts that will be reused to read different files in the future. Another important advantage is that these functions read character strings formatted as dates or times directly into columns of class \code{POSIXct}. All \code{write} functions defined in \pkgname{readr} have an \code{append} parameter, which can be used to change the default behaviour of overwriting an existing file with the same name, to appending the output at its end.

Although we exemplify the use of these functions by passing a file name as an argument, as is the case with \Rlang native functions, URLs, and file descriptors are also accepted (see section \ref{sec:io:connections} on page \pageref{sec:io:connections}). The files read are uncompressed, and those written are compressed on the fly if their name ends in \code{.gz}, \code{.bz2}, \code{.xz}, or \code{.zip}.

\begin{warningbox}
Functions ``equivalent'' to native \Rlang functions described in the previous section have names formed by replacing the dot with an underscore, e.g.,  \Rfunction{read\_csv()} $\approx$ \Rfunction{read.csv()}. The similarity refers to the format of the files read, but not the order, names, or roles of their formal parameters. For example, function \code{read\_table()} has a slightly different behaviour than \Rfunction{read.table()}, although they both read fields separated by whitespace. Row names are not set in the returned \Rclass{tibble}, which inherits from \Rclass{data.frame}, but is not fully compatible (see section \ref{sec:data:tibble} on page \pageref{sec:data:tibble}).
\end{warningbox}

\begin{warningbox}
  Package \pkgname{readr} is under active development, and functions with the same name from different major versions are not fully compatible. Code for some examples from the first edition of the book no longer work because the updated implementation fails to recognise escaped special characters. Function \Rfunction{read\_table2()} has been renamed \Rfunction{read\_table()}.
\end{warningbox}

These functions report to the console the specifications of the columns, which is important when these are guessed from the file contents, or even only from rows near the top of the file.

<<readr-01>>=
read_csv(file = "extdata/aligned-ASCII-UK.csv", show_col_types = FALSE)
@

<<readr-02>>=
read_csv(file = "extdata/not-aligned-ASCII-UK.csv", show_col_types = FALSE)
@

Package \pkgname{readr} is under active development, and different major versions are not fully compatible with each other. Because of the misaligned fields in file \code{"not-aligned-ASCII.txt"} in the past we needed to use \Rfunction{read\_table2()}, which allowed misalignment of fields, similarly to \Rfunction{read.table()}. This function has been renamed as \Rfunction{read\_table()} and \Rfunction{read\_table2()} deprecated. However, parsing of both files fails if they are read with \Rfunction{read\_table()}, quoted strings containing whitespace are no longer recognised. See above example using \Rfunction{read.table()}. Examples below are not run, but kept as they may work again in the future.

<<readr-03, eval=FALSE>>=
read_table(file = "extdata/aligned-ASCII.txt")
@

<<readr-04, eval=FALSE>>=
read_table(file = "extdata/not-aligned-ASCII.txt")
@

Function \Rfunction{read\_delim()} with space as the delimiter succeeds only with the not-aligned file as in this file the separator is in all cases a single space.

<<readr-05a>>=
read_delim(file = "extdata/not-aligned-ASCII.txt",
           delim = " ", show_col_types = FALSE)
@

Function \Rfunction{read\_tsv()} reads files encoded with the tab character as the delimiter, and \Rfunction{read\_fwf()} reads files with fixed width fields. There is, however, no equivalent to \Rfunction{read.fortran()}, supporting implicit decimal points.

\begin{playground}
Use the "wrong" \code{read\_} functions to read the example files used above and/or your own files. As mentioned earlier, forcing errors will help you learn how to diagnose when such errors are caused by coding or data entry mistakes. In this case, as wrongly read data are not always accompanied by error or warning messages, carefully check the returned tibbles for misread data values.
\end{playground}

\begin{explainbox}
The functions from \Rlang's package \pkgname{utils} read the whole file before attempting to guess the class of the columns or their alignment. This is reliable but slow for text files with many lines. The functions from \pkgname{readr} read by default only the top 1000 lines (\code{guess\_max = 1000}) when guessing the format and class, assuming that the guessed properties also apply to the remaining lines of the file. This is more efficient, but rather risky. However, the functions from \Rlang's package \pkgname{utils} are faster at reading files with many fields (or columns) per line and few lines.

In earlier versions of \pkgname{readr}, a problem was the failure to correctly decode numeric values when increasingly large numbers resulted in wider fields in the lines below those used for guessing. However, at the time of writing, this case is correctly handled. A guess based on the top lines of a text file also means that when values in lines below \code{guess\_max} lines cannot be converted to numeric, the \code{numeric} column returned contains \code{NA} values. In contrast, in this situation, functions from \Rlang's package \pkgname{utils}, skip decoding and return a \code{character} column. Below, a very small value for \code{guess\_max} is used to demonstrate this behaviour with a file only a few lines in length.

<<tibble-print-20, echo=FALSE>>=
options(tibble.print_max = 6, tibble.print_min = 6)
@

<<readr-06>>=
read_table(file = "extdata/miss-aligned-ASCII.txt", show_col_types = FALSE)
@

<<readr-07>>=
read_table(file = "extdata/miss-aligned-ASCII.txt", show_col_types = FALSE,
           guess_max = 3L)
@
\end{explainbox}
\index{importing data!text files|)}

<<tibble-print-21, echo=FALSE>>=
options(tibble.print_max = 3, tibble.print_min = 3)
@

\index{exporting data!text files|(}
The \code{write\_} functions from \pkgname{readr} are the counterpart to \code{write.} functions from \pkgname{utils}. In addition to the expected \Rfunction{write\_csv()}, \Rfunction{write\_csv2()}, \Rfunction{write\_tsv()} and \Rfunction{write\_delim()}, \pkgname{readr} provides functions that write \pgrmname{MS-Excel}-friendly CSV files. Function \Rfunction{write\_excel\_csv()} saves a text file with comma-separated fields suitable for import into \pgrmname{MS-Excel}.

<<readr-10>>=
write_excel_csv(my.df, file = "my-file6.csv")
file.show("my-file6.csv", pager = "console")
@

<<readr-11, comment='', echo=FALSE>>=
cat(read_lines('my-file6.csv'), sep = '\n')
@

\begin{playground}
Compare the output from \Rfunction{write\_excel\_csv()} and \Rfunction{write\_csv()}. What is the difference? Does it matter when you import the written CSV file into Excel (in the version you are using, and with the locale settings of your computer)?
\end{playground}

The pair of functions \Rfunction{read\_lines()} and \Rfunction{write\_lines()} read and write character vectors without conversion, similarly to base \Rlang \code{readLines()} and \code{writeLines()}. Functions \Rfunction{read\_file()} and \Rfunction{write\_file()} read and write the contents of a whole text file into, and from, a single character string. Functions \Rfunction{read\_file()} and \Rfunction{write\_file()} can also be used with raw vectors to read and write binary files or text files of unknown encoding.

The contents of the whole file are returned as a character vector of length one, with the embedded new line markers. We use \code{cat()} to print it so these new line characters force the start of a new print-out line.

<<readr-12>>=
one.str <- read_file(file = "extdata/miss-aligned-ASCII.txt")
length(one.str)
cat(one.str)
@

\begin{advplayground}
Use \Rfunction{write\_file()} to write a file that can be read with \Rfunction{read\_csv()}.
\end{advplayground}
\index{exporting data!text files|)}

\section{XML and HTML Files}
\index{importing data!XML and HTML files|(}
\index{file formats!HTML}\index{file formats!XML}
XML files contain text with special markup. Several modern data exchange formats are based on the \langname{XML} standard (see \url{https://www.w3.org/TR/xml/}) which uses schemas for flexibility. Schemas define specific formats, allowing reading of formats not specifically targeted during development of the read functions. Even the modern \langname{XHTML} standard used for web pages is based on such schemas, while \langname{HTML} only differs slightly in its syntax.

\subsection[\pkgnameNI{xml2}]{\pkgname{xml2}}

<<xml2-00, eval=FALSE, include=FALSE>>=
citation(package = "xml2")
@

Package \pkgname{xml2} provides functions for reading and parsing \langname{XTML} and \langname{HTML} files. This is a vast subject, of which I will only give a brief example.

Function \Rfunction{read\_html()} can be used to read an HTML document, either locally or from a URL as below.

<<xml2-01>>=
web_page <- read_html("https://www.learnr-book.info/")
@

Function \Rfunction{html\_structure()} displays the structure of an HTML document (long text output not shown).

<<xml2-01a, eval=FALSE>>=
html_structure(web_page)
@

Function \Rfunction{xml\_text()} extracts the text content of a field. Function \Rfunction{xml\_find\_all()} returns a field searched by name. Here used to extract the text from the \code{title} attribute, using functions \Rfunction{xml\_find\_all()} and .

<<xml2-02>>=
xml_text(xml_find_all(web_page, ".//title"))
@

The functions defined in this package can be used to ``harvest'' data from web pages, but also to read data from files using formats that are defined through \langname{XML} schemas.
\index{importing data!XML and HTML files|)}

\section{GPX Files}
\index{importing data!GPX files|(}\index{file formats!GPX}
GPX\index{importing data!geographic positioning system} (GPS Exchange Format) files use an XML scheme designed for saving and exchanging data from geographic positioning systems (GPS). There is some variation on the variables saved depending on the settings of the GPS receiver. The example data used here is from a Transmeta BT747 GPS logger. The example below reads the data into a \code{tibble} as character strings. For plotting, the character values representing numbers and dates would need to be converted to numeric and datetime (\code{POSIXct}) values, respectively. In the case of plotting tracks on a map, it is preferable to use package \pkgname{sf} to import the tracks directly from the \code{.gpx} file into a layer (use of \Rlang pipe operator is described in section \ref{sec:script:pipes} on page \pageref{sec:script:pipes}).

<<gps-01>>=
xmlTreeParse(file = "extdata/GPSDATA.gpx", useInternalNodes = TRUE) |>
xmlRoot(x = _) |>
xmlToList(node = _) |>
_[["trk"]] |>
assign(x = "temp", value = _) |>
_[names(x = temp) == "trkseg"] |>
unlist(x = _, recursive = FALSE) |>
map_df(.x = _, .f = function(x) as_tibble(x = t(x = unlist(x = x))))
rm(temp) # cleanup
@

\begin{playground}
  To understand what data transformation takes place in each statement of this pipe, start by running the first statement by itself, excluding the pipe operator, and continue adding one statement at a time, and at each step check the returned value and look out for what has changed from the previous step. Optionally you can insert a line \code{print() |>} at the point where you wish to see the data being ``piped''.
\end{playground}

\index{importing data!GPX files|)}

\section{Worksheets}\label{sec:files:worksheets}
\index{importing data!worksheets and workbooks|(}

Microsoft Office, Open Office, and Libre Office are the most frequently used suites containing programs based on the worksheet paradigm. There is available a standardised file format for exchange of worksheet data, but it does not support all the features present in native file formats. We will start by considering \pgrmname{MS-Excel}. The file format used by \pgrmname{MS-Excel} has changed significantly over the years, and old formats tend to be less well supported by available \Rlang packages and may require the file to be updated to a more modern format with \pgrmname{MS-Excel} itself before import into \Rlang. The current format is based on XML and relatively simple to decode, whereas older binary formats are more difficult. Worksheets contain code as equations in addition to the actual data. In all cases, only values entered as such or those computed by means of the embedded equations can be imported into \Rlang rather than the equations themselves.

\begin{warningbox}
When directly reading from a worksheet, a column of cells with mixed type, can introduce \code{NA} values. A wrongly selected cell range from the worksheet can result in missing columns or rows, if the area is too small, or in rows or columns filled with \code{NA} values, if the range includes empty cells in the worksheet. Depending on the function used, it may be possible to ignore empty cells, by passing an argument.

Many problems related to the import of data from worksheets and workbooks are due to translation between two different formats that impose different restrictions on what is allowed or not. While in a worksheet it is allowed to set the ``format'' (as called in \pgrmname{Excel}, and roughly equivalent to \code{mode} in \Rlang) of individual cells, a variable (column) in an \Rlang data frame is expected to be vector, and thus contain members belonging the same \code{mode} or type. For the import to work as expected, the ``format'' must be consistent, i.e., all cells in a column to be imported are marked as one of the \code{Number}, \code{Date}, \code{Time}, or \code{Text} formats, with the possible exception of a \emph{single row} of column headers with the names of the variables as \code{Text}. The default format \code{General} also works but as it does not ensure consistency, it makes more difficult to see format inconsistencies at a glance in Excel.

When reading a \code{CSV} file, text representing numbers will be recognised and converted, but only if the decimal point is encoded as expected from the arguments passed to the function call. So a single number with a comma instead of a dot as decimal marker (or vice versa) will result in most cases in the column not being decoded as numbers and returned as a \code{character} vector (or column) in the data frame. In the case of package \pkgname{readr}, a \code{numeric} vector containing \code{NA} values for the non-decoded text may be returned instead of a \code{character} vector depending on whether the wrong decimal marker appears near the top or near the end of the file.

When importing data from a worksheet or workbook, my recommendation is first to check it in the original software to ensure that the cells to be imported are encoded as expected. When using a \code{CSV} as an intermediate step, it is crucial to also open this file in a plain-text editor such as the editor pane in \RStudio (or \pgrmname{Notepad} in \pgrmnameNI{Windows} or \pgrmname{Nano}, \pgrmname{Emacs}, etc., in \pgrmnameNI{Unix} and \pgrmnameNI{Linux}). Based on what field separator, decimal mark, and possibly character encoding has been used, which depends on the locale settings in the operating system of the computer and in the worksheet program, select a suitable function to call and the necessary arguments to pass to it.
\end{warningbox}

\subsection{CSV files as middlemen}

If we have access to the original software used for creating a worksheet or workbook, then exporting worksheets to text files in CSV format and importing them into \Rlang using the functions described in sections \ref{sec:files:txt} and \ref{sec:files:readr} starting on pages \pageref{sec:files:txt} and \pageref{sec:files:readr} provides a broadly compatible route for importing data---with the caveat that one must ensure that delimiters and decimal marks match the expectations of the functions used. This approach is not ideal from the perspective of having to create intermediate \code{CSV} formatted text files. A better approach is, when feasible, to import the data directly from the workbook or worksheets into \Rlang.

\subsection[\pkgnameNI{readxl}]{\pkgname{readxl}}\label{sec:files:excel}
\index{importing data!.xlsx files|(}\index{file formats!xlsx}

<<readxl-00, eval=FALSE, include=FALSE>>=
citation(package = "readxl")
@

Package \pkgname{readxl} supports reading of \pgrmname{MS-Excel} workbooks, and selecting worksheets and regions within worksheets specified in ways similar to those used by \pgrmname{MS-Excel} itself. The interface is simple, and the package easy to install. We will import a file that in \pgrmname{MS-Excel} looks like the screen capture below.

\begin{center}
\includegraphics[width=0.75\textwidth]{figures/Book1-xlsx.png}
\end{center}

Function \Rfunction{excel\_sheets()} lists the sheets contained in the workbook.

<<readxl-01>>=
sheets <- excel_sheets("extdata/Book1.xlsx")
sheets
@

In this case, the argument passed to \code{sheet} is redundant, as there is only a single worksheet in the file. It is possible to use either the name of the sheet or a positional index (in this case \code{1} would be equivalent to \code{"my data"}). Function \Rfunction{read\_excel()} with no range specification imports the whole worksheet into a tibble, as can be expected from a package included in the \pkgname{tidyverse}.

<<readxl-02>>=
Book1.df <- read_excel("extdata/Book1.xlsx",
                       sheet = "my data")
Book1.df
@

It is also possible to read a region instead of the whole worksheet.

<<readxl-03>>=
Book1_region.df <- read_excel("extdata/Book1.xlsx",
                              sheet = "my data",
                              range = "A1:B8")
Book1_region.df
@

Of the remaining arguments, the most useful ones have the same names and play similar roles as in \pkgname{readr} (see section \ref{sec:files:readr} on page \pageref{sec:files:readr}). For example, new names for the columns can be passed as an argument to override the names in the worksheet.

<<readxl-04>>=
Book1_region.df <- read_excel("extdata/Book1.xlsx",
                              sheet = "my data",
                              range = "A2:B8",
                              col_names = c("A", "B"))
Book1_region.df

@

\subsection[\pkgnameNI{xlsx}]{\pkgname{xlsx}}

<<xlsx-00, eval=FALSE, include=FALSE>>=
citation(package = "xlsx")
@

Package \pkgname{xlsx} can be more difficult to install as it uses Java functions to do the actual work. However, it is more comprehensive, with functions both for reading and writing \pgrmname{MS-Excel} worksheets and workbooks, in different formats including the older binary ones. Similarly to \pkgname{readr}, it allows selected regions of a worksheet to be imported.

Function \Rfunction{read.xlsx()} can be used indexing the worksheet by name. The returned value is a data frame, and following the expectations of \Rlang package \pkgnameNI{utils}, character columns are \emph{no longer} converted into factors by default.

<<xlsx-01>>=
Book1_xlsx.df <- read.xlsx("extdata/Book1.xlsx",
                           sheetName = "my data")
Book1_xlsx.df
sapply(Book1_xlsx.df, class)
@

With function \Rfunction{write.xlsx()}, we can write data frames out to Excel worksheets and even append new worksheets to an existing workbook.

<<xlsx-05>>=
set.seed(456321)
my.data <- data.frame(x = 1:10, y = letters[1:10])
write.xlsx(my.data,
           file = "extdata/my-data.xlsx",
           sheetName = "first copy")
write.xlsx(my.data,
           file = "extdata/my-data.xlsx",
           sheetName = "second copy",
           append = TRUE)
@

When opened in Excel, we get a workbook containing two worksheets, named using the arguments we passed through \code{sheetName} in the code chunk above.
% screen capture to be replaced!!
\begin{center}
\includegraphics[width=0.75\textwidth]{figures/my-data-xlsx.png}
\end{center}

\begin{playground}
If you have some worksheet files available, import them into \Rlang to get a feel for how data is organised in the worksheets affects how easy or difficult it is to import them into \Rlang.
\end{playground}
\index{importing data!.xlsx files|)}

\subsection[\pkgnameNI{readODS}]{\pkgname{readODS}}
\index{importing data!.ods files|(}\index{file formats!ODS}

Package \pkgname{readODS} provides functions for reading data saved in files that follow the \emph{Open Documents Standard}. Function \Rfunction{read\_ods()} has a similar user interface to that of \code{read\_excel()} and reads one worksheet at a time, with support only for skipping top rows and selecting ranges of columns and rows. The value returned is a tibble or, optionally, a data frame. Function \Rfunction{read\_fods()} reads flat ODS files.

<<readODS-00>>=
list_ods_sheets("extdata/Book1.ods")
@

<<readODS-01>>=
ods.df <- read_ods("extdata/Book1.ods", sheet = 1)
@

<<readODS-02>>=
ods.df
@

Functions \Rfunction{write\_ods()} and \Rfunction{write\_fods()} write a data frame into an ODS or FODS file.
\index{importing data!.ods files|)}
\index{importing data!worksheets and workbooks|)}

\section{Statistical Software}\label{sec:files:stat}
\index{importing data!other statistical software|(}

There are two different comprehensive packages for importing data saved from other statistical programs such as \pgrmname{SAS}, \pgrmname{Statistica}, \pgrmname{SPSS}, etc. The longtime ``standard'' is package \pkgname{foreign} included in base \Rlang, and package \pkgname{haven} is a newer contributed extension. In the case of files saved with old versions of statistical programs, functions from \pkgname{foreign} tend to be more robust than those from \pkgname{haven}.

\subsection[\pkgnameNI{foreign}]{\pkgname{foreign}}

<<foreign-00, eval=FALSE, include=FALSE>>=
citation(package = "foreign")
@

Functions\index{importing data!from SAS}\index{importing data!from Stata}\index{importing data!from SPSS}\index{importing data!from Systat} in package \pkgname{foreign} allow us to import data from files saved by several statistical analysis programs, including \pgrmname{SAS}, \pgrmname{Stata}, \pgrmname{SPSS}, \pgrmname{Systat}, \pgrmname{Octave} among others, and a function for writing data into files with formats native to \pgrmname{SAS}, \pgrmname{Stata}, and \pgrmname{SPSS}. \Rlang documents the use of these functions in detail in the \emph{R Data Import/Export} manual. As a simple example, we use function \Rfunction{read.spss()} to read a \texttt{.sav} file, saved a few years ago with the then current version of \pgrmname{SPSS}. Only the first six rows and seven columns of the data frame are shown, including a column with dates, which appears as numeric.

<<foreign-01>>=
my_spss.df <- read.spss(file = "extdata/my-data.sav", to.data.frame = TRUE)
my_spss.df[1:6, c(1:6, 17)]
@

A second example, this time with a simple \code{.sav} file saved 15 years ago.

<<foreign-02>>=
thiamin.df <- read.spss(file = "extdata/thiamin.sav", to.data.frame = TRUE)
head(thiamin.df)
@

Another example, for a \pgrmname{Systat} file saved on an PC more than 20 years ago, and read with \Rfunction{read.systat()}.

<<foreign-03>>=
my_systat.df <- read.systat(file = "extdata/BIRCH1.SYS")
head(my_systat.df)
@

Not all functions in \pkgname{foreign} return data frames by default, but all of them can be coerced to do so.

\subsection[\pkgnameNI{haven}]{\pkgname{haven}}

<<haven-00, eval=FALSE, include=FALSE>>=
citation(package = "haven")
@

Package \pkgname{haven} is less ambitious with respect to the number of formats supported, or their vintages, providing read and write functions for only three file formats: \pgrmname{SAS}, \pgrmname{Stata}, and \pgrmname{SPSS}. On the other hand, \pkgname{haven} provides flexible ways to convert the different labelled values that cannot be directly mapped to \Rlang modes. They also decode dates and times according to the idiosyncrasies of each of these file formats. In cases when the imported file contains labelled values, the returned \Rclass{tibble} object needs some additional attention from the user. Labelled numeric columns in \pgrmname{SPSS} are not necessarily equivalent to factors, although they sometimes are. Consequently, conversion to factors cannot be automated and must be done manually in a separate step.

Function \Rfunction{read\_sav()} can be used to import a \code{.sav} file saved by a recent version of \pgrmname{SPSS}. As in the previous section, we display below only the first six rows and seven columns of the data frame, including a column \code{treat} containing a labelled numeric vector and \code{harvest\_date} with dates encoded as \Rlang date values.

<<haven-01>>=
my_spss.tb <- read_sav(file = "extdata/my-data.sav")
my_spss.tb[1:6, c(1:6, 17)]
@

In this case, the dates are correctly decoded.

Next, we import an \pgrmname{SPSS}'s \code{.sav} file saved 20 years ago.

<<haven-02>>=
thiamin.tb <- read_sav(file = "extdata/thiamin.sav")
thiamin.tb
@

<<haven-02a>>=
thiamin.tb <- as_factor(thiamin.tb)
thiamin.tb
@

\begin{playground}
Compare the values returned by different \code{read} functions when applied to the same file on disk. Use \Rfunction{names()}, \Rfunction{str()}, and \Rfunction{class()} as tools in your exploration. If you are brave, also use \Rfunction{attributes()}, \Rfunction{mode()}, \Rfunction{dim()}, \Rfunction{dimnames()}, \Rfunction{nrow()}, and \Rfunction{ncol()}.
\end{playground}

\begin{playground}
If you use or have in the past used other statistical software or a general-purpose language like \langname{Python}, look for some old files and import them into \Rlang.
\end{playground}
\index{importing data!other statistical software|)}

\section{NetCDF Files}\label{sec:dataio:netcdf}
\index{importing data!NeCDF files|(}\index{file formats!NetCDF}

In some fields, including geophysics and meteorology, \pgrmname{NetCDF} is a very common format for the exchange of data. It is also used in other contexts in which data are referenced to a grid of locations, like with data read from Affymetrix microarrays used to study gene expression. \pgrmname{NetCDF} files are binary but use a format that allows the storage of metadata describing each variable together with the data itself in a well-organised and standardised format, which is ideal for exchange of moderately large data sets measured on a spatial or spatio-temporal grid.

Officially described as follows:
\begin{quote}
\pgrmname{NetCDF} is a set of software libraries [from Unidata] and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
\end{quote}

That \pgrmname{NetCDF} files be selectively read, extracting the data from individual variables, is important as it allows computations in \Rlang with data sets too big to fit in a computer's RAM. Selective reading is possible using functions from packages \pkgname{ncdf4} or \pkgname{RNetCDF}. As a consequence of this flexibility, contrary to other data file reading operations, reading a \pgrmname{NetCDF} file is done in multiple steps---i.e., opening the file, reading metadata describing the variables and spatial grid, and finally selectively reading the data of interest.

\subsection[\pkgnameNI{ncdf4}]{\pkgname{ncdf4}}

<<ncdf4-00, eval=FALSE, include=FALSE>>=
citation(package = "ncdf4")
@

Package \pkgname{ncdf4} supports reading of files using \pgrmname{NetCDF} version 4 or earlier formats. Functions in \pkgname{ncdf4} not only allow reading and writing of these files, but also their modification.

Below, first file \code{pevpr.sfc.mon.ltm.nc}, containing meteorological data, is opened with function \Rfunction{nc\_open()}. The object returned is saved to \code{meteo\_data.nc}. This object contains only an index to the file contents, whose structure is displayed with a call to \code{str()}, it plays the role of a file connection.

<<ncdf4-01>>=
meteo_data.nc <- nc_open("extdata/pevpr.sfc.mon.ltm.nc")
str(meteo_data.nc, max.level = 1)
@

\begin{advplayground}
Increase \code{max.level} in the call to \Rfunction{str()} above and study how the connection object stores information on the dimensions and for each data variable. You can also \code{print(meteo\_data.nc)} for a more complete printout once you have understood the structure of the object.
\end{advplayground}

The dimensions of the data array are stored as metadata, in the file used mapping indexes to a grid of latitudes and longitudes and into a time vector as a third dimension. The dates are returned as character strings. The variables describing the grid are read one at a time with function \Rfunction{ncvar\_get()}.

<<ncdf4-02>>=
time.vec <- ncvar_get(meteo_data.nc, "time")
head(time.vec)
longitude <-  ncvar_get(meteo_data.nc, "lon")
head(longitude)
latitude <- ncvar_get(meteo_data.nc, "lat")
head(latitude)
@%
\pagebreak

The \code{time} vector contains only monthly values as the file contains a long-term series of monthly averages, expressed as days from 1800-01-01 corresponding to the first day of each month of year "1". We use package \pkgname{lubridate} for the conversion. To find the indexes for the grid point of interest, it is necessary to study the vectors \code{longitude} and \code{latitude} saved above.

Next, the potential evapotranspiration is read for one grid point, and used to construct a data frame, with some values recycled.

<<ncdf4-03>>=
pet.tb <-
    tibble(time = time.vec,
           month = month(ymd("1800-01-01") + days(time)),
           lon = longitude[6],
           lat = latitude[2],
           pet = ncvar_get(meteo_data.nc, "pevpr")[6, 2, ]
           )
pet.tb
@

To read data for several grid points, different approaches are available. However, the order of nesting of dimensions can make adding the dimensions as columns error prone. It is much simpler to use package \pkgnameNI{tidync} described next.

\subsection[\pkgnameNI{tidync}]{\pkgname{tidync}}

<<tidync-00, eval=FALSE, include=FALSE>>=
citation(package = "tidync")
@

Package \pkgname{tidync} provides functions that make it easy to extract subsets of the data from an \pgrmname{NetCDF} file. The initial steps are the  same operations as in the examples for \pkgnameNI{ncdf4}.

Function \Rfunction{tidync()} is used to open the file and simultaneously activate the first grid. The returned object is saved as \code{meteo\_data.tnc}. This object is subsequently used to access the file, and when printed displays a summary of the file structure and data encoding.

<<tidync-01>>=
meteo_data.tnc <- tidync("extdata/pevpr.sfc.mon.ltm.nc")
meteo_data.tnc
@

Function \Rfunction{hyper\_dims()} returns a description of the grid for which observations are available.

<<tidync-01a>>=
hyper_dims(meteo_data.tnc)
@

Function \Rfunction{hyper\_vars()} returns a description of the observations or variables available at each grid point.

<<tidync-01b>>=
hyper_vars(meteo_data.tnc)
@

Function \Rfunction{hyper\_tibble()} extracts a subset of the data into a tibble in long (or tidy) format. The selection of the grid point is done in the same operation and in this case using \code{signif()} to test for an approximate match to actual longitude and latitude values. A pipe is used to add the decoded dates, using the pipe operator (\code{|>}) and methods from \pkgname{dplyr} (see section \ref{sec:dplyr:group:wise} on page \pageref{sec:dplyr:group:wise}). The decoding of dates is done using functions from package \pkgname{lubridate} (see section \ref{sec:data:datetime} on page \pageref{sec:data:datetime}).

<<tidync-02>>=
hyper_tibble(meteo_data.tnc,
             lon = signif(lon, 1) == 9,
             lat = signif(lat, 2) == 87)
# 'tidync' (== 0.4.0) fails to include variable time in the tibble
# |>
#  mutate(.data = _, month = month(ymd("1800-01-01") + days(time))) |>
#  select(.data = _, -time)
@

In this second example, data are extracted for all grid points along latitudes by omitting the test for \code{lat} from the chunk above. The tibble is assembled automatically and columns for the active dimensions added. The decoding of the months remains the same as above.

<<tidync-03>>=
hyper_tibble(meteo_data.tnc,
             lon = signif(lon, 1) == 9) |>
  mutate(.data = _, month = month(ymd("1800-01-01") + days(time))) |>
  select(.data = _, -time)
@

\begin{playground}
Instead of extracting data for one longitude across latitudes, extract data across longitudes for one latitude near the Equator.
\end{playground}
\index{importing data!NeCDF files|)}

\section{Remotely Located Data}\label{sec:files:remote}
\index{importing data!remote connections|(}

Many\index{importing data!remotely}\index{importing data!using URL}\index{files!downloading} of the functions described above accept a URL address in place of a file name. Consequently, files can be read remotely without having to first download and save a copy in the local file system. This can be useful, especially when file names are generated within a script. However, one should avoid, especially in the case of servers open to public access, repeatedly downloading the same file as this unnecessarily increases network traffic and workload on the remote server. Because of this, our first example reads a small file from my own web site. See section \ref{sec:files:txt} on page \pageref{sec:files:txt} for details on the use of these and other functions for reading text files.

<<url-01, eval=eval_online_data>>=
logger.df <-
      read.csv2(file = "http://r4photobiology.info/learnr/logger_1.txt",
                header = FALSE,
                col.names = c("time", "temperature"))
sapply(logger.df, class)
@

While functions in package \pkgname{readr} support the use of URLs, those in packages \pkgname{readxl} and \pkgname{xlsx} do not. Consequently, the file has to be first downloaded and saved locally, and subsequently imported as described in section \ref{sec:files:excel} on page \pageref{sec:files:excel}. Function \Rfunction{download.file()} in the \Rlang \pkgname{utils} package can be used to download files using URLs. It supports different modes such as binary or text for the contents, and write or append for the local file, and different methods such as \code{"internal"}, \code{"wget"}, and \code{"libcurl"}.

\begin{warningbox}
For portability, \pgrmname{MS-Excel} files should be downloaded in binary mode, setting \code{mode = "wb"}, which is required under \osname{MS-Windows}.
\end{warningbox}

<<url-11, eval=eval_online_data>>=
download.file("http://r4photobiology.info/learnr/my-data.xlsx",
              "data/my-data-dwn.xlsx",
              mode = "wb")
@%
\pagebreak

Functions from packages \pkgname{foreign} and \pkgname{haven}, useful for reading files saved by other statistical software, support URLs. See section \ref{sec:files:stat} on page \pageref{sec:files:stat} for more information about importing this kind of data into \Rlang. The two examples below read a file saved by \pgrmname{SPSS} located in a remote server, using these two packages.

<<url-03, eval=eval_online_data>>=
remote_thiamin.df <-
  read.spss(file = "http://r4photobiology.info/learnr/thiamin.sav",
            to.data.frame = TRUE)
head(remote_thiamin.df)
@

<<url-04, eval=eval_online_data>>=
remote_my_spss.tb <-
    read_sav(file = "http://r4photobiology.info/learnr/thiamin.sav")
remote_my_spss.tb
@

Next, we download from NOAA's server a NetCDF file with long-term means for potential evapotranspiration, the same file used above in the \pkgname{ncdf4} example. This is a moderately large file at 834~KB. In this case, it is not possible to directly open the connection to the NetCDF file and it has to be downloaded. The \code{if} statement ensures that the file is downloaded only if the local copy is missing (to refresh the local copy simply delete the existing one). Once downloaded, the file can be opened as shown in section \ref{sec:dataio:netcdf} on page \pageref{sec:dataio:netcdf}.

<<url-05, eval=eval_online_data>>=
if (!file.exists("extdata/pevpr.sfc.mon.ltm.nc")) {
  my.url <- paste("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.derived/",
                  "surface_gauss/pevpr.sfc.mon.ltm.nc",
                  sep = "")
  download.file(my.url,
                mode = "wb",
                destfile = "extdata/pevpr.sfc.mon.ltm.nc")
}
pet_ltm.nc <- nc_open("extdata/pevpr.sfc.mon.ltm.nc")
@

\begin{warningbox}
For portability, \pgrmname{NetCDF} files should be downloaded in binary mode, setting \code{mode = "wb"}, which is required under \osname{MS-Windows}.
\end{warningbox}
\index{importing data!remote connections|)}

\begin{warningbox}
  Some NetCDF file servers support the OPeNDAP protocol. In these servers, it is possible to open the files remotely and only download a part of the file. Function \Rfunction{open.nc()} from package \pkgname{RNetCDF} transparently supports OPeNDAP URLs.
\end{warningbox}

\section{Databases}\label{sec:data:db}
\index{importing data!databases|(}
<<dbplyr-00a, eval=FALSE, include=FALSE>>=
citation(package = "dbplyr")
@

<<dbplyr-00b, eval=FALSE, include=FALSE>>=
citation(package = "DBI")
@

<<dbplyr-00c, eval=FALSE, include=FALSE>>=
citation(package = "RSQLite")
@

One of the advantages of using databases is that subsets of cases and variables can be retrieved, even remotely, making it possible to work in \Rlang both locally and remotely with huge data sets. One should remember that \Rlang natively keeps whole objects in RAM, and consequently, available machine memory limits the size of data sets with which it is possible to work. Package \pkgname{dbplyr} provides the tools to work with data in databases using the same verbs as when using \pkgname{dplyr} with data stored in memory (RAM) (see chapter \ref{chap:R:data}). This is an important subject, but extensive enough to be outside the scope of this book. We provide a few simple examples to show the very basics but interested readers should consult \citebooktitle{Wickham2023a} \autocite{Wickham2023a}.

The additional steps compared to using \pkgname{dplyr} start with the need to establish a connection to a local or remote database. We will use \Rlang package \pkgname{RSQLite} to create a local temporary \pgrmname{SQLite} database. \pkgname{dbplyr} backends supporting other database systems are also available. We will use meteorological data from \pkgname{learnrbook} for this example.

<<dbplyr-01>>=
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
copy_to(con, weather_wk_25_2019.tb, "weather",
        temporary = FALSE,
        indexes = list(
          c("month_name", "calendar_year", "solar_time"),
          "time",
          "sun_elevation",
          "was_sunny",
          "day_of_year",
          "month_of_year"
        )
)
weather.db <- tbl(con, "weather")
colnames(weather.db)
weather.db |>
  filter(.data = _, sun_elevation > 5) |>
  group_by(.data = _, day_of_year) |>
  summarise(.data = _, energy_Wh = sum(global_watt, na.rm = TRUE) * 60 / 3600)
@

\begin{explainbox}
Package \pkgname{dbplyr} translates data pipes that use \pkgname{dplyr} syntax into SQL queries to databases, either local or remote. As long as there are no problems with the backend, the use of a database is almost transparent to the \Rlang user.
\end{explainbox}
\index{importing data!databases|)}

\begin{explainbox}
It is always good to clean up, and in the case of the book, the best way to test that the examples
can be run in a ``clean'' system.

<<clear-up-data-folders>>=
unlink("./data", recursive = TRUE)
unlink("./extdata", recursive = TRUE)
@
\end{explainbox}

\section{Data Acquisition from Physical Devices}\label{sec:data:acquisition}
\index{importing data!physical devices|(}\index{file formats!JSON}

Numerous\index{internet-of-things} modern data acquisition devices based on microcontrollers, including internet-of-things (IoT) devices, have servers (or daemons) that can be queried over a network connection to retrieve either real-time or logged data. Formats based on XML schemas or the JSON format are commonly used for data.

\subsection[\pkgnameNI{jsonlite}]{\pkgname{jsonlite}}

<<iot-00, eval=FALSE, include=FALSE>>=
citation(package = "jsonlite")
@

\index{importing data!jsonlite}\index{YoctoPuce modules}
The next example retrieves data from USB module from \href{http://www.yoctopuce.com/}{YoctoPuce} using a software hub running locally. The module used in this example is a YoctoMeteo capable of storing measured data in its own memory.

\begin{warningbox}
Before running this example, data recording needs to be anabled in the YoctoPuce module and allow some time for the module to collect some data. In the call to function \Rfunction{fromJSON()}, \code{"C1-Meteo"} needs to be replaced by the ID assigned to the module used. The example uses an instance of Yocto VirtualHub running locally and listening at port 4444, the default. The same code can be used over the network by editing the string saved in \code{hub.url}.
\end{warningbox}

Function \Rfunction{fromJSON()} from package \pkgname{jsonlite} can be used to retrieve logged data from one sensor module.

<<iot-01, eval=eval_yoctopuce>>=
hub.url <- "http://localhost:4444/"
Meteo01.df <-
    fromJSON(paste(hub.url, "byName/C1-Meteo/dataLogger.json",
                   sep = ""), flatten = TRUE)
str(Meteo01.df, max.level = 2)
@

The minimum, mean, and maximum values for each logging interval need to be split from a single vector. We do this by indexing with a logical vector (recycled). The data returned are in long form, including measured values, quantity names, units, and the date and time when each value was acquired.

<<iot-02, eval=eval_yoctopuce, cache=TRUE>>=
Meteo01.df[["streams"]][[which(Meteo01.df$id == "temperature")]] |>
  as_tibble(x = _) |>
  dplyr::transmute(.data = _,
                   utc.time = as.POSIXct(utc, origin = "1970-01-01", tz = "UTC"),
                   t_min = unlist(val)[c(TRUE, FALSE, FALSE)],
                   t_mean = unlist(val)[c(FALSE, TRUE, FALSE)],
                   t_max = unlist(val)[c(FALSE, FALSE, TRUE)]) -> temperature.df

Meteo01.df[["streams"]][[which(Meteo01.df$id == "humidity")]] |>
  as_tibble(x = _) |>
  dplyr::transmute(.data = _,
                   utc.time = as.POSIXct(utc, origin = "1970-01-01", tz = "UTC"),
                   hr_min = unlist(val)[c(TRUE, FALSE, FALSE)],
                   hr_mean = unlist(val)[c(FALSE, TRUE, FALSE)],
                   hr_max = unlist(val)[c(FALSE, FALSE, TRUE)]) -> humidity.df

full_join(temperature.df, humidity.df)
@

\begin{explainbox}
Most YoctoPuce input modules have a built-in datalogger, and the stored data can also be downloaded as a \code{CSV} file through a physical or virtual hub. As shown above, it is possible to control them through the HTML server in the physical or virtual hubs. Alternatively the \Rlang package \pkgname{reticulate} can be used to control YoctoPuce modules by means of the \langname{Python} library giving access to their full API.
\end{explainbox}
\index{importing data!physical devices|)}

\section{Further Reading}

\Rpgrm includes the manual ``R Data Import/Export'', a very useful reference.

Since\index{further reading!elegant R code}\index{further reading!idiosyncracies or R} this is the end of the book, I recommend as further reading the writings of \citeauthor{Burns1998} as they are full of insight. Having arrived at the end of \emph{Learn R: As a Language}, you should read \citebooktitle{Burns1998} \autocite{Burns1998} and \citebooktitle{Burns2012} \autocite{Burns2012}. If you want to never get caught unaware by \Rlang's idiosyncrasies, read also \citebooktitle{Burns2011} \autocite{Burns2011}.

<<echo=FALSE>>=
try(detach(package:jsonlite))
try(detach(package:lubridate))
try(detach(package:tidync))
try(detach(package:ncdf4))
try(detach(package:xml2))
try(detach(package:haven))
try(detach(package:foreign))
try(detach(package:xlsx))
try(detach(package:readxl))
try(detach(package:readr))
try(detach(package:tidyr))
try(detach(package:dplyr))
try(detach(package:stringr))
try(detach(package:tibble))
try(detach(package:learnrbook))
@

<<eval=eval_diag, include=eval_diag, echo=eval_diag, cache=FALSE>>=
knitter_diag()
R_diag()
other_diag()
@