-
Notifications
You must be signed in to change notification settings - Fork 4
/
R.scripts.Rnw
1730 lines (1351 loc) · 109 KB
/
R.scripts.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% !Rnw root = appendix.main.Rnw
<<echo=FALSE, cache=FALSE>>=
set_parent('r4p.main.Rnw')
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'scripts-chunk')
@
\chapter{Base \Rlang: ``Paragraphs'' and ``Essays''}\label{chap:R:scripts}
\index{scripts}
\begin{VF}
An \Rlang script is simply a text file containing (almost) the same commands that you would enter on the command line of \Rlang.
\VA{Jim Lemon}{\emph{Kickstarting R}}\nocite{LemonND}
\end{VF}
%\dictum[\href{https://cran.r-project.org/doc/contrib/Lemon-kickstart/}{Kickstarting R}]{An R script is simply a text file containing (almost) the same commands that you would enter on the command line of R.}\vskip2ex
\section{Aims of This Chapter}
For those who have mainly used graphical user interfaces, understanding why and when scripts can help in communicating a certain data analysis protocol can be revelatory. As soon as a data analysis stops being trivial, describing the steps followed through a system of menus and dialogue boxes becomes extremely tedious.
Moreover, graphical user interfaces tend to be difficult to extend or improve in a way that keeps step-by-step instructions valid across program versions and operating systems.
Many times, exactly the same sequence of commands needs to be applied to different data sets, and scripts make both implementation and validation of such a requirement easy.
In this chapter, I will walk you through the use of \Rpgrm scripts, starting from an extremely simple script.
\section{Writing Scripts}
In \Rlang language, the closest match to a natural language essay is a script. A script is built from multiple interconnected code statements needed to complete a given task. Simple statements, equivalent to sentences, can be combined into compound statements, equivalent to natural language paragraphs. Frequently, we combine simple sequences of statements into a sequence of actions necessary to complete a task. The sequence is not necessarily linear, as branching and repetition are also available.
Scripts can vary from simple scripts containing only a few code statements, to complex scripts containing hundreds of code statements. In the rest of the present section I discuss how to write readable and reliable scripts and how to use them.
\subsection{What is a script?}\label{sec:script:what:is}
\index{scripts!definition}
A \textit{script} is a text file that contains (almost) the same commands that you would type at the \Rlang console prompt. A true script is not, for example, an MS-Word file where you have pasted or typed some \Rlang commands.
When typing commands/statements at the \Rlang console, we ``feed'' one line of text at a time. When we end the line by typing the enter key, the line of text is interpreted and evaluated. We then type the next line of text, which gets in turn interpreted and evaluated, and so on. In a script we write nearly the same text in an editor and save multiple lines containing commands into a text file. Interpretation takes place only later, when we \emph{source} the file as a whole into \Rlang.
A script file has the following characteristics.
\begin{itemize}
\item The script is a plain text file, i.e., a file containing bytes that represent alphanumeric characters in a standardised character set like UTF8 or ASCII.
\item The text in the file contains valid \Rlang statements (including comments) and nothing else.
\item Comments start at a \code{\#} and end at the end of the line.
\item The \Rlang statements are in the file in the order that they must be executed, and respecting the line continuation rules of \Rlang.
\item \Rlang scripts customarily have file names ending in \texttt{.r} or \texttt{.R}.
\end{itemize}
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.5cm]
\node (start) [startstop, color = blue, fill = blue!15] {\textsl{Top (start)}};
\node (stat2) [process, color = blue, fill = blue!15, below of=start] {\code{<statement A>}};
\node (stat3) [process, color = blue, fill = blue!15, below of=stat2] {\code{<statement B>}};
\node (continue) [startstop, color = blue, fill = blue!15, below of=stat3] {$\cdots$};
\node (stop) [startstop, color = blue, fill = blue!15, below of=continue] {\textsl{Bottom (end)}};
\draw [arrow, color = blue] (start) -- (stat2);
\draw [arrow, color = blue] (stat2) -- (stat3);
\draw [arrow, color = blue] (stat3) -- (continue);
\draw [arrow, color = blue] (continue) -- (stop);
\end{tikzpicture}
\end{small}
\caption[Code statements in a script.]{Diagram of script showing sequentially evaluated code statements; \textcolor{blue}{$\cdots$} represent additional statements in the script.}\label{fig:script}
\end{figure}
The statements in the text file, are read, interpreted, and evaluated sequentially, from the start to the end of the file, as represented in the diagram (Figure \ref{fig:script}).
As we will see later in the chapter, code statements can be combined into larger statements and evaluated conditionally and/or repeatedly, which allows us to control the realised sequence of evaluated statements.
In addition to being valid, it is important that scripts are also understandable to humans. Consequently, a clear writing style and consistent adherence to it are important.
It is good practice to write scripts so that they are self-contained. To make a script self-contained, one must include code to load the packages used, load or import data from files, perform the data analysis, and display and/or save the results of the analysis. Such scripts can be used to apply the same analysis algorithm to other data by reading data from a different file and/or to reproduce the same analysis at a later time using the same data. Such scripts document all steps used for the analysis.
<<setup-scripts, include=FALSE, cache=FALSE>>=
show.results <- FALSE
@
\subsection{How do we use a script?}\label{sec:script:using}
\index{scripts!sourcing}
A script can be ``sourced'' using function \Rfunction{source()}. If a text file called \texttt{my.first.script.r} contains the text
\begin{shaded}
\footnotesize
\begin{verbatim}
# this is my first R script
print(3 + 4)
\end{verbatim}
\end{shaded}
it can be sourced by typing at \Rpgrm console
<<evaluate=FALSE>>=
source("my.first.script.r")
@
Execution of the statements in the file makes \Rlang display \code{[1] 7} at the console, below the command we typed in. The commands themselves are not shown (by default the sourced file is not \emph{echoed} to the console) and the results of computations are not printed unless one includes explicit \Rfunction{print()} commands in the script.
Scripts can be run both by sourcing them into an open \Rlang session, or at the operating system command prompt (see section \ref{sec:intro:using:R} on page \pageref{sec:intro:using:R}). In \RStudio, the script in the currently active editor tab can be sourced using the ``source'' button. The drop-down menu of this button has three entries: ``Source'' , quietly to the \Rlang console; ``Source with echo'' showing the code as it is run; and ``Source as local job'', using a new instance of \Rlang in the background. In the last case, the \Rlang console remains free for other uses while the script is running.
When a script is \emph{sourced}, the output can be saved to a text file instead of being shown in the console. It is also easy to call \Rpgrm with the \Rlang script file as an argument directly at the operating system shell or command-interpreter prompt---and obviously also from shell scripts. The next two chunks show commands entered at the OS shell command prompt rather than at the \Rlang command prompt.
\begin{shaded}
\footnotesize
\begin{verbatim}
RScript my.first.script.r
\end{verbatim}
\end{shaded}
You can open an operating system's \emph{shell} from the Tools menu in \RStudio, to run this command. The output will be printed to the shell console. If you would like to save the output to a file, use output redirection using the operating system's syntax.
\begin{shaded}
\footnotesize
\begin{verbatim}
RScript my.first.script.r > my.output.txt
\end{verbatim}
\end{shaded}
While developing or debugging a script, one usually wants to run (or \emph{execute}) one or a few statements at a time. This can be done in \RStudio using the ``run'' button after either positioning the cursor in the line to be executed, or selecting the text to be run (the selected text can be part of a line, a whole line, or a group of lines, as long as it is syntactically valid). The key-shortcut Ctrl-Enter is equivalent to pressing the ``run'' button.
\subsection{How to write a script}\label{sec:script:writing}
\index{scripts!writing}
As with any type of writing, different approaches may be preferred by different \Rlang users. In general, the approach used, or mix of approaches, will also depend on how confident one is that the statements will work as expected---one already knows the best approach vs.\ one is exploring different alternatives.
Three approaches are listed below. They all can result in equally good code, but as work in progress, they differ. In the first approach, the script as a whole is likely to contain some bugs until being thoroughly tested. In the middle approach, only the most recently added statements are likely to contain bugs. In the last one, the script contains at all times only valid \Rlang code, even if incomplete. This third approach also has the advantage that code remains in the \Rpgrm console \emph{History} and can be retrieved with a delay, e.g., after comparison against an alternative statement.
\begin{description}
\setlength{\itemsep}{1pt}
\setlength{\parskip}{0pt}
\setlength{\parsep}{0pt}
\item[If one is very familiar with similar problems,] one can create a new text file and write the whole script in the editor, testing it only afterwards. Use of this approach is uncommon.
\item[If one is moderately familiar with the problem,] one can write a script as above, but testing it, step by step, while writing it, i.e., running parts of the script before continuing with the writing. This is the approach I use most frequently.
\item[If one is mostly playing around,] one can type statements at the console prompt to try them. As every statement ran at the console is saved to the ``History'', these previously entered statement(s) can be copied and pasted into the script. In this way one can build a script from statements already known to work correctly.
\end{description}
\begin{playground}
By now you should be familiar enough with \Rlang to be able to write your own script.%
\begin{enumerate}
\setlength{\itemsep}{1pt}
\setlength{\parskip}{0pt}
\setlength{\parsep}{0pt}
\item Create a new \Rpgrm script (in \RStudio, from the File menu, leftmost ``+'' icon, or by typing ``Ctrl + Shift + N'').
\item Save the file as \texttt{my.second.script.r}.
\item Use the editor pane in \RStudio to type some \Rpgrm commands and comments.
\item \emph{Run} individual commands.
\item \emph{Source} the whole file.
\end{enumerate}
\end{playground}
\subsection{The need to be understandable to people}\label{sec:script:readability}
\index{scripts!readability}
It is not enough for program code to be understood by a computer and that it returns the correct answer. Both large programs and small scripts have to be readable to humans, and the intention of the code understandable. In most cases, \Rlang code will be maintained, reused, and modified over time. In many cases, this code also serves to document a given computation and to make it possible to reproduce it.
When one writes a script, it is either because one wants to document what has been done or because one plans to use it again in the future. In the first case, other persons will read it, and in the second case, one rarely remembers all the details. Thus, spending time and effort on the writing style, paying special attention to the following recommendations, is important.
\begin{itemize}
\setlength{\itemsep}{1pt}
\setlength{\parskip}{0pt}
\setlength{\parsep}{0pt}
\item Avoid the unusual. People using a certain programming language tend to use some implicit or explicit rules of style---style includes \textit{indentation} of statements, \textit{capitalisation} of variable and function names. As a minimum try to be consistent with yourself.
\item Use meaningful names for variables, and any other object. What is meaningful depends on the context. Depending on common use, a single letter may be more meaningful than a long word. However self-explanatory names are usually better: e.g., using \code{n.rows} and \code{n.cols} is much clearer than using \code{n1} and \code{n2} when dealing with a matrix of data. Probably \code{number.of.rows} and \code{number.of.columns} would make the script verbose, and take longer to type without gaining anything in return. Sometimes, short textual explanations in comments (ignored by \Rlang) are needed to achieve readability for humans.
\item How to make the words visible in names: traditionally in \Rlang one would use dots to separate the words and use only lower case. Some years ago, it became possible to use underscores. The use of underscores is common nowadays because it is ``safer'', as in some situations a dot may have a special meaning. Names like \code{NumCols}, using ``camel case'', are only infrequently used in \Rlang programming but are frequently used in other languages like \pascallang.
\end{itemize}
The \emph{Tidyverse style guide} for writing \Rlang code (\url{https://style.tidyverse.org/}) provides more detailed ``rules''. However, more important than strictly following a published guideline is to be consistent in the style one, a team of programmers or data analysts, or even members of an organisation use. In the current book, I have not followed this guide in all respects, instead following in some cases the style used in \Rlang documentation. However, I have attempted to be consistent.
\begin{playground}
Here is an example of bad style in a script. Edit the code in the chunk below so that it becomes easier to read.
<<eval=eval_playground>>=
a <- 2 # height
b <- 4 # length
C <-
a *
b
C -> variable
print(
"area: ", variable
)
@
\end{playground}
The points discussed above already help a lot. However, one can go further in achieving the goal of human readability by interspersing explanations and code ``chunks'' and using all the facilities of typesetting, even of formatted maths formulas and equations, within the listing of the script. Furthermore, by including the results of the calculations and the code itself in a typeset report built automatically one ensures that they match each other. This greatly contributes to data analysis reproducibility, which is becoming a widespread requirement both in academia and in industry.
This approach is called literate programming\index{literate programming} and was first proposed by \citeauthor{Knuth1984a} (\citeyear{Knuth1984a}) through his \pgrmname{WEB} system. In the case of \Rpgrm programming, the first support of literate programming was in \pkgname{Sweave}, which has been superseded by \pkgname{knitr} \autocite{Xie2013}. This package supports the use of \Markdown or \Latex\ \autocite{Lamport1994} as the markup language for the textual contents and also formats and applies syntax highlighting to code. \Rmarkdown is an extension to \Markdown that makes it easier to include \Rlang code in documents (see \url{http://rmarkdown.rstudio.com/}). It is the basis of \Rlang packages that support typesetting large and complex documents (\pkgname{bookdown}), web sites (\pkgname{blogdown}), package vignettes (\pkgname{pkgdown}), and slides for presentations \autocite{Xie2016,Xie2018}. \Quarto, which provides an enhanced version of \Rmarkdown, is implemented in \Rlang package \pkgname{quarto} together with the \Quarto program as a separate executable. The use of \pkgname{knitr} and \pkgname{quarto} is very well integrated into the \RStudio IDE.
The generation of typeset reports is outside the scope of the book, but it is an important skill to learn. It is well described in the books and web sites cited.
\subsection{Debugging scripts}\label{sec:script:debug}
\index{scripts!debugging}
The use of the word \emph{bug} to describe a problem in computer hardware and software started in 1946 when a real bug, more precisely a moth, got between the contacts of a relay in an electromechanical computer causing it to malfunction and Grace Hooper described the first computer \emph{bug}. The use of the term bug in engineering predates the use in computer science, and consequently, the use of the word bug in computing caught on easily.
A suitable quotation from a letter written by Thomas Alva Edison in 1878 \autocite[as given by][]{Hughes2004}:
\begin{quotation}
It has been just so in all of my inventions. The first step is an intuition, and comes with a burst, then difficulties arise--this thing gives out and [it is] then that ``Bugs''---as such little faults and difficulties are called---show themselves and months of intense watching, study and labour are requisite before commercial success or failure is certainly reached.
\end{quotation}
The quoted paragraph above makes clear that only very exceptionally does any new design fully succeed. The same applies to \Rlang scripts as well as any other non-trivial piece of computer code. From this it logically follows that testing and de-bugging are fundamental steps in the development of \Rlang scripts and packages. Debugging, as an activity, is outside the scope of this book. However, clear programming style and good documentation are indispensable for efficient testing and reuse.
Even for scripts used for analysing a single data set, we need to be confident that the algorithms and their implementation are valid, and able to return correct results. This is true both for scientific reports, expert reports, and any data analysis related to assessment of compliance with legislation or regulations. Of course, even in cases when we are not required to demonstrate validity, say for decision making purely internal to a private organisation, we will still want to avoid costly mistakes.
The first step in producing reliable computer code is to accept that any code that we write needs to be tested and, if possible, validated. Another important step is to make sure that input is validated within the script and a suitable error produced for bad input (including valid input values falling outside the range that can be reliably handled by the script).
If during testing, or during normal use, a wrong value or no value is returned by a calculation (e.g., the script crashes or triggers a fatal error), debugging consists in finding the cause of the problem. The cause can be either a mistake in the implementation of an algorithm or in the algorithm itself. However, many apparent \emph{bugs} are caused by bad, or missing, code for handling of special cases, such as invalid input values, rounding errors, and division by zero, making a function or script crash instead of elegantly issuing a helpful message.
Diagnosing the source of bugs is, in most cases, like detective work. One uses hunches based on common sense and experience to try to locate the lines of code causing the problem. One follows different \emph{leads} until the case is solved. In most cases, at the very bottom, we rely on some sort of divide-and-conquer strategy. For example, we may check the value returned by intermediate calculations until we locate the earliest code statement producing a wrong value. Another common case is when some input values trigger a bug. In such cases, it is frequently best to start by testing if different ``cases'' of input lead to errors/crashes or not. Boundary input values are usually the telltale ones: for numbers, zero, negative and positive values, very large values, very small values, missing values (\code{NA}), vectors of length zero (\code{numeric()}), etc.
\begin{warningbox}
\textbf{Error messages} When debugging, keep in mind that in some cases a single bug can lead to a whole cascade of error messages. Do also keep in mind that typing mistakes, originating when code is entered through the keyboard, can wreak havock in a script: usually there is little correspondence between the number of error messages and the seriousness of the bug triggering them. When several errors are triggered, start by reading the error message printed first, as later errors can be an indirect consequence of earlier ones.
\end{warningbox}
There are special tools, called debuggers, available, and they help enormously. Debuggers allow one to step through the code, executing one statement at a time, allowing inspection of the objects present in the \Rlang environment. It is even possible to execute additional statements at the \Rpgrm console, e.g., to modify the value of a variable, while execution is paused. An \Rlang debugger is available within \RStudio and also through the \Rlang console.
When writing your first scripts, you will manage perfectly well, and learn more by running the script one line at a time, and when needed temporarily inserting \code{print()} statements to ``look'' at how the value of variables changes at each step. A debugger allows a lot more control, as one can ``step in'' and ``step out'' of function definitions, and set and unset break points where execution will stop. However, using a debugger is not as simple as using \code{print()}.
If you get stuck trying to find the cause of a bug, do extend your search both to the most trivial of possible causes, and later on to the least likely ones (such as a bug in a package installed from \CRAN or \Rlang itself). Of course, when suspecting a bug in code you have not written, it is wise to very carefully read the documentation, as the ``bug'' may be just a misunderstanding of what a certain piece of code is expected to do. Also keep in mind that as discussed on page \pageref{sec:intro:net:help}, you will be able to find online already-answered questions to many of your likely problems and doubts. For example, searching with Google for the text of an error message is usually well rewarded. Most important to remember is that bugs do pop up frequently in newly written code, and occasionally in old code. No coding is immune to them, thus, the code you write, packages you use or \Rlang itself can contain bugs.
\section{Compound Statements}\label{sec:script:compound:statement}
\index{compound code statements}\index{simple code statements}
Individual statements can be grouped into \emph{compound statements} by enclosing them in curly braces (Figure \ref{fig:compound:statement}). Conceptually, is like putting these statements into a box that allows us to operate with them as an anonymous whole.
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.7cm]
\node (start) [startstop] {\ldots};
\node (enc) [enclosure, color = blue, fill = blue!5, below of=start, yshift=-0.75cm] {\ };
\node (stat2) [process, color = blue, fill = blue!15, below of=start] {\code{<statement A>}};
\node (stat3) [process, color = blue, fill = blue!15, below of=stat2, yshift=+0.2cm] {\code{<statement B>}};
\node (stop) [startstop, below of=stat3] {\ldots};
\draw [arrow, color = blue] (start) -- (stat2);
\draw [arrow, color = blue] (stat2) -- (stat3);
\draw [arrow, color = blue] (stat3) -- (stop);
\draw [arrow, color = black] (start) -- (enc);
\draw [arrow, color = black] (enc) -- (stop);
\end{tikzpicture}
\end{small}
\caption[Compound code statement]{Diagram of a compound code statement is a grouping of statements that in some contexts behaves as a single statement. In the diagram, statements A and B have been grouped into a compound statement.}\label{fig:compound:statement}
\end{figure}
<<compound-1>>=
print("...")
{
print("A")
print("B")
}
print("...")
@
The grouping of the two middle statements above is of no consequence, as it does not alter sequential evaluation. In the example above, only side effects are of interest. In the example below, the value returned by a compound statement is that returned by the last statement evaluated within it. Individual statements can be separated by an end-of-line as above, or by a semicolon (;) as below: two statements, each of them implementing an arithmetic operation.
<<compound-2>>=
{1 + 2; 3 + 4}
@
The example above demonstrates that only the value returned by the compound statement as a whole is displayed automatically at the \Rlang console, i.e., the implicit call to \code{print()} is applied to the compound statement. Thus, even though both statements were evaluated, we only see the result returned by the second one.
\begin{playground}
Nesting is also possible. Before running the compound statement below try to predict the value it will return, and then run the code and compare your prediction to the value returned.
<<compound-3, eval=eval_playground>>=
{1 + 2; {a <- 3 + 4; a + 1}}
@
\end{playground}
Grouping is of little use by itself. It becomes useful together with control-of-execution constructs, when defining functions, and in similar cases where we need to treat a group of code statements as if they were a single statement. We will see several examples of the use of compound statements in the current chapter and in chapter \ref{chap:R:functions} on page \pageref{chap:R:functions}.
\section{Function Calls}
\index{functions!call}
We will describe functions in detail and how to create new ones in chapter \ref{chap:R:functions}. We have already been using functions since chapter \ref{chap:R:as:calc}. Functions are structurally \Rlang statements, in most cases, compound statements, using formal parameters as placeholders. When one calls a function, one passes arguments for the different parameters (or placeholder names) and the (compound) statement conforming the \emph{body} of the function is evaluated after ``replacing'' the placeholders by the values passed as arguments.
In the first example, we use two statements. In the first statement, $log(100)$ is computed by calling function \code{log10()} with \code{100} as argument and the returned value is assigned to variable \code{a}. In the second statement, the value 2 is displayed as a side effect of calling \code{print()} with variable \code{a} as argument.
<<fun-calls-01>>=
a <- log10(100)
print(a)
@
The two statements in example above can be rewritten as a single statement using a nested function call.
<<fun-calls-02>>=
print(log10(100))
@
The difference is that we avoid the explicit creation of a variable. Whether this is an advantage or not depends on whether we use variable \code{a} in later statements or not.
Statements with more levels of nesting than shown above become very difficult to read, so alternative notations can help.
\section{Data Pipes}\label{sec:script:pipes}
\index{pipes!base R|(}
\index{pipe operator}
\index{chaining statements with \emph{pipes}}
Pipes have been at the core of shell scripting in \osname{Unix} since early stages of its design \autocite{Kernigham1981} as well as in \osname{Linux} distributions. Within an OS, pipes are chains of small programs or ``tools'' that carry out a single well-defined task (e.g., \code{ed}, \code{sub}, \code{gsub}, \code{grep}, and \code{more}). Data such as text is described as flowing from a source into a sink through a series of steps at which a specific transformations take place. In \osname{Unix} and \osname{Linux} shells like \pgrmname{sh} or \pgrmname{bash}, sinks and sources are files, but in \osname{Unix} and \osname{Linux} files are an abstraction that includes all devices and connections for input or output, including physical ones such as terminals and printers.
<<pipes-r-01,engine="bash",eval=FALSE>>=
stdin | grep("abc") | more
@
How can \emph{pipes} exist within a single \Rlang script? When chaining functions into a pipe, data is passed between them through temporary \Rlang objects stored in memory, which are created and destroyed automatically. Conceptually, there is little difference between \osname{Unix} shell pipes and pipes in \Rlang scripts, but the implementations are different.
What do pipes achieve in \Rlang scripts? They relieve us from the responsibility of creating and deleting the temporary objects. By chaining the statements they enforce their sequential execution. Pipes usually improve the readability of scripts by allowing more concise code.
Since 2021, starting from version 4.1.0, \Rlang has had a native pipe operator (\Roperator{\textbar >}) as part of the language. Subsequently, the placeholder (\code{\_}) was implemented in version 4.2.0 and its functionality expanded in version 4.3.0. Another two implementations of pipes, that have been available as \Rlang extensions for some years in packages \pkgnameNI{magrittr} and \pkgnameNI{wrapr}, are described in chapter \ref{chap:R:data} on page \pageref{chap:R:data}.
I describe R's pipe syntax based on \Rpgrm 4.3.0. I start by showing the same operations coded using nested function calls, using explicit saving of intermediate values in temporary objects, and using the pipe operator.
Nested function calls are concise, but difficult to read when the depth of nesting increases.
<<pipes-r-02>>=
sum(sqrt(1:10))
@
Saving intermediate results explicitly results in clear but verbose code.
<<pipes-r-03>>=
data.in <- 1:10
data.tmp <- sqrt(data.in)
sum(data.tmp)
rm(data.tmp) # clean up!
@
A pipe using operator \Roperator{\textbar >} makes the data flow clear and keeps the code concise.
<<pipes-r-04>>=
1:10 |> sqrt() |> sum()
@
We can assign the result of the computation to a variable, most elegantly using the \Roperator{->} operator on the \emph{rhs} of the pipe.
<<pipes-r-04a>>=
1:10 |> sqrt() |> sum() -> my_rhs.var
my_rhs.var
@
We can also use the \Roperator{<-} operator on the \emph{lhs} of the pipe, i.e., for assignments a pipe behaves as a compound statement.
<<pipes-r-04b>>=
my_lhs.var <- 1:10 |> sqrt() |> sum()
my_lhs.var
@
Formally, the \Roperator{\textbar >} operator from base \Rlang takes two operands, just like operator \code{+} does. The value returned by the \emph{lhs} (left-hand side) operand, which can be any \Rlang expression, is passed as argument to the function-call operand on \emph{rhs} (right-hand side). The called function must accept at least one argument. This default syntax that implicitly passes the argument by position to the first parameter of the function would limit which functions could be used in a pipe construct. However, it is also possible to pass the piped argument explicitly by name to any parameter of the function on the \emph{rhs} using an underscore (\code{\_}) as a placeholder.
<<pipes-r-05>>=
1:10 |> sqrt(x = _) |> sum(x = _)
@
The placeholder can be also used with extraction operators.
<<pipes-r-05a>>=
1:10 |> sqrt(x = _) |> _[2:8] |> sum(x = _)
@
\begin{explainbox}
Base \Rlang functions like \Rfunction{subset()} have formal parameters in an order that is suitable for implicitly passing the piped value as an argument to their first parameter, while others like \Rfunction{assign()} do not. For example, when calling function \code{assign()} to save a value using a name available as a character string, we would like to pass the piped value as an argument to parameter \code{value} which is not the first. In such cases, we can use \code{\_} as a placeholder and pass it by name.
<<pipes-box-pipes-02>>=
obj.name <- "data.out"
1:10 |> sqrt() |> sum() |> assign(x = obj.name, value = _)
@
Alternatively, we can define a wrapper function, with the desired order for the formal parameters. This approach can be worthwhile when the same function is called repeatedly within a script.
<<pipes-box-pipes-03>>=
value_assign <- function(value, x, ...) {
assign(x = x, value = value, ...)
}
obj.name <- "data.out"
1:10 |> sqrt() |> sum() |> value_assign(obj.name)
@
\end{explainbox}
In general, whenever we use temporary variables to store values that are passed as arguments only once, we can nest or chain the statements making the saving of intermediate results into a temporary variable implicit instead of explicit. Examples of some useful idioms follow.
Addition of computed variables to a data frame using \Rfunction{within()} (see section \ref{sec:calc:df:with} on page \pageref{sec:calc:df:with}) and selecting rows with \Rfunction{subset()} (see section \ref{sec:calc:df:subset} on page \pageref{sec:calc:df:subset}) are combined in our first simple example. For clarity, we use the \code{\_} placeholder to indicate the value returned by the preceding function in the pipe.
<<pipes-r-06>>=
data.frame(x = 1:10, y = rnorm(10)) |>
within(data = _,
{
x4 <- x^4
is.large <- x^4 > 1000
}) |>
subset(x = _, is.large)
@
\begin{playground}
Without using the \code{\_} placeholder, but using a more compact layout, the code above becomes that shown below. Compare it to that above to work out how I simplified the code.
<<pipes-r-06aa, eval=eval_playground>>=
data.frame(x = 1:10, y = rnorm(10)) |>
within({x4 <- x^4; is.large <- x^4 > 1000}) |>
subset(is.large)
@
\end{playground}
Subset can be also used to select variables or columns from data frames and matrices.
<<pipes-r-06a>>=
data.frame(x = 1:10, y = rnorm(10)) |>
within(data = _,
{
x4 <- x^4
is.large <- x^4 > 1000
}) |>
subset(x = _, is.large, select = -x)
@
<<pipes-r-06b>>=
data.frame(x = 1:10, y = rnorm(10)) |>
within(data = _,
{
x4 <- x^4
is.large <- x^4 > 1000
}) |>
subset(x = _, select = c(y, x4))
@
<<pipes-r-07>>=
data.frame(group = factor(rep(c("T1", "T2", "Ctl"), each = 4)),
y = rnorm(12)) |>
subset(x = _, group %in% c("T1", "T2")) |>
aggregate(data = _, y ~ group, mean)
@
The extraction operators are accepted on the \emph{rhs} of a pipe only starting from \Rpgrm 4.3.0. With these versions \code{\_[["y"]]}, as shown below, as well as its equivalent \code{\_\$y} can be used. Function \Rfunction{getElement()} used as \code{getElement("y")}, being a normal function, can be used in situations where operators are not accepted, like on the \emph{rhs} of \Roperator{|>} in older versions of \Rlang.
<<pipes-r-09>>=
data.frame(group = factor(rep(c("T1", "T2", "Ctl"), each = 4)),
y = rnorm(12)) |>
subset(x = _, group %in% c("T1", "T2")) |>
aggregate(data = _, y ~ group, mean) |>
_[["y"]]
@
Additional functions designed to be used in pipes are available through packages as described in chapter \ref{chap:R:data}.
\begin{playground}
In the last three examples, in which function calls is the explicit use of the placeholder needed, and in which ones is it optional? Hint: edit the code, removing the parameter name, \code{=}, and \code{\_}, and test whether the edited code works and returns the same value as before.
\end{playground}
\index{pipes!base R|)}
\section{Conditional Evaluation}\label{sec:script:flow:control}
\index{control of execution flow}
By default, \Rlang statements in a script are evaluated (or executed) in the sequence they appear in the script \textit{listing} or text. We give the name \emph{control of execution constructs} to those special statements that allow us to alter this default sequence, by either skipping or repeatedly evaluating individual statements. The statements whose evaluation is controlled can be either simple or compound. Some of the control of execution flow statements, function like \emph{ON-OFF switches} for program statements. Others allow statements to be executed repeatedly while or until a condition is met, or until all members of a list or a vector are processed.
These \emph{control of execution constructs} can be also used at the \Rlang console, but it is usually awkward to do so as they can extend over several lines of text. In simple scripts, the \emph{flow of execution} can be fixed and linear from the first to the last statement in the script. However, \emph{control of execution constructs} are a crucial part of most useful scripts. As we will see next, a compound statement can include multiple simple or nested compound statements. \Rpgrm has two types of \emph{if}\index{conditional statements} statements, non-vectorised and vectorised.
\subsection[Non-vectorised \texttt{if}, \texttt{else} and \texttt{switch}]{Non-vectorised \code{if}, \code{else} and \code{switch}}\label{sec:script:if}
\qRcontrol{if}\qRcontrol{if\ldots{}else}%
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.5cm]
\node (start) [startstop] {\ldots};
\node (dec1) [decision, color = blue, fill = blue!15, below of=start, yshift=-0.3cm] {\code{if (<cond.>)}};
\node (stat2) [process, color = blue, fill = blue!15, right of=dec1, xshift=3.2cm] {\code{<statement A>}};
\node (stat3) [process, below of=dec1, yshift=-0.5cm] {\code{<statement B>}};
\node (stop) [startstop, below of=stat3] {\ldots};
\draw [arrow] (start) -- (dec1);
\draw [arrow, color=blue] (dec1) -- node[anchor=north] {\code{TRUE}} (stat2);
\draw [arrow, color=blue] (dec1) -- node[anchor=west] {\code{FALSE}} (stat3);
\draw [arrow] (stat2) |- (stat3);
\draw [arrow] (stat3) -- (stop);
\end{tikzpicture}
\end{small}
\caption[Flowchart for \code{if} construct.]{Flowchart for \code{if} construct.}\label{fig:if:diagram}
\end{figure}
The \code{if} construct ``decides'', depending on a \code{logical} value, whether the next code statement is executed (if \code{TRUE}) or skipped (if \code{FALSE}) (Figure \ref{fig:if:diagram}). The flow chart shows how \code{if} works: \code{<statement A>} is either evaluated or skipped depending on the value of \code{<condition>}, while \code{<statement B>} is always evaluated.\label{flowchart:if}
The usefulness of \emph{if} statements stems from the possibility of computing the \code{logical} value used as \code{<condition>} with comparison operators (see section \ref{sec:calc:comparison} on page \pageref{sec:calc:comparison}) and logical operators (see section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}).
We start with toy examples demonstrating how \emph{if} statements work. Later we will see examples closer to real use cases. Here \Rcontrol{if} controls the evaluation or not of the simple statement \code{print("Hello!")}.
\begin{explainbox}
We use the name \emph{flag} for a \code{logical} variable set manually, preferably near the top of the script. Real flags were used in railways to indicate to trains whether to stop or continue at stations and which route to follow at junctions. Use of \code{logical} flags in scripts is most useful when switching between two behaviours that depend on multiple separate statements.
\end{explainbox}
<<if-1z>>=
flag <- TRUE
if (flag) print("Hello!")
@
\begin{playground}
Play with the code above by changing the value assigned to variable \code{flag}, \code{FALSE}, \code{NA}, and \code{logical(0)}.
In the example above we use variable \code{flag} as the \emph{condition}.
Nothing in the \Rlang language prevents this condition from being a \code{logical} constant. Explain why \code{if (FALSE)} in the syntactically correct statement below is of no practical use.
<<if-1>>=
if (FALSE) print("Hello!")
@
\end{playground}
Conditional execution is much more useful than what could be expected from the previous examples, because the statement whose execution is being controlled can be a compound statement of almost any length or complexity. A very simple example follows, with a compound statement containing two statements, each one, a call to function \code{print()} with a different argument.
<<if-2>>=
printing <- TRUE
if (printing) {
print("A")
print("B")
}
@
\begin{warningbox}
The condition passed as an argument to \code{if}, enclosed in parentheses, can be anything yielding a \Rclass{logical} vector of length one. As this condition is \emph{not} vectorised, a longer vector will trigger an \Rlang warning or error depending on \Rlang's version.
\end{warningbox}
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.5cm]
\node (start) [startstop] {\ldots};
\node (dec1) [decision, color = blue, fill = blue!15, below of=start, yshift=-0.5cm] {\code{if (<cond.>) else}};
\node (stat2) [process, color = blue, fill = blue!15, left of=dec1, xshift=-3.2cm] {\code{<statement A>}};
\node (stat3) [process, color = blue, fill = blue!15, right of=dec1, xshift=3.2cm] {\code{<statement B>}};
\node (stat4) [process, below of=dec1, yshift=-0.5cm] {\code{<statement C>}};
\node (stop) [startstop, below of=stat4] {\ldots};
\draw [arrow] (start) -- (dec1);
\draw [arrow, color=blue] (dec1) -- node[anchor=north] {\code{TRUE}} (stat2);
\draw [arrow, color=blue] (dec1) -- node[anchor=north] {\code{FALSE}} (stat3);
\draw [arrow] (stat2) |- (stat4);
\draw [arrow] (stat3) |- (stat4);
\draw [arrow] (stat4) -- (stop);
\end{tikzpicture}
\end{small}
\caption[Flowchart for \code{if \ldots\ else} construct.]{Flowchart for \code{if \ldots else} construct.}\label{fig:if:else:diagram}
\end{figure}
The \code{if \ldots\ else \ldots} construct ``decides'', depending on a \code{logical} value, which of two code statements is executed (Figure \ref{fig:if:else:diagram}). The flow chart shows how it works: either \code{<statement A>} or \code{<statement B>} is evaluated and the other skipped depending on the value of \code{<condition>}, while \code{<statement C>} is always evaluated.\label{flowchart:if:else}
<<if-3>>=
a <- 10
if (a < 0) print("'a' is negative") else print("'a' is not negative")
print("This is always printed")
@
As can be seen above, the statement immediately following \code{if} is executed if the condition returns \code{TRUE} and that following \code{else} is executed if the condition returns \code{FALSE}. Statements after the conditionally executed \code{if} and \code{else} statements are always executed, independently of the value returned by the condition.
\begin{playground}
Play with the code in the chunk above by assigning different numeric vectors to \code{a}.
\end{playground}
<<auxiliary, echo=FALSE, include = FALSE, eval=TRUE>>=
show.results <- TRUE
if (show.results) eval.if.4 <- c(1:4) else eval.if.4 <- FALSE
# eval.if.4
show.results <- FALSE
if (show.results) eval.if.4 <- c(1:4) else eval.if.4 <- FALSE
#eval.if.4
@
\begin{explainbox}
Do you still remember the rules about continuation lines?
<<if-4>>=
# 1
a <- 1
if (a < 0) print("'a' is negative") else print("'a' is not negative")
@
Why does the statement below (not evaluated here) trigger an error while the one above does not?
<<if-4a, eval=FALSE>>=
# 2 (not evaluated here)
if (a < 0) print("'a' is negative")
else print("'a' is not negative")
@
How do the continuation line rules apply when we add curly braces as shown below.
<<if-4b>>=
# 1
a <- 1
if (a < 0) {
print("'a' is negative")
} else {
print("'a' is not negative")
}
@
In the example above, we enclosed a single statement between each pair of curly braces, but as these braces create compound statements, multiple statements could have been enclosed between each pair.
\end{explainbox}
\begin{playground}
Play with the use of conditional execution, with both simple and compound statements, and also think how to combine \code{if} and \code{else} to select among more than two options.
\end{playground}
In \Rlang, the value returned by any compound statement is the value returned by the last simple statement executed within the compound one. This means that we can assign the value returned by an \code{if} and \code{else} statement to a variable. This style is less frequently used, but occasionally can result in easier-to-understand scripts.\label{chunk:if:assignment}
<<if-4c>>=
a <- 1
my.message <-
if (a < 0) "'a' is negative" else "'a' is not negative"
print(my.message)
@
\begin{explainbox}
If the condition statement returns a value of a class other than \code{logical}, \Rlang will attempt to convert it into a logical. This is sometimes used instead of a comparison to zero, as the conversion from \code{integer} yields \code{TRUE} for all integers except zero. The code below illustrates a rather frequently used idiom for checking if there is something available to display.
<<if-explain_conv>>=
message <- "abc"
if (length(message)) print(message)
@
\end{explainbox}
\begin{advplayground}
\Kern{-1}{Study the conversion rules between \Rclass{numeric} and \Rclass{logical} values, run each of the statements below, and explain the output based on how type conversions are interpreted, remembering the difference between \emph{floating-point numbers} as implemented in computers and \emph{real numbers} as defined in mathematics (see page \pageref{box:integer:float}).}
% chunk contains intentional error-triggering examples
<<if-PG-01, eval=FALSE>>=
if (0) print("hello")
if (-1) print("hello")
if (0.01) print("hello")
if (1e-300) print("hello")
if (1e-323) print("hello")
if (1e-324) print("hello")
if (1e-500) print("hello")
if (as.logical("true")) print("hello")
if (as.logical(as.numeric("1"))) print("hello")
if (as.logical("1")) print("hello")
if ("1") print("hello")
@
Hint: if you need to refresh your understanding of the type conversion rules, see section \ref{sec:calc:type:conversion} on page \pageref{sec:calc:type:conversion}.
\end{advplayground}
\begin{figure}
\centering
\begin{small}\label{flowchart:switch}
\begin{tikzpicture}[node distance=1.5cm]
\node (start) [startstop] {\ldots};
\node (dec1) [decision, color = blue, fill = blue!15, below of=start, yshift=-0.4cm] {\code{switch(<value>)}};
\node (stat2) [process, color = blue, fill = blue!15, below of=dec1, xshift=3.4cm] {\code{<statement A>}};
\node (stat3) [process, color = blue, fill = blue!15, below of=stat2] {\code{<statement B>}};
\node (stat4) [process, color = blue, fill = blue!15, below of=stat3] {\code{<statement C>}};
\node (stat5) [process, color = blue, fill = blue!15, below of=stat4] {\code{<statement D>}};
\node (stat6) [process, below of=stat5, xshift=3.3cm] {\code{<statement E>}};
\node (stop) [startstop, below of=stat6] {\ldots};
\draw [arrow] (start) -- (dec1);
\draw [arrow, color=blue] (dec1) |- node[anchor=north west] {\code{<value 1>}} (stat2);
\draw [arrow, color=blue] (dec1) |- node[anchor=north west] {\code{<value 2>}} (stat3);
\draw [arrow, color=blue] (dec1) |- node[anchor=north west] {\code{<value 3>}} (stat4);
\draw [arrow, color=blue] (dec1) |- node[anchor=north west] {\code{<default>}} (stat5);
\draw [arrow] (stat2) -| (stat6);
\draw [arrow] (stat3) -| (stat6);
\draw [arrow] (stat4) -| (stat6);
\draw [arrow] (stat5) -| (stat6);
\draw [arrow] (stat6) -- (stop);
\end{tikzpicture}
\end{small}
\caption{Flowchart for a \code{switch} construct with four cases.}\label{fig:switch:diagram}
\end{figure}
\Kern{-1}{In addition to \Rcontrol{if} and \Rcontrol{if\ldots{}else}, there is in \Rlang a \Rcontrol{switch()} statement (Figure \ref{fig:switch:diagram}). It can be used to select among several \emph{cases}, or alternative statements, based on an expression that returns a \code{numeric} or a \code{character} value of length one when evaluated.}
A \Rcontrol{switch()} statement returns a value, just like \code{if} does. The value passed as argument to \Rcontrol{switch()} functions as an index selecting one of the statements. The value returned by the \Rcontrol{switch()} statement is the value returned by the selected \textit{case} statement.
In the first example below, we use a \code{character} variable as the condition, named cases, and a final unlabelled case as default in case of no match. In real use, a computed value or user input would be used in place of \code{my.object}. As with the \code{logical} argument to \code{if}, the \code{character} string value passed as argument must be a vector of length one.
<<>>=
my.object <- "two"
b <- switch(my.object,
one = 1,
two = 1 / 2,
four = 1 / 4,
0
)
b
@
Multiple condition values can share the same statement.
<<>>=
my.object <- "two"
b <- switch(my.object,
one =, uno = 1,
two =, dos = 1 / 2,
four =, cuatro = 1 / 4,
0
)
b
@
\begin{playground}
Do play with the use of the switch statement. Look at the documentation for \code{switch()} using \code{help(switch)} and study the examples at the end of the help page. Explore what happens if you set \code{my.object <- "ten"}, \code{my.object <- "three"}, \code{my.object <- NA\_character\_} or \code{my.object <- character()}. Then remove the \code{, 0} as default value, and repeat.
\end{playground}
When the expression used as a condition returns a value that is not a \code{character}, it will be interpreted as an \code{integer} index. In this case, no names are used for the cases and the last case is always interpreted as the default.
<<>>=
my.number <- 2
b <- switch(my.number,
1,
1 / 2,
1 / 4,
0
)
b
@
\begin{playground}
Continue playing with the use of the switch statement. Explore what happens if you set \code{my.number <- 10}, \code{my.number <- 3}, \code{my.number <- NA}, or \code{my.object <- numeric()}. Afterwards, remove the \code{, 0} as default value, and repeat.
\end{playground}
\begin{explainbox}
The statements for the cases in a \Rcontrol{switch()} statement can be compound statements as in the case of \code{if}, and they can even be used for a side effect. The code example above can edited to print a message when the default value is returned.
<<explain-switch-01>>=
my.object <- "ten"
b <- switch(my.object,
one = 1,
two = 1 / 2,
three = 1 / 4,
{print("No match! Using default"); 0}
)
b
@
\end{explainbox}
\begin{explainbox}
The \Rcontrol{switch()} statement can substitute for chained \code{if \ldots\ else} statements when all the conditions can be described by constant values or distinct values returned by the same test. The advantage is more concise and readable code. The equivalent of the first \Rcontrol{switch()} example above when written using \code{if \ldots\ else} becomes longer. Given how terse code using \Rcontrol{switch()} is, those not yet familiar with its use may find the more verbose style used below easier to understand. On the other hand, with numerous cases, a \Rcontrol{switch()} statement is easier to read and understand.
<<explain-switch-11>>=
my.object <- "two"
if (my.object == "one") {
b <- 1
} else if (my.object == "two") {
b <- 1 / 2
} else if (my.object == "four") {
b <- 1 / 4
} else {
b <- 0
}
b
@
\end{explainbox}
\begin{advplayground}
Consider another alternative approach, the use of a named vector to map values. In most of the examples above, the code for the cases is a constant value or an operation among constant values. Implement one of these examples using a named vector instead of a \Rcontrol{switch()} statement.
\end{advplayground}
\subsection[Vectorised \texttt{ifelse()}]{Vectorised \code{ifelse()}}
\index{vectorised ifelse}
Vectorised \emph{ifelse} is a peculiarity of the \Rlang language, but very useful for writing concise code that may execute faster than logically equivalent but not vectorised code.
Vectorised conditional execution is coded by means of \emph{function} \Rcontrol{ifelse()} (written as a single word). This function takes three arguments: a \code{logical} vector usually the result of a test (parameter \code{test}), an expression to use for \code{TRUE} cases (parameter \code{yes}), and an expression to use for \code{FALSE} cases (parameter \code{no}). At each index position along the vectors, the value included in the returned vector is taken from \code{yes} if the corresponding member of the \code{test} logical vector is \code{TRUE} and from \code{no} if the corresponding member of \code{test} is \code{FALSE}. All three arguments can be any \Rlang statement returning the required vectors.
The flow chart for \Rcontrol{ifelse()} is similar to that for \code{if \ldots\ else} shown on page \pageref{flowchart:if} but applied in parallel to the individual members of vectors; e.g.,\ the condition expression is evaluated at index position \code{1} controls which value will be present in the returned vector at index position \code{1}, and so on.
It is customary to pass arguments to \code{ifelse} by position. We give a first example with named arguments to clarify the use of the function.
<<ifelse-0>>=
my.test <- c(TRUE, FALSE, TRUE, TRUE)
ifelse(test = my.test, yes = 1, no = -1)
@
In practice, the most common idiom is to have as an argument passed to \code{test}, the result of a comparison calculated on the fly. As an example, the absolute values of the members of a vector are computed using \Rcontrol{ifelse()} instead of with \Rlang function \code{abs()}.
<<ifelse-0a>>=
nums <- -3:+3
ifelse(nums < 0, -nums, nums)
@
\begin{warningbox}
In the case of \Rcontrol{ifelse()}, the length of the returned value is determined by the length of the logical vector passed as an argument to its first formal parameter (named \code{test})! A frequent mistake is to use a condition that returns a \code{logical} vector of length one, expecting that it will be recycled because arguments passed to the other formal parameters (named \code{yes} and \code{no}) are longer. However, no recycling will take place, resulting in a returned value of length one, with the remaining elements of the vectors passed to \code{yes} and \code{no} being discarded. Do try this by yourself, using logical vectors of different lengths. You can start with the examples below, making sure you understand why the returned values are what they are.
<<>>=
ifelse(TRUE, 1:5, -5:-1)
ifelse(FALSE, 1:5, -5:-1)
ifelse(c(TRUE, FALSE), 1:5, -5:-1)
ifelse(c(FALSE, TRUE), 1:5, -5:-1)
ifelse(c(FALSE, TRUE), 1:5, 0)
@
\end{warningbox}
\begin{playground}
Some additional examples to play with, containing a few surprises. Study the examples below until you understand why returned values are what they are. In addition, create your own examples to test other possible cases. In other words, play with the code until you fully understand how \code{ifelse()} statements work.
<<ifelse-1, eval=eval_playground>>=
a <- 1:10
ifelse(a > 5, 1, -1)
ifelse(a > 5, a + 1, a - 1)
ifelse(any(a > 5), a + 1, a - 1) # tricky
ifelse(logical(0), a + 1, a - 1) # even more tricky
ifelse(NA, a + 1, a - 1) # as expected
@
Hint: if you need to refresh your understanding of \code{logical} values and Boolean algebra see section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}.
\end{playground}
\begin{advplayground}
Using \Rcontrol{ifelse()}, write a single statement to combine numbers from the two vectors \code{a} and \code{b} into a result vector \code{d}, based on whether the corresponding value in vector \code{c} is the character \code{"a"} or \code{"b"}. Then print vector \code{d} to make the result visible.
<<ifelse-2, eval=eval_playground>>=
a <- -10:-1
b <- +1:10
c <- c(rep("a", 5), rep("b", 5))
# your code
@
If you do not understand how the three vectors are built, or you cannot guess the values they contain by reading the code, print them, and play with the arguments, until you understand what each parameter does. Also use \code{help(rep)} and/or \code{help(ifelse)} to access the documentation.
\end{advplayground}
\begin{advplayground}
Continuing from the playground above, test the behaviour of \Rcontrol{ifelse()} with \code{NA}, \code{NULL} and \code{logical()} passed as arguments to \code{test}. Also test the behaviour when only some members of a logical vector are not available (\code{NA}).
\end{advplayground}
\section{Iteration}
\index{loops|seealso{iteration}}
We give the name \emph{iteration} to the process of repetitive execution of a program statement---e.g., \emph{computed by iteration}. We use the same word, \emph{iteration}, to name each one of these repetitions of the execution of a statement---e.g., \emph{the second iteration}.
Iteration constructs make it possible to ``decide'' at run time the number of iterations, i.e., when execution breaks out of the loop and continues at the next statement in the script. Iteration can be used to apply the same computations to the different members of a vector or list (this section), but also to apply different functions to members of a vector, matrix, list, or data frame (section \ref{sec:R:faces:of:loops} on page \pageref{sec:R:faces:of:loops}).
In \Rlang, three types of iteration loops are available: \Rloop{for}, \Rloop{while} and \Rloop{repeat} constructs. They differ in the origin of the values they iterate over, and in the type of test used to terminate iteration. When the same algorithm can be implemented with more than one of these constructs, using the least flexible of them usually results in easier to understand code.
In \Rlang, explicit loops as described in this section can in some cases be replaced by calls to \emph{apply} functions (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}) or with vectorised functions and operators (see page \pageref{par:calc:vectorised:opers}). The choice among these approaches affects readability and performance (see section \ref{sec:loops:slow} on page \pageref{sec:loops:slow}).
\subsection[\texttt{for} loops]{\code{for} loops}
\begin{figure}
\centering
\begin{small}
\begin{tikzpicture}[node distance=1.5cm]
\node (start) [startstop] {\ldots};
\node (entry) [below of=start, color = blue, yshift=0.5cm]{$\bullet$};
\node (dec1) [decision, color = blue, fill = blue!15, below of=entry, yshift=0.3cm] {\code{for (<list>)}};
\node (stat2) [process, color = blue, fill = blue!15, right of=dec1, xshift=3.55cm] {\code{<statement A>}};
\node (stat3) [process, below of=dec1, yshift=-0.5cm] {\code{<statement B>}};
\node (stop) [startstop, below of=stat3] {\ldots};
\draw [arrow] (start) -- (dec1);
\draw [arrow, color=blue] (dec1) -- node[anchor=north] {\textsl{continue}} (stat2);
\draw [arrow, color=blue] (dec1) -- node[anchor=west] {\textsl{break}} (stat3);
\draw [arrow, color = blue] (stat2) |- (entry);
\draw [arrow, color = blue] (entry) -- (dec1);
\draw [arrow] (stat3) -- (stop);
\end{tikzpicture}
\end{small}
\caption{Flowchart for a \code{for} iteration loop.}\label{fig:for:loop:diagram}
\end{figure}
The\index{for loop}\index{iteration!for loop}\qRloop{for} most frequently used type of loop is a \code{for} loop. These loops work in \Rlang by ``walking through'' a list or vector of values to act upon (Figure \ref{fig:for:loop:diagram}). Within a \qRloop{for} loop, member values are available, sequentially, one at a time, through a variable that functions as a placeholder. The implicit test for the end of the vector or list takes place at the top of the construct before the loop statement is evaluated. The flow chart has the shape of a \emph{loop} as the execution can be directed to an earlier position in the sequence of statements, allowing the same section of code to be evaluated multiple times, each time with a new value assigned to the placeholder variable.
In the diagram above, the argument to \code{for()} is shown as \code{<list>} but it can also be a \code{vector} of any mode. Objects of most classes derived from \code{list} or from an atomic vector can also fulfil the same role. The extraction operation with a numeric index must be supported by objects of the class passed as argument.
Similarly to \code{if} constructs, only one statement is controlled by \Rloop{for}, however this statement can be a compound statement enclosed in braces \verb|{ }| (see pages \pageref{sec:script:compound:statement} and \pageref{sec:script:if}).
<<for-00>>=
b <- 0 # variable needs to set to a valid numeric value!
for (a in 1:5) b <- b + a
b
@
Here the statement \code{b <- b + a} is executed five times, with the placeholder variable \code{a} sequentially taking each of the values, 1, 2, 3, 4, and 5, the members of the anonymous vector \code{1:5}. The name used as a placeholder has to fulfil the same requirements as an ordinary \Rlang variable name. The list or vector following \code{in} can contain any valid \Rlang objects, as long as the code statements in the loop body can handle them.
\begin{warningbox}
In a \code{for} loop construct, even when it is a variable, the vector or list passed as argument cannot be modified by the code statement within the \code{for} loop.
\end{warningbox}
A\index{for loop!unrolled} loop can be ``unrolled'' into a linear sequence of statements. Let's work through the \code{for} loop above.
<<for-unrolled>>=
b <- 0
# start of loop
# first iteration
a <- 1
b <- b + a
# second iteration
a <- 2
b <- b + a
# third iteration
a <- 3
b <- b + a
# fourth iteration
a <- 4
b <- b + a
# fifth iteration
a <- 5
b <- b + a
# end of loop
b
@
The operation implemented in this example is a very frequent one, the sum of a vector, so base \Rlang provides a function optimised for efficiently computing it.
<<for-replaced-by-sum>>=
sum(1:5)
@
\begin{warningbox}
It is important to note that a list or vector of length zero is a valid argument to \code{for}, that triggers no error, but skips the statements in the loop body.
<<for-00a>>=
b <- 0
for (a in numeric()) b <- b + a
print(b)
@
\end{warningbox}
By printing at each iteration variable \code{b}, the partial results at each iteration can be observed. Brackets are needed to form a compound statement from the two simple statements so that \code{print(b)} is also executed at each iteration.
<<for-01>>=
a <- c(1, 4, 3, 6, 8)
for(x in a) {
b <- x*2
print(b)
}
@
\begin{warningbox}
The iteration constructs \Rloop{for}, \Rloop{while}, and \code{repeat} always silently return \code{NULL}, which is a different behaviour than that of \code{if}.
<<for-02>>=
b <- for(x in a) x*2
x
b
@
Thus as shown in earlier examples of \Rloop{for} loops, computed values need to be assigned to one or more variables within the loop so that they are not lost.
\end{warningbox}
While in the examples above the code directly walked through the values in the vector, an alternative approach is to walk through a sequence of indices using the extraction operator \Roperator{[ ]} to access the values in vectors or lists. This approach makes it possible to concurrently walk through more than one list or vector. In the example below, one member of vector \code{a} and of \code{b} are accessed in each iteration, \code{a} providing the input and \code{b} used to store the corresponding computed value.\label{chunk:for:example}
<<for-03a>>=
b <- numeric() # an empty vector
for(i in seq(along.with = a)) {
b[i] <- a[i]^2
}
b
@
\begin{playground}\label{box:play:forloop}
Adding calls to \code{print()} makes visible the values taken by variables \code{i}, \code{a}, and \code{b} at each iteration. Try to understand where these values come from at each iteration, by playing with the code and modifying it.
<<for-03d, eval=eval_playground>>=
b <- numeric() # an empty vector
for(i in seq(along.with = a)) {
b[i] <- a[i]^2
print(i)
print(a)
print(b)
}
b
@
The same approach of adding calls to \code{print()} can be used for debugging any code that does not return the expected results.
\end{playground}
Above I used \code{seq(along.with = a)} to build a numeric vector containing a sequence of the same length as vector \code{a}. Using this \emph{idiom} ensures that a vector, in this example \code{a}, with length zero will be handled correctly, with \code{numeric(0)} assigned to \code{b}.
\begin{advplayground}
Run the examples below and explain why the two approaches are equivalent only when the length of \code{A} is one or more. Find the answer by assigning to \code{A}, vectors of different lengths, including zero (using \code{A <- numeric(0)}).
<<for-04, eval=eval_playground>>=
A <- -5:5 # assign different numeric vector to A
B <- numeric(length(A))
for(i in seq(along.with = A)) {
B[i] <- A[i]^2
}
B
C <- numeric(length(A))
for(i in 1:length(A)) {
C[i] <- A[i]^2
}
C
@
\end{advplayground}
\begin{explainbox}
Using \code{seq(along.with = a)}, its equivalent \code{seq\_along(a)},\qRfunction{seq()}\qRfunction{seq\_along()} as above creates a sequence of integers in \code{i}, that indexes all members of \code{a} in the ``walk-through''. There is no requirement in the \Rlang for this, and including only some of the valid indexes, or including them in arbitrary order is possible if needed, however, this is rarely the case. On exit from the loop, the iterator \code{i} remains accessible and contains its value at the last iteration.
\end{explainbox}
Vectorisation usually results in the simplest and fastest code, as shown below (see section \ref{sec:loops:slow} on \pageref{sec:loops:slow}). However, not all \Rloop{for} loops can be replaced by vectorised statements.
<<for-03c>>=
b <- a^2
b
@
\begin{explainbox}
\Rloop{for} loops as described above, in the absence of errors, have statically predictable behaviour. The compound statement in the loop will be executed once for each member of the vector or list. Special cases may require the alteration of the normal flow of execution in the loop. Two cases are easy to deal with, one is stopping iteration early with a call to \Rloop{break()}, and another is jumping ahead to the next iteration with a call to \Rloop{next()}. The example below shows the use of these two functions: we ignore negative values contained in \code{a}, and exit or break out of the loop when the accumulated sum \code{b} exceeds 100.
<<for-05>>=
b <- 0
a <- -10:100
idxs <- seq_along(a)
for(i in idxs) {
if (a[i] < 0) next()
b <- b + a[i]
if (b > 100) break()
}