-
Notifications
You must be signed in to change notification settings - Fork 4
/
R.as.calculator.Rnw
2802 lines (2148 loc) · 144 KB
/
R.as.calculator.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% !Rnw root = appendix.main.Rnw
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'calculator-chunk')
@
\chapter{Base \Rlang: ``Words'' and ``Sentences''}\label{chap:R:as:calc}
\begin{VF}
The desire to economise time and mental effort in arithmetical computations, and to eliminate human liability to error, is probably as old as the science of arithmetic itself.
\VA{Howard Aiken}{\emph{Proposed automatic calculating machine}, 1937; reprinted 1964}\nocite{Aiken1964}
\end{VF}
%\dictum[Howard Aiken, \emph{Proposed automatic calculating machine}, presented to IBM in 1937]{The desire to economise time and mental effort in arithmetical computations, and to eliminate human liability to error, is probably as old as the science of arithmetic itself.}\vskip2ex
\section{Aims of This Chapter}
In my experience, for those who are not familiar with computer programming languages, the best first step in learning the \Rlang language is to use it interactively by typing textual commands at the \Rpgrm \emph{console}. This teaches not only the syntax and grammar rules, but also gives a glimpse at the advantages and flexibility of this approach to data analysis. In this chapter, I focus on the different simple values or items that can be stored and manipulated in \Rpgrm, as well as the role of computer program statements, the equivalent of ``sentences'' in natural languages.
In the first part of the chapter, you will use \Rlang to do everyday calculations that should be so easy and familiar that you will not need to think about the operations themselves. This easy start will give you a chance to focus on learning how to issue textual commands at the command prompt.
Later in the chapter, you will gradually need to focus more on the \Rlang language and its grammar and less on how commands are entered. By the end of the chapter, you will be familiar with most of the kinds of simple ``words'' used in the \Rlang language and you will be able to read and write simple \Rlang statements.
Throughout the chapter, I will occasionally show the equivalent of the \Rlang code in mathematical notation. If you are not familiar with the mathematical notation, you can safely ignore the mathematics, as long as you understand the diagrams and the \Rlang code.
\section{Natural and Computer Languages}
\index{languages!natural and computer}
Computer languages have strict rules, and the interpreters and compilers that translate these languages into machine code are unforgiving about errors. They will issue error messages, but in contrast to human readers or listeners, will not guess your intentions and continue. However, computer languages have a much smaller set of words than natural languages, such as English. If you are new to computer programming, understanding the parallels between computer and natural languages may be useful.
One can think of constant values and variables (values stored under a name) as nouns and of operators and functions as verbs. A complete command, or statement, is the equivalent of a natural language sentence: ``a comprehensible utterance''. The simple statement \code{a + 1} has three components: \code{a}, a variable, \code{+}, an operator and \code{1} a constant. The statement \code{sqrt(4)} has two components, a function \code{sqrt()} and a numerical constant \code{4}. We say that ``to compute $\sqrt{4}$ we \emph{call} \code{sqrt()} with \code{4} as its \emph{argument}''.
Although all values manipulated in a digital computer are stored as \textit{bits} in memory, multiple interpretations are possible. Numbers, letters, logical values, etc., can be encoded into bits and decoded as long as their type or \code{mode} is known. The concept of \code{class} is not directly related to how values are encoded when stored in computer memory, but instead how they are interpretated as part of a computer program. We can have, for example, RGB colour values, stored as three numbers such as \code{0, 0, 255}, as hexadecimal numbers stored as characters {\#0000FF}, or even use fancy names stored as character strings like \code{"blue"}. We could create a \code{class} for colours using any of these representations, based on two different modes: \code{numeric} and \code{character}.
\section{Numeric Values and Arithmetic}\label{sec:calc:numeric}
\index{classes and modes!numeric, integer, double|(}\index{numbers and their arithmetic|(}\qRclass{numeric}\index{math operators}\index{math functions}\index{numeric values}\qRoperator{+}\qRoperator{-}\qRoperator{*}\qRoperator{/}
When working in \Rlang with arithmetic expressions, the normal mathematical precedence rules are followed and parentheses can be used to alter this order. Parentheses can be nested, but in contrast to the usual practice in mathematics, the same parenthesis symbol is used at all nesting levels.
\begin{explainbox}
Both in mathematics and programming languages \emph{operator precedence rules} determine which subexpressions are evaluated first and which later. Contrary to primitive electronic calculators, \Rlang evaluates numeric expressions containing operators according to the rules of mathematics. In the expression $1 + 2 \times 3$, the product $2 \times 3$ has precedence over the addition, and is evaluated first, yielding as the result of the whole expression, 7. Similar rules apply to other operators, even those taking as operands non-numeric values.
\end{explainbox}
The equivalent of the math expression\qRfunction{exp()}\qRfunction{cos()}\qRconst{pi}
$$
\frac{3 + e^2}{\cos \pi}
$$
is, in \Rlang, written as follows:
<<numbers-0>>=
(3 + exp(2)) / cos(pi)
@
Where constant \Rconst{pi} ($\pi = 3.1415\ldots$) and function \Rfunction{cos()} (cosine) are defined in base \Rlang. Many trigonometric and mathematical functions are available in addition to operators like \verb|+|, \verb|-|, \verb|*|, \verb|/|, and \verb|^|.
\begin{warningbox}
In \Rlang, angles are expressed in radians, thus $\cos(\pi) = 1$ and $\sin(\pi) = 0$, according to trigonometry. Degrees can be converted into radians taking into account that the circle corresponds to $2 \times \pi$ when expressed in radians and to $360^\circ$ when expressed in degrees. Thus the cosine of an angle of $45^\circ$ can be computed as follows.
<<numbers-radians>>=
sin(45/180 * pi)
@
\end{warningbox}
One thing to remember when translating fractions into \Rlang code is that in arithmetic expressions the bar of the fraction generates a grouping that alters the normal precedence of operations. In contrast, in \Rlang expressions this grouping must be explicitly signalled with additional parentheses.
If you are in doubt about how precedence rules work, you can add parentheses to make sure the order of computations is the one you intend. Redundant parentheses have no effect.
<<numbers-00>>=
1 + 2 * 3
1 + (2 * 3)
(1 + 2) * 3
@
The number of opening (left side) and closing (right side) parentheses must be balanced, and they must be located so that each enclosed term is a valid mathematical expression, i.e., code that can be evaluated to return a value, a value that can be inserted in place of the expression enclosed in parenthesis before evaluating the remaining of the expression. For example, \code{(1 + 2) * 3} after evaluating \code{(1 + 2)} becomes \code{3 * 3} yielding \code{9}. In contrast, \code{(1 +) 2 * 3} is a syntax error as \code{1 +} is incomplete and does not yield a number.
\begin{playground}
In \emph{playgrounds} the output from running the code in \Rpgrm is not shown, as these are exercises for you to enter at the \Rpgrm console and run. In general, you should not skip them as in most cases playgrounds aim to teach or demonstrate concepts or features that I have \emph{not} included in full-detail in the main text. You are strongly encouraged to \emph{play}, in other words, to create new variations of the examples and execute them to explore how \Rlang works.\qRfunction{sqrt()}\qRfunction{sin()}\qRfunction{log()}\qRfunction{log10()}\qRfunction{log2()}\qRfunction{exp()}
<<numbers-1, eval=eval_playground>>=
1 + 1
2 * 2
2 + 10 / 5
(2 + 10) / 5
10^2 + 1
sqrt(9)
@
<<numbers-1a, eval=eval_playground>>=
pi
sin(pi)
log(100)
log10(100)
log2(8)
exp(1)
@
\end{playground}
Variables\index{variables}\index{assignment} are used to store values. After we \emph{assign} a value to a variable, we can use in our code the name of the variable in place of the stored value. The ``usual'' assignment operator is \Roperator{<-}. In \Rlang, all names, including variable names, are case sensitive. Variables \code{a} and \code{A} are two different variables. Variable names can be long in \Rlang, although it is not a good idea to use very long names. Here I am using very short names, something that is usually also a very bad idea. However, in the examples in this chapter, where the stored values have no connection to the real world, simple names emphasise their abstract nature. In the chunk below, \code{vct1} and \code{vct2} are arbitrarily chosen variable names; I should have used names like \code{height.cm} or \code{outside.temperature.C} if they had been useful to convey information.
In the book, I use variable names that help recognise the kind of object stored, as this is most relevant when learning \Rlang. Here I use \code{vct1} because in \Rlang, as we will see on page \pageref{par:numeric:vectors:start}, numeric objects are always vectors, even when of length one.
<<numbers-2>>=
vct1 <- 1
vct1 + 1
vct1
vct2 <- 10
vct2 <- vct1 + vct2
vct2
@
Entering the name of a variable \emph{at the \Rlang console} implicitly calls function \code{print()} displaying the stored value on the console. The same applies to any other statement entered \emph{at the \Rlang console}: \code{print()} is implicitly called with the result of executing the statement as its argument.
<<numbers-2a>>=
vct1
print(vct1)
vct1 + 1
print(vct1 + 1)
@
\begin{playground}
There are some syntactically legal assignment statements that are not very frequently used, but you should be aware that they are valid, as they will not trigger error messages and may surprise you. The most important thing is to write code consistently. The ``backwards'' assignment operator \Roperator{->} and resulting code like \code{1 -> VCT1}\index{assignment!leftwise} are valid but less frequently used. The use of the equals sign (\Roperator{=}) for assignment in place of \Roperator{<-} although valid is discouraged. Chaining\index{assignment!chaining} assignments as in the first statement below can be used to signal to the human reader that \code{VCT1}, \code{VCT2} and \code{VCT3} are being assigned the same value.
<<numbers-3, tidy=FALSE, eval=eval_playground>>=
VCT1 <- VCT2 <- VCT3 <- 0
VCT1
VCT2
VCT3
1 -> VCT1
VCT1
VCT1 = 3
VCT1
remove(VCT1, VCT2, VCT3) # cleanup
@
\end{playground}
\begin{explainbox}\label{box:integer:float}
In\index{numeric, integer and double values} \Rlang, all numbers belong to mode \Rclass{numeric} (we will discuss the concepts of \emph{mode} and \emph{class} in section \ref{sec:rlang:mode} on page \pageref{sec:rlang:mode}). We can query if the mode of an object is \Rclass{numeric} with function \Rfunction{is.numeric()}. The returned values are either \code{TRUE} or \code{FALSE}. These are logical values that will be discussed in section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}.
<<classes-01>>=
mode(1)
vct1 <- 1
is.numeric(vct1)
@
Because numbers can be stored in computer memory in different formats, most computing languages, including \Rlang, implement multiple types of numerical values. In most cases, \Rpgrm's \code{numeric} values can be used everywhere that a number is expected. However, in some cases, explicitly using class \Rclass{integer} to indicate that we will store or operate on whole numbers, can be advantageous, e.g., \Rclass{integer} constants are identified by a trailing capital ``L'', as in \code{32L}.
<<classes-02>>=
is.numeric(1L)
is.integer(1L)
is.double(1L)
@
Real numbers are a mathematical abstraction, and do not have an exact equivalent in computers. Instead of Real numbers, computers store and operate on numbers that are restricted to a broad but finite range of values and have a finite resolution. They are called, \emph{floats} (or \emph{floating-point} numbers); in \Rlang they go by the name of \Rclass{double} and can be created with the constructor \Rfunction{double()}.
<<classes-03>>=
is.numeric(1)
@
<<classes-03a>>=
is.integer(1)
is.double(1)
@
\end{explainbox}
\index{vectors!introduction|(}\label{par:calc:vectors:diag}
Vectors\label{par:numeric:vectors:start} are one-dimensional in structure, of varying length and used to store similar values, e.g., numbers. They are different from the vectors, commonly used in Physics when describing directional forces, which are symbolised with an arrow as an ``accent'', such as $\overrightarrow{\mathbf{F}}$. In \Rlang numeric values and other atomic values are always \Rclass{vector}s that can contain zero, one or more elements. The diagram below exemplifies a vector containing ten elements, also called members. These elements can be extracted using integer numbers as positional indices, and manipulated as described in more detail in section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}.\vspace{1ex}
\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=codeshadecolor},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}},
row 1 column 1/.style={nodes={draw}}}]
\matrix[array] (array) {
1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
& & & & & & & & & \\};
\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {};
\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-10.south east);
\end{scope}
\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\textcolor{blue}{\ \code{<name>}\strut}};
\draw (array-1-1.north)--++(90:3mm) node [above] (first) {First index};
\draw (array-1-10.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-10.east)--++(0:3mm) node [right]{Elements or \textcolor{blue}{\code{<values>}}};
\node [align=center, anchor=south] at (array-2-9.north west|-first.south) (8) {element at index 9};
\draw (8)--(box);
%
\end{tikzpicture}
\end{footnotesize}
\end{center}
Vectors, in mathematical notation, are similarly represented using positional indexes as subscripts,
\begin{equation}\label{eq:vector}
a_{1\ldots n} = a_1, a_2, \cdots a_i, \cdots, a_n,
\end{equation}
where $a_{1\ldots n}$ is the whole vector and $a_1$ its first member. The length of $a_{1\ldots n}$ is $n$ as it contains $n$ members. In the diagram above $n = 10$.
As you have seen above, the results of calculations were printed preceded with \code{[1]}. This is the index or position in the vector of the first number (or other value) displayed at the head of the current line. As in \Rlang single values are vectors of length one, when they are printed, they are also preceded with \code{[1]}.\label{par:print:vec:index}
One\label{par:calc:concatenate} can use function \Rfunction{c()} ``concatenate'' to create a vector from other vectors, including vectors of length 1, or even vectors of length 0, such as the \code{numeric} constants in the statements below. The first example shows an anonymous vector created, printed, and then automatically discarded.
<<numbers-4aann>>=
c(3, 1, 2)
@
To be able to reuse the vector, we assign it to a variable, giving a name to it. The length of a vector can be queried with function \Rfunction{length()}. Below, \Rlang code is followed by diagrams depicting the structure of the vectors created.
<<numbers-4aa>>=
vct4 <- c(3, 1, 2)
length(vct4)
vct4
@
%\begin{center}
\noindent
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=blue!20},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}}}]
\matrix[array] (array) {
1 & 2 & 3 \\
3 & 1 & 2 \\};
%\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {};
\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east);
\end{scope}
\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\strut\code{\ vct4\phantom{mm}}};
\draw (array-1-3.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-3.east)--++(0:3mm) node [right]{\code{numeric} values};
%
\end{tikzpicture}
\end{footnotesize}
%\end{center}
<<numbers-4bb>>=
vct5 <- c(4, 5, 0)
vct5
@
\noindent
%\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=blue!20},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}}}]
\matrix[array] (array) {
1 & 2 & 3 \\
4 & 5 & 0 \\};
%\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {};
\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east);
\end{scope}
\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\strut\code{\ vct5\phantom{mm}}};
\draw (array-1-3.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-3.east)--++(0:3mm) node [right]{\code{numeric} values};
%
\end{tikzpicture}
\end{footnotesize}
%\end{center}
<<numbers-4cc>>=
vct6 <- c(vct4, vct5)
vct6
@
\noindent
%\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=blue!20},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}}}]
\matrix[array] (array) {
1 & 2 & 3 & 4 & 5 & 6\\
3 & 1 & 2 & 4 & 5 & 0 \\};
%\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {};
\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-6.south east);
\end{scope}
\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\strut\code{\ vct6\phantom{mm}}};
\draw (array-1-6.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-6.east)--++(0:3mm) node [right]{\code{numeric} values};
%
\end{tikzpicture}
\end{footnotesize}
%\end{center}
<<numbers-4dd>>=
vct7 <- c(vct5, vct4)
vct7
@
\noindent
%\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily,
array/.style={matrix of nodes,nodes={draw, minimum size=7mm, fill=blue!20},column sep=-\pgflinewidth, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=5mm}}}]
\matrix[array] (array) {
1 & 2 & 3 & 4 & 5 & 6\\
4 & 5 & 0 & 3 & 1 & 2\\};
%\node[draw, fill=gray, minimum size=4mm] at (array-2-9) (box) {};
\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-6.south east);
\end{scope}
\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\strut\code{\ vct7\phantom{mm}}};
\draw (array-1-6.east)--++(0:3mm) node [right]{\code{integer} positional indices};
\draw (array-2-6.east)--++(0:3mm) node [right]{\code{numeric} values};
%
\end{tikzpicture}
\end{footnotesize}
%\end{center}
One or more member values of a vector can be extracted using the positional indexes and the extraction operator \Roperator{[ ]}. The returned value is a new vector. Member extraction is discussed in detail in section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}.
<<numeric-extract-member>>=
vct7[3]
vct7[c(6, 2)]
@
\begin{faqbox}{How to create an empty vector?}
<<numeric-empty-faq>>=
numeric()
@
\end{faqbox}
Next, I show concatenation of two vectors of the same class, the second of them of length zero.
<<numbers-4ee>>=
c(vct7, numeric())
@
Function \code{c()} accepts as arguments two or more vectors and concatenates them, one after another. Quite frequently we may need to insert one vector in the middle of another. For this operation, \code{c()} is not useful by itself. One could use indexing combined with \code{c()}, but this is not needed as \Rlang provides a function capable of directly doing this operation. Although it can be used to ``insert'' values, it is named \code{append()}, and by default, it indeed appends one vector at the end of another.
<<numbers-4a>>=
append(vct4, vct5)
@
The output above is the same as for \code{c(a, b)}, however, \Rfunction{append()} accepts as an argument an index position after which to ``append'' its second argument. This results in an \emph{insert} operation when the index points at any position different from the end of the vector.\label{par:calc:append:end}
<<numbers-4b>>=
append(vct4, values = vct5, after = 2)
@
\begin{playground}\label{pg:seq:rep}
One can create sequences\index{sequence} using function \Rfunction{seq()} or the operator \Roperator{:}, or repeat values using function \Rfunction{rep()}. In this case, I leave it to the reader to work out the rules by running these and his/her own examples, with the help of the documentation, available through \code{help(seq)} and \code{help(rep)}.
<<numbers-5, eval=eval_playground>>=
-1:5
5:-1
seq(from = -1, to = 1, by = 0.1)
rep(-5, times = 4)
rep(1:2, length.out = 4)
@
\end{playground}
\begin{faqbox}{How to create a vector of zeros?}
<<numeric-zeros1-faq>>=
numeric(length = 10)
@
or
<<numeric-zeros2-faq>>=
rep(0, times = 10)
@
\end{faqbox}
Next,\label{par:calc:vectorised:opers} something that makes \Rlang different from most other programming languages: vectorised arithmetic\index{vectorised arithmetic}. Operators and functions that are vectorised accept, as arguments, vectors of arbitrary length, in which case the result returned is equivalent to having applied the same function or operator individually to each element of the vector.\label{par:vectorised:numeric}
<<numbers-6aa>>=
log10(100)
log10(c(10, 5, 100, 200))
@
Function \Rfunction{sum()} accepts vectors of different lengths as input but is not vectorised, as it always returns a vector of length one as result.
<<numbers-6ab>>=
sum(100)
sum(c(10, 5, 100, 200))
@
A vectorised sum, also called a parallel sum of vectors, to differentiate it from obtaining the sum of the members of a vector, as computed above with function \Rfunction{sum()}, is the usual way in which operators like \Roperator{+} and other arithmetic operators and functions work in \Rlang.
<<numbers-6ac>>=
c(3, 1, 2) + c(1, 2, 31)
@
Vectorised\index{recycling of arguments}\index{recycling of operands} functions and operators that operate on more than one vector simultaneously, in many cases accept vectors of mismatched length as arguments or operands. When two or more vectors are of different length, these functions and operators recycle the shorter vector(s) to match the length of the longest one. The two statements below are equivalent; in the first statement, the short vector \code{1} is first recycled into \code{c(1, 1, 1)}. The operation, addition in this example, is applied to the numbers stored at the same position in the two vectors, returning a new vector.
<<numbers-6ad>>=
c(3, 1, 2) + 1
c(3, 1, 2) + c(1, 1, 1)
@
In the second code statement (line) below, \code{vct4} is of length \Sexpr{length(vct4)}, but the \code{numeric} constant 2 is a vector of length 1, this short constant vector is extended, by recycling (replicating) its value, into a longer vector of ones---i.e., a vector of the same length as the longest vector in the statement, \code{a}.\label{par:recycling:numeric}
<<numbers-6>>=
vct4 <- c(3, 1, 2)
(vct4 + 1) * 2
vct4 * 0:1
vct4 - vct4
@
Make sure you understand what calculations are taking place in the chunk above, and also the one below. Vectorisation and vector recycling are key features of the \Rlang language.
<<numbers-6a>>=
vct8 <- rep(1, 6)
vct8
vct8 + 1:2
vct8 + 1:3
vct8 + 1:4
@
\begin{playground}
Create further variants of the statements in the code chunk above to work out when warnings or errors are issued. Does the length of the operands matter?
\end{playground}
\begin{warningbox}
Most functions defined in base \Rlang apply recycling to vectors passed as arguments to at least some of their parameters. When recycling is supported, the conditions triggering warnings or errors are consistent with those you discovered in the playground above. However, if and how recycling is applied depends on how functions have been defined. Thus, there is variation, especially, but not only, in the case of functions and operators defined in contributed extension packages. For example, package \pkgname{tibble} and some other packages in the \pkgname{tidyverse} support recycling but some boundary cases that trigger a warning in base \Rlang functions, trigger an error in functions defined in these packages. See section \ref{sec:data:tibble} on page \pageref{sec:data:tibble} about package \pkgname{tibble}.
\end{warningbox}
\begin{explainbox}
As mentioned above, a vector can contain zero or more member values. Vectors of length zero may seem at first sight quite useless, but in practice they are very useful. They allow the handling of ``no input'' or ``nothing to do'' cases as normal cases, which in the absence of vectors of length zero would require to be treated as special cases. Constructors for \Rlang classes like \Rfunction{numeric()} return vectors of a length given by their first argument, which defaults to zero.
<<>>=
vct9 <- numeric(length = 0) # named argument
vct9
length(vct8)
@
<<>>=
numeric() # default argument
@
Vectors of length zero, behave in most cases, as expected---e.g., they can be concatenated as shown here.
<<>>=
length(c(vct4, vct9, vct5))
length(c(vct4, vct5))
@
Many functions, such as \Rlang's maths functions and operators, will accept numeric vectors of length zero as valid input, returning also a vector of length zero, issuing neither a warning nor an error message. In other words, \emph{these are valid operations} in \Rlang.
<<>>=
log(numeric(0))
5 + numeric(0)
@
Even when of length zero, vectors do have to belong to a class acceptable for the operation: \code{5 + character(0)} is an error (\code{character} values are described in section \ref{sec:calc:character} on page \pageref{sec:calc:character}).
Passing as an argument to parameter \code{length} a value larger than zero creates a longer vector filled with zeros in the case of \Rfunction{numeric()}.
<<>>=
numeric(length = 5)
@
The length of a vector can be explicitly increased, with missing values filled automatically with \code{NA}, the marker for not available.
<<>>=
vct10 <- 1:5
length(vct10) <- 10
vct10
@
If the length is decreased, the values in the \emph{tail} of the vector are discarded.
<<>>=
vct11 <- 1:10
vct11
length(vct11) <- 5
vct11
@
\end{explainbox}
\label{par:numeric:vectors:end}\index{vectors!introduction|)}
There\index{special values!NA} are some special values available for numbers. \Rconst{NA} meaning ``not available'' is used for missing values. (\Rconst{NA}) values play a very important role in the analysis of data, as frequently some observations are missing from an otherwise complete data set due to ``accidents'' during the course of an experiment or survey. It is important to understand how to interpret \Rconst{NA} values: They are placeholders for something that is unavailable, in other words, whose value is \emph{unknown}. \Rconst{NA} values propagate when used, so that numerical computations yield \Rconst{NA} when one or more input of the values is unknown.
<<numbers-8>>=
vct12 <- c(NA, 5)
vct12
vct12 + 1
@
Calculations\index{special values!NaN}\label{par:special:values} can also yield the following values \Rconst{NaN} ``not a number'', \Rconst{Inf} and \Rconst{-Inf} for $\infty$ and $-\infty$. As you will see below, calculations yielding these values do \textbf{not} trigger errors or warnings, as they are arithmetically valid. \Rconst{Inf} and \Rconst{-Inf} are also valid numerical values for input and constants.
<<numbers-8a>>=
vct12 + Inf
Inf / vct12
-1 / 0
1 / 0
Inf / Inf
Inf + 4
-Inf * -1
@
\begin{playground}
\textbf{When to use vectors of length zero, and when \code{NA}s?}\index{zero length objects}\index{vectors!zero length} Make sure you understand the logic behind the different behaviour of functions and operators with respect to \code{NA} and \code{numeric()} or its equivalent \code{numeric(0)}. What do they represent? Why \Rconst{NA}s are not ignored, while vectors of length zero are?
<<numbers-PG00, eval=eval_playground>>=
123 + numeric()
123 + NA
@
\emph{Model answer:}
\Rconst{NA} values are used to signal a value that ``was lost'' or ``was expected'' but is unavailable because of some accident. A vector of length zero, represents no values, but within the normal expectations. In particular, if vectors are expected to have a certain length, or if index positions along a vector are meaningful, then using \Rconst{NA} is a must.
\end{playground}
Any operation, even tests of equality, involving one or more \Rconst{NA}'s return an \Rconst{NA}. In other words, when one input to a calculation is unknown, the result of the calculation is unknown. This means that a special function is needed for testing for the presence of \code{NA} values.
<<numbers-8b>>=
is.na(c(NA, 1))
@
In the example above, we can also see that \Rfunction{is.na()} is vectorised, and that it applies the test to each of the elements of the vector individually, returning the result as \code{TRUE} or \code{FALSE}.
One\index{precision!math operations}\index{numbers!floating point} needs to be aware of the consequences of numbers in computers being almost always stored with finite precision and/or range: the expectations derived from the mathematical definition of Real numbers are not always fulfilled. See the box on page \pageref{box:floats} for an in-depth explanation.
<<numbers-9>>=
1 - 1e-20
@
When using \Rclass{integer}\index{numbers!whole}\index{numbers!integer} values these problems do not exist, as integer arithmetic is not affected by loss of precision in calculations restricted to integers. Because of the way integers are stored in the memory of computers, within the representable range, they are stored exactly. One can think of computer integers as a subset of whole numbers restricted to a certain range of values.
<<integers-1>>=
1L + 3L
1L * 3L
@
Using the ``usual'' division operator yields a floating-point \code{double} result, while the integer division operator \Roperator{\%/\%} yields an \code{integer} result, and the modulo operator \Roperator{\%\%} returns the remainder from the integer division.
<<integers-1a>>=
1L / 3L
1L %/% 3L
1L %% 3L
@
If an operation would create an \code{integer} value that falls outside the range representable in \Rlang, the value returned is \code{NA} (not available).
<<integers-1b>>=
1000000L * 1000000L
@
Both doubles and integers are considered numeric. In most situations, conversion is automatic and we do not need to worry about the differences between these two types of numeric values. The functions in the next chunk return \code{TRUE} or \code{FALSE}, i.e., \code{logical} values (see section \ref{sec:calc:boolean} on page \pageref{sec:calc:boolean}).\index{numbers!double}\index{numbers!integer}
<<integers-2>>=
is.numeric(1L)
is.integer(1L)
is.double(1L)
is.double(1L / 3L)
is.numeric(1L / 3L)
@
\begin{advplayground}
Study the variations of the previous example shown below, and explain why the two statements return different values. Hint: 1 is a \code{double} constant. You can use \code{is.integer()} and \code{is.double()} in your explorations.
<<integers-PG1, eval=eval_playground>>=
1 * 1000000L * 1000000L
1000000L * 1000000L * 1
@
\end{advplayground}
\begin{explainbox}
\label{box:floats} \label{par:float}
\index{integer numbers!arithmetic|(}\index{double precision numbers!arithmetic|(}
\index{floating point numbers!arithmetic|(}\index{machine arithmetic!precision|(}
\index{floats|see{floating point numbers}}\index{machine arithmetic!rounding errors}\index{Real numbers and computers}\index{integer numbers and computers}
\index{EPS ($\epsilon$)|see{machine arithmetic precision}}%
The usual way to store numerical values in computers is to reserve a fixed amount of space in memory for each value, which imposes limits on which numbers can be represented or not, and the maximum precision that can be achieved. The difference between \Rclass{integer} and \Rclass{double} is explained on page \pageref{box:integer:float}. Integers, or ``whole numbers'', like \Rlang \Rclass{integer} values are stored always with the same resolution such that the smallest difference between two integer values is 1. The amount of memory available to store an individual value creates a limit for the size of the largest and smallest values that can be represented. Thus integers in \Rlang behave like Integers or whole numbers as defined in mathematics, but constrained to a restricted finite range of values. In the computing language \Clang, different types of integer numbers are available \code{short} and \code{long}, these differ in the size of the space reserved for them in memory. \Rlang \Rclass{integer} type is equivalent to \code{long} in \Clang, thus the use of \code{L} for integer constant values like \code{5L}.
Floating point numbers like \Rlang \Rclass{double} values are stored in two parts: an integer \emph{significand} and an integer \emph{exponent}, each part using a fixed amount of space in memory. The relative resolution is constrained by the number of digits that can be stored in the significand while the absolute size of the largest and smallest numbers that can be represented is limited by the largest and smallest values that fit in the memory reserved for the exponent. In many computing languages, different types of floating point numbers are available, these differ in the size of the space reserved for them in memory. The properties of Real numbers as defined in mathematics differ from floating point numbers in assuming unlimited resolution and an unlimited range of representable values.
In \Rpgrm, numbers that are not integers are stored as \emph{double-precision floats}. Precision of numerical values in computers is usually symbolised by ``epsilon'' ($\epsilon$), commonly abbreviated \emph{eps}, defined as the largest value of $\epsilon$ for which $1 + \epsilon = 1$. The finite resolution of floats can lead to unexpected results when testing for equality or inequality. Test for equality is done with operator \code{==}. The use of this and other comparison operators is explained in section \ref{sec:calc:comparison} on page \pageref{sec:calc:comparison}.
<<comparison-5>>=
1e20 == 1 + 1e20
1 == 1 + 1e-20
0 == 1e-20
@
Another way of revealing the limited precision is during conversion to \code{character}.
<<numbers-EB10>>=
format(5.123, digits = 16) # near maximun resolution
format(5.123, digits = 22) # more digits than in resolution
@
The accumulation of successive small losses of precision from multiple operations on \Rlang \code{double} values can be a problem. Thus when computations involve both very large and very small numbers, the returned value can depend on the order of the operations. In practice ordinary users rarely need to be concerned about losses in precision except when testing for equality and inequality. On the other hand, finite resolution of \code{double} numerical values can explain why sometimes returned values for equivalent computations differ, and why some computation algorithms may be preferable, and others even fail, in specific cases.
As the \Rpgrm program can be used on different types of computer hardware, the actual machine limits for storing numbers in memory may vary depending on the type of processor and even the compiler used to build the \Rpgrm program executable. However, it is possible to obtain these values at run time, i.e., while the \Rpgrm is being used, from the variable \code{.Machine}, which is part of the \Rlang language. Please see the help page for \code{.Machine} for a detailed and up-to-date description of the available constants. \emph{Beware that when you run the examples below, the values returned by \Rlang in your own computer can differ from those returned in the computer I have used to typeset the book as you are reading it here.}\qRconst{.Machine\$double.eps}\qRconst{.Machine\$double.neg.eps}\qRconst{.Machine\$double.max}
<<machine-eps-01>>=
.Machine$double.eps
.Machine$double.neg.eps
.Machine$double.max
.Machine$double.min
.Machine$double.base
@
The last two values refer to the exponents of a base number or \emph{radix}, \Sexpr{.Machine$double.base}, rather than the maximum and minimum size of numbers that can be handled as objects of class \Rclass{double}. The maximum size of normalised \code{double} values, given by \code{.Machine\$double.xmax}, is much larger than the maximum value of \code{integer} values, given by \code{.Machine\$integer.max}.\qRconst{.Machine\$double.min}\qRconst{.Machine\$double.xmax}\qRconst{.Machine\$integer.max}
<<machine-eps-01a>>=
.Machine$double.xmax
.Machine$integer.max
@
As \Rclass{integer} values are stored in machine memory without loss of precision, epsilon is not defined for \Rclass{integer} values.
In \Rlang not all out-of-range \code{numeric} values behave in the same way: while off-range \code{double} values are stored as \Rconst{-Inf} or \Rconst{Inf} and enter arithmetic as infinite values according to the mathematical rules, off-range \code{integer} values become \code{NA} with a warning.
<<machine-eps-02>>=
1e1026
1e-1026
@
<<machine-eps-03, warning=TRUE>>=
2147483699L
@
In those statements in the chunk below where at least one operand is \Rclass{double} the \Rclass{integer} operands are \emph{promoted} to \Rclass{double} before computation. A similar promotion does not take place when operations are among \Rclass{integer} values, resulting in \emph{overflow}\index{arithmetic overflow}\index{overflow|see{arithmetic overflow}}, meaning numbers that are too big to be represented as \Rclass{integer} values.
<<machine-eps-04>>=
2147483600L + 99L
2147483600L + 99
2147483600L * 2147483600L
2147483600L * 2147483600
@
The exponentiation operator \Roperator{\^{}} forces the promotion\index{type promotion}\index{arithmetic overflow!type promotion} of its arguments to \Rclass{double}, resulting in no overflow. In contrast, as seen above, the multiplication operator \Roperator{*} operates on \code{integer} values resulting in overflow.
<<machine-eps-05>>=
2147483600L * 2147483600L
2147483600L^2L
@
\index{integer numbers!arithmetic|)}\index{double precision numbers!arithmetic|)}
\index{floating point numbers!arithmetic|)}\index{machine arithmetic!precision|)}
\end{explainbox}
Both\label{par:calc:round} for display or as part of computations, we may want to decrease the number of significant digits or the number of digits after the decimal marker. Be aware that in the examples below, even if printing is being done by default, these functions return \code{numeric} values that are different from their input and can be stored and used in computations. Function \Rfunction{round()} is used to round numbers to a certain number of decimal places after or before the decimal marker, with a positive or negative value for \code{digits}, respectively. In contrast, function \Rfunction{signif()} rounds to the requested number of significant digits, i.e., ignoring the position of the decimal marker.
<<convert-3>>=
round(0.0124567, digits = 3)
signif(0.0124567, digits = 3)
round(1789.1234, digits = -1)
round(1789.1234, digits = 3)
signif(1789.1234, digits = 3)
@
<<convert-3x>>=
vct13 <- 0.12345
vct14 <- round(vct13, digits = 2)
vct13 == vct14
vct13 - vct14
vct14
@
\begin{explainbox}
Functions are described in detail in section \ref{sec:script:functions} on page \pageref{sec:script:functions}. Here I describe them briefly in relation to their use. Functions are objects containing \Rlang code that can be used to perform an operation on values passed as arguments to its parameters. They return the result of the operation as a single \Rlang object, or less frequently, as a side effect. Functions have a name like any other \Rlang object. If the name of a function is followed by parentheses \code{()} and included in a code statement, it becomes a function \emph{call} or a ``request'' for the code stored in the function object to be run. Many functions, accept \Rlang objects and/or constant values as \emph{arguments} to their \emph{formal parameters}. Formal parameters are placeholder names in the code stored in the function object, or the \emph{definition} of the function. In a function call, the code in its definition is evaluated (or run) with formal parameter names taking the values passed as arguments to them.
In a function definition, formal parameters can be assigned default values, which are used if no explicit argument is passed in the call. Arguments can be passed to formal parameters by name or by position. In most cases, passing arguments by name makes the code easier to understand and more robust against coding mistakes. In the examples presented in the book, I most frequently pass arguments by name, except for the first parameter.
Being \code{digits}, the second parameter, its argument can also be passed by position.
<<convert-3a>>=
round(0.0124567, digits = 3)
round(0.0124567, 3)
@
When passing arguments by name, in most cases unambiguous partial matching is acceptable, but can make code difficult to read.
<<convert-3b>>=
round(0.0124567, di = 3)
@
\end{explainbox}
Functions \Rfunction{trunc()} and \Rfunction{ceiling()} return the non-fractional part of a numeric value as a new numeric value. They differ in how they handle negative values, and neither of them rounds the returned value to the nearest whole number. Hint: you can use \code{help(trunc)} or \code{trunc?} at the \Rpgrm console, or the help tab of \RStudio to find out the answer.
\begin{playground}
What does value truncation mean? Function \Rfunction{trunc()} truncates a numeric value, but it does not return an \code{integer}.
\begin{itemize}
\item Explore how \Rfunction{trunc()} and \Rfunction{ceiling()} differ. Test them both with positive and negative values.
\item \textbf{Advanced} Use function \Rfunction{abs()} and operators \Roperator{+} and \Roperator{-} to reproduce the output of \Rfunction{trunc()} and \Rfunction{ceiling()} for the different inputs.
\item Can \Rfunction{trunc()} and \Rfunction{ceiling()} be considered type conversion functions in \Rlang?
\end{itemize}
\end{playground}
\begin{explainbox}
\Rlang supports complex numbers and arithmetic operations with class \Rclass{complex}. As complex numbers rarely appear in user-written scripts, I give only one example of their use. Complex numbers, as defined in mathematics, have two parts, a real component and an imaginary one. Complex numbers can be used, for example, to describe the result of $\sqrt{-1} = 1i$.
<<numbers-complex>>=
cmp1 <- complex(real = c(-1, 1), imaginary = c(0, 0))
cmp1
cmp2 <- sqrt(cmp1)
cmp2
cmp2^2
@
\end{explainbox}
\index{classes and modes!numeric, integer, double|)}\index{numbers and their arithmetic|)}
\begin{warningbox}
Instants in time and periods of time in computers are usually encoded as classes derived from \code{integer}, and thus considered in \Rlang as atomic classes and the objects vectors. Some of these encodings are standardised and supported by \Rlang classes \Rclass{POSIXlt} and \Rclass{POSIXct}. The computations based on times and dates are difficult because the relationship between local time at a given location and Universal Time Coordinates (UTC) has changed with time, as well as with changes in national borders. Packages \pkgname{lubridate} and \pkgname{anytime} support operations among time-related data and conversions between character strings and time and date classes, making them easier and less error prone than when using base \Rlang functions. Thus I describe classes and operations related to dates and times in section \ref{sec:data:datetime} on page \pageref{sec:data:datetime}.
\end{warningbox}
It\index{removing objects}\index{deleting objects|see {removing objects}}\label{par:clac:remove}\label{par:calc:remove} is good to \emph{remove} from the workspace objects that are no longer needed. We use function \Rfunction{remove()} to delete objects stored in the current workspace.
Arguments passed to \Rfunction{remove()} can be bare object names as shown here.
<<>>=
an.object <- 1:4
remove(an.object) # using a bare name
@
Function \Rfunction{remove()} also accepts the names of the objects to remove as a \code{character} vector passed to its parameter \code{list}. In spite of its name, the argument must be a \code{vector} rather than a \code{list} (see section \ref{sec:calc:character} on \code{character} and section \ref{sec:calc:lists} on \code{list} on pages \pageref{sec:calc:character} and \pageref{sec:calc:lists}).
<<>>=
an.object <- 5:2
remove(list = "an.object") # using a character vector
@
Function \Rfunction{objects()} returns a \code{character} vector containing the names of all objects visible in the current environment, or by passing an argument to parameter \code{pattern}, only the objects with names matching it.
<<>>=
an.object <- 1:4
another.object <- 2
objects(pattern = "*.object")
remove(an.object)
objects(pattern = "*.object")
@
In \pgrmname{RStudio}, all objects are listed in the \textbf{Environment} tab and the search box of this tab can be used to find a given object.
\begin{explainbox}
Function \Rfunction{remove()} accepts both bare names of objects as in the chunk above and \code{character} strings corresponding to object names like in \code{remove("any.object")}. However, While \Rfunction{objects()} accept patterns to be matched to object names, \Rfunction{remove()} does not. Because of this, these two functions have to be used together for removing all objects with names that match a pattern. The pattern can be given as a regular expression (see section \ref{sec:calc:regex} on page \pageref{sec:calc:regex}).
Both functions are available under short names matching those used in \osnameNI{Linux} and \osnameNI{Unix} for managing files: \Rfunction{ls()} is a synonym of \Rfunction{objects()} and \Rfunction{rm()} of \Rfunction{remove()}.
Using a simple search pattern we obtain the names of all objects with names \code{"vct1"}, \code{"vct2"}, and so on. When using a pattern to remove objects, it is good to first use \Rfunction{objects()} on its own to get a list of the objects that would be deleted by calling \Rfunction{remove()} when passing the names returned by \Rfunction{objects()} as the argument for parameter \code{list}.
<<numbers-7>>=
objects(pattern = "^vec.*")
@
The code below removes all objects with names \code{"vct1"}, \code{"vct2"}, and so on. We do this at the end of the section before reusing the same names in the code examples of the next section.
<<numbers-last>>=
remove(list = objects(pattern = "^vct[[:digit:]]?"))
@
Similar code chunks are included at the end of each section throughout the book to ensure that code examples are self-contained by section. The chunk about is shown above as an example, but kept hidden in later sections.
\end{explainbox}
\section{Character Values}\label{sec:calc:character}
\index{character strings}\index{classes and modes!character|(}\qRclass{character}
In spite of the name \code{character}, values of this mode, are vectors of \emph{character strings"}. Character constants are written by enclosing characters strings in quotation marks, i.e., \code{"this is a character string"}. There are three types of quotation marks in the ASCII character set, double quotes \code{"}, single quotes \code{'}, and back ticks \code{`}. The first two types of quotes can be used as delimiters of \code{character} constants.
<<char-1>>=
vct1 <- "A"
vct1
vct2 <- 'A'
vct2
vct1 == vct2 # two variables holding character values, or named objects
"A" == 'A' # two constant character values, or anonymous objects
@
\begin{explainbox}
In many computer languages, vectors of characters are distinct from vectors of character strings. In these languages, character vectors store at each index position a single character, while vectors of character strings store at each index position strings of characters of various lengths, such as words or sentences. If you are familiar with \Clang or \Cpplang, you need to keep in mind that \Clang's \code{char} and \Rlang's \code{character} are not equivalent and that in \Rlang. In contrast to these other languages, in \Rlang there is no predefined class for vectors of individual characters and character constants enclosed in double or single quotes are not different.
\end{explainbox}
Concatenating character vectors of length one does not yield a longer character string, it yields instead a longer vector of character strings.
<<char-1a>>=
vct3 <- 'ABC'
vct4 <- "bcdefg"
vct5 <- c("123", "xyz")
c(vct3, vct4, vct5)
@
Having two different delimiters available makes it possible to choose the type of quotation marks used as delimiters so that other quotation marks can be easily included in a string.
<<char-3>>=
"He said 'hello' when he came in"
'He said "hello" when he came in'
@
The\index{character string delimiters} outer quotes are not part of the string, they are ``delimiters'' used to mark the boundaries. As you can see when \code{b} is printed special characters can be represented using ``escape codes''. There are several of them, and here we will show just four, new line (\verb|\n|) and tab (\verb|\t|), \verb|\"| the escape code for a quotation mark within a string and \verb|\\| the escape code for a single backslash \verb|\|. I also show the different behaviour of \Rfunction{print()} and \Rfunction{cat()}, with \Rfunction{cat()} \emph{interpreting} the escape sequences and \Rfunction{print()} displaying them as entered.
<<char-4>>=
vct6 <- "abc\ndef\tx\"yz\"\\\tm"
print(vct6)
cat(vct6)
@
The \textit{escape codes}\index{character escape codes} are expanded only in some contexts, such as when using \Rfunction{cat()} to display text output.
%\subsection{Character operations}\label{sec:calc:character:oper}
\begin{faqbox}{How to find the length of a character string?}
While\index{character strings!number of characters} function \code{length()} returns the number of member \code{character} strings in a vector, function \Rfunction{nchar()} returns the number of characters in each string in the vector (see below for examples).
\end{faqbox}
In the example below, function \Rfunction{nchar()} returns the number of characters in each member string.
<<char-nchar-01>>=
nchar(x = "abracadabra")
nchar(x = c("abracadabra", "workaholic", ""))
@
To convert a \code{character} string into upper case or lower case we use functions \Rfunction{toupper()} and \Rfunction{tolower()}, respectively.
<<char-toupper-01>>=
toupper(x = "aBcD")
tolower(x = "aBcD")
@
Function \Rfunction{strtrim()} trims a string to a maximum number of characters or width.
<<char-trim-01>>=
strtrim(x = "abracadabra", width = 6)
strtrim(x = "abra", width = 6)
strtrim(x = c("abracadabra", "workaholic"), 6)
strtrim(x = c("abracadabra", "workaholic"), c(6, 3))
@
\begin{faqbox}{How to wrap long character strings?}
Use \Rlang function \Rfunction{strwrap()} (see below for examples).
\end{faqbox}
Function \Rfunction{strwrap()} edits a string to a maximum number of characters or width, by splitting it into a vector of shorter character strings. It can additionally insert a character string at the start or end of each of these new shorter strings.
<<char-wrap-01>>=
strwrap(x = "This is a long sentence used to show how line wrapping works.", width = 20)
@
\begin{advplayground}
Function \Rfunction{cat()} prints a character vector respecting the embedded special characters such as new line (encoded as \verb|\n|) in \code{character} strings) and without issuing any additional new lines. Study the code below and the output it generates, consult the documentation of the two functions, and modify the example code until you are confident that you understand in detail how these two functions work.
<<char-wrap-02, eval=eval_playground>>=
wrapped_sentence <-
strwrap(x = "This is a very long sentence used to show how line wrapping works.",
width = 10,
prefix = "\n")
print(wrapped_sentence)
cat(wrapped_sentence, "\n")
@
\end{advplayground}
\begin{faqbox}{How to create a single character string from multiple shorter strings?}
While function \code{c()} is used to concatenate \code{character} vectors into longer vectors, function \Rfunction{paste()} is used to concatenate character strings into a single longer string (see below for examples).
\end{faqbox}
Pasting together \code{character} strings has many uses, e.g., assembling informative messages to be printed, programmatically creating file names or file paths, etc. If we pass numbers, they are converted to \code{character} before pasting. The default separator is a space character, but this can be changed by passing a \code{character} string as an argument for parameter \code{sep}.
<<char-paste-01>>=
paste("n =", 3)
paste("n", 3, sep = " = ")
@
Pasting constants, as shown above, is of little practical use. In contrast, combining values stored in different variables is a very frequent operation when working with data. A simple use example follows. Assuming vector \code{friends} contains the names of friends and vector \code{fruits} the fruits they like to eat we can paste these values together into short sentences.
<<char-paste-02>>=
friends <- c("John ", "Yan ", "Juana ", "Mary ")
fruits <- c("apples", "lichees", "oranges", "strawberries")
paste(friends, "eats ", fruits, ".", sep = "")
@
\begin{playground}
Why was necessary to pass \code{sep = ""} in the call to \Rfunction{paste()} in the example above? First try to predict what will happen and then remove \code{, sep = ""} from the statement above and run it to learn the answer. Try your own variations of the code until you understand the role of the separator string.
\end{playground}
We can pass an additional argument to tell that the vector resulting from the paste operation is to be collapsed into a single \code{character} string. The argument passed to collapse is used as the separator. I use here \code{cat()} so that the newline character is obeyed in the display of the single character string.
<<char-paste-03>>=
cat(paste(friends, "eats ", fruits, collapse = ".\n", sep = ""))
@
\begin{explainbox}
When the vectors are of different length, as in the last example above, the shorter one is recycled as many times as needed, which is not always what we want. To void the recycling, we need to first collapse the members of the long vector \code{fruits} into a vector of length one. We can achieve this by nesting two calls to \Rfunction{paste()}, and passing an argument to \code{collapse} in the inner function call.
<<char-paste-04>>=
collapsed_fruits <- paste(fruits, collapse = ", ")
paste("My friends eat", collapsed_fruits, "and other fruits.")
@
The nesting of function calls is explained in section \ref{sec:script:pipes} on page \pageref{sec:script:pipes}. However, as the two statements above would in most cases be written as nested function calls, I add this example for reference.
<<char-paste-05>>=
paste("My friends eat", paste(fruits, collapse = ", "), "and other fruits.")
@
\end{explainbox}
Function \Rfunction{strrep()} repeats and pastes \code{character} strings into a new longer \code{character string}, while function \Rfunction{rep()} repeats character strings without pasting them together, returning a longer vector with each repeat of the string as a separate member.
<<char-strrep-01>>=
rep(x = "ABC", times = 3)
strrep(x = "ABC", times = 3)
strrep(x = "ABC", times = c(2, 4))
strrep(x = c("ABC", "X"), times = 2)
strrep(x = c("ABC", "X"), times = c(2, 5))
@
\begin{faqbox}{How to trim leading and/or trailing whitespace in character strings?}
Use function \Rfunction{trimws()} (see below for examples).
\end{faqbox}
Trimming\index{character strings!whitespace trimming} leading and trailing whitespace is a frequent operation. \Rlang function \Rfunction{trimws()} implements this operation as shown below.
<<char-str-00a>>=
trimws(x = " two words ")
trimws(x = c(" eight words and a newline at the end\n", " two words "))
@
\begin{playground}
Function \Rfunction{trimws()} has additional parameters that make it possible to select which end of the string is trimmed and which characters are considered whitespace. Use \code{help(trimws)} to access the help and study this documentation. Modify the example above so that only trailing whitespace is removed, and so that the newline character \verb!\n! is not considered whitespace, and thus not trimmed away.
\end{playground}
Within\index{character strings!position-based operations} \Rclass{character} strings, substrings can be extracted and replaced \emph{by position} using \Rfunction{substring()} or \Rfunction{substr()}.
For extraction, we can pass to \code{x} a constant as shown below or a variable.
<<char-str-01>>=
substr(x = "abracadabra", start = 5, stop = 9)
substr(x = c("abracadabra", "workaholic"), start = 5, stop = 11)
@
Replacement is done \emph{in place}, by having function \code{substr()} on the left-hand side (lhs) of the assignment operator \code{<-}. Thus, the argument passed to parameter \code{x} of \code{substr()} must in this case be a variable rather than a constant. This is a substitution character by character, not insertion, so the number of characters in the string passed as the argument to \code{x} remains unchanged, i.e., the value returned by \code{nchar()} does not change.
<<char-str-02>>=
vct7 <- c("abracadabra", "workaholic")
substr(x = vct7, start = 5, stop = 9) <- "xxx"
vct7
@
If we pass values to both \code{start} and \code{stop} then only part of the value on the \emph{rhs} of the assignment operator \code{<-} may be used.
<<char-str-03>>=
vct8 <- c("abracadabra", "workaholic")
substr(x = vct8, start = 5, stop = 6) <- "xxx"
vct8
@
\begin{playground}
Frequently, a very effective way of learning how a function behaves, is to experiment. In the example below, we set \code{start} and \code{stop} delimiting more characters than those in \code{"xxx"}. In this case, is \code{"xxx"} extended,
or \code{start} or \code{stop} ignored? Run this ``toy example'' to find out the answer.
<<char-str-04, eval=eval_playground>>=
VCT1 <- c("abracadabra", "workaholic")
substr(x = VCT1, start = 5, stop = 11) <- "xxx"
VCT1
remove(VCT1) # clean up
@
\end{playground}
As\index{character strings!partial matching and substitution} in \Rlang each character value is a string comprised of zero to many characters, in addition to comparisons based on whole strings or values, partial matches among them are of interest.
To substitute part of a \code{character} string \emph{by matching a pattern}, we can use functions \Rfunction{sub()} or \Rfunction{gsub()}. The first example uses three \code{character} constants, but values stored in variables can also be passed as arguments.
<<char-regex-01>>=
sub(pattern = "ab", replacement = "AB", x = "about")
@
The difference between \Rfunction{sub()} (substitution) and \Rfunction{gsub()} (global substitution) is that the first replaces only the first match found while the second replaces all matches.
<<char-regex-02>>=
sub(pattern = "ab", replacement = "x", x = "abracadabra")
gsub(pattern = "ab", replacement = "x", x = "abracadabra")
@
\begin{playground}
Functions \Rfunction{sub()} or \Rfunction{gsub()} accept character vectors as the argument for parameter \code{x}. Run the two statements below and study how the values returned differ.
<<char-regex-03, eval=eval_playground>>=
sub(pattern = "ab", replacement = "x", x = c("abra", "cadabra"))