-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path02--Foundations.Rmd
638 lines (460 loc) · 22.3 KB
/
02--Foundations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
# Foundations
Here we go over the basic commands in R, and how R is best used. R is a 'vectorised'
programming language - it is fastest when you perform operations on vectors.
Consequently, it is better
to put your data in vectors, and perform operations on vectors, as opposed to
using normal programming tricks like loops. We will explain this in more detail later.
Details about the language might seem irrelevant if you just need to work with some
data and fit some regression models.
They are not. Knowing something about how the language works is for understanding
how you fit the model and what the output is like. Thus, it is better to cover
these basics before starting to work with any data.
## Objects
The first thing you need to know is how to create objects with names. This is how
you store things in memory to use them later. In R,
this is done using a a left arrow. We are going to create a string - a sequence
of symbols between double quotes - and store it with a name.
```{r}
rm(list = ls())
# Ignore the thing above for now - I will explain what that is doing later
first_string <- "hello_world!"
```
We did two things in this line of code. Firstly we created an
object, the string hello_world!. Then, we bound it to the name first_string.
The first character in the name can be anything that does not start with a number.
string_3 is admissible, but 3_string is not. Naming an object stores it in memory,
and allows us to access it later on in the code. So now
if we want to access that string, we can type
```{r}
first_string
```
and R gives us the string.
We can manually take a look at the objects we have stored in memory.

Open RStudio
and take a look at the top right of your screen. There should be a pane called 'environment'.
This is a record of every single object you have stored in your working memory.
You should now see an object called first_string. If you
click on it, you can see what the object is.
Sometimes, we will want to remove an object from our working memory. We can
do this using `rm`, which removes an object from your environment. If, for example,
we wanted to remove `first_string` we would type
```{r}
rm(first_string)
```
Note that R does not do this automatically when we run a new script. Thus, at
the beginning of each script, we should clear the environment. We do this by
running `rm(list=ls())` as the very first line of the script. This is what we did at
the very beginning of this section. Otherwise, it is
very easy to introduce errors by accidentally using variables defined in another script
in a current script.
## Vectors
Vectors are the most important type of object in R. Fundamentally, a vector is an
ordered set of values like strings or numbers. Vectors will be the basis of the objects
we use to store data later. We create them by putting
elements in between c(), separated by commas.
```{r}
first_vec <- c(0,1,2,3,4)
```
The elements of a vector are ordered. R stores them in the order in which we
pass them. Thus, 0 is the first element of our vector, 1 is the second and so on.
There are two types of vector: atomic vectors, and lists. The difference is that
in an atomic vector, all the elements must be of the same type. In a list, the elements
can be of different types. For ease, we will start by going over the basics of atomic
vectors. Then, we will move on to lists.
### Atomic vectors
There are four common types of atomic vector you will come across (plus two other
rare types, that we can safely ignore for now). These are logicals, doubles,
integers, and characters.
Logicals are vectors of logical operators - things
that are either `TRUE`, or `FALSE`.
We can create logicals based on conditions using the `==` sign. `==` returns `TRUE`
if the objects on both sides are exactly equal to each other, and `FALSE` otherwise.
```{r}
# creating some logical conditions
3==4
"cat"=="dog"
f <- "cat"
"cat"== f
```
This will come in very handy when we want to select some variables based on the
values of other variables, as we commonly do.
We can also write `TRUE`, `FALSE` more succinctly as `T`,`F`.
```{r}
# Two equivalent ways of writing logicals
(1+1==2)==TRUE
(1+1==2)==T
(1+1==3)==FALSE
(1+1==3)==F
```
Integer is what is sounds like - a vector of integers.
We use doubles to represent numbers that may not be integers. A double can be
any type of number, including integers and decimals (or 'floating point' numbers
as programmers often call them). We can generate a vector containing a range of integers
by passing a starting integer, followed by a colon, followed by the final integer.
Finally, character vectors are vectors of strings.
```{r}
# Lets make a vector of each type
first_log_vec <- c(TRUE, FALSE, TRUE)
first_int_vec <- c(1,2,3)
int_range_vec <- c(1:10)
first_double_vec <- c(1.1, 2.2, 3.3)
first_char_vec <- c("my", "first", "character", "vector")
```
`typeof()` tells you the type of your vector, and `length()` tells you how
many elements are in it.
```{r}
# Looking the type of the first vector, and length of the second
typeof(first_log_vec)
length(first_int_vec)
```
You can merge two atomic vectors by putting them in a larger vector.
```{r}
# creating two small vectors
vec_1 <- c(1,2,3)
vec_2 <- c(4,5,6)
# merging them into one big vector
big_vec <- c(vec_1, vec_2)
```
Putting a set of vectors together like this 'flattens' them. What we get out of
the above is not a vector containing a set of vectors, each containing numbers.
Instead, we get out a single vector that contains all of the components of the
set of vectors together in the order that we passed them.
### Lists
A list is a vector comprised of elements of multiple types.
```{r}
# making a list
first_list <- c("hello_world!", 1)
```
We can turn something into a list explicitly by using `list`.
```{r}
# making a list more explicitly
second_list <- list("hello_world!", 1)
```
Lists are very useful in practice because they can be recursive. Elements of lists
can be other lists or vectors, R does not flatten lists like it does atomic vectors.
Lists are thus useful when we want to store a set of objects, but preserve some
internal structure. An example from econometrics, which we will see later,
is collecting standard errors for different regression specifications. When we
create a regression table, we often want to pass a list of standard errors to
tell us how precise our estimates are. Yet, we also want to keep the standard errors
of each model together so we do not mix standard errors of different models up.
A way we can achieve this is by storing the standard errors for each model in its
own vector, and then placing those vectors into a list. The vectors keep the
standard errors of the estimates from each model together and separate.
We can also flatten lists if we want to by
using `unlist`. Again, we will see how this is useful later.
```{r}
# demonstrating the nested structure in lists
# imagine these are the two outputs of standard errors from our variance-covariance
# matrix
se_1 <- c(0.3, 0.4, 0.5)
se_2 <- c(0.4, 0.5, 0.6, 0.7)
# now lets create a list with a nested structure
se_list <- list(se_1, se_2)
# now if we take a look at the list, we can see the nested structure
se_list
# selecting the first or second element of the list will give us
# the coresponding vector.
# Now, imagine we want to flatten the list i.e remove the nested structure
flattened_se_list <- unlist(se_list)
flattened_se_list
```
### Working with vectors
As mentioned above, vectors have an order. We can use this ordering to select
elements from vectors. This is called 'slicing'. We do this
by putting square brackets after the name of the vector, containing positions of
elements within the vector. You can select a single
element of the vector by putting a single number in the brackets that corresponds
to the position of that element in the vector. You can select the nth to n+kth
element of the vector by passing n:n+k.
You can create any combination of those elements by passing the positions
of the elements you want to select as a vector.
```{r}
# lets slice the vector we created earlier
big_vec[1]
big_vec[6]
big_vec[2:3]
big_vec[c(1, 2:3, 6)]
```
We can add, subtract, multiply, and divide numbers using +, -, *, /. These also
work for vectors. They perform the operations on the vectors elementwise.
```{r}
# Using some basic mathematical operators
# on integers/doubles
1+1
1-1
1*2
1/2
# on vectors
big_vec + 1
big_vec - 1
big_vec* 2
big_vec / 2
# notice that each of these returns a vector of results, where each element is
# the result of the operation on the element of the previous vector.
```
In practice, you will see many vectors that contain missing values. R denotes
missing values with NA. R also has useful commands to find these missing values.
`is.na()` returns a logical vector of the same length as the original vector.
Each element corresponds to the element in the original vector. It returns
`TRUE` if this is missing, else `FALSE`. The `any()` and `all()` commands tell us
if any or all of the elements of a vector satisfy a logical respectively. The first
returns `TRUE` if at least one of the elements of the vector satisfies the logical,
else `FALSE`. The second returns `TRUE` if all of the elements of the vector satisfies the logical,
else `FALSE`.
We can combine these with `is.na()`
to screen for NAs.
```{r}
# Lets look for some missing values
# here's a vector with a missing value
missing_value_vec <- c(1,2,NA,4,5)
# now lets see where that missing value is
is.na(missing_value_vec)
any(is.na(missing_value_vec) == T)
all(is.na(missing_value_vec) == T)
```
Notice that in the middle, we have specified a logical condition
using `==`.
Vectors can also have attributes - metadata that we attach to the vector object.
The most important of these to know are names and dimensions. R constructs objects that we
store data in like matrices and dataframes as multi-dimensional vectors.
## Functions
A function is a map from some arguments to an output. In programming, it takes
some things in and performs some operations on that thing. Thus, they are incredibly
common. One of the main things we want to do in R is use functions - by taking others'
from packages and making our own.
### Functions from packages
A package is a collection of functions that other people have written. Normally,
these have a theme. For example, `plm` is a package of functions for fitting
panel data models. Thus, it contains a set of interconnected functions you can
use to do lots of things with panel data.
Packages are incredibly useful. Using functions from packages is the main way of performing
operations in R. Most programmers upload their packages to the 'Comprehensive R
Archive Network' or 'CRAN' for short https://cran.r-project.org/. All packages on
CRAN have 'vignettes' - a pdf file describing all of the functions that are in
the package with examples of how you might use them.
To use a package, you first have to download it onto your computer. To do this,
you have to run `install.packages()` with the package name between double quotes within
the brackets. Thus, to install `plm`, we run `install.packages("plm")` in the
console.
Once you have a package on your computer, you now have the option to use it in
your scripts. To use a package in a script, you need to write `library()` with the
package name before the point you use any functions from the package. If we want
to use function from `plm`, we have to write `library(plm)` in our script before
we use the function.
Once we load a package, we can look at the functions in the package using the
package explorer on the right in RStudio. Run a function from a package by typing
the name of that function.
If we have a function from a package and want to easily look up the vignette, we
can do this by typing the function name, preceeded by a question mark `?`. This
brings up the entry from the package vignette on the right hand side of the RStudio
viewer.
```{r}
# Some examples of finding the documentation for a specific function
# imagine we want to find the vignette for the summary function in base R -
# we would run
?summary
```
The output for the inbuilt (or 'Base R') summary is as below.

Imagine here that we had multiple objects called `summary` from different packages
that we might use. We can specify which package we want to select the function from by preceeding the function
with the package name and a double colon `::`. Our version of `summary` is from base R - the set of functions
that come with the R distribution itself. To call that `summary` specifically, we
can run `base::summary`.
```{r}
# Some examples of finding the documentation for a specific function
?base::summary
# now imagine we want the documentation for the group_by function from the
# package dplyr
?dplyr::group_by
# notice we can get the documentation without actually loading the package!
# But if we wanted to actually run the function we would need to load
# the package of course
```
### Defining your own function
It is very easy to define functions in R. They are particularly useful in R because
they allow you to iterate in a vectorised way by `apply`ing a function to a vector.
We will see this later. We store functions as
objects with the name of the function as the object name. We follow this by the command
`function()`. The arguments to `function()` are the arguments of the function. We
enclose the body of the function in curly brackets. At the end of the function,
we specify what we want to return from the function within `return()`.
Once we have our function, we can run it with the name, followed by the value
of the arguments within brackets.
```{r}
# lets define our first function - squaring
# This takes a numeric argument `x' and returns its
# square
square_things <- function(x){
y <- x**2
return(y)
}
# I could have also written this in one line as return(x**2) of course
# now lets square something
square_things(22)
square_things(7)
```
Of course, we can nest functions if we want to chain some operation and write that
in an efficient way.
```{r}
square_things_add_two <- function(x){
y <- square_things(x) + 2
return(y)
}
square_things_add_two(7)
```
Notice that all the objects we define within the function are 'local' - they do
not exist in our environment outside of that function. If we try to look for them
in the environment pane, they are not there. Thus, if we try to get y
outside of the function above, it will return nothing (or worse, something else
we have called y that we do not want!). Objects we define outside of functions,
by contrast, are global -- they exist in the whole environment. Global objects
are the ones we can see in our environment pane.
## Iteration
Imagine we want to do some operation to lots of things - for example squaring
a vector of numbers. Applying an operation to each of those things is called
'iteration'. We can iterate in R using `for` and `while` loops as you might have
seen in other programming languages.
```{r}
# making a for or while loop in R
for (i in range(0,10)){
print(i**2)
}
j <- 0
while (j < 10){
print(j**2)
j <- j+1
}
```
If we want to select some variables based on conditions, we can use `if` and
`else` statements.
```{r}
# if-else
x <- 3
if (x==3){
print("X is three!")
} else {
print("X is not three :(")
}
```
R is, however, a vectorised programming language - it is designed to perform
operations directly to vectors. Looping does not do this. When we write a loop,
we take an object from the set of objects we want to iterate over, apply the
operation directly to that object. We then select the next object, apply the operation
to that object, and so on. This suggests a quicker and more robust way
to iterate - store the objects we want to iterate over as some sort of vector, and
perform the operation directly on the entire vector at once.
More precisely, we should iterate by designing a function, setting up a vectorised object,
and applying the function directly to the vectorised object. In R, we do this using the `apply`
family of functions, and it is called `apply`ing.
There are four of these functions - `sapply`, `lapply`, `apply`, and `tapply`.
The most common to use are `lapply` and `sapply`.
`lapply` takes a list, a function,
and a vector of additional arguments to the function. It applies the function to
each value of the original list, given the parameters, and returns the list of
those values. `sapply` takes a vector, a function,
and a vector of additional arguments to the function. It applies the function to
each value of the original vector, given the parameters, and returns the vector of
those values.
```{r}
# examples of iteration by applying
vec_of_numbers <- c(0:10)
squares_vec <- sapply(vec_of_numbers, FUN=square_things)
list_of_numbers <- as.list(c(0:10))
sqaures_list <- lapply(list_of_numbers, FUN=square_things)
# notice that we can get the same thing as above by flattening the list
squares_vec_2 <- unlist(sqaures_list)
```
`apply` and `tapply` are less common to use. `apply` does the same as `lapply`
and `sapply` above, but takes in and returns a `data.frame` or `matrix` instead
of a list or vector. We will learn what those objects are in the next and fifth
section respectively. `tapply` takes in a vector containing factor variables and
computes a function for another vector by the levels of the factors. This can be
very useful for summarising data - for example taking means by groups.
`apply`ing is much more efficient for most operations than a loop. Of course, for small
operations, there is a trade off between run time and developer time. Often it is
easier to just write a loop. We cover some more efficient iteration methods in
final section if you are interested to learn more.
## Style
Now you know how to write some R, it is important to know how to present it in
a nice way. Writing readable code is very important for a couple of reasons. Firstly,
others might read your code. Secondly, your future self will almost certainly
read your code to find some operation you have written before and want to do again.
In either case, we want that person to be able to find and interpret
it easily. Thus, we need to write code in a consistent and interpretable style.
The usual style in R is based on the Google Style guide for R
https://google.github.io/styleguide/Rguide.html. We base this section on the
synopsis by Hadley Wickham at http://adv-r.had.co.nz/Style.html .
Here, I quickly go over the main points. Writing readable code takes practice, but
pays huge dividends.As always with style guides, do what we say and not what we do.
### File names
File names should be meaningful - so if our script does x we should call it `do_x.R`.
If running scripts in order, prefix with numbers e.g `1-do_x.R`, `2-do_y.R`.
### Object names
Object names should be lower case - so `xyz` as opposed to `Xyz` or `XYZ`. Separate
words in a name with `_` - so `x_vec` as opposed to `xvec`. Variable names should
be nouns, function names should be verbs. Names should be concise but meaningful.
Do not name your object the same as an inbuilt object - like `list`. At best it will
confuse your reader. At worst, you can overwrite the inbuilt name in your environment
and cause a load of problems for yourself.
### Syntax
Spacing should be as in English - so spaces after commas, mathematical symbols and
so on. The exceptions to this are for colons, variables within brackets - so `(x)`
not `( x )`, or if it leads your code to line up in a nicer way. Try to keep code
within 80 spaces from the beginning of the line. This means that it will fit on
a single sheet of A4 paper if you need to save it to a document (say for a coding
test for a job). RStudio places
a vertical line in the editor at 80 spaces in that you can use to measure this.
When we use curly brackets, the second curly bracket should go on its own line
unless it is followed by `else`.
```{r}
x <- 3
if (x==3){
print("X is three!")
} else {
print("X is not three :(")
}
```
Indent in the same block with two spaces, except when you are indenting an
argument within brackets. Then, indent to the beginning of the arguments.
```{r}
very_very_long_function<- function(a= "very_very_very_very_very_long_argument",
b= "another argument"){
return(print(a))
}
```
### Commenting
Comment your code! Caring about commenting seems silly, but is actually very
very important. It is very hard to interpret someone else's or your own past
code without commenting. Comment frequently using the `#`. Comments should
explain what the function or line of code is doing.
```{r}
# now lets create a variable that stores the string "hello_world"
hello_world_string <- "hello_world"
# lets print it
hello_world_string
```
Break scripts into sections, and delimit these sections with `#` followed by
a name saying what the section is doing, followed by `---` up to the line in the
IDE. This best allows you to easily find the area of the code you want.
A common thing to do is to first have a section where you read in the data, then
one containing all of your functions, and then one where you run your models (which
you may or may not break into further sections. At the beginning of a script,
add a line explaining what the code is doing and the name of the person who made
it.
```{r}
# Example section breaks for an assignment with a series of questions
# Code for assignment 1
# Your group name
rm(list=ls())
# reading in data -------------------------------------------------------------
# code for reading in our data
# functions --------------------------------------------------------------------
# whatever functions we want to use
# question 1 -------------------------------------------------------------------
# code where we compute what we need for question 1
# question 2 -------------------------------------------------------------------
# code where we compute what we need for question 2
```