-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBrownianMotion_Simulation.Rmd
226 lines (188 loc) · 13.1 KB
/
BrownianMotion_Simulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
title: "Appendix 2: Brownian Motion Simulation"
author: "Chris Bentz"
date: "March 09, 2021"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Session Info
Give the session info (reduced).
```{r, echo = F}
# R version
sessionInfo()$R.version$version.string
# platform
sessionInfo()$R.version$platform
```
# Load Packages
Load packages. If they are not installed yet on your local machine, use install.packages() to install them.
```{r, message = FALSE}
library(ape)
library(phytools)
library(GGally)
library(tidyr)
library(plyr)
library(scales)
library(rstatix)
```
Give the package versions.
```{r, echo = F}
# version of packages loaded
installed.packages()[names(sessionInfo()$otherPkgs), "Version"]
```
# Phylogenetic Tree
We here use an Indo-European phylogenetic tree from Bentz et al. (2018), which was built by using lexical items from word lists of the Automated Similarity Judgment Program (ASJP). For details on how such phylogenetic trees are automatically built see Jäger (2018).
```{r, fig.width = 5, fig.height = 5}
# load phylogenetic tree in newick format as a text file
tree <- read.newick("https://raw.githubusercontent.com/IWMLC/complexityMeaning/main/IENewickTree.txt")
# choose the languages which are supposed to remain on the phylogeny
languages <- c("'Swedish [i-swe]'","'Dutch [i-nld]'","'Romanian [i-ron]'",
"'Italian [i-ita]'","'German [i-deu]'", "'French [i-fra]'",
"'Spanish [i-spa]'", "'English [i-eng]'")
# prune the tree by dropping all other tips
pruned.tree <- drop.tip(tree, setdiff(tree$tip.label, languages))
# change tip labels
plot(pruned.tree)
```
# Simulate Brownian Motion along the Phylogenetic Tree
We here simulate the evolution of trait values (i.e. pseudo-complexity values in our case) via the process of Brownian motion along the branches of our phylogenetic tree. The number of simulations per taxon (i.e. language) is set as ``nsim''. This is conceptualized as the number of pseudo-complexity measurements per language. For further explanations of the function fastBM() see Revell (2016).
```{r}
# define variance of diffusion process
sig2 = 2 # this variance setting is 0.01 in Revell (2016)
# define number of BM simulations
nsim = 20
# set the seed for random number generation in order to get the same result
# when the code is re-run
set.seed(09012020)
# simulate Brownian motion on a tree with fastBM()
simulation <- fastBM(pruned.tree, mu = 0, sig2 = sig2, internal = F, nsim = nsim)
```
# Correlation Plots
It is expected that the trait values for taxa which are close on the tree (e.g. Spanish and French or English and Dutch) are positively correlated. This can be checked in the below correlation plot. However, significance of correlations depends on the number of data points simulated here.
```{r, fig.width = 8, fig.height = 8}
simulation.df <- as.data.frame(t(simulation))
# the t() function is transposing the rows and columns
cor.plot <- ggpairs(simulation.df,
lower = list(continuous = wrap("smooth_loess", size = 1)),
upper = list(continuous = wrap('cor', method = "spearman"))) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(cor.plot)
```
Save figure to file.
```{r, fig.width = 8, fig.height = 8}
ggsave("Figures/BM_CorPlot.pdf", cor.plot, dpi = 300, scale = 1,
device = cairo_pdf)
```
# Visualization: Density Distributions
Plot density distributions of BM generated pseudo-complexity measurements by language. Individual values for each complexity pseudo-measurement are plotted as black dots.
Transform data frame from wide format to long format (this is necessary for later plotting and analyses by columns rather than rows).
```{r}
# change column names
colnames(simulation.df) <- c("Swedish", "English", "Dutch", "German", "French", "Italian", "Spanish", "Romanian")
simulation.df.long <- gather(simulation.df, key = language, value = value, Swedish:Romanian)
head(simulation.df.long)
```
Get mean, median, and standard deviation values.
```{r}
# get mean values for each language
mu <- ddply(simulation.df.long, "language", summarise, grp.mean = mean(value, na.rm = T))
# get median values for each language
med <- ddply(simulation.df.long, "language", summarise, grp.median = median(value, na.rm = T))
# get standard deviation values for each language
sdev <- ddply(simulation.df.long, "language", summarise, grp.sd = sd(value, na.rm = T))
```
Plot density distributions with indication of median (mean) values.
```{r, fig.width = 6, fig.height = 5}
density.plot <- ggplot(simulation.df.long, aes(x = value)) +
# histograms could be added as well here (or instead of the density distributions)
# geom_histogram(aes(y = ..density..), colour = "white", fill = "light grey",
# binwidth = 0.1) +
geom_density(alpha = .2, fill = "grey", color = "darkgrey") +
geom_jitter(data = simulation.df.long, aes(x = value, y = 0),
size = 0.7, height = 0.03, width = 0) + # add some jitter to prevent overplotting
facet_wrap(~ language) +
# geom_vline(data = mu, aes(xintercept=grp.mean),
# linetype = "dotted", color = "blue") +
geom_vline(data = med, aes(xintercept = grp.median),
linetype = "dashed", color = "red") +
geom_vline(aes(xintercept = 0), linetype = "dotted") +
labs(x = "Pseudo-Complexity Value", y = "Density") +
xlim(-2, 2) +
theme_bw()
print(density.plot)
```
Save figure to file.
```{r, fig.width = 6, fig.height = 5}
ggsave("Figures/BM_simulation_densities.pdf", density.plot, dpi = 300, scale = 1,
device = cairo_pdf)
```
# Descriptive Statistics
Give an overview of mean, median, and standard deviation values (i.e. values reflecting the location of a distribution).
```{r}
stats.df <- cbind(mu, med[, 2], sdev[, 2])
colnames(stats.df) <- c("language", "mu", "med", "sdev")
stats.df.sorted <- stats.df[order(stats.df$language), ]
# round values to two decimal places, the "-1" excludes column 1
stats.df.sorted[, -1] <- round(stats.df.sorted[, -1], 2)
print(stats.df.sorted)
```
Output data frame as csv file.
```{r}
write.csv(stats.df.sorted, file = "Tables/BMsimulation_descriptiveStats.csv", row.names = F)
```
# Normality
The assumption that the tested data stems from a normally distributed population is often necessary for the mathematical proofs underlying standard statistical techniques. We might apply normality tests to check for this assumption (e.g. Baayen 2008, p. 73), but some statisticians advice against such pre-tests, since they are often too sensitive (MacDonald 2014, p. 133-136, Rasch et al. (2020), p. 67). In fact, Rasch et al. (2020, p. xi) argue based on earlier simulation studies that almost all standard statistical tests are fairly robust against deviations from normality. In a similar vein, Lumley et al. (2009) argue that non-normality of the data is a negligible issue with the t-test, at least for larger sample sizes, e.g. >= 100. However, especially for smaller sample sizes, it is still advisable to check for gross deviations from normality in the data. One common way of doing this is quantile-quantile plots. The points should here roughly follow a straight line (Crawley 2007, p. 281).
```{r, fig.width = 3.5, fig.height = 3}
ggplot(simulation.df.long, aes(sample = value)) + stat_qq()
```
# Statistical Tests
Standard t-tests can be used to assess significant differences in the means of the pseudo-complexity distributions, if we assume that the underlying population distributions are normal. Wilcoxon tests are a non-parametric alternative, i.e. they do not make assumptions about the underlying population distribution (Crawley 2007, p. 283; Baayen 2008, p. 77). We here run pairwise t-tests. If we supply two data vectors, then by default the function pairwise.t.test() runs a Welch two sample t-test (for unpaired samples); with the argument "paired = T" a paired t-test is invoked. We here assume that our data consists of two vectors which are linked via the same measurement procedure(s), and we hence consider them "paired". A more general term is "related samples", which are defined as "two sets of data where a data point in one set has a pairwise relationship to a point in the other set of data" (Cahusac 2021, p. 56).
P-value adjustment for multiple comparisons: In case of multiple testing, we should account for the fact that the likelihood of finding a significant result by chance increases with the number of statistical tests. One of the most conservative methods to account for this is the so-called Bonferroni correction, i.e. multiplying the p-values with the number of tests. This method assumes that tests are independent of one another (MacDonald 2014, p. 254-260). However, since we here compare, for example, French to Spanish and then French to German, there is dependence between the test results. We therefore apply the so-called Holm-Bonferroni method, which is less conservative. It does not assume independence between tests (see the descriptions in the vignette invoked by the command "?p.adjust()").
```{r}
p.values <- pairwise.t.test(simulation.df.long$value, simulation.df.long$language,
paired = T, p.adjust.method = "holm")
p.values
```
## Effect sizes
Statistical significance is only one part of the story. For instance, a difference in complexity values might be statistically significant, but so small that it is negligible for any theorizing. In fact, it is sometimes argued that effect sizes -- rather than p-values -- should be the aim of statistical inquiry (Cahusac 2020, p. 12-15). An overview of effect size measures per statistical test is given in Patil (2020). In conjunction with the t-test we here use Cohen's d (i.e. function cohens_d() of the "rstatix" package). (Note: the d estimate given by this function can be negative. However, the sign is not relevant here, only the absolute value.)
```{r}
effect.sizes <- cohens_d(simulation.df.long, value ~ language, paired = T)
print(effect.sizes)
```
## Effect size heatmap
Plot a heatmap with effect sizes to get a better overview.
```{r, fig.width = 5, fig.height = 4}
effect.sizes.plot <- ggplot(as.data.frame(effect.sizes), aes(group1, group2)) +
geom_tile(aes(fill = abs(effsize)), color = "white") +
scale_fill_gradient2(low = "light blue", mid = "light grey", high = "red",
midpoint = 0.5, limit = c(0, 2.5)) +
geom_text(aes(label = round(abs(effsize), 2))) +
labs(x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
effect.sizes.plot
```
Save figure to file.
```{r, fig.width = 5, fig.height = 4}
ggsave("Figures/BMsimulation_effectSizes.pdf", effect.sizes.plot, dpi = 300, scale = 1,
device = cairo_pdf)
```
# Interpretation
Note that the exact results of the tests above will differ every time the random vectors are re-sampled (which is avoided here by using the set.seed() function). Having said this, we do not seem to find any significant differences between the distributions of values generated along the branches of the tree by Brownian motion. In fact, it follows from the core properties of (the basic) Brownian motion model that the expected value of a trait at any time is equal to the value of the trait at time zero (Harmon 2019, p. 41). This means we should not expect to see significant differences in the average trait values (pseudo-complexities) of the languages in this simple simulation. What we do expect, on the other hand, are positive correlations between closely related languages. For instance, if a given complexity measure yields a relatively high value for Italian, we would expect it to also yield a relatively high value for French (according to our phylogeny). Such correlations can indeed be observed in our Brownian motion simulation.
# References
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction using statistics in R. Cambridge University Press.
Bentz, C., Dediu, D., Verkerk, A., Jäger, G. (2018 )The evolution of language families is shaped by the environment beyond neutral drift. Nat Hum Behav 2, 816–821. https://doi.org/10.1038/s41562-018-0457-6
Cahusac, P. M. B. (2021). Evidence-based statistics. John Wiley & Sons.
Crawley, M. J. (2007). The R book. John Wiley & Sons Ltd.
Harmon, L. (2019). Phylogenetic comparative methods. Online at https://lukejharmon.github.io/pcm/ (last accessed 25/11/2020).
Jäger, G. (2018). Global-scale phylogenetic linguistic inference from lexical resources. Sci Data 5, 180189. https://doi.org/10.1038/sdata.2018.189
Lumley et al. (2002). The importants of the normality assumption in large public health data sets. Annu. Rev. Public Health.
McDonald, J.H. (2014). Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland. online at http://www.biostathandbook.com
Patil, I. (2020). Test and effect size details. online at https://cran.r-project.org/web/packages/statsExpressions/vignettes/stats_details.html.
Rasch, D., Verdooren, R., and J. Pilz (2020). Applied statistics. Theory and problem solutions with R. John Wiley & Sons Ltd.
Revell, L. (2016). Simulating Brownian motion in R. Online at http://www.phytools.org/Bariloche2016/ex/3/Simulating-BM.html (last accessed 25/11/2020).
---