-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmoorea_exploration.Rmd
282 lines (214 loc) · 10.5 KB
/
moorea_exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: "Moorea Exercise"
author: "Owen Liu"
date: "Monday, October 12, 2015"
output: html_document
---
This is an exploratory document to both:
* explore some of the public Moorea datasets [available online](http://mcr.lternet.edu/data/), and to
* practice using RMarkdown and GitHub
```{r, echo= FALSE, include=FALSE}
library(dplyr)
library(doBy)
library(ggplot2)
library(DT)
```
## Data Organization and Sub-setting
The first dataset is a .csv of Annual Visual Fish Surveys from Moorea since 2006, with 56918 observations in 24 variables
```{r data import}
vfs <- read.csv('~/github/moorea_exploration/data/MCR_LTER_Annual_Fish_Survey_20150318.csv')
names(vfs)
```
The transects have data like date, location, and habitat type of the dive site, along with the taxonomy of the species identified and their lengths, abundance, etc.
Let's summarize some of these data and see what they look like (mostly from the doBy pkg). Data from all years pooled at first.
```{r top ten species}
## The summaryBy() function summarizes data according to a specified function
## and by the subset of variables you specify
## We could make a list of the most abundant species by all individuals for all years
abun <- summaryBy(Count~Taxonomy, data= vfs, FUN=sum, na.rm=TRUE) #summarizes the data
abun <- abun[order(-abun[,2]),] #sorts by total abundance
top_10 <- abun[1:10,] #top 10 most abundant species
top_10.names <- as.vector(top_10[,'Taxonomy']) #names (for later use)
par(mar=c(10,4,4,2))
barplot(top_10[,2],names.arg=top_10[,1],las=3, ylim=c(0,20000),
border=NA, ylab='Total invidivudals, all years',cex.names=0.8,
main='Top ten most abundant species',cex.axis=0.8)
```
## Calculating mean size by habitat
I want to look at which species may have ontogenetic habitat shifts between
fringing reef, backreef, and forereef, specifically species that use the
fringing reef as a nursery. One way to start to investigate this
is to calculate average size for each species in each habitat, and then see
if there are significant differences between habitats.
We're looking for something like this:
```{r ,echo=FALSE,warning=F,message=F}
library(png)
library(grid)
img <- readPNG('./data/hab_shift.png')
grid.raster(img)
```
where the hypothesis is that the fringing reef is a nursery, and the backreef and
forereef support larger fish.
First, let's look at all the species that have at least 10 observed individuals
in each habitat in the data (since, presumably, some species don't utilize all
habitats). In other words, let's limit our later analysis by those species that
are represented in each habitat.
```{r data cropping}
## Original dataset, counts aggregated by species, habitat, and length bin
sp_hab <- summaryBy(Count~Taxonomy + Habitat + Total_Length, data=vfs, FUN=sum,na.rm=TRUE)
knitr::kable(head(sp_hab,8))
## Function returns 1 if species is represented by at least 10 individuals in each habitat
## and returns 0 if not
rep_hab <- function(species) {
dat<-subset(sp_hab,sp_hab$Taxonomy==species) # all data for the species
fr <- subset(dat,dat$Habitat=='FR') # fringing reef
br <- subset(dat,dat$Habitat=='BA') # back reef
fo <- subset(dat,dat$Habitat=='FO') # forereef
if(sum(fr$Count.sum)>9 & sum(br$Count.sum)>9 & sum(fo$Count.sum)>9) return(1)
else return(0)
}
```
Now we can limit our full dataset to just the species that are represented in all habitats
```{r representation check}
rep_check <- sapply(sp_hab$Taxonomy,rep_hab) #check for representation
sp_rep <- sp_hab[rep_check==1,] #keep all the rows for represented species
sp_rep.names <- as.vector(unique(sp_rep$Taxonomy))
length(unique(sp_rep$Taxonomy)) #How many species?
```
We can see that there are 69 species that fit the representation criteria.
Now let's calculate the mean lengths for each species in each of the three
habitats.
```{r mean length by habitat function}
## Function for mean length by habitat
species.mean<-function(data,species,habitat) {
sp.data <- subset(data,data$Taxonomy==species & data$Habitat==habitat)
hab.count <- sum(sp.data$Count.sum)
freq <- sp.data$Count.sum/hab.count # relative frequency of a size class
hab.mean <- sum(sp.data$Total_Length*freq) # mean is the sum of all length classes times their relative frequencies
return(cbind(hab.count,hab.mean))
}
names <- as.vector(sapply(sp_rep.names,function(x) rep(x,3),USE.NAMES=F)) #expands names list so each species has 3 habitats
hab.vec <- rep(c('FR','BA','FO'),length(sp_rep.names))
## Data frame of species and their mean size by habitat
mean.lengths <- mapply(species.mean,species=names,habitat=hab.vec,MoreArgs=list(data=sp_rep))
hab.means <- data.frame(names,hab.vec,mean.lengths[1,],mean.lengths[2,])
names(hab.means)<-c('Species','Habitat','Total_Count','Mean_Length')
#DT::datatable(hab.means,caption='Species Mean Length By Habitat')
```
To run an ANOVA, we have to 'expand' the data such that each individual count is
recorded as a separate observation. Here's a quick function to do that.
```{r data expansion for ANOVA}
exp.counts<- function(data,species,habitat) {
x <- subset(data,data$Taxonomy==species & data$Habitat==habitat)
# next line replicates the total length based on how many individuals were counted in that length class
x2 <- unlist(mapply(function(x,y) rep(x,y), x$Total_Length,x$Count.sum))
x2<- as.numeric(x2) #has to be a continuous vector for later ANOVA
x3 <- data.frame(length=x2,habitat=rep(habitat,length(x2)))
return(x3) #returns the data as a column of lengths and a column of habitats, so now one
# row is one 'observation' of length
}
```
Let's try to look at one species after 'expanding' the data in this way. Randomly,
I'll use the species *Acanthurus nigricans*, the [whitecheek surgeonfish](http://www.discoverlife.org/IM/I_RR/0019/320/Acanthurus_nigricans,I_RR1947.jpg).
```{r a. nigricans example}
a_nigricans.fr <- exp.counts(sp_rep,'Acanthurus nigricans','FR') #fringing reef data expansion
a_nigricans.br <- exp.counts(sp_rep,'Acanthurus nigricans','BA') #back reef data expansion
a_nigricans.fo <- exp.counts(sp_rep,'Acanthurus nigricans','FO') #forereef data expansion
a_nigricans.all <- rbind(a_nigricans.fr, a_nigricans.br, a_nigricans.fo) #combined
boxplot(length~habitat,data=a_nigricans.all, xlab='Habitat',ylab='Total Length',
main= 'A. nigricans Total Length by Habitat')
```
Now we can do an ANOVA to test for significant difference between the three habitats
```{r a. nigricans ANOVA}
an.aov<-aov(length~habitat,data=a_nigricans.all) #the simple ANOVA for a. nigricans
summary(an.aov)
```
We can see that for this species there is a significant (p<<0.05) difference between mean
lengths by habitat! We can also look closer at pairwise comparisons by using Tukey's
Honest Significant Differences method
```{r Tukey test,echo=F}
an.tukey <- TukeyHSD(an.aov)
an.tukey$habitat
```
For all represented species...
```{r ANOVAs and Tukeys all species}
dat.all <- list() #List of data, one for each species
aov.all <- list() #List of ANOVAs, one for each species
tukey.all <- list() #List of Tukey tests, one for each species
for (i in 1:length(sp_rep.names)) {
sp <- sp_rep.names[i] #species name
fr.dat <- exp.counts(sp_rep,sp,'FR') #fringing reef data expansion
br.dat <- exp.counts(sp_rep,sp,'BA') #back reef data expansion
fo.dat <- exp.counts(sp_rep,sp,'FO') #forereef data expansion
dat.sp <- rbind(fr.dat,br.dat,fo.dat) #combined
dat.all[[sp]] <- dat.sp #add species to the list(s)
}
aov.all <- lapply(dat.all, function(x) aov(length~habitat,data=x)) #run an ANOVA
tukey.all <- lapply(aov.all,TukeyHSD) #run a Tukey test
```
Having the data in this way allows us to look up data, ANOVAs, and the Tukey
tests by species name, e.g.
```{r lookup example}
tukey.all[['Zebrasoma scopas']]
boxplot(length~habitat,data=dat.all[['Zebrasoma scopas']], xlab='Habitat',ylab='Total Length',
main= 'Z. scopas Total Length by Habitat')
```
Now we can look for only those species that have significant differences between habitats,
first using our ANOVA results
```{r significant aovs}
## a function that takes an ANOVA and tests for significance based on p-value
## returns 1 if significant, zero if not
aov.sig.check <- function (anova) {
p <- summary(anova)[[1]][['Pr(>F)']][1] #this is how to access the p-value alone
if(p<0.05) return(1)
else return(0)
}
#now apply it to our list of ANOVAs
sig_aovs.vec <- sapply(aov.all, aov.sig.check) #returns a vector
names_sig_aovs <- names(subset(aov.all,sig_aovs.vec==1)) #significant species
names_sig_aovs
sig_aov <-subset(aov.all,sig_aovs.vec==1)
sig_aov_dat <- subset(dat.all,sig_aovs.vec==1) #make sure we crop the dataset and tukey list as well
sig_aov_tukey <- subset(tukey.all,sig_aovs.vec==1)
```
This brings us down from 69 to 57 species. But this doesn't necessarily mean that these
species follow our hypothesis from above. We have to check if species get significantly
larger from fringe reef to back reef to fore reef, and we can use the results of the tukey
tests to investigate this, because it calculates significance for each step of the multiple
comparisons between habitats.
```{r tukey check}
tukey.check <- function(tukey) { #takes a tukey test as an input
diff.vec <- tukey$habitat[,'diff']
#checks that there are positive differences between BA-FR, FO-FR, and FO-BA
if(diff.vec[1]>0 & diff.vec[2]>0 &diff.vec[3]>0) return(1)
else return(0)
}
sig_tukey.vec <- sapply(sig_aov_tukey,tukey.check)
names_sig_tukey <- names(subset(sig_aov_tukey,sig_tukey.vec==1)) #significant species
names_sig_tukey
dat.final <- subset(sig_aov_dat,sig_tukey.vec==1)
aov.final <-subset(sig_aov,sig_tukey.vec==1)
tukey.final <- subset(sig_aov_tukey,sig_tukey.vec==1)
```
Only 14 species left! How manageable!
Boxplots
```{r boxplots, echo=FALSE,results='hide'}
dev.off()
x11()
par(mfrow=c(3,5))
for(i in 1:length(dat.final)) {
species=names(dat.final)[i]
boxplot(length~habitat,data=dat.all[[species]],main= species)
}
```
What have we ended up with? Let's look at these species
```{r final species list, echo=FALSE}
sp.final <- data.frame(taxonomy = names_sig_tukey)
eng.names <- c('scissortail sergeant','teardrop butterflyfish','daisy parrotfish',
'goldspot seabream','threeband pennantfish','pinktail triggerfish',
'twosaddle goatfish','bridled parrotfish','common parrotfish',
'yellowband parrotfish','streamlined spinefoot','pacific gregory',
'blunthead wrasse','sailfin tang')
sp.final$common <-eng.names
write.csv(sp.final,file='data/sp list.csv')
vfs.crop <- subset(vfs,vfs$Taxonomy%in%sp.final$taxonomy)