moorea_exploration.Rmd

---
title: "Moorea Exercise"
author: "Owen Liu"
date: "Monday, October 12, 2015"
output: html_document
---

This is an exploratory document to both:

* explore some of the public Moorea datasets [available online](http://mcr.lternet.edu/data/), and to 

* practice using RMarkdown and GitHub


```{r, echo= FALSE, include=FALSE}
library(dplyr)
library(doBy)
library(ggplot2)
library(DT)
```

## Data Organization and Sub-setting

The first dataset is a .csv of Annual Visual Fish Surveys from Moorea since 2006, with 56918 observations in 24 variables

```{r data import}
vfs <- read.csv('~/github/moorea_exploration/data/MCR_LTER_Annual_Fish_Survey_20150318.csv')
names(vfs)
```

The transects have data like date, location, and habitat type of the dive site, along with the taxonomy of the species identified and their lengths, abundance, etc.

Let's summarize some of these data and see what they look like (mostly from the doBy pkg).  Data from all years pooled at first.

```{r top ten species}
## The summaryBy() function summarizes data according to a specified function
## and by the subset of variables you specify
## We could make a list of the most abundant species by all individuals for all years
abun <- summaryBy(Count~Taxonomy, data= vfs, FUN=sum, na.rm=TRUE) #summarizes the data
abun <- abun[order(-abun[,2]),] #sorts by total abundance
top_10 <- abun[1:10,] #top 10 most abundant species
top_10.names <- as.vector(top_10[,'Taxonomy']) #names (for later use)
par(mar=c(10,4,4,2))
barplot(top_10[,2],names.arg=top_10[,1],las=3, ylim=c(0,20000),
        border=NA, ylab='Total invidivudals, all years',cex.names=0.8,
        main='Top ten most abundant species',cex.axis=0.8)

```


## Calculating mean size by habitat

I want to look at which species may have ontogenetic habitat shifts between
fringing reef, backreef, and forereef, specifically species that use the
fringing reef as a nursery. One way to start to investigate this
is to calculate average size for each species in each habitat, and then see
if there are significant differences between habitats.

We're looking for something like this:

```{r ,echo=FALSE,warning=F,message=F}
library(png)
library(grid)
img <- readPNG('./data/hab_shift.png')
grid.raster(img)
```

where the hypothesis is that the fringing reef is a nursery, and the backreef and
forereef support larger fish.

First, let's look at all the species that have at least 10 observed individuals
in each habitat in the data (since, presumably, some species don't utilize all
habitats).  In other words, let's limit our later analysis by those species that 
are represented in each habitat.

```{r data cropping}
## Original dataset, counts aggregated by species, habitat, and length bin
sp_hab <- summaryBy(Count~Taxonomy + Habitat + Total_Length, data=vfs, FUN=sum,na.rm=TRUE)

knitr::kable(head(sp_hab,8))

## Function returns 1 if species is represented by at least 10 individuals in each habitat
## and returns 0 if not
rep_hab <- function(species) {
  dat<-subset(sp_hab,sp_hab$Taxonomy==species) # all data for the species
  fr <- subset(dat,dat$Habitat=='FR') # fringing reef
  br <- subset(dat,dat$Habitat=='BA') # back reef
  fo <- subset(dat,dat$Habitat=='FO') # forereef
  if(sum(fr$Count.sum)>9 & sum(br$Count.sum)>9 & sum(fo$Count.sum)>9) return(1)
  else return(0)
}
```

Now we can limit our full dataset to just the species that are represented in all habitats

```{r representation check}
rep_check <- sapply(sp_hab$Taxonomy,rep_hab) #check for representation
sp_rep <- sp_hab[rep_check==1,] #keep all the rows for represented species
sp_rep.names <- as.vector(unique(sp_rep$Taxonomy))

length(unique(sp_rep$Taxonomy)) #How many species?
```

We can see that there are 69 species that fit the representation criteria.
Now let's calculate the mean lengths for each species in each of the three 
habitats.

```{r mean length by habitat function}
## Function for mean length by habitat

species.mean<-function(data,species,habitat) {
  sp.data <- subset(data,data$Taxonomy==species & data$Habitat==habitat)
  hab.count <- sum(sp.data$Count.sum)
  freq <- sp.data$Count.sum/hab.count # relative frequency of a size class
  hab.mean <- sum(sp.data$Total_Length*freq) # mean is the sum of all length classes times their relative frequencies
  return(cbind(hab.count,hab.mean))
}

names <- as.vector(sapply(sp_rep.names,function(x) rep(x,3),USE.NAMES=F)) #expands names list so each species has 3 habitats

hab.vec <- rep(c('FR','BA','FO'),length(sp_rep.names))

## Data frame of species and their mean size by habitat
mean.lengths <- mapply(species.mean,species=names,habitat=hab.vec,MoreArgs=list(data=sp_rep))
hab.means <- data.frame(names,hab.vec,mean.lengths[1,],mean.lengths[2,])

names(hab.means)<-c('Species','Habitat','Total_Count','Mean_Length')

#DT::datatable(hab.means,caption='Species Mean Length By Habitat')
```

To run an ANOVA, we have to 'expand' the data such that each individual count is
recorded as a separate observation.  Here's a quick function to do that.

```{r data expansion for ANOVA}
exp.counts<- function(data,species,habitat) {
  x <- subset(data,data$Taxonomy==species & data$Habitat==habitat)
  # next line replicates the total length based on how many individuals were counted in that length class
  x2 <- unlist(mapply(function(x,y) rep(x,y), x$Total_Length,x$Count.sum))
  x2<- as.numeric(x2) #has to be a continuous vector for later ANOVA
  x3 <- data.frame(length=x2,habitat=rep(habitat,length(x2)))
  return(x3) #returns the data as a column of lengths and a column of habitats, so now one
  # row is one 'observation' of length
}
```

Let's try to look at one species after 'expanding' the data in this way.  Randomly,
I'll use the species *Acanthurus nigricans*, the [whitecheek surgeonfish](http://www.discoverlife.org/IM/I_RR/0019/320/Acanthurus_nigricans,I_RR1947.jpg).

```{r a. nigricans example}
a_nigricans.fr <- exp.counts(sp_rep,'Acanthurus nigricans','FR') #fringing reef data expansion
a_nigricans.br <- exp.counts(sp_rep,'Acanthurus nigricans','BA') #back reef data expansion
a_nigricans.fo <- exp.counts(sp_rep,'Acanthurus nigricans','FO') #forereef data expansion

a_nigricans.all <- rbind(a_nigricans.fr, a_nigricans.br, a_nigricans.fo) #combined
boxplot(length~habitat,data=a_nigricans.all, xlab='Habitat',ylab='Total Length',
        main= 'A. nigricans Total Length by Habitat')
```

Now we can do an ANOVA to test for significant difference between the three habitats

```{r a. nigricans ANOVA}
an.aov<-aov(length~habitat,data=a_nigricans.all) #the simple ANOVA for a. nigricans
summary(an.aov)
```

We can see that for this species there is a significant (p<<0.05) difference between mean
lengths by habitat!  We can also look closer at pairwise comparisons by using Tukey's
Honest Significant Differences method

```{r Tukey test,echo=F}
an.tukey <- TukeyHSD(an.aov)
an.tukey$habitat
```

For all represented species...

```{r ANOVAs and Tukeys all species}
dat.all <- list() #List of data, one for each species
aov.all <- list() #List of ANOVAs, one for each species
tukey.all <- list() #List of Tukey tests, one for each species

for (i in 1:length(sp_rep.names)) {
  sp <- sp_rep.names[i] #species name
  fr.dat <- exp.counts(sp_rep,sp,'FR') #fringing reef data expansion
  br.dat <- exp.counts(sp_rep,sp,'BA') #back reef data expansion
  fo.dat <- exp.counts(sp_rep,sp,'FO') #forereef data expansion

  dat.sp <- rbind(fr.dat,br.dat,fo.dat) #combined
  dat.all[[sp]] <- dat.sp #add species to the list(s)
}

aov.all <- lapply(dat.all, function(x) aov(length~habitat,data=x)) #run an ANOVA
  
tukey.all <- lapply(aov.all,TukeyHSD) #run a Tukey test
```

Having the data in this way allows us to look up data, ANOVAs, and the Tukey
tests by species name, e.g.

```{r lookup example}
tukey.all[['Zebrasoma scopas']]

boxplot(length~habitat,data=dat.all[['Zebrasoma scopas']], xlab='Habitat',ylab='Total Length',
        main= 'Z. scopas Total Length by Habitat')
```

Now we can look for only those species that have significant differences between habitats,
first using our ANOVA results

```{r significant aovs}
## a function that takes an ANOVA and tests for significance based on p-value
## returns 1 if significant, zero if not
aov.sig.check <- function (anova) { 
  
  p <- summary(anova)[[1]][['Pr(>F)']][1] #this is how to access the p-value alone
  
  if(p<0.05) return(1)
  else return(0) 
}

#now apply it to our list of ANOVAs
sig_aovs.vec <- sapply(aov.all, aov.sig.check) #returns a vector
names_sig_aovs <- names(subset(aov.all,sig_aovs.vec==1)) #significant species
names_sig_aovs

sig_aov <-subset(aov.all,sig_aovs.vec==1)
sig_aov_dat <- subset(dat.all,sig_aovs.vec==1) #make sure we crop the dataset and tukey list as well
sig_aov_tukey <- subset(tukey.all,sig_aovs.vec==1)

```

This brings us down from 69 to 57 species.  But this doesn't necessarily mean that these
species follow our hypothesis from above.  We have to check if species get significantly
larger from fringe reef to back reef to fore reef, and we can use the results of the tukey
tests to investigate this, because it calculates significance for each step of the multiple
comparisons between habitats.

```{r tukey check} 
tukey.check <- function(tukey) { #takes a tukey test as an input
  
  diff.vec <- tukey$habitat[,'diff']
  
  #checks that there are positive differences between BA-FR, FO-FR, and FO-BA
  if(diff.vec[1]>0 & diff.vec[2]>0 &diff.vec[3]>0) return(1)
  else return(0)
  }
sig_tukey.vec <- sapply(sig_aov_tukey,tukey.check)
names_sig_tukey <- names(subset(sig_aov_tukey,sig_tukey.vec==1)) #significant species
names_sig_tukey

dat.final <- subset(sig_aov_dat,sig_tukey.vec==1)
aov.final <-subset(sig_aov,sig_tukey.vec==1)
tukey.final <- subset(sig_aov_tukey,sig_tukey.vec==1)
```

Only 14 species left! How manageable!

Boxplots

```{r boxplots, echo=FALSE,results='hide'}
dev.off()
x11()
par(mfrow=c(3,5))
for(i in 1:length(dat.final)) {
  species=names(dat.final)[i]
  boxplot(length~habitat,data=dat.all[[species]],main= species)
}
```


What have we ended up with? Let's look at these species
```{r final species list, echo=FALSE}
sp.final <- data.frame(taxonomy = names_sig_tukey)
eng.names <- c('scissortail sergeant','teardrop butterflyfish','daisy parrotfish',
               'goldspot seabream','threeband pennantfish','pinktail triggerfish',
               'twosaddle goatfish','bridled parrotfish','common parrotfish',
               'yellowband parrotfish','streamlined spinefoot','pacific gregory',
               'blunthead wrasse','sailfin tang')
sp.final$common <-eng.names
write.csv(sp.final,file='data/sp list.csv')
vfs.crop <- subset(vfs,vfs$Taxonomy%in%sp.final$taxonomy)