austraits_package.qmd

# The `austraits` package

```{r, include = FALSE}
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  comment = "#>",
  fig.path = "figures/"
)

library(dplyr)
library(stringr)
library(austraits)

austraits <- austraits:::austraits_5.0.0_lite
```

The `austraits` package was initially designed to aid users in accessing data from [AusTraits](https://zenodo.org/record/5112001), a curated plant trait database for the Australian flora. This package contains several core functions to explore, wrangle and visualise data. 

In 2024 the package was generalised to support all databases built using the [traits.build workflow](https://github.com/traitecoevo/traits.build), new functions were added, and existing functions were re-worked. The structure of AusTraits evolved from its release in 2021 until present and the version 3.0 of the austraits package only supports AusTraits versions from 5.0 onwards. If you are working with AusTraits version 4.2.0 or earlier, you need to install an [old version of austraits]() 

Below, we include a tutorial to illustrate how to use these functions.

**Note the examples shown us a subset of AusTraits release 5.0.0, but the code can be run using any traits.build database.**

## Getting started
`austraits` is still under development and not yet on Cran. To install the current version from GitHub: 

```{r setup, eval = FALSE}
#install.packages("remotes")
#remotes::install_github("traitecoevo/austraits", dependencies = TRUE, upgrade = "ask")

## Load the austraits package
library(austraits)
```

### Loading AusTraits database

`load_austraits` is the one `austraits` function that is specific to the AusTraits database. By default, `load_austraits` will download AusTraits to a specified path e.g. `data/austraits` and will reload it from this location in the future. You can set `update = TRUE` so the austrait versions are downloaded fresh from [Zenodo](https://zenodo.org/record/3568429). Note that `load_austraits` accepts a version number or the DOI of a particular version.

If you are new to using AusTraits we recommend you download the most recent release, while you may want to download an older version to reproduce a previous analysis.

```{r, load_data, eval = FALSE}
austraits <- load_austraits(version = "6.0.0", path = "data/austraits")
```

You can check out the different versions of Austraits and their associated DOI by using: 

```{r, eval = FALSE}
get_versions(path = "data/austraits")
```

The traits.build object is a very *long* list with various of elements. If you are not familiar with working with lists in R, we recommend having a quick look at this [tutorial](https://www.tutorialspoint.com/r/r_lists.htm). To learn more about the structure of a traits.build database, check out the [structure of the database](https://traitecoevo.github.io/austraits/articles/structure.html). 

```{r}
austraits
```

## Descriptive summaries of traits and taxa

The perfect way to begin exploring a traits.build database is to learn which traits are included and how much data exists for various traits and taxa.

Interested in a specific trait? `lookup_trait`, `lookup_location_property` and `lookup_context_property` let you find terms based on exact and partial string matches.

```{r}
lookup_trait(database = austraits, term = "leaf") %>% head()

lookup_location_property(database = austraits, term = "temperature") %>% head()

lookup_context_property(database = austraits, term = "fire") %>% head()
```

Alternatively, have a look how much data a traits.build database has for specific traits or taxa. *This function only summarises by `trait_name`, `genus` or `family`.*

```{r}
summarise_database(database = austraits, var = "trait_name") %>% head()

summarise_database(database = austraits, var =  "family") %>% head()

summarise_database(database = austraits, var =  "genus") %>% head()
```

All traits.build databases include definitions for all traits. Check out the dictionary if you want to learn more about trait's that have been output by a `lookup_` or `summarise_` query.

```{r}
austraits$definitions %>% head()

austraits$definitions[["leaf_area"]] %>% convert_list_to_df1
```


## Extracting data

In most cases, users would like to extract a subset of a traits.build database for their own research purposes.`extract_dataset` subsets by dataset(s), `extract_trait`subsets by trait, and `extract_taxa` subsets by taxon_name, genus or family. In addition, the function `extract_data` extracts data based on a specified value(s) from any column of any table within a traits.build database.

Note that the other tables and elements of the AusTraits data are extracted too, not just the main traits table, retaining the database's original structure. See `?extract_data` and `?extract_trait` for more details.

### Extracting by study

Filtering **one particular study** and assigning it to an object

```{r, extract_study}
subset_data <- extract_dataset(database = austraits, dataset_id = "Falster_2005_2")

subset_data$traits %>% head()
```

Filtering **multiple studies by two different lead authors**  and assigning it to an object

```{r, extract_studies}
subset_multi_studies <- extract_dataset(database = austraits, 
                                        dataset_id = c("Thompson_2001","Ilic_2000"))
 
subset_multi_studies$traits %>% distinct(dataset_id)
```

Filtering **multiple studies by same lead author** (e.g. Falster) and assigning it to an object.  

```{r, extract_Falster_studies}
data_falster_studies <- extract_dataset(austraits, "Falster")

data_falster_studies$traits %>% distinct(dataset_id)
```

### Extracting by taxonomic level

Filtering 
```{r extract_taxa}
# By family 
proteaceae <- extract_taxa(austraits, family = "Proteaceae")
# Checking that only taxa in Proteaceae have been extracted
proteaceae$taxa$family %>% unique()

# By genus 
acacia <- extract_taxa(austraits, genus = "Acacia")
# Checking that only taxa in Acacia have been extracted
acacia$traits$taxon_name %>% unique() %>% head()
```

### Extracting by trait

Filtering **one trait** and assigning it to an object

```{r, extract_trait}
data_wood_dens <- extract_trait(austraits, "wood_density")

head(data_wood_dens$traits)
```

Using `extract_trait` to extract data for **all traits with 'leaf' in the trait name** and assigning it to an object.

```{r}
data_leaf <- extract_trait(austraits, "leaf") 

unique(data_leaf$traits$trait_name)
```

Using `extract_trait` to extract data for **all traits with 'leaf_N' or 'leaf_P' in the trait name** and assigning it to an object.

```{r}
data_vector_extraction <- extract_trait(austraits, c("leaf_photosyn", "huber_value")) 

unique(data_vector_extraction$traits$trait_name)
```

The function `extract_data` offers the flexibility of subsetting a database based on any combination of data table, column within the table, and column value. 

The database tables you can subset on are: 'traits', 'locations', 'contexts', 'methods', 'contributors', 'taxa', and 'taxonomic_updates' 

```{r}
observations_with_soil_data <- extract_data(
  database = austraits, 
  table = "locations",
  col = "location_property",
  col_value = "soil") 

# If you are unsure about column names, check with:

names(austraits$methods)
```

Any of the extract functions can be linked together to output a more precise subset of data.

For instance, to return data on 'leaf mass per area' and 'wood density' for 'adult' plants of any species in the genus Acacia:

```{r}
Acacias_specific_traits <- austraits %>% 
  extract_taxa(genus = "Acacia") %>%
  extract_trait(trait_name = c("leaf_mass_per_area", "wood_density")) %>%
  extract_data(table = "traits", col = "life_stage", col_value = "adult")
```

## Join data from other tables and elements

Once users have extracted the data they want, they may want to merge the study metadata stored in various relational tables into the main `traits` dataframe for their analyses. For example, users may require additional taxonomic information for a phylogenetic analysis, location coordinates to plot data, or context properties to understand variation in trait values for a single taxon. This is where the `join_` functions come in. 

There are seven `join_` functions in total, each designed to append specific information from other tables and elements in the `austraits`the ancillary data tables in a `traits.build` object. Their suffixes refer to the type of information that is joined, e.g. `join_taxonomy` appends taxonomic information to the `traits` dataframe. Each function lets you select which data columns you wish to add and the output format.

```{r, join_}
# Join location coordinates
(data_leaf %>% join_location_coordinates)$traits %>% head()

# Join location properties using defaults
(data_leaf %>% join_location_properties)$traits %>% head()

# Join location properties, with each location property pertaining to soil added as a separate column.
temperature_properties <- lookup_location_property(data_leaf, "temperature")
(data_leaf %>% join_location_properties(format = "many_columns", vars = temperature_properties))$traits %>% head()

# Join context properties using defaults
(data_leaf %>% join_context_properties)$traits %>% head()

# Join context properties, with each context property pertaining to fire added as a separate column.
fire_properties <- lookup_context_property(data_leaf, "fire")
(austraits %>% join_context_properties(format = "many_columns", vars = fire_properties, include_description = TRUE))$traits %>% head()

# Join methodological information using defaults
(data_leaf %>% join_methods)$traits %>% head()

# Join taxonomic information using defaults
(data_leaf %>% join_taxa)$traits %>% head()

# Join taxonomic updates information using defaults
(data_leaf %>% join_taxonomic_updates)$traits %>% head()

# Join contributors information using defaults
(data_leaf %>% join_contributors)$traits %>% head()
```

All data tables can be merged with `flatten_database`, which calls each of the `join_` functions.

```{r}
# Flatten database using defaults
all_joined <- data_leaf %>% flatten_database()

# Flatten database also accepts a list of vectors to specify which columns to include for each table
data_leaf %>% flatten_database(
                format = "single_column_json",
                vars = list(
                  location = "all",
                  context = "all",
                  contributors = "all",
                  taxonomy = c("family"),
                  taxonomic_updates = c("aligned_name"),
                  methods = c("methods"))
)
```

## Binding extracted databases

The function `bind_databases` allows you to bind together two subsetted databases you have created by filtering (`extract_` functions) based on two critera.
```{r}
extracted_wood <- austraits %>% extract_trait("wood_density")
extracted_Rutaceae <- austraits %>% extract_taxa(family = "Rutaceae")
merged <- bind_databases(extracted_wood, extracted_Rutaceae)
```

## Visualising data by site

`plot_locations` graphically summarises where trait data were collected and how much data is available. The legend refers to the number of neighbouring points: the warmer the colour, the more data that are available for a particular location. This function only includes data that are geo-referenced. Users must first use `join_location_coordinates` to append latitude and longitude information into the trait dataframe before plotting

```{r, site_plot, fig.align = "center", fig.width=5, fig.height=5}
data_wood_dens <- data_wood_dens %>% join_location_coordinates()
plot_locations(data_wood_dens$traits)
```

## Visualising data distribution and variance

`plot_trait_distribution_beeswarm` creates histograms and [beeswarm plots](https://github.com/eclarke/ggbeeswarm) for continuous traits to help users visualise the data. Users can specify whether to create separate beeswarm plots based on any column in the traits table (e.g. `dataset_id` or `life_stage`) or at the level of `genus` or `family`.

```{r, beeswarm, fig.align = "center", fig.width=6, fig.height=7}
austraits %>% plot_trait_distribution_beeswarm("wood_density", "family")
austraits %>% plot_trait_distribution_beeswarm("wood_density", "dataset_id")
```

## Pivotting from long to wide format 

The table of traits in AusTraits comes in **long** format, where data for all trait information are denoted by two columns called `trait_name` and `value`. You can convert this to wide format, where each trait is in a separate column, using the function `trait_pivot_wider`. Note, that the informtion in the columns `unit`, `replicates`, `measurement_remarks`, and `basis_of_value` is lost when this function pivots.

```{r, pivot}
data_wide_bound <- data_falster_studies %>% # Joining multiple obs with `--`
  trait_pivot_wider()
```

If there are multiple measurements linked to the same observation_id, such as when both a minimum and maximum are recorded, this information is retained as separate rows when using `trait_pivot_wider`. Instead, you can use `bind_trait_values` first, which merges multiple entries for `value`, `value_type`, `basis_of_value`, and `replicates` into a single row, with values delimited by "--".

```{r}
bound_values <- (austraits %>% extract_dataset("ABRS_1981"))$traits %>%
  bind_trait_values()

bounded_wider <- bound_values %>% trait_pivot_wider
```

If you would like to revert the bounded trait values, you have to use `separate_trait_values`. Note, this function does not always recreate the original table as the delimitor "--" is also used for value_type `bin` and `range`, which should not necessarily be split. *This function is on the list to rework for future `austraits` versions.*

```{r}
bound_values2 <- (austraits %>% 
                    extract_dataset("ABRS_1981") %>%
                    extract_data(table = "traits", col = "value_type", col_value = c("minimum", "maximum"))
                  )$traits %>%
  bind_trait_values()

bound_values2 %>%
  separate_trait_values(., austraits$definitions)
```