AusTraits_tutorial.qmd

# AusTraits tutorial

## Introduction

With more than 1.8 million data records, AusTraits is Australia's [largest plant trait database](austraits_database.html#plant_database), created using the [`{traits.build}` R package](https://github.com/traitecoevo/traits.build)

This tutorial introduces:

-   [the database structure](#database_structure)

-   [`{austraits}` R package functions](#austraits_functions)

-   additional [examples of analyses](#sample_analyses) using the database

To access more information about `traits.build`, see [`traits.build-book`](https://traitecoevo.github.io/traits.build-book/)

Or you can visit the Github repositories for individual packages/data repos:

-   the database structure: [`traits.build`](https://github.com/traitecoevo/traits.build)

-   the database contents: [`austraits.build`](https://github.com/traitecoevo/austraits.build)

-   an R package for exploring and wrangling the data: [`austraits`](https://github.com/traitecoevo/austraits)

## Download AusTraits data

Before you begin, download and source essential packages and functions.

```{r, eval = TRUE, message = FALSE}
# Packages need to be installed the first time you use them.
# They are commented out here, so they aren't reinstalled each time you run the code, 
# but install any packages you require the first time you run this tutorial.

#install.packages("readr", "tidyr", "dplyr", "stringr", "remotes")
library(readr)
library(tidyr)
library(dplyr)
library(stringr)

#remotes::install_github("traitecoevo/austraits", dependencies = TRUE, upgrade = "ask")
#remotes::install_github("traitecoevo/traits.build", dependencies = TRUE, upgrade = "ask")

library(austraits)    # functions for exploring a traits.build database, available at Github repo
# XX library(traits.build) # additional functions for exploring a traits.build database, available at Github repo

#source("https://raw.githubusercontent.com/traitecoevo/traits.build-book/master/data/extra_functions.R")
source("data/extra_functions.R")
```

Then download (or build) the latest AusTraits database, using one of the methods described [here](austraits_database.html#access_data).

This tutorial uses the most recent AusTraits release, version `r austraits::get_versions()[["version"]][1]`

```{r, eval = TRUE}
most_recent <- austraits::get_versions() %>%
  dplyr::pull("doi") %>%
  dplyr::first()

most_recent

austraits <- austraits::load_austraits(doi = most_recent)
```

## A first look at data {#exploring}

If you're not familiar with AusTraits, you may want to begin by exploring the breadth and depth of data within the database. The database can be explored by trait name, species, or genus using either [austraits functions](#austraits_functions) or dplyr functions.

*How many taxa have `leaf_N_per_dry_mass` data in AusTraits?*

```{r, eval = TRUE}
(austraits %>% 
  austraits::extract_trait(trait_name = "leaf_N_per_dry_mass"))$traits %>%
  dplyr::distinct(taxon_name) %>% nrow()
```

*How are these data distributed across datasets?*

```{r, eval = TRUE}
austraits::plot_trait_distribution_beeswarm(database = austraits, trait_name = "leaf_N_per_dry_mass", y_axis_category = "dataset_id")
```

*How much data exist for other nitrogen traits?*

```{r, eval = TRUE}
austraits::lookup_trait(austraits, "_N_") -> N_traits

austraits %>% 
  austraits::extract_trait(trait_name = N_traits) %>% 
  austraits::summarise_database(var = "trait_name") %>%
  dplyr::arrange(-n_taxa)
```

*How many "hydraulic" traits are in AusTraits? How much data exist for these traits?*

```{r, eval = TRUE}
austraits::lookup_trait(austraits, "hydraulic") -> hydraulic_traits

austraits %>% 
  austraits::extract_trait(trait_name = hydraulic_traits) %>%
  austraits::summarise_database(var = "trait_name") %>%
  dplyr::arrange(-n_taxa)
```

*Where have trait data for Acacia aneura been collected?*

```{r, eval = TRUE, warning = FALSE, message = FALSE}
data <-
  austraits %>%
     austraits::extract_taxa(taxon_name = "Acacia aneura") %>%
     austraits::join_location_coordinates()

data$traits %>% austraits::plot_locations("taxon_name")
```

*Where have data for Hibbertia species been collected?*

```{r, eval = TRUE, warning = FALSE,  message = FALSE}
data <-
  austraits %>%
     austraits::extract_taxa(genus = "Hibbertia") %>%
     austraits::join_location_coordinates() %>%
     austraits::join_taxa(var = "genus")

data$traits %>% austraits::plot_locations("genus")
```

## The database structure {#database_structure}

The `{traits.build}` R package is the workflow that builds AusTraits from its component datasets.

The database is output as a list, a collection of relational tables, described in detail [here](database_structure.html).

A `traits.build` data object includes both the relational data tables and additional tables documenting database metadata and a traits dictionary.

```{r, eval = TRUE}
austraits
```

### Traits table

The core AusTraits table is the traits table. It is in "long" format, with each row documenting a single trait measurement.

```{r, eval = TRUE}
austraits$traits %>% dplyr::slice(1:20)
```

The columns include:

-   core columns\
    -   dataset_id\
    -   taxon_name\
    -   trait_name\
    -   value (trait value)\
-   entity metadata
    -   entity_type\
    -   life_stage
-   value metadata
    -   value_type\
    -   unit\
    -   basis_of_value\
    -   replicate\
    -   basis_of_record\
-   additional metadata
    -   collection_date
    -   measurement_remarks
-   identifiers for specific observations, individuals, etc.
    -   observation_id\
    -   individual_id\
    -   population_id\
    -   repeat_measurements_id\
-   identifiers that provide links to ancillary tables with additional metadata
    -   location_id\
    -   treatment_context_id\
    -   plot_context_id\
    -   entity_context_id\
    -   temporal_context_id\
    -   method_context_id\
    -   method_id\
    -   source_id

### Ancillary data tables

The remaining metadata accompanying each trait record is recorded across multiple relational tables.

These include: 

-   austraits\$locations 
-   austraits\$contexts 
-   austraits\$methods 
-   austraits\$taxa 
-   austraits\$taxonomic_updates 
-   austraits\$contributors 

Like the core `traits` table, each is in 'long' format.

The tables `locations`, `contexts`, `methods`, `taxa` and `taxonomic_updates` include metadata that links seamlessly to individual rows within `traits`.

A collection of `join_` functions within `austraits` join the ancillary tables to the traits table, based on columns shared across tables.

| table             | metadata in table                                                                                                                               | columns that link to austraits\$traits                                                                                                                                                 |
|------------------|--------------------------|----------------------------|
| locations         | location name, location properties, latitude, longitude                                                                                         | dataset_id, location_id                                                                                                                                                                |
| contexts          | context name, context category (method context, temporal, entity context, plot, treatment), context property                                    | dataset_id, link_id (identifier to link to: method_context_id, temporal_context_id, entity_context_id, plot_context_id, treatment_context_id), link_vals (identifier value to link to) |
| methods           | dataset description, dataset sampling strategy, trait collection method, data collectors, data curators, dataset citation, source_id & citation | dataset_id, trait_name, method_id                                                                                                                                                      |
| taxa              | genus, family, scientific name, APC/APNI taxon concept/taxon name identifiers                                                                   | taxon_name                                                                                                                                                                             |
| taxonomic_updates | original name (name submitted), aligned name (typos removed; standardised syntax), identifiers for aligned name                                 | dataset_id, taxon_name, original_name                                                                                                                                                  |
| contributors      | people who contributed data, including their ORCIDs, affiliations, roles                                                                        | dataset_id                                                                                                                                                                             |

## Exploring AusTraits

With `r nrow(austraits$traits)` rows of trait values in the main traits table, knowing how to explore the contents is essential.

The R package [`{austraits}`](https://github.com/traitecoevo/austraits) offers a collection of functions to explore and wrangle AusTraits data -- or indeed any data using the traits.build format. 

An austraits package vignette is available [here](https://traitecoevo.github.io/traits.build-book/austraits_package.html).

Function categories include:

-   [**summarise and lookup functions**](#summarise_functions): These functions offer summaries by taxon name or trait, summarising taxa per trait (or other variable), datasets per trait, and observations per trait.

-   [**filtering functions**](#filtering_functions): These functions begin with the word `extract` and filter all of the relational tables simultaneously.

-   [**join functions**](#join_functions): These functions allow columns from the relational tables to be joined to the core traits table.

-   [**pivot functions**](#pivot_functions): These functions allow the traits table to be pivoted to wide format.

-   [**plotting functions**](#plotting_functions): These functions offer a means of rapidly visualising AusTraits data, either plotting collection locations on a map of Australia or plotting trait values by dataset.

### austraits.R function reference

Reference guide to: [austraits functions](https://traitecoevo.github.io/austraits/reference/index.html)

### Summarising data: data coverage {#summarise_functions}

There are two function families for summarising AusTraits data:

-   **lookup_()**\
-   **summarise_database()**

Use the function `summarise_database` to output summaries of total records, datasets with records, and taxa with records across `families`, `genera` or `traits`:

```{r, eval = TRUE}
austraits::summarise_database(database = austraits, var = "trait_name") %>% dplyr::slice(100:130)
austraits::summarise_database(database = austraits, var = "family") %>% dplyr::slice(1:20)
austraits::summarise_database(database = austraits, var = "genus") %>% dplyr::slice(1:20)
```

Since this function summarises the variable selected for **ALL** of AusTraits, you may want to first [filter](#filtering_functions) the data before summarising by "taxon_name" -- or even "trait_name".

Alternatively, you can look up traits that contain a specific search term:

```{r, eval = TRUE}
austraits::lookup_trait(database = austraits, term = "leaf") %>% length()
austraits::lookup_trait(database = austraits, term = "leaf")[1:30]
austraits::lookup_trait(database = austraits, term = "_N_") 
 # elemental contents use their symbol and are *almost* always in the middle of a trait name
austraits::lookup_trait(database = austraits, term = "photo")
```

Also visit the AusTraits Plant Dictionary to learn more about the traits included in AusTraits, https://w3id.org/APD.

You can also search the `locations` and `contexts` tables for `location_properties` and/or `context_properties` included as metadata for many trait measurements:

```{r}
austraits::lookup_location_property(database = austraits, term = "soil")
austraits::lookup_location_property(database = austraits, term = "temperature")

austraits::lookup_context_property(database = austraits, term = "season")
austraits::lookup_context_property(database = austraits, term = "fire")
```


For instance, to just look at number of records, datasets, and taxa with data for nitrogen-related traits:

```{r, eval = TRUE}
N_traits <- austraits %>% 
  austraits::extract_trait(trait_name = "_N_") %>%
  austraits::summarise_database(var = "trait_name")
```

## Wrangling AusTraits

### Filtering data {#filtering_functions}

There are four `austraits` functions that filter data: `extract_trait`, `extract_taxon` or `extract_dataset_id`, and `extract_data`.

Each of these functions simultaneously filters all database tables to only include trait measurements (and associated metadata) meeting the specified criteria, retaining the original database structure.

*Note, although the `extract_` functions were explicitly developed to return the original `traits.build` database structure, they will also work when the "database" is just a single table, such as if prior manipulations have separated the traits table from the rest of the database.*

#### `extract_trait`, `extract_dataset` and `extract_taxa`

Three of the functions extract data based on pre-set columns:
1.    `extract_trait` filters by `trait_name`
2.    `extract_dataset` filters by `dataset_id`
3.    `extract_taxa` filters by `taxon_name`, `genus` or `family`

Search terms can either be exact or partial matches.

```{r, eval = TRUE}
leaf_mass_per_area_data <-
  austraits %>% 
     austraits::extract_trait(trait_names = c("leaf_mass_per_area"))

Westoby_2014_datasets <-
  austraits %>%
     austraits::extract_dataset("Westoby_2014")

all_Westoby_datasets <-
  austraits %>%
     austraits::extract_dataset("Westoby")

Eucalyptus_data <-
  austraits %>%
     austraits::extract_taxa(genus = "Eucalyptus")

Banksia_serrata_data <-
  austraits %>%
     austraits::extract_taxa(taxon_name = "Banksia serrata")
```

#### `extract_data`

`extract_data` offers the ability of filtering the database based on a value(s) in any column of any of the seven data tables (traits, locations, contexts, methods, taxa, taxonomic_updates, contributors). 

See the [database structure chapter](database_structure.html) for names and definitions of each column and, for those with controlled vocabulary, their allowed values.

Alternatively to see the list of column names to use:

```{r, eval = TRUE}
names(austraits$traits)

names(austraits$methods)
```

And to see the list of possible values for a column:

```{r, eval = TRUE}
unique(austraits$traits$life_stage)

unique(austraits$traits$basis_of_record)

unique(austraits$locations$location_property)[1:20]

unique(austraits$contexts$context_property)[1:20]
```

The function then allows you to filter down to the components of each table that are relevant to the search criteria specified:

```{r, eval = TRUE}
field_data <- austraits %>% austraits::extract_data(table = "traits", col = "basis_of_record", col_value = "field")

field_data$traits %>% head()

data_with_soils_data <- austraits %>% austraits::extract_data(table = "locations", col = "location_property", col_value = "soil")

data_with_soils_data$traits %>% head()
data_with_soils_data$locations %>% head() # all location properties are retained for the measurements for measurements for which at least location property pertains to soil

data_contributed_by_Wright <- austraits %>% austraits::extract_data(table = "contributors", col = "last_name", col_value = "Wright")

data_contributed_by_Wright$traits %>% head()
data_contributed_by_Wright$contributors %>% head() # all contributors are retained for datasets where at least one of the contributors on the dataset has the last name "Wright"
```

Multiple `extract_`'s can be linked together to rapidly restrict data to the subset desired:

```{r, eval = TRUE}
subset <- austraits %>%
  austraits::extract_data(table = "traits", col = "basis_of_record", col_value = "field") %>%
  austraits::extract_trait(trait_name = c("leaf_mass_per_area", "leaf_thickness", "leaf_length", "leaf_area")) %>%
  austraits::extract_taxa(genus = "Eucalyptus")

subset$traits[1:20]
```

### Joining relational tables {#join_functions}

For many research purposes you will want to join metadata from one of the relational tables to the core traits table. There are eight `{austraits}` functions that facilitate this by adding the columns you select from the ancillary data tables to the database's `traits` table, seven functions that merge information from a single table (`join_...`) and a function that joins columns from all seven ancillary data tables (`flatten_database`). All functions output the database with the original database structure allowing you to follow up `joining` with `extracting` and to continue `joining` additional columns.

#### Joining location metadata

The locations table includes information on all location properties measured, including the actual location (latitude/longitude), climatic data, soil properties, fire history, vegetation history, geologic history, etc.

The austraits function `join_location_coordinates` just adds location name, latitude, and longitude to the core traits table:

```{r, eval = TRUE}
traits_with_lat_long <- austraits %>% 
  austraits::extract_dataset(dataset_id = "Westoby") %>%
  austraits::join_location_coordinates()

traits_with_lat_long$traits %>% names()
```

The function `join_location_properties` joins other location properties to the `traits` table. It has two arguments:

1.    `vars`, specifies the location properties, via complete or partial string matches, that should be added to the traits table; defaults to "all"  

2.    `format` offers three output formats:  

    *   `many_columns` (each location property is added as a separate column)  
    *   `single_column_pretty` (all location properties compacted into a single column delimited in a way that is easy for humans to read; this is the default)
    *   `single_column_json` (all location properties compacted into a single column, using json formatting)  

Examples of joining location properties to the traits table:

```{r, eval = TRUE}
# method to add location properties that you know exist from previous database exploration; this example showcases `format = "many_columns"
locations1 <- austraits %>%
  austraits::join_location_properties(vars = c("description", "aridity index (MAP/PET)", 
                                         "soil type", "fire history"), format = "many_columns")

locations1$traits %>% names()

locations1$traits[1:10]

# method where you first lookup location properties using the function `lookup_location_property`; this example showcases `format = "single_column_pretty"
precipitation_properties <- lookup_location_property(database = austraits, term = "precipitation")

locations2 <- austraits %>%
  austraits::join_location_properties(vars = precipitation_properties, format = "single_column_pretty")

locations2$traits %>% names()

locations2$traits[1:10]

# method where you add all location properties; this example showcases `format = "single_column_json"
locations3 <- austraits %>%
  austraits::join_location_properties(vars = "all", format = "single_column_json")

locations3$traits %>% names()

locations3$traits[1:10]
```

#### Joining contexts metadata

The context table documents additional context properties/ancillary data which may be useful for interpreting trait values. Context properties are divided into 5 categories: `treatment context`, `plot context`, `entity context`, `temporal context`, and `method context`.

| context category  | description                                                                                                                                                                                                     |
|-----------------|-------------------------------------------------------|
| treatment context | Context property that is an experimental manipulation, that might affect the trait values measured on an individual, population or species-level entity.                                                        |
| plot context      | Context property that is a feature of a plot (subset of a location) that might affect the trait values measured on an individual, population or species-level entity.                                           |
| entity context    | Context property that is information about an organismal entity (individual, population or taxon) that does not comprise a trait-centered observation but might affect the trait values measured on the entity. |
| temporal context  | Context property that is a feature of a "point in time" that might affect the trait values measured on an individual, population or species-level entity.                                                       |
| method context    | Context property that records specific information about a measurement method that is modified between measurements.                                                                                            |

The austraits function `join_context_properties` joins context properties to the `traits` table. It has three arguments:

1.    `vars`, specifies the context properties, via complete or partial string matches, that should be added to the traits table; defaults to "all"  

2.    `format` offers three output formats:  

    *   `many_columns` (each context property is added as a separate column)  
    *   `single_column_pretty` (all context properties compacted into a single column delimited in a way that is easy for humans to read; this is the default)
    *   `single_column_json` (all context properties compacted into a single column, using json formatting)
    
3.    `include_description` is a logical argument (TRUE/FALSE) that indicates where the context property value descriptions should be included or excluded when context data are joined; defaults to "TRUE"  

Each category of context property is added to a separate column for the compacted columns, retaining this important information about the different groupings of context properties. When `format = "many_columns"` is selected the context category is indicated in the column name.

Examples of joining context properties to the traits table:

```{r, eval = TRUE}
# method to add context properties that you know exist from previous database exploration; this example showcases `format = "many_columns"
contexts1 <- austraits %>%
  austraits::join_context_properties(
    vars = c("sampling season", "plant sex", "leaf surface", "leaf age", "fire intensity",
             "slope position", "fire season", "drought treatment", "temperature treatment"), 
    format = "many_columns",
    include_description = TRUE
    )

contexts1$traits %>% names()

contexts1$traits[1:10]

# method where you first lookup context properties using the function `lookup_context_property`; this example showcases `format = "single_column_pretty"
leaf_properties <- lookup_context_property(database = austraits, term = "leaf")

contexts2 <- austraits %>%
  austraits::join_context_properties(
    vars = leaf_properties,
    format = "single_column_pretty",
    include_description = TRUE
    )

contexts2$traits %>% names()

contexts2$traits[1:10]

# method where you add all context properties; this example showcases `format = "single_column_json"
contexts3 <- austraits %>%
  austraits::join_context_properties(
    vars = "all",
    format = "single_column_json",
    include_description = FALSE
    )

contexts3$traits %>% names()

contexts3$traits[1:10]
```

#### Joining methods columns

The methods table documents a selection of metadata recorded about the entire dataset and methods used for individual trait measurements. There is a single row of data per `dataset_id` x `trait_name` x `method_id` combination. `Method_id` is used to distinguish between instances where a single trait is measured twice using two separate protocols and is separate to `method_context_id`, which documents specific components of a method that are modified between measurements.

The austraits function `join_methods` joins columns from the methods table to the `traits` table. It has one argument:

1.    `vars` which specifies which columns from the `methods` table are joined to the `traits` table; defaults to `vars = c("all")`

First, check the schema file embedded within AusTraits to see what information is documented in each column:  

```{r, eval = TRUE}
austraits$schema$austraits$elements$methods$elements %>% 
  austraits::convert_list_to_df1()
```

Examples using `join_methods`:  

```{r, eval = TRUE}
# join methods column only, the default
traits_with_methods <- 
  austraits %>% austraits::join_methods()

traits_with_methods$traits %>% names()

# join all methods table columns
traits_with_methods <- 
  austraits %>% austraits::join_methods(vars = "all")

traits_with_methods$traits %>% names()

# join all specifically selected methods table columns
traits_with_methods <- 
  austraits %>% austraits::join_methods(vars = c("methods", "description", "source_secondary_key"))

traits_with_methods$traits %>% names()
```

#### Joining taxa

The `taxa` table documents a collection of names and identifiers for each taxon. Within AusTraits, `names` submitted as identifiers within a dataset might be resolved to a species, an infraspecific taxon, or sometimes just to a genus- or family-level name; the name's resolution is recorded as the `taxon_rank`. The `taxon_rank` determines which information is filled in in the taxa table.

The `{austraits}` function `join_taxa` joins columns from the `taxa` table to the `traits` table. It has one argument:

1.    `vars` which specifies which columns from the `taxa` table are joined to the `traits` table; defaults to `vars = c("family", "genus", "taxon_rank", "establishment_means").`

First, check the schema file embedded within AusTraits to see what information is documented in each column:  

```{r, eval = TRUE}
austraits$schema$austraits$elements$taxa$elements %>% 
  austraits::convert_list_to_df1()
```

Examples using `join_taxa`:  

```{r, eval = TRUE}
# join the default columns
traits_with_taxa <- 
  austraits %>% austraits::join_taxa()

traits_with_taxa$traits %>% names()

# join all taxa table columns
traits_with_taxa <- 
  austraits %>% austraits::join_taxa(vars = "all")

traits_with_taxa$traits %>% names()
```

#### Joining taxonomic updates

The taxonomic updates table documents all taxonomic changes implemented in the construction of AusTraits, including both the correction of typos and the updating of outdated synonyms to the currently accepted name.

The `{austraits}` function `join_taxonomic_updates` joins columns from the `taxonomic_updates` table to the `traits` table. It has one argument:

1.    `vars` which specifies which columns from the `taxonomic_updates` table are joined to the `traits` table; defaults to `vars = c("aligned_name").`

First, check the schema file embedded within AusTraits to see what information is documented in each column:  

```{r, eval = TRUE}
austraits$schema$austraits$elements$taxonomic_updates$elements %>% 
  austraits::convert_list_to_df1()
```

Examples using `join_taxonomic_updates`:  

```{r, eval = TRUE}
# join the default columns
traits_with_taxonomic_updates <- 
  austraits %>% austraits::join_taxonomic_updates()

traits_with_taxonomic_updates$traits %>% names()

# join all methods columns
traits_with_taxonomic_updates <- 
  austraits %>% austraits::join_taxonomic_updates(vars = "all")

traits_with_taxonomic_updates$traits %>% names()
```

#### Joining contributors

The contributors table documents all basic metadata about all dataset contributors, including their name, ORCID, and role for various datasets.

The `{austraits}` function `join_contributors` joins columns from the `contributors` table to the `traits` table. It has two arguments:

1.    `vars` which specifies which columns from the `contributors` table are joined to the `traits` table; defaults to `vars = c("aligned_name").

2.    `format` offers two output formats:  

    *   `single_column_pretty` (data in selected columns from `contributor` table compacted into a single column delimited in a way that is easy for humans to read; this is the default)
    *   `single_column_json` (data in selected columns from `contributor` table compacted into a single column, using json formatting)  

First, check the schema file embedded within AusTraits to see what information is documented in each column:  

```{r, eval = TRUE}
austraits$schema$austraits$elements$contributors$elements %>% 
  austraits::convert_list_to_df1()
```

Examples using `join_contributors`:  

```{r, eval = TRUE}
# join all columns (the default)
traits_with_contributors <- 
  austraits %>% austraits::join_contributors(format = "single_column_json")

traits_with_contributors$traits %>% names()

# join select contributors columns
traits_with_contributors <- 
  austraits %>% austraits::join_contributors(
                  vars = c("last_name", "first_name", "ORCID"),
                  format = "single_column_pretty")

traits_with_contributors$traits %>% names()

traits_with_contributors$traits
```

#### Joining all data

If you want to join data from all ancillary tables onto the traits table, effectively "flattening" the relational table into a flat table, it is simplest to use the `{austraits}` function `flatten_database`.

`flatten_database` calls each of the join functions, selecting `vars = "all"` as the default for each function.

It has three arguments:

1.    `vars`, specifies the context properties, via complete or partial string matches, that should be added to the traits table; defaults to "all"  

2.    `format` offers three output formats that apply to the functions `join_locations`, `join_contexts` and `join_contributors`

    *   `many_columns` (each location or context property is added as a separate column)
    *   `single_column_pretty` (all location or context properties or all contributor columns compacted into a single column delimited in a way that is easy for humans to read; this is the default)
    *   `single_column_json` (all location or context properties or all contributor columns compacted into a single column, using json formatting)
    
3.    `include_description` is a logical argument (TRUE/FALSE) that indicates where the context property value descriptions should be included or excluded when context data are joined; defaults to "TRUE"; this argument is only used to parameterise `join_contexts`

Examples using `flatten_database`:  

```{r, eval = TRUE}
# using the defaults
flat_database <- austraits %>% flatten_database()

names(flat_database)

# specifying vars for each column
flat_database <- austraits %>% flatten_database(
  vars = list(
    location = "all",
    context = "sampling_season",
    contributors = c("last_name", "first_name", "ORCID"),
    taxonomy = c("family", "establishment_means"),
    taxonomic_updates = "aligned_name",
    methods = "methods"
  )
)

names(flat_database)
```

### Combining `extract_` and `join_` functions

As both the `extract` and `join` functions output a database with the original database structure they can be used sequentially to extract, then join exactly the data desired.

For instance:
```{r, eval = TRUE}
subset2 <- austraits %>%
  austraits::extract_data(table = "traits", col = "basis_of_record", col_value = "field") %>%
  austraits::extract_trait(trait_name = c("leaf_mass_per_area", "leaf_thickness", "leaf_length", "leaf_area")) %>%
  austraits::extract_taxa(genus = "Eucalyptus") %>%
  austraits::join_location_coordinates() %>%
  austraits::join_taxa(vars = c("family")) %>%
  austraits::join_context_properties(vars = "all", format = "many_columns")

names(subset2)
```

Having joined all context properties are separate columns, you may now look at the expanded `traits` table and decide that you only want data that was sampled during `wet seasons`, documented in the column `temporal_context: sampling season`

```{r, eval = TRUE}
unique(subset2$traits$`temporal_context: sampling season`)

subset2 <- subset2 %>%
  austraits::extract_data(table = "traits", col = "temporal_context: sampling season", col_value = "wet")
```

### Binding datasets

For some applications, you may wish to extract two different subsets of data, based on the values of different columns, then merge those extracted database subsets together, but still retain the original database structure.  

This is possible with the function `bind_databases`.

This function binds each of the relational tables, removing any duplicate entries.

For instance, you might want all measurements where *either* the `location_property` or the `context_property` references the word "fire":

```{r, eval = TRUE}
subset_a <- austraits %>%
  austraits::extract_data(table = "locations", col = "location_property", col_value = "fire")

subset_b <- austraits %>%
  austraits::extract_data(table = "contexts", col = "context_property", col_value = "fire")

subset_ab <- bind_databases(subset_a, subset_b)
```

## Summarising data: trait means, modes, etc.

The function `summarise_trait_means` that was in older `{austraits}` versions was has been deprecated, as it is not appropriate for AusTraits versions > 5.0 -- that is all databases built using `{traits.build}`. A new version is in development and will be released in 2025. 
In the meantime, if you've sourced the file `extra_functions.R`, there are a few functions that allow you to summarise trait values.

### Categorical traits

For instance, `categorical_summary` indicates how many times a specific trait value is reported for a given taxa (across all datasets):

```{r, eval = TRUE, message = FALSE}
cat_summary <- categorical_summary(austraits, "resprouting_capacity")

cat_summary
```

Alternatively, create a wider matrix with possible trait values as columns:

```{r, eval = TRUE, message = FALSE}
categorical_summary_wider <- 
  categorical_summary_by_value(austraits, "resprouting_capacity") %>%
    tidyr::pivot_wider(names_from = value_tmp, values_from = replicates)

categorical_summary_wider
```

### Numeric traits

One of the problems with writing functions that summarise numeric traits is that they make statistical assumptions that are hidden within the function code and might not be appropriate for your data use case.

The datasets that comprise AusTraits were collected by different people, with a different number of replicates and different entity types reported. One dataset might include 20 measurements on individuals for a trait and another might have submitted a single population-level mean derived from 5 measurements.

How do you take the mean of these trait values?

Do you want to include both data from experiments and plants growing under natural conditions? This information is recorded in the `basis_of_record` column.

One function we're developing calculates weighted group means for field and experiment-sourced data, by first grouping values at the site level, then at the taxon level. For trait data sourced from floras where trait values are documented as a minimum and maximum value, the function takes the mean of these. The two subsets of data are then merged together.

```{r, eval = TRUE, message = FALSE}
weighted <- austraits_weighted_means(austraits, c("leaf_mass_per_area", 
                                                  "leaf_length"))

weighted
```

This function may be sufficient for exploratory purposes. Alternatively, you can download the file with the function and edit the code to suit your purposes.

## Plotting data {#plotting_functions}

### Plotting trait distributions

Another way to summarise AusTraits data by trait, and determine whether AusTraits offers sufficient data coverage for a trait of choice, is to plot the distribution of trait values in AusTraits.

As seen in [`A first look at data`](#exploring), the function `austraits::plot_trait_distribution_beeswarm()` plots trait data by `dataset_id`, `genus`, `family` or indeed any column in the traits table, such as `life_stage` or `basis_of_record`:

```{r, eval = TRUE}
# How does leaf N vary by dataset?
austraits::plot_trait_distribution_beeswarm(austraits, "leaf_N_per_dry_mass", 
                                               y_axis_category = "dataset_id")

# How does leaf N vary across Banksia species?
Banksia_data <- austraits %>% extract_taxa(genus = "Banksia")

austraits::plot_trait_distribution_beeswarm(Banksia_data, "leaf_N_per_dry_mass",
                                               y_axis_category = "taxon_name")

# Does leaf mass per area shift in Eucalyptus seedlings versus adults, which is captured in `life_stage`? What about amongst Eucalypts where information about the age of the leaves was recorded, captured as the context property "leaf age"?

Euc_data <- austraits %>% extract_taxa(genus = "Eucalyptus") %>%
  austraits::join_context_properties(vars = "leaf age", format = "many_columns", include_description = FALSE)

austraits::plot_trait_distribution_beeswarm(Euc_data, "leaf_mass_per_area", 
                                               y_axis_category = "life_stage")

austraits::plot_trait_distribution_beeswarm(Euc_data, "leaf_mass_per_area", 
                                               y_axis_category = "method_context: leaf age")
```

### Plotting data distribution by location

To plot locations, begin by merging on the latitude & longitude data from austraits\$locations using `austraits::join_location_coordinates`.

The `plot_locations` function plots the selected data, separating data into a series of plots based on the variable name selected. You can separate data based on the values of **any** column within the traits table -- including `basis_of_record`, `life_stage` and `value_type` -- or higher taxon categories (`genus`, `family`).

For instance, `austraits::plot_locations("trait_name")` will output a separate plot for each `trait_name` within the selected data.

A warning: `austraits::plot_locations()` WILL BE VERY SLOW if you request more than \~20 plots. For instance, do not attempt to generate plots for all traits simultaneously. Always first use extract/filter to just select a narrow range of traits, datasets, or taxa.


#### Plot locations by trait, dataset, or other column

See where Eucalyptus data have been collected, divided by `life_stage`  

```{r, eval = TRUE, message = FALSE, warning = FALSE}
Euc_data <- Euc_data %>%
  austraits::join_location_coordinates()

austraits::plot_locations(database = Euc_data, feature = "life_stage")
```

See where Banksia leaf area data have been collected, divided by `taxon_name`  

```{r, eval = TRUE, message = FALSE, warning = FALSE}
Banksia_data <- Banksia_data %>%
  austraits::join_location_coordinates() %>%
  austraits::extract_trait(trait_name = "leaf_area")

austraits::plot_locations(database = Banksia_data, feature = "taxon_name")
```

Where were the various Westoby datasets collected?

```{r, eval = TRUE, message = FALSE, warning = FALSE}
Westoby <-
  austraits %>%
     austraits::extract_dataset(dataset_id = "Westoby") %>%
     austraits::join_location_coordinates()

austraits::plot_locations(database = Westoby, feature = "dataset_id")

# Note that while the `dataset` is intended to be a relational database, this function also works with just the traits table, should you have separated it out of the relational structure.

# Westoby_traits <- Westoby$traits
# austraits::plot_locations(dataset = Westoby_traits, feature = "dataset_id")
```

Where were data for Acacia aneura collected?
```{r, eval = TRUE, message = FALSE, warning = FALSE}
data <-
  austraits %>%
  austraits::extract_taxa(taxon_name = "Acacia aneura") %>%
  austraits::join_location_coordinates()

data$traits <- data$traits %>% 
  dplyr::filter(!is.na(`latitude (deg)`)) 

austraits::plot_locations(data, "taxon_name")     # actually 4 taxa, because of subspecies

austraits::plot_locations(data, "dataset_id")     # 1 plot for each dataset_id
```

### More complex workflows -- some examples

#### An example looking at trait-climate gradients

A simple workflow allows one to look at [trait values across a climate gradient](traits_and_climate_example.html)

#### An example incorporating ALA distribution data

A recent tutorial posted by ALA shows how one can combine AusTraits trait data and ALA spatial occurrence data:

https://labs.ala.org.au/posts/2023-08-28_alternatives-to-box-plots/post.html

We've adopted it [here](spatial_data_example.html).

## A complexity: pivoting datasets {pivotting_datasets}

The AusTraits tables are all in `long` format with an individual row for each trait measurement. This is the most compact way to store data and offers the flexibility of documenting diverse metadata for each trait measurement.

However, for many research uses, it may be more useful to view data in a `wide` format, with the multiple traits that comprise a single observation displayed as consecutive columns.

The `{austraits}` function `trait_pivot_wider` allows AusTraits datasets to be pivoted from `long` to `wide` format.

It is recommended to only use this function on individual datasets -- or perhaps a small selection of datasets -- as each dataset includes a different collection of traits and pivoting wider otherwise creates a very "holey" dataset.

```{r, eval = TRUE}
Farrell_2017_values <-
  austraits %>%
    austraits::extract_dataset(dataset_id = "Farrell_2017")

Farrell_2017_pivoted <- 
  Farrell_2017_values$traits %>%
    austraits::trait_pivot_wider()

Farrell_2017_pivoted
```

This example pivots "nicely" as all observations have `entity_type = individual`.

Compare this first example to the dataset `Edwards_2000` which includes individual-, population-, and species-level observations:

```{r, eval = TRUE}
Edwards_2000_values <-
  austraits %>%
    austraits::extract_dataset(dataset_id = "Edwards_2000")

Edwards_2000_pivoted <- 
  Edwards_2000_values$traits %>%
    austraits::trait_pivot_wider()

Edwards_2000_pivoted
```

The values at the individual, population and species level do not collapse together, because traits measured on different `entity_types` have separate `observation_id`'s.

One of the core identifiers assigned to data points is the `observation_id`. An observation is a collection of measurements made on a specific entity at a single point in time.

Observation_id's are, therefore, unique combinations of:

-   dataset_id
-   source_id
-   entity_type
-   taxon_name
-   population_id (location_id, plot_context_id, treatment_context_id)
-   individual_id
-   basis_of_record
-   entity_context_id
-   life_stage
-   temporal_context_id
-   collection_date
-   original_name

If a single dataset includes traits that are attributed to different entity types, they are assigned separate `observation_id`'s. For instance, many datasets are comprised of individual-level physiological trait data and a column `growth_form`, documenting the growth form (i.e. tree, shrub, herb, etc.) of each *species*.

We're developing a function, `merge_entity_types` that collapses the pivoted data into a more condensed table, but this loses some of the metadata. This function is currently in the R file `extra_functions.R`

```{r, eval = TRUE, message = FALSE, warning = FALSE}
Edwards_2000_pivoted_merged <-
  merge_entity_types("Edwards_2000")
```

-   This function will duplicate any "higher-entity" trait values (e.g. A single species-level value is filled in for all individuals or populations)

-   Metadata fields, like `entity_type` or `value_type`, are only retained if their values are identical for all measurements

```{r, eval = TRUE, message = FALSE, warning = FALSE}
Westoby_2014_pivoted_merged <-
  merge_entity_types("Westoby_2014")
```

## Intepreting trait names, taxon names

### Trait dictionary

The `{traits.build}` pipeline requires a trait dictionary that documents 4 pieces of information about each trait:

-   trait name 
-   trait type (categorical vs numeric) 
-   allowable trait values (for categorical traits) 
-   allowable trait range and units (for numeric traits) 

The trait dictionary embedded within AusTraits also has:

-   trait labels 
-   trait definitions 
-   definitions for all categorical trait values 

Together these clarify each "trait concept", which we define as: "a circumscribed set of trait measurements". Much like a taxon concept delimits a collection of organisms, a trait concept delimits a collection of trait values pertaining to a distinct characteristic of a specific part of an organism (cell, tissue, organ, or whole organism).

The [AusTraits Plant Dictionary (APD)](http:///w3id.org/APD) offers detailed descriptions for all trait concepts included in AusTraits. With the APD, each trait is given a unique, resolvable identifier, allowing trait definitions to be reused and shared.

The trait dictionary also includes: 

-   keywords 
-   plant structure measured 
-   characteristic measured 
-   references 
-   links to the same (or similar) trait concepts in other databases and dictionaries 

### Understanding taxon names

AusTraits uses the taxon names in the Australian Plant Census (APC) and the scientific names in the Australian Plant Names Index (APNI).

The R package [`{APCalign}`](https://github.com/traitecoevo/APCalign) is used to align and update taxon names submitted to AusTraits with those in the APC/APNI.

`{APCalign}` can be installed directly from CRAN

```{r, eval = TRUE}
#install.packages("APCalign")

library(APCalign) # Australian plant taxon alignment function, available on CRAN 
```

There are two key components to the workflow: 

1.  aligning names 

-   syntax is standardised, including for phrase names 

-   most spelling mistakes are corrected 

-   names that indicate the plant can only be identified to genus are reformatted to `genus sp. [available notes; dataset_id]` 

    1.  they are linked to an APC-accepted genus but not to an APC-accepted binomial. 

    2.  they include the dataset_id so people don't mistakenly group all `Eucalyptus sp.` as a single "species" 

2.  updating names 

-   all aligned names that are in the APC, but that have a `taxonomic status` other than `accepted`, are updated to their currently accepted name. 

Examples: 

Identical `genus sp.` inputs from disparate datasets are given unique "names": 

```{r, eval = TRUE}
austraits$traits %>%
  dplyr::filter(stringr::str_detect(original_name, "Eucalyptus sp\\.$")) %>%
  dplyr::distinct(dataset_id, taxon_name, original_name) %>%
  dplyr::filter(original_name != taxon_name) 
```

Outdated names are updated: 

```{r, eval = TRUE}
austraits$traits %>%
  dplyr::filter(stringr::str_detect(original_name, "Dryandra")) %>%
  dplyr::distinct(taxon_name, original_name) %>%
  dplyr::filter(original_name != taxon_name) %>% dplyr::slice(1:15)
```

Phrase name syntax across datasets is aligned:

```{r, eval = TRUE}
austraits$traits %>%
  dplyr::filter(stringr::str_detect(taxon_name, "Argyrodendron sp. Whyanbeel")) %>%
  dplyr::distinct(taxon_name, original_name)
```