file_organisation.qmd

# File organisation

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  results = "asis",
  echo = FALSE,
  message = FALSE,
  warning = FALSE
)

library(traits.build)

my_kable_styling <- function(...) {
  kableExtra::kable(..., booktabs=TRUE, format="latex") %>%
  kableExtra::kable_styling(latex_options="scale_down", font_size = 8, full_width = TRUE)}


if(knitr::is_html_output()) {
  my_kable_styling <- util_kable_styling_html
}
```

```{r, echo=FALSE, results='hide', message=FALSE}
austraits <- austraits:::austraits_5.0.0_lite
schema <- get_schema()
```

This chapter describes the typical files you may encounter in a `traits.build` compilation. The description is based on the [`austraits.build`](https://github.com/traitecoevo/austraits.build/) compilation. 

We strongly suggest you create a standalone folder for your repository, e.g. `austraits.build`. This folder should contain all files needed to build your compilation. We're big fans of github as a platform for collaboration. If you're not familiar with git or github, we suggest you check out the [happy git with R](https://happygitwithr.com/) book.


## Repository structure

The main directory for the [`austraits.build`](https://github.com/traitecoevo/austraits.build/) repository contains the following files and folders, with purpose as indicated. Not all of these files are required for a compilation, some are used for extra features such as website. They are included here for completeness.

```{r, eval=FALSE}
dir() %>%
  create_tree_branch(title = "austraits") %>%
  writeLines()
```

**Files used for data compilation**
```
├── remake.yml/build.R    # instructions for build
├── config                # configuration files
├── data                  # raw data files
├── R                     # folder with custom R functions
├── export                # folder for output
└── scripts               # R scripts for processing files before/after build
```

**R project file**

```
├── austraits.build.Rproj     # Rstudio project
```

**Files for maintaining a repo on github**

```
├── README.md         # landing page
├── .github           # folder containing github actions, issue templates, code of conduct
├── LICENCE
├── NEWS.md
├── inst              # contains images that appear on github repo 
```

**Additional files describing the compendium and testing the database build process**

```
├── DESCRIPTION           # compendium description
├── tests                 # tests for whether database builds
```

## `/config` folder

The folder `config` contains four files which govern the building of the dataset.

```
config
├── metadata.yml
├── traits.yml
├── taxon_list.csv
└── unit_conversions.csv
```

### `metadata.yml`

The file `metadata.yml` documents dataset-level metadata, including a database description, authors, and funders.

### `traits.yml`

The file `traits.yml` provides the trait definitions used to compile the trait database, including allowable trait values. See [creating a trait dictionary](create_dictionary.html) for more information on the process of creating this file. A `.yml` file is a structured data file where information is presented in a hierarchical format (see [appendix for details](yaml.html)).

### `taxon_list.csv`

The file `taxon_list.csv` is our master list of taxa in the trait database. 

It includes all unique taxon names after typos have been corrected (through taxonomic_updates). It includes both accepted/valid taxon concepts and outdated taxonomic names. It includes taxon names indicating a taxon that can be identified to species and names that can only be resolved to a lower taxon rank.

There are only three required columns within `taxon_list.csv`: `aligned_name`, `taxon_name` and `taxon_rank`. In the file, `aligned_name` refers to the taxon name after any typos have been corrected, while `taxon_name` is the taxon name following updates to the currently accepted/valid taxon name (when available). `Taxon_rank` indicates the resolution of the `taxon_name`. 

However, it is best practice to include additional columns when available, including taxon identifiers for species (& infraspecific taxon concepts) or genera that align with known taxon concepts. 

The file `taxon_list.csv` should be added to if a study includes taxa not previously represented in the trait database. It must be compiled outside of the traits.build workflow, as each dataset will use different taxonomic datasets, with different columns of information. 

For AusTraits, the taxonomic datasets referenced are the two vascular plant lists within the National Species Lists (NSL), the APC (Australian Plant Census) and the Australian Plant Name Index (APNI). The workflow used by AusTraits to rebuild the taxon list is available [here](https://github.com/traitecoevo/austraits.build/blob/develop/R/build_update_taxon_list.R).


```{r, results='show'}
read_csv("data/taxon_list.csv", show_col_types = FALSE) %>%
  select(taxon_name, aligned_name, family, taxonomic_dataset, taxon_rank,aligned_name_taxonomic_status, taxon_id, scientific_name, scientific_name_id
) %>%
  slice(1:10) %>%
  my_kable_styling()
```

### `unit_conversions.csv`

The file `unit_conversions.csv` defines the unit conversions that are used when converting contributed trait data to common units, e.g.

```{r}
read_csv("data/unit_conversions.csv", col_types = "ccc",show_col_types = FALSE) %>%
  slice(1:10) %>%
  my_kable_styling()
```

## `/data` folder

The folder `data` contains the raw data from individual studies included in the trait database.

Records within the `data` folder are organised as coming from a particular study, defined by the `dataset_id`. Data from each study are organised into a separate folder, with two files:

- `data.csv`: a table containing the actual trait data.
- `metadata.yml`: a file that contains study metadata (source, methods, locations, and context), maps trait names and units onto standard types, and lists any substitutions applied to the data in processing.

The folder `data` thus contains a long list of folders, one for each study and each containing two files:

```
data
├── Angevin_2010
│   ├── data.csv
│   └── metadata.yml
├── Barlow_1981
│   ├── data.csv
│   └── metadata.yml
├── Bean_1997
│   ├── data.csv
│   └── metadata.yml
├── ....

```

where `Angevin_2010`, `Barlow_1981`, & `Bean_1997` are each a unique `dataset_id` in the final dataset.

This file can be added to within specific `traits.build` projects, as required for different dataset styles.