-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathAusTraits_tutorial.qmd
1048 lines (695 loc) · 46.4 KB
/
AusTraits_tutorial.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# AusTraits tutorial
## Introduction
With more than 1.8 million data records, AusTraits is Australia's [largest plant trait database](austraits_database.html#plant_database), created using the [`{traits.build}` R package](https://github.com/traitecoevo/traits.build)
This tutorial introduces:
- [the database structure](#database_structure)
- [`{austraits}` R package functions](#austraits_functions)
- additional [examples of analyses](#sample_analyses) using the database
To access more information about `traits.build`, see [`traits.build-book`](https://traitecoevo.github.io/traits.build-book/)
Or you can visit the Github repositories for individual packages/data repos:
- the database structure: [`traits.build`](https://github.com/traitecoevo/traits.build)
- the database contents: [`austraits.build`](https://github.com/traitecoevo/austraits.build)
- an R package for exploring and wrangling the data: [`austraits`](https://github.com/traitecoevo/austraits)
## Download AusTraits data
Before you begin, download and source essential packages and functions.
```{r, eval = TRUE, message = FALSE}
# Packages need to be installed the first time you use them.
# They are commented out here, so they aren't reinstalled each time you run the code,
# but install any packages you require the first time you run this tutorial.
#install.packages("readr", "tidyr", "dplyr", "stringr", "remotes")
library(readr)
library(tidyr)
library(dplyr)
library(stringr)
#remotes::install_github("traitecoevo/austraits", dependencies = TRUE, upgrade = "ask")
#remotes::install_github("traitecoevo/traits.build", dependencies = TRUE, upgrade = "ask")
library(austraits) # functions for exploring a traits.build database, available at Github repo
# XX library(traits.build) # additional functions for exploring a traits.build database, available at Github repo
#source("https://raw.githubusercontent.com/traitecoevo/traits.build-book/master/data/extra_functions.R")
source("data/extra_functions.R")
```
Then download (or build) the latest AusTraits database, using one of the methods described [here](austraits_database.html#access_data).
This tutorial uses the most recent AusTraits release, version `r austraits::get_versions()[["version"]][1]`
```{r, eval = TRUE}
most_recent <- austraits::get_versions() %>%
dplyr::pull("doi") %>%
dplyr::first()
most_recent
austraits <- austraits::load_austraits(doi = most_recent)
```
## A first look at data {#exploring}
If you're not familiar with AusTraits, you may want to begin by exploring the breadth and depth of data within the database. The database can be explored by trait name, species, or genus using either [austraits functions](#austraits_functions) or dplyr functions.
*How many taxa have `leaf_N_per_dry_mass` data in AusTraits?*
```{r, eval = TRUE}
(austraits %>%
austraits::extract_trait(trait_name = "leaf_N_per_dry_mass"))$traits %>%
dplyr::distinct(taxon_name) %>% nrow()
```
*How are these data distributed across datasets?*
```{r, eval = TRUE}
austraits::plot_trait_distribution_beeswarm(database = austraits, trait_name = "leaf_N_per_dry_mass", y_axis_category = "dataset_id")
```
*How much data exist for other nitrogen traits?*
```{r, eval = TRUE}
austraits::lookup_trait(austraits, "_N_") -> N_traits
austraits %>%
austraits::extract_trait(trait_name = N_traits) %>%
austraits::summarise_database(var = "trait_name") %>%
dplyr::arrange(-n_taxa)
```
*How many "hydraulic" traits are in AusTraits? How much data exist for these traits?*
```{r, eval = TRUE}
austraits::lookup_trait(austraits, "hydraulic") -> hydraulic_traits
austraits %>%
austraits::extract_trait(trait_name = hydraulic_traits) %>%
austraits::summarise_database(var = "trait_name") %>%
dplyr::arrange(-n_taxa)
```
*Where have trait data for Acacia aneura been collected?*
```{r, eval = TRUE, warning = FALSE, message = FALSE}
data <-
austraits %>%
austraits::extract_taxa(taxon_name = "Acacia aneura") %>%
austraits::join_location_coordinates()
data$traits %>% austraits::plot_locations("taxon_name")
```
*Where have data for Hibbertia species been collected?*
```{r, eval = TRUE, warning = FALSE, message = FALSE}
data <-
austraits %>%
austraits::extract_taxa(genus = "Hibbertia") %>%
austraits::join_location_coordinates() %>%
austraits::join_taxa(var = "genus")
data$traits %>% austraits::plot_locations("genus")
```
## The database structure {#database_structure}
The `{traits.build}` R package is the workflow that builds AusTraits from its component datasets.
The database is output as a list, a collection of relational tables, described in detail [here](database_structure.html).
A `traits.build` data object includes both the relational data tables and additional tables documenting database metadata and a traits dictionary.
```{r, eval = TRUE}
austraits
```
### Traits table
The core AusTraits table is the traits table. It is in "long" format, with each row documenting a single trait measurement.
```{r, eval = TRUE}
austraits$traits %>% dplyr::slice(1:20)
```
The columns include:
- core columns\
- dataset_id\
- taxon_name\
- trait_name\
- value (trait value)\
- entity metadata
- entity_type\
- life_stage
- value metadata
- value_type\
- unit\
- basis_of_value\
- replicate\
- basis_of_record\
- additional metadata
- collection_date
- measurement_remarks
- identifiers for specific observations, individuals, etc.
- observation_id\
- individual_id\
- population_id\
- repeat_measurements_id\
- identifiers that provide links to ancillary tables with additional metadata
- location_id\
- treatment_context_id\
- plot_context_id\
- entity_context_id\
- temporal_context_id\
- method_context_id\
- method_id\
- source_id
### Ancillary data tables
The remaining metadata accompanying each trait record is recorded across multiple relational tables.
These include:
- austraits\$locations
- austraits\$contexts
- austraits\$methods
- austraits\$taxa
- austraits\$taxonomic_updates
- austraits\$contributors
Like the core `traits` table, each is in 'long' format.
The tables `locations`, `contexts`, `methods`, `taxa` and `taxonomic_updates` include metadata that links seamlessly to individual rows within `traits`.
A collection of `join_` functions within `austraits` join the ancillary tables to the traits table, based on columns shared across tables.
| table | metadata in table | columns that link to austraits\$traits |
|------------------|--------------------------|----------------------------|
| locations | location name, location properties, latitude, longitude | dataset_id, location_id |
| contexts | context name, context category (method context, temporal, entity context, plot, treatment), context property | dataset_id, link_id (identifier to link to: method_context_id, temporal_context_id, entity_context_id, plot_context_id, treatment_context_id), link_vals (identifier value to link to) |
| methods | dataset description, dataset sampling strategy, trait collection method, data collectors, data curators, dataset citation, source_id & citation | dataset_id, trait_name, method_id |
| taxa | genus, family, scientific name, APC/APNI taxon concept/taxon name identifiers | taxon_name |
| taxonomic_updates | original name (name submitted), aligned name (typos removed; standardised syntax), identifiers for aligned name | dataset_id, taxon_name, original_name |
| contributors | people who contributed data, including their ORCIDs, affiliations, roles | dataset_id |
## Exploring AusTraits
With `r nrow(austraits$traits)` rows of trait values in the main traits table, knowing how to explore the contents is essential.
The R package [`{austraits}`](https://github.com/traitecoevo/austraits) offers a collection of functions to explore and wrangle AusTraits data -- or indeed any data using the traits.build format.
An austraits package vignette is available [here](https://traitecoevo.github.io/traits.build-book/austraits_package.html).
Function categories include:
- [**summarise and lookup functions**](#summarise_functions): These functions offer summaries by taxon name or trait, summarising taxa per trait (or other variable), datasets per trait, and observations per trait.
- [**filtering functions**](#filtering_functions): These functions begin with the word `extract` and filter all of the relational tables simultaneously.
- [**join functions**](#join_functions): These functions allow columns from the relational tables to be joined to the core traits table.
- [**pivot functions**](#pivot_functions): These functions allow the traits table to be pivoted to wide format.
- [**plotting functions**](#plotting_functions): These functions offer a means of rapidly visualising AusTraits data, either plotting collection locations on a map of Australia or plotting trait values by dataset.
### austraits.R function reference
Reference guide to: [austraits functions](https://traitecoevo.github.io/austraits/reference/index.html)
### Summarising data: data coverage {#summarise_functions}
There are two function families for summarising AusTraits data:
- **lookup_()**\
- **summarise_database()**
Use the function `summarise_database` to output summaries of total records, datasets with records, and taxa with records across `families`, `genera` or `traits`:
```{r, eval = TRUE}
austraits::summarise_database(database = austraits, var = "trait_name") %>% dplyr::slice(100:130)
austraits::summarise_database(database = austraits, var = "family") %>% dplyr::slice(1:20)
austraits::summarise_database(database = austraits, var = "genus") %>% dplyr::slice(1:20)
```
Since this function summarises the variable selected for **ALL** of AusTraits, you may want to first [filter](#filtering_functions) the data before summarising by "taxon_name" -- or even "trait_name".
Alternatively, you can look up traits that contain a specific search term:
```{r, eval = TRUE}
austraits::lookup_trait(database = austraits, term = "leaf") %>% length()
austraits::lookup_trait(database = austraits, term = "leaf")[1:30]
austraits::lookup_trait(database = austraits, term = "_N_")
# elemental contents use their symbol and are *almost* always in the middle of a trait name
austraits::lookup_trait(database = austraits, term = "photo")
```
Also visit the AusTraits Plant Dictionary to learn more about the traits included in AusTraits, https://w3id.org/APD.
You can also search the `locations` and `contexts` tables for `location_properties` and/or `context_properties` included as metadata for many trait measurements:
```{r}
austraits::lookup_location_property(database = austraits, term = "soil")
austraits::lookup_location_property(database = austraits, term = "temperature")
austraits::lookup_context_property(database = austraits, term = "season")
austraits::lookup_context_property(database = austraits, term = "fire")
```
For instance, to just look at number of records, datasets, and taxa with data for nitrogen-related traits:
```{r, eval = TRUE}
N_traits <- austraits %>%
austraits::extract_trait(trait_name = "_N_") %>%
austraits::summarise_database(var = "trait_name")
```
## Wrangling AusTraits
### Filtering data {#filtering_functions}
There are four `austraits` functions that filter data: `extract_trait`, `extract_taxon` or `extract_dataset_id`, and `extract_data`.
Each of these functions simultaneously filters all database tables to only include trait measurements (and associated metadata) meeting the specified criteria, retaining the original database structure.
*Note, although the `extract_` functions were explicitly developed to return the original `traits.build` database structure, they will also work when the "database" is just a single table, such as if prior manipulations have separated the traits table from the rest of the database.*
#### `extract_trait`, `extract_dataset` and `extract_taxa`
Three of the functions extract data based on pre-set columns:
1. `extract_trait` filters by `trait_name`
2. `extract_dataset` filters by `dataset_id`
3. `extract_taxa` filters by `taxon_name`, `genus` or `family`
Search terms can either be exact or partial matches.
```{r, eval = TRUE}
leaf_mass_per_area_data <-
austraits %>%
austraits::extract_trait(trait_names = c("leaf_mass_per_area"))
Westoby_2014_datasets <-
austraits %>%
austraits::extract_dataset("Westoby_2014")
all_Westoby_datasets <-
austraits %>%
austraits::extract_dataset("Westoby")
Eucalyptus_data <-
austraits %>%
austraits::extract_taxa(genus = "Eucalyptus")
Banksia_serrata_data <-
austraits %>%
austraits::extract_taxa(taxon_name = "Banksia serrata")
```
#### `extract_data`
`extract_data` offers the ability of filtering the database based on a value(s) in any column of any of the seven data tables (traits, locations, contexts, methods, taxa, taxonomic_updates, contributors).
See the [database structure chapter](database_structure.html) for names and definitions of each column and, for those with controlled vocabulary, their allowed values.
Alternatively to see the list of column names to use:
```{r, eval = TRUE}
names(austraits$traits)
names(austraits$methods)
```
And to see the list of possible values for a column:
```{r, eval = TRUE}
unique(austraits$traits$life_stage)
unique(austraits$traits$basis_of_record)
unique(austraits$locations$location_property)[1:20]
unique(austraits$contexts$context_property)[1:20]
```
The function then allows you to filter down to the components of each table that are relevant to the search criteria specified:
```{r, eval = TRUE}
field_data <- austraits %>% austraits::extract_data(table = "traits", col = "basis_of_record", col_value = "field")
field_data$traits %>% head()
data_with_soils_data <- austraits %>% austraits::extract_data(table = "locations", col = "location_property", col_value = "soil")
data_with_soils_data$traits %>% head()
data_with_soils_data$locations %>% head() # all location properties are retained for the measurements for measurements for which at least location property pertains to soil
data_contributed_by_Wright <- austraits %>% austraits::extract_data(table = "contributors", col = "last_name", col_value = "Wright")
data_contributed_by_Wright$traits %>% head()
data_contributed_by_Wright$contributors %>% head() # all contributors are retained for datasets where at least one of the contributors on the dataset has the last name "Wright"
```
Multiple `extract_`'s can be linked together to rapidly restrict data to the subset desired:
```{r, eval = TRUE}
subset <- austraits %>%
austraits::extract_data(table = "traits", col = "basis_of_record", col_value = "field") %>%
austraits::extract_trait(trait_name = c("leaf_mass_per_area", "leaf_thickness", "leaf_length", "leaf_area")) %>%
austraits::extract_taxa(genus = "Eucalyptus")
subset$traits[1:20]
```
### Joining relational tables {#join_functions}
For many research purposes you will want to join metadata from one of the relational tables to the core traits table. There are eight `{austraits}` functions that facilitate this by adding the columns you select from the ancillary data tables to the database's `traits` table, seven functions that merge information from a single table (`join_...`) and a function that joins columns from all seven ancillary data tables (`flatten_database`). All functions output the database with the original database structure allowing you to follow up `joining` with `extracting` and to continue `joining` additional columns.
#### Joining location metadata
The locations table includes information on all location properties measured, including the actual location (latitude/longitude), climatic data, soil properties, fire history, vegetation history, geologic history, etc.
The austraits function `join_location_coordinates` just adds location name, latitude, and longitude to the core traits table:
```{r, eval = TRUE}
traits_with_lat_long <- austraits %>%
austraits::extract_dataset(dataset_id = "Westoby") %>%
austraits::join_location_coordinates()
traits_with_lat_long$traits %>% names()
```
The function `join_location_properties` joins other location properties to the `traits` table. It has two arguments:
1. `vars`, specifies the location properties, via complete or partial string matches, that should be added to the traits table; defaults to "all"
2. `format` offers three output formats:
* `many_columns` (each location property is added as a separate column)
* `single_column_pretty` (all location properties compacted into a single column delimited in a way that is easy for humans to read; this is the default)
* `single_column_json` (all location properties compacted into a single column, using json formatting)
Examples of joining location properties to the traits table:
```{r, eval = TRUE}
# method to add location properties that you know exist from previous database exploration; this example showcases `format = "many_columns"
locations1 <- austraits %>%
austraits::join_location_properties(vars = c("description", "aridity index (MAP/PET)",
"soil type", "fire history"), format = "many_columns")
locations1$traits %>% names()
locations1$traits[1:10]
# method where you first lookup location properties using the function `lookup_location_property`; this example showcases `format = "single_column_pretty"
precipitation_properties <- lookup_location_property(database = austraits, term = "precipitation")
locations2 <- austraits %>%
austraits::join_location_properties(vars = precipitation_properties, format = "single_column_pretty")
locations2$traits %>% names()
locations2$traits[1:10]
# method where you add all location properties; this example showcases `format = "single_column_json"
locations3 <- austraits %>%
austraits::join_location_properties(vars = "all", format = "single_column_json")
locations3$traits %>% names()
locations3$traits[1:10]
```
#### Joining contexts metadata
The context table documents additional context properties/ancillary data which may be useful for interpreting trait values. Context properties are divided into 5 categories: `treatment context`, `plot context`, `entity context`, `temporal context`, and `method context`.
| context category | description |
|-----------------|-------------------------------------------------------|
| treatment context | Context property that is an experimental manipulation, that might affect the trait values measured on an individual, population or species-level entity. |
| plot context | Context property that is a feature of a plot (subset of a location) that might affect the trait values measured on an individual, population or species-level entity. |
| entity context | Context property that is information about an organismal entity (individual, population or taxon) that does not comprise a trait-centered observation but might affect the trait values measured on the entity. |
| temporal context | Context property that is a feature of a "point in time" that might affect the trait values measured on an individual, population or species-level entity. |
| method context | Context property that records specific information about a measurement method that is modified between measurements. |
The austraits function `join_context_properties` joins context properties to the `traits` table. It has three arguments:
1. `vars`, specifies the context properties, via complete or partial string matches, that should be added to the traits table; defaults to "all"
2. `format` offers three output formats:
* `many_columns` (each context property is added as a separate column)
* `single_column_pretty` (all context properties compacted into a single column delimited in a way that is easy for humans to read; this is the default)
* `single_column_json` (all context properties compacted into a single column, using json formatting)
3. `include_description` is a logical argument (TRUE/FALSE) that indicates where the context property value descriptions should be included or excluded when context data are joined; defaults to "TRUE"
Each category of context property is added to a separate column for the compacted columns, retaining this important information about the different groupings of context properties. When `format = "many_columns"` is selected the context category is indicated in the column name.
Examples of joining context properties to the traits table:
```{r, eval = TRUE}
# method to add context properties that you know exist from previous database exploration; this example showcases `format = "many_columns"
contexts1 <- austraits %>%
austraits::join_context_properties(
vars = c("sampling season", "plant sex", "leaf surface", "leaf age", "fire intensity",
"slope position", "fire season", "drought treatment", "temperature treatment"),
format = "many_columns",
include_description = TRUE
)
contexts1$traits %>% names()
contexts1$traits[1:10]
# method where you first lookup context properties using the function `lookup_context_property`; this example showcases `format = "single_column_pretty"
leaf_properties <- lookup_context_property(database = austraits, term = "leaf")
contexts2 <- austraits %>%
austraits::join_context_properties(
vars = leaf_properties,
format = "single_column_pretty",
include_description = TRUE
)
contexts2$traits %>% names()
contexts2$traits[1:10]
# method where you add all context properties; this example showcases `format = "single_column_json"
contexts3 <- austraits %>%
austraits::join_context_properties(
vars = "all",
format = "single_column_json",
include_description = FALSE
)
contexts3$traits %>% names()
contexts3$traits[1:10]
```
#### Joining methods columns
The methods table documents a selection of metadata recorded about the entire dataset and methods used for individual trait measurements. There is a single row of data per `dataset_id` x `trait_name` x `method_id` combination. `Method_id` is used to distinguish between instances where a single trait is measured twice using two separate protocols and is separate to `method_context_id`, which documents specific components of a method that are modified between measurements.
The austraits function `join_methods` joins columns from the methods table to the `traits` table. It has one argument:
1. `vars` which specifies which columns from the `methods` table are joined to the `traits` table; defaults to `vars = c("all")`
First, check the schema file embedded within AusTraits to see what information is documented in each column:
```{r, eval = TRUE}
austraits$schema$austraits$elements$methods$elements %>%
austraits::convert_list_to_df1()
```
Examples using `join_methods`:
```{r, eval = TRUE}
# join methods column only, the default
traits_with_methods <-
austraits %>% austraits::join_methods()
traits_with_methods$traits %>% names()
# join all methods table columns
traits_with_methods <-
austraits %>% austraits::join_methods(vars = "all")
traits_with_methods$traits %>% names()
# join all specifically selected methods table columns
traits_with_methods <-
austraits %>% austraits::join_methods(vars = c("methods", "description", "source_secondary_key"))
traits_with_methods$traits %>% names()
```
#### Joining taxa
The `taxa` table documents a collection of names and identifiers for each taxon. Within AusTraits, `names` submitted as identifiers within a dataset might be resolved to a species, an infraspecific taxon, or sometimes just to a genus- or family-level name; the name's resolution is recorded as the `taxon_rank`. The `taxon_rank` determines which information is filled in in the taxa table.
The `{austraits}` function `join_taxa` joins columns from the `taxa` table to the `traits` table. It has one argument:
1. `vars` which specifies which columns from the `taxa` table are joined to the `traits` table; defaults to `vars = c("family", "genus", "taxon_rank", "establishment_means").`
First, check the schema file embedded within AusTraits to see what information is documented in each column:
```{r, eval = TRUE}
austraits$schema$austraits$elements$taxa$elements %>%
austraits::convert_list_to_df1()
```
Examples using `join_taxa`:
```{r, eval = TRUE}
# join the default columns
traits_with_taxa <-
austraits %>% austraits::join_taxa()
traits_with_taxa$traits %>% names()
# join all taxa table columns
traits_with_taxa <-
austraits %>% austraits::join_taxa(vars = "all")
traits_with_taxa$traits %>% names()
```
#### Joining taxonomic updates
The taxonomic updates table documents all taxonomic changes implemented in the construction of AusTraits, including both the correction of typos and the updating of outdated synonyms to the currently accepted name.
The `{austraits}` function `join_taxonomic_updates` joins columns from the `taxonomic_updates` table to the `traits` table. It has one argument:
1. `vars` which specifies which columns from the `taxonomic_updates` table are joined to the `traits` table; defaults to `vars = c("aligned_name").`
First, check the schema file embedded within AusTraits to see what information is documented in each column:
```{r, eval = TRUE}
austraits$schema$austraits$elements$taxonomic_updates$elements %>%
austraits::convert_list_to_df1()
```
Examples using `join_taxonomic_updates`:
```{r, eval = TRUE}
# join the default columns
traits_with_taxonomic_updates <-
austraits %>% austraits::join_taxonomic_updates()
traits_with_taxonomic_updates$traits %>% names()
# join all methods columns
traits_with_taxonomic_updates <-
austraits %>% austraits::join_taxonomic_updates(vars = "all")
traits_with_taxonomic_updates$traits %>% names()
```
#### Joining contributors
The contributors table documents all basic metadata about all dataset contributors, including their name, ORCID, and role for various datasets.
The `{austraits}` function `join_contributors` joins columns from the `contributors` table to the `traits` table. It has two arguments:
1. `vars` which specifies which columns from the `contributors` table are joined to the `traits` table; defaults to `vars = c("aligned_name").
2. `format` offers two output formats:
* `single_column_pretty` (data in selected columns from `contributor` table compacted into a single column delimited in a way that is easy for humans to read; this is the default)
* `single_column_json` (data in selected columns from `contributor` table compacted into a single column, using json formatting)
First, check the schema file embedded within AusTraits to see what information is documented in each column:
```{r, eval = TRUE}
austraits$schema$austraits$elements$contributors$elements %>%
austraits::convert_list_to_df1()
```
Examples using `join_contributors`:
```{r, eval = TRUE}
# join all columns (the default)
traits_with_contributors <-
austraits %>% austraits::join_contributors(format = "single_column_json")
traits_with_contributors$traits %>% names()
# join select contributors columns
traits_with_contributors <-
austraits %>% austraits::join_contributors(
vars = c("last_name", "first_name", "ORCID"),
format = "single_column_pretty")
traits_with_contributors$traits %>% names()
traits_with_contributors$traits
```
#### Joining all data
If you want to join data from all ancillary tables onto the traits table, effectively "flattening" the relational table into a flat table, it is simplest to use the `{austraits}` function `flatten_database`.
`flatten_database` calls each of the join functions, selecting `vars = "all"` as the default for each function.
It has three arguments:
1. `vars`, specifies the context properties, via complete or partial string matches, that should be added to the traits table; defaults to "all"
2. `format` offers three output formats that apply to the functions `join_locations`, `join_contexts` and `join_contributors`
* `many_columns` (each location or context property is added as a separate column)
* `single_column_pretty` (all location or context properties or all contributor columns compacted into a single column delimited in a way that is easy for humans to read; this is the default)
* `single_column_json` (all location or context properties or all contributor columns compacted into a single column, using json formatting)
3. `include_description` is a logical argument (TRUE/FALSE) that indicates where the context property value descriptions should be included or excluded when context data are joined; defaults to "TRUE"; this argument is only used to parameterise `join_contexts`
Examples using `flatten_database`:
```{r, eval = TRUE}
# using the defaults
flat_database <- austraits %>% flatten_database()
names(flat_database)
# specifying vars for each column
flat_database <- austraits %>% flatten_database(
vars = list(
location = "all",
context = "sampling_season",
contributors = c("last_name", "first_name", "ORCID"),
taxonomy = c("family", "establishment_means"),
taxonomic_updates = "aligned_name",
methods = "methods"
)
)
names(flat_database)
```
### Combining `extract_` and `join_` functions
As both the `extract` and `join` functions output a database with the original database structure they can be used sequentially to extract, then join exactly the data desired.
For instance:
```{r, eval = TRUE}
subset2 <- austraits %>%
austraits::extract_data(table = "traits", col = "basis_of_record", col_value = "field") %>%
austraits::extract_trait(trait_name = c("leaf_mass_per_area", "leaf_thickness", "leaf_length", "leaf_area")) %>%
austraits::extract_taxa(genus = "Eucalyptus") %>%
austraits::join_location_coordinates() %>%
austraits::join_taxa(vars = c("family")) %>%
austraits::join_context_properties(vars = "all", format = "many_columns")
names(subset2)
```
Having joined all context properties are separate columns, you may now look at the expanded `traits` table and decide that you only want data that was sampled during `wet seasons`, documented in the column `temporal_context: sampling season`
```{r, eval = TRUE}
unique(subset2$traits$`temporal_context: sampling season`)
subset2 <- subset2 %>%
austraits::extract_data(table = "traits", col = "temporal_context: sampling season", col_value = "wet")
```
### Binding datasets
For some applications, you may wish to extract two different subsets of data, based on the values of different columns, then merge those extracted database subsets together, but still retain the original database structure.
This is possible with the function `bind_databases`.
This function binds each of the relational tables, removing any duplicate entries.
For instance, you might want all measurements where *either* the `location_property` or the `context_property` references the word "fire":
```{r, eval = TRUE}
subset_a <- austraits %>%
austraits::extract_data(table = "locations", col = "location_property", col_value = "fire")
subset_b <- austraits %>%
austraits::extract_data(table = "contexts", col = "context_property", col_value = "fire")
subset_ab <- bind_databases(subset_a, subset_b)
```
## Summarising data: trait means, modes, etc.
The function `summarise_trait_means` that was in older `{austraits}` versions was has been deprecated, as it is not appropriate for AusTraits versions > 5.0 -- that is all databases built using `{traits.build}`. A new version is in development and will be released in 2025.
In the meantime, if you've sourced the file `extra_functions.R`, there are a few functions that allow you to summarise trait values.
### Categorical traits
For instance, `categorical_summary` indicates how many times a specific trait value is reported for a given taxa (across all datasets):
```{r, eval = TRUE, message = FALSE}
cat_summary <- categorical_summary(austraits, "resprouting_capacity")
cat_summary
```
Alternatively, create a wider matrix with possible trait values as columns:
```{r, eval = TRUE, message = FALSE}
categorical_summary_wider <-
categorical_summary_by_value(austraits, "resprouting_capacity") %>%
tidyr::pivot_wider(names_from = value_tmp, values_from = replicates)
categorical_summary_wider
```
### Numeric traits
One of the problems with writing functions that summarise numeric traits is that they make statistical assumptions that are hidden within the function code and might not be appropriate for your data use case.
The datasets that comprise AusTraits were collected by different people, with a different number of replicates and different entity types reported. One dataset might include 20 measurements on individuals for a trait and another might have submitted a single population-level mean derived from 5 measurements.
How do you take the mean of these trait values?
Do you want to include both data from experiments and plants growing under natural conditions? This information is recorded in the `basis_of_record` column.
One function we're developing calculates weighted group means for field and experiment-sourced data, by first grouping values at the site level, then at the taxon level. For trait data sourced from floras where trait values are documented as a minimum and maximum value, the function takes the mean of these. The two subsets of data are then merged together.
```{r, eval = TRUE, message = FALSE}
weighted <- austraits_weighted_means(austraits, c("leaf_mass_per_area",
"leaf_length"))
weighted
```
This function may be sufficient for exploratory purposes. Alternatively, you can download the file with the function and edit the code to suit your purposes.
## Plotting data {#plotting_functions}
### Plotting trait distributions
Another way to summarise AusTraits data by trait, and determine whether AusTraits offers sufficient data coverage for a trait of choice, is to plot the distribution of trait values in AusTraits.
As seen in [`A first look at data`](#exploring), the function `austraits::plot_trait_distribution_beeswarm()` plots trait data by `dataset_id`, `genus`, `family` or indeed any column in the traits table, such as `life_stage` or `basis_of_record`:
```{r, eval = TRUE}
# How does leaf N vary by dataset?
austraits::plot_trait_distribution_beeswarm(austraits, "leaf_N_per_dry_mass",
y_axis_category = "dataset_id")
# How does leaf N vary across Banksia species?
Banksia_data <- austraits %>% extract_taxa(genus = "Banksia")
austraits::plot_trait_distribution_beeswarm(Banksia_data, "leaf_N_per_dry_mass",
y_axis_category = "taxon_name")
# Does leaf mass per area shift in Eucalyptus seedlings versus adults, which is captured in `life_stage`? What about amongst Eucalypts where information about the age of the leaves was recorded, captured as the context property "leaf age"?
Euc_data <- austraits %>% extract_taxa(genus = "Eucalyptus") %>%
austraits::join_context_properties(vars = "leaf age", format = "many_columns", include_description = FALSE)
austraits::plot_trait_distribution_beeswarm(Euc_data, "leaf_mass_per_area",
y_axis_category = "life_stage")
austraits::plot_trait_distribution_beeswarm(Euc_data, "leaf_mass_per_area",
y_axis_category = "method_context: leaf age")
```
### Plotting data distribution by location
To plot locations, begin by merging on the latitude & longitude data from austraits\$locations using `austraits::join_location_coordinates`.
The `plot_locations` function plots the selected data, separating data into a series of plots based on the variable name selected. You can separate data based on the values of **any** column within the traits table -- including `basis_of_record`, `life_stage` and `value_type` -- or higher taxon categories (`genus`, `family`).
For instance, `austraits::plot_locations("trait_name")` will output a separate plot for each `trait_name` within the selected data.
A warning: `austraits::plot_locations()` WILL BE VERY SLOW if you request more than \~20 plots. For instance, do not attempt to generate plots for all traits simultaneously. Always first use extract/filter to just select a narrow range of traits, datasets, or taxa.
#### Plot locations by trait, dataset, or other column
See where Eucalyptus data have been collected, divided by `life_stage`
```{r, eval = TRUE, message = FALSE, warning = FALSE}
Euc_data <- Euc_data %>%
austraits::join_location_coordinates()
austraits::plot_locations(database = Euc_data, feature = "life_stage")
```
See where Banksia leaf area data have been collected, divided by `taxon_name`
```{r, eval = TRUE, message = FALSE, warning = FALSE}
Banksia_data <- Banksia_data %>%
austraits::join_location_coordinates() %>%
austraits::extract_trait(trait_name = "leaf_area")
austraits::plot_locations(database = Banksia_data, feature = "taxon_name")
```
Where were the various Westoby datasets collected?
```{r, eval = TRUE, message = FALSE, warning = FALSE}
Westoby <-
austraits %>%
austraits::extract_dataset(dataset_id = "Westoby") %>%
austraits::join_location_coordinates()
austraits::plot_locations(database = Westoby, feature = "dataset_id")
# Note that while the `dataset` is intended to be a relational database, this function also works with just the traits table, should you have separated it out of the relational structure.
# Westoby_traits <- Westoby$traits
# austraits::plot_locations(dataset = Westoby_traits, feature = "dataset_id")
```
Where were data for Acacia aneura collected?
```{r, eval = TRUE, message = FALSE, warning = FALSE}
data <-
austraits %>%
austraits::extract_taxa(taxon_name = "Acacia aneura") %>%
austraits::join_location_coordinates()
data$traits <- data$traits %>%
dplyr::filter(!is.na(`latitude (deg)`))
austraits::plot_locations(data, "taxon_name") # actually 4 taxa, because of subspecies
austraits::plot_locations(data, "dataset_id") # 1 plot for each dataset_id
```
### More complex workflows -- some examples
#### An example looking at trait-climate gradients
A simple workflow allows one to look at [trait values across a climate gradient](traits_and_climate_example.html)
#### An example incorporating ALA distribution data
A recent tutorial posted by ALA shows how one can combine AusTraits trait data and ALA spatial occurrence data:
https://labs.ala.org.au/posts/2023-08-28_alternatives-to-box-plots/post.html
We've adopted it [here](spatial_data_example.html).
## A complexity: pivoting datasets {pivotting_datasets}
The AusTraits tables are all in `long` format with an individual row for each trait measurement. This is the most compact way to store data and offers the flexibility of documenting diverse metadata for each trait measurement.
However, for many research uses, it may be more useful to view data in a `wide` format, with the multiple traits that comprise a single observation displayed as consecutive columns.
The `{austraits}` function `trait_pivot_wider` allows AusTraits datasets to be pivoted from `long` to `wide` format.
It is recommended to only use this function on individual datasets -- or perhaps a small selection of datasets -- as each dataset includes a different collection of traits and pivoting wider otherwise creates a very "holey" dataset.
```{r, eval = TRUE}
Farrell_2017_values <-
austraits %>%
austraits::extract_dataset(dataset_id = "Farrell_2017")
Farrell_2017_pivoted <-
Farrell_2017_values$traits %>%
austraits::trait_pivot_wider()
Farrell_2017_pivoted
```
This example pivots "nicely" as all observations have `entity_type = individual`.
Compare this first example to the dataset `Edwards_2000` which includes individual-, population-, and species-level observations:
```{r, eval = TRUE}
Edwards_2000_values <-
austraits %>%
austraits::extract_dataset(dataset_id = "Edwards_2000")
Edwards_2000_pivoted <-
Edwards_2000_values$traits %>%
austraits::trait_pivot_wider()
Edwards_2000_pivoted
```
The values at the individual, population and species level do not collapse together, because traits measured on different `entity_types` have separate `observation_id`'s.
One of the core identifiers assigned to data points is the `observation_id`. An observation is a collection of measurements made on a specific entity at a single point in time.
Observation_id's are, therefore, unique combinations of:
- dataset_id
- source_id
- entity_type
- taxon_name
- population_id (location_id, plot_context_id, treatment_context_id)
- individual_id
- basis_of_record
- entity_context_id
- life_stage
- temporal_context_id
- collection_date
- original_name
If a single dataset includes traits that are attributed to different entity types, they are assigned separate `observation_id`'s. For instance, many datasets are comprised of individual-level physiological trait data and a column `growth_form`, documenting the growth form (i.e. tree, shrub, herb, etc.) of each *species*.
We're developing a function, `merge_entity_types` that collapses the pivoted data into a more condensed table, but this loses some of the metadata. This function is currently in the R file `extra_functions.R`
```{r, eval = TRUE, message = FALSE, warning = FALSE}
Edwards_2000_pivoted_merged <-
merge_entity_types("Edwards_2000")
```
- This function will duplicate any "higher-entity" trait values (e.g. A single species-level value is filled in for all individuals or populations)
- Metadata fields, like `entity_type` or `value_type`, are only retained if their values are identical for all measurements
```{r, eval = TRUE, message = FALSE, warning = FALSE}
Westoby_2014_pivoted_merged <-
merge_entity_types("Westoby_2014")
```
## Intepreting trait names, taxon names
### Trait dictionary
The `{traits.build}` pipeline requires a trait dictionary that documents 4 pieces of information about each trait:
- trait name
- trait type (categorical vs numeric)
- allowable trait values (for categorical traits)
- allowable trait range and units (for numeric traits)
The trait dictionary embedded within AusTraits also has:
- trait labels
- trait definitions
- definitions for all categorical trait values
Together these clarify each "trait concept", which we define as: "a circumscribed set of trait measurements". Much like a taxon concept delimits a collection of organisms, a trait concept delimits a collection of trait values pertaining to a distinct characteristic of a specific part of an organism (cell, tissue, organ, or whole organism).
The [AusTraits Plant Dictionary (APD)](http:///w3id.org/APD) offers detailed descriptions for all trait concepts included in AusTraits. With the APD, each trait is given a unique, resolvable identifier, allowing trait definitions to be reused and shared.
The trait dictionary also includes:
- keywords
- plant structure measured
- characteristic measured
- references
- links to the same (or similar) trait concepts in other databases and dictionaries
### Understanding taxon names
AusTraits uses the taxon names in the Australian Plant Census (APC) and the scientific names in the Australian Plant Names Index (APNI).
The R package [`{APCalign}`](https://github.com/traitecoevo/APCalign) is used to align and update taxon names submitted to AusTraits with those in the APC/APNI.
`{APCalign}` can be installed directly from CRAN
```{r, eval = TRUE}
#install.packages("APCalign")