Remove duplicate flora data & add authorship to flora data #791

ehwenk · 2024-02-16T00:32:56Z

Two pieces of work on this branch that cause most data.csv files for datasets with flora data to be completely overwritten:

add species-by-species authorship to all flora data. This adds a new line to all data.csv files for floras were it was possible to add attribution for who wrote each profile. This was requested by the taxonomy community and has been mapped in as either "source_id" or as "measurement_remarks". These are not curated, but are simply the "sources" or "authors" that could be automatically downloaded.
filter the original flora data in AusTraits (datsets: ABRS_1981, NHNSW_2014, 2014_2, 2016, WAH_1998, SAH_2014, NTH_2014) using the following rules:

remove all woodiness, growth form, life history from the "original" flora scrapings, since we have complete trait value datasets for these traits (most common error here are "vines that climb to tree tops" being designated as trees, but there are others)

remove all taxon_name x trait_name x dataset_id that are in "original" and "new" scraped datasets; there are indeed updated values for a number of numeric traits and in the ~100 profiles I've looked up where there is a difference between old and new, only 1 mistake in the newer versions. That said, the "differences" are the absolute minority - for trait x taxon x dataset values in both old and new flora extractions 98+ % are identical.

retain all categorical data that is only in the "original" scrapings (except the three complete traits). I've spot checked lots of values and haven't found any errors - and other than growth form, woodiness, life history there isn't much overlap in the categorical traits scraped in the "original" and "new" flora datasets

For numeric traits, for trait x taxon x dataset combinations that are only in the "original" scrapings, I manually checked every data point (~8000 values across all floras) and manually correct or dismissed incorrect values.

Overall, this has removed ~100,000 data points. These are almost entirely true duplicates:

nrow(austraits_develop$traits)
[1] 1813898
nrow(austraits_removed$traits)
[1] 1706226

filter SAH_2014 to only retain trait values that aren't also in SAH_2022, SAH_2023. As part of this, SAH_2014 is now in long-format.

...the problem is that the way to filter NTH_2014 vs new scrapings is by comparing austraits output, which means that other than `original_name` none of the information from the original dataset is retained (i.e. "un-doing" substitutions)

- have removed all leaf width, leaf lengths, plant heights EXCEPT if the taxa are completely missing from ABRS_2022, 2023. There are about 1000 values to manually check - well over 50% were wrong plant part or wrong number, so they are simply saved in the raw data folder

easy, because virtually no numeric traits, almost all trait values that were only in WAH_1998 were flowering times

still need to redo metadata files

- merge in authorship info for individual profiles, under `measurement_remarks`

multiple bugs fixed now only 1.56 million rows of data.... there were lots of duplicates

dfalster · 2024-02-18T23:10:54Z

Nice work @ehwenk. This is quite a big PRE so we'll need to look at together sometime

dfalster

Have reviewed this together with @ehwenk . Huge amount of work. Hard to catch any issues without reviewing original sources, I am approving without going that far.

ehwenk added 19 commits November 28, 2023 11:42

filter SAH_2014

8331f19

filter SAH_2014 to only retain trait values that aren't also in SAH_2022, SAH_2023. As part of this, SAH_2014 is now in long-format.

SAH_2014, correction

87d9697

Update NTH_filter_duplicate_data.R

dc061bd

WAH_1998, WAH_2016

33830d6

easy, because virtually no numeric traits, almost all trait values that were only in WAH_1998 were flowering times

most of NHNSW work

1464c83

still need to redo metadata files

update metadata files for NHNSW scrapings

b72217a

more tweaks to excluded/included old data

eef8c81

edits, while proofing old data

d442282

manually check values in NTH_2014, SAH_2014

f1bca30

remove duplicate, erroneous values

3608d6e

Update ABRS_filter_duplicate_data.R

a1a95f6

update ABRS_1981 data to remove errors, duplicates

8189673

Add authorship to ABRS datasets

b1cb796

- merge in authorship info for individual profiles, under `measurement_remarks`

add authors to SAH, NTH datasets

ee0d52f

add authors to NHNSW datasets

1768f0f

multiple bugs fixed

082e982

multiple bugs fixed now only 1.56 million rows of data.... there were lots of duplicates

add authors to RBGV datasets

76b2a2a

ehwenk requested a review from dfalster February 16, 2024 00:33

Update build.R

dbc4b22

ehwenk and others added 2 commits May 13, 2024 11:52

minor metadata edit

9727c2d

Merge branch 'develop' into remove_duplicate_flora_data

2768db4

dfalster approved these changes May 14, 2024

View reviewed changes

ehwenk merged commit 36f3855 into develop May 14, 2024
1 check passed

ehwenk deleted the remove_duplicate_flora_data branch May 14, 2024 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove duplicate flora data & add authorship to flora data #791

Remove duplicate flora data & add authorship to flora data #791

ehwenk commented Feb 16, 2024

dfalster commented Feb 18, 2024

dfalster left a comment

Remove duplicate flora data & add authorship to flora data #791

Remove duplicate flora data & add authorship to flora data #791

Conversation

ehwenk commented Feb 16, 2024

dfalster commented Feb 18, 2024

dfalster left a comment

Choose a reason for hiding this comment