Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicate flora data & add authorship to flora data #791

Merged
merged 22 commits into from
May 14, 2024

Conversation

ehwenk
Copy link
Collaborator

@ehwenk ehwenk commented Feb 16, 2024

Two pieces of work on this branch that cause most data.csv files for datasets with flora data to be completely overwritten:

  • add species-by-species authorship to all flora data. This adds a new line to all data.csv files for floras were it was possible to add attribution for who wrote each profile. This was requested by the taxonomy community and has been mapped in as either "source_id" or as "measurement_remarks". These are not curated, but are simply the "sources" or "authors" that could be automatically downloaded.
  • filter the original flora data in AusTraits (datsets: ABRS_1981, NHNSW_2014, 2014_2, 2016, WAH_1998, SAH_2014, NTH_2014) using the following rules:
  1. remove all woodiness, growth form, life history from the "original" flora scrapings, since we have complete trait value datasets for these traits (most common error here are "vines that climb to tree tops" being designated as trees, but there are others)
  2. remove all taxon_name x trait_name x dataset_id that are in "original" and "new" scraped datasets; there are indeed updated values for a number of numeric traits and in the ~100 profiles I've looked up where there is a difference between old and new, only 1 mistake in the newer versions. That said, the "differences" are the absolute minority - for trait x taxon x dataset values in both old and new flora extractions 98+ % are identical.
  3. retain all categorical data that is only in the "original" scrapings (except the three complete traits). I've spot checked lots of values and haven't found any errors - and other than growth form, woodiness, life history there isn't much overlap in the categorical traits scraped in the "original" and "new" flora datasets
  4. For numeric traits, for trait x taxon x dataset combinations that are only in the "original" scrapings, I manually checked every data point (~8000 values across all floras) and manually correct or dismissed incorrect values.

Overall, this has removed ~100,000 data points. These are almost entirely true duplicates:

nrow(austraits_develop$traits)
[1] 1813898
nrow(austraits_removed$traits)
[1] 1706226

filter SAH_2014 to only retain trait values that aren't also in SAH_2022, SAH_2023. As part of this, SAH_2014 is now in long-format.
...the problem is that the way to filter NTH_2014 vs new scrapings is by comparing austraits output, which means that other than `original_name` none of the information from the original dataset is retained (i.e. "un-doing" substitutions)
- have removed all leaf width, leaf lengths, plant heights EXCEPT if the taxa are completely missing from ABRS_2022, 2023. There are about 1000 values to manually check - well over 50% were wrong plant part or wrong number, so they are simply saved in the raw data folder
easy, because virtually no numeric traits, almost all trait values that were only in WAH_1998 were flowering times
still need to redo metadata files
- merge in authorship info for individual profiles, under `measurement_remarks`
multiple bugs fixed

now only 1.56 million rows of data.... there were lots of duplicates
@ehwenk ehwenk requested a review from dfalster February 16, 2024 00:33
@dfalster
Copy link
Member

Nice work @ehwenk. This is quite a big PRE so we'll need to look at together sometime

Copy link
Member

@dfalster dfalster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have reviewed this together with @ehwenk . Huge amount of work. Hard to catch any issues without reviewing original sources, I am approving without going that far.

@ehwenk ehwenk merged commit 36f3855 into develop May 14, 2024
1 check passed
@ehwenk ehwenk deleted the remove_duplicate_flora_data branch May 14, 2024 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants