Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[traits.build wide table] Import as_wide_table and make part of build #19

Closed
dfalster opened this issue Jun 30, 2023 · 3 comments
Closed
Labels
enhancement New feature or request

Comments

@dfalster
Copy link
Member

dfalster commented Jun 30, 2023

The austraits package has a function as_wide_table.R, which converts the various tables into a single wide table (see https://github.com/traitecoevo/austraits/blob/master/R/as_wide_table.R). Suggest moving this in here and making it a possible output of the build.

Benefits

Wide table is better target for some for downloads, as

  • single table is easier for most users
  • simplifies austraits R package
  • easier to write as a standard
  • better for API

But large size

A downside of the large table is that file is substantially larger, e.g. for AusTraits v4.1.0

  • current .csv output is 260Mb (with most 230Mb is traits.csv)
  • by comparison, as plain csv wide table is 4.2GB
@dfalster dfalster changed the title Import as_wide_table Import as_wide_table and make part of build Jun 30, 2023
@dfalster
Copy link
Member Author

As noted in #20 , we can drastically reduce file size by using alternative output types with better compression. Using AusTraits 4.2.1, the wide table is

  • 4.2GB as .csv
  • 72Mb as .csv.gz
  • 26Mb as .parquet
austraits <- readRDS("export/as_wide/austraits_wide.rds")

austraits %>%
   select(-c("population_id", "individual_id", "temporal_id", "source_id", "location_id", "entity_context_id", "plot_id", "treatment_id", "method_id")) %>% write_csv("~/Downloads/austraits_wide.csv.gz")

However, if save as parquet format, data comes in at only 24Mb

austraits %>%
   select(-c("population_id", "individual_id", "temporal_id", "source_id", "location_id", "entity_context_id", "plot_id", "treatment_id", "method_id")) %>% arrow::write_parquet("export/as_wide/austraits_wide.parquet")

@cboettig suggested we should not see file size as deterrent to using wide table, if that is preferable for some users. We can solve this with smarter compression methods.

We can also filter before read, It's slower to read but saves RAM by only loading the set you want

x <- arrow::open_dataset("export/as_wide/austraits_wide.parquet")

x %>%
  filter(taxon_name == "Banksia serrata") %>%
  collect() -> y

can do this on csv too

x <- arrow::open_dataset("export/as_wide/austraits_wide.csv.gz", format = "csv")

@dfalster dfalster added the enhancement New feature or request label Aug 30, 2023
@ehwenk
Copy link
Collaborator

ehwenk commented Nov 16, 2023

There is a new function here that will build a single table that joins and packs all information from the relational tables into a single combined table.

Information for location properties and the various categories of context properties are packed into single columns.

I'd thought about writing a function that allowed unpacking, but am thinking that is a bad idea:

  • even though I have used unique syntax that allows unpacking, it assumes that many symbols (including = and ;) are never used for location properties or context property names, values, or descriptions. This isn't realistic
  • the only way to unpack the information is to use R - and the entire point of the combined table is to have a single joined output for non-R users.

Instead 2 alternatives. In addition to the combined table output, add

  • function/output that is an unpacked combined table with many, many columns for location properties and context properties
  • new function that lets users select which location properties and context properties they want to add, through a list of values.

@ehwenk ehwenk changed the title Import as_wide_table and make part of build [traits.build wide table] Import as_wide_table and make part of build Jul 31, 2024
@ehwenk ehwenk added this to AusTraits Jul 31, 2024
@ehwenk ehwenk moved this to Backlog in AusTraits Jul 31, 2024
@ehwenk
Copy link
Collaborator

ehwenk commented Dec 5, 2024

Completed with combination of c5014dc and traitecoevo/austraits@02abab7

@ehwenk ehwenk closed this as completed Dec 5, 2024
@github-project-automation github-project-automation bot moved this from Backlog to Done in AusTraits Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

2 participants