Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata #19

Open
roaldarbol opened this issue Sep 11, 2024 · 2 comments
Open

Add metadata #19

roaldarbol opened this issue Sep 11, 2024 · 2 comments
Labels
enhancement New feature or request
Milestone

Comments

@roaldarbol
Copy link
Owner

roaldarbol commented Sep 11, 2024

I've restructured this issue in light of metadata being an option in Parquet files - anything prior is found below the line.

So metadata can actually be saved in Parquet files, so I'd still go for that as our storage format. Now the question should rather be: What metadata should we support by default?


Would be a good way to e.g. keep track of units (frame/s or dots/ pixels/ cm). See https://stackoverflow.com/a/68675903. Also to the dataframe itself (through attr), which could e.g. be the starting time stamp when present - that way we can always dig it out despite having converted to seconds since start (so we can convert back and forth between absolute and relative time).

@roaldarbol roaldarbol moved this to 🔬 Triage in animovement progress Sep 15, 2024
@roaldarbol roaldarbol added good first issue Good for newcomers enhancement New feature or request and removed good first issue Good for newcomers labels Sep 15, 2024
@roaldarbol roaldarbol added this to the Version 1.0 milestone Sep 16, 2024
@roaldarbol
Copy link
Owner Author

roaldarbol commented Sep 18, 2024

So attr seems not to be persistent (good thread https://discuss.ropensci.org/t/creating-persistent-metadata-for-an-r-object-for-data-provenace/1260/2). Probably maybe adding the metadata as actual columns, as per the metadata is also data logic. That way it shouldn't disappear (except for summarise, some pivots or joins maybe, and selects of course). Way to think about that? Also a way to write metadata to a file? Maybe keep the UID and make a second dataframe that links the UID to metadata?

https://x.com/i_steves/status/1017569725340151809 also

Otherwise, I'd need to always read the attr, save them and reattach at the end of any of my functions. Could enable logging of all the stuff that happens to the data, e.g. what smoothing, interpolation, number of missing values, stuff like that. That would be cool IMO! Might require using {rhdf5} to export the data including the metadata, or save all that data as columns with meta_ prepended?

@roaldarbol
Copy link
Owner Author

roaldarbol commented Nov 5, 2024

So metadata can actually be saved in Parquet files, so I'd still go for that as our storage format. Now the question should rather be: What metadata should we support by default?

  • Column labels (https://larmarange.github.io/labelled/)
  • Column units (times, positions)
  • Metadata
    • UUID (generated by us or movement - or any of the tracking software if they do this)
    • Date
    • Source (DLC, SLEAP, idtracker, etc.)
    • Source version
    • Filename(s) (video names, files used to create the movement data file)
    • Sampling rate (e.g. fps for video or for mouse sensors)
    • Start time (datetime timestamp`)

Maybe we should try to add an Operations category too, so the operations that have been performed on the data automatically gets saved with the data (e.g. how is the data filtered/smoothed, what's the average confidence, etc.). Maybe a kind of data integrity category.


For the record, this is how it is added, accessed, saved and read:

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- data.frame(col1 = 2:4, col2 = c(0.1, 0.3, 0.5))

attributes(df)$units <- c("seconds", "minutes")
attributes(df)$metadata <- list(
  one_thing = "here",
  another = "there",
  a_vector = c(1,2,3,4,5),
  a_double_vector = c(1.22,1.55)
)

attributes(df)$metadata
#> $one_thing
#> [1] "here"
#> 
#> $another
#> [1] "there"
#> 
#> $a_vector
#> [1] 1 2 3 4 5
#> 
#> $a_double_vector
#> [1] 1.22 1.55
write_parquet(df, "test.parquet")
a <- read_parquet("test.parquet", as_data_frame = TRUE)
attributes(a)
#> $names
#> [1] "col1" "col2"
#> 
#> $row.names
#> [1] 1 2 3
#> 
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"
#> 
#> $units
#> [1] "seconds" "minutes"
#> 
#> $metadata
#> $metadata$one_thing
#> [1] "here"
#> 
#> $metadata$another
#> [1] "there"
#> 
#> $metadata$a_vector
#> [1] 1 2 3 4 5
#> 
#> $metadata$a_double_vector
#> [1] 1.22 1.55

Created on 2024-11-06 with reprex v2.1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: 🔬 Triage
Development

No branches or pull requests

1 participant