Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why there is no a datatagr::tags_df() function? #47

Closed
avallecam opened this issue Oct 8, 2024 · 2 comments
Closed

why there is no a datatagr::tags_df() function? #47

avallecam opened this issue Oct 8, 2024 · 2 comments

Comments

@avallecam
Copy link
Member

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

linelist::tags_df() is not comparable to datatagr::labels_df()

  • linelist::tags_df() generates an output that secures downstream analysis in the outbreak analytics pipeline
  • datatagr::labels_df() generates an output that helps me to showcase the dataset with labels

Can we have a function in datatagr that still inherits the power of tagging columns to get a validated set of them for secure downstream analysis? Is datatagr::make_datatagr() in the capacity to create a tagged dataframe? If this has been discussed elsewhere, I am happy to read it.

In the reprex below I compare package features.

library(datatagr)
library(linelist)
library(labelled)
library(dplyr)

# linelist ----------------------------------------------------------------

dataset <- outbreaks::mers_korea_2015$linelist

dataset %>% 
  dplyr::as_tibble() %>% 
  linelist::make_linelist(
    location = "place_infect",
    date_onset = "dt_onset"
  ) %>% 
  linelist::validate_linelist() %>% 
  linelist::tags_df()
#> # A tibble: 162 × 2
#>    date_onset location           
#>    <date>     <fct>              
#>  1 2015-05-11 Middle East        
#>  2 2015-05-18 Outside Middle East
#>  3 2015-05-20 Outside Middle East
#>  4 2015-05-25 Outside Middle East
#>  5 2015-05-25 Outside Middle East
#>  6 2015-05-24 Outside Middle East
#>  7 2015-05-21 Outside Middle East
#>  8 2015-05-26 Outside Middle East
#>  9 NA         Outside Middle East
#> 10 2015-05-21 Outside Middle East
#> # ℹ 152 more rows

# datatagr ----------------------------------------------------------------

datatagr_out <- cars %>% 
  dplyr::as_tibble() %>% 
  # Create a datatagr object
  datatagr::make_datatagr(
    speed = 'Miles per hour'
  ) %>% 
  # Validate the data are of a specific type
  datatagr::validate_datatagr(
    speed = 'numeric'
  ) %>% 
  # extract dataframe of labelled variables
  datatagr::labels_df()

datatagr_out
#> # A tibble: 50 × 2
#>    `Miles per hour`  dist
#>               <dbl> <dbl>
#>  1                4     2
#>  2                4    10
#>  3                7     4
#>  4                7    22
#>  5                8    16
#>  6                9    10
#>  7               10    18
#>  8               10    26
#>  9               10    34
#> 10               11    17
#> # ℹ 40 more rows

# The action below may not be expected to be done in an analysis pipeline

datatagr_out %>% 
  # standardize column names of a data frame
  cleanepi::standardize_column_names()
#> # A tibble: 50 × 2
#>    miles_per_hour  dist
#>             <dbl> <dbl>
#>  1              4     2
#>  2              4    10
#>  3              7     4
#>  4              7    22
#>  5              8    16
#>  6              9    10
#>  7             10    18
#>  8             10    26
#>  9             10    34
#> 10             11    17
#> # ℹ 40 more rows

# labelled ----------------------------------------------------------------

var_label(cars) <- list(
  speed = 'Miles per hour'
)

cars %>% 
  labelled::var_label()
#> $speed
#> [1] "Miles per hour"
#> 
#> $dist
#> NULL

Created on 2024-10-08 with reprex v2.1.1

Describe the solution you'd like
A clear and concise description of what you want to happen.

  • create a datatagr::tags_df() function to get tagged-only and validated-only columns for downstream analysis
  • edit datatagr::labels_df() to get only the labelled columns (motivating downstream analysis restricted to labelled and validated columns only)
  • edit datatagr::labels_df() to get standardised column names (to avoid using cleanepi downstream) with labels interoperable with {labelled} (possibly)

Additional context
Add any other context or screenshots about the feature request here.

@chartgerink
Copy link
Member

In direct response to the issue title: There is no tags_df() because the naming of tags has been dropped throughout the package (pending the rename of the package).

All functionality that remains is indeed labels_df(), and good to hear the feedback around how it does or does not work for you 😊 We will not be reintroducing the tags_df() as the naming does not fit, but I am happy to consider your second suggested change for integration ("get only the labelled columns"). It may make sense to only have the labelled and validated ones in there. In order to make that comparison, could you add a direct comparison between linelist and datatagr, for the same data?

Your third proposed change ("get standardised column names"), I am not sure about. The package scope is not to wrangle variable names into a prettier format. In your example, the renaming of speed into miles_per_hour does not necessarily make the output of labels_df more usable, if we also retain the labels. It may make sense if we drop the label attribute when using labels_df, and put the label information in the variable name (snake_case formatted), but not both. Would you be okay with dropping the labels and interoperability with labelled in that scenario?

@Bisaloo Bisaloo added this to the Before first CRAN release milestone Dec 9, 2024
@chartgerink
Copy link
Member

This functionality is now restored 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants