Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design data extract and data merge #36

Open
pawelru opened this issue Apr 12, 2022 · 12 comments
Open

Design data extract and data merge #36

pawelru opened this issue Apr 12, 2022 · 12 comments

Comments

@pawelru
Copy link
Collaborator

pawelru commented Apr 12, 2022

Motivation

Data extract/data merge has been criticised by many parties mainly for opaqueness and complicated API. The main goal of the refactor should be:

  1. Remove independent filtering on the selector level, to avoid scenarios like below where ADLB is merged to ADLB. No other data-visualization tools offers such a functionality. Possible data transformation (reshape or filter) should be done once for dataset before variables selection. Independent filtering also complicates API call to initialise encoding panel. data_extract_spec sometimes covers more than hanf of the code space:

    a <- data_extract_spect("ADSL", select_spec(...))
    b <- data_extract_spec("ADLB", filter_spec(<PARAMCD>), filter_spec(<AVISIT>), select_spec(...))
    c <- data_extract_spec("ADLB", select_spec(...))
    Screenshot 2022-05-18 at 06 37 53

    Selecting variables to the module should be minimalized to specifying variables only (without data transformation). Still open question whether we allow to select from different datasets (support automatic merge) or allow to select from one dataset only (no automatic merge).

    a <- select_variables(<select from numeric vars in ADSL>)
    b <- select_variables(<AVAL from ADLB>)
    c <- select_variables(<select from factor vars in ADSL>)
  2. If we decide to keep offering automatic merge we need to provide UI element informing that there is merge and how it's performed (used keys, shape of the final ANL)

Solutions in the comments.

@pawelru pawelru added the teal label Apr 12, 2022
@gogonzo
Copy link

gogonzo commented May 18, 2022

1. Interactive metatransform

image

Data transformation settings

This solution allows app developer creating the app with selection from single dataset only but with option to make a custom merge. This solution gives app-developer possibility to specify custom ui and srv functions where inputs are bind with expression. For example see the code to initialize custom data transformation.

tbl <- list(
  expr = {
    ADTTE <- ADTTE %>%
      filter(!!adtte_filter_variable %in% !!adtte_filter_levels)

    ADRS <- ADRS %>%
      filter(!!adrs_filter_variable %in% !!adrs_filter_levels) %>%
      pivot_wider(!!adrs_pivot_by_variable, !!adrs_pivot_by_value)

    ABCD <- merge(ADSL, ADTTE, ADTRS)
  },

  ui = function(id, data = <list>) {
    div(
      selectInput("adtte_filter_variable", data$adsl),
      ...
    )
  },
  srv = function(input, output, session, datasets, ...) {
    reactive(inputs)
  }
)

Above data-transformation settings will be resulting in following ui elements which then are replacing !!<input name> in the given expression. Please note that app might not need any data-tranformation in advance, so this doesn't have to be always specified.

image

Variables selection

Above gives app developer entire freedom to make a new datasets for the module and also gives a full control over merge.
If app developer can create any dataset including merged ANL, this means that we don't need to support automatic data merge. This simplifies the variable selection which can be limited to single dataset. I call it "choices_selected" here just to indicate what it is - it's something similar to choices selected but with dataname attribute.
For example

a <- choices_selected(<select from numeric vars in ABCD>)
b <- choices_selected(<AVAL from ABCD>)
c <- choices_selected(<select from factor vars in ABCD>)

New Wireframe 1 (2)

calliing modules

Select instructions a, b, c are passed in the similar way as data_extract_spec and transform instructions are passed to the new argument (let's call it tbl here). In the ui we need module for data tranformation and for inputs selection.

tm_scatterplot_ui <- function(id, a, b, c, tbl)
  div(
    interactive_transform_ui("tbl", tbl),
    choices_selected_ui(a, b, c),
    ...
  )

In the server part we need the equivalent modules calls. Data transform module returns the expression which then needs to be evaluated somewhere (chunks). "choices_selected" (

tm_scatterplot_srv <- function(id, x, y, c, tbl) 
  # call data tranformation - return expression
  transform <- interactive_transform_srv("tbl", tbl, <data>)
 
 # evaluate expression in chunks, shinymeta or anything else
  eval(transform$expr(), env)
  
  # resolve and validate inputs selection
  choices <- choices_selected_srv(<env>, x, y, c)

  # evaluate visualization based on the inputs
  eval(
    !!choices$dataname %>%
      plot(x = !!choices$x, y = !!choices$y, c = !!choices$c),
    env
  )

Conclusions about this way

  • ✔️ Full control over data transformation
  • ✔️ No automatic merge and simplest api for "choices_selected" where each variable is selected from single dataset.
  • Is shiny and metaprogramming combines easy to handle by app-developer?
  • ❌ Burden of data transformation on the app developer. Note: can we cover most of the scenarios by wrappers?
  • ❌ No way to conditionally change some parts of expression - for example when no filter variable is selected then filter should be removed instead of having filter(NULL = NULL).
  • ❌ App developer may need to merge data in advance instead of automatic merge. Please note that app developer knows the data.
  • App user doesn't know how the merged data are created.

@gogonzo
Copy link

gogonzo commented May 18, 2022

2. Interactive metatransform 2

image

data transformation settings

This solution is almost the same as (1) with that difference that expression needs to be done in the way that app-developer doesn't assign final dataset but returns it from the expr.

tbl <- list(
  expr = {
    ADTTE <- ADTTE %>%
      filter(!!adtte_filter_variable %in% !!adtte_filter_levels)

    ADRS <- ADRS %>%
      filter(!!adrs_filter_variable %in% !!adrs_filter_levels) %>%
      pivot_wider(!!adrs_pivot_by_variable, !!adrs_pivot_by_value)

    merge(ADSL, ADTTE, ADTRS)
  },

  ui = function(id, data = <list>) {
    div(
      selectInput("adtte_filter_variable", data$adsl),
      ...
    )
  },
  srv = function(input, output, session, datasets, ...) {
    reactive(inputs)
  }
)

Variables selection

Above means that dataname used in the module is unknown by the app-developer, which implies that one doesn't need to specify where the choices coming from.

a <- <select numeric>
b <- <select numeric>
c <- <select factor>

Calling modules

ui and srv are called almost in the same way but there is dataname argument which controls which dataset is used to generate output

tm_scatterplot_ui <- function(id, x, y, c, tbl)
  div(
    interactive_transform_ui("tbl", tbl),
    choices_selected_ui(x, y, c),
    ...
  )
tm_scatterplot_srv <- function(id, x, y, c, tbl, dataname = "ANL") 
transform <- interactive_transform_srv("tbl", tbl, <data>)

# expression from transform is assigned to the name specified in the module
if (!<empty expr>) {
   eval(!!dataname <- !!transform$expr(), env)
}
choices <- choices_selected_srv(<env>$<dataname>, x, y, c)

eval(
  !!dataname %>%
    plot(x = !!choices$x, y = !!choices$y, c = !!choices$c),
  env
)

Conclusions (comparing to (1))

  • ✔️ App developer doesn't need to specify dataname
  • Module developer have a little more to do with assigning data transformation code to the .

@gogonzo
Copy link

gogonzo commented May 18, 2022

3. Split data_extract_spec

The solution is to simply split data_extract so that filter_spec and select_spec will be independent and will go to the module separately.

image

Data tranformation settings

App developer will need to specify the filters separately. Probably reshape_spec will be needed.

tbl <- list(
  filter_spec("ADTTE", vars = "PARAMCD"),
  filter_spec("ADRS", vars = "AVISIT"),
  reshape_spec("ADRS", vars = "PARAMCD")
)

New Wireframe 1 (1)

Variables selection

App developer will be able to specify possible selections from any datasets

a <- <select numeric from ADSL or ADTTE>
b <- <select numeric from ADTTE>
c <- <select factor from ADSL>

New Wireframe 1 (2)

Calling modules

Modules are called in similar way to (1) and (2) but tbl objects are completely different. Also because there will be automatic merge, we probably need UI to visualise the merge.

tm_scatterplot_ui <- function(x, y, c, tbl)
  div(
    interactive_transform_ui("tbl", tbl),
    choices_selected_ui(id, x, y, c),
    merge_choices_ui(id)
    ...
  )
tm_scatterplot_srv <- function(...) 
  # returns expression for selected filters
  transform <- interactive_transform_srv("tbl", tbl, <data>)

  # evaluate expression in env (chunks or equivalent)
  eval(transform$expr(), env)
  
  # choices same as in (1) and (2)
  choices <- choices_selected_srv(env, a, b, c)
  
  # merge returns expression
  merged <- merge_choices(choices, join_keys)
  eval(merged$expr, env)
 
  eval(
    !!merged$dataname %>%
      plot(a = !!merged$a, b = !!merged$b, c = !!merged$c),
    env
  )

Conclusions

  • ✔️ Easier of app developer to set the app
  • ✔️ Possible to visualise merge so the app user can see the way how ANL was created
  • ✔️ Lower cost of refactor
  • ❌ Limited for custom data transformation
  • ❌ Variable choices are not final choices in ANL - merge can result in change of variables which have the same names in merged datasets.
  • ❌ Filters in the encoding-panel and filters in the filter-panel is confusing. Can we put module-filters to the filter-panel but distinguish them visually from the global filters? Maybe we should have a filter-panel module specific only?

@gogonzo
Copy link

gogonzo commented May 18, 2022

Solution 4 - module filters in the filter-panel

Another solution is to have right hand-side panel containing "global" and "module specific" filters.

image

Data transformation settings

App developer would specify filters in the same way as specifies the "global" filters - by specifying a list (see teal::init filters argument). For reshape we might use similar instructions.

tbl <- list(
  filter = list(ADRS = list(PARAMCD = ...)),
  reshape = ...
)

Variables selection

a <- <select numeric from ADSL or ADRS>
b <- <select numeric from ADRS>
c <- <select factor from ADSL>

Calling modules

interactive_transform_ui won't be needed here as tbl$filter instructions will communicate with the filter-panel.

tm_scatterplot_ui <- function(x, y, c, tbl)
  div(
    choices_selected_ui(id, x, y, c),
    merge_choices_ui(id)
    ...
  )
tm_scatterplot_srv <- function(..., tbl) 
  # send filter instructions to filter-panel
  set_filter_state(tbl, module = <this module>)
  
  # choices same as in (1) and (2)
  choices <- choices_selected_srv(env, a, b, c)
  
  # merge returns expression
  merged <- merge_choices(choices, join_keys)
  eval(merged$expr, env)
 
  eval(
    !!merged$dataname %>%
      plot(a = !!merged$a, b = !!merged$b, c = !!merged$c),
    env
  )

Conclusions

  • API for setting filters the same as in filter panel
  • One place for filters
  • ❌ Difficult task to refactor filter panel to be "global" and "module" specific
  • ❌ Not clear how to solve reshape in the filter-panel?

@gogonzo

This comment was marked as off-topic.

@gogonzo
Copy link

gogonzo commented May 19, 2022

I guess we can agree to the one point. @kpagacz @mhallal1 @Polkas @nikolas-burkoff

Module should receive the instructions of possible selections for it's fixed inputs.
For example, if module needs to select a, b, c for the visualization, app developer should specify something to choose from. "choices_selected" is the proposition. The question is now, if the user could select from the same dataset or multiple datasets. This will have a consequence in having or not having the merge.

a <- choices_selected(<variables from ADSL>)
b <- choices_selected(<variables from ADSL>)
c <- choices_selected(<variables from ADSL>)
a <- list(choices_selected(<variables from ADSL>), choices_selected(<variables from ABCD>))
b <- choices_selected(<variables from ADTTE>)
c <- choices_selected(<variables from ADTTE>) 

This instructions are for the fixed module inputs so extra "filter" or "reshape" is not involved here. It's not said that this will be named a choices_selected, but the point is that module needs to receive some instructions of what are the possible choices for a, b , c.


If we agree to above that modules needs only instructions where to take the "choices" for inputs then we can resolve this:

quosure-Page-11

Feel free to modify if you thing some node is missing
https://app.diagrams.net/#G1z7RTPvBB9KYn13cN7AZmgaTE46tfcf95

@nikolas-burkoff
Copy link

nikolas-burkoff commented May 19, 2022

So here is an app which is quite painful to specify at the moment (but seems to be quite a reasonable thing to want to do, to create a cross table between ADSL and another dataset) which requires a merge between ADSL and the other dataset (as ADSL has one row per subject and the others don't)

Comments:

  1. The end user specifies the join type - this gives them good control but as this is CDISC data the join should be obvious (or at least the obvious join should be obvious? - but clearly in general data it may or may not make sense for the end user to choose this)

  2. The SRC has some information I don't need (some of A - but I assume that could be fixed?) but does create the one join I need (and handles adding the labels back etc) (B) and the module developer has access to the selected values to use in a title which is nice (C)
    image

  3. The app developer is having to specify multiple (and ordered) but they can be left to the module developer to decide

  4. The api is quite complex you are essentially telling the module you want:

x = everything but dates in ADSL
y = all factors in ADSL, ADRS, ADTTE and ADLB

and these two statements take 50 lines to express to the module (!)

I'd be interested to see how this sort of thing can be done in any new approach - whether that's a small change to an api or a larger refactor which could involve say the merge taking place outside of the module code completely.

library(scda)
library(teal.modules.general)

ADSL <- synthetic_cdisc_data("latest")$adsl
ADRS <- synthetic_cdisc_data("latest")$adrs
ADTTE <- synthetic_cdisc_data("latest")$adtte
ADLB <- synthetic_cdisc_data("latest")$adlb


f1 <- function(data) {
  idx <- !vapply(data, inherits, logical(1), c("Date", "POSIXct", "POSIXlt"))
  return(names(data)[idx])
}

f2 <- function(data) {
  idx <- vapply(data, is.factor, logical(1))
  return(names(data)[idx])
}

app <- init(
  data = cdisc_data(
    cdisc_dataset("ADSL", ADSL, code = "ADSL <- synthetic_cdisc_data(\"latest\")$adsl"),
    cdisc_dataset("ADRS", ADRS, code = "ADRS <- synthetic_cdisc_data(\"latest\")$adrs"),
    cdisc_dataset("ADTTE", ADTTE, code = "ADTTE <- synthetic_cdisc_data(\"latest\")$adtte"),
    cdisc_dataset("ADLB", ADLB, code = "ADLB <- synthetic_cdisc_data(\"latest\")$adlb"),
    check = TRUE
  ),
  modules = modules(
    tm_t_crosstable(
      label = "Cross Table",
      x = list(
        data_extract_spec(
          dataname = "ADSL",
          select = select_spec(
            label = "Select variable:",
            choices = variable_choices(ADSL, subset = f1),
            multiple = TRUE
          )
        ),
        data_extract_spec(
          dataname = "ADRS",
          select = select_spec(
            label = "Select variable:",
            choices = variable_choices(ADRS, subset = f1),
            multiple = TRUE
          )
        ),
        data_extract_spec(
          dataname = "ADTTE",
          select = select_spec(
            label = "Select variable:",
            choices = variable_choices(ADTTE, subset = f1),
            multiple = TRUE
          )
        ),
        data_extract_spec(
          dataname = "ADLB",
          select = select_spec(
            label = "Select variable:",
            choices = variable_choices(ADLB, subset = f1),
            multiple = TRUE
          )
        )
      ),
      y = data_extract_spec(
        dataname = "ADSL",
        select = select_spec(
          label = "Select variable:",
          choices = variable_choices(ADSL, subset = f2),
          selected = "SEX"
        )
      )
    )
  )
)
shinyApp(app$ui, app$server)

@gogonzo
Copy link

gogonzo commented May 19, 2022

The api is quite complex you are essentially telling the module you want:

x = everything but dates in ADSL
y = all factors in ADSL, ADRS, ADTTE and ADLB
....
I'd be interested to see how this sort of thing can be done in any new approach

Because you allow here to select from multiple datasets, it means that merge after selection is required. Available to achieve with solution 3:

x <- list(<select not dates from ADSL, ADTTE, ADRS, ...>)
y <- <select factors from ADSL>

@nikolas-burkoff your example is basically to use select_spec from data_extract which can refer to "split of data extract". Filter can be specified in some other place but doesn't have to. And question is, where it should be specified, in my opinion it should be module-specific (but doesn't have to be specified in the module)

@gogonzo
Copy link

gogonzo commented May 20, 2022

Meeting summary:

After planned meeting I had a ad hoc meeting with @nikolas-burkoff and I think I have clearer view on all comments addressed by @kpagacz and @nikolas-burkoff. Let's have some specific points in this discussion and focus on the key points of the "data-extract". I'm almost sure that above 4-propositions contain almost everything we are looking for, we probably need to shuffle them little bit.

Let's start from the most basic conclusion and go step-by-step.

  1. Custom selection in the teal_module (constant) inputs

    I'm certain that teal_module needs some fixed inputs which require some instructions from app developer about possible choices for these inputs. Let's call it "choices_selected"

    a <- choices_selected(<give me numeric variables>)
    b <- choices_selected(<give me AVAL>)
    c <- choices_selected(<give me factor variables>)
    
    tm_custom(a, b, c)
    tm_custom_srv(a, b, c, data)
    # choices a, b, c are resolved in the teal_module to update module "fixed" inputs. 
  2. How many datasets can be used in the teal_module selection
    Having above as certain we can think if the choices can be taken from one dataset or from multiple datasets. This implies whether choices should be merged or not. There is no other way to select from multiple datasets and merge. So this would look like:

    a <- choices_selected(<give me numeric variables from ADSL>)
    b <- choices_selected(<give me AVAL ADTTE>)
    c <- choices_selected(<give me factor variables ADSL>)
    
    tm_custom(a, b, c)
    tm_custom_srv(a, b, c, data, join_keys)
    # choices a, b, c are resolved in the teal_module to update module "fixed" inputs. 
    # choices are used to make a merge (probably call)

    It's debatable where merge should be done. I think merge_choices_srv should be in tm_module_srv which accepts a, b, c.
    If we wan't merge to be done before calling tm_module_srv - we need a callback which I'm not a fan of.

    Alternatively we might force module to always use "choices_selected" from the single dataset. Then the merge must be
    done somewhere because app developer have to be allowed to use variables form multiple datasets. It doesn't mean
    that the merge can be done somewhere.

    a <- choices_selected(<give me numeric variables from XXXX>)
    b <- choices_selected(<give me AVAL XXXX>)
    c <- choices_selected(<give me factor variables XXXX>)
    
    tm_custom_srv(a, b, c, data = ABCD)
  3. How detailed should be module specific data tranformation instructions
    Does teal_module need a module specific data transformation ? Not always, but often yes. Question is does it need to be interactive? I'm almost sure that it should be. This means that we need to give possibility for specifying some instructions by the app-developer - this means tbl argument is needed. tbl can be very detailed or simplified. Let's see the alternatives

    Data transformation requires very detailed instruction where app developer specifies ui and server. See Solution 1-2

    tbl <- list(expr, ui, srv)

    We can use filter_spec if we need to filter by some set of variables.

    tbl <- list(filter_spec(...), filter_spec(), reshape_spec())

    If app developer needs filter on a specific variable - then it might be good idea to allow instructions same as in filter panel (see set_filter_state or filter argument in teal::init)

    tbl <- list(filter = list(ADSL = ....))
  4. Where data_transformation should be placed?
    I think the key in the discussion was rather "Does teal_module needs data transformation?". And I think it was a good
    point and I agree that modules should accept ready-to-use data and not perform additional data tranformation inside. But
    this doesn't have to reject previous point (3). tbl still can be used but not on as a module argument.

    tbl <- <tranformation instructions>
    
    teal::init(
      tm_module("my_module", a, b, c),
        # tm_module_srv(a, b, c, data)
      ....,
      tbl = list(my_module = 
    

    Alternatively, module receives instructions for data_transformation and needs to consume this within.

    tbl <- <data tranformation>
    
    teal::init(
      tm_module("x", a, b, c, tbl = tbl)
     )

My personal view on this is as follows:

  • I agree with the idea that modules should receive ready-to-use data which means that module-specific-data-transformation should be done before calling module
  • I'm supporting idea of having choices from multiple datasets and thus do auto-merge in the teal_module

This means that teal_module (server part) would look like this:

tm_module_srv(
  a , b, c <choices selected>,
  data <list of data>,
  expr <expression used to generate the data>,
  ...
)

Alternatively if we decide to make transformation within teal_module

tm_module_srv(
  a , b, c <choices selected>,
  data <list of data>,
  tbl <data tranform instructions>
  expr <expression used to generate the data>,
  ...
)

@gogonzo gogonzo changed the title Data extract refactor Design data extract refactor May 23, 2022
@gogonzo gogonzo mentioned this issue Jun 2, 2022
13 tasks
@gogonzo gogonzo mentioned this issue Jun 22, 2022
10 tasks
@gogonzo gogonzo changed the title Design data extract refactor Simplify data extract and data merge Jul 26, 2022
@gogonzo gogonzo changed the title Simplify data extract and data merge Design data extract and data merge Jul 28, 2022
@donyunardi
Copy link

donyunardi commented Aug 11, 2022

@gogonzo thanks for summarizing the related issues for this roadmap.

It looks like most of these issues are missing SP.
Do you think this roadmap is ready for the next increment assignment or do we need to refine it and get the SP estimate?

I did notice that most issues are created in 2021 so perhaps there were discussions around this already which doesn't require SP to be assigned.

@gogonzo
Copy link

gogonzo commented Aug 12, 2022

@donyunardi some of them probably will be irrelevant if we finally redesign and refactor data_merge/data_extract. But we can refine them to at least have everything filled-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants