Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploration for Storing Submission Data #165

Open
dan-tang-ssd opened this issue Feb 5, 2025 · 3 comments
Open

Exploration for Storing Submission Data #165

dan-tang-ssd opened this issue Feb 5, 2025 · 3 comments
Assignees

Comments

@dan-tang-ssd
Copy link
Contributor

@alex-thomson222 - Hi Alex, hope you get well soon.

We are exploring possible approaches for storing submission data.
We would like to know you comments on how do you want to retrieve submission data in R script program.

Approach 1: Current way for storing submission data in a "long" format table
Approach 2: Possible way for storing submission data in a "wide" format table.

Previous discussion can be found in issue #164 FYI.


Approach 1: Current way for storing submission data in a "long" format table

When we retrieve a ODK submission:

  • ODK form structure are stored in entities table
  • Each ODK form data item and data value is stored as one entity_values record

For example, we will have below records for a submission with a main survey section and two repeat groups.

entities records for storing ODK form structures

  • main survey
  • repeat group A
  • repeat group B

entity_values records for storing ODK data items and values

  • main survey, item 1
  • main survey, item 2
  • repeat group A, item 1
  • repeat group A, item 2
  • repeat group A, item 3
  • repeat group B, item 1
  • repeat group B, item 2
  • repeat group B, item 3
  • repeat group B, item 4

Each data item is stored as a entity_values record, the table is in a "long" format.

When we do data export from entities and entity_values tables, we need to run one query from entity_values table for each single data item.

  • main survey, 2 queries
  • repeat group A, 3 queries
  • repeat group B, 4 queries

For this approach:

  • No program change required in PHP program
  • R script program needs to re-assemble each data item and data value by one query (i.e. 4 queries for 4 data items)
  • R script will create entity_values records for storing the calculated indicator values

Approach 2: Possible way for storing submission data in a "wide" format table.

Dave and I are exploring potential option for storing and retrieving submission data in a more efficient way.

When we retrieve a ODK submission:

  • ODK form structure are stored in entities table
  • Each dataset for ODK form data items and data is stored as one entity_values record

For example, we will have below records for a submission with a main survey section and two repeat groups.

entities records for storing ODK form structures

  • main survey
  • repeat group A
  • repeat group B

entity_values records for storing dataset for ODK data items and values

  • main survey, JSON column for storing item 1 and 2
  • repeat group A, JSON column for storing item 1, 2, and 3
  • repeat group B, JSON column for storing item 1, 2, 3 and 4

For this approach:

  • We will need to update database table structures, submission retrieval program and data export program in PHP program
  • R script program can re-assemble one dataset with data items and data values by one query (i.e. 1 query for 4 data items)
  • R script will create entity_values records for storing the calculated indicator values
@alex-thomson222
Copy link

Hi @dan-tang-ssd ,

I think will probably need a call at some point just to check I am understanding and picturing these structures correctly.

But just on the basis of reading your descriptions I think approach 1 sounds simpler from an R perspective. At least if I am understanding correctly than a data item is an entity, e.g. a single farm, a single crop within a farm etc. In which case this would align with the existing structure of 1 row per farm, 1 row per crop grown etc. Which is the preferable format I would need for joining tables together so I can match IDs to codes or bring required variables up or down a level if needed for a calculation.

I'm not sure if I am picturing approach 2 correctly, so I may need a visual aid to understand how exactly this is storing the data items. But from what I think you mean, I think this approach would be extra steps in R to unpack the JSONs and probably pivot to a longer format anyway to be able to join onto other tables. So in R this might be more steps to get back to what approach 1 is already doing for me.

@dan-tang-ssd
Copy link
Contributor Author

Hi @alex-thomson222 - Thanks a lot for your reply and comment.

Yes, a picture approach would be very useful.
Please find below illustrations for databae records for visual aid.

I do agree with you.
For approach 1, R script can just get what it needs for indicator calculations.
For approahc 2, R script will need extra step to handle JSON data. (um.... maybe we can only get the required data names and data values from JSON columns...?)


Image


Image

@alex-thomson222
Copy link

Thanks that helps me understand more clearly, I think still stand with what I said before that approach 2 introduces extra complexities where as approach 1 is pretty simple and accommodating to working with the data in R.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants