Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link R scripts for data processing #55

Open
dave-mills opened this issue Dec 2, 2024 · 17 comments
Open

Link R scripts for data processing #55

dave-mills opened this issue Dec 2, 2024 · 17 comments
Assignees

Comments

@dave-mills
Copy link
Member

First step:

  • Calculated indicators for Agroecology module are ready; need to add database link in R to get data from database and save results back to database.

Next:

  • Automate process - run R scripts when new submissions come in. (or on schedule?)
  • Include calculated indicators in data export.
@dave-mills dave-mills added this to the All features ready for testing milestone Dec 2, 2024
@dave-mills
Copy link
Member Author

If the scripts are going to be run on a schedule, I would also like the option for a user to manually trigger the scripts from the front-end. It might be easier for the scripts to be run as soon as a new submission comes in, or perhaps on a delay (e.g. 10 minutes after new submissions are received), to account for the likely event that many submissions are received in a batch.

@alex-thomson222
Copy link

@dan-tang-ssd

Just a couple things to check database wise to help with the R scripts and data based on previous projects though you may have already corrected for them.

  • That the binary columns for the options of multiple select questions are available in the database
  • That 0s are not being stored or exported as blanks as this was an issue on TAPE

@dan-tang-ssd
Copy link
Contributor

@alex-thomson222

@dan-tang-ssd

Just a couple things to check database wise to help with the R scripts and data based on previous projects though you may have already corrected for them.

* That the binary columns for the options of multiple select questions are available in the database

Dan: We do not have any database table created for main survey and repeat groups yet. The table structure can be tailor made for our requirements.

* That 0s are not being stored or exported as blanks as this was an issue on TAPE

Dan: I will keep it in my mind and ensure this issue will not happen here.

@alex-thomson222
Copy link

I sent an email to you both detailing the first draft of the agroecology_scores script which can be found at

https://github.com/stats4sd/holpa-r-scripts

@alex-thomson222
Copy link

I have the script for key performance indicators maybe 2/3rds complete currently, all of the simple ones are coded so I have begun work on the more complex calculations.

@dave-mills you may have seen my emails to Andrea + Sarah about a couple of the indicators where script and protocol differ or i think there may be errors in their code

@dave-mills dave-mills self-assigned this Dec 16, 2024
@alex-thomson222
Copy link

tested what I can of the agroecology scores script will need the following to test the rest;

  • The "products" table populated
  • variables following income_count to be restored as they are currently blank
  • "livestock_count" and "fish_count" to be added to the data structure

@dave-mills
Copy link
Member Author

@alex-thomson222 thanks for the review and feedback. We've looked through the issues and Dan's coming up with some fixes. The big one is because these submissions are using forms with the unclosed income group, so the form structure isn't what the platform expects - we've put in a fix for that and now Dan's working through the other issues you've highlighted.

I expect you'll be able to test again tomorrow with a refreshed set of data tables. Tomorrow I want to do a run through of the whole system on the staging site, so we'll be able to test with properly formatted submission data at that point.

@alex-thomson222
Copy link

No problem, just let me know when there is a new copy of the database on dropbox i can use.

Are the fixed versions of the form going to be uploaded to the same project on ODK Central so I can complete some submissions that should have complete data, especially on those variables converting to hectares or kilograms?

@dan-tang-ssd
Copy link
Contributor

Thanks a lot Alex.

Some updates:

  1. farm_survey_data table, columns after "income_count" populated. (Thank you Dave for help)
  2. farm_survey_data table, added columns "livestock_count" and "fish_count", they are populated
  3. products table, it is now populated. I realised that ODK variable names not matched in submission content and Data Structure excel file. I updated column names to fix it.

Latest database with retrieved submissions is in below Dropbox folder:
\SSD Dropbox\Stats4SD Projects\HOLPA - 2024\database\20241218 Database with Data Retrieved\newer_version


I will fix "farm_survey_data_id is empty" issue by today.

@alex-thomson222
Copy link

With the income issue, I think almost have it as while everything after the income group has now been populated, the questions that are meant to be in that group i.e. subsidy to farm_loss_text are still empty. The indicator codes relating to these questions are fairly simple so I can keep working on other parts before needing this fixed.

For the products table, I think this isn't quite what was in mind for this table. I believe the intention was to combine the other_product_use_sales repeat group with crop_use, livestock_use, fish_use, tree_use and honey_use groups of questions which are not repeats but structured similarly to create a product level table. With those variables currently not in the data anywhere

@alex-thomson222
Copy link

alex-thomson222 commented Dec 18, 2024

Just a couple things missing on the permanent workers table to match with the structure of the seasonal_workers, though i assume this was on the "to do" list as only a few rows were populated

  • The number of workers which would be the merging of perm_labour_group_n_workers (Household) and perm_labourer_numbers (hired externally) - down the line I should harmonise these 2 variable names though to make it clearer apologies
  • populating the binary for household_members

edit - the seasonal worker table is missing seasonal_labour_months_count

@alex-thomson222
Copy link

Done a test on everything I can for the performance indicators and have fixed obvious errors I made. Will now go back to the agroecology scores to check what i couldn't before.

To clarify, i am first testing to make sure code runs successfully, without getting stuck except where existing data table issues would need correcting first. Once i am happy with that i will check more closely that the final numbers are as expected. I have done a little bit of this where it is quite simple to quickly compare but will go back more thoroughly on the more complex instances.

@alex-thomson222
Copy link

@dan-tang-ssd @dave-mills

Finished checking now what I can on the agroecology scores script and have rewritten my code for the connectivity and fairness scores as they relate to the products table so the code should now accommodate that structure. this can be tested when the products table is completed

@dan-tang-ssd
Copy link
Contributor

For the products table, I think this isn't quite what was in mind for this table. I believe the intention was to combine the other_product_use_sales repeat group with crop_use, livestock_use, fish_use, tree_use and honey_use groups of questions which are not repeats but structured similarly to create a product level table. With those variables currently not in the data anywhere

@dave-mills - Um... I am a bit confused... Would you clarify where can I get data for products table?

image

@dave-mills
Copy link
Member Author

This was the part we discussed where the structure is flat in the form, but really it's a different data level, In the 'farm_characteristics' section - the user is asked 'what did you produce on your farm?' with a select-multiple and then a repeat group to enter any number of other products.

Then there is a flat section of questions about each product:

  • Produced Crops
  • Livestock
  • Fish
  • Trees
  • Honey
  • Other (this is a repeat group that repeats over any 'other' products entered.

Each of these sections has basically the same set of questions, (with some minor variation; there are a couple of 'trees-only' questions - see the 'notes' column in your screenshot), so we discussed the option of turning it into a new data level with Alex.

So: we get the data for this data level from those questions. There should be 1 row per selected item in farm_products, and one row per repeat entry in other_product_use_sales.

For crops, livestock, fish, trees and honey, the product_id and product_name should be taken from the choice_list, i.e.:

id name
crops Crops (including perennial crops)
livestock Livestock
fish Fish
trees Trees (e.g., for wood, bark, rubber)
honey Honey

And the rest of the data comes from the correct section. e.g. crops from:

crop_produce_note
crop_hh_consumption
crop_livestock_consumption
crop_sales
crop_gifts
crop_waster
crop_other_use
crop_other_use_specify
crop_use
crop_sales_buyers
crop_buyer
crop_buyer_other
crop_fair_price

and so on.

@dan-tang-ssd
Copy link
Contributor

@dave-mills - Thank you Dave for your clarification. This is really helpful.

Questions:

For crops, livestock, fish, trees and honey, the product_id and product_name should be taken from the choice_list.

I can find product_id, but where can I find product_name...?

  • There is no product_name in submission content.
  • I also tried to find it from database table choice_list_entries but failed...

Screen shots:

image

image

@dave-mills
Copy link
Member Author

I can find product_id, but where can I find product_name...?

In general, the names/labels for choice list entries are in LanguageStrings, as they are 'translatable'. But I don't think we need to go that far yet, as these products aren't changing any time soon. So we can use the names as they are in the ODK form right now:

id name
crops Crops (including perennial crops)
livestock Livestock
fish Fish
trees Trees (e.g., for wood, bark, rubber)
honey Honey

@dan-tang-ssd dan-tang-ssd mentioned this issue Dec 20, 2024
4 tasks
@dave-mills dave-mills removed this from the All features ready for testing milestone Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants