Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Development - Data Generation Script #90

Closed
9 tasks done
RobotPsychologist opened this issue Oct 26, 2024 · 3 comments
Closed
9 tasks done

Model Development - Data Generation Script #90

RobotPsychologist opened this issue Oct 26, 2024 · 3 comments
Assignees
Labels
modeldev Developing modeling pipelines for meal annotation task.

Comments

@RobotPsychologist
Copy link
Owner

RobotPsychologist commented Oct 26, 2024

This issue captures both: #108 #109

See the README for a better understanding of where files should go.

This ticket relates to #91 - Data Cleaning Script and #96 - Transformations Script.

The dataset_generator.py script should specify which settings to use for the dataset creation. It should generally create the dataset stored in data/interim. It calls all specified data wrangling, processing, and cleaning utilities that should happen outside of sktime's API.

The data stored in data/interim is then used by the data_transformations.py script functions to apply time series machine learning-specific transformations to the data, which is the final data processing stage before modelling. The data transformations should be stored in data/processed. For more information on sktime transformers, see:

dataset_processing.py

Purpose: Handles data loading, saving, and file naming.

Location: 0_meal_identification/meal_identification/meal_identification/datasets/dataset_processing.py

Functions:

  • get_root_dir: finds the root directory of the project.
  • load_data: a general data loading utility function that can load data from any data directory.
  • save_data: a general data saving utility function that can store data in either data/interim or data/processed.
  • dataset_labeler: an auto-labeller that takes in the configurations from the data processing, cleaning, and generation

to create a labelled dataset that should give the user a good understanding of how the dataset was generated.

dataset_cleaning.py

Purpose: Focuses on cleaning and preprocessing utilities for the dataset, such as handling overlaps and selecting top meals.

Location: 0_meal_identification/meal_identification/meal_identification/datasets/dataset_cleaning.py

Functions:

  • coerce_time: - a function that allows for time coercion/resampling;
    • it should be designed to allow for various resampling techniques, not just the original one I developed,
    • likely compute-intensive (bottleneck), so it's important to try to optimize this one as much as possible.
  • erase_meal_overlap - a function that erases meal overlaps,
    • Often, multiple 'ANNOUCE_MEALS' will occur in quick succession, but for our modelling task, we want to combine those into the initial meal start time.
      • This is because it characterizes a period with high BGL variability.
    • This is another potential high compute bottleneck.
  • keep_top_n_carb_meals - for our modelling task, we will want to assess model performance on a different number of top carb meal settings, typically 2 or 3 meals per day we wish to be identified.

dataset_generator.py

Purpose: Handles only the dataset creation process by leveraging functions from both dataset_processing.py and dataset_cleaning.py, it should generally be writing the pre-transform dataset into data/interim

Location: 0_meal_identification/meal_identification/meal_identification/datasets/dataset_generator.py

Functions:

  • create_dataset

plots.py

Purpose: contains a variety of plotting functions that we will frequently reuse for various tasks, usually related to assessing model performance.

Location: 0_meal_identification/meal_identification/meal_identification/plots.py

  • plot_announce_meal_histogram
@andytubeee
Copy link
Contributor

.

RobotPsychologist added a commit that referenced this issue Nov 7, 2024
Adding the dataset_processing.py file  for #90.
RobotPsychologist added a commit that referenced this issue Nov 7, 2024
Adding the dataset_cleaning.py script specified in #90
@RobotPsychologist
Copy link
Owner Author

@andytubeee @Tony911029

Hopefully, this is enough to get you started.

@RobotPsychologist
Copy link
Owner Author

Closing this now because I think all the requirements have been fulfilled. New changes to the data generation script will either be enhancements or bug fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
modeldev Developing modeling pipelines for meal annotation task.
Development

No branches or pull requests

3 participants