Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: google batch dag for l2g predictions #110

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

project-defiant
Copy link
Collaborator

@project-defiant project-defiant commented Feb 28, 2025

Context

This PR attempts to create a dag for running the l2g_prediction step that creates:

  • predictions
  • SHAP explanations

The dag triggers the google batch job for the prediction step. This is done on the basis of the number of credible set files that are discovered by credible_set_glob parameter defined in the configuration. The number of found files will trigger the corresponding number of google batch tasks, each task will gain access to a single credible set partition.

Implementations

  • l2g_prediction task group for running l2g predictions step in the unified pipeline
  • Generic google batch job manifest operator for splitting the gentropy step into multiple google batch tasks depending on the input_glob
  • allow for max_task_count to be 0 produce a single row BatchIndex, so all tasks run as a single google batch task.

Concept behind the google batch tasks

Google batch operator CloudBatchSubmitJobOperator allows for running the batch tasks, but requires pre-defining the job definition. To be able to define the job definition we need to know in advance how many tasks we will be building. To achieve that for gentropy steps we need to list the input bucket to be able to determine the number of partitions that are in the input dataset, then match them to multiple tasks.

To achieve above we use two operators:

  1. BatchIndexOperator that defines the job definition by listing the input files
  2. BatchJobOperator that utilizes the job definitions build by the former and executes them in google batch.

To allow for generic google batch jobs one can derive it's own implementation of Manifest Generator class that builds the BatchIndex object within the BatchIndexOperator. See more details in batch subpackage docstring

@project-defiant project-defiant force-pushed the szsz-l2g-prediction-batch branch from 71d6842 to d73032b Compare February 28, 2025 16:43
@project-defiant project-defiant marked this pull request as ready for review March 3, 2025 19:54
@project-defiant project-defiant requested a review from javfg March 3, 2025 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants