feat: google batch dag for l2g predictions #110
+154
−27
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
This PR attempts to create a dag for running the
l2g_prediction
step that creates:The dag triggers the
google batch job
for the prediction step. This is done on the basis of the number of credible set files that are discovered bycredible_set_glob
parameter defined in the configuration. The number of found files will trigger the corresponding number of google batch tasks, each task will gain access to a single credible set partition.Implementations
l2g_prediction
task group for running l2g predictions step in the unified pipelineinput_glob
max_task_count
to be 0 produce a single row BatchIndex, so all tasks run as a single google batch task.Concept behind the google batch tasks
Google batch operator
CloudBatchSubmitJobOperator
allows for running the batch tasks, but requires pre-defining the job definition. To be able to define the job definition we need to know in advance how many tasks we will be building. To achieve that for gentropy steps we need to list the input bucket to be able to determine the number of partitions that are in the input dataset, then match them to multiple tasks.To achieve above we use two operators:
To allow for generic google batch jobs one can derive it's own implementation of Manifest Generator class that builds the
BatchIndex
object within theBatchIndexOperator
. See more details in batch subpackage docstring