-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RHOAIENG-16247: Add option to append the run ID to Cloud Object Storage output paths #105
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #105 +/- ##
==========================================
- Coverage 80.61% 80.53% -0.08%
==========================================
Files 151 151
Lines 19421 19455 +34
Branches 487 487
==========================================
+ Hits 15656 15669 +13
- Misses 3584 3606 +22
+ Partials 181 180 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested the code on this pipeline:
https://github.com/harshad16/data-science-pipeline-example
It does work as intended and stores all the content behind run-id for 1st node execution:
Top layer
s3://hnalla-test/yolo-0129040610/
PRE d6265ec5-2537-4d16-b06f-524d1d50c8ff/
2025-01-28 23:06:18 6949 Part 1 - Data Cleaning-8c96e288-4461-4d7e-8e0d-353c1fdb0c8c.tar.gz
2025-01-28 23:06:18 4793 Part 2 - Data Analysis-dcf486ef-2d73-4306-a3ca-af720a1f8eb3.tar.gz
2025-01-28 23:06:18 9903 Part 3 - Time Series Forecasting-1e4b1763-337e-4f84-ae9c-a6cc79a1b7eb.tar.gz
2025-01-28 23:06:18 2528 load_data-bb889c69-b23a-484e-8fb3-e69309f38a98.tar.gz
Run layer:
s3://hnalla-test/yolo-0129040610/d6265ec5-2537-4d16-b06f-524d1d50c8ff/
PRE data/
2025-01-28 23:07:28 350937 Part 1 - Data Cleaning.html
2025-01-28 23:07:28 43544 Part 1 - Data Cleaning.ipynb
2025-01-28 23:06:51 285368 load_data.html
2025-01-28 23:06:51 10516 load_data.ipynb
However the 2nd node, fails as it is looking for data in top level i.e inside the pipeline_version itself.
Not sure, if user should be looking for data at run-id level or pipeline level.
based on my understanding, the content can be pipeline level,
runs are more to do with parameters.
based on that , i feel some content need to be left at top level
@@ -184,6 +185,10 @@ def put_file_to_object_storage(self, file_to_upload: str, object_name: Optional[ | |||
if not object_to_upload: | |||
object_to_upload = file_to_upload | |||
|
|||
run_id = os.getenv("ELYRA_RUN_NAME") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this function is called from multiple places,
notice there are different types of content also that is getting included.
for example: if a user includes a output file for a node and uses that file in next node by directly trying to read it. like here
it would be unable to locate that, as all the content is inside run-id.
Possibly, if a content is assigned as output, perhaps we should keep it on pipeline level than run ?
we have to leave out the content that is coming from line 211 under func process_output_file
tried something here: harshad16@3f2150d
result after this change:
top level
s3://hnalla-test/air-0129044831/
PRE 03efd3c6-7c48-433e-aa45-5d35d29f8cf7/
PRE 38e95101-54b0-41f1-93de-aa2ed5dcead0/
PRE 9f11921a-ca0a-4f79-bb42-fff38e704b98/
PRE data/
2025-01-28 23:48:39 6945 Part 1 - Data Cleaning-8c96e288-4461-4d7e-8e0d-353c1fdb0c8c.tar.gz
2025-01-28 23:48:39 4789 Part 2 - Data Analysis-dcf486ef-2d73-4306-a3ca-af720a1f8eb3.tar.gz
2025-01-28 23:48:39 9900 Part 3 - Time Series Forecasting-1e4b1763-337e-4f84-ae9c-a6cc79a1b7eb.tar.gz
2025-01-28 23:48:39 2525 load_data-bb889c69-b23a-484e-8fb3-e69309f38a98.tar.gz
run level
s3://hnalla-test/air-0129044831/9f11921a-ca0a-4f79-bb42-fff38e704b98/
2025-01-28 23:49:54 365055 Part 1 - Data Cleaning.html
2025-01-28 23:49:54 116369 Part 1 - Data Cleaning.ipynb
2025-01-28 23:50:53 4203367 Part 2 - Data Analysis.html
2025-01-28 23:50:52 3923758 Part 2 - Data Analysis.ipynb
2025-01-28 23:59:07 697190 Part 3 - Time Series Forecasting.html
2025-01-28 23:59:07 422415 Part 3 - Time Series Forecasting.ipynb
2025-01-28 23:49:13 285577 load_data.html
2025-01-28 23:49:13 10603 load_data.ipynb
WDYT ?
if it is all to be on run level,
then we need to modify the get_ func() as well, users need to change their code.
Thanks for the review, @harshad16!
If we keep them on the pipeline level, then these files will be overridden in the next run, won't they? (which is the issue we are trying to resolve). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point, it is more clear to me, when and when not to use the id
Test with further sets
/lgtm
we should update this flow in some docs in elyra as well as product side.
See https://issues.redhat.com/browse/RHOAIENG-16247 for more information
Description
This PR adds the option to append the run ID to Cloud Object Storage (COS) output paths. This way, output files are not overridden when a new run is triggered from the same pipeline execution on the Dashboard. Notice that the current behavior is kept as default to avoid breaking existing flows. Users who want to use this feature must deliberately enable it through the new property.
In case the property is not set, then the output files will be stored in a path like (current behavior):
<bucket_name>/<pipeline_name>-<timestamp>/<output_file>
In case the property is set, then the output files will be stored in a path like:
<bucket_name>/<pipeline_name>-<timestamp>/<run_id>/<output_file>
Validation
Be sure to have the Pipeline server, COS and connection correctly set up on ODH.
Step 1
Run this code locally and connect to a running ODH or build a notebook-based image with an Elyra version that includes the changes from this PR.
Alternativelly, you can use the image that is built from this code through our automation:
ghcr.io/caponetto/opendatahub-io-elyra/workbench-images:cuda-jupyter-tensorflow-ubi9-python-3.11-20250124-adbd066-sha-9c4c8ebf
(this image is based on quay.io/opendatahub/workbench-images and simply uninstalls the current ODH Elyra and installs the ODH Elyra built from this code). The following steps use this image.Step 2
Import the new image into ODH and create a new workbench from it. Be sure to set up the following env vars so that the updated
bootstrapper.py
is downloaded and used:Given recent changes not propagated to notebooks yet, you also need to set up:
Step 3
Open up the newly created workbench and add a pipeline that stores files to COS. An example can be found here, which stores a text file to COS.
Step 4
Keep the new property disabled and run the pipeline. It is expected that output files be stored on:
<bucket_name>/<pipeline_name>-<timestamp>/<output_file>
If you create a new run from Dashboard, the output files will be overridden, which is the current behavior.
Step 5
Enable the new property and run the pipeline again.
It is expected that output files be stored on:
<bucket_name>/<pipeline_name>-<timestamp>/<run_id>/<output_file>
Notice that, the run ID in the folder matches with the information on the Dashboard:
If you create a new run from Dashboard, the output files from the run will be stored in a new folder.