RHOAIENG-16247: Add option to append the run ID to Cloud Object Storage output paths #105

caponetto · 2025-01-24T19:33:26Z

See https://issues.redhat.com/browse/RHOAIENG-16247 for more information

Description

This PR adds the option to append the run ID to Cloud Object Storage (COS) output paths. This way, output files are not overridden when a new run is triggered from the same pipeline execution on the Dashboard. Notice that the current behavior is kept as default to avoid breaking existing flows. Users who want to use this feature must deliberately enable it through the new property.

In case the property is not set, then the output files will be stored in a path like (current behavior):
<bucket_name>/<pipeline_name>-<timestamp>/<output_file>

In case the property is set, then the output files will be stored in a path like:
<bucket_name>/<pipeline_name>-<timestamp>/<run_id>/<output_file>

Validation

Be sure to have the Pipeline server, COS and connection correctly set up on ODH.

Step 1

Run this code locally and connect to a running ODH or build a notebook-based image with an Elyra version that includes the changes from this PR.

Alternativelly, you can use the image that is built from this code through our automation: ghcr.io/caponetto/opendatahub-io-elyra/workbench-images:cuda-jupyter-tensorflow-ubi9-python-3.11-20250124-adbd066-sha-9c4c8ebf (this image is based on quay.io/opendatahub/workbench-images and simply uninstalls the current ODH Elyra and installs the ODH Elyra built from this code). The following steps use this image.

Step 2

Import the new image into ODH and create a new workbench from it. Be sure to set up the following env vars so that the updated bootstrapper.py is downloaded and used:

ELYRA_FILE_BASE_PATH=/opt/app-root/bin (or any other folder)
ELYRA_GITHUB_BRANCH=RHOAIENG-16247
ELYRA_GITHUB_ORG=caponetto
ELYRA_GITHUB_REPO=opendatahub-io-elyra

Given recent changes not propagated to notebooks yet, you also need to set up:

KF_PIPELINES_SSL_SA_CERTS=/etc/pki/tls/custom-certs/ca-bundle.crt

Step 3

Open up the newly created workbench and add a pipeline that stores files to COS. An example can be found here, which stores a text file to COS.

Step 4

Keep the new property disabled and run the pipeline. It is expected that output files be stored on:
<bucket_name>/<pipeline_name>-<timestamp>/<output_file>

If you create a new run from Dashboard, the output files will be overridden, which is the current behavior.

Step 5

Enable the new property and run the pipeline again.

It is expected that output files be stored on:
<bucket_name>/<pipeline_name>-<timestamp>/<run_id>/<output_file>

Notice that, the run ID in the folder matches with the information on the Dashboard:

If you create a new run from Dashboard, the output files from the run will be stored in a new folder.

codecov-commenter · 2025-01-24T19:41:07Z

Codecov Report

Attention: Patch coverage is 87.17949% with 5 lines in your changes missing coverage. Please review.

Project coverage is 80.53%. Comparing base (a74d345) to head (9c4c8eb).

Files with missing lines	Patch %	Lines
elyra/airflow/bootstrapper.py	0.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #105      +/-   ##
==========================================
- Coverage   80.61%   80.53%   -0.08%     
==========================================
  Files         151      151              
  Lines       19421    19455      +34     
  Branches      487      487              
==========================================
+ Hits        15656    15669      +13     
- Misses       3584     3606      +22     
+ Partials      181      180       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

harshad16

Tested the code on this pipeline:
https://github.com/harshad16/data-science-pipeline-example

It does work as intended and stores all the content behind run-id for 1st node execution:

Top layer

s3://hnalla-test/yolo-0129040610/
                           PRE d6265ec5-2537-4d16-b06f-524d1d50c8ff/
2025-01-28 23:06:18       6949 Part 1 - Data Cleaning-8c96e288-4461-4d7e-8e0d-353c1fdb0c8c.tar.gz
2025-01-28 23:06:18       4793 Part 2 - Data Analysis-dcf486ef-2d73-4306-a3ca-af720a1f8eb3.tar.gz
2025-01-28 23:06:18       9903 Part 3 - Time Series Forecasting-1e4b1763-337e-4f84-ae9c-a6cc79a1b7eb.tar.gz
2025-01-28 23:06:18       2528 load_data-bb889c69-b23a-484e-8fb3-e69309f38a98.tar.gz

Run layer:

s3://hnalla-test/yolo-0129040610/d6265ec5-2537-4d16-b06f-524d1d50c8ff/
                           PRE data/
2025-01-28 23:07:28     350937 Part 1 - Data Cleaning.html
2025-01-28 23:07:28      43544 Part 1 - Data Cleaning.ipynb
2025-01-28 23:06:51     285368 load_data.html
2025-01-28 23:06:51      10516 load_data.ipynb

However the 2nd node, fails as it is looking for data in top level i.e inside the pipeline_version itself.
Not sure, if user should be looking for data at run-id level or pipeline level.
based on my understanding, the content can be pipeline level,
runs are more to do with parameters.
based on that , i feel some content need to be left at top level

harshad16 · 2025-01-29T05:24:21Z

elyra/airflow/bootstrapper.py

@@ -184,6 +185,10 @@ def put_file_to_object_storage(self, file_to_upload: str, object_name: Optional[
        if not object_to_upload:
            object_to_upload = file_to_upload

+        run_id = os.getenv("ELYRA_RUN_NAME")


As this function is called from multiple places,
notice there are different types of content also that is getting included.

for example: if a user includes a output file for a node and uses that file in next node by directly trying to read it. like here
it would be unable to locate that, as all the content is inside run-id.

Possibly, if a content is assigned as output, perhaps we should keep it on pipeline level than run ?
we have to leave out the content that is coming from line 211 under func process_output_file
tried something here: harshad16@3f2150d

result after this change:
top level

s3://hnalla-test/air-0129044831/ PRE 03efd3c6-7c48-433e-aa45-5d35d29f8cf7/ PRE 38e95101-54b0-41f1-93de-aa2ed5dcead0/ PRE 9f11921a-ca0a-4f79-bb42-fff38e704b98/ PRE data/ 2025-01-28 23:48:39 6945 Part 1 - Data Cleaning-8c96e288-4461-4d7e-8e0d-353c1fdb0c8c.tar.gz 2025-01-28 23:48:39 4789 Part 2 - Data Analysis-dcf486ef-2d73-4306-a3ca-af720a1f8eb3.tar.gz 2025-01-28 23:48:39 9900 Part 3 - Time Series Forecasting-1e4b1763-337e-4f84-ae9c-a6cc79a1b7eb.tar.gz 2025-01-28 23:48:39 2525 load_data-bb889c69-b23a-484e-8fb3-e69309f38a98.tar.gz

run level

s3://hnalla-test/air-0129044831/9f11921a-ca0a-4f79-bb42-fff38e704b98/ 2025-01-28 23:49:54 365055 Part 1 - Data Cleaning.html 2025-01-28 23:49:54 116369 Part 1 - Data Cleaning.ipynb 2025-01-28 23:50:53 4203367 Part 2 - Data Analysis.html 2025-01-28 23:50:52 3923758 Part 2 - Data Analysis.ipynb 2025-01-28 23:59:07 697190 Part 3 - Time Series Forecasting.html 2025-01-28 23:59:07 422415 Part 3 - Time Series Forecasting.ipynb 2025-01-28 23:49:13 285577 load_data.html 2025-01-28 23:49:13 10603 load_data.ipynb

WDYT ?
if it is all to be on run level,
then we need to modify the get_ func() as well, users need to change their code.

caponetto · 2025-01-29T10:32:34Z

Thanks for the review, @harshad16!
I completely agree with you, that's precisely the reason I'm proposing this feature to be an option.
With this change, I wanted to avoid breaking existing flows and/or forcing users to change their code.

Possibly, if a content is assigned as output, perhaps we should keep it on pipeline level than run ?
we have to leave out the content that is coming from line 211 under func process_output_file
tried something here: harshad16@3f2150d

If we keep them on the pipeline level, then these files will be overridden in the next run, won't they? (which is the issue we are trying to resolve).

harshad16

Excellent point, it is more clear to me, when and when not to use the id
Test with further sets

/lgtm

we should update this flow in some docs in elyra as well as product side.

Add option to append the run ID to Cloud Object Storage output paths

9c4c8eb

caponetto requested a review from harshad16 January 24, 2025 19:33

harshad16 reviewed Jan 29, 2025

View reviewed changes

harshad16 approved these changes Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RHOAIENG-16247: Add option to append the run ID to Cloud Object Storage output paths #105

RHOAIENG-16247: Add option to append the run ID to Cloud Object Storage output paths #105

caponetto commented Jan 24, 2025

codecov-commenter commented Jan 24, 2025 •

edited

Loading

harshad16 left a comment

harshad16 Jan 29, 2025 •

edited

Loading

caponetto commented Jan 29, 2025

harshad16 left a comment

RHOAIENG-16247: Add option to append the run ID to Cloud Object Storage output paths #105

Are you sure you want to change the base?

RHOAIENG-16247: Add option to append the run ID to Cloud Object Storage output paths #105

Conversation

caponetto commented Jan 24, 2025

Description

Validation

Step 1

Step 2

Step 3

Step 4

Step 5

codecov-commenter commented Jan 24, 2025 • edited Loading

Codecov Report

harshad16 left a comment

Choose a reason for hiding this comment

harshad16 Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

caponetto commented Jan 29, 2025

harshad16 left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 24, 2025 •

edited

Loading

harshad16 Jan 29, 2025 •

edited

Loading