Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHOAIENG-16247: Add option to append the run ID to Cloud Object Storage output paths #105

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

caponetto
Copy link

See https://issues.redhat.com/browse/RHOAIENG-16247 for more information

Description

This PR adds the option to append the run ID to Cloud Object Storage (COS) output paths. This way, output files are not overridden when a new run is triggered from the same pipeline execution on the Dashboard. Notice that the current behavior is kept as default to avoid breaking existing flows. Users who want to use this feature must deliberately enable it through the new property.

image

In case the property is not set, then the output files will be stored in a path like (current behavior):
<bucket_name>/<pipeline_name>-<timestamp>/<output_file>

In case the property is set, then the output files will be stored in a path like:
<bucket_name>/<pipeline_name>-<timestamp>/<run_id>/<output_file>

Validation

Be sure to have the Pipeline server, COS and connection correctly set up on ODH.

Step 1

Run this code locally and connect to a running ODH or build a notebook-based image with an Elyra version that includes the changes from this PR.

Alternativelly, you can use the image that is built from this code through our automation: ghcr.io/caponetto/opendatahub-io-elyra/workbench-images:cuda-jupyter-tensorflow-ubi9-python-3.11-20250124-adbd066-sha-9c4c8ebf (this image is based on quay.io/opendatahub/workbench-images and simply uninstalls the current ODH Elyra and installs the ODH Elyra built from this code). The following steps use this image.

Step 2

Import the new image into ODH and create a new workbench from it. Be sure to set up the following env vars so that the updated bootstrapper.py is downloaded and used:

ELYRA_FILE_BASE_PATH=/opt/app-root/bin (or any other folder)
ELYRA_GITHUB_BRANCH=RHOAIENG-16247
ELYRA_GITHUB_ORG=caponetto
ELYRA_GITHUB_REPO=opendatahub-io-elyra

Given recent changes not propagated to notebooks yet, you also need to set up:

KF_PIPELINES_SSL_SA_CERTS=/etc/pki/tls/custom-certs/ca-bundle.crt

Step 3

Open up the newly created workbench and add a pipeline that stores files to COS. An example can be found here, which stores a text file to COS.

Step 4

Keep the new property disabled and run the pipeline. It is expected that output files be stored on:
<bucket_name>/<pipeline_name>-<timestamp>/<output_file>

image

If you create a new run from Dashboard, the output files will be overridden, which is the current behavior.

image

Step 5

Enable the new property and run the pipeline again.

image

It is expected that output files be stored on:
<bucket_name>/<pipeline_name>-<timestamp>/<run_id>/<output_file>

image image

Notice that, the run ID in the folder matches with the information on the Dashboard:

image

If you create a new run from Dashboard, the output files from the run will be stored in a new folder.

image image

@caponetto caponetto requested a review from harshad16 January 24, 2025 19:33
@codecov-commenter
Copy link

codecov-commenter commented Jan 24, 2025

Codecov Report

Attention: Patch coverage is 87.17949% with 5 lines in your changes missing coverage. Please review.

Project coverage is 80.53%. Comparing base (a74d345) to head (9c4c8eb).

Files with missing lines Patch % Lines
elyra/airflow/bootstrapper.py 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #105      +/-   ##
==========================================
- Coverage   80.61%   80.53%   -0.08%     
==========================================
  Files         151      151              
  Lines       19421    19455      +34     
  Branches      487      487              
==========================================
+ Hits        15656    15669      +13     
- Misses       3584     3606      +22     
+ Partials      181      180       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@harshad16 harshad16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the code on this pipeline:
https://github.com/harshad16/data-science-pipeline-example

It does work as intended and stores all the content behind run-id for 1st node execution:

Top layer

s3://hnalla-test/yolo-0129040610/
                           PRE d6265ec5-2537-4d16-b06f-524d1d50c8ff/
2025-01-28 23:06:18       6949 Part 1 - Data Cleaning-8c96e288-4461-4d7e-8e0d-353c1fdb0c8c.tar.gz
2025-01-28 23:06:18       4793 Part 2 - Data Analysis-dcf486ef-2d73-4306-a3ca-af720a1f8eb3.tar.gz
2025-01-28 23:06:18       9903 Part 3 - Time Series Forecasting-1e4b1763-337e-4f84-ae9c-a6cc79a1b7eb.tar.gz
2025-01-28 23:06:18       2528 load_data-bb889c69-b23a-484e-8fb3-e69309f38a98.tar.gz

Run layer:

s3://hnalla-test/yolo-0129040610/d6265ec5-2537-4d16-b06f-524d1d50c8ff/
                           PRE data/
2025-01-28 23:07:28     350937 Part 1 - Data Cleaning.html
2025-01-28 23:07:28      43544 Part 1 - Data Cleaning.ipynb
2025-01-28 23:06:51     285368 load_data.html
2025-01-28 23:06:51      10516 load_data.ipynb

However the 2nd node, fails as it is looking for data in top level i.e inside the pipeline_version itself.
Not sure, if user should be looking for data at run-id level or pipeline level.
based on my understanding, the content can be pipeline level,
runs are more to do with parameters.
based on that , i feel some content need to be left at top level

@@ -184,6 +185,10 @@ def put_file_to_object_storage(self, file_to_upload: str, object_name: Optional[
if not object_to_upload:
object_to_upload = file_to_upload

run_id = os.getenv("ELYRA_RUN_NAME")
Copy link
Member

@harshad16 harshad16 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this function is called from multiple places,
notice there are different types of content also that is getting included.

for example: if a user includes a output file for a node and uses that file in next node by directly trying to read it. like here
it would be unable to locate that, as all the content is inside run-id.

Possibly, if a content is assigned as output, perhaps we should keep it on pipeline level than run ?
we have to leave out the content that is coming from line 211 under func process_output_file
tried something here: harshad16@3f2150d

result after this change:
top level

s3://hnalla-test/air-0129044831/
                           PRE 03efd3c6-7c48-433e-aa45-5d35d29f8cf7/
                           PRE 38e95101-54b0-41f1-93de-aa2ed5dcead0/
                           PRE 9f11921a-ca0a-4f79-bb42-fff38e704b98/
                           PRE data/
2025-01-28 23:48:39       6945 Part 1 - Data Cleaning-8c96e288-4461-4d7e-8e0d-353c1fdb0c8c.tar.gz
2025-01-28 23:48:39       4789 Part 2 - Data Analysis-dcf486ef-2d73-4306-a3ca-af720a1f8eb3.tar.gz
2025-01-28 23:48:39       9900 Part 3 - Time Series Forecasting-1e4b1763-337e-4f84-ae9c-a6cc79a1b7eb.tar.gz
2025-01-28 23:48:39       2525 load_data-bb889c69-b23a-484e-8fb3-e69309f38a98.tar.gz

run level

s3://hnalla-test/air-0129044831/9f11921a-ca0a-4f79-bb42-fff38e704b98/                                     
2025-01-28 23:49:54     365055 Part 1 - Data Cleaning.html
2025-01-28 23:49:54     116369 Part 1 - Data Cleaning.ipynb
2025-01-28 23:50:53    4203367 Part 2 - Data Analysis.html
2025-01-28 23:50:52    3923758 Part 2 - Data Analysis.ipynb
2025-01-28 23:59:07     697190 Part 3 - Time Series Forecasting.html
2025-01-28 23:59:07     422415 Part 3 - Time Series Forecasting.ipynb
2025-01-28 23:49:13     285577 load_data.html
2025-01-28 23:49:13      10603 load_data.ipynb

WDYT ?
if it is all to be on run level,
then we need to modify the get_ func() as well, users need to change their code.

@caponetto
Copy link
Author

Thanks for the review, @harshad16!
I completely agree with you, that's precisely the reason I'm proposing this feature to be an option.
With this change, I wanted to avoid breaking existing flows and/or forcing users to change their code.

Possibly, if a content is assigned as output, perhaps we should keep it on pipeline level than run ?
we have to leave out the content that is coming from line 211 under func process_output_file
tried something here: harshad16@3f2150d

If we keep them on the pipeline level, then these files will be overridden in the next run, won't they? (which is the issue we are trying to resolve).

Copy link
Member

@harshad16 harshad16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point, it is more clear to me, when and when not to use the id
Test with further sets

/lgtm

we should update this flow in some docs in elyra as well as product side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants