Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[frontend] UI is failing to render Large PIpeline DAGs with "no graph to show" error #10011

Closed
sachdevayash1910 opened this issue Sep 20, 2023 · 12 comments
Assignees
Labels
area/frontend kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@sachdevayash1910
Copy link

sachdevayash1910 commented Sep 20, 2023

  • How did you deploy Kubeflow Pipelines (KFP)?
    It was installed with the Kubeflow standard installation of the AWS distribution
  • KFP version:
    2.0.0-alpha7

What we are trying to do

We are trying to execute pipelines with 400-500 components, which on average have 10-15 inputs/outputs each but some components may have close to 100s of inputs and outputs. This results in a Workflow size of about 700KB which only grows as the number of ParallelFors execute depending on which use case is being run in a particular pipeline.

Expected result

The pipelines should run to completion and we should be able to see the graph

Issue we are seeing

image


We had seen the same issue earlier when our pipelines were smaller. At the time we had build the ml-pipeline-ui image from the master branch which had a fix for the issue raised via
#8343.
Currently we are running the following commit of the master branch of the pipelines repo: 1bed63a31e7ac5e7ba122a6695f9aa40449a22aa
We did not run into any issues since Feb. However, now our pipelines are much larger and we have run into this issue again.
To be on the safe side, I tried to upgrade the UI image to version 2.0.0-alpha7 but as I understand, the version has the same fix which I have already deployed which is why the issue was not resolved.
Would appreciate any input on how to resolve this. This is blocking us from running pipelines beyond a point.

Additional context:

Our pipelines are too large and already exceeded the limits for workflows and we started receiving workflow is longer than maximum allowed size. compressed size 1055604 > maxSize 1048576T
This is the same as this issue: awslabs/kubeflow-manifests#767

For this I tried and was able to turn on the workflow offloading feature provided by Argo : https://argoproj.github.io/argo-workflows/offloading-large-workflows/#:~:text=Argo%20stores%20workflows%20as%20Kubernetes,This%20can%20be%20over%201MB.

This actually allowed the pipelines to run successfully (they were stuck/failed before due the size error) but we then started to see these UI issues
@zijianjoy @chensun Would appreciate your inputs

Impacted by this bug? Give it a 👍.

@zijianjoy
Copy link
Collaborator

@droctothorpe has a PR for this fix: #9351.

@sachdevayash1910
Copy link
Author

Thanks @zijianjoy, I will check it out. But isn't that just to get the error? Our pipelines aren't exceeding 1000 nodes and yet we don't see a graph

@zijianjoy
Copy link
Collaborator

@sachdevayash1910 You can also open the Developer console in your browser to see if there is any error shown on the console.

Alternatively, you can also share the pipeline template here so we can reproduce.

@droctothorpe
Copy link
Contributor

droctothorpe commented Sep 21, 2023

The error also surfaces if you have 1000 connections / edges in your graph, even if you have less than 1000 nodes.

@droctothorpe
Copy link
Contributor

droctothorpe commented Sep 22, 2023

One options is to break your pipeline up into sub-pipelines and have the last component of the first pipeline trigger the second, etc.

Another option is to watch pipelines in Argo Workflows, the Argo Workflows CLI, or K9s. Not as pretty as the KFP frontend though.

@TristanGreathouse
Copy link

@droctothorpe @zijianjoy I work with @sachdevayash1910 and am one of the primary developers on our pipelines. We definitely have greater than 1000 edges in some of our larger DAGs, and potentially could exceed 1000 nodes depending on the use-case and how the pipelines are configured.

Why is there specifically a 1000 edge and/or node limit for the KF UI? Is there any way this can be increased or are there any plans to fix this in the future?

We can't share pipelines with our images, but if it'd be critical for debugging, we could work on a dummy pipeline with the same inputs and outputs for every component running on generic images and running minimal code to reproduce the UI issue. Would this be helpful?

@zijianjoy
Copy link
Collaborator

@TristanGreathouse If the number of edges and nodes are too many, there is a chance that the web page will freeze due to failure of rendering a large graph. One thing to consider is by packaging pipeline as a component, which is a SubDAG. This can reduce the number of nodes and edges in each rendering. https://www.kubeflow.org/docs/components/pipelines/v2/pipelines/pipeline-basics/#pipelines-as-components

If the UI can handle more than 1000 nodes and edges, feel free to increase the limit by creating a PR.

@TristanGreathouse
Copy link

@zijianjoy sub-dags in KF is something we've wanted for quite a while, so I'm very glad to see it's been released. This is definitely preferred functionality for fixing our problems, however we're running into some snags testing out the examples.

I tried to upload the toy pipeline from the docs. In order to compile it I upgraded to kfp==2.3.0 (up from 1.8.21). However, when I go to upload the pipelines in the KF UI, I faced the below screenshotted error. I also attempted to upload pipelines and start runs from a template using the KFP client, however we get the following warning before our client fails to connect with the cluster. Our current KF pipelines backend is 2.0.0-alpha.5 which comes with KF 1.6.1.

/home/inferno/miniconda/lib/python3.8/site-packages/kfp/client/client.py:158: FutureWarning: This client only works with Kubeflow Pipeline v2.0.0-beta.2 and later versions.

Do we need to install a V2 backend in order to upload and run pipelines compiled with the V2 SDK? If so, which backend version should we use? We tried to reference the docs for installation, but they just say "This page will be availble soon". Any guidance on V2 compatible versioning and documentation for installation would be greatly appreciated as we're eager to test out V2 sub-dag functionality.

Screen Shot 2023-09-25 at 4 52 11 PM

CC: @sachdevayash1910

@noodleai
Copy link

Hi all, after applying the Argo fix suggested by @sachdevayash1910 by setting nodeStatusOffLoad: true and now larger pipelines will execute. However, these offloaded pipelines will now fail in the UI.

After some Chrome dev-tools debugging (not a UI developer by any means), I found the following:

  1. The variable graph is present in a non-offloading pipeline:
    !image

  2. However, it's undefined in a larger, offloading one:
    !image

  3. Upon digging a bit deeper into the RunDetails.tsx file I found that the following must be true: workflow && workflow.status && workflow.status.nodes. However, workflow has no value nodes but does have a offloadNodeStatusVersion entry:
    !

  4. For a non-offloading pipeline workflow does have a nodes entry, but no offloadNodeStatusVersion:
    !image

@zijianjoy any pointers on how we can get the nodes into workflow when using nodeStatusOffLoad: true?

@zijianjoy
Copy link
Collaborator

nodeStatusOffLoad is an argo feature that we haven't supported. If you would like to contribute, you would need to identify the location of the uncompressed graph:

  • If this is a compressed graph storing in Kubernetes resource, frontend need to detect this is the case and unzip the graph before rendering.
  • If this is stored in database like mysql or postgresql, establish an api on the frontend server which queries the target database location, so it can return the graph detail to browser.

Copy link

github-actions bot commented Jan 3, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jan 3, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/frontend kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
No open projects
Status: Closed
Development

No branches or pull requests

5 participants