Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Excdeeding 63 character limit for Volume names in Flyte Propeller Container Plugin #4824

Closed
2 tasks done
alexbeach-bc opened this issue Feb 2, 2024 · 5 comments
Closed
2 tasks done
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo

Comments

@alexbeach-bc
Copy link

Describe the bug

I am running this tutorial

https://github.com/unionai-oss/llm-fine-tuning/tree/main/flyte_llama

I am running the command in the readme:

pyflyte -c $FLYTECTL_CONFIG run --remote \
    --copy-all \
    flyte_llama/workflows.py train_workflow \
    --config config/flyte_llama_7b_qlora_v0.json

The only code modifications is that I am using gcp registry and a self hosted flyte cluster in GKE. The flyte cluster i have, contains node pools with the required gpu, memory, and cpu resources to run.

Note that the registry name is very long due to how the urls are constructed by gcp

image_spec = ImageSpec(
    name="flyte-llama-qlora",
    apt_packages=["git"],
    registry="us-central1-docker.pkg.dev/xx-xxxxxxx-xxxxxxxxx-xxxxx/flyte",
    requirements="requirements.txt",
    python_version="3.9",
    cuda="11.7.1",
    env={"VENV": "/opt/venv"},
)

The first tasks succeeds, but the second task is stuck in Queue state:
Screenshot 2024-02-01 at 4 02 24 PM

After looking at the flyte-propeller k8s logs, i see this error.

 {"json":{"exec_id":"fb3bf107c6ddd4889a74","ns":"flytesnacks-development","res_ver":"8124851","routine":"worker-8","wf":"flytesnacks:development:flyte_llama.workflows.train_workflow"},"level":"error","msg":"Error when trying to recon │
│ cile workflow. Error [failed at Node[n1]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [Invalid] failed to create resource, caused by: Pod \"fb3bf107c6ddd4889a74- │
│ n1-0\" is invalid: [spec.volumes[1].name: Invalid value: \"mfzg3otbo4ztu32fmnzgk4dtnvqw3ylhmvzdu4ltfvswc32ufuzdumzvgy2dgmzqgyzdanryhjzwky2smv1duxy\": must be no more than 63 characters, spec.containers[0].volumeMounts[1].name: Not f │
│ ound: \"mfzg3otbo4ztu32fmnzgk4dtnvqw3ylhmvzdu4ltfvswc32ufuzdumzvgy2dgmzqgyzdanryhjzwky2smv1duxy\"]]. Error Type[*errors.NodeErrorWithCause]","ts":"2024-02-02T00:25:57Z"}    

How are the volume names generated? Could it be because the long registry name is used in the volume name?

Expected behavior

I would expect valid Volume names to be generated and the tutorial task to complete.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@alexbeach-bc alexbeach-bc added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Feb 2, 2024
Copy link

welcome bot commented Feb 2, 2024

Thank you for opening your first issue here! 🛠

Copy link

dosubot bot commented Feb 2, 2024

🤖

Hey @alexbeach-bc, great to see you back here! Hope you're doing well.

Based on the information available in the Flyte repository, it's not clear how volume names are generated in the Flyte Propeller Container Plugin. The error message you're seeing suggests that the volume name is exceeding the Kubernetes limit of 63 characters, but it's not clear from the provided context why this is happening.

The registry name you're using does not appear to play a direct role in the generation of volume names, based on the information available. The volume name is generated by the template "flyte-sandbox.persistence.minioVolumeName" as seen in the charts/flyte-sandbox/templates/storage/minio/pvc.yaml. However, without the content of this template, it's not possible to confirm this.

It's possible that there's a bug or a misconfiguration causing this issue, but without more information, it's hard to say for sure. If you could provide more details about your configuration and the exact steps you're taking, it might be possible to identify the cause of this issue.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@pvditt pvditt added the exo label Feb 2, 2024
@pvditt pvditt self-assigned this Feb 2, 2024
@hamersaw hamersaw added backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Feb 2, 2024
@pvditt
Copy link
Contributor

pvditt commented Apr 2, 2024

@alexbeach-bc apologies for just now getting to this issue.

Reproduced the invalid volume name error with running a task with a PodSpec with a > 63 char volume name. The latest version of Flyte properly handles the error so I don’t believe there was a regression with launching pods. This is sufficient to close concern on propeller’s end. Will follow up on issue to see how the volume name was generated.

image

What Flyte version are you running?

@pvditt
Copy link
Contributor

pvditt commented Apr 4, 2024

closing after discussion offline

@pvditt pvditt closed this as completed Apr 4, 2024
@andylizf
Copy link

encountered this again. Any suggestions?

Workflow[flytesnacks:development:.flytegen.flyte_llama.workflows.tune_batch_size_eager] failed. RuntimeExecutionError: max number of system retry attempts [11/10] exhausted. Last known status message: failed at Node[flytellamaworkflowstunebatchsizeeager]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [Invalid] failed to create resource, caused by: Pod "f522b9b0dd4044607860-f1heqpqi-0" is invalid: [spec.volumes[1].name: Invalid value: "mfzg3otbo4ztu32fmnzgk4dtnvqw3ylhmvzdu4ltfvswc32ufuzdumzvgy2dgmzqgyzdanryhjzwky2smv1duxy": must be no more than 63 characters, spec.containers[0].volumeMounts[1].name: Not found: "mfzg3otbo4ztu32fmnzgk4dtnvqw3ylhmvzdu4ltfvswc32ufuzdumzvgy2dgmzqgyzdanryhjzwky2smv1duxy"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. bug Something isn't working exo
Projects
None yet
Development

No branches or pull requests

4 participants