-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Create model and dataset initializers #2303
KEP-2170: Create model and dataset initializers #2303
Conversation
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Pull Request Test Coverage Report for Build 11517758882Details
💛 - Coveralls |
Should we consider unit or e2e tests for this? |
Yeah, I will open dedicated issue for it. |
Co-authored-by: Kevin Hannon <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]>
pkg/initiailizer_v2/utils/utils.py
Outdated
|
||
# Get DataClass config from the environment variables. | ||
# Env names must be equal to the DataClass parameters. | ||
def get_config_from_env(config) -> Dict[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want some typing hints for config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I wasn't able to find Python type that can represent DataClass, as described here: https://stackoverflow.com/questions/54668000/type-hint-for-an-instance-of-a-non-specific-dataclass#:~:text=Despite%20its%20name%2C%20dataclasses.dataclass%20doesn%27t%20expose%20a%20class%20interface..
Any ideas on how we can add the type hint for it @kannon92 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Then it’s fine to go without.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not much of a Python dev these days so I don’t know about the typing for data classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we use like this
config : Union[DataClassA, DataClassB]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, it won't work. E.g. I can see this error from PyLance:
Argument of type "type[HuggingFaceDatasetConfig]" cannot be assigned to parameter "config" of type "HuggingFaceDatasetConfig" in function "get_config_from_env"
"type[type]" is not assignable to "type[HuggingFaceDatasetConfig]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from dataclasses import fields
from typing import Dict
import os
@dataclass
class HuggingFaceModelInputConfig:
storage_uri: str
access_token: Optional[str] = None
def get_config_from_env(config: Union[HuggingFaceModelInputConfig]) -> Dict[str, str]:
config_from_env = {}
for field in fields(config):
config_from_env[field.name] = os.getenv(field.name.upper())
return config_from_env
cf1 = get_config_from_env(HuggingFaceModelInputConfig)
print(cf1)
This is working @andreyvelich
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deepanker13 What type validator do you use in your IDE ?
My VSCode complains with the error that I added above.
I am using PyLance VSCode extension: https://github.com/microsoft/pylance-release
repo_id=model_uri, | ||
local_dir=constants.VOLUME_PATH_MODEL, | ||
allow_patterns=["*.json", "*.safetensors", "*.model"], | ||
ignore_patterns=["*.msgpack", "*.h5", "*.bin"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can consider ignoring ".pt" and ".pth" files as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saileshd1402 Do we know what files have .pt and .pth extension ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They store model weights similar to .safetensors and .bin but are outdated now. For example: In meta-llama/Llama-3.1-8B there is a folder called "/original" where consolidated.00.pth stores same data as ".safetensors" but is outdated. But would like to know if others think those files are important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lizzzcai do you have any thoughts on whether we can exclude .pth
and .pt
files from model download ?
I noticed that you suggested to add the same ignore_patterns to KServe storage initializer: kserve/kserve#3584 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @andreyvelich, I don't see an issue to exclude .pth
, .pt
and .bin
, as safetensors should be a preferred format and it has better security compared to others. Assuming that the model being downloaded provides safetensors format (most of the popular model should have).
If you want to support multiple formats, can check how vLLM support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the info!
I think, in the future we should follow the vLLM approach.
@@ -30,6 +30,14 @@ jobs: | |||
dockerfile: cmd/training-operator.v2alpha1/Dockerfile | |||
platforms: linux/amd64,linux/arm64,linux/ppc64le | |||
tag-prefix: v2alpha1 | |||
- component-name: model-initiailizer-v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo at multiple places
initializer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! Let me fix that.
huggingface_hub.snapshot_download( | ||
repo_id=dataset_uri, | ||
repo_type="dataset", | ||
local_dir=constants.VOLUME_PATH_DATASET, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To speed up things should we set max_workers equal to number of files getting downloaded, currently it downloads 8 files parallely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any benchmarks that show that setting of max_workers to number of files speedup download time ?
What if we don't have enough CPUs for all concurrent threads ?
cmd/initiailizer_v2/model/Dockerfile
Outdated
WORKDIR /workspace | ||
|
||
# Copy the required Python modules. | ||
COPY cmd/initiailizer_v2/model/requirements.txt . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the folder named as cmd? It is containing docker files and requirements.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to be consistent across all Training Operator components, like we do in Katib: https://github.com/kubeflow/katib/tree/master/cmd
E.g.
cmd - contains binaries/dockerfiles for execution.
pkg - contains the actual backend.
logging.info(f"Config for HuggingFace dataset initiailizer: {config_dict}") | ||
self.config = HuggingFaceDatasetConfig(**config_dict) | ||
|
||
def download_dataset(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have unit tests for these, or would it be taken care in e2e? Downloading models from HF was having issues previously, it would be helpful to have it tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvmd, just realised we have: #2305
Signed-off-by: Andrey Velichkevich <[email protected]>
fa77453
to
5c39812
Compare
huggingface_hub.snapshot_download( | ||
repo_id=model_uri, | ||
local_dir=constants.VOLUME_PATH_MODEL, | ||
allow_patterns=["*.json", "*.safetensors", "*.model"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For "*.safetensors"
and "*.model"
in the allow_patterns, I would say it works in most of the cases.
However, for model like mistralai/Mistral-7B-Instruct-v0.3, it has consolidated.safetensors and tokenizer.model.v3 (so called v3 format in Mistral, check their download example here).
snapshot_download(repo_id="mistralai/Mistral-7B-Instruct-v0.3", allow_patterns=["params.json", "consolidated.safetensors", "tokenizer.model.v3"], local_dir=mistral_models_path)
In this case, downloading the above mistral model with the current allow_patterns, the downloaded size will be 29 GB (double from the actual size 14.5 GB). Probably you need some logic to handle the Mistral model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing! Let me add it to the TODO list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can add some comment somewhere to let users know we only support downloading models/datasets from HuggingFace now?
It might be more user-friendly :)
Yes, we are planning to add supported dataset and model providers to the website. |
Signed-off-by: Andrey Velichkevich <[email protected]>
2cf2c5e
to
2e8a518
Compare
Are there any other comments before we can move forward with this initial PR ? |
@andreyvelich: GitHub didn't allow me to assign the following users: varshaprasad96, saileshd1402. Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's address other issues in a follow up.
/lgtm
/approve
@@ -0,0 +1,13 @@ | |||
FROM python:3.11-alpine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FROM python:3.11-alpine | |
FROM python:3.11-slim-bookworm |
@andreyvelich Could you use the Debian image since the Alpine has a performance penalty due to with musl libc?
Python still depends on the C codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me create an issue
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes: #2210
I created model and dataset initializers.
Initially, we will only support HF for the demo purposes.
I will create dedicated issue to support more providers.
/assign @kubeflow/wg-training-leads @varshaprasad96 @akshaychitneni @deepanker13 @helenxie-bit @Electronic-Waste @saileshd1402 @kannon92