You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello @kjsman,
this is more a feature proposal than an actual issue. Instead of requiring the user to download and open the tar file containing the weights and the vocabulary from your huggingface hub repository, one can directly make the model_loader and the Tokenizer download and cache them.
For the first part, it only requires replacing torch.load(...)here (and for the other 3 functions in the same file) with
All it takes on your side is to upload on hugginface hub the 4 pt files (not in a zipped file) and thats' it.
As regards the tokenizer, just takes to add a default_bpe() method / function
@lru_cache()defdefault_bpe():
p=os.path.join(
os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz"
)
ifos.path.exists(p):
returnpelse:
p=urlretrieve(
"https://github.com/openai/CLIP/blob/main/clip/bpe_simple_vocab_16e6.txt.gz?raw=true",
"bpe_simple_vocab_16e6.txt.gz",
)
iflen(p) !=1:
# if it also contains the# HTTP message as second entryreturnp[0]
else:
returnp
Another option is, if you prefer to keep your vocab.json and merges.txt, to upload them as well to Hugginface hub (not in a tar file) or directly to GitHub like the original reposiotry does with its vocab.
If you like it, I will open a new PR, otherwise please let me know if you have any better idea or close this issue if you are not interested in this feature 😄
The text was updated successfully, but these errors were encountered:
First of all, thank you for your idea! The notification email was bounced on my inbox so I couldn't reply quickly... 😓
I agree that we can do better for downloading/loading models, but I want to keep data/ directory: I think it's straightforward for users who want to {look at, change, load finetuned, finetune} model (yeah, we don't support conversion and training now, but might gonna do someday).
Maybe we can:
Create the function which download models from default CDN and saves to given path or data/
Modify model loader functions to do following:
get checkpoint path as parameter
if checkpoint path is not given, try to get it from data/ directory
if the model file does not exist at data/ directory, download it
I think we should use the same way for tokenizer. Yeah, everyone uses CLIP's default tokenizer without edit, But:
anyway I think we should be consistent for some loadable data
treating it with other ways (e.g. download on the fly, caching in somewhere) would need longer code
I'll upload checkpoint files in near future and mention you; I think I might change some structures so I'm not sure I can do it now.
Hello @kjsman ,
thanks for the answer.
I guess I will wait for the checkpoint files in the future so that we can discuss more concretely possible enhancement, if you like 😄
Hello @kjsman,
this is more a feature proposal than an actual issue. Instead of requiring the user to download and open the tar file containing the weights and the vocabulary from your huggingface hub repository, one can directly make the
model_loader
and theTokenizer
download and cache them.For the first part, it only requires replacing
torch.load(...)
here (and for the other 3 functions in the same file) withAll it takes on your side is to upload on hugginface hub the 4
pt
files (not in a zipped file) and thats' it.As regards the tokenizer, just takes to add a
default_bpe()
method / functionAnother option is, if you prefer to keep your vocab.json and merges.txt, to upload them as well to Hugginface hub (not in a tar file) or directly to GitHub like the original reposiotry does with its vocab.
If you like it, I will open a new PR, otherwise please let me know if you have any better idea or close this issue if you are not interested in this feature 😄
The text was updated successfully, but these errors were encountered: