-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clip_back H14: making it work on an SSD < 2TB; making it fast #304
Comments
About correctness: did you get the metadata from the same source as the
index?
About English only: you could change the clip back code to filter out
results ids above 2B
…On Sun, Aug 13, 2023, 08:05 Chris ***@***.***> wrote:
Goal: My end goal is to use the clip back-end as part of an
image-captioning pipeline (will be needing to process many millions of
images). The constraint I'm presently working under is that my SSD is only
2TB in capacity, so my intent is to only use the English metadata in
conjunction with the H14 index.
Hardware is as follows:
CPU: 13900k
SSD: Samsung 990 pro 2TB
GPU: 4090
RAM: 96GB ***@***.***)
I have managed to get the clip-backend up and running, but I still need to
figure out how to align the laion 2B english-only metadata with the H-14 5B
index. The guide indicates that the no-language and the multi-language
metadata should also be used, but I simply don't have the storage capacity
for them.
I converted the metadata I did get into a single pyarrow file (as the
guide instructs), and the backend will run, but the results being returned
by front-end searches are mostly incorrect. Any guidance on this would be
very much appreciated.
I'm also interested in any sort of performance tweaks that might improve
performance; eg: reorder_metadata_by_ivf_index (I wonder if this would
address my metadata dilemma...).
Thanks in advance for any help or insights.
—
Reply to this email directly, view it on GitHub
<#304>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437TBKYTUVEHXUG54NVDXVBVBJANCNFSM6AAAAAA3ONQZUU>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
As far as I can tell, they're the correct ones. Following the H-14 guide
This sounds like the perfect fix; some of my results are correct (about 1/3rd of them), so I suspect this will work. Any pointers or specifics to achieve this most succinctly? (eg: perhaps modifying the Thanks very much in any case! |
my suggestion is to filter as soon as possible right after the search so here https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/clip_back.py#L366 filtering the index itself is also possible but it would require you to process all the index files and remove any ids > 2B |
I narrowed down my issue a bit further after implementing your suggestion but am unsure how to proceed: after setting an id threshold, it turns out that any id above roughly 100 million is misaligned in the front-end or ClipClient results. I can't be sure whether it's malformed index files or a bad metadata arrow file, but I did re-download and re-merge both of them (one at a time) as a sanity check. Been trying many things in hopes that I missed a step, but no luck so far. One of my suspicions is that a recent change to the instructions may have messed something up, specifically where Which was done to prevent default names being incorrect, but the index files are in fact .index files, so perhaps saving them as .parquet before merging them could affect the alignment somehow? It seems like my issue is with metadata alignment; the only other suspicion I have right now is that by simply not having the rest of the metadata (by only compiling the English metadatas), the alignment of the metadata is somehow messed up. Hoping to get your thoughts on this before I potentially re-download the index file with the correct name scheme. In case it helps, my 0.arrow (english metadata) file is 385.8 GB, the |
I managed to get everything working and aligned. Although I'm not positive which of the following was the fix (mostly due to my inability to systematically test because of storage capacity limits), here was how I modified steps from the guide:
becomes:
and
becomes (along with the other metadata downloads) for i in {0000..2313}; do aria2c -x 16 https://huggingface.co/datasets/laion/laion2b-en-vit-h-14-embeddings/resolve/main/metadata/metadata_$i.parquet -o metadata_$i.parquet; done Other than that, a final sanity check that may have been at play was validating that I did have all 2314 of the English metadata files before compiling them. After the first download of the datasets I stopped checking to make sure they were all there (I assumed aria2 was fool-proof), but on this last attempt i checked and found a few missing files. I'll make a pull request and include a warning that advises people check for missing files, in addition to the above changes just to keep consistent with the database filenames. I'm still interested in figuring out how to maximize clip back's performance for my use case, so I'll leave this issue open and eventually complete it with my findings (or perhaps ask some related questions :D ) Thanks again for your help @rom1504 , I'm looking forward to using this for big data jobs! |
Goal: My end goal is to use the clip back-end as part of an image-captioning pipeline (will be needing to process many millions of images). The constraint I'm presently working under is that my SSD is only 2TB in capacity, so my intent is to only use the English metadata in conjunction with the H14 index.
Hardware is as follows:
CPU: 13900k
SSD: Samsung 990 pro 2TB
GPU: 4090
RAM: 96GB (@6400hz)
I have managed to get the clip-backend up and running, but I still need to figure out how to align the laion 2B english-only metadata with the H-14 5B index. The guide indicates that the no-language and the multi-language metadata should also be used, but I simply don't have the storage capacity for them.
I converted the metadata I did get into a single pyarrow file (as the guide instructs), and the backend will run, but the results being returned by front-end searches are mostly incorrect. Any guidance on this would be very much appreciated.
I'm also interested in any sort of performance tweaks that might improve performance; eg:
reorder_metadata_by_ivf_index
(I wonder if this would address my metadata dilemma...).Thanks in advance for any help or insights.
The text was updated successfully, but these errors were encountered: