-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speeding up clip-retrieval back
for large number of images
#213
Comments
Hey, glad you got things working locally What kind of hardware do you have? Probably the best way to speed things up is #125 so that the reordering option will work with arrow You may also disable safety and near deduplication |
I'm currently keeping index files on an EFS - could that be a source of problems? I can move it to a SSD if that would result in better performance. |
So #125 should be possible with this in the config -
Right? |
No it needs new code I'm afraid Yes prefer using a ssd |
1 similar comment
No it needs new code I'm afraid Yes prefer using a ssd |
Sorry I'm a little confused - where exactly does the Also, regarding speedup, just using the original config with SSD would yield that much benefit (I believe 20 query/s is mentioned in the benchmarking section)? No GPUs required? |
reorder_metadata_by_ivf_index cannot currently help with the arrow files
I advise 2 Regarding speed up. The number in the readme is for a smaller index. However it is indeed possible to get good speeds. It will need some work however. Here are slow things
|
GPU won't help without batching |
I see. Thank you so much for all your help! I will look into your suggestions and try to implement at least one. One last clarification about the original post above : the benchmarking section mentions "turning off memory mapping options can also speed up requests, at the cost of high ram usage". How does this work? |
Turning off memory mapping means putting the whole index in ram. For a 800GB index it would mean either getting a machine with a lot of ram or splitting in many machines |
Got it. I started by rebuilding the hdf5 collection with the Should I reduce it down to fewer columns (like only |
If you get only id and similarity it means the metadata is not used at all
(that's what get stored in hdf5/arrow/parquet)
I figure you may have disabled it ?
…On Fri, Dec 9, 2022, 17:02 Varad Gunjal ***@***.***> wrote:
Got it. I started by rebuilding the hdf5 collection with the
reorder_metadata_by_ivf_index option. It does make responses faster, but
the responses only return the 'id' & 'similarity' column even though I set "columns_to_return":
["url", "caption", "NSFW", "id", "similarity"]
Should I reduce it down to fewer columns (like only ["url", "caption"]?
Or does the reordering limit to returning only id & similarity?
—
Reply to this email directly, view it on GitHub
<#213 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437TQLOWPUX4TBEXK3JDWMNJXZANCNFSM6AAAAAASYQN5PE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
No I don't think I explicitly disable it. Is there a flag that does that? |
To enable it you need to use one of --enable_hdf5 True or use arrow, have a metadata collection that contain all items and have no error in the console |
I do have
... and running it with |
I tried to call the MetadataService explicitly using the ids returned by the KnnService, but it doesn't return any metadata for any of the listed IDs, using the above config. However, if I switch to I guess that's why this check - clip-retrieval/clip_retrieval/clip_back.py Line 407 in 19c9185
|
Can you check if the hdf5 file has been created in the folder ?
…On Fri, Dec 9, 2022, 18:36 Varad Gunjal ***@***.***> wrote:
I tried to call the MetadataService explicitly using the ids returned by
the KnnService, but it doesn't return any metadata for any of the listed
IDs, using the above config. However, if I switch to "use_arrow": True
(and "enable_hdf5": False), the MetadataService does return requested
metadata.
I guess that's why this check -
https://github.com/rom1504/clip-retrieval/blob/19c91856f9456463b00bd9162389266714f04cb7/clip_retrieval/clip_back.py#L407
- fails and I get only 'id' & 'similarity' in the output.
—
Reply to this email directly, view it on GitHub
<#213 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437XWR3DAXXI6GHAX35TWMNUZJANCNFSM6AAAAAASYQN5PE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Yes, I see |
For testing, I'm querying it like so -
|
How big is the hdf5 file ? Should be around 800GB
…On Fri, Dec 9, 2022, 19:10 Varad Gunjal ***@***.***> wrote:
For testing, I'm querying it like so -
payload = {
"text":"red car",
"modality":"image",
"num_images":20,
"indice_name":"laion5B",
"use_mclip":False,
"deduplicate":True,
"use_safety_model":True,
"use_violence_detector":True,
"aesthetic_score":"",
"aesthetic_weight":0.5
}
response = requests.post(
"http://127.0.0.1:1234/knn-service",
data=json.dumps(payload)
)
—
Reply to this email directly, view it on GitHub
<#213 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437RJP2UIXK5IEM5EYGLWMNYX3ANCNFSM6AAAAAASYQN5PE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Can you try to open it manually?
…On Fri, Dec 9, 2022, 19:11 Romain Beaumont ***@***.***> wrote:
How big is the hdf5 file ? Should be around 800GB
On Fri, Dec 9, 2022, 19:10 Varad Gunjal ***@***.***> wrote:
> For testing, I'm querying it like so -
>
> payload = {
> "text":"red car",
> "modality":"image",
> "num_images":20,
> "indice_name":"laion5B",
> "use_mclip":False,
> "deduplicate":True,
> "use_safety_model":True,
> "use_violence_detector":True,
> "aesthetic_score":"",
> "aesthetic_weight":0.5
> }
>
> response = requests.post(
> "http://127.0.0.1:1234/knn-service",
> data=json.dumps(payload)
> )
>
> —
> Reply to this email directly, view it on GitHub
> <#213 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAR437RJP2UIXK5IEM5EYGLWMNYX3ANCNFSM6AAAAAASYQN5PE>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
Oh hmm. The |
Sounds like the reordering failed. Try to delete the file and retry.
Also you need to be using the parquet files from
https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-multi/laion2B-multi-metadata/
and 2B-en and 1B-nolang since it's these ones that are in same order as the
embeddings and index
Have you been using that ?
…On Fri, Dec 9, 2022, 19:14 Varad Gunjal ***@***.***> wrote:
Oh hmm. The ivf_old_to_new_mapping.npy is around 42G and the
metadata_reordered is only a few KB. What could the cause for that be?
—
Reply to this email directly, view it on GitHub
<#213 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR437WONP34HSK5QNY26G3WMNZGRANCNFSM6AAAAAASYQN5PE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Ahh that's the problem. I was still pointing to the |
One comment regarding the metadata parquet files - when I downloaded them, it was more manageable and informative to keep them in their respective folders laion1B-nolang, laion2B-en & laion2B-multi, rather than just dumping it all into one metadata folder. Is it possible to consider adapting the reordering code to align with such a folder structure rather than having to get the user put in effort (for eg. do something like rename the parquet files so that they don't overwrite)? I think it would just require updating -
...to |
yeah absolutely I think using something like this https://github.com/rom1504/embedding-reader/blob/main/embedding_reader/get_file_list.py#L38 should do the trick (this is what gets used in autofaiss so it's the right order) |
@varadgunjal hey just wondering, did you have any success ? |
@rom1504 I've been experimenting with this for the past 2 days and have a few notes -
|
|
About 1, Yes I'm certain that ivf_old_to_new_mapping.npy is on SSD. It is ~42GB. I did benchmark it. It's similar to what I observed earlier : the index returns super quickly (a few ms) with the |
Ok. Curious what you see with arrow This problem of efficiently mapping an incremental id to a string is surprisingly hard. At some point I had benchmarked all the popular on disk kV store (leveldb, rocksdb, leveldb,..) and didn't find them faster than hdf5/arrow However reordering was faster for me I think we should maybe set up an easily reproduceable benchmark scripts, maybe independent from this repo, so we can easily benchmark all possible solution. It's a much simpler problem than approximate knn. There must be a solution |
Out of curiosity, do you have any numbers on how much faster reordering was for you? I'm using a gp3 SSD on AWS. Not sure if there's any better suited one? BTW for reference, the |
I wanted to also check - which metadata column does the returned I ask because as an alternative while I'm debugging this, I was thinking I will make do with |
id maps to the line number with the metadata files sorted in alphabetical order |
If your goal is to do a lot of queries, you can definitely do a full scan of the metadata a single time instead of using random access |
I see. Thanks! These are line numbers of the metadata files from the-eye right? Bc I noticed they are setup differently than the ones in HF (and, as I was mentioning on Discord, have lesser number of total samples). |
it's the ones next to embeddings, see the table at download section there https://laion.ai/blog/laion-5b/ |
Ahh yes. Those are the ones at the-eye as well. Thank you! I'll run this experiment and see if it works for my use case. |
Do these ids / line numbers go from 0 to 5.85B in order of 1B, 2B-en, 2B-multi? So for eg, if I get a returned id 2370603503 I should be looking in 2B-en metadata files since it is greater than the ~1.2B in 1B-nolang? And the line number would be approximately 2370603503 - 1.23B? |
Yes |
This doesn't seem to hold up from my initial tests. Here's an example of what I tried, maybe you can point out where I'm wrong -
Am I going about this correctly? |
You are now using the non re-ordered collection, right? |
Reordered collection is using a completely different ordering |
Well that was super dumb of me. Thanks for pointing that out! |
I'm experimenting with retrieving large number of images (providing
num_images
as 10-20k in the query). However, I notice that the response is super slow. For 2k images it took ~38s to complete. To speed it up, I tried some of the suggestions from the README -enable_faiss_memory_mapping
,use_arrow
andenable_hdf5
to false), but then it throws an error sayingRuntimeError: Error in faiss::Index* faiss::read_index(faiss::IOReader*, int) at /project/faiss/faiss/impl/index_read.cpp:527: Error: 'ret == (1)' failed: read error in /efs/data/laion-5b-index/image.index: 0 != 1 (Is a directory)
Did I misunderstand "turn off memory mapping"?
Options
section of clip back, I tried to setreorder_metadata_by_ivf_index
to true (while keepingenable_faiss_memory_mapping
anduse_arrow
to true as before). But this gives the following stack trace -The text was updated successfully, but these errors were encountered: