Speeding up `clip-retrieval back` for large number of images #213

varadgunjal · 2022-12-08T19:36:10Z

I'm experimenting with retrieving large number of images (providing num_images as 10-20k in the query). However, I notice that the response is super slow. For 2k images it took ~38s to complete. To speed it up, I tried some of the suggestions from the README -

I tried to follow the instructions here - https://github.com/rom1504/clip-retrieval#clip-back-benchmark-and-monitoring - and turned off memory mapping (set enable_faiss_memory_mapping, use_arrow and enable_hdf5 to false), but then it throws an error saying
RuntimeError: Error in faiss::Index* faiss::read_index(faiss::IOReader*, int) at /project/faiss/faiss/impl/index_read.cpp:527: Error: 'ret == (1)' failed: read error in /efs/data/laion-5b-index/image.index: 0 != 1 (Is a directory)

Did I misunderstand "turn off memory mapping"?

As mentioned under the Options section of clip back, I tried to set reorder_metadata_by_ivf_index to true (while keeping enable_faiss_memory_mapping and use_arrow to true as before). But this gives the following stack trace -

[2022-12-08 19:14:36,767] ERROR in app: Exception on /knn-service [POST]
Traceback (most recent call last):
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/flask/app.py", line 1820, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/flask/app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/flask_restful/__init__.py", line 467, in wrapper
    resp = resource(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/flask/views.py", line 107, in view
    return current_app.ensure_sync(self.dispatch_request)(**kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/flask_restful/__init__.py", line 582, in dispatch_request
    resp = meth(*args, **kwargs)
  File "<decorator-gen-2>", line 2, in post
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/prometheus_client/context_managers.py", line 81, in wrapped
    return func(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 488, in post
    return self.query(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 451, in query
    distances, indices = self.knn_search(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/clip_retrieval/clip_back.py", line 360, in knn_search
    results = np.take(ivf_old_to_new_mapping, indices[0])
  File "<__array_function__ internals>", line 180, in take
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 190, in take
    return _wrapfunc(a, 'take', indices, axis=axis, out=out, mode=mode)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 54, in _wrapfunc
    return _wrapit(obj, method, *args, **kwds)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/numpy/core/fromnumeric.py", line 43, in _wrapit
    result = getattr(asarray(obj), method)(*args, **kwds)
IndexError: index 5437962295 is out of bounds for axis 0 with size 1

The clip benchmarking section also mentions using GPUs for fast clip inference - is there an option to enable this in clip back?

The text was updated successfully, but these errors were encountered:

rom1504 · 2022-12-08T21:26:21Z

Hey, glad you got things working locally

What kind of hardware do you have?
Are the files on a nvme ssd? How much ram do you have ?

Probably the best way to speed things up is #125 so that the reordering option will work with arrow

You may also disable safety and near deduplication

varadgunjal · 2022-12-08T21:48:07Z

I'm currently keeping index files on an EFS - could that be a source of problems? I can move it to a SSD if that would result in better performance.

varadgunjal · 2022-12-08T22:37:17Z

Hey, glad you got things working locally

What kind of hardware do you have? Are the files on a nvme ssd? How much ram do you have ?

Probably the best way to speed things up is #125 so that the reordering option will work with arrow

You may also disable safety and near deduplication

So #125 should be possible with this in the config -

{   
   ...
   "enable_faiss_memory_mapping": true,
   "use_arrow": true,
   "reorder_metadata_by_ivf_index: true
   ...
}

Right?

rom1504 · 2022-12-08T23:24:41Z

No it needs new code I'm afraid

Yes prefer using a ssd

rom1504 · 2022-12-08T23:24:58Z

No it needs new code I'm afraid

Yes prefer using a ssd

varadgunjal · 2022-12-09T00:18:44Z

Sorry I'm a little confused - where exactly does the reorder_metadata_by_ivf_index option help since that issue states it's not for arrow yet?

Also, regarding speedup, just using the original config with SSD would yield that much benefit (I believe 20 query/s is mentioned in the benchmarking section)? No GPUs required?

rom1504 · 2022-12-09T00:35:16Z

reorder_metadata_by_ivf_index cannot currently help with the arrow files
You have 2 options

Rebuild the hdf5 collection from the parquet metadata using that option
Implement the re-ordering with arrow and open a PR

I advise 2

Regarding speed up. The number in the readme is for a smaller index. However it is indeed possible to get good speeds. It will need some work however. Here are slow things

implement batching: that will speed up 10x
metadata reordering: 100x on metadata fetching speed
disable safety and near dedup or use a faster implementation

rom1504 · 2022-12-09T00:35:33Z

GPU won't help without batching

varadgunjal · 2022-12-09T15:10:28Z

I see. Thank you so much for all your help! I will look into your suggestions and try to implement at least one.

One last clarification about the original post above : the benchmarking section mentions "turning off memory mapping options can also speed up requests, at the cost of high ram usage". How does this work?

rom1504 · 2022-12-09T15:51:09Z

Turning off memory mapping means putting the whole index in ram. For a 800GB index it would mean either getting a machine with a lot of ram or splitting in many machines
I recommend investigating the other options first

varadgunjal · 2022-12-09T16:01:54Z

Got it. I started by rebuilding the hdf5 collection with the reorder_metadata_by_ivf_index option. It does make responses faster, but the responses only return the 'id' & 'similarity' column even though I set "columns_to_return": ["url", "caption", "NSFW", "id", "similarity"]

Should I reduce it down to fewer columns (like only ["url", "caption"]? Or does the reordering limit to returning only id & similarity?

rom1504 · 2022-12-09T16:04:56Z

If you get only id and similarity it means the metadata is not used at all (that's what get stored in hdf5/arrow/parquet) I figure you may have disabled it ?

…

On Fri, Dec 9, 2022, 17:02 Varad Gunjal ***@***.***> wrote: Got it. I started by rebuilding the hdf5 collection with the reorder_metadata_by_ivf_index option. It does make responses faster, but the responses only return the 'id' & 'similarity' column even though I set "columns_to_return": ["url", "caption", "NSFW", "id", "similarity"] Should I reduce it down to fewer columns (like only ["url", "caption"]? Or does the reordering limit to returning only id & similarity? — Reply to this email directly, view it on GitHub <#213 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437TQLOWPUX4TBEXK3JDWMNJXZANCNFSM6AAAAAASYQN5PE> . You are receiving this because you commented.Message ID: ***@***.***>

varadgunjal · 2022-12-09T16:15:27Z

No I don't think I explicitly disable it. Is there a flag that does that?

rom1504 · 2022-12-09T16:33:38Z

To enable it you need to use one of --enable_hdf5 True or use arrow, have a metadata collection that contain all items and have no error in the console

varadgunjal · 2022-12-09T16:36:44Z

I do have enable_hdf5=True and there is no error at the console. My config looks like so -

{
    "laion5B": {
            "indice_folder": "laion-5b-index",
            "provide_safety_model": false,
            "provide_violence_detector": false,
            "enable_faiss_memory_mapping": true,
            "use_arrow": false,
            "enable_hdf5": true,
            "reorder_metadata_by_ivf_index": true,
            "columns_to_return": ["url", "caption", "NSFW", "id", "similarity"],
            "clip_model": "ViT-L/14",
            "enable_mclip_option": false
    }
}

... and running it with reorder_metadata_by_ivf_index did cause some extra processing to happen (2 progress bars show up). I re-ran it by removing the metadata_reordered.hdf5 & ivf_old_to_new_mapping.npy that are created, but still get the same result.

varadgunjal · 2022-12-09T17:36:08Z

I tried to call the MetadataService explicitly using the ids returned by the KnnService, but it doesn't return any metadata for any of the listed IDs, using the above config. However, if I switch to "use_arrow": True (and "enable_hdf5": False), the MetadataService does return requested metadata.

I guess that's why this check -

clip-retrieval/clip_retrieval/clip_back.py

Line 407 in 19c9185

if meta is not None:

- fails and I get only 'id' & 'similarity' in the output.

rom1504 · 2022-12-09T17:57:33Z

Can you check if the hdf5 file has been created in the folder ?

…

On Fri, Dec 9, 2022, 18:36 Varad Gunjal ***@***.***> wrote: I tried to call the MetadataService explicitly using the ids returned by the KnnService, but it doesn't return any metadata for any of the listed IDs, using the above config. However, if I switch to "use_arrow": True (and "enable_hdf5": False), the MetadataService does return requested metadata. I guess that's why this check - https://github.com/rom1504/clip-retrieval/blob/19c91856f9456463b00bd9162389266714f04cb7/clip_retrieval/clip_back.py#L407 - fails and I get only 'id' & 'similarity' in the output. — Reply to this email directly, view it on GitHub <#213 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437XWR3DAXXI6GHAX35TWMNUZJANCNFSM6AAAAAASYQN5PE> . You are receiving this because you commented.Message ID: ***@***.***>

varadgunjal · 2022-12-09T17:58:23Z

Yes, I see metadata_reordered.hdf5 in the folder.

varadgunjal · 2022-12-09T18:09:54Z

For testing, I'm querying it like so -

payload = {
    "text":"red car",
    "modality":"image",
    "num_images":20,
    "indice_name":"laion5B",
    "use_mclip":False,
    "deduplicate":True,
    "use_safety_model":True,
    "use_violence_detector":True,
    "aesthetic_score":"",
    "aesthetic_weight":0.5
}

response = requests.post(
    "http://127.0.0.1:1234/knn-service",
    data=json.dumps(payload)
)

rom1504 · 2022-12-09T18:11:44Z

How big is the hdf5 file ? Should be around 800GB

…

On Fri, Dec 9, 2022, 19:10 Varad Gunjal ***@***.***> wrote: For testing, I'm querying it like so - payload = { "text":"red car", "modality":"image", "num_images":20, "indice_name":"laion5B", "use_mclip":False, "deduplicate":True, "use_safety_model":True, "use_violence_detector":True, "aesthetic_score":"", "aesthetic_weight":0.5 } response = requests.post( "http://127.0.0.1:1234/knn-service", data=json.dumps(payload) ) — Reply to this email directly, view it on GitHub <#213 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437RJP2UIXK5IEM5EYGLWMNYX3ANCNFSM6AAAAAASYQN5PE> . You are receiving this because you commented.Message ID: ***@***.***>

rom1504 · 2022-12-09T18:11:59Z

Can you try to open it manually?

…

On Fri, Dec 9, 2022, 19:11 Romain Beaumont ***@***.***> wrote: How big is the hdf5 file ? Should be around 800GB On Fri, Dec 9, 2022, 19:10 Varad Gunjal ***@***.***> wrote: > For testing, I'm querying it like so - > > payload = { > "text":"red car", > "modality":"image", > "num_images":20, > "indice_name":"laion5B", > "use_mclip":False, > "deduplicate":True, > "use_safety_model":True, > "use_violence_detector":True, > "aesthetic_score":"", > "aesthetic_weight":0.5 > } > > response = requests.post( > "http://127.0.0.1:1234/knn-service", > data=json.dumps(payload) > ) > > — > Reply to this email directly, view it on GitHub > <#213 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAR437RJP2UIXK5IEM5EYGLWMNYX3ANCNFSM6AAAAAASYQN5PE> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

varadgunjal · 2022-12-09T18:13:49Z

Oh hmm. The ivf_old_to_new_mapping.npy is around 42G and the metadata_reordered is only a few KB. What could the cause for that be? ISn't that generated automatically with the reorder_ flag?

rom1504 · 2022-12-09T18:38:56Z

Sounds like the reordering failed. Try to delete the file and retry. Also you need to be using the parquet files from https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion2B-multi/laion2B-multi-metadata/ and 2B-en and 1B-nolang since it's these ones that are in same order as the embeddings and index Have you been using that ?

…

On Fri, Dec 9, 2022, 19:14 Varad Gunjal ***@***.***> wrote: Oh hmm. The ivf_old_to_new_mapping.npy is around 42G and the metadata_reordered is only a few KB. What could the cause for that be? — Reply to this email directly, view it on GitHub <#213 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR437WONP34HSK5QNY26G3WMNZGRANCNFSM6AAAAAASYQN5PE> . You are receiving this because you commented.Message ID: ***@***.***>

varadgunjal · 2022-12-09T19:13:35Z

Ahh that's the problem. I was still pointing to the metadata folder with the arrow files that is provided on HF and not to the local folder with these parquet files. Thank you!

varadgunjal · 2022-12-13T18:27:51Z

One comment regarding the metadata parquet files - when I downloaded them, it was more manageable and informative to keep them in their respective folders laion1B-nolang, laion2B-en & laion2B-multi, rather than just dumping it all into one metadata folder. Is it possible to consider adapting the reordering code to align with such a folder structure rather than having to get the user put in effort (for eg. do something like rename the parquet files so that they don't overwrite)?

I think it would just require updating -

clip-retrieval/clip_retrieval/ivf_metadata_ordering.py

Line 77 in 19c9185

for parquet_files in tqdm(sorted(data_dir.glob("*.parquet"))):

...to .rglob instead and the order of the parquet files would remain the same (1B-nolang, then 2B-en and then 2B-multi) in the result

rom1504 · 2022-12-13T21:01:18Z

yeah absolutely

I think using something like this https://github.com/rom1504/embedding-reader/blob/main/embedding_reader/get_file_list.py#L38 should do the trick (this is what gets used in autofaiss so it's the right order)

rom1504 · 2022-12-15T23:22:58Z

@varadgunjal hey just wondering, did you have any success ?
feel free to talk at discord about it as well if you want, I'm rom1504#5008 there

varadgunjal · 2022-12-16T07:34:15Z

@rom1504 I've been experimenting with this for the past 2 days and have a few notes -

Firstly, I retried the hdf5-based reordering we had discussed earlier. This time I'm fairly certain it worked as intended - it took over a day for the processing to complete after I ran clip-retrieval back ... with a config file as here. I have verified that the ivf_old_to_new_mapping.npy is 42G & metadata_reordered.hdf5 is ~1TB. However, I don't see any speedup as compared to using the arrow files without reordering. I did a test with requesting 20000 images per query and it still takes ~40s to return a response - I was expecting it to reduce by a factor of 2 at least. Is it possible I've done something incorrect again? Or is a significant speedup not to be expected?
Regarding the layout of the metadata parquet files, using .rglob seems to do the job for me. Is this change enough or should I use your code suggestion from get_file_list ? I'm not sure what the preferred way is.
Regarding adapt ivf metadata reordering to work with arrow #125 the only thing left is to make the ArrowSink class efficient. I could use some guidance there - is Discord better for discussing this or should I keep the discussion here for easy reference later?

rom1504 · 2022-12-21T23:01:27Z

that seems surprising. is ivf_old_to_new_mapping.npy on a ssd ? can you benchmark what takes time ? (index vs metadata)
https://github.com/rom1504/embedding-reader/blob/main/embedding_reader/get_file_list.py#L38 is the best way but anything that works is ok
feel free to talk in discord but indeed making arrowsink efficient is important

varadgunjal · 2022-12-23T20:18:06Z

About 1, Yes I'm certain that ivf_old_to_new_mapping.npy is on SSD. It is ~42GB. I did benchmark it. It's similar to what I observed earlier : the index returns super quickly (a few ms) with the id and similarity, but the metadata takes all the remaining time.

rom1504 · 2022-12-23T20:58:50Z

Ok. Curious what you see with arrow

This problem of efficiently mapping an incremental id to a string is surprisingly hard.

At some point I had benchmarked all the popular on disk kV store (leveldb, rocksdb, leveldb,..) and didn't find them faster than hdf5/arrow

However reordering was faster for me

I think we should maybe set up an easily reproduceable benchmark scripts, maybe independent from this repo, so we can easily benchmark all possible solution.

It's a much simpler problem than approximate knn. There must be a solution

varadgunjal · 2022-12-23T21:10:00Z

Out of curiosity, do you have any numbers on how much faster reordering was for you?

I'm using a gp3 SSD on AWS. Not sure if there's any better suited one?

BTW for reference, the ivf_old_to_new_mapping.npy is 42GB & the metadata_reordered.hdf5 is ~1.3TB - do those numbers sound correct?

varadgunjal · 2022-12-23T21:19:58Z

I wanted to also check - which metadata column does the returned id from the index search map to? The columns I saw in the metadata files here are - 'image_path', 'caption', 'NSFW', 'similarity', 'LICENSE', 'url', 'key', 'status', 'error_message', 'width', 'height', 'original_width', 'original_height', 'exif', 'md5'. Would it map to 'key' perhaps? And is there a relationship between the id & which of the shards it would be in?

I ask because as an alternative while I'm debugging this, I was thinking I will make do with id & similarity for now since those are returned quickly. And then I could batch the returned ids from multiple text queries together and find them in the sharded metadata files?

rom1504 · 2022-12-23T21:22:42Z

id maps to the line number with the metadata files sorted in alphabetical order

rom1504 · 2022-12-23T21:24:33Z

If your goal is to do a lot of queries, you can definitely do a full scan of the metadata a single time instead of using random access

varadgunjal · 2022-12-23T21:32:14Z

I see. Thanks! These are line numbers of the metadata files from the-eye right? Bc I noticed they are setup differently than the ones in HF (and, as I was mentioning on Discord, have lesser number of total samples).

rom1504 · 2022-12-23T21:37:28Z

it's the ones next to embeddings, see the table at download section there https://laion.ai/blog/laion-5b/

varadgunjal · 2022-12-23T21:53:10Z

Ahh yes. Those are the ones at the-eye as well. Thank you! I'll run this experiment and see if it works for my use case.

varadgunjal · 2022-12-23T21:58:06Z

Do these ids / line numbers go from 0 to 5.85B in order of 1B, 2B-en, 2B-multi?

So for eg, if I get a returned id 2370603503 I should be looking in 2B-en metadata files since it is greater than the ~1.2B in 1B-nolang? And the line number would be approximately 2370603503 - 1.23B?

rom1504 · 2022-12-23T23:17:42Z

Yes

varadgunjal · 2022-12-24T03:40:06Z

This doesn't seem to hold up from my initial tests. Here's an example of what I tried, maybe you can point out where I'm wrong -

I sent a query to a locally running clip-back and got a response with metadata. One of the reponses is like so -

{'caption': 'Ekstra rzadki Pepe',
  'url': 'https://pobierak.jeja.pl/images_thumb/5/5/3/251008_300x160.jpg',
  'similarity': 0.23226161301136017,
  'NSFW': 'UNLIKELY',
  'id': 1266299506}

Since the returned id was greater than the total number of samples in laion1B-nolang metadata next to the embeddings (which I counted as 1,231,502,026), I searched for this example in laion2B-en metadata.
The offset was 1,266,299,506 - 1,231,502,026 = 34797480.
Now I counted the number of samples in each of the parquet metadata files from 0000 onwards till I exceeded 34797480. The example should therefore be within the parquet file at which the count was exceeded. This came out to be metadata_0037.parquet under laion2B-en metadata files.
However, I don't see this example anywhere in that parquet file when I tried to search for it by loading it into pandas.

Am I going about this correctly?

rom1504 · 2022-12-24T04:48:52Z

You are now using the non re-ordered collection, right?

rom1504 · 2022-12-24T04:49:21Z

Reordered collection is using a completely different ordering

varadgunjal · 2022-12-24T06:43:54Z

Well that was super dumb of me. Thanks for pointing that out!

varadgunjal mentioned this issue Dec 12, 2022

adapt ivf metadata reordering to work with arrow #125

Open

rom1504 added the enhancement New feature or request label Jan 13, 2024

Speeding up clip-retrieval back for large number of images #213

Speeding up clip-retrieval back for large number of images #213

Comments

varadgunjal commented Dec 8, 2022 • edited Loading

rom1504 commented Dec 8, 2022

varadgunjal commented Dec 8, 2022 • edited Loading

varadgunjal commented Dec 8, 2022

rom1504 commented Dec 8, 2022

rom1504 commented Dec 8, 2022

varadgunjal commented Dec 9, 2022

rom1504 commented Dec 9, 2022

rom1504 commented Dec 9, 2022

varadgunjal commented Dec 9, 2022

rom1504 commented Dec 9, 2022

varadgunjal commented Dec 9, 2022

rom1504 commented Dec 9, 2022 via email

varadgunjal commented Dec 9, 2022

rom1504 commented Dec 9, 2022

varadgunjal commented Dec 9, 2022 • edited Loading

varadgunjal commented Dec 9, 2022

rom1504 commented Dec 9, 2022 via email

varadgunjal commented Dec 9, 2022

varadgunjal commented Dec 9, 2022

rom1504 commented Dec 9, 2022 via email

rom1504 commented Dec 9, 2022 via email

varadgunjal commented Dec 9, 2022 • edited Loading

rom1504 commented Dec 9, 2022 via email

varadgunjal commented Dec 9, 2022 • edited Loading

varadgunjal commented Dec 13, 2022

rom1504 commented Dec 13, 2022

rom1504 commented Dec 15, 2022

varadgunjal commented Dec 16, 2022 • edited Loading

rom1504 commented Dec 21, 2022

varadgunjal commented Dec 23, 2022

rom1504 commented Dec 23, 2022

varadgunjal commented Dec 23, 2022

varadgunjal commented Dec 23, 2022

rom1504 commented Dec 23, 2022 • edited Loading

rom1504 commented Dec 23, 2022

varadgunjal commented Dec 23, 2022

rom1504 commented Dec 23, 2022

varadgunjal commented Dec 23, 2022

varadgunjal commented Dec 23, 2022 • edited Loading

rom1504 commented Dec 23, 2022

varadgunjal commented Dec 24, 2022

rom1504 commented Dec 24, 2022

rom1504 commented Dec 24, 2022

varadgunjal commented Dec 24, 2022

Speeding up `clip-retrieval back` for large number of images #213

Speeding up `clip-retrieval back` for large number of images #213

varadgunjal commented Dec 8, 2022 •

edited

Loading

varadgunjal commented Dec 8, 2022 •

edited

Loading

varadgunjal commented Dec 9, 2022 •

edited

Loading

varadgunjal commented Dec 9, 2022 •

edited

Loading

varadgunjal commented Dec 9, 2022 •

edited

Loading

varadgunjal commented Dec 16, 2022 •

edited

Loading

rom1504 commented Dec 23, 2022 •

edited

Loading

varadgunjal commented Dec 23, 2022 •

edited

Loading