-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch nightly benchy to more realistic Cohere/wikipedia-22-12-en-embeddings
vectors
#256
Comments
I attempted to follow the
(Note that the nightly benchy only does indexing, so I really only need the first file) But this consumes gobbs of RAM apparently and the Linux OOME killer killed it! Is this expected? I can run this on a beefier machine if need be (current machine has "only" 256 GB and no swap) for this one-time generation of vectors ... Maybe |
Oooh this |
OK well that With this change to do chunking into 1M blocks of vectors when writing the index-time inferred vectors I was able to run the tool! Full output below:
It produced a large
Next I'll try switching to this source for nightly benchy. I'll also publish this on |
Hmm, except, that file is too large?
It's 159 GB but should be ~77 GB? Maybe my "chunking" is buggy :) |
OK I think these are |
And I think |
Oooh this |
OK I made this change and kicked off
|
OK hmm scratch that, I see from the already loaded features that
|
OK! Now I think the issue is in So, now I'm testing this:
|
OK the above change seemed to have worked (I just pushed it)! I now see these vector files:
Now I will try to confirm their recall seems sane, and then switch nightly to them. |
OK I think the next wrinkle here is ... to fix |
I think we can modify VectorDictionary so accept a |
otherwise you could simply select some random vector every time you see a vector-type query task?? But I would expect some vectors behave differently from others? Not sure |
I was finally able to index/search using these Cohere vectors, and the profiler output is sort of strange: This is CPU:
and this is HEAP:
Why are we reading individual bytes so intensively? And why is lock acquisition the top HEAP object creator!? |
Here's the
|
More thread context for the CPU profiling:
Curious that |
VInts are used to encode the HNSW graph, so it looks like decoding the
graph is where that is happening (vis
at
org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek())
…On Mon, Jun 10, 2024 at 2:12 PM Michael McCandless ***@***.***> wrote:
More thread context for the CPU profiling:
PROFILE SUMMARY from 10264 events (total: 10264)
tests.profile.mode=cpu
tests.profile.count=50
tests.profile.stacksize=8
tests.profile.linenumbers=false
PERCENT CPU SAMPLES STACK
12.59% 1292 jdk.incubator.vector.FloatVector#reduceLanesTemplate()
at jdk.incubator.vector.Float256Vector#reduceLanes()
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody()
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct()
at org.apache.lucene.util.VectorUtil#dotProduct()
at org.apache.lucene.index.VectorSimilarityFunction$2#compare()
at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
7.16% 735 org.apache.lucene.store.DataInput#readVInt()
at org.apache.lucene.store.MemorySegmentIndexInput#readVInt()
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader$OffHeapHnswGraph#seek()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#graphSeek()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
4.25% 436 jdk.incubator.vector.FloatVector#lanewiseTemplate()
at jdk.incubator.vector.Float256Vector#lanewise()
at jdk.incubator.vector.Float256Vector#lanewise()
at jdk.incubator.vector.FloatVector#fma()
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#fma()
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody()
at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProduct()
at org.apache.lucene.util.VectorUtil#dotProduct()
4.21% 432 jdk.internal.misc.ScopedMemoryAccess#getByteInternal()
at jdk.internal.misc.ScopedMemoryAccess#getByte()
at java.lang.invoke.VarHandleSegmentAsBytes#get()
at java.lang.invoke.VarHandleGuards#guard_LJ_I()
at java.lang.foreign.MemorySegment#get()
at org.apache.lucene.store.MemorySegmentIndexInput#readByte()
at org.apache.lucene.store.DataInput#readVInt()
at org.apache.lucene.store.MemorySegmentIndexInput#readVInt()
4.08% 419 org.apache.lucene.util.SparseFixedBitSet#insertLong()
at org.apache.lucene.util.SparseFixedBitSet#getAndSet()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
at org.apache.lucene.index.CodecReader#searchNearestVectors()
2.84% 292 org.apache.lucene.util.LongHeap#downHeap()
at org.apache.lucene.util.LongHeap#pop()
at org.apache.lucene.util.hnsw.NeighborQueue#pop()
at org.apache.lucene.search.TopKnnCollector#topDocs()
at org.apache.lucene.search.knn.MultiLeafKnnCollector#topDocs()
at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch()
at org.apache.lucene.search.AbstractKnnVectorQuery#getLeafResults()
at org.apache.lucene.search.AbstractKnnVectorQuery#searchLeaf()
2.74% 281 org.apache.lucene.index.VectorSimilarityFunction$2#compare()
at org.apache.lucene.codecs.hnsw.DefaultFlatVectorScorer$FloatVectorScorer#score()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
at org.apache.lucene.index.CodecReader#searchNearestVectors()
2.58% 265 org.apache.lucene.util.compress.LZ4#decompress()
at org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor#decompress()
at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState#document()
at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#serializedDocument()
at org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader#document()
at org.apache.lucene.index.CodecReader$1#document()
at org.apache.lucene.index.BaseCompositeReader$2#document()
at org.apache.lucene.index.StoredFields#document()
2.48% 255 org.apache.lucene.util.SparseFixedBitSet#getAndSet()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.util.hnsw.HnswGraphSearcher#search()
at org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsReader#search()
at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsReader#search()
at org.apache.lucene.index.CodecReader#searchNearestVectors()
at org.apache.lucene.search.KnnFloatVectorQuery#approximateSearch()
Curious that readVInt, when seeking to load a vector (?) is 2nd hotspot?
—
Reply to this email directly, view it on GitHub
<#256 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHHUQKVIMMHESS7RMPDGLDZGXUBXAVCNFSM6AAAAABEHTEKSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGAYDAMBVGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I might be missing it, but where is the similarity defined for using the cohere vectors? They are designed for max inner product, if we use euclidean, I would expect graph building and indexing to be poor as we might get stuck in local minima. |
The benchmark tools are hard-coded to use DOT_PRODUCT; see https://github.com/mikemccand/luceneutil/blob/main/src/main/perf/LineFileDocs.java#L454 Maybe this is why we get such poor results w/Cohere? |
@msokolov using I could maybe see But I would suggest we switch to max-inner-product for Cohere 768 for a true test with those vectors as they were designed to be used. |
I ran a test comparing mainline, Cohere, angular
mainline, Cohere, mip
|
#255 added realistic
Cohere/wikipedia-22-12-en-embeddings
768 dim vectors toluceneutil
-- let's switch over nightlies to use these vectors instead.The text was updated successfully, but these errors were encountered: