Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a simple vector inspector tool #298

Open
mikemccand opened this issue Sep 13, 2024 · 3 comments
Open

Create a simple vector inspector tool #298

mikemccand opened this issue Sep 13, 2024 · 3 comments

Comments

@mikemccand
Copy link
Owner

Too often when trying to generate .vec files for benchmarking from Cohere I struggled with whether the written files were actually "correct".

E.g. early attempts were writing float64 instead of float32 and, horribly, if you run with a float64 encoded .vec file nothing really "goes wrong", except you get weird/bad recall. Each float64 is interpreted as two (strange) adjacent float32.

It'd be nice to have a tool that could just give a bit of transparency about a .vec file, e.g. if its size doesn't evenly divide by the dimensions, something is wrong. Or if there are NaN's, something is wrong. Or if the vectors are not normalized to unit sphere when you expected them to be, something is wrong.

Maybe the tool could also print out the actual float values for a few vectors and we might use our human eyes to look for any such "anomalies" ...

@mikemccand
Copy link
Owner Author

The tool could also report some aggregate stats, like per-dimension variance, or, do all/some dimensions have negative values, etc.

@msokolov
Copy link
Collaborator

Capturing a tiny tool I have been using for posterity:

import sys
import numpy as np

def calculate_statistics(file):
    np_array = np.fromfile(file, dtype=np.float32)
    percentiles = [1, 10, 50, 90, 99, 100]
    for percentile in percentiles:
        print(percentile, "Percentile = ",  np.percentile(np_array, percentile))
    print("average: " + str(np.average(np_array)))
    print("stddev: " + str(np.std(np_array)))
    print("min .. max: " + str(np.min(np_array)) + " .. " + str(np.max(np_array)))

with open(sys.argv[1], "rb") as inp:
    calculate_statistics(inp)

@mikemccand
Copy link
Owner Author

Awesome! Let's start with that! I'll go merge it :) Thanks @msokolov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants