-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use custom binary index file format #3
Comments
Excellent job done ! |
I wonder how just using a more efficient serde format like |
Actually, it looks like the main reason it's slow is just unbuffered IO. See #5. With that, it loads in 0.3s for me with json. |
Oh... I was aware that reading xmls was using BufReader, but I missed the index file.
The difference doesn't really matter much. BTW, I managed to make sqlite search work in #4, and it's
I have to admit that sqlite is fast. |
This is very interesting project. A search engine on your local that you know how it works and can customize. It can actually impove my productivity.
So I was interested in the performance, because it seems not so fast.
I injected performance measurement code in various places in my branch.
This is the time to make index of the whole docs.gl in my computer.
And this is the time to search a common word like
TEXTURE
:It is too slow as a search engine, but interestingly most of the time is spent on loading the JSON file index.
I suppose most time consuming part is parsing.
I see that you attempt to implement sqlite index file. Indeed it is very fast to load, since it doesn't really load the file when it makes connection to the db.
However, sqlite doesn't look particularly easy to work with. I tried to implement search feature for sqlite, but it's too painful to implement using SQL.
So I propose a third approach - designing a custom binary file format for the specific use case.
The idea is to implement our custom serializer/deserializer to a binary file like below. It will omit the parsing part and loading time should be much faster.
The deserializer is like below.
We don't need serde or anything, in fact using it may put some overhead. Dumping binary data right into disk should be the simplest and the fastest way to serialize.
I implemented in the branch and with this binary file:
It's still not very fast, but much better than JSON.
I think we can make it a randomly-accessible binary file format and seek the file pointer to the necessary field to reduce the time to load the whole file content. It should implement
Model
trait. We will need a header in the beginning of the binary to indicate these file offset per data records such as a document. At least it seems easier than SQL with complex queries.It is like designing a custom database engine but I think it fits to the policy of this project :)
The text was updated successfully, but these errors were encountered: