-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Optionally) don't index bookids #137
Comments
I started this work in https://github.com/Bookworm-project/BookwormDB/tree/small_index As it currently works, you add |
About the
A similar process can be done with the tab-separated files that are used for LOAD DATA INFILE, somewhat slower because of the IO bottleneck. A simple way forward would entail the following actions:
However, there are a few sticking points that may complicate this. First, with Secondly, you can't rely on the So, to account for these, it might make sense to modify the above steps in three ways. First, initiate the flat file nword creation based on whether |
I'll trust your judgment about what's easiest. I should say that there's nothing especially desirable about creating the nwords table through a SQL query; Rather than creating a single nwords.txt file, it might make sense to create a folder at The nwords table is small enough (16 million rows) that there shouldn't be major performance hits to just dropping it entirely and recreating it when needed. |
One other point; currently |
The |
@organisciak
The bookid indices take a long time to build on sources like Hathi. They could just be deleted to reduce index creation time and reduce index creation speed; that requires just, AFAIK¸ eliminating this line of code.
https://github.com/Bookworm-project/BookwormDB/blob/master/bookwormDB/CreateDatabase.py#L295
The only problem that I can see is that the creation of the 'nwords' table works from that index, I believe; so the 'nwords' table might have to be created from the flat files instead. That's not a problem, but it is a little more work.
The text was updated successfully, but these errors were encountered: