You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
create_unigram_book_counts and create_bigram_book_counts are redundant. Refactoring may make sense so that the updates made to one don't need to be copy-pasted. Ultimately, the functions are the same, just arguments and naming are different.
The text was updated successfully, but these errors were encountered:
It would be good for this method to include a two variables that specifies the bits used to store the wordids and bookids.
Just sketching it out, something like this.
defcreate_wordcount_table(ngrams, wordid_bytes=3, bookid_bytes=3):
""" wordid_bytes: 3 or 4. The number of bytes to store wordids; 3 reduces file sizes by 25% and may speed up queries, but limits the vocabulary to 16 million words. bookid_bytes: 3 or 4. the number of bytes to store wordids; 3 reduces file sizes by 25% and may speed up queries, but limits the library to 16 million documents. """vartypes= {3:"MEDIUMINT UNSIGNED", 4: "INT UNSIGNED"}
table_string="TABLE word1 {}, bookid {}, count MEDIUMINT UNSIGNED".format(vartypes[wordid_bytes],vartype[bookid_bytes])
I know of one group that has hacked at the code to allow bookid to be an INT UNSIGNED rather than MEDIUMINT UNSIGNED, which is necessary if ingesting more the 16 million volumes. There is a little work that needs to be done in other places before this support is total, but it would be nice to lay the groundwork here.
A two-byte int goes to 65,000 and a one-byte int to 255. I can imagine a few cases where these might be useful if you're using a Bookworm to store named entities rather than actual words. But space is unlikely to be as big a deal in those cases as in the base one. 3 and 4 are the only ones necessary to support.
create_unigram_book_counts
andcreate_bigram_book_counts
are redundant. Refactoring may make sense so that the updates made to one don't need to be copy-pasted. Ultimately, the functions are the same, just arguments and naming are different.The text was updated successfully, but these errors were encountered: