Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This bundles a couple of changes that were done around the same time.
--index-only
,--no-index
, and--no-delete
flags tobookworm prep database_wordcounts
, resolving Allow resuming of unigram ingest #135 (and fixing one bug that came up).db.query()
supportsexecutemany
calls, and there is a backup process for writing csv files to DB from Python if LOAD DATA INFILE fails. Not sure when this might be useful except with a permission error - I wrote it for some benchmarking and figured it could be kept in as a failsafe.h5
files. This looks for a table calledunigrams
inside the file, writes a set of temporary CSVs in parallel, then uses LOAD DATA INFILE. The reason I opted for H5 is because it's well supported in Pandas and contains support for 'blosc', a fast compression algorithm. I tried to keep this code as simple as possible, it would have been easy to over-engineer it.create_unigram_book_counts
, toward eventually being able to convert it to acreate_book_counts_table
method thatcreate_unigram_book_counts
andcreate_bigram_book_counts
can both use. This relates to the discussion in Generalize unigram and bigram ingest methods #134. Updates above are currently specific to unigram tables, my use case, so this will allow bigrams indexes to keep pace.