HTRC improvements #136

organisciak · 2017-04-11T21:09:44Z

This bundles a couple of changes that were done around the same time.

Fixing the broken logging, as per manager.py logging broken #133
Adding --index-only, --no-index, and --no-delete flags to bookworm prep database_wordcounts, resolving Allow resuming of unigram ingest #135 (and fixing one bug that came up).
Two small improvements: db.query() supports executemany calls, and there is a backup process for writing csv files to DB from Python if LOAD DATA INFILE fails. Not sure when this might be useful except with a permission error - I wrote it for some benchmarking and figured it could be kept in as a failsafe.
Support for ingest from h5 files. This looks for a table called unigrams inside the file, writes a set of temporary CSVs in parallel, then uses LOAD DATA INFILE. The reason I opted for H5 is because it's well supported in Pandas and contains support for 'blosc', a fast compression algorithm. I tried to keep this code as simple as possible, it would have been easy to over-engineer it.
I started generalizing create_unigram_book_counts, toward eventually being able to convert it to a create_book_counts_table method that create_unigram_book_counts and create_bigram_book_counts can both use. This relates to the discussion in Generalize unigram and bigram ingest methods #134. Updates above are currently specific to unigram tables, my use case, so this will allow bigrams indexes to keep pace.

The global logger initialization with basicConfig is removed, so any programmatic use of BookwormManager should have its own logger defined.

organisciak added 8 commits April 5, 2017 13:11

Fix manager.py logging, fixes #133

3a89d5f

The global logger initialization with basicConfig is removed, so any programmatic use of BookwormManager should have its own logger defined.

Unigram ingest flags, resolves #135

e290b1b

executemany support

793b0b1

h5 support

c64f9a0

Backup method if LOAD DATA INFILE fails for plain text

7b66c92

Fix incorrect debugging var

0c38aab

Generalizing code and renaming --no-close

03600f3

Correct variable typo

877e3ba

organisciak requested a review from bmschmidt April 11, 2017 21:09

bmschmidt merged commit ee7866a into master Apr 19, 2017

bmschmidt deleted the htrc_improvements branch March 21, 2019 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTRC improvements #136

HTRC improvements #136

organisciak commented Apr 11, 2017

HTRC improvements #136

HTRC improvements #136

Conversation

organisciak commented Apr 11, 2017