(Optionally) don't index bookids #137

bmschmidt · 2017-04-13T18:22:12Z

The bookid indices take a long time to build on sources like Hathi. They could just be deleted to reduce index creation time and reduce index creation speed; that requires just, AFAIK¸ eliminating this line of code.

https://github.com/Bookworm-project/BookwormDB/blob/master/bookwormDB/CreateDatabase.py#L295

The only problem that I can see is that the creation of the 'nwords' table works from that index, I believe; so the 'nwords' table might have to be created from the flat files instead. That's not a problem, but it is a little more work.

organisciak · 2017-06-21T15:24:15Z

I started this work in https://github.com/Bookworm-project/BookwormDB/tree/small_index

As it currently works, you add --no-reverse-index to bookworm prep database_wordcounts. This naming might be confusing because the change is made at table creation rather than indexing.

organisciak · 2017-06-21T16:01:18Z

About the nwords tables: this can be calculated from the source files, multithreaded through dask. Using the h5 files that I've been using:

import dask.dataframe as dd
bookcounts = dd.read_hdf('./*.h5', 'unigrams').reset_index().groupby('id')[['count']].sum()
bookcounts.to_csv('nwords.tsv', sep='\t')

A similar process can be done with the tab-separated files that are used for LOAD DATA INFILE, somewhat slower because of the IO bottleneck.

A simple way forward would entail the following actions:

if the --no-reverse-index option was used, do the above summing on everything in the unigrams text folder. Save this to a new file in .bookworm (nwords.txt, perhaps?) and touch a file in 'targets'
when variableSet.createNwordsFile is run and an nwords file has been created, just load that file in

However, there are a few sticking points that may complicate this.

First, with bookworm prep database_wordcounts --no-delete and --no-index, I've tried to support the use case of partial ingest: so you can slurp some files from the unigrams folder, then replace the files, and slurp up the new ones. This might help with file size limits or concurrent workflows for some people. The nwords workflow above breaks the partial ingest use case, because it expects all the input unigram files to exist at once.

Secondly, you can't rely on the --no-reverse-index flag to tell you when you are supposed to build the nword table, because the table could have been created earlier.

So, to account for these, it might make sense to modify the above steps in three ways. First, initiate the flat file nword creation based on whether INDEX(bookid,wordid,count) is or is not in the schema, rather that looking for the flag. Second, sum up bookcounts per unigram input file, as each file is being slurped up with LOAD DATA INFILE. Finally, alongside the index sorting step, merge and sum those intermediate files into a single file.

bmschmidt · 2017-06-21T16:13:42Z

I'll trust your judgment about what's easiest.

I should say that there's nothing especially desirable about creating the nwords table through a SQL query;
in fact, I suspect it's inefficient compared to doing word counts at the moment (say) that the unigram files are being created. (The index is necessary to group by books; but books are already grouped by themselves in the flat files). So I'd be happy to see the whole step moved to flat files.

Rather than creating a single nwords.txt file, it might make sense to create a folder at .bookworm/texts/nwords/ filled with files that can be slurped into a single table using LOAD DATA LOCAL INFILE, just like for the full word count. This would (if I understand right) allow piecewise creation; it would also make parallelization of creation even more trivial.

The nwords table is small enough (16 million rows) that there shouldn't be major performance hits to just dropping it entirely and recreating it when needed.

bmschmidt · 2017-06-21T16:15:21Z

One other point; currently nwords is defined as 'number of tokens inside the whitelist of known tokens.' It would be equally reasonable for it to be 'number of tokens, on and off the whitelist'; it should just be documented and consistent across builds. (Not sure if your h5 files would produce the latter).

organisciak · 2017-06-21T16:25:31Z

The h5 files are simply more densely stored versions of what you want in master_bookcounts, so it is whitelist-only data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Optionally) don't index bookids #137

(Optionally) don't index bookids #137

bmschmidt commented Apr 13, 2017 •

edited

Loading

organisciak commented Jun 21, 2017

organisciak commented Jun 21, 2017

bmschmidt commented Jun 21, 2017

bmschmidt commented Jun 21, 2017

organisciak commented Jun 21, 2017

(Optionally) don't index bookids #137

(Optionally) don't index bookids #137

Comments

bmschmidt commented Apr 13, 2017 • edited Loading

organisciak commented Jun 21, 2017

organisciak commented Jun 21, 2017

bmschmidt commented Jun 21, 2017

bmschmidt commented Jun 21, 2017

organisciak commented Jun 21, 2017

bmschmidt commented Apr 13, 2017 •

edited

Loading