Skip to content

First Pandas release.

Compare
Choose a tag to compare
@bmschmidt bmschmidt released this 27 Aug 17:56
· 4 commits to dev since this release

This is a major architectural update to allow further development; it includes with some important performance changes and new features.

It introduces two new python package dependencies:

  • pandas
  • numexpr (for convenience; maybe this will be bundled out eventually).

Both should be easily available through easy_install or whatever else you use.

It also sets aside, for the time being, the need for temporary tables and the bookworm_scratch database described in v0.4, though they may still be revived.

Architecture changes

Parts of the core functionality of the API have been abstracted out of the SQL generating code.

All the different counttypes have been boiled down to two core types: "WordCount" and "TextCount"; and each single API call now separately constructs two corpora and runs the ratios ("Words per million," and so forth) inside of python instead of SQL.

This is done for two reasons:

  1. It allows better performance on MySQL (see below).
  2. The SQL construction engine is considerably less complicated, so re-implementing it on top of other platforms is easier.
    • Solr will be somewhat easier and can use more existing code, but will still need a few methods.
    • The meta-bookworm (an implementation that dispatches calls to other API nodes, rather than directly to a database) should be quite easy to write for most methods, although ordering search results presents issues.

Most of the API handling code has been bundled into a module, with the cgi-bin bits now taking up minimal space. This should make local (non-apache-interfacing) connections slightly easier.

Performance changes

Corpus creation queries are now usually cached: for large (5m+) bookworms, this can frequently speed up queries by 5-6x, getting most normal queries near a second again.

Additions

The new dispatching makes new methods working off of the API much easier to write: as an example, I've added in "Average Text Length" and TF-IDF as core summary statistics. This may be removed at a later point.

This also allows a non-canon return method of a cPickled pandas dataframe rather than just json or tsv. That should be great news for anyone looking to do analysis directly in python.