Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoring fast feature counting #89

Closed
bmschmidt opened this issue Dec 3, 2015 · 1 comment
Closed

Restoring fast feature counting #89

bmschmidt opened this issue Dec 3, 2015 · 1 comment

Comments

@bmschmidt
Copy link
Member

In wrapping up all the various bookworm calls into a command-line executable over the summer, I removed the ability to ingest unigrams.

I've now restored that, but not using the system calls to @organisciak's "fast_featurecounter.sh". Instead, it just calls a moved version of his (older?) function write_word_ids_from_feature_counts.

For rebuilds of Hathi, I'm not so worried about this: what we really want to do is not rebuild the vocabulary list at all, but instead to just use the file that we've now created.

For Jstor DFR, the Underwood corpus, or other potential feature-count bookworms, however, we may want the faster version. I don't know what the cost is here, really.

My preference for doing this would be as a redefinition of that function so that the external wrappers can keep working. But we could also just switch back to using the Makefile to dispatch if that is easier. (It may be, because the current version is configured to read from stdin.)

@bmschmidt
Copy link
Member Author

Fold into #134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant