-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Total Occurrence Counts for non-unigram queries not available #2
Comments
There are some more questions about how this sort of counting would look.
|
Any particular implementation probably shouldn't implement "WordsPerMillion" as a method; it only needs to be able to implement "WordCount" on two different sorts of search restrictions:
So if you can get the token count for all books published in 1800 somehow, that's all you need. But maybe that's not possible at all? Or only indirectly, through multiplication? There might be some way to kludge it through. For the two questions: {"word":["book","worm"]} should return total corpus frequencies for either. Proximity queries raise all sorts of interesting questions, which is actually one of the reasons I think it might be useful to define them into the API rather than just adopt the Solr method wholesale. I can think of two sensible ways to handle this: in reality, we probably want to go with whatever Solr makes easier. The really hard question is the query "book book worm worm worm." If you had a special key like search_limits:{"within":{"book":5},"word":"worm"}, that would suggest for me that: "book book worm worm worm"
If instead it was defined as
I may be implementing a very limited version of the first soon on MySQL. |
I'm not sure I understand. When you write "the token count for all books published in 1800", you mean the token count for a given token, count(token="worm"), right? Or do you mean count(all tokens)? Lets go back to "book worm", it is possible to: d) get a count of occurrences of "book" in the entire corpus (if it occurs five times in a document, each time gets counted). Same for "worm" It's not possible to get the intersection, of how much 'book' and 'worm' occur next to each other overall, at least, not without some gutting around Lucene. |
So, the stats for (d) and (e) can give you a WordsPerMillion for a single term. For bigrams, trigrams, etc., the Lucene mailing list suggested to me ShingleFilter, which saves Ngrams as individual tokens: so we could treat them in the same way as above. For this project, I don't think we want any indexing dependencies: I want to work with vanilla Lucene 4.0+ indices. We could maybe support fields indexed with a ShingleFilter, but I think we should simply estimate WordsPerMillion based on the other information. Do we have access to any stats that see how common one-word queries are in Bookworm or Ngrams? This might help us evaluate whether we want to kludge around with the Lucene innards. |
Yeah, we're talking about two different issues at once. First let me say the one that I care most about because I think it will preserve flexibility and avoid duplicating code. That is: a Solr implementation should use the existing API code and not reimplement a method like "WordsPerMillion." Instead, it should just extend the general API class, as in this SQL example. To explain a bit more: I mean that the latest versions of the bookworm API, if you do a simple search:
the API break down each incoming query into two new queries which are each dispatched to a SQL-specific instance: first
And second the counts for the full corpus:
And then it calculates the WordsPerMillion from the documents returned. The MySQL implementation only handles "WordCount" and "TextCount" as counttypes: all the rest are derived in the general_API.py file, which could be used for the Solr implementation as well. The advantage of this, as I see it, is that it makes it easy to implement all sorts of esoteric but potentially useful statistics off of book and word counts, like average text length, TF-IDF, and Dunning Log-Likelihood. Rather than re-implementing those on each platform, we can just keep it simple by only implementing WordCount and TextCount. It also makes it possible to add some experiments with syntactic sugar that would be silly to implement twice: for instance, I've been experimenting with using an asterisk to indicate keys to be dropped in the grouping field as well as the search limit field. For heatmaps, that makes a useful sort of crosstab functionality possible. But it would be silly to re-implement. The reason not to do this would be if it proves much slower to dispatch two queries to Lucene instead of one that fetches both. But we still shouldn't assume that a query will necessarily have any search term at all. One of my favorite bookworm charts is the number of books from each constituent library, which doesn't involve any word limitations at all. We should be able to do something like this on Hathi as well. |
Second, on this point:
You're saying, if I get it, that with Lucene, multigrams are implemented in such a way that we can't retrieve the counts for the 2-gram "book worm" without an extension. One question: even with that extension, would we be able to quickly get counts for "The United States are", or some arbitrary 14-gram, without dramatically increased the index size? If not, that would really mean that we can implement most of the API on unigrams, but not on bigrams or higher. In practice, most queries are unigrams, but that might be because we only support bigrams. On the movie bookworm, the one with the most hits I have logs for handy, there are 250,000 queries outside the default set for unigrams, and only 10,000 for bigrams. So it wouldn't be the end of the world to only search unigrams: OTOH, multigram searches are the most important advantage of Solr. |
To calculate WordsPerMillion, we need access to all the occurrences of the query in all the documents of the corpus.
This is possible for a single term (
TermsEnum.totalTermFreq()
), but as far as I can tell, not for full queries.Needs more investigation.
The text was updated successfully, but these errors were encountered: