Let’s return to the first example from the book — identifying the geographic flavor of words using wikipedia — with actual code and more detail.
taking as a whole the terms that have a strong geographic flavor, we should largely see cultural terms (foods, sports, etc)
Terms like "beach" or "mountain" will clearly
Common words like "couch" or "hair"
Words like 'town' or 'street' will be
You don’t have to stop exploring when you find a new mystery, but no data exploration is complete until you uncover at least one.
Next, we’ll choose some exemplars: familiar records to trace through "Barbeque" should cover ;
The Wikipedia corpus is large, unruly — thirty million human-edited article
It’s also <remark>TODO verify</remark>200 million links
-
article → wordbag
-
join on page data to get geolocation
-
use pagelinks to get larger pool of implied geolocations
-
turn geolocations into quadtile keys
-
aggregate topics by quadtile
-
take summary statistics aggregated over term and quadkey
-
combine those statistics to identify terms that occur more frequently than the base rate would predict
-
explore and validate the results
-
filter to find strongly-flavored words, and other reductions of the data for visualization
There are three touchstones to hit in every data exploration:
-
Confirm the things you know:
-
Confirm or refute the things you suspect.
-
Uncover at least one thing you never suspected.
Things we know: First, common words should show no geographic flavor. Geographic features — "beach", "mountain", etc — should be intensely localised.
We will jointly discover two things taking as a whole the terms that have a strong geographic flavor, we should largely see cultural terms (foods, sports, etc)
<remark>* compared to other color words, there will be a larger regional variation for the terms "white" and "black" (as they describe ra</remark>
You don’t have to stop exploring when you find a new mystery, but no data exploration is complete until you uncover at least one.
Next, we’ll choose some exemplars: familiar records to trace through "Barbeque" should cover ;
Chapter in progress — the story so far: we’ve counted the words in each document and each geographic grid region, and want to use those counts to estimate each word’s frequency in context. Picking up there…
The count of each word is an imperfect estimate of the probability of seeing that word in the context of the given topic. Consider for instance the words that would have shown up if the article were 50% longer, or the cases where an author chose one synonym out of many equivalents. This is particularly significant considering words with zero count.
We want to treat "missing" terms as having occurred some number of times, and adjust the probabilities of all the observed terms.
Note
|
Minimally Invasive
It’s essential to use "minimally invasive" methods to address confounding factors. What we’re trying to do is expose a pattern that we believe is robust: that it will shine through any occlusions in the data. Occasionally, as here, we need to directly remove some confounding factor. The naive practitioner thinks, "I will use a powerful algorithm! That’s good, because powerful is better than not powerful!" No — simple and clear is better than powerful. Suppose you were instead telling a story set in space - somehow or another, you must address the complication of faster-than-light travel. Star Wars does this early and well: its choices ("Ships can jump to faraway points in space, but not from too close to a planet and only after calculations taking several seconds; it happens instantaneously, causing nearby stars to appear as nifty blue tracks") are made clear in a few deft lines of dialog. A ham-handed sci-fi author instead brings in complicated machinery requiring a complicated explanation resulting in complicated dialogue. There are two obvious problems: first, the added detail makes the story less clear. It’s literally not rocket science: concentrate on heros and the triumph over darkness, not on rocket engines. Second, writing that dialog is wasted work. If it’s enough to just have the Wookiee hit the computer with a large wrench, do that. But it’s essential to appreciate that this also introduces extra confounding factors. Rather than a nifty special effect and a few lines shouted by a space cowboy at his hairy sidekick, your junkheap space freighter now needs an astrophysicist, a whiteboard and a reason to have the one use the other. The story isn’t just muddier, it’s flawed. We’re trying to tell a story ("words have regional flavor"), but the plot requires a few essential clarifications ("low-frequency terms are imperfectly estimated"). If these patterns are robust, complicated machinery is detrimental. It confuses the audience, and is more work for you; it can also bring more pattern to the data than is actually there, perverting your results. The only time you should bring in something complicated or novel is when it’s a central element of your story. In that case, it’s worth spending multiple scenes in which Jedi masters show and tell the mechanics and limitations of The Force. |
There are two reasonable strategies: be lazy; or consult a sensible mathematician.
To be lazy, add a 'pseudocount' to each term: pretend you saw it an extra small number of times For the common pseudocount choice of 0.5, you would treat absent terms as having been seen 0.5 times, terms observed once as having been seen 1.5 times, and so forth. Calclulate probabilities using the adjusted count divided by the sum of all adjusted counts (so that they sum to 1). It’s not well-justified mathematically, but is easy to code.
Consult a mathematician: for something that is mathematically justifiable, yet still simple enough to be minimally invasive, she will recommend "Good-Turing" smoothing.
In this approach, we expand the dataset to include both the pool of counter for terms we saw, and an "absent" pool of fractional counts, to be shared by all the terms we didn’t see. Good-Turing says to count the terms that occurred once, and guess that an equal quantity of things would have occurred once, but didn’t. This is handwavy, but minimally invasive; we oughtn’t say too much about the things we definitionally can’t say much about.
We then make the following adjustments:
-
Set the total count of words in the absent pool equal to the number of terms that occur once. There are of course tons of terms in this pool; we’ll give each some small fractional share of an appearance.
-
Specifically, treat each absent term as occupying the same share of the absent pool as it does in the whole corpus (minus this doc). So, if "banana" does not appear in the document, but occurs at (TODO: value) ppm across all docs, we’ll treat it as occupying the same fraction of the absent pool (with slight correction for the absence of this doc).
-
Finally, estimate the probability for each present term as its count divided by the total count in the present and absent pools.
The approach we use here can be a baseline for the practical art of authorship detection in legal discovery, where a
-
http://en.wikipedia.org/wiki/Wikipedia:List_of_controversial_issues
-
http://www.infochimps.com/datasets/list-of-dirty-obscene-banned-and-otherwise-unacceptable-words
-
entity names within angle brackets. Where possible these are drawn from Appendix D to ISO 8879:1986, Information Processing - Text & Office Systems - Standard Generalized Markup Language (SGML).
-
http://faculty.cs.byu.edu/~ringger/CS479/papers/Gale-SimpleGoodTuring.pdf
-
http://nltk.googlecode.com/svn/trunk/doc/howto/collocations.html
-
Stanford Named Entity Parser - http://nlp.stanford.edu/software/CRF-NER.shtml
-
http://nlp.stanford.edu/software/corenlp.shtml - > Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. Stanford CoreNLP is an integrated framework, which make it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications. > > Stanford CoreNLP integrates all our NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, and the coreference resolution system, and provides model files for analysis of English. The goal of this project is to enable people to quickly and painlessly get complete linguistic annotations of natural language texts. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. > > The Stanford CoreNLP code is written in Java and licensed under the GNU General Public License (v2 or later). Source is included. Note that this is the full GPL, which allows many free uses, but not its use in distributed proprietary software. The download is 259 MB and requires Java 1.6+.