load_data
Loads in rocstories csv file and iterates over the stories
- Returns a generator of stories
- a generator is like a collection/list that isn’t populated until a item is requested
- calling a
next
function on the generator gives an item if there is one left - if there are no items left, returns that it is empty
for story in load_data("~/path/to/rocstories.csv"):
print(story)
- Want to load up spacy (or keep it loaded)
- Want to pass each sentence into spacy
- store the parse as well.
- Heuristic 1: first mentioned protagonist
- Heuristic 2: Most frequently mentioned protagonist
- Heuristic 3: break ties in 2 with 1
- output looks like
[(entity1, 2), (entity2, 1) ... ]
- Want to store and count
[(story-id, entity, VERB, relation, index)]
- Want to find all instance where lose and find refer to the same entity?
- start of by counting all stories that have lose and find
- find all story-ids where at least one verb is lose
- find all story-ids where at least one verb is find
- find all story-ids where at least one verb is lose
- count how many times they refer to the same entity within the story
- start of by counting all stories that have lose and find
- P(e(w,d)) => Probability of an event-dep
- Specifically, verb is w and the dependency is d
$P(E1) = \frac{Count(E1)}{Count(stories)}$ $P(E1,E2) = \frac{Count(E1 \& E2)}{Count(stories)}$
- So we are calling e(w,d)=E1
- Equation 1:
$pmi(E1,E2)=log\frac{P(E1,E2)}{P(E1)P(E2)}$ - Equation 2:
$P(E1, E2)=\frac{C(E1,E2)}{DENOMINATOR}$ - The denominator is supposed to be the number of all event pairs that share an entity.
- This is a little more complicated than just the number of stories
# this example is in example.py
data, probability = process_corpus("train.csv, sample=100)
print(probability.pmi("move", "dobj", "move", "dobj"))
- Download model from https://github.com/mrmechko/narrative_chains/releases/download/0.0.1/all.json and save it as
all.json
- use the following snippet
with open("all.json") as fp: table = chains.ProbabilityTable(json.load(fp))