Skip to content

Latest commit

 

History

History
67 lines (58 loc) · 2.41 KB

README.org

File metadata and controls

67 lines (58 loc) · 2.41 KB

Document

Load Data

Load the data using pandas load_data Loads in rocstories csv file and iterates over the stories
  • Returns a generator of stories
  • a generator is like a collection/list that isn’t populated until a item is requested
  • calling a next function on the generator gives an item if there is one left
  • if there are no items left, returns that it is empty
for story in load_data("~/path/to/rocstories.csv"):
    print(story)
  

Parse Data using Spacy

  • Want to load up spacy (or keep it loaded)
  • Want to pass each sentence into spacy
    • store the parse as well.

Extract Dependency Pairs

Count Dependency pairs

Protagonist detection

  • Heuristic 1: first mentioned protagonist
  • Heuristic 2: Most frequently mentioned protagonist
  • Heuristic 3: break ties in 2 with 1

How to use Coreference

  • output looks like [(entity1, 2), (entity2, 1) ... ]
  • Want to store and count [(story-id, entity, VERB, relation, index)]
  • Want to find all instance where lose and find refer to the same entity?
    • start of by counting all stories that have lose and find
      • find all story-ids where at least one verb is lose
        • find all story-ids where at least one verb is find
    • count how many times they refer to the same entity within the story

PMI

  • P(e(w,d)) => Probability of an event-dep
    • Specifically, verb is w and the dependency is d
    • $P(E1) = \frac{Count(E1)}{Count(stories)}$
    • $P(E1,E2) = \frac{Count(E1 \& E2)}{Count(stories)}$
  • So we are calling e(w,d)=E1
  • Equation 1: $pmi(E1,E2)=log\frac{P(E1,E2)}{P(E1)P(E2)}$
  • Equation 2: $P(E1, E2)=\frac{C(E1,E2)}{DENOMINATOR}$
    • The denominator is supposed to be the number of all event pairs that share an entity.
    • This is a little more complicated than just the number of stories

Load data and probability table

# this example is in example.py
data, probability = process_corpus("train.csv, sample=100)
print(probability.pmi("move", "dobj", "move", "dobj"))

Load Pretrained model instead:

  1. Download model from https://github.com/mrmechko/narrative_chains/releases/download/0.0.1/all.json and save it as all.json
  2. use the following snippet
    with open("all.json") as fp:
        table = chains.ProbabilityTable(json.load(fp))