Align surprisal

Reconciles surprisal values from an LM, which might have a weird tokenization, with tokens found in some other corpus such as an RT corpus.

We have two sequences X = x_1, ..., x_N and Y = y_1, ..., y_M. Think of them as two different tokenizations of the same string. Each token of the sequence X comes with a surprisal value.

We want to align X and Y, producing for each token of Y, either:

If y corresponds to one token of X, the surprisal value of its corresponding single token in X.
If y corresponds to multiple tokens in X, the sum surprisal of its multiple corresponding tokens in X.
If multiple y correspond to one token in X, a sentinel value.
An irreconcilable sentinel if elements of X cannot be reconciled with elements of Y (for example, X = a bc d, Y = ab cd)

Example usage

The file rt1000.csv contains 1000 tokens' worth of data from the Dundee corpus, and the file lm1000.csv contains surprisal values for these tokens from GPT-3, under GPT-3's tokenization.

python lm1000.csv token logprob rt1000.csv WORD > aligned.csv

Dependencies

tokenizations pandas rfutils: https://github.com/Futrell/rfutils

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
align_surprisal.py		align_surprisal.py
lm1000.csv		lm1000.csv
rt1000.csv		rt1000.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Align surprisal

Example usage

Dependencies

About

Releases

Packages

Languages

Futrell/alignsurprisal

Folders and files

Latest commit

History

Repository files navigation

Align surprisal

Example usage

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages