Skip to content

Commit

Permalink
Minor edits
Browse files Browse the repository at this point in the history
  • Loading branch information
tbenthompson committed Jul 7, 2023
1 parent 7059a2f commit 342846d
Showing 1 changed file with 12 additions and 14 deletions.
26 changes: 12 additions & 14 deletions posts/catalog.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -171,15 +171,14 @@
"2. **First token deletion**: a dataset constructed by differencing the outputs\n",
"of Pythia-2.8B [@biderman2023pythia] between four and five token prompts. This\n",
"method highlights tokens that are extremely predictive in context.\n",
" - for example, when prompted with `\", or common table\"`, the model predicts `\" expression\"` with probability 0.37. But, if we prompt with `\" chloride, or common table\"` and the model predicts `\" salt\"` with probability 0.99. \n",
" - for example, when prompted with `\", or common table\"`, the model predicts `\" expression\"` ([CTE](https://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL#Common_table_expression)) with probability 0.37. But, if we prompt with `\" chloride, or common table\"`, then the model predicts `\" salt\"` with probability 0.99. \n",
"\n",
"## The data\n",
"\n",
"In following sections we will give details on the construction and statistics\n",
"of these datasets. But before continuing, we share some interactive data previews:\n",
"In following sections we will give details on the construction and statistics of these datasets. Before continuing, we share some interactive data previews:\n",
"\n",
"- **Deletion**: the first 25000 rows of [pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4).\n",
"- **Bigrams**: the entirety of [pile_top_bigrams](https://huggingface.co/datasets/Confirm-Labs/pile_top_bigrams), which contains bigrams with suffix probability greater than 50%\n",
"- **Bigrams**: the entirety of [pile_top_bigrams](https://huggingface.co/datasets/Confirm-Labs/pile_top_bigrams), which contains bigrams with suffix probability greater than 50%.\n",
"- **Trigrams**: the first 25000 rows of [pile_top_trigrams](https://huggingface.co/datasets/Confirm-Labs/pile_top_trigrams), which contains trigrams with suffix probability greater than 50% and count greater than 1000."
]
},
Expand Down Expand Up @@ -220,7 +219,7 @@
"\n",
"The columns of the table below:\n",
"\n",
"- `text`: two prompts provided. The additional token of backwards context is surrounded by square brackets. The example above would be written `\"[_chloride],_or_common_table\"`.\n",
"- `text`: the two prompts provided. The additional token of backwards context is surrounded by square brackets. The example in the introduction would be written `\"[_chloride],_or_common_table\"`.\n",
"- `token_short`: the most likely next token predicted by Pythia-2.8B for the *four* token prompt.\n",
"- `token_long`: the most likely next token predicted by Pythia-2.8B for the *five* token prompt.\n",
"- `p_short`: the probability Pythia-2.8B assigns to `token_short`.\n",
Expand Down Expand Up @@ -324,10 +323,10 @@
"The table below shows bigram completions in The Pile sorted by the frequency of\n",
"occurence of the prefix token:\n",
"\n",
"- `token#`: the tokens of the bigram.\n",
"- `sum_count`: the number of times the first token of the bigram occurs in The Pile.\n",
"- `frac_max`: the fraction of first token appearances that are followed by the most common bigram completion. For example, 50.3% of the time the model sees `\" need\"`, the correct next token is `\" to\"`.\n",
"- `p_2.8b`: the probability Pythia-2.8B assigns to the most likely completion token when prompted with just the prefix token.\n",
"- `token#`: the tokens of the bigram.\n",
"\n",
"Note:\n",
"\n",
Expand Down Expand Up @@ -419,13 +418,12 @@
"source": [
"## **Trigrams**\n",
"\n",
"The table below shows trigram completions in The Pile sorted by the frequency of\n",
"occurence of the prefix bigram:\n",
"The table below shows trigram completions in The Pile sorted by the frequency of occurence of the prefix bigram:\n",
"\n",
"- `token#`: the tokens of the trigram.\n",
"- `sum_count`: the number of times the prefix bigram occurs in The Pile.\n",
"- `frac_max`: the fraction of bigram appearances that are followed by the most common third token. For example, when prompted with the tokens `[\"://\", \"www\"]`, 99.4% of the time, the next token is `\".\"`.\n",
"- `p_2.8b`: the probability Pythia-2.8B assigns to the most likely completion token when prompted with the prefix bigram.\n",
"- `token#`: the tokens of the trigram.\n",
"\n",
"Note:\n",
"\n",
Expand Down Expand Up @@ -531,20 +529,20 @@
"To construct bigram and trigram statistics, we process [the entire deduplicated\n",
"Pile](https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated). \n",
"\n",
"We share six datasets on Huggingface. Descriptions of the datasets are available in the linked model cards:\n",
"We share six datasets on Huggingface. Descriptions of the datasets are available in the linked dataset cards:\n",
"\n",
"- [pile_bigrams](https://huggingface.co/datasets/Confirm-Labs/pile_bigrams): Raw bigram statistics:\n",
" - 479 million unique bigrams.\n",
"- [pile_bigram_prefixes](https://huggingface.co/datasets/Confirm-Labs/pile_bigram_prefixes): All bigram prefixes with their most common completion token.\n",
" - 50,054 unique bigram prefixes (equivalent to tokens/unigrams).\n",
" - 50,054 unique bigram prefixes (one row for each unique token).\n",
"- [pile_top_bigrams](https://huggingface.co/datasets/Confirm-Labs/pile_top_bigrams): Those bigram prefixes for which the most common completion has > 50% probability. We add Pythia's probability of the most frequent completion for each Pythia model.\n",
" - 3,448 such bigram prefixes. All of these are available to browse above.\n",
" - 3,448 such bigram prefixes. All of these are available to browse on this page above.\n",
"- [pile_trigrams](https://huggingface.co/datasets/Confirm-Labs/pile_trigrams): Raw trigram statistics.\n",
" - 9.9 billion unique trigrams.\n",
"- [pile_trigram_prefixes](https://huggingface.co/datasets/Confirm-Labs/pile_trigram_prefixes): All trigram prefixes with their most common completion token.\n",
" - 479 million unique trigram prefixes (equivalent to bigrams).\n",
"- [pile_top_trigrams](https://huggingface.co/datasets/Confirm-Labs/pile_top_trigrams): Those trigram prefixes for which the most common completion has > 50% probability and where the prefix occurs more than 1000 times in The Pile. We add Pythia's probability of the most frequent completion for each Pythia model.\n",
" - 1,542,074 such trigram prefixes. The top 25k are available to browse above."
" - 1,542,074 such trigram prefixes. The top 25k are available to browse on this page above."
]
},
{
Expand Down Expand Up @@ -658,7 +656,7 @@
"the prompt has a large influence and results in a confident prediction. Note\n",
"that the true next token $t_{i + n + 1}$ does not factor into these criteria\n",
"and therefore the correctness of the model's predictions does not affect\n",
"whether we consider the model to successfully be completing a task.\n",
"whether we consider the model to be \"completing a task\".\n",
"\n",
"We share 1,874,497 tasks produced by prompt scanning with Pythia-2.8B for every sliding 5-token prompt in the first 112.5M tokens of the Pile. The dataset is available on Huggingface: [pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)\n"
]
Expand Down

0 comments on commit 342846d

Please sign in to comment.