Minor edits

Confirm-Solutions · Jul 7, 2023 · 342846d · 342846d
1 parent 7059a2f
commit 342846d
Showing 1 changed file with 12 additions and 14 deletions.
diff --git a/posts/catalog.ipynb b/posts/catalog.ipynb
@@ -171,15 +171,14 @@
         "2. **First token deletion**: a dataset constructed by differencing the outputs\n",
         "of Pythia-2.8B [@biderman2023pythia] between four and five token prompts. This\n",
         "method highlights tokens that are extremely predictive in context.\n",
-        "    - for example, when prompted with `\", or common table\"`, the model predicts `\" expression\"` with probability 0.37. But, if we prompt with `\" chloride, or common table\"` and the model predicts `\" salt\"` with probability 0.99. \n",
+        "    - for example, when prompted with `\", or common table\"`, the model predicts `\" expression\"` ([CTE](https://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL#Common_table_expression)) with probability 0.37. But, if we prompt with `\" chloride, or common table\"`, then the model predicts `\" salt\"` with probability 0.99. \n",
         "\n",
         "## The data\n",
         "\n",
-        "In following sections we will give details on the construction and statistics\n",
-        "of these datasets. But before continuing, we share some interactive data previews:\n",
+        "In following sections we will give details on the construction and statistics of these datasets. Before continuing, we share some interactive data previews:\n",
         "\n",
         "- **Deletion**: the first 25000 rows of [pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4).\n",
-        "- **Bigrams**: the entirety of [pile_top_bigrams](https://huggingface.co/datasets/Confirm-Labs/pile_top_bigrams), which contains bigrams with suffix probability greater than 50%\n",
+        "- **Bigrams**: the entirety of [pile_top_bigrams](https://huggingface.co/datasets/Confirm-Labs/pile_top_bigrams), which contains bigrams with suffix probability greater than 50%.\n",
         "- **Trigrams**: the first 25000 rows of [pile_top_trigrams](https://huggingface.co/datasets/Confirm-Labs/pile_top_trigrams), which contains trigrams with suffix probability greater than 50% and count greater than 1000."
       ]
     },
@@ -220,7 +219,7 @@
         "\n",
         "The columns of the table below:\n",
         "\n",
-        "- `text`: two prompts provided. The additional token of backwards context is surrounded by square brackets. The example above would be written `\"[_chloride],_or_common_table\"`.\n",
+        "- `text`: the two prompts provided. The additional token of backwards context is surrounded by square brackets. The example in the introduction would be written `\"[_chloride],_or_common_table\"`.\n",
         "- `token_short`: the most likely next token predicted by Pythia-2.8B for the *four* token prompt.\n",
         "- `token_long`: the most likely next token predicted by Pythia-2.8B for the *five* token prompt.\n",
         "- `p_short`: the probability Pythia-2.8B assigns to `token_short`.\n",
@@ -324,10 +323,10 @@
         "The table below shows bigram completions in The Pile sorted by the frequency of\n",
         "occurence of the prefix token:\n",
         "\n",
+        "- `token#`: the tokens of the bigram.\n",
         "- `sum_count`: the number of times the first token of the bigram occurs in The Pile.\n",
         "- `frac_max`: the fraction of first token appearances that are followed by the most common bigram completion. For example, 50.3% of the time the model sees `\" need\"`, the correct next token is `\" to\"`.\n",
         "- `p_2.8b`: the probability Pythia-2.8B assigns to the most likely completion token when prompted with just the prefix token.\n",
-        "- `token#`: the tokens of the bigram.\n",
         "\n",
         "Note:\n",
         "\n",
@@ -419,13 +418,12 @@
       "source": [
         "## **Trigrams**\n",
         "\n",
-        "The table below shows trigram completions in The Pile sorted by the frequency of\n",
-        "occurence of the prefix bigram:\n",
+        "The table below shows trigram completions in The Pile sorted by the frequency of occurence of the prefix bigram:\n",
         "\n",
+        "- `token#`: the tokens of the trigram.\n",
         "- `sum_count`: the number of times the prefix bigram occurs in The Pile.\n",
         "- `frac_max`: the fraction of bigram appearances that are followed by the most common third token. For example, when prompted with the tokens `[\"://\", \"www\"]`, 99.4% of the time, the next token is `\".\"`.\n",
         "- `p_2.8b`: the probability Pythia-2.8B assigns to the most likely completion token when prompted with the prefix bigram.\n",
-        "- `token#`: the tokens of the trigram.\n",
         "\n",
         "Note:\n",
         "\n",
@@ -531,20 +529,20 @@
         "To construct bigram and trigram statistics, we process [the entire deduplicated\n",
         "Pile](https://huggingface.co/datasets/EleutherAI/the_pile_deduplicated). \n",
         "\n",
-        "We share six datasets on Huggingface. Descriptions of the datasets are available in the linked model cards:\n",
+        "We share six datasets on Huggingface. Descriptions of the datasets are available in the linked dataset cards:\n",
         "\n",
         "- [pile_bigrams](https://huggingface.co/datasets/Confirm-Labs/pile_bigrams): Raw bigram statistics:\n",
         "    - 479 million unique bigrams.\n",
         "- [pile_bigram_prefixes](https://huggingface.co/datasets/Confirm-Labs/pile_bigram_prefixes): All bigram prefixes with their most common completion token.\n",
-        "    - 50,054 unique bigram prefixes (equivalent to tokens/unigrams).\n",
+        "    - 50,054 unique bigram prefixes (one row for each unique token).\n",
         "- [pile_top_bigrams](https://huggingface.co/datasets/Confirm-Labs/pile_top_bigrams): Those bigram prefixes for which the most common completion has > 50% probability. We add Pythia's probability of the most frequent completion for each Pythia model.\n",
-        "    - 3,448 such bigram prefixes. All of these are available to browse above.\n",
+        "    - 3,448 such bigram prefixes. All of these are available to browse on this page above.\n",
         "- [pile_trigrams](https://huggingface.co/datasets/Confirm-Labs/pile_trigrams): Raw trigram statistics.\n",
         "    - 9.9 billion unique trigrams.\n",
         "- [pile_trigram_prefixes](https://huggingface.co/datasets/Confirm-Labs/pile_trigram_prefixes): All trigram prefixes with their most common completion token.\n",
         "    - 479 million unique trigram prefixes (equivalent to bigrams).\n",
         "- [pile_top_trigrams](https://huggingface.co/datasets/Confirm-Labs/pile_top_trigrams): Those trigram prefixes for which the most common completion has > 50% probability and where the prefix occurs more than 1000 times in The Pile. We add Pythia's probability of the most frequent completion for each Pythia model.\n",
-        "    - 1,542,074 such trigram prefixes. The top 25k are available to browse above."
+        "    - 1,542,074 such trigram prefixes. The top 25k are available to browse on this page above."
       ]
     },
     {
@@ -658,7 +656,7 @@
         "the prompt has a large influence and results in a confident prediction. Note\n",
         "that the true next token $t_{i + n + 1}$ does not factor into these criteria\n",
         "and therefore the correctness of the model's predictions does not affect\n",
-        "whether we consider the model to successfully be completing a task.\n",
+        "whether we consider the model to be \"completing a task\".\n",
         "\n",
         "We share 1,874,497 tasks produced by prompt scanning with Pythia-2.8B for every sliding 5-token prompt in the first 112.5M tokens of the Pile. The dataset is available on Huggingface: [pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)\n"
       ]