Summary of summaries (#16)

* Add Summary of Summaries Signed-off-by: Fayvor Love <[email protected]> * Remove errant file Signed-off-by: Fayvor Love <[email protected]> * remove extra space in pip installs Signed-off-by: Fayvor Love <[email protected]> * Reduce text sizes for Replicate/testing Signed-off-by: Fayvor Love <[email protected]> --------- Signed-off-by: Fayvor Love <[email protected]>
ibm-granite-community · Oct 12, 2024 · aedecd0 · aedecd0
1 parent 7de6cc3
commit aedecd0
Showing 1 changed file with 136 additions and 14 deletions.
diff --git a/recipes/Summarize/Summarize.ipynb b/recipes/Summarize/Summarize.ipynb
@@ -32,7 +32,8 @@
    "source": [
     "! pip install git+https://github.com/ibm-granite-community/granite-kitchen \\\n",
     "    transformers \\\n",
-    "    torch"
+    "    torch \\\n",
+    "    tiktoken"
    ]
   },
   {
@@ -97,17 +98,19 @@
     "# Get the contents\n",
     "response = requests.get(url)\n",
     "response.raise_for_status()\n",
-    "contents = response.text\n",
+    "full_contents = response.text\n",
     "\n",
     "# Extract the text of the book, leaving out the gutenberg boilerplate.\n",
-    "start_index = contents.index(\"*** START OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***\")\n",
-    "end_index = contents.find(\"*** END OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***\")\n",
-    "contents = contents[start_index:end_index]\n",
-    "print(\"Length of book text: {} chars\".format(len(contents)))\n",
+    "start_str = \"*** START OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***\"\n",
+    "start_index = full_contents.index(start_str) + len(start_str)\n",
+    "end_str = \"*** END OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***\"\n",
+    "end_index = full_contents.index(end_str)\n",
+    "book_contents = full_contents[start_index:end_index]\n",
+    "print(\"Length of book text: {} chars\".format(len(book_contents)))\n",
     "\n",
     "# We limit the text to 200k characters, which is about 57k tokens. (400k chars is ~114k tokens; 300k chars is ~86k tokens; 350k chars is ~100k tokens).\n",
-    "char_limit = 200000\n",
-    "contents = contents[:char_limit]\n",
+    "char_limit = 10000\n",
+    "contents = book_contents[:char_limit]\n",
     "print(\"Length of text for summarization: {} chars\".format(len(contents)))"
    ]
   },
@@ -122,9 +125,8 @@
     "Before sending our code to the AI model, it's crucial to understand how much of the model's capacity we're using. Language models typically have a limit on the number of tokens they can process in a single request.\n",
     "\n",
     "Key points:\n",
-    "- We're using the `granite-8B-Code-instruct-128k` model, which has a context window of 128,000 tokens\n",
-    "- The context window includes both the input (the book text) and the output (the summary)\n",
-    "- Tokenization can vary between models, so we use the specific tokenizer for our chosen model\n",
+    "- We're using the [`granite-8B-Code-instruct-128k`](https://huggingface.co/ibm-granite/granite-8b-code-instruct-128k) model, which has a context window of 128,000 tokens.\n",
+    "- Tokenization can vary between models, so we use the specific tokenizer for our chosen model.\n",
     "\n",
     "Understanding token count helps us optimize our prompts and ensure we're using the model efficiently."
    ]
@@ -141,6 +143,7 @@
     "\n",
     "model_path = \"ibm-granite/granite-8B-Code-instruct-128k\"\n",
     "tokenizer = AutoTokenizer.from_pretrained(model_path)\n",
+    "print(\"Your model uses the tokenizer \" + type(tokenizer).__name__)\n",
     "\n",
     "print(f\"Your document has has {len(tokenizer(contents, return_tensors='pt')['input_ids'][0])} tokens. \")"
    ]
@@ -165,15 +168,134 @@
    "outputs": [],
    "source": [
     "prompt = f\"\"\"\n",
-    "Summarize the following text:\n",
+    "Summarize the following text from \"Walden\" by Henry David Thoreau:\n",
     "{contents}\n",
     "\"\"\"\n",
     "\n",
     "output = model.invoke(\n",
     "    prompt,\n",
     "    model_kwargs={\n",
-    "        \"max_tokens\": 10000,\n",
-    "        \"min_tokens\": 0,\n",
+    "        \"max_tokens\": 10000, # Set the maximum number of tokens to generate as output.\n",
+    "        \"min_tokens\": 200, # Set the minimum number of tokens to generate as output.\n",
+    "        \"temperature\": 0.75,\n",
+    "        \"system_prompt\": \"You are a helpful assistant.\",\n",
+    "        \"presence_penalty\": 0,\n",
+    "        \"frequency_penalty\": 0\n",
+    "    }\n",
+    "    )\n",
+    "\n",
+    "print(output)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary of Summaries\n",
+    "\n",
+    "Here we use an iterative summarization technique to adapt to the context length of the model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Chunk the text\n",
+    "\n",
+    "Divide the full text into smaller passages for separate processing."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.text_splitter import TokenTextSplitter\n",
+    "from langchain.docstore.document import Document\n",
+    "\n",
+    "excerpt_length = 20000\n",
+    "doc =  Document(page_content=book_contents[:excerpt_length], metadata={\"source\": \"local\"})\n",
+    "print(f\"The text is {len(doc.page_content)} chars\")\n",
+    "\n",
+    "# Split the documents into chunks\n",
+    "chunk_char_limit = 1000\n",
+    "text_splitter = TokenTextSplitter.from_huggingface_tokenizer(tokenizer=tokenizer, chunk_size=chunk_char_limit, chunk_overlap=50)\n",
+    "chunks = text_splitter.split_documents([doc])\n",
+    "print(\"Chunk count: \" + str(len(chunks)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Summarize the chunks\n",
+    "\n",
+    "Here we create a separate summary of each passage. This can take a few minutes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "summaries = []\n",
+    "\n",
+    "for i, chunk in enumerate(chunks):\n",
+    "    prompt = f\"\"\"\n",
+    "        Summarize the following text from \"Walden\" by Henry David Thoreau:\n",
+    "        {chunk}\n",
+    "        \"\"\"\n",
+    "    output = model.invoke(\n",
+    "        prompt,\n",
+    "        model_kwargs={\n",
+    "            \"max_tokens\": 10000, # Set the maximum number of tokens to generate as output.\n",
+    "            \"min_tokens\": 200, # Set the minimum number of tokens to generate as output.\n",
+    "            \"temperature\": 0.75,\n",
+    "            \"system_prompt\": \"You are a helpful assistant.\",\n",
+    "            \"presence_penalty\": 0,\n",
+    "            \"frequency_penalty\": 0\n",
+    "        }\n",
+    "    )\n",
+    "    summary = f\"Summary {i+1}:\\n{output}\\n\\n\"\n",
+    "    summaries.append(summary)\n",
+    "    print(summary)\n",
+    "\n",
+    "print(\"Summary count: \" + str(len(summaries)))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Summarize the Summaries\n",
+    "\n",
+    "We signal to the model that it is receiving separate summaries of passages from an original text, and to create a unified summary of that text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "summary_contents = \"\\n\\n\".join(summaries)\n",
+    "print(len(summary_contents))\n",
+    "\n",
+    "prompt = f\"\"\"\n",
+    "The text of \"Walden\", by Henry David Thoreau, was summarized in separate passages; those passage summaries are provided below. \n",
+    "\n",
+    "{summary_contents}\n",
+    "\n",
+    "From these summaries, compose a single lengthy, unified summary of the original text.\n",
+    "\"\"\"\n",
+    "\n",
+    "output = model.invoke(\n",
+    "    prompt,\n",
+    "    model_kwargs={\n",
+    "        \"max_tokens\": 100000, # Set the maximum number of tokens to generate as output.\n",
+    "        \"min_tokens\": 5000, # Set the minimum number of tokens to generate as output.\n",
     "        \"temperature\": 0.75,\n",
     "        \"system_prompt\": \"You are a helpful assistant.\",\n",
     "        \"presence_penalty\": 0,\n",