Skip to content

Commit

Permalink
Summary of summaries (#16)
Browse files Browse the repository at this point in the history
* Add Summary of Summaries

Signed-off-by: Fayvor Love <[email protected]>

* Remove errant file

Signed-off-by: Fayvor Love <[email protected]>

* remove extra space in pip installs

Signed-off-by: Fayvor Love <[email protected]>

* Reduce text sizes for Replicate/testing

Signed-off-by: Fayvor Love <[email protected]>

---------

Signed-off-by: Fayvor Love <[email protected]>
  • Loading branch information
fayvor authored Oct 12, 2024
1 parent 7de6cc3 commit aedecd0
Showing 1 changed file with 136 additions and 14 deletions.
150 changes: 136 additions & 14 deletions recipes/Summarize/Summarize.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@
"source": [
"! pip install git+https://github.com/ibm-granite-community/granite-kitchen \\\n",
" transformers \\\n",
" torch"
" torch \\\n",
" tiktoken"
]
},
{
Expand Down Expand Up @@ -97,17 +98,19 @@
"# Get the contents\n",
"response = requests.get(url)\n",
"response.raise_for_status()\n",
"contents = response.text\n",
"full_contents = response.text\n",
"\n",
"# Extract the text of the book, leaving out the gutenberg boilerplate.\n",
"start_index = contents.index(\"*** START OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***\")\n",
"end_index = contents.find(\"*** END OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***\")\n",
"contents = contents[start_index:end_index]\n",
"print(\"Length of book text: {} chars\".format(len(contents)))\n",
"start_str = \"*** START OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***\"\n",
"start_index = full_contents.index(start_str) + len(start_str)\n",
"end_str = \"*** END OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***\"\n",
"end_index = full_contents.index(end_str)\n",
"book_contents = full_contents[start_index:end_index]\n",
"print(\"Length of book text: {} chars\".format(len(book_contents)))\n",
"\n",
"# We limit the text to 200k characters, which is about 57k tokens. (400k chars is ~114k tokens; 300k chars is ~86k tokens; 350k chars is ~100k tokens).\n",
"char_limit = 200000\n",
"contents = contents[:char_limit]\n",
"char_limit = 10000\n",
"contents = book_contents[:char_limit]\n",
"print(\"Length of text for summarization: {} chars\".format(len(contents)))"
]
},
Expand All @@ -122,9 +125,8 @@
"Before sending our code to the AI model, it's crucial to understand how much of the model's capacity we're using. Language models typically have a limit on the number of tokens they can process in a single request.\n",
"\n",
"Key points:\n",
"- We're using the `granite-8B-Code-instruct-128k` model, which has a context window of 128,000 tokens\n",
"- The context window includes both the input (the book text) and the output (the summary)\n",
"- Tokenization can vary between models, so we use the specific tokenizer for our chosen model\n",
"- We're using the [`granite-8B-Code-instruct-128k`](https://huggingface.co/ibm-granite/granite-8b-code-instruct-128k) model, which has a context window of 128,000 tokens.\n",
"- Tokenization can vary between models, so we use the specific tokenizer for our chosen model.\n",
"\n",
"Understanding token count helps us optimize our prompts and ensure we're using the model efficiently."
]
Expand All @@ -141,6 +143,7 @@
"\n",
"model_path = \"ibm-granite/granite-8B-Code-instruct-128k\"\n",
"tokenizer = AutoTokenizer.from_pretrained(model_path)\n",
"print(\"Your model uses the tokenizer \" + type(tokenizer).__name__)\n",
"\n",
"print(f\"Your document has has {len(tokenizer(contents, return_tensors='pt')['input_ids'][0])} tokens. \")"
]
Expand All @@ -165,15 +168,134 @@
"outputs": [],
"source": [
"prompt = f\"\"\"\n",
"Summarize the following text:\n",
"Summarize the following text from \"Walden\" by Henry David Thoreau:\n",
"{contents}\n",
"\"\"\"\n",
"\n",
"output = model.invoke(\n",
" prompt,\n",
" model_kwargs={\n",
" \"max_tokens\": 10000,\n",
" \"min_tokens\": 0,\n",
" \"max_tokens\": 10000, # Set the maximum number of tokens to generate as output.\n",
" \"min_tokens\": 200, # Set the minimum number of tokens to generate as output.\n",
" \"temperature\": 0.75,\n",
" \"system_prompt\": \"You are a helpful assistant.\",\n",
" \"presence_penalty\": 0,\n",
" \"frequency_penalty\": 0\n",
" }\n",
" )\n",
"\n",
"print(output)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary of Summaries\n",
"\n",
"Here we use an iterative summarization technique to adapt to the context length of the model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chunk the text\n",
"\n",
"Divide the full text into smaller passages for separate processing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.text_splitter import TokenTextSplitter\n",
"from langchain.docstore.document import Document\n",
"\n",
"excerpt_length = 20000\n",
"doc = Document(page_content=book_contents[:excerpt_length], metadata={\"source\": \"local\"})\n",
"print(f\"The text is {len(doc.page_content)} chars\")\n",
"\n",
"# Split the documents into chunks\n",
"chunk_char_limit = 1000\n",
"text_splitter = TokenTextSplitter.from_huggingface_tokenizer(tokenizer=tokenizer, chunk_size=chunk_char_limit, chunk_overlap=50)\n",
"chunks = text_splitter.split_documents([doc])\n",
"print(\"Chunk count: \" + str(len(chunks)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summarize the chunks\n",
"\n",
"Here we create a separate summary of each passage. This can take a few minutes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"summaries = []\n",
"\n",
"for i, chunk in enumerate(chunks):\n",
" prompt = f\"\"\"\n",
" Summarize the following text from \"Walden\" by Henry David Thoreau:\n",
" {chunk}\n",
" \"\"\"\n",
" output = model.invoke(\n",
" prompt,\n",
" model_kwargs={\n",
" \"max_tokens\": 10000, # Set the maximum number of tokens to generate as output.\n",
" \"min_tokens\": 200, # Set the minimum number of tokens to generate as output.\n",
" \"temperature\": 0.75,\n",
" \"system_prompt\": \"You are a helpful assistant.\",\n",
" \"presence_penalty\": 0,\n",
" \"frequency_penalty\": 0\n",
" }\n",
" )\n",
" summary = f\"Summary {i+1}:\\n{output}\\n\\n\"\n",
" summaries.append(summary)\n",
" print(summary)\n",
"\n",
"print(\"Summary count: \" + str(len(summaries)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summarize the Summaries\n",
"\n",
"We signal to the model that it is receiving separate summaries of passages from an original text, and to create a unified summary of that text."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"summary_contents = \"\\n\\n\".join(summaries)\n",
"print(len(summary_contents))\n",
"\n",
"prompt = f\"\"\"\n",
"The text of \"Walden\", by Henry David Thoreau, was summarized in separate passages; those passage summaries are provided below. \n",
"\n",
"{summary_contents}\n",
"\n",
"From these summaries, compose a single lengthy, unified summary of the original text.\n",
"\"\"\"\n",
"\n",
"output = model.invoke(\n",
" prompt,\n",
" model_kwargs={\n",
" \"max_tokens\": 100000, # Set the maximum number of tokens to generate as output.\n",
" \"min_tokens\": 5000, # Set the minimum number of tokens to generate as output.\n",
" \"temperature\": 0.75,\n",
" \"system_prompt\": \"You are a helpful assistant.\",\n",
" \"presence_penalty\": 0,\n",
Expand Down

0 comments on commit aedecd0

Please sign in to comment.