Skip to content

Commit

Permalink
fix typo
Browse files Browse the repository at this point in the history
  • Loading branch information
lfunderburk committed Nov 21, 2023
1 parent 7fbf89a commit 0eab6e9
Show file tree
Hide file tree
Showing 2 changed files with 109 additions and 6 deletions.
109 changes: 106 additions & 3 deletions ch3/jupyter-notebooks/components.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
"source": [
"# Building blocks in Haystack: components and pipelines\n",
"\n",
"In the [previous notebook](data_classes.ipynb), we learned how we can store structured and unstructured data through Documents objects, as well as dataframe, ByteStream, ChatMessage and StreamingChunk objects. We also learned how to store these objects into a Document Store. In this notebook, we will explore how to store and retrieve data from a Haystack Document store. Let's take a look at its architecture.\n",
"In the [previous notebook](data_classes.ipynb), we learned how we can store structured and unstructured data through Documents objects, as well as data frame, ByteStream, ChatMessage and StreamingChunk objects. We also learned how to store these objects into a Document Store. In this notebook, we will explore they Haystack component.\n",
"\n",
"Haystack's architecture leverages components as its core elements, each performing specific functions like text processing or summarization. These components are designed to be connected into pipelines, which orchestrate the flow of data and manage task execution in a structured manner. The Pipeline class facilitates this by allowing the addition and connection of components, which must have unique input and output points for data transfer.\n",
"Haystack's components are designed to be connected into pipelines, which orchestrate the flow of data and manage task execution in a structured manner. The Pipeline class facilitates this by allowing the addition and connection of components, which must have unique input and output points for data transfer.\n",
"\n",
"Pipelines are the backbone of NLP applications in Haystack, functioning as directed graphs where nodes are components and edges dictate data flow. They ensure smooth data processing, handle errors, and support debugging through visualization tools that help developers trace and optimize the data journey.\n",
"\n",
Expand Down Expand Up @@ -609,12 +609,115 @@
" print(\"\\n\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Rankers"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "03fc6a0289464be28b80bc2808cffee4",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading config.json: 0%| | 0.00/794 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f5f3a35ef02e45d9a1d4eff9e0b53ea8",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading pytorch_model.bin: 0%| | 0.00/90.9M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"ename": "AssertionError",
"evalue": "Torch not compiled with CUDA enabled",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAssertionError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m/Users/macpro/Documents/GitHub/Building-Natural-Language-Pipelines/ch3/jupyter-notebooks/components.ipynb Cell 33\u001b[0m line \u001b[0;36m8\n\u001b[1;32m <a href='vscode-notebook-cell:/Users/macpro/Documents/GitHub/Building-Natural-Language-Pipelines/ch3/jupyter-notebooks/components.ipynb#Y112sZmlsZQ%3D%3D?line=4'>5</a>\u001b[0m ranker \u001b[39m=\u001b[39m TransformersSimilarityRanker(model_name_or_path\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mcross-encoder/ms-marco-MiniLM-L-6-v2\u001b[39m\u001b[39m\"\u001b[39m, device\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mcuda\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m <a href='vscode-notebook-cell:/Users/macpro/Documents/GitHub/Building-Natural-Language-Pipelines/ch3/jupyter-notebooks/components.ipynb#Y112sZmlsZQ%3D%3D?line=6'>7</a>\u001b[0m \u001b[39m# Warm up the model\u001b[39;00m\n\u001b[0;32m----> <a href='vscode-notebook-cell:/Users/macpro/Documents/GitHub/Building-Natural-Language-Pipelines/ch3/jupyter-notebooks/components.ipynb#Y112sZmlsZQ%3D%3D?line=7'>8</a>\u001b[0m ranker\u001b[39m.\u001b[39;49mwarm_up()\n\u001b[1;32m <a href='vscode-notebook-cell:/Users/macpro/Documents/GitHub/Building-Natural-Language-Pipelines/ch3/jupyter-notebooks/components.ipynb#Y112sZmlsZQ%3D%3D?line=9'>10</a>\u001b[0m \u001b[39m# Candidate documents\u001b[39;00m\n\u001b[1;32m <a href='vscode-notebook-cell:/Users/macpro/Documents/GitHub/Building-Natural-Language-Pipelines/ch3/jupyter-notebooks/components.ipynb#Y112sZmlsZQ%3D%3D?line=10'>11</a>\u001b[0m docs \u001b[39m=\u001b[39m [Document(content\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mParis\u001b[39m\u001b[39m\"\u001b[39m), Document(content\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mBerlin\u001b[39m\u001b[39m\"\u001b[39m)]\n",
"File \u001b[0;32m~/anaconda3/envs/llm-pipelines/lib/python3.10/site-packages/haystack/preview/components/rankers/transformers_similarity.py:78\u001b[0m, in \u001b[0;36mTransformersSimilarityRanker.warm_up\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 76\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mmodel_name_or_path \u001b[39mand\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mmodel:\n\u001b[1;32m 77\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mmodel \u001b[39m=\u001b[39m AutoModelForSequenceClassification\u001b[39m.\u001b[39mfrom_pretrained(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mmodel_name_or_path, token\u001b[39m=\u001b[39m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mtoken)\n\u001b[0;32m---> 78\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mmodel \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mmodel\u001b[39m.\u001b[39;49mto(\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mdevice)\n\u001b[1;32m 79\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mmodel\u001b[39m.\u001b[39meval()\n\u001b[1;32m 80\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mtokenizer \u001b[39m=\u001b[39m AutoTokenizer\u001b[39m.\u001b[39mfrom_pretrained(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mmodel_name_or_path, token\u001b[39m=\u001b[39m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mtoken)\n",
"File \u001b[0;32m~/anaconda3/envs/llm-pipelines/lib/python3.10/site-packages/transformers/modeling_utils.py:2014\u001b[0m, in \u001b[0;36mPreTrainedModel.to\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 2009\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 2010\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m`.to` is not supported for `4-bit` or `8-bit` bitsandbytes models. Please use the model as it is, since the\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 2011\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m model has already been set to the correct devices and casted to the correct `dtype`.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 2012\u001b[0m )\n\u001b[1;32m 2013\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m-> 2014\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39msuper\u001b[39;49m()\u001b[39m.\u001b[39;49mto(\u001b[39m*\u001b[39;49margs, \u001b[39m*\u001b[39;49m\u001b[39m*\u001b[39;49mkwargs)\n",
"File \u001b[0;32m~/anaconda3/envs/llm-pipelines/lib/python3.10/site-packages/torch/nn/modules/module.py:1145\u001b[0m, in \u001b[0;36mModule.to\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1141\u001b[0m \u001b[39mreturn\u001b[39;00m t\u001b[39m.\u001b[39mto(device, dtype \u001b[39mif\u001b[39;00m t\u001b[39m.\u001b[39mis_floating_point() \u001b[39mor\u001b[39;00m t\u001b[39m.\u001b[39mis_complex() \u001b[39melse\u001b[39;00m \u001b[39mNone\u001b[39;00m,\n\u001b[1;32m 1142\u001b[0m non_blocking, memory_format\u001b[39m=\u001b[39mconvert_to_format)\n\u001b[1;32m 1143\u001b[0m \u001b[39mreturn\u001b[39;00m t\u001b[39m.\u001b[39mto(device, dtype \u001b[39mif\u001b[39;00m t\u001b[39m.\u001b[39mis_floating_point() \u001b[39mor\u001b[39;00m t\u001b[39m.\u001b[39mis_complex() \u001b[39melse\u001b[39;00m \u001b[39mNone\u001b[39;00m, non_blocking)\n\u001b[0;32m-> 1145\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_apply(convert)\n",
"File \u001b[0;32m~/anaconda3/envs/llm-pipelines/lib/python3.10/site-packages/torch/nn/modules/module.py:797\u001b[0m, in \u001b[0;36mModule._apply\u001b[0;34m(self, fn)\u001b[0m\n\u001b[1;32m 795\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_apply\u001b[39m(\u001b[39mself\u001b[39m, fn):\n\u001b[1;32m 796\u001b[0m \u001b[39mfor\u001b[39;00m module \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mchildren():\n\u001b[0;32m--> 797\u001b[0m module\u001b[39m.\u001b[39;49m_apply(fn)\n\u001b[1;32m 799\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mcompute_should_use_set_data\u001b[39m(tensor, tensor_applied):\n\u001b[1;32m 800\u001b[0m \u001b[39mif\u001b[39;00m torch\u001b[39m.\u001b[39m_has_compatible_shallow_copy_type(tensor, tensor_applied):\n\u001b[1;32m 801\u001b[0m \u001b[39m# If the new tensor has compatible tensor type as the existing tensor,\u001b[39;00m\n\u001b[1;32m 802\u001b[0m \u001b[39m# the current behavior is to change the tensor in-place using `.data =`,\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 807\u001b[0m \u001b[39m# global flag to let the user control whether they want the future\u001b[39;00m\n\u001b[1;32m 808\u001b[0m \u001b[39m# behavior of overwriting the existing tensor or not.\u001b[39;00m\n",
"File \u001b[0;32m~/anaconda3/envs/llm-pipelines/lib/python3.10/site-packages/torch/nn/modules/module.py:797\u001b[0m, in \u001b[0;36mModule._apply\u001b[0;34m(self, fn)\u001b[0m\n\u001b[1;32m 795\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_apply\u001b[39m(\u001b[39mself\u001b[39m, fn):\n\u001b[1;32m 796\u001b[0m \u001b[39mfor\u001b[39;00m module \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mchildren():\n\u001b[0;32m--> 797\u001b[0m module\u001b[39m.\u001b[39;49m_apply(fn)\n\u001b[1;32m 799\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mcompute_should_use_set_data\u001b[39m(tensor, tensor_applied):\n\u001b[1;32m 800\u001b[0m \u001b[39mif\u001b[39;00m torch\u001b[39m.\u001b[39m_has_compatible_shallow_copy_type(tensor, tensor_applied):\n\u001b[1;32m 801\u001b[0m \u001b[39m# If the new tensor has compatible tensor type as the existing tensor,\u001b[39;00m\n\u001b[1;32m 802\u001b[0m \u001b[39m# the current behavior is to change the tensor in-place using `.data =`,\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 807\u001b[0m \u001b[39m# global flag to let the user control whether they want the future\u001b[39;00m\n\u001b[1;32m 808\u001b[0m \u001b[39m# behavior of overwriting the existing tensor or not.\u001b[39;00m\n",
"File \u001b[0;32m~/anaconda3/envs/llm-pipelines/lib/python3.10/site-packages/torch/nn/modules/module.py:797\u001b[0m, in \u001b[0;36mModule._apply\u001b[0;34m(self, fn)\u001b[0m\n\u001b[1;32m 795\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_apply\u001b[39m(\u001b[39mself\u001b[39m, fn):\n\u001b[1;32m 796\u001b[0m \u001b[39mfor\u001b[39;00m module \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mchildren():\n\u001b[0;32m--> 797\u001b[0m module\u001b[39m.\u001b[39;49m_apply(fn)\n\u001b[1;32m 799\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mcompute_should_use_set_data\u001b[39m(tensor, tensor_applied):\n\u001b[1;32m 800\u001b[0m \u001b[39mif\u001b[39;00m torch\u001b[39m.\u001b[39m_has_compatible_shallow_copy_type(tensor, tensor_applied):\n\u001b[1;32m 801\u001b[0m \u001b[39m# If the new tensor has compatible tensor type as the existing tensor,\u001b[39;00m\n\u001b[1;32m 802\u001b[0m \u001b[39m# the current behavior is to change the tensor in-place using `.data =`,\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 807\u001b[0m \u001b[39m# global flag to let the user control whether they want the future\u001b[39;00m\n\u001b[1;32m 808\u001b[0m \u001b[39m# behavior of overwriting the existing tensor or not.\u001b[39;00m\n",
"File \u001b[0;32m~/anaconda3/envs/llm-pipelines/lib/python3.10/site-packages/torch/nn/modules/module.py:820\u001b[0m, in \u001b[0;36mModule._apply\u001b[0;34m(self, fn)\u001b[0m\n\u001b[1;32m 816\u001b[0m \u001b[39m# Tensors stored in modules are graph leaves, and we don't want to\u001b[39;00m\n\u001b[1;32m 817\u001b[0m \u001b[39m# track autograd history of `param_applied`, so we have to use\u001b[39;00m\n\u001b[1;32m 818\u001b[0m \u001b[39m# `with torch.no_grad():`\u001b[39;00m\n\u001b[1;32m 819\u001b[0m \u001b[39mwith\u001b[39;00m torch\u001b[39m.\u001b[39mno_grad():\n\u001b[0;32m--> 820\u001b[0m param_applied \u001b[39m=\u001b[39m fn(param)\n\u001b[1;32m 821\u001b[0m should_use_set_data \u001b[39m=\u001b[39m compute_should_use_set_data(param, param_applied)\n\u001b[1;32m 822\u001b[0m \u001b[39mif\u001b[39;00m should_use_set_data:\n",
"File \u001b[0;32m~/anaconda3/envs/llm-pipelines/lib/python3.10/site-packages/torch/nn/modules/module.py:1143\u001b[0m, in \u001b[0;36mModule.to.<locals>.convert\u001b[0;34m(t)\u001b[0m\n\u001b[1;32m 1140\u001b[0m \u001b[39mif\u001b[39;00m convert_to_format \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39mand\u001b[39;00m t\u001b[39m.\u001b[39mdim() \u001b[39min\u001b[39;00m (\u001b[39m4\u001b[39m, \u001b[39m5\u001b[39m):\n\u001b[1;32m 1141\u001b[0m \u001b[39mreturn\u001b[39;00m t\u001b[39m.\u001b[39mto(device, dtype \u001b[39mif\u001b[39;00m t\u001b[39m.\u001b[39mis_floating_point() \u001b[39mor\u001b[39;00m t\u001b[39m.\u001b[39mis_complex() \u001b[39melse\u001b[39;00m \u001b[39mNone\u001b[39;00m,\n\u001b[1;32m 1142\u001b[0m non_blocking, memory_format\u001b[39m=\u001b[39mconvert_to_format)\n\u001b[0;32m-> 1143\u001b[0m \u001b[39mreturn\u001b[39;00m t\u001b[39m.\u001b[39;49mto(device, dtype \u001b[39mif\u001b[39;49;00m t\u001b[39m.\u001b[39;49mis_floating_point() \u001b[39mor\u001b[39;49;00m t\u001b[39m.\u001b[39;49mis_complex() \u001b[39melse\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m, non_blocking)\n",
"File \u001b[0;32m~/anaconda3/envs/llm-pipelines/lib/python3.10/site-packages/torch/cuda/__init__.py:239\u001b[0m, in \u001b[0;36m_lazy_init\u001b[0;34m()\u001b[0m\n\u001b[1;32m 235\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mRuntimeError\u001b[39;00m(\n\u001b[1;32m 236\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mCannot re-initialize CUDA in forked subprocess. To use CUDA with \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 237\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mmultiprocessing, you must use the \u001b[39m\u001b[39m'\u001b[39m\u001b[39mspawn\u001b[39m\u001b[39m'\u001b[39m\u001b[39m start method\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m 238\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mhasattr\u001b[39m(torch\u001b[39m.\u001b[39m_C, \u001b[39m'\u001b[39m\u001b[39m_cuda_getDeviceCount\u001b[39m\u001b[39m'\u001b[39m):\n\u001b[0;32m--> 239\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mAssertionError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mTorch not compiled with CUDA enabled\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m 240\u001b[0m \u001b[39mif\u001b[39;00m _cudart \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[1;32m 241\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mAssertionError\u001b[39;00m(\n\u001b[1;32m 242\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mlibcudart functions unavailable. It looks like you have a broken build?\u001b[39m\u001b[39m\"\u001b[39m)\n",
"\u001b[0;31mAssertionError\u001b[0m: Torch not compiled with CUDA enabled"
]
}
],
"source": [
"from haystack.preview import Document\n",
"from haystack.preview.components.rankers import TransformersSimilarityRanker\n",
"\n",
"# Initialize the ranker with a pre-trained model\n",
"ranker = TransformersSimilarityRanker(model_name_or_path=\"cross-encoder/ms-marco-MiniLM-L-6-v2\", device=\"cuda\")\n",
"\n",
"# Warm up the model\n",
"ranker.warm_up()\n",
"\n",
"# Candidate documents\n",
"docs = [Document(content=\"Paris\"), Document(content=\"Berlin\")]\n",
"\n",
"# Query\n",
"query = \"City in Germany\"\n",
"\n",
"# Rank the documents\n",
"output = ranker.run(query=query, documents=docs)\n",
"\n",
"# Get the ranked documents\n",
"ranked_docs = output[\"documents\"]\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"from haystack.preview import Document\n",
"from haystack.preview.components.rankers import MetaFieldRanker\n",
"\n",
"# Initialize the ranker to sort by the \"rating\" metadata field\n",
"ranker = MetaFieldRanker(metadata_field=\"rating\")\n",
"\n",
"# Documents with metadata field \"rating\"\n",
"docs = [\n",
" Document(text=\"Paris\", metadata={\"rating\": 1.3}),\n",
" Document(text=\"Berlin\", metadata={\"rating\": 0.7}),\n",
" Document(text=\"Barcelona\", metadata={\"rating\": 2.1}),\n",
"]\n",
"\n",
"# Rank the documents\n",
"output = ranker.run(documents=docs)\n",
"\n",
"# Get the ranked documents\n",
"ranked_docs = output[\"documents\"]\n"
]
}
],
"metadata": {
Expand Down
6 changes: 3 additions & 3 deletions ch3/jupyter-notebooks/data_classes.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,15 @@
"\n",
"When building data pipelines, a core component involved is the use of data structures. With data structures, we can store, manipulate and manage data through code. Having a solid foundation for data structures is key to ease NLP pipeline development, particularly when an LLM is involved. With Haystack, we can leverage the following built-in data classes: \n",
"\n",
"* Haystack Documents data class \n",
"* Haystack Document data class \n",
"\n",
"* Haystack ByteStream data class \n",
"\n",
"* Haystack ChatMessage data class \n",
"\n",
"* Haystack StreaminhChunk data class \n",
"* Haystack StreamingChunk data class \n",
"\n",
"It also provides support for dataframe objects as well as dictionaries and JSON objects. \n",
"It also provides support for data frame objects as well as dictionaries and JSON objects. \n",
"\n",
"![](./images/data-structures.png)\n",
"\n",
Expand Down

0 comments on commit 0eab6e9

Please sign in to comment.