diff --git a/ch3/jupyter-notebooks/components.ipynb b/ch3/jupyter-notebooks/components.ipynb index 3b38520..b80370f 100644 --- a/ch3/jupyter-notebooks/components.ipynb +++ b/ch3/jupyter-notebooks/components.ipynb @@ -6,6 +6,8 @@ "source": [ "# Building blocks in Haystack: components and pipelines\n", "\n", + "In the [previous notebook](data_classes.ipynb), we learned how we can store structured and unstructured data through Documents objects, as well as dataframe, ByteStream, ChatMessage and StreamingChunk objects. We also learned how to store these objects into a Document Store. In this notebook, we will explore how to store and retrieve data from a Haystack Document store. Let's take a look at its architecture.\n", + "\n", "Haystack's architecture leverages components as its core elements, each performing specific functions like text processing or summarization. These components are designed to be connected into pipelines, which orchestrate the flow of data and manage task execution in a structured manner. The Pipeline class facilitates this by allowing the addition and connection of components, which must have unique input and output points for data transfer.\n", "\n", "Pipelines are the backbone of NLP applications in Haystack, functioning as directed graphs where nodes are components and edges dictate data flow. They ensure smooth data processing, handle errors, and support debugging through visualization tools that help developers trace and optimize the data journey.\n", diff --git a/ch3/jupyter-notebooks/data_classes.ipynb b/ch3/jupyter-notebooks/data_classes.ipynb index c366e3a..13f91d8 100644 --- a/ch3/jupyter-notebooks/data_classes.ipynb +++ b/ch3/jupyter-notebooks/data_classes.ipynb @@ -34,111 +34,11 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 46, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Help on class Document in module haystack.preview.dataclasses.document:\n", - "\n", - "class Document(builtins.object)\n", - " | Document(*args, **kwargs)\n", - " | \n", - " | Base data class containing some data to be queried.\n", - " | Can contain text snippets, tables, and file paths to images or audios.\n", - " | Documents can be sorted by score and saved to/from dictionary and JSON.\n", - " | \n", - " | :param id: Unique identifier for the document. When not set, it's generated based on the Document fields' values.\n", - " | :param content: Text of the document, if the document contains text.\n", - " | :param dataframe: Pandas dataframe with the document's content, if the document contains tabular data.\n", - " | :param blob: Binary data associated with the document, if the document has any binary data associated with it.\n", - " | :param meta: Additional custom metadata for the document. Must be JSON-serializable.\n", - " | :param score: Score of the document. Used for ranking, usually assigned by retrievers.\n", - " | :param embedding: Vector representation of the document.\n", - " | \n", - " | Methods defined here:\n", - " | \n", - " | __eq__(self, other)\n", - " | Compares Documents for equality.\n", - " | Two Documents are considered equals if their dictionary representation is identical.\n", - " | \n", - " | __init__(self, id: str = '', content: Optional[str] = None, dataframe: Optional[pandas.core.frame.DataFrame] = None, blob: Optional[haystack.preview.dataclasses.byte_stream.ByteStream] = None, meta: Dict[str, Any] = , score: Optional[float] = None, embedding: Optional[List[float]] = None) -> None\n", - " | Initialize self. See help(type(self)) for accurate signature.\n", - " | \n", - " | __post_init__(self)\n", - " | Generate the ID based on the init parameters.\n", - " | \n", - " | __repr__(self)\n", - " | Return repr(self).\n", - " | \n", - " | __str__(self)\n", - " | Return str(self).\n", - " | \n", - " | to_dict(self, flatten=True) -> Dict[str, Any]\n", - " | Converts Document into a dictionary.\n", - " | `dataframe` and `blob` fields are converted to JSON-serializable types.\n", - " | \n", - " | :param flatten: Whether to flatten `meta` field or not. Defaults to `True` to be backward-compatible with Haystack 1.x.\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Class methods defined here:\n", - " | \n", - " | from_dict(data: Dict[str, Any]) -> 'Document' from haystack.preview.dataclasses.document._BackwardCompatible\n", - " | Creates a new Document object from a dictionary.\n", - " | `dataframe` and `blob` fields are converted to their original types.\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Readonly properties defined here:\n", - " | \n", - " | content_type\n", - " | Returns the type of the content for the document.\n", - " | This is necessary to keep backward compatibility with 1.x.\n", - " | A ValueError will be raised if both `text` and `dataframe` fields are set\n", - " | or both are missing.\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Data descriptors defined here:\n", - " | \n", - " | __dict__\n", - " | dictionary for instance variables (if defined)\n", - " | \n", - " | __weakref__\n", - " | list of weak references to the object (if defined)\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Data and other attributes defined here:\n", - " | \n", - " | __annotations__ = {'blob': typing.Optional[haystack.preview.dataclasse...\n", - " | \n", - " | __dataclass_fields__ = {'blob': Field(name='blob',type=typing.Optional...\n", - " | \n", - " | __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,or...\n", - " | \n", - " | __hash__ = None\n", - " | \n", - " | __match_args__ = ('id', 'content', 'dataframe', 'blob', 'meta', 'score...\n", - " | \n", - " | blob = None\n", - " | \n", - " | content = None\n", - " | \n", - " | dataframe = None\n", - " | \n", - " | embedding = None\n", - " | \n", - " | id = ''\n", - " | \n", - " | score = None\n", - "\n" - ] - } - ], + "outputs": [], "source": [ - "from haystack.preview.dataclasses import Document\n", - "\n", - "help(Document)" + "from haystack.preview.dataclasses import Document" ] }, { @@ -387,11 +287,11 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 47, "metadata": {}, "outputs": [], "source": [ - "from haystack.preview.dataclasses import ByteStream\n" + "from haystack.preview.dataclasses import ByteStream" ] }, { @@ -473,135 +373,98 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 48, + "metadata": {}, + "outputs": [], + "source": [ + "from haystack.preview.dataclasses import ChatMessage" + ] + }, + { + "cell_type": "code", + "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Help on class ChatMessage in module haystack.preview.dataclasses.chat_message:\n", - "\n", - "class ChatMessage(builtins.object)\n", - " | ChatMessage(content: str, role: haystack.preview.dataclasses.chat_message.ChatRole, name: Optional[str], metadata: Dict[str, Any] = ) -> None\n", - " | \n", - " | Represents a message in a LLM chat conversation.\n", - " | \n", - " | :param content: The text content of the message.\n", - " | :param role: The role of the entity sending the message.\n", - " | :param name: The name of the function being called (only applicable for role FUNCTION).\n", - " | :param metadata: Additional metadata associated with the message.\n", - " | \n", - " | Methods defined here:\n", - " | \n", - " | __eq__(self, other)\n", - " | Return self==value.\n", - " | \n", - " | __init__(self, content: str, role: haystack.preview.dataclasses.chat_message.ChatRole, name: Optional[str], metadata: Dict[str, Any] = ) -> None\n", - " | Initialize self. See help(type(self)) for accurate signature.\n", - " | \n", - " | __repr__(self)\n", - " | Return repr(self).\n", - " | \n", - " | is_from(self, role: haystack.preview.dataclasses.chat_message.ChatRole) -> bool\n", - " | Check if the message is from a specific role.\n", - " | \n", - " | :param role: The role to check against.\n", - " | :return: True if the message is from the specified role, False otherwise.\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Class methods defined here:\n", - " | \n", - " | from_assistant(content: str, metadata: Optional[Dict[str, Any]] = None) -> 'ChatMessage' from builtins.type\n", - " | Create a message from the assistant.\n", - " | \n", - " | :param content: The text content of the message.\n", - " | :param metadata: Additional metadata associated with the message.\n", - " | :return: A new ChatMessage instance.\n", - " | \n", - " | from_function(content: str, name: str) -> 'ChatMessage' from builtins.type\n", - " | Create a message from a function call.\n", - " | \n", - " | :param content: The text content of the message.\n", - " | :param name: The name of the function being called.\n", - " | :return: A new ChatMessage instance.\n", - " | \n", - " | from_system(content: str) -> 'ChatMessage' from builtins.type\n", - " | Create a message from the system.\n", - " | \n", - " | :param content: The text content of the message.\n", - " | :return: A new ChatMessage instance.\n", - " | \n", - " | from_user(content: str) -> 'ChatMessage' from builtins.type\n", - " | Create a message from the user.\n", - " | \n", - " | :param content: The text content of the message.\n", - " | :return: A new ChatMessage instance.\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Data descriptors defined here:\n", - " | \n", - " | __dict__\n", - " | dictionary for instance variables (if defined)\n", - " | \n", - " | __weakref__\n", - " | list of weak references to the object (if defined)\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Data and other attributes defined here:\n", - " | \n", - " | __annotations__ = {'content': , 'metadata': typing.Dict[s...\n", - " | \n", - " | __dataclass_fields__ = {'content': Field(name='content',type=, name=None, metadata={})\n" ] } ], "source": [ - "from haystack.preview.dataclasses import ChatMessage\n", + "# Create a message from the assistant\n", + "assistant_msg = ChatMessage.from_assistant(content=\"Hello, how can I assist you today?\")\n", "\n", - "help(ChatMessage)" + "print(assistant_msg)\n" ] }, { "cell_type": "code", - "execution_count": 15, + "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "ChatMessage(content='Hello, how can I assist you today?', role=, name=None, metadata={})\n", - "ChatMessage(content='Can you show me the weather forecast?', role=, name=None, metadata={})\n", - "ChatMessage(content='A new user has joined the chat.', role=, name=None, metadata={})\n", - "ChatMessage(content='Retrieving weather data...', role=, name='fetch_weather', metadata={})\n" + "ChatMessage(content='Can you show me the weather forecast?', role=, name=None, metadata={})\n" ] } ], "source": [ - "# Create a message from the assistant\n", - "assistant_msg = ChatMessage.from_assistant(content=\"Hello, how can I assist you today?\")\n", - "\n", "# Create a message from the user\n", "user_msg = ChatMessage.from_user(content=\"Can you show me the weather forecast?\")\n", "\n", + "print(user_msg)" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ChatMessage(content='A new user has joined the chat.', role=, name=None, metadata={})\n" + ] + } + ], + "source": [ "# Create a system message, for instance, to indicate that a user has joined the chat\n", "system_msg = ChatMessage.from_system(content=\"A new user has joined the chat.\")\n", "\n", + "print(system_msg)" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ChatMessage(content='Retrieving weather data...', role=, name='fetch_weather', metadata={})\n" + ] + } + ], + "source": [ "# Create a function message, for example, to execute a command to retrieve weather data\n", "function_msg = ChatMessage.from_function(content=\"Retrieving weather data...\", name=\"fetch_weather\")\n", "\n", - "print(assistant_msg)\n", - "print(user_msg)\n", - "print(system_msg)\n", - "print(function_msg)\n" + "print(function_msg)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's populate a Document object with a ChatMessage object." ] }, { @@ -664,67 +527,6 @@ "from haystack.preview.dataclasses import StreamingChunk" ] }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Help on class StreamingChunk in module haystack.preview.dataclasses.streaming_chunk:\n", - "\n", - "class StreamingChunk(builtins.object)\n", - " | StreamingChunk(content: str, metadata: Dict[str, Any] = ) -> None\n", - " | \n", - " | The StreamingChunk class encapsulates a segment of streamed content along with\n", - " | associated metadata. This structure facilitates the handling and processing of\n", - " | streamed data in a systematic manner.\n", - " | \n", - " | :param content: The content of the message chunk as a string.\n", - " | :param metadata: A dictionary containing metadata related to the message chunk.\n", - " | \n", - " | Methods defined here:\n", - " | \n", - " | __eq__(self, other)\n", - " | Return self==value.\n", - " | \n", - " | __init__(self, content: str, metadata: Dict[str, Any] = ) -> None\n", - " | Initialize self. See help(type(self)) for accurate signature.\n", - " | \n", - " | __repr__(self)\n", - " | Return repr(self).\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Data descriptors defined here:\n", - " | \n", - " | __dict__\n", - " | dictionary for instance variables (if defined)\n", - " | \n", - " | __weakref__\n", - " | list of weak references to the object (if defined)\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Data and other attributes defined here:\n", - " | \n", - " | __annotations__ = {'content': , 'metadata': typing.Dict[s...\n", - " | \n", - " | __dataclass_fields__ = {'content': Field(name='content',type= List[haystack.preview.dataclasses.document.Document]\n", - " | Retrieves documents that are most relevant to the query using BM25 algorithm.\n", - " | \n", - " | :param query: The query string.\n", - " | :param filters: A dictionary with filters to narrow down the search space.\n", - " | :param top_k: The number of top documents to retrieve. Default is 10.\n", - " | :param scale_score: Whether to scale the scores of the retrieved documents. Default is False.\n", - " | :return: A list of the top_k documents most relevant to the query.\n", - " | \n", - " | count_documents(self) -> int\n", - " | Returns the number of how many documents are present in the DocumentStore.\n", - " | \n", - " | delete_documents(self, document_ids: List[str]) -> None\n", - " | Deletes all documents with matching document_ids from the DocumentStore.\n", - " | Fails with `MissingDocumentError` if no document with this id is present in the DocumentStore.\n", - " | \n", - " | :param object_ids: The object_ids to delete.\n", - " | \n", - " | embedding_retrieval(self, query_embedding: List[float], filters: Optional[Dict[str, Any]] = None, top_k: int = 10, scale_score: bool = False, return_embedding: bool = False) -> List[haystack.preview.dataclasses.document.Document]\n", - " | Retrieves documents that are most similar to the query embedding using a vector similarity metric.\n", - " | \n", - " | :param query_embedding: Embedding of the query.\n", - " | :param filters: A dictionary with filters to narrow down the search space.\n", - " | :param top_k: The number of top documents to retrieve. Default is 10.\n", - " | :param scale_score: Whether to scale the scores of the retrieved Documents. Default is False.\n", - " | :param return_embedding: Whether to return the embedding of the retrieved Documents. Default is False.\n", - " | :return: A list of the top_k documents most relevant to the query.\n", - " | \n", - " | filter_documents(self, filters: Optional[Dict[str, Any]] = None) -> List[haystack.preview.dataclasses.document.Document]\n", - " | Returns the documents that match the filters provided.\n", - " | \n", - " | Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator (`\"$and\"`,\n", - " | `\"$or\"`, `\"$not\"`), a comparison operator (`\"$eq\"`, `$ne`, `\"$in\"`, `$nin`, `\"$gt\"`, `\"$gte\"`, `\"$lt\"`,\n", - " | `\"$lte\"`) or a metadata field name.\n", - " | \n", - " | Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata\n", - " | field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or\n", - " | (in case of `\"$in\"`) a list of values as value. If no logical operator is provided, `\"$and\"` is used as default\n", - " | operation. If no comparison operator is provided, `\"$eq\"` (or `\"$in\"` if the comparison value is a list) is used\n", - " | as default operation.\n", - " | \n", - " | Example:\n", - " | \n", - " | ```python\n", - " | filters = {\n", - " | \"$and\": {\n", - " | \"type\": {\"$eq\": \"article\"},\n", - " | \"date\": {\"$gte\": \"2015-01-01\", \"$lt\": \"2021-01-01\"},\n", - " | \"rating\": {\"$gte\": 3},\n", - " | \"$or\": {\n", - " | \"genre\": {\"$in\": [\"economy\", \"politics\"]},\n", - " | \"publisher\": {\"$eq\": \"nytimes\"}\n", - " | }\n", - " | }\n", - " | }\n", - " | # or simpler using default operators\n", - " | filters = {\n", - " | \"type\": \"article\",\n", - " | \"date\": {\"$gte\": \"2015-01-01\", \"$lt\": \"2021-01-01\"},\n", - " | \"rating\": {\"$gte\": 3},\n", - " | \"$or\": {\n", - " | \"genre\": [\"economy\", \"politics\"],\n", - " | \"publisher\": \"nytimes\"\n", - " | }\n", - " | }\n", - " | ```\n", - " | \n", - " | To use the same logical operator multiple times on the same level, logical operators can take a list of\n", - " | dictionaries as value.\n", - " | \n", - " | Example:\n", - " | \n", - " | ```python\n", - " | filters = {\n", - " | \"$or\": [\n", - " | {\n", - " | \"$and\": {\n", - " | \"Type\": \"News Paper\",\n", - " | \"Date\": {\n", - " | \"$lt\": \"2019-01-01\"\n", - " | }\n", - " | }\n", - " | },\n", - " | {\n", - " | \"$and\": {\n", - " | \"Type\": \"Blog Post\",\n", - " | \"Date\": {\n", - " | \"$gte\": \"2019-01-01\"\n", - " | }\n", - " | }\n", - " | }\n", - " | ]\n", - " | }\n", - " | ```\n", - " | \n", - " | :param filters: The filters to apply to the document list.\n", - " | :return: A list of Documents that match the given filters.\n", - " | \n", - " | to_dict(self) -> Dict[str, Any]\n", - " | Serializes this store to a dictionary.\n", - " | \n", - " | write_documents(self, documents: List[haystack.preview.dataclasses.document.Document], policy: haystack.preview.document_stores.protocols.DuplicatePolicy = ) -> None\n", - " | Writes (or overwrites) documents into the DocumentStore.\n", - " | \n", - " | :param documents: A list of documents.\n", - " | :param policy: Documents with the same ID count as duplicates. When duplicates are met,\n", - " | the DocumentStore can:\n", - " | - skip: keep the existing document and ignore the new one.\n", - " | - overwrite: remove the old document and write the new one.\n", - " | - fail: an error is raised.\n", - " | :raises DuplicateError: Exception trigger on duplicate document if `policy=DuplicatePolicy.FAIL`\n", - " | :return: None\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Class methods defined here:\n", - " | \n", - " | from_dict(data: Dict[str, Any]) -> 'InMemoryDocumentStore' from builtins.type\n", - " | Deserializes the store from a dictionary.\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Data descriptors defined here:\n", - " | \n", - " | __dict__\n", - " | dictionary for instance variables (if defined)\n", - " | \n", - " | __weakref__\n", - " | list of weak references to the object (if defined)\n", - " | \n", - " | ----------------------------------------------------------------------\n", - " | Data and other attributes defined here:\n", - " | \n", - " | __haystack_document_store__ = True\n", - "\n" - ] - } - ], - "source": [ - "help(InMemoryDocumentStore)" - ] - }, { "cell_type": "code", "execution_count": 34,