Skip to content

Commit

Permalink
finalze data classes notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
lfunderburk committed Nov 16, 2023
1 parent 2561fc1 commit 6e2594f
Show file tree
Hide file tree
Showing 2 changed files with 786 additions and 228 deletions.
209 changes: 0 additions & 209 deletions ch3/jupyter-notebooks/components.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,215 +23,6 @@
"\n",
"![](./images/haystack-components.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The `DocumentStore` class and the `Document` class\n",
"\n",
"The `DocumentStore` class is an internal component of the Haystack library that serves as a registry for classes that are marked as document stores. A document store in Haystack is a place where documents are stored and retrieved, typically used as part of a pipeline to handle data for search and retrieval tasks. \n",
"\n",
"The `Document` is a data structure that represents a document in Haystack. `Document` objects are stored in a `DocumentStore` and are used as input and output for the various components in Haystack.\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from haystack.preview.dataclasses import Document, ByteStream\n",
"from haystack.preview.document_stores.in_memory.document_store import InMemoryDocumentStore\n",
"import pandas as pd\n",
"\n",
"# Assuming 'binary_data' is your binary data, for example, read from a file:\n",
"binary_data = b'Your binary data here' # This could be the actual binary content, such as PDF or image data\n",
"\n",
"# Convert binary data to ByteStream object\n",
"binary_blob = ByteStream(data=binary_data, mime_type='application/pdf') # MIME type should match your data\n",
"\n",
"# Example metadata\n",
"metadata = {\n",
" \"source\": \"Wikipedia\",\n",
" \"author\": \"John Doe\",\n",
" \"date\": \"2021-07-21\",\n",
" \"custom_field\": \"custom_value\"\n",
"}\n",
"\n",
"# Pandas dataframe for tabular data\n",
"df = pd.DataFrame.from_dict({'first_name': ['John', 'Jane'], 'last_name': ['Doe', 'Doe'], 'age': [35, 38]})\n",
"\n",
"# Create documents\n",
"documents = [\n",
" Document(content=\"The population of Germany is 100 million people.\", id=\"1\"),\n",
" Document(content=\"About 65 million people live in France as of today.\", id='2'),\n",
" Document(dataframe=df, id='3'),\n",
" Document(blob=binary_blob, meta={\"file_name\": \"example.pdf\", \"file_type\": \"PDF\"}, id='4'),\n",
" Document(content=\"A sample text document with metadata.\", meta=metadata, id='5')\n",
"]\n",
"\n",
"# Write documents to document store\n",
"docstore = InMemoryDocumentStore()\n",
"docstore.write_documents(documents=documents)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering document information"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find all documents - no filter"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(id='1', content='The population of Germany is 100 million people.', dataframe=None, blob=None, meta={}, score=None),\n",
" Document(id='2', content='About 65 million people live in France as of today.', dataframe=None, blob=None, meta={}, score=None),\n",
" Document(id='3', content=None, dataframe= first_name last_name age\n",
" 0 John Doe 35\n",
" 1 Jane Doe 38, blob=None, meta={}, score=None),\n",
" Document(id='4', content=None, dataframe=None, blob=ByteStream(data=b'Your binary data here', metadata={}, mime_type='application/pdf'), meta={'file_name': 'example.pdf', 'file_type': 'PDF'}, score=None),\n",
" Document(id='5', content='A sample text document with metadata.', dataframe=None, blob=None, meta={'source': 'Wikipedia', 'author': 'John Doe', 'date': '2021-07-21', 'custom_field': 'custom_value'}, score=None)]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"docstore.filter_documents()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To find a document using exact match\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(id='1', content='The population of Germany is 100 million people.', dataframe=None, blob=None, meta={}, score=None)]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"filters_exact_match = {\n",
" \"content\": {\"$eq\": \"The population of Germany is 100 million people.\"}\n",
"}\n",
"docstore.filter_documents(filters=filters_exact_match)\n"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 35\n",
"1 38\n",
"Name: age, dtype: int64"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_doc = Document(dataframe=df, id='3')\n",
"\n",
"df_doc.dataframe.age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find entries that are not of content type"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"filters_df = {\n",
" \"dataframe.age\": {\"$eq\": 35}\n",
"}\n",
"docstore.filter_documents(filters=filters_df)\n"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Document(id='3', content=None, dataframe= first_name last_name age\n",
" 0 John Doe 35\n",
" 1 Jane Doe 38, blob=None, meta={}, score=None),\n",
" Document(id='4', content=None, dataframe=None, blob=ByteStream(data=b'Your binary data here', metadata={}, mime_type='application/pdf'), meta={'file_name': 'example.pdf', 'file_type': 'PDF'}, score=None)]"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"filters_none = {\n",
" \"content\": {\"$eq\": None}\n",
"}\n",
"docstore.filter_documents(filters=filters_none)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
Loading

0 comments on commit 6e2594f

Please sign in to comment.