Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add file and indexing related super components #184

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
6af8913
add super component utils
mathislucka Jan 31, 2025
fc20b49
add super component base
mathislucka Jan 31, 2025
ca66f49
add super component
mathislucka Jan 31, 2025
251b9ae
add init
mathislucka Jan 31, 2025
948574f
add SuperComponent tests
mathislucka Jan 31, 2025
7b3009f
add type utils tests
mathislucka Jan 31, 2025
5aebc78
format
mathislucka Jan 31, 2025
144cde8
add init
mathislucka Jan 31, 2025
d08a37a
add file converter
mathislucka Jan 31, 2025
0ad8b96
add example notebook
mathislucka Jan 31, 2025
402e91b
feat: implement `DocumentIndexer` super component
abrahamy Jan 27, 2025
647c4a2
chore: add license headers
abrahamy Jan 27, 2025
0da3c59
fix: raise error on import of `AsyncPipeline` (#178)
Amnah199 Jan 27, 2025
78e1fcd
test: add unit tests for `DocumentIndexer`
abrahamy Jan 27, 2025
56412be
update indexer to take model names instead of component
abrahamy Jan 30, 2025
eca5d67
more tests
abrahamy Jan 30, 2025
301b523
fix
abrahamy Jan 30, 2025
4201998
remove unused imports
abrahamy Jan 30, 2025
159af04
refactor: use single base class
mathislucka Feb 6, 2025
6c2226a
Merge branch 'main' into feat/supercomponent_base
mathislucka Feb 6, 2025
5ce83ca
fix: format
mathislucka Feb 6, 2025
a6fca36
fix: SuperComponent is a component
mathislucka Feb 6, 2025
3f33ddd
Merge branch 'feat/supercomponent_base' into feat/ready_made_supercom…
mathislucka Feb 6, 2025
eac0142
test: MultiFileConverter
mathislucka Feb 6, 2025
4f71375
Merge branch 'refs/heads/main' into feat/ready_made_supercomponents
mathislucka Feb 7, 2025
8c13dc8
feat: add preprocessor
mathislucka Feb 11, 2025
d92fe9e
feat: update document indexer
mathislucka Feb 11, 2025
0aee13d
feat: update notebook
mathislucka Feb 11, 2025
ba932fc
lint and format
mathislucka Feb 11, 2025
cbbb445
license headers
mathislucka Feb 11, 2025
2d10891
Merge branch 'main' into feat/ready_made_supercomponents
mathislucka Feb 11, 2025
e6149d0
add test dependencies
mathislucka Feb 11, 2025
4ec386a
Merge remote-tracking branch 'origin/feat/ready_made_supercomponents'…
mathislucka Feb 11, 2025
276bbfb
ensure docx, pptx, xlsx MIME types are registered
julian-risch Feb 18, 2025
84aa2a7
Merge branch 'main' into feat/ready_made_supercomponents
julian-risch Feb 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions examples/sample_files/sample.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
type: intro
date: 1.1.2023
---
```bash
pip install farm-haystack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is just an example document but I can also see how it can create problems from us and confusion on discord 🤣

Let's switch it to haystack-ai?

```
## What to build with Haystack

- **Ask questions in natural language** and find granular answers in your own documents.
- Perform **semantic search** and retrieve documents according to meaning not keywords
- Use **off-the-shelf models** or **fine-tune** them to your own domain.
- Use **user feedback** to evaluate, benchmark and continuously improve your live models.
- Leverage existing **knowledge bases** and better handle the long tail of queries that **chatbots** receive.
- **Automate processes** by automatically applying a list of questions to new documents and using the extracted answers.

![Logo](https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/logo.png)


## Core Features

- **Latest models**: Utilize all latest transformer based models (e.g. BERT, RoBERTa, MiniLM) for extractive QA, generative QA and document retrieval.
- **Modular**: Multiple choices to fit your tech stack and use case. Pick your favorite database, file converter or modeling framework.
- **Open**: 100% compatible with HuggingFace's model hub. Tight interfaces to other frameworks (e.g. Transformers, FARM, sentence-transformers)
- **Scalable**: Scale to millions of docs via retrievers, production-ready backends like Elasticsearch / FAISS and a fastAPI REST API
- **End-to-End**: All tooling in one place: file conversion, cleaning, splitting, training, eval, inference, labeling ...
- **Developer friendly**: Easy to debug, extend and modify.
- **Customizable**: Fine-tune models to your own domain or implement your custom DocumentStore.
- **Continuous Learning**: Collect new training data via user feedback in production & improve your models continuously

| | |
|-|-|
| :ledger: [Docs](https://haystack.deepset.ai/overview/intro) | Usage, Guides, API documentation ...|
| :beginner: [Quick Demo](https://github.com/deepset-ai/haystack/#quick-demo) | Quickly see what Haystack offers |
| :floppy_disk: [Installation](https://github.com/deepset-ai/haystack/#installation) | How to install Haystack |
| :art: [Key Components](https://github.com/deepset-ai/haystack/#key-components) | Overview of core concepts |
| :mortar_board: [Tutorials](https://github.com/deepset-ai/haystack/#tutorials) | Jupyter/Colab Notebooks & Scripts |
| :eyes: [How to use Haystack](https://github.com/deepset-ai/haystack/#how-to-use-haystack) | Basic explanation of concepts, options and usage |
| :heart: [Contributing](https://github.com/deepset-ai/haystack/#heart-contributing) | We welcome all contributions! |
| :bar_chart: [Benchmarks](https://haystack.deepset.ai/benchmarks/v0.9.0) | Speed & Accuracy of Retriever, Readers and DocumentStores |
| :telescope: [Roadmap](https://haystack.deepset.ai/overview/roadmap) | Public roadmap of Haystack |
| :pray: [Slack](https://haystack.deepset.ai/community/join) | Join our community on Slack |
| :bird: [Twitter](https://twitter.com/deepset_ai) | Follow us on Twitter for news and updates |
| :newspaper: [Blog](https://medium.com/deepset-ai) | Read our articles on Medium |


## Quick Demo

The quickest way to see what Haystack offers is to start a [Docker Compose](https://docs.docker.com/compose/) demo application:

**1. Update/install Docker and Docker Compose, then launch Docker**

```
# apt-get update && apt-get install docker && apt-get install docker-compose
# service docker start
```

**2. Clone Haystack repository**

```
# git clone https://github.com/deepset-ai/haystack.git
```

### 2nd level headline for testing purposes
#### 3rd level headline for testing purposes
Binary file added examples/sample_files/sample_docx.docx
Binary file not shown.
Binary file added examples/sample_files/sample_pdf_1.pdf
Binary file not shown.
Binary file added examples/sample_files/sample_pptx.pptx
Binary file not shown.
497 changes: 497 additions & 0 deletions examples/super_components.ipynb

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions haystack_experimental/super_components/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0

from haystack_experimental.super_components.converters.multi_file_converter import MultiFileConverter

_all_ = ["MultiFileConverter"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0

from enum import Enum
from typing import Any, Dict

from haystack import Pipeline, component, default_from_dict, default_to_dict
from haystack.components.converters import (
CSVToDocument,
DOCXToDocument,
HTMLToDocument,
JSONConverter,
MarkdownToDocument,
PPTXToDocument,
PyPDFToDocument,
TextFileToDocument,
XLSXToDocument,
)
from haystack.components.joiners import DocumentJoiner
from haystack.components.routers import FileTypeRouter

from haystack_experimental.core.super_component import SuperComponent


class ConverterMimeType(str, Enum):
CSV = "text/csv"
DOCX = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
HTML = "text/html"
JSON = "application/json"
MD = "text/markdown"
TEXT = "text/plain"
PDF = "application/pdf"
PPTX = "application/vnd.openxmlformats-officedocument.presentationml.presentation"
XLSX = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to serialize and deserialize ok? We usually used something like:

class DOCXTableFormat(Enum):
    """
    Supported formats for storing DOCX tabular data in a Document.
    """

    MARKDOWN = "markdown"
    CSV = "csv"

    def __str__(self):
        return self.value

    @staticmethod
    def from_str(string: str) -> "DOCXTableFormat":
        """
        Convert a string to a DOCXTableFormat enum.
        """
        enum_map = {e.value: e for e in DOCXTableFormat}
        table_format = enum_map.get(string.lower())
        if table_format is None:
            msg = f"Unknown table format '{string}'. Supported formats are: {list(enum_map.keys())}"
            raise ValueError(msg)
        return table_format

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only using this internally, so no need to serialize.



@component
class MultiFileConverter(SuperComponent):
"""
A file converter that handles conversion of multiple file types.

The MultiFileConverter handles the following file types:
- CSV
- DOCX
- HTML
- JSON
- MD
- TEXT
- PDF (no OCR)
- PPTX
- XLSX

Usage:
```
converter = MultiFileConverter()
converter.run(sources=["test.txt", "test.pdf"], meta={})
```
"""

def __init__( # noqa: PLR0915
self,
encoding: str = "utf-8",
json_content_key: str = "content",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear what this is? Add init pydoc?

) -> None:
self.encoding = encoding
self.json_content_key = json_content_key

# initialize components
router = FileTypeRouter(
mime_types=[
ConverterMimeType.CSV.value,
ConverterMimeType.DOCX.value,
ConverterMimeType.HTML.value,
ConverterMimeType.JSON.value,
ConverterMimeType.MD.value,
ConverterMimeType.TEXT.value,
ConverterMimeType.PDF.value,
ConverterMimeType.PPTX.value,
ConverterMimeType.XLSX.value,
],
# Ensure common extensions are registered. Tests on Windows fail otherwise.
additional_mimetypes = {
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": ".xlsx",
"application/vnd.openxmlformats-officedocument.presentationml.presentation": ".pptx"
}
)

csv = CSVToDocument(encoding=self.encoding)
docx = DOCXToDocument()
html = HTMLToDocument()
json = JSONConverter(content_key=self.json_content_key)
md = MarkdownToDocument()
txt = TextFileToDocument(encoding=self.encoding)
pdf = PyPDFToDocument()
pptx = PPTXToDocument()
xlsx = XLSXToDocument()

joiner = DocumentJoiner()



# Create pipeline and add components
pp = Pipeline()

pp.add_component("router", router)

pp.add_component("docx", docx)
pp.add_component("html", html)
pp.add_component("json", json)
pp.add_component("md", md)
pp.add_component("txt", txt)
pp.add_component("pdf", pdf)
pp.add_component("pptx", pptx)
pp.add_component("xlsx", xlsx)
pp.add_component("joiner", joiner)
pp.add_component("csv", csv)

pp.connect(f"router.{ConverterMimeType.CSV.value}", "csv")
pp.connect(f"router.{ConverterMimeType.DOCX.value}", "docx")
pp.connect(f"router.{ConverterMimeType.HTML.value}", "html")
pp.connect(f"router.{ConverterMimeType.JSON.value}", "json")
pp.connect(f"router.{ConverterMimeType.MD.value}", "md")
pp.connect(f"router.{ConverterMimeType.TEXT.value}", "txt")
pp.connect(f"router.{ConverterMimeType.PDF.value}", "pdf")
pp.connect(f"router.{ConverterMimeType.PPTX.value}", "pptx")
pp.connect(f"router.{ConverterMimeType.XLSX.value}", "xlsx")

pp.connect("docx.documents", "joiner.documents")
pp.connect("html.documents", "joiner.documents")
pp.connect("json.documents", "joiner.documents")
pp.connect("md.documents", "joiner.documents")
pp.connect("txt.documents", "joiner.documents")
pp.connect("pdf.documents", "joiner.documents")
pp.connect("pptx.documents", "joiner.documents")

pp.connect("csv.documents", "joiner.documents")
pp.connect("xlsx.documents", "joiner.documents")


output_mapping = {"joiner.documents": "documents"}
input_mapping = {
"sources": ["router.sources"],
"meta": ["router.meta"]
}

super(MultiFileConverter, self).__init__(
pipeline=pp,
output_mapping=output_mapping,
input_mapping=input_mapping
)

def to_dict(self) -> Dict[str, Any]:
"""
Serialize this instance to a dictionary.
"""
return default_to_dict(
self,
encoding=self.encoding,
json_content_key=self.json_content_key,
)

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "MultiFileConverter":
"""
Load this instance from a dictionary.
"""
return default_from_dict(cls, data)
9 changes: 9 additions & 0 deletions haystack_experimental/super_components/indexers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
#
# SPDX-License-Identifier: Apache-2.0

from haystack_experimental.super_components.indexers.document_indexer import DocumentIndexer

__all__ = [
"DocumentIndexer",
]
Loading
Loading