-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add file and indexing related super components #184
base: main
Are you sure you want to change the base?
Changes from all commits
6af8913
fc20b49
ca66f49
251b9ae
948574f
7b3009f
5aebc78
144cde8
d08a37a
0ad8b96
402e91b
647c4a2
0da3c59
78e1fcd
56412be
eca5d67
301b523
4201998
159af04
6c2226a
5ce83ca
a6fca36
3f33ddd
eac0142
4f71375
8c13dc8
d92fe9e
0aee13d
ba932fc
cbbb445
2d10891
e6149d0
4ec386a
276bbfb
84aa2a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
--- | ||
type: intro | ||
date: 1.1.2023 | ||
--- | ||
```bash | ||
pip install farm-haystack | ||
``` | ||
## What to build with Haystack | ||
|
||
- **Ask questions in natural language** and find granular answers in your own documents. | ||
- Perform **semantic search** and retrieve documents according to meaning not keywords | ||
- Use **off-the-shelf models** or **fine-tune** them to your own domain. | ||
- Use **user feedback** to evaluate, benchmark and continuously improve your live models. | ||
- Leverage existing **knowledge bases** and better handle the long tail of queries that **chatbots** receive. | ||
- **Automate processes** by automatically applying a list of questions to new documents and using the extracted answers. | ||
|
||
 | ||
|
||
|
||
## Core Features | ||
|
||
- **Latest models**: Utilize all latest transformer based models (e.g. BERT, RoBERTa, MiniLM) for extractive QA, generative QA and document retrieval. | ||
- **Modular**: Multiple choices to fit your tech stack and use case. Pick your favorite database, file converter or modeling framework. | ||
- **Open**: 100% compatible with HuggingFace's model hub. Tight interfaces to other frameworks (e.g. Transformers, FARM, sentence-transformers) | ||
- **Scalable**: Scale to millions of docs via retrievers, production-ready backends like Elasticsearch / FAISS and a fastAPI REST API | ||
- **End-to-End**: All tooling in one place: file conversion, cleaning, splitting, training, eval, inference, labeling ... | ||
- **Developer friendly**: Easy to debug, extend and modify. | ||
- **Customizable**: Fine-tune models to your own domain or implement your custom DocumentStore. | ||
- **Continuous Learning**: Collect new training data via user feedback in production & improve your models continuously | ||
|
||
| | | | ||
|-|-| | ||
| :ledger: [Docs](https://haystack.deepset.ai/overview/intro) | Usage, Guides, API documentation ...| | ||
| :beginner: [Quick Demo](https://github.com/deepset-ai/haystack/#quick-demo) | Quickly see what Haystack offers | | ||
| :floppy_disk: [Installation](https://github.com/deepset-ai/haystack/#installation) | How to install Haystack | | ||
| :art: [Key Components](https://github.com/deepset-ai/haystack/#key-components) | Overview of core concepts | | ||
| :mortar_board: [Tutorials](https://github.com/deepset-ai/haystack/#tutorials) | Jupyter/Colab Notebooks & Scripts | | ||
| :eyes: [How to use Haystack](https://github.com/deepset-ai/haystack/#how-to-use-haystack) | Basic explanation of concepts, options and usage | | ||
| :heart: [Contributing](https://github.com/deepset-ai/haystack/#heart-contributing) | We welcome all contributions! | | ||
| :bar_chart: [Benchmarks](https://haystack.deepset.ai/benchmarks/v0.9.0) | Speed & Accuracy of Retriever, Readers and DocumentStores | | ||
| :telescope: [Roadmap](https://haystack.deepset.ai/overview/roadmap) | Public roadmap of Haystack | | ||
| :pray: [Slack](https://haystack.deepset.ai/community/join) | Join our community on Slack | | ||
| :bird: [Twitter](https://twitter.com/deepset_ai) | Follow us on Twitter for news and updates | | ||
| :newspaper: [Blog](https://medium.com/deepset-ai) | Read our articles on Medium | | ||
|
||
|
||
## Quick Demo | ||
|
||
The quickest way to see what Haystack offers is to start a [Docker Compose](https://docs.docker.com/compose/) demo application: | ||
|
||
**1. Update/install Docker and Docker Compose, then launch Docker** | ||
|
||
``` | ||
# apt-get update && apt-get install docker && apt-get install docker-compose | ||
# service docker start | ||
``` | ||
|
||
**2. Clone Haystack repository** | ||
|
||
``` | ||
# git clone https://github.com/deepset-ai/haystack.git | ||
``` | ||
|
||
### 2nd level headline for testing purposes | ||
#### 3rd level headline for testing purposes |
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]> | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]> | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
from haystack_experimental.super_components.converters.multi_file_converter import MultiFileConverter | ||
|
||
_all_ = ["MultiFileConverter"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,169 @@ | ||
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]> | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
from enum import Enum | ||
from typing import Any, Dict | ||
|
||
from haystack import Pipeline, component, default_from_dict, default_to_dict | ||
from haystack.components.converters import ( | ||
CSVToDocument, | ||
DOCXToDocument, | ||
HTMLToDocument, | ||
JSONConverter, | ||
MarkdownToDocument, | ||
PPTXToDocument, | ||
PyPDFToDocument, | ||
TextFileToDocument, | ||
XLSXToDocument, | ||
) | ||
from haystack.components.joiners import DocumentJoiner | ||
from haystack.components.routers import FileTypeRouter | ||
|
||
from haystack_experimental.core.super_component import SuperComponent | ||
|
||
|
||
class ConverterMimeType(str, Enum): | ||
CSV = "text/csv" | ||
DOCX = "application/vnd.openxmlformats-officedocument.wordprocessingml.document" | ||
HTML = "text/html" | ||
JSON = "application/json" | ||
MD = "text/markdown" | ||
TEXT = "text/plain" | ||
PDF = "application/pdf" | ||
PPTX = "application/vnd.openxmlformats-officedocument.presentationml.presentation" | ||
XLSX = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this going to serialize and deserialize ok? We usually used something like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are only using this internally, so no need to serialize. |
||
|
||
|
||
@component | ||
class MultiFileConverter(SuperComponent): | ||
""" | ||
A file converter that handles conversion of multiple file types. | ||
|
||
The MultiFileConverter handles the following file types: | ||
- CSV | ||
- DOCX | ||
- HTML | ||
- JSON | ||
- MD | ||
- TEXT | ||
- PDF (no OCR) | ||
- PPTX | ||
- XLSX | ||
|
||
Usage: | ||
``` | ||
converter = MultiFileConverter() | ||
converter.run(sources=["test.txt", "test.pdf"], meta={}) | ||
``` | ||
""" | ||
|
||
def __init__( # noqa: PLR0915 | ||
self, | ||
encoding: str = "utf-8", | ||
json_content_key: str = "content", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unclear what this is? Add init pydoc? |
||
) -> None: | ||
self.encoding = encoding | ||
self.json_content_key = json_content_key | ||
|
||
# initialize components | ||
router = FileTypeRouter( | ||
mime_types=[ | ||
ConverterMimeType.CSV.value, | ||
ConverterMimeType.DOCX.value, | ||
ConverterMimeType.HTML.value, | ||
ConverterMimeType.JSON.value, | ||
ConverterMimeType.MD.value, | ||
ConverterMimeType.TEXT.value, | ||
ConverterMimeType.PDF.value, | ||
ConverterMimeType.PPTX.value, | ||
ConverterMimeType.XLSX.value, | ||
], | ||
# Ensure common extensions are registered. Tests on Windows fail otherwise. | ||
additional_mimetypes = { | ||
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx", | ||
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": ".xlsx", | ||
"application/vnd.openxmlformats-officedocument.presentationml.presentation": ".pptx" | ||
} | ||
) | ||
|
||
csv = CSVToDocument(encoding=self.encoding) | ||
docx = DOCXToDocument() | ||
html = HTMLToDocument() | ||
json = JSONConverter(content_key=self.json_content_key) | ||
md = MarkdownToDocument() | ||
txt = TextFileToDocument(encoding=self.encoding) | ||
pdf = PyPDFToDocument() | ||
pptx = PPTXToDocument() | ||
xlsx = XLSXToDocument() | ||
|
||
joiner = DocumentJoiner() | ||
|
||
|
||
|
||
# Create pipeline and add components | ||
pp = Pipeline() | ||
|
||
pp.add_component("router", router) | ||
|
||
pp.add_component("docx", docx) | ||
pp.add_component("html", html) | ||
pp.add_component("json", json) | ||
pp.add_component("md", md) | ||
pp.add_component("txt", txt) | ||
pp.add_component("pdf", pdf) | ||
pp.add_component("pptx", pptx) | ||
pp.add_component("xlsx", xlsx) | ||
pp.add_component("joiner", joiner) | ||
pp.add_component("csv", csv) | ||
|
||
pp.connect(f"router.{ConverterMimeType.CSV.value}", "csv") | ||
pp.connect(f"router.{ConverterMimeType.DOCX.value}", "docx") | ||
pp.connect(f"router.{ConverterMimeType.HTML.value}", "html") | ||
pp.connect(f"router.{ConverterMimeType.JSON.value}", "json") | ||
pp.connect(f"router.{ConverterMimeType.MD.value}", "md") | ||
pp.connect(f"router.{ConverterMimeType.TEXT.value}", "txt") | ||
pp.connect(f"router.{ConverterMimeType.PDF.value}", "pdf") | ||
pp.connect(f"router.{ConverterMimeType.PPTX.value}", "pptx") | ||
pp.connect(f"router.{ConverterMimeType.XLSX.value}", "xlsx") | ||
|
||
pp.connect("docx.documents", "joiner.documents") | ||
pp.connect("html.documents", "joiner.documents") | ||
pp.connect("json.documents", "joiner.documents") | ||
pp.connect("md.documents", "joiner.documents") | ||
pp.connect("txt.documents", "joiner.documents") | ||
pp.connect("pdf.documents", "joiner.documents") | ||
pp.connect("pptx.documents", "joiner.documents") | ||
|
||
pp.connect("csv.documents", "joiner.documents") | ||
pp.connect("xlsx.documents", "joiner.documents") | ||
|
||
|
||
output_mapping = {"joiner.documents": "documents"} | ||
input_mapping = { | ||
"sources": ["router.sources"], | ||
"meta": ["router.meta"] | ||
} | ||
|
||
super(MultiFileConverter, self).__init__( | ||
pipeline=pp, | ||
output_mapping=output_mapping, | ||
input_mapping=input_mapping | ||
) | ||
|
||
def to_dict(self) -> Dict[str, Any]: | ||
""" | ||
Serialize this instance to a dictionary. | ||
""" | ||
return default_to_dict( | ||
self, | ||
encoding=self.encoding, | ||
json_content_key=self.json_content_key, | ||
) | ||
|
||
@classmethod | ||
def from_dict(cls, data: Dict[str, Any]) -> "MultiFileConverter": | ||
""" | ||
Load this instance from a dictionary. | ||
""" | ||
return default_from_dict(cls, data) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]> | ||
# | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
from haystack_experimental.super_components.indexers.document_indexer import DocumentIndexer | ||
|
||
__all__ = [ | ||
"DocumentIndexer", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is just an example document but I can also see how it can create problems from us and confusion on discord 🤣
Let's switch it to haystack-ai?