deepset-ai · mathislucka · Jan 31, 2025 · Jan 31, 2025 · Jan 31, 2025 · Jan 31, 2025
@@ -0,0 +1,65 @@
+---
+type: intro
+date: 1.1.2023
+---
+```bash
+pip install farm-haystack
+```
+## What to build with Haystack
+
+- **Ask questions in natural language** and find granular answers in your own documents.
+- Perform **semantic search** and retrieve documents according to meaning not keywords
+- Use **off-the-shelf models** or **fine-tune** them to your own domain.
+- Use **user feedback** to evaluate, benchmark and continuously improve your live models.
+- Leverage existing **knowledge bases** and better handle the long tail of queries that **chatbots** receive.
+- **Automate processes** by automatically applying a list of questions to new documents and using the extracted answers.
+
+![Logo](https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/logo.png)
+
+
+## Core Features
+
+-   **Latest models**: Utilize all latest transformer based models (e.g. BERT, RoBERTa, MiniLM) for extractive QA, generative QA and document retrieval.
+-   **Modular**: Multiple choices to fit your tech stack and use case. Pick your favorite database, file converter or modeling framework.
+-   **Open**: 100% compatible with HuggingFace's model hub. Tight interfaces to other frameworks (e.g. Transformers, FARM, sentence-transformers)
+-   **Scalable**: Scale to millions of docs via retrievers, production-ready backends like Elasticsearch / FAISS and a fastAPI REST API
+-   **End-to-End**: All tooling in one place: file conversion, cleaning, splitting, training, eval, inference, labeling ...
+-   **Developer friendly**: Easy to debug, extend and modify.
+-   **Customizable**: Fine-tune models to your own domain or implement your custom DocumentStore.
+-   **Continuous Learning**: Collect new training data via user feedback in production & improve your models continuously
+
+|  |  |
+|-|-|
+| :ledger: [Docs](https://haystack.deepset.ai/overview/intro) | Usage, Guides, API documentation ...|
+| :beginner: [Quick Demo](https://github.com/deepset-ai/haystack/#quick-demo) | Quickly see what Haystack offers |
+| :floppy_disk: [Installation](https://github.com/deepset-ai/haystack/#installation) | How to install Haystack |
+| :art: [Key Components](https://github.com/deepset-ai/haystack/#key-components) | Overview of core concepts |
+| :mortar_board: [Tutorials](https://github.com/deepset-ai/haystack/#tutorials) | Jupyter/Colab Notebooks & Scripts |
+| :eyes: [How to use Haystack](https://github.com/deepset-ai/haystack/#how-to-use-haystack) | Basic explanation of concepts, options and usage |
+| :heart: [Contributing](https://github.com/deepset-ai/haystack/#heart-contributing) | We welcome all contributions! |
+| :bar_chart: [Benchmarks](https://haystack.deepset.ai/benchmarks/v0.9.0) | Speed & Accuracy of Retriever, Readers and DocumentStores |
+| :telescope: [Roadmap](https://haystack.deepset.ai/overview/roadmap) | Public roadmap of Haystack |
+| :pray: [Slack](https://haystack.deepset.ai/community/join) | Join our community on Slack |
+| :bird: [Twitter](https://twitter.com/deepset_ai) | Follow us on Twitter for news and updates |
+| :newspaper: [Blog](https://medium.com/deepset-ai) | Read our articles on Medium |
+
+
+## Quick Demo
+
+The quickest way to see what Haystack offers is to start a [Docker Compose](https://docs.docker.com/compose/) demo application:
+
+**1. Update/install Docker and Docker Compose, then launch Docker**
+
+```
+    # apt-get update && apt-get install docker && apt-get install docker-compose
+    # service docker start
+```
+
+**2. Clone Haystack repository**
+
+```
+    # git clone https://github.com/deepset-ai/haystack.git
+```
+
+### 2nd level headline for testing purposes
+#### 3rd level headline for testing purposes
@@ -0,0 +1,3 @@
+# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
+#
+# SPDX-License-Identifier: Apache-2.0
@@ -0,0 +1,7 @@
+# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
+#
+# SPDX-License-Identifier: Apache-2.0
+
+from haystack_experimental.super_components.converters.multi_file_converter import MultiFileConverter
+
+_all_ = ["MultiFileConverter"]
@@ -0,0 +1,169 @@
+# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
+#
+# SPDX-License-Identifier: Apache-2.0
+
+from enum import Enum
+from typing import Any, Dict
+
+from haystack import Pipeline, component, default_from_dict, default_to_dict
+from haystack.components.converters import (
+    CSVToDocument,
+    DOCXToDocument,
+    HTMLToDocument,
+    JSONConverter,
+    MarkdownToDocument,
+    PPTXToDocument,
+    PyPDFToDocument,
+    TextFileToDocument,
+    XLSXToDocument,
+)
+from haystack.components.joiners import DocumentJoiner
+from haystack.components.routers import FileTypeRouter
+
+from haystack_experimental.core.super_component import SuperComponent
+
+
+class ConverterMimeType(str, Enum):
+    CSV = "text/csv"
+    DOCX = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
+    HTML = "text/html"
+    JSON = "application/json"
+    MD = "text/markdown"
+    TEXT = "text/plain"
+    PDF = "application/pdf"
+    PPTX = "application/vnd.openxmlformats-officedocument.presentationml.presentation"
+    XLSX = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
+
+
+@component
+class MultiFileConverter(SuperComponent):
+    """
+    A file converter that handles conversion of multiple file types.
+
+    The MultiFileConverter handles the following file types:
+    - CSV
+    - DOCX
+    - HTML
+    - JSON
+    - MD
+    - TEXT
+    - PDF (no OCR)
+    - PPTX
+    - XLSX
+
+    Usage:
+    ```
+    converter = MultiFileConverter()
+    converter.run(sources=["test.txt", "test.pdf"], meta={})
+    ```
+    """
+
+    def __init__( # noqa: PLR0915
+        self,
+        encoding: str = "utf-8",
+        json_content_key: str = "content",
+    ) -> None:
+        self.encoding = encoding
+        self.json_content_key = json_content_key
+
+        # initialize components
+        router = FileTypeRouter(
+            mime_types=[
+                ConverterMimeType.CSV.value,
+                ConverterMimeType.DOCX.value,
+                ConverterMimeType.HTML.value,
+                ConverterMimeType.JSON.value,
+                ConverterMimeType.MD.value,
+                ConverterMimeType.TEXT.value,
+                ConverterMimeType.PDF.value,
+                ConverterMimeType.PPTX.value,
+                ConverterMimeType.XLSX.value,
+            ],
+            # Ensure common extensions are registered. Tests on Windows fail otherwise.
+            additional_mimetypes = {
+                "application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx", 
+                "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": ".xlsx", 
+                "application/vnd.openxmlformats-officedocument.presentationml.presentation": ".pptx"
+            }
+        )
+
+        csv = CSVToDocument(encoding=self.encoding)
+        docx = DOCXToDocument()
+        html = HTMLToDocument()
+        json = JSONConverter(content_key=self.json_content_key)
+        md = MarkdownToDocument()
+        txt = TextFileToDocument(encoding=self.encoding)
+        pdf = PyPDFToDocument()
+        pptx = PPTXToDocument()
+        xlsx = XLSXToDocument()
+
+        joiner = DocumentJoiner()
+
+
+
+        # Create pipeline and add components
+        pp = Pipeline()
+
+        pp.add_component("router", router)
+
+        pp.add_component("docx", docx)
+        pp.add_component("html", html)
+        pp.add_component("json", json)
+        pp.add_component("md", md)
+        pp.add_component("txt", txt)
+        pp.add_component("pdf", pdf)
+        pp.add_component("pptx", pptx)
+        pp.add_component("xlsx", xlsx)
+        pp.add_component("joiner", joiner)
+        pp.add_component("csv", csv)
+
+        pp.connect(f"router.{ConverterMimeType.CSV.value}", "csv")
+        pp.connect(f"router.{ConverterMimeType.DOCX.value}", "docx")
+        pp.connect(f"router.{ConverterMimeType.HTML.value}", "html")
+        pp.connect(f"router.{ConverterMimeType.JSON.value}", "json")
+        pp.connect(f"router.{ConverterMimeType.MD.value}", "md")
+        pp.connect(f"router.{ConverterMimeType.TEXT.value}", "txt")
+        pp.connect(f"router.{ConverterMimeType.PDF.value}", "pdf")
+        pp.connect(f"router.{ConverterMimeType.PPTX.value}", "pptx")
+        pp.connect(f"router.{ConverterMimeType.XLSX.value}", "xlsx")
+
+        pp.connect("docx.documents", "joiner.documents")
+        pp.connect("html.documents", "joiner.documents")
+        pp.connect("json.documents", "joiner.documents")
+        pp.connect("md.documents", "joiner.documents")
+        pp.connect("txt.documents", "joiner.documents")
+        pp.connect("pdf.documents", "joiner.documents")
+        pp.connect("pptx.documents", "joiner.documents")
+
+        pp.connect("csv.documents", "joiner.documents")
+        pp.connect("xlsx.documents", "joiner.documents")
+
+
+        output_mapping = {"joiner.documents": "documents"}
+        input_mapping = {
+            "sources": ["router.sources"],
+            "meta": ["router.meta"]
+        }
+
+        super(MultiFileConverter, self).__init__(
+            pipeline=pp,
+            output_mapping=output_mapping,
+            input_mapping=input_mapping
+        )
+
+    def to_dict(self) -> Dict[str, Any]:
+        """
+        Serialize this instance to a dictionary.
+        """
+        return default_to_dict(
+            self,
+            encoding=self.encoding,
+            json_content_key=self.json_content_key,
+        )
+
+    @classmethod
+    def from_dict(cls, data: Dict[str, Any]) -> "MultiFileConverter":
+        """
+        Load this instance from a dictionary.
+        """
+        return default_from_dict(cls, data)
@@ -0,0 +1,9 @@
+# SPDX-FileCopyrightText: 2022-present deepset GmbH <[email protected]>
+#
+# SPDX-License-Identifier: Apache-2.0
+
+from haystack_experimental.super_components.indexers.document_indexer import DocumentIndexer
+
+__all__ = [
+    "DocumentIndexer",
+]