diff --git a/logos/unstructured.svg b/logos/unstructured.svg new file mode 100644 index 00000000..35502d03 --- /dev/null +++ b/logos/unstructured.svg @@ -0,0 +1,391 @@ + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/unstructured-file-converter.md b/unstructured-file-converter.md new file mode 100644 index 00000000..3e9e40dd --- /dev/null +++ b/unstructured-file-converter.md @@ -0,0 +1,67 @@ +--- +layout: integration +name: Unstructured File Converter +description: Component to easily convert files and directories into Documents using the Unstructured API +authors: + - name: deepset + socials: + github: deepset-ai + twitter: deepset_ai + linkedin: deepset-ai +pypi: https://pypi.org/project/unstructured-fileconverter-haystack/ +repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/unstructured/fileconverter +type: Custom Node +report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues +logo: /logos/unstructured.svg +version: Haystack 2.0 +--- + +Component for the Haystack (2.x) LLM framework to easily convert files and directories into Documents using the Unstructured API. + +**[Unstructured](https://unstructured-io.github.io/unstructured/index.html)** provides a series of tools to do **ETL for LLMs**. This component calls the Unstructured API that simply extracts text and other information from a vast range of file formats. See [supported file types](https://unstructured-io.github.io/unstructured/api.html#supported-file-types). + +## Installation + +```bash +pip install unstructured-fileconverter-haystack +``` + +### Hosted API +If you plan to use the hosted version of the Unstructured API, you just need the **(free) Unsctructured API key**. You can get it by signing up [here](https://unstructured.io/api-key). + +### Local API (Docker) +If you want to run your own local instance of the Unstructured API, you need Docker and you can find instructions [here](https://unstructured-io.github.io/unstructured/api.html#using-docker-images). + +In short, this should work: +```bash +docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0 +``` + +## Usage + +### In isolation +```python +import os +from unstructured_fileconverter_haystack import UnstructuredFileConverter + +converter = UnstructuredFileConverter(api_key="UNSTRUCTURED_API_KEY") +documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"] +``` + +### In a Haystack Pipeline +```python +import os +from haystack.preview import Pipeline +from haystack.preview.components.writers import DocumentWriter +from haystack.preview.document_stores import InMemoryDocumentStore +from unstructured_fileconverter_haystack import UnstructuredFileConverter + +document_store = InMemoryDocumentStore() + +indexing = Pipeline() +indexing.add_component("converter", UnstructuredFileConverter(api_key="UNSTRUCTURED_API_KEY")) +indexing.add_component("writer", DocumentWriter(document_store)) +indexing.connect("converter", "writer") + +indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}}) +``` \ No newline at end of file