diff --git a/integrations/unstructured-file-converter.md b/integrations/unstructured-file-converter.md index d8eece0d..ef874ff2 100644 --- a/integrations/unstructured-file-converter.md +++ b/integrations/unstructured-file-converter.md @@ -14,37 +14,64 @@ type: Data Ingestion report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues logo: /logos/unstructured.svg version: Haystack 2.0 +toc: true --- +- [Overview](#overview) +- [Installation](#installation) +- [Usage](#usage) + - [Connecting to the Unstructured API](#connecting-to-the-unstructured-api) + - [Hosted API](#hosted-api) + - [Local API (Docker)](#local-api-docker) + - [Running Unstructured File Converter](#running-unstructured-file-converter) + - [In isolation](#in-isolation) + - [In a Haystack Pipeline](#in-a-haystack-pipeline) -Component for the Haystack (2.x) LLM framework to easily convert files and directories into Documents using the Unstructured API. -**[Unstructured](https://unstructured-io.github.io/unstructured/index.html)** provides a series of tools to do **ETL for LLMs**. This component calls the Unstructured API that simply extracts text and other information from a vast range of file formats. See [supported file types](https://docs.unstructured.io/api-reference/api-services/overview#supported-file-types). + +## Overview +Component for the Haystack (2.x) LLM framework to convert files and directories into Documents using the Unstructured API. + +**[Unstructured](https://unstructured-io.github.io/unstructured/index.html)** provides ETL tools for LLMs, extracting text and other information from various file formats. See [supported file types](https://docs.unstructured.io/api-reference/api-services/overview#supported-file-types) for more details. ## Installation +To install the [Unstructured File Converter](https://docs.haystack.deepset.ai/docs/unstructuredfileconverter), run: ```bash pip install unstructured-fileconverter-haystack ``` -### Hosted API -If you plan to use the hosted version of the Unstructured API, you just need the **(free) Unstructured API key**. You can get it by signing up [here](https://unstructured.io/api-key-free). +## Usage + +### Connecting to the Unstructured API +#### Hosted API + +The Unstructured API is available in both free and paid versions: Unstructured Serverless API or Free Unstructured API. -### Local API (Docker) -If you want to run your own local instance of the Unstructured API, you need Docker and you can find instructions [here](https://unstructured-io.github.io/unstructured/api.html#using-docker-images). +For the Free Unstructured API, the API URL is `https://api.unstructured.io/general/v0/general`. For the Unstructured Serverless API, find your unique API URL in your Unstructured account. -In short, this should work: +Note that the API keys for free and paid versions are not interchangeable. + +Set the Unstructured API key as an environment variable: ```bash -docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0 +export UNSTRUCTURED_API_KEY=your_api_key ``` -## Usage +#### Local API (Docker) +You can run a local instance of the Unstructured API using Docker: -If you plan to use the hosted version of the Unstructured API, set the Unstructured API key as an environment variable `UNSTRUCTURED_API_KEY`: ```bash -export UNSTRUCTURED_API_KEY=your_api_key +docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0 +``` + +When initializing the component, specify the localhost URL: +```python +from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter + +converter = UnstructuredFileConverter(api_url="http://localhost:8000/general/v0/general") ``` -### In isolation +### Running Unstructured File Converter +#### In isolation ```python import os from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter @@ -53,7 +80,7 @@ converter = UnstructuredFileConverter() documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"] ``` -### In a Haystack Pipeline +#### In a Haystack Pipeline ```python import os from haystack import Pipeline @@ -69,4 +96,4 @@ indexing.add_component("writer", DocumentWriter(document_store)) indexing.connect("converter", "writer") indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}}) -``` +``` \ No newline at end of file