Skip to content

Commit

Permalink
update unstructured (#249)
Browse files Browse the repository at this point in the history
  • Loading branch information
anakin87 authored Aug 2, 2024
1 parent c408a1d commit e3a4916
Showing 1 changed file with 41 additions and 14 deletions.
55 changes: 41 additions & 14 deletions integrations/unstructured-file-converter.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,37 +14,64 @@ type: Data Ingestion
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
logo: /logos/unstructured.svg
version: Haystack 2.0
toc: true
---
- [Overview](#overview)
- [Installation](#installation)
- [Usage](#usage)
- [Connecting to the Unstructured API](#connecting-to-the-unstructured-api)
- [Hosted API](#hosted-api)
- [Local API (Docker)](#local-api-docker)
- [Running Unstructured File Converter](#running-unstructured-file-converter)
- [In isolation](#in-isolation)
- [In a Haystack Pipeline](#in-a-haystack-pipeline)

Component for the Haystack (2.x) LLM framework to easily convert files and directories into Documents using the Unstructured API.

**[Unstructured](https://unstructured-io.github.io/unstructured/index.html)** provides a series of tools to do **ETL for LLMs**. This component calls the Unstructured API that simply extracts text and other information from a vast range of file formats. See [supported file types](https://docs.unstructured.io/api-reference/api-services/overview#supported-file-types).

## Overview
Component for the Haystack (2.x) LLM framework to convert files and directories into Documents using the Unstructured API.

**[Unstructured](https://unstructured-io.github.io/unstructured/index.html)** provides ETL tools for LLMs, extracting text and other information from various file formats. See [supported file types](https://docs.unstructured.io/api-reference/api-services/overview#supported-file-types) for more details.

## Installation
To install the [Unstructured File Converter](https://docs.haystack.deepset.ai/docs/unstructuredfileconverter), run:

```bash
pip install unstructured-fileconverter-haystack
```

### Hosted API
If you plan to use the hosted version of the Unstructured API, you just need the **(free) Unstructured API key**. You can get it by signing up [here](https://unstructured.io/api-key-free).
## Usage

### Connecting to the Unstructured API
#### Hosted API

The Unstructured API is available in both free and paid versions: Unstructured Serverless API or Free Unstructured API.

### Local API (Docker)
If you want to run your own local instance of the Unstructured API, you need Docker and you can find instructions [here](https://unstructured-io.github.io/unstructured/api.html#using-docker-images).
For the Free Unstructured API, the API URL is `https://api.unstructured.io/general/v0/general`. For the Unstructured Serverless API, find your unique API URL in your Unstructured account.

In short, this should work:
Note that the API keys for free and paid versions are not interchangeable.

Set the Unstructured API key as an environment variable:
```bash
docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
export UNSTRUCTURED_API_KEY=your_api_key
```

## Usage
#### Local API (Docker)
You can run a local instance of the Unstructured API using Docker:

If you plan to use the hosted version of the Unstructured API, set the Unstructured API key as an environment variable `UNSTRUCTURED_API_KEY`:
```bash
export UNSTRUCTURED_API_KEY=your_api_key
docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0
```

When initializing the component, specify the localhost URL:
```python
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter

converter = UnstructuredFileConverter(api_url="http://localhost:8000/general/v0/general")
```

### In isolation
### Running Unstructured File Converter
#### In isolation
```python
import os
from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
Expand All @@ -53,7 +80,7 @@ converter = UnstructuredFileConverter()
documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]
```

### In a Haystack Pipeline
#### In a Haystack Pipeline
```python
import os
from haystack import Pipeline
Expand All @@ -69,4 +96,4 @@ indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")

indexing.run({"converter": {"paths": ["a/file/path.pdf", "a/directory/path"]}})
```
```

0 comments on commit e3a4916

Please sign in to comment.