Skip to content

Commit

Permalink
docs[minor]: Update unstructured doc loader (#6344)
Browse files Browse the repository at this point in the history
* docs[minor]: Update unstructured doc loader

* cr
  • Loading branch information
bracesproul authored Aug 2, 2024
1 parent 866e8e0 commit 27d1d6f
Show file tree
Hide file tree
Showing 2 changed files with 243 additions and 32 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
{
"cells": [
{
"cell_type": "raw",
"metadata": {
"vscode": {
"languageId": "raw"
}
},
"source": [
"---\n",
"sidebar_label: Unstructured\n",
"sidebar_class_name: node-only\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# UnstructuredLoader\n",
"\n",
"```{=mdx}\n",
"\n",
":::tip Compatibility\n",
"\n",
"Only available on Node.js.\n",
"\n",
":::\n",
"\n",
"```\n",
"\n",
"This notebook provides a quick overview for getting started with `UnstructuredLoader` [document loaders](/docs/concepts/#document-loaders). For detailed documentation of all `UnstructuredLoader` features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_unstructured.UnstructuredLoader.html).\n",
"\n",
"## Overview\n",
"### Integration details\n",
"\n",
"| Class | Package | Compatibility | Local | [PY support](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) | \n",
"| :--- | :--- | :---: | :---: | :---: |\n",
"| [UnstructuredLoader](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_unstructured.UnstructuredLoader.html) | [@langchain/community](https://api.js.langchain.com/modules/langchain_community_document_loaders_fs_unstructured.html) | Node-only | ✅ | ✅ |\n",
"\n",
"## Setup\n",
"\n",
"To access `UnstructuredLoader` document loader you'll need to install the `@langchain/community` integration package, and create an Unstructured account and get an API key.\n",
"\n",
"### Local\n",
"\n",
"You can run Unstructured locally in your computer using Docker. To do so, you need to have Docker installed. You can find the instructions to install Docker [here](https://docs.docker.com/get-docker/).\n",
"\n",
"```bash\n",
"docker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0\n",
"```\n",
"\n",
"### Credentials\n",
"\n",
"Head to [unstructured.io](https://unstructured.io/api-key-hosted) to sign up to Unstructured and generate an API key. Once you've done this set the `UNSTRUCTURED_API_KEY` environment variable:\n",
"\n",
"```bash\n",
"export UNSTRUCTURED_API_KEY=\"your-api-key\"\n",
"```\n",
"\n",
"### Installation\n",
"\n",
"The LangChain UnstructuredLoader integration lives in the `@langchain/community` package:\n",
"\n",
"```{=mdx}\n",
"import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n",
"import Npm2Yarn from \"@theme/Npm2Yarn\";\n",
"\n",
"<IntegrationInstallTooltip></IntegrationInstallTooltip>\n",
"\n",
"<Npm2Yarn>\n",
" @langchain/community\n",
"</Npm2Yarn>\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instantiation\n",
"\n",
"Now we can instantiate our model object and load documents:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import { UnstructuredLoader } from \"@langchain/community/document_loaders/fs/unstructured\"\n",
"\n",
"const loader = new UnstructuredLoader(\"../../../../../../examples/src/document_loaders/example_data/notion.md\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Document {\n",
" pageContent: '# Testing the notion markdownloader',\n",
" metadata: {\n",
" filename: 'notion.md',\n",
" languages: [ 'eng' ],\n",
" filetype: 'text/plain',\n",
" category: 'NarrativeText'\n",
" },\n",
" id: undefined\n",
"}\n"
]
}
],
"source": [
"const docs = await loader.load()\n",
"docs[0]"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" filename: 'notion.md',\n",
" languages: [ 'eng' ],\n",
" filetype: 'text/plain',\n",
" category: 'NarrativeText'\n",
"}\n"
]
}
],
"source": [
"console.log(docs[0].metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Directories\n",
"\n",
"You can also load all of the files in the directory using [`UnstructuredDirectoryLoader`](https://v02.api.js.langchain.com/classes/langchain_document_loaders_fs_unstructured.UnstructuredDirectoryLoader.html), which inherits from [`DirectoryLoader`](/docs/integrations/document_loaders/file_loaders/directory):\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Unknown file type: Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt\n",
"Unknown file type: test.mp3\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"directoryDocs.length: 247\n",
"Document {\n",
" pageContent: 'Bitcoin: A Peer-to-Peer Electronic Cash System',\n",
" metadata: {\n",
" filetype: 'application/pdf',\n",
" languages: [ 'eng' ],\n",
" page_number: 1,\n",
" filename: 'bitcoin.pdf',\n",
" category: 'Title'\n",
" },\n",
" id: undefined\n",
"}\n"
]
}
],
"source": [
"import { UnstructuredDirectoryLoader } from \"@langchain/community/document_loaders/fs/unstructured\";\n",
"\n",
"const directoryLoader = new UnstructuredDirectoryLoader(\n",
" \"../../../../../../examples/src/document_loaders/example_data/\",\n",
" {}\n",
");\n",
"const directoryDocs = await directoryLoader.load();\n",
"console.log(\"directoryDocs.length: \", directoryDocs.length);\n",
"console.log(directoryDocs[0])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## API reference\n",
"\n",
"For detailed documentation of all UnstructuredLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_unstructured.UnstructuredLoader.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "TypeScript",
"language": "typescript",
"name": "tslab"
},
"language_info": {
"codemirror_mode": {
"mode": "typescript",
"name": "javascript",
"typescript": true
},
"file_extension": ".ts",
"mimetype": "text/typescript",
"name": "typescript",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

This file was deleted.

0 comments on commit 27d1d6f

Please sign in to comment.