-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs[minor]: Update unstructured doc loader (#6344)
* docs[minor]: Update unstructured doc loader * cr
- Loading branch information
1 parent
866e8e0
commit 27d1d6f
Showing
2 changed files
with
243 additions
and
32 deletions.
There are no files selected for viewing
243 changes: 243 additions & 0 deletions
243
docs/core_docs/docs/integrations/document_loaders/file_loaders/unstructured.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,243 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "raw", | ||
"metadata": { | ||
"vscode": { | ||
"languageId": "raw" | ||
} | ||
}, | ||
"source": [ | ||
"---\n", | ||
"sidebar_label: Unstructured\n", | ||
"sidebar_class_name: node-only\n", | ||
"---" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# UnstructuredLoader\n", | ||
"\n", | ||
"```{=mdx}\n", | ||
"\n", | ||
":::tip Compatibility\n", | ||
"\n", | ||
"Only available on Node.js.\n", | ||
"\n", | ||
":::\n", | ||
"\n", | ||
"```\n", | ||
"\n", | ||
"This notebook provides a quick overview for getting started with `UnstructuredLoader` [document loaders](/docs/concepts/#document-loaders). For detailed documentation of all `UnstructuredLoader` features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_unstructured.UnstructuredLoader.html).\n", | ||
"\n", | ||
"## Overview\n", | ||
"### Integration details\n", | ||
"\n", | ||
"| Class | Package | Compatibility | Local | [PY support](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) | \n", | ||
"| :--- | :--- | :---: | :---: | :---: |\n", | ||
"| [UnstructuredLoader](https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_unstructured.UnstructuredLoader.html) | [@langchain/community](https://api.js.langchain.com/modules/langchain_community_document_loaders_fs_unstructured.html) | Node-only | ✅ | ✅ |\n", | ||
"\n", | ||
"## Setup\n", | ||
"\n", | ||
"To access `UnstructuredLoader` document loader you'll need to install the `@langchain/community` integration package, and create an Unstructured account and get an API key.\n", | ||
"\n", | ||
"### Local\n", | ||
"\n", | ||
"You can run Unstructured locally in your computer using Docker. To do so, you need to have Docker installed. You can find the instructions to install Docker [here](https://docs.docker.com/get-docker/).\n", | ||
"\n", | ||
"```bash\n", | ||
"docker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0\n", | ||
"```\n", | ||
"\n", | ||
"### Credentials\n", | ||
"\n", | ||
"Head to [unstructured.io](https://unstructured.io/api-key-hosted) to sign up to Unstructured and generate an API key. Once you've done this set the `UNSTRUCTURED_API_KEY` environment variable:\n", | ||
"\n", | ||
"```bash\n", | ||
"export UNSTRUCTURED_API_KEY=\"your-api-key\"\n", | ||
"```\n", | ||
"\n", | ||
"### Installation\n", | ||
"\n", | ||
"The LangChain UnstructuredLoader integration lives in the `@langchain/community` package:\n", | ||
"\n", | ||
"```{=mdx}\n", | ||
"import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n", | ||
"import Npm2Yarn from \"@theme/Npm2Yarn\";\n", | ||
"\n", | ||
"<IntegrationInstallTooltip></IntegrationInstallTooltip>\n", | ||
"\n", | ||
"<Npm2Yarn>\n", | ||
" @langchain/community\n", | ||
"</Npm2Yarn>\n", | ||
"\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Instantiation\n", | ||
"\n", | ||
"Now we can instantiate our model object and load documents:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import { UnstructuredLoader } from \"@langchain/community/document_loaders/fs/unstructured\"\n", | ||
"\n", | ||
"const loader = new UnstructuredLoader(\"../../../../../../examples/src/document_loaders/example_data/notion.md\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Load" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Document {\n", | ||
" pageContent: '# Testing the notion markdownloader',\n", | ||
" metadata: {\n", | ||
" filename: 'notion.md',\n", | ||
" languages: [ 'eng' ],\n", | ||
" filetype: 'text/plain',\n", | ||
" category: 'NarrativeText'\n", | ||
" },\n", | ||
" id: undefined\n", | ||
"}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"const docs = await loader.load()\n", | ||
"docs[0]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"{\n", | ||
" filename: 'notion.md',\n", | ||
" languages: [ 'eng' ],\n", | ||
" filetype: 'text/plain',\n", | ||
" category: 'NarrativeText'\n", | ||
"}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"console.log(docs[0].metadata)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Directories\n", | ||
"\n", | ||
"You can also load all of the files in the directory using [`UnstructuredDirectoryLoader`](https://v02.api.js.langchain.com/classes/langchain_document_loaders_fs_unstructured.UnstructuredDirectoryLoader.html), which inherits from [`DirectoryLoader`](/docs/integrations/document_loaders/file_loaders/directory):\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"Unknown file type: Star_Wars_The_Clone_Wars_S06E07_Crisis_at_the_Heart.srt\n", | ||
"Unknown file type: test.mp3\n" | ||
] | ||
}, | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"directoryDocs.length: 247\n", | ||
"Document {\n", | ||
" pageContent: 'Bitcoin: A Peer-to-Peer Electronic Cash System',\n", | ||
" metadata: {\n", | ||
" filetype: 'application/pdf',\n", | ||
" languages: [ 'eng' ],\n", | ||
" page_number: 1,\n", | ||
" filename: 'bitcoin.pdf',\n", | ||
" category: 'Title'\n", | ||
" },\n", | ||
" id: undefined\n", | ||
"}\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import { UnstructuredDirectoryLoader } from \"@langchain/community/document_loaders/fs/unstructured\";\n", | ||
"\n", | ||
"const directoryLoader = new UnstructuredDirectoryLoader(\n", | ||
" \"../../../../../../examples/src/document_loaders/example_data/\",\n", | ||
" {}\n", | ||
");\n", | ||
"const directoryDocs = await directoryLoader.load();\n", | ||
"console.log(\"directoryDocs.length: \", directoryDocs.length);\n", | ||
"console.log(directoryDocs[0])\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## API reference\n", | ||
"\n", | ||
"For detailed documentation of all UnstructuredLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_fs_unstructured.UnstructuredLoader.html" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "TypeScript", | ||
"language": "typescript", | ||
"name": "tslab" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"mode": "typescript", | ||
"name": "javascript", | ||
"typescript": true | ||
}, | ||
"file_extension": ".ts", | ||
"mimetype": "text/typescript", | ||
"name": "typescript", | ||
"version": "3.7.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
32 changes: 0 additions & 32 deletions
32
docs/core_docs/docs/integrations/document_loaders/file_loaders/unstructured.mdx
This file was deleted.
Oops, something went wrong.