From 2781a34f88b30d5300a202a230c42e37248102fe Mon Sep 17 00:00:00 2001 From: Brace Sproul Date: Thu, 1 Aug 2024 14:04:07 -0700 Subject: [PATCH] docs[minor]: Update recursive url loader docs (#6322) * docs[minor]: Update recursive url loader docs * delete old page * chore: lint files * cr --- .../web_loaders/recursive_url_loader.ipynb | 449 ++++++++++++++++++ .../web_loaders/recursive_url_loader.mdx | 67 --- .../src/cli/docs/document_loaders.ts | 11 +- 3 files changed, 451 insertions(+), 76 deletions(-) create mode 100644 docs/core_docs/docs/integrations/document_loaders/web_loaders/recursive_url_loader.ipynb delete mode 100644 docs/core_docs/docs/integrations/document_loaders/web_loaders/recursive_url_loader.mdx diff --git a/docs/core_docs/docs/integrations/document_loaders/web_loaders/recursive_url_loader.ipynb b/docs/core_docs/docs/integrations/document_loaders/web_loaders/recursive_url_loader.ipynb new file mode 100644 index 000000000000..ec13013b245c --- /dev/null +++ b/docs/core_docs/docs/integrations/document_loaders/web_loaders/recursive_url_loader.ipynb @@ -0,0 +1,449 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "sidebar_label: RecursiveUrlLoader\n", + "sidebar_class_name: node-only\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# RecursiveUrlLoader\n", + "\n", + "```{=mdx}\n", + "\n", + ":::tip Compatibility\n", + "\n", + "Only available on Node.js.\n", + "\n", + ":::\n", + "\n", + "```\n", + "\n", + "This notebook provides a quick overview for getting started with [RecursiveUrlLoader](/docs/integrations/document_loaders/). For detailed documentation of all RecursiveUrlLoader features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_community_document_loaders_web_recursive_url.RecursiveUrlLoader.html).\n", + "\n", + "## Overview\n", + "### Integration details\n", + "\n", + "| Class | Package | Local | Serializable | PY support |\n", + "| :--- | :--- | :---: | :---: | :---: |\n", + "| [RecursiveUrlLoader](https://api.js.langchain.com/classes/langchain_community_document_loaders_web_recursive_url.RecursiveUrlLoader.html) | [@langchain/community](https://api.js.langchain.com/modules/langchain_community_document_loaders_web_recursive_url.html) | ✅ | beta | ❌ | \n", + "### Loader features\n", + "| Source | Web Loader | Node Envs Only\n", + "| :---: | :---: | :---: | \n", + "| RecursiveUrlLoader | ✅ | ✅ | \n", + "\n", + "When loading content from a website, we may want to process load all URLs on a page.\n", + "\n", + "For example, let's look at the [LangChain.js introduction](/docs/introduction) docs.\n", + "\n", + "This has many interesting child pages that we may want to load, split, and later retrieve in bulk.\n", + "\n", + "The challenge is traversing the tree of child pages and assembling a list!\n", + "\n", + "We do this using the `RecursiveUrlLoader`.\n", + "\n", + "This also gives us the flexibility to exclude some children, customize the extractor, and more.\n", + "\n", + "## Setup\n", + "\n", + "To access `RecursiveUrlLoader` document loader you'll need to install the `@langchain/community` integration, and the [`jsdom`](https://www.npmjs.com/package/jsdom) package.\n", + "\n", + "### Credentials\n", + "\n", + "If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:\n", + "\n", + "```bash\n", + "# export LANGCHAIN_TRACING_V2=\"true\"\n", + "# export LANGCHAIN_API_KEY=\"your-api-key\"\n", + "```\n", + "\n", + "### Installation\n", + "\n", + "The LangChain RecursiveUrlLoader integration lives in the `@langchain/community` package:\n", + "\n", + "```{=mdx}\n", + "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n", + "import Npm2Yarn from \"@theme/Npm2Yarn\";\n", + "\n", + "\n", + "\n", + "\n", + " @langchain/community jsdom\n", + "\n", + "\n", + "We also suggest adding a package like [`html-to-text`](https://www.npmjs.com/package/html-to-text) or\n", + "[`@mozilla/readability`](https://www.npmjs.com/package/@mozilla/readability) for extracting the raw text from the page.\n", + "\n", + "\n", + " html-to-text\n", + "\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Instantiation\n", + "\n", + "Now we can instantiate our model object and load documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import { RecursiveUrlLoader } from \"@langchain/community/document_loaders/web/recursive_url\"\n", + "import { compile } from \"html-to-text\";\n", + "\n", + "const compiledConvert = compile({ wordwrap: 130 }); // returns (text: string) => string;\n", + "\n", + "const loader = new RecursiveUrlLoader(\"https://langchain.com/\", {\n", + " extractor: compiledConvert,\n", + " maxDepth: 1,\n", + " excludeDirs: [\"/docs/api/\"],\n", + "})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " pageContent: '\\n' +\n", + " '/\\n' +\n", + " 'Products\\n' +\n", + " '\\n' +\n", + " 'LangChain [/langchain]LangSmith [/langsmith]LangGraph [/langgraph]\\n' +\n", + " 'Methods\\n' +\n", + " '\\n' +\n", + " 'Retrieval [/retrieval]Agents [/agents]Evaluation [/evaluation]\\n' +\n", + " 'Resources\\n' +\n", + " '\\n' +\n", + " 'Blog [https://blog.langchain.dev/]Case Studies [/case-studies]Use Case Inspiration [/use-cases]Experts [/experts]Changelog\\n' +\n", + " '[https://changelog.langchain.com/]\\n' +\n", + " 'Docs\\n' +\n", + " '\\n' +\n", + " 'LangChain Docs [https://python.langchain.com/v0.2/docs/introduction/]LangSmith Docs [https://docs.smith.langchain.com/]\\n' +\n", + " 'Company\\n' +\n", + " '\\n' +\n", + " 'About [/about]Careers [/careers]\\n' +\n", + " 'Pricing [/pricing]\\n' +\n", + " 'Get a demo [/contact-sales]\\n' +\n", + " 'Sign up [https://smith.langchain.com/]\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'LangChain’s suite of products supports developers along each step of the LLM application lifecycle.\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'APPLICATIONS THAT CAN REASON. POWERED BY LANGCHAIN.\\n' +\n", + " '\\n' +\n", + " 'Get a demo [/contact-sales]Sign up for free [https://smith.langchain.com/]\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'FROM STARTUPS TO GLOBAL ENTERPRISES,\\n' +\n", + " 'AMBITIOUS BUILDERS CHOOSE\\n' +\n", + " 'LANGCHAIN PRODUCTS.\\n' +\n", + " '\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c22746faa78338532_logo_Ally.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c08e67bb7eefba4c2_logo_Rakuten.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c576fdde32d03c1a0_logo_Elastic.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c6d5592036dae24e5_logo_BCG.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667f19528c3557c2c19c3086_the-home-depot-2%201.png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7cbcf6473519b06d84_logo_IDEO.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7cb5f96dcc100ee3b7_logo_Zapier.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/6606183e52d49bc369acc76c_mdy_logo_rgb_moodysblue.png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c8ad7db6ed6ec611e_logo_Adyen.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c737d50036a62768b_logo_Infor.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667f59d98444a5f98aabe21c_acxiom-vector-logo-2022%201.png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c09a158ffeaab0bd2_logo_Replit.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c9d2b23d292a0cab0_logo_Retool.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c44e67a3d0a996bf3_logo_Databricks.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667f5a1299d6ba453c78a849_image%20(19).png][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ca3b7c63af578816bafcc3_logo_Instacart.svg][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/665dc1dabc940168384d9596_podium%20logo.svg]\\n' +\n", + " '\\n' +\n", + " 'Build\\n' +\n", + " '\\n' +\n", + " 'LangChain is a framework to build with LLMs by chaining interoperable components. LangGraph is the framework for building\\n' +\n", + " 'controllable agentic workflows.\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'Run\\n' +\n", + " '\\n' +\n", + " 'Deploy your LLM applications at scale with LangGraph Cloud, our infrastructure purpose-built for agents.\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'Manage\\n' +\n", + " '\\n' +\n", + " \"Debug, collaborate, test, and monitor your LLM app in LangSmith - whether it's built with a LangChain framework or not. \\n\" +\n", + " '\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'BUILD YOUR APP WITH LANGCHAIN\\n' +\n", + " '\\n' +\n", + " 'Build context-aware, reasoning applications with LangChain’s flexible framework that leverages your company’s data and APIs.\\n' +\n", + " 'Future-proof your application by making vendor optionality part of your LLM infrastructure design.\\n' +\n", + " '\\n' +\n", + " 'Learn more about LangChain\\n' +\n", + " '\\n' +\n", + " '[/langchain]\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'RUN AT SCALE WITH LANGGRAPH CLOUD\\n' +\n", + " '\\n' +\n", + " 'Deploy your LangGraph app with LangGraph Cloud for fault-tolerant scalability - including support for async background jobs,\\n' +\n", + " 'built-in persistence, and distributed task queues.\\n' +\n", + " '\\n' +\n", + " 'Learn more about LangGraph\\n' +\n", + " '\\n' +\n", + " '[/langgraph]\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667c6d7284e58f4743a430e6_Langgraph%20UI-home-2.webp]\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'MANAGE LLM PERFORMANCE WITH LANGSMITH\\n' +\n", + " '\\n' +\n", + " 'Ship faster with LangSmith’s debug, test, deploy, and monitoring workflows. Don’t rely on “vibes” – add engineering rigor to your\\n' +\n", + " 'LLM-development workflow, whether you’re building with LangChain or not.\\n' +\n", + " '\\n' +\n", + " 'Learn more about LangSmith\\n' +\n", + " '\\n' +\n", + " '[/langsmith]\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'HEAR FROM OUR HAPPY CUSTOMERS\\n' +\n", + " '\\n' +\n", + " 'LangChain, LangGraph, and LangSmith help teams of all sizes, across all industries - from ambitious startups to established\\n' +\n", + " 'enterprises.\\n' +\n", + " '\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308aee06d9826765c897_Retool_logo%201.png]\\n' +\n", + " '\\n' +\n", + " '“LangSmith helped us improve the accuracy and performance of Retool’s fine-tuned models. Not only did we deliver a better product\\n' +\n", + " 'by iterating with LangSmith, but we’re shipping new AI features to our users in a fraction of the time it would have taken without\\n' +\n", + " 'it.”\\n' +\n", + " '\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308abdd2dbbdde5a94a1_Jamie%20Cuffe.png]\\n' +\n", + " 'Jamie Cuffe\\n' +\n", + " 'Head of Self-Serve and New Products\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308a04d37cf7d3eb1341_Rakuten_Global_Brand_Logo.png]\\n' +\n", + " '\\n' +\n", + " '“By combining the benefits of LangSmith and standing on the shoulders of a gigantic open-source community, we’re able to identify\\n' +\n", + " 'the right approaches of using LLMs in an enterprise-setting faster.”\\n' +\n", + " '\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308a8b6137d44c621cb4_Yusuke%20Kaji.png]\\n' +\n", + " 'Yusuke Kaji\\n' +\n", + " 'General Manager of AI\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308aea1371b447cc4af9_elastic-ar21.png]\\n' +\n", + " '\\n' +\n", + " '“Working with LangChain and LangSmith on the Elastic AI Assistant had a significant positive impact on the overall pace and\\n' +\n", + " 'quality of the development and shipping experience. We couldn’t have achieved  the product experience delivered to our customers\\n' +\n", + " 'without LangChain, and we couldn’t have done it at the same pace without LangSmith.”\\n' +\n", + " '\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c5308a4095d5a871de7479_James%20Spiteri.png]\\n' +\n", + " 'James Spiteri\\n' +\n", + " 'Director of Security Products\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c530539f4824b828357352_Logo_de_Fintual%201.png]\\n' +\n", + " '\\n' +\n", + " '“As soon as we heard about LangSmith, we moved our entire development stack onto it. We could have built evaluation, testing and\\n' +\n", + " 'monitoring tools in house, but with LangSmith it took us 10x less time to get a 1000x better tool.”\\n' +\n", + " '\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c53058acbff86f4c2dcee2_jose%20pena.png]\\n' +\n", + " 'Jose Peña\\n' +\n", + " 'Senior Manager\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'THE REFERENCE ARCHITECTURE ENTERPRISES ADOPT FOR SUCCESS.\\n' +\n", + " '\\n' +\n", + " 'LangChain’s suite of products can be used independently or stacked together for multiplicative impact – guiding you through\\n' +\n", + " 'building, running, and managing your LLM apps.\\n' +\n", + " '\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/6695b116b0b60c78fd4ef462_15.07.24%20-Updated%20stack%20diagram%20-%20lightfor%20website-3.webp][https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/667d392696fc0bc3e17a6d04_New%20LC%20stack%20-%20light-2.webp]\\n' +\n", + " '15M+\\n' +\n", + " 'Monthly Downloads\\n' +\n", + " '100K+\\n' +\n", + " 'Apps Powered\\n' +\n", + " '75K+\\n' +\n", + " 'GitHub Stars\\n' +\n", + " '3K+\\n' +\n", + " 'Contributors\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'THE BIGGEST DEVELOPER COMMUNITY IN GENAI\\n' +\n", + " '\\n' +\n", + " 'Learn alongside the 1M+ developers who are pushing the industry forward.\\n' +\n", + " '\\n' +\n", + " 'Explore LangChain\\n' +\n", + " '\\n' +\n", + " '[/langchain]\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'GET STARTED WITH THE LANGSMITH PLATFORM TODAY\\n' +\n", + " '\\n' +\n", + " 'Get a demo [/contact-sales]Sign up for free [https://smith.langchain.com/]\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65ccf12801bc39bf912a58f3_Home%20C.webp]\\n' +\n", + " '\\n' +\n", + " 'Teams building with LangChain are driving operational efficiency, increasing discovery & personalization, and delivering premium\\n' +\n", + " 'products that generate revenue.\\n' +\n", + " '\\n' +\n", + " 'Discover Use Cases\\n' +\n", + " '\\n' +\n", + " '[/use-cases]\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'GET INSPIRED BY COMPANIES WHO HAVE DONE IT.\\n' +\n", + " '\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65bcd7ee85507bdf350399c3_Ally_Financial%201.svg]\\n' +\n", + " 'Financial Services\\n' +\n", + " '\\n' +\n", + " '[https://blog.langchain.dev/ally-financial-collaborates-with-langchain-to-deliver-critical-coding-module-to-mask-personal-identifying-information-in-a-compliant-and-safe-manner/]\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65bcd8b3ae4dc901daa3037a_Adyen_Corporate_Logo%201.svg]\\n' +\n", + " 'FinTech\\n' +\n", + " '\\n' +\n", + " '[https://blog.langchain.dev/llms-accelerate-adyens-support-team-through-smart-ticket-routing-and-support-agent-copilot/]\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c534b3fa387379c0f4ebff_elastic-ar21%20(1).png]\\n' +\n", + " 'Technology\\n' +\n", + " '\\n' +\n", + " '[https://blog.langchain.dev/langchain-partners-with-elastic-to-launch-the-elastic-ai-assistant/]\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'LANGSMITH IS THE ENTERPRISE DEVOPS PLATFORM BUILT FOR LLMS.\\n' +\n", + " '\\n' +\n", + " 'Explore LangSmith\\n' +\n", + " '\\n' +\n", + " '[/langsmith]\\n' +\n", + " 'Gain visibility to make trade offs between cost, latency, and quality.\\n' +\n", + " 'Increase developer productivity.\\n' +\n", + " 'Eliminate manual, error-prone testing.\\n' +\n", + " 'Reduce hallucinations and improve reliability.\\n' +\n", + " 'Enterprise deployment options to keep data secure.\\n' +\n", + " '\\n' +\n", + " '\\n' +\n", + " 'READY TO START SHIPPING 
RELIABLE GENAI APPS FASTER?\\n' +\n", + " '\\n' +\n", + " 'Get started with LangChain, LangGraph, and LangSmith to enhance your LLM app development, from prototype to production.\\n' +\n", + " '\\n' +\n", + " 'Get a demo [/contact-sales]Sign up for free [https://smith.langchain.com/]\\n' +\n", + " 'Products\\n' +\n", + " 'LangChain [/langchain]LangSmith [/langsmith]LangGraph [/langgraph]Agents [/agents]Evaluation [/evaluation]Retrieval [/retrieval]\\n' +\n", + " 'Resources\\n' +\n", + " 'Python Docs [https://python.langchain.com/]JS/TS Docs [https://js.langchain.com/docs/get_started/introduction/]GitHub\\n' +\n", + " '[https://github.com/langchain-ai]Integrations [https://python.langchain.com/v0.2/docs/integrations/platforms/]Templates\\n' +\n", + " '[https://templates.langchain.com/]Changelog [https://changelog.langchain.com/]LangSmith Trust Portal\\n' +\n", + " '[https://trust.langchain.com/]\\n' +\n", + " 'Company\\n' +\n", + " 'About [/about]Blog [https://blog.langchain.dev/]Twitter [https://twitter.com/LangChainAI]LinkedIn\\n' +\n", + " '[https://www.linkedin.com/company/langchain/]YouTube [https://www.youtube.com/@LangChain]Community [/join-community]Marketing\\n' +\n", + " 'Assets [https://drive.google.com/drive/folders/17xybjzmVBdsQA-VxouuGLxF6bDsHDe80?usp=sharing]\\n' +\n", + " 'Sign up for our newsletter to stay up to date\\n' +\n", + " 'Thank you! Your submission has been received!\\n' +\n", + " 'Oops! Something went wrong while submitting the form.\\n' +\n", + " '[https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/65c6a38f9c53ec71f5fc73de_langchain-word.svg]\\n' +\n", + " 'All systems operational\\n' +\n", + " '[https://status.smith.langchain.com/]Privacy Policy [/'... 111 more characters,\n", + " metadata: {\n", + " source: 'https://langchain.com/',\n", + " title: 'LangChain',\n", + " description: 'LangChain’s suite of products supports developers along each step of their development journey.',\n", + " language: 'en'\n", + " }\n", + "}\n" + ] + } + ], + "source": [ + "const docs = await loader.load()\n", + "docs[0]" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " source: 'https://langchain.com/',\n", + " title: 'LangChain',\n", + " description: 'LangChain’s suite of products supports developers along each step of their development journey.',\n", + " language: 'en'\n", + "}\n" + ] + } + ], + "source": [ + "console.log(docs[0].metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Options\n", + "\n", + "```typescript\n", + "interface Options {\n", + " excludeDirs?: string[]; // webpage directories to exclude.\n", + " extractor?: (text: string) => string; // a function to extract the text of the document from the webpage, by default it returns the page as it is. It is recommended to use tools like html-to-text to extract the text. By default, it just returns the page as it is.\n", + " maxDepth?: number; // the maximum depth to crawl. By default, it is set to 2. If you need to crawl the whole website, set it to a number that is large enough would simply do the job.\n", + " timeout?: number; // the timeout for each request, in the unit of seconds. By default, it is set to 10000 (10 seconds).\n", + " preventOutside?: boolean; // whether to prevent crawling outside the root url. By default, it is set to true.\n", + " callerOptions?: AsyncCallerConstructorParams; // the options to call the AsyncCaller for example setting max concurrency (default is 64)\n", + "}\n", + "```\n", + "\n", + "However, since it's hard to perform a perfect filter, you may still see some irrelevant results in the results. You can perform a filter on the returned documents by yourself, if it's needed. Most of the time, the returned results are good enough." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## API reference\n", + "\n", + "For detailed documentation of all RecursiveUrlLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_web_recursive_url.RecursiveUrlLoader.html" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "TypeScript", + "language": "typescript", + "name": "tslab" + }, + "language_info": { + "codemirror_mode": { + "mode": "typescript", + "name": "javascript", + "typescript": true + }, + "file_extension": ".ts", + "mimetype": "text/typescript", + "name": "typescript", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/core_docs/docs/integrations/document_loaders/web_loaders/recursive_url_loader.mdx b/docs/core_docs/docs/integrations/document_loaders/web_loaders/recursive_url_loader.mdx deleted file mode 100644 index ddcb358c3056..000000000000 --- a/docs/core_docs/docs/integrations/document_loaders/web_loaders/recursive_url_loader.mdx +++ /dev/null @@ -1,67 +0,0 @@ ---- -sidebar_class_name: node-only -hide_table_of_contents: true ---- - -# Recursive URL Loader - -When loading content from a website, we may want to process load all URLs on a page. - -For example, let's look at the [LangChain.js introduction](/docs/introduction) docs. - -This has many interesting child pages that we may want to load, split, and later retrieve in bulk. - -The challenge is traversing the tree of child pages and assembling a list! - -We do this using the RecursiveUrlLoader. - -This also gives us the flexibility to exclude some children, customize the extractor, and more. - -## Setup - -To get started, you'll need to install the [`jsdom`](https://www.npmjs.com/package/jsdom) package: - -```bash npm2yarn -npm i jsdom -``` - -We also suggest adding a package like [`html-to-text`](https://www.npmjs.com/package/html-to-text) or -[`@mozilla/readability`](https://www.npmjs.com/package/@mozilla/readability) for extracting the raw text from the page. - -```bash npm2yarn -npm i html-to-text -``` - -## Usage - -```typescript -import { compile } from "html-to-text"; -import { RecursiveUrlLoader } from "@langchain/community/document_loaders/web/recursive_url"; - -const url = "/docs/introduction"; - -const compiledConvert = compile({ wordwrap: 130 }); // returns (text: string) => string; - -const loader = new RecursiveUrlLoader(url, { - extractor: compiledConvert, - maxDepth: 1, - excludeDirs: ["/docs/api/"], -}); - -const docs = await loader.load(); -``` - -## Options - -```typescript -interface Options { - excludeDirs?: string[]; // webpage directories to exclude. - extractor?: (text: string) => string; // a function to extract the text of the document from the webpage, by default it returns the page as it is. It is recommended to use tools like html-to-text to extract the text. By default, it just returns the page as it is. - maxDepth?: number; // the maximum depth to crawl. By default, it is set to 2. If you need to crawl the whole website, set it to a number that is large enough would simply do the job. - timeout?: number; // the timeout for each request, in the unit of seconds. By default, it is set to 10000 (10 seconds). - preventOutside?: boolean; // whether to prevent crawling outside the root url. By default, it is set to true. - callerOptions?: AsyncCallerConstructorParams; // the options to call the AsyncCaller for example setting max concurrency (default is 64) -} -``` - -However, since it's hard to perform a perfect filter, you may still see some irrelevant results in the results. You can perform a filter on the returned documents by yourself, if it's needed. Most of the time, the returned results are good enough. diff --git a/libs/langchain-scripts/src/cli/docs/document_loaders.ts b/libs/langchain-scripts/src/cli/docs/document_loaders.ts index 0ffbfde2da31..09498fd52d30 100644 --- a/libs/langchain-scripts/src/cli/docs/document_loaders.ts +++ b/libs/langchain-scripts/src/cli/docs/document_loaders.ts @@ -36,15 +36,8 @@ const INTEGRATIONS_DOCS_PATH = path.resolve( "../../docs/core_docs/docs/integrations/document_loaders" ); -const NODE_ONLY_TOOLTIP = `\`\`\`{=mdx} - -:::tip Compatibility - -Only available on Node.js. - -::: - -\`\`\``; +const NODE_ONLY_TOOLTIP = + "```{=mdx}\n\n:::tip Compatibility\n\nOnly available on Node.js.\n\n:::\n\n```\n"; const NODE_ONLY_SIDEBAR_BADGE = `sidebar_class_name: node-only`; const fetchAPIRefUrl = async (url: string): Promise => {