Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain[minor]: Firecrawl Document Loader #5180

Merged
merged 18 commits into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
hide_table_of_contents: true
---

# Firecrawl

This guide shows how to use [Firecrawl](https://firecrawl.dev) with LangChain to load web data into an LLM-ready format using Firecrawl.

## Overview

[FireCrawl](https://firecrawl.dev) crawls and convert any website into LLM-ready data. It crawls all accessible subpages and give you clean markdown and metadata for each. No sitemap required.

FireCrawl handles complex tasks such as reverse proxies, caching, rate limits, and content blocked by JavaScript. Built by the [mendable.ai](https://mendable.ai) team.

This guide shows how to scrap and crawl entire websites and load them using the `FireCrawlLoader` in LangChain.

## Setup

Sign up and get your free [FireCrawl API key](https://firecrawl.dev) to start. FireCrawl offers 100 free credits to get you started, and it's [open-source](https://github.com/mendableai/firecrawl) in case you want to self-host.

## Usage

Here's an example of how to use the `FireCrawlLoader` to load web search results:

Firecrawl offers 2 modes: `scrape` and `crawl`. In `scrape` mode, Firecrawl will only scrape the page you provide. In `crawl` mode, Firecrawl will crawl the entire website.

import CodeBlock from "@theme/CodeBlock";
import Example from "@examples/document_loaders/firecrawl.ts";

```bash npm2yarn
npm install @mendableai/firecrawl-js
```

<CodeBlock language="typescript">{Example}</CodeBlock>

### Additional Parameters

For `params` you can pass any of the params according to the [Firecrawl documentation](https://docs.firecrawl.dev).
13 changes: 13 additions & 0 deletions examples/src/document_loaders/firecrawl.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import { FireCrawlLoader } from "langchain/document_loaders/web/firecrawl";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! 👋 I've reviewed the code and noticed that the added lines explicitly access an environment variable using process.env. I've flagged this for your review to ensure it aligns with our best practices for handling environment variables. Let me know if you need further assistance with this.


const loader = new FireCrawlLoader({
url: "https://firecrawl.dev", // The URL to scrape
apiKey: process.env.FIRECRAWL_API_KEY, // Optional, defaults to `FIRECRAWL_API_KEY` in your env.
mode: "scrape", // The mode to run the crawler in. Can be "scrape" for single urls or "crawl" for all accessible subpages
params: {
// optional parameters based on Firecrawl API docs
// For API documentation, visit https://docs.firecrawl.dev
},
});

const docs = await loader.load();
4 changes: 4 additions & 0 deletions langchain/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,10 @@ document_loaders/web/figma.cjs
document_loaders/web/figma.js
document_loaders/web/figma.d.ts
document_loaders/web/figma.d.cts
document_loaders/web/firecrawl.cjs
document_loaders/web/firecrawl.js
document_loaders/web/firecrawl.d.ts
document_loaders/web/firecrawl.d.cts
document_loaders/web/github.cjs
document_loaders/web/github.js
document_loaders/web/github.d.ts
Expand Down
2 changes: 2 additions & 0 deletions langchain/langchain.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ export const config = {
"document_loaders/web/hn": "document_loaders/web/hn",
"document_loaders/web/imsdb": "document_loaders/web/imsdb",
"document_loaders/web/figma": "document_loaders/web/figma",
"document_loaders/web/firecrawl": "document_loaders/web/firecrawl",
"document_loaders/web/github": "document_loaders/web/github",
"document_loaders/web/notiondb": "document_loaders/web/notiondb",
"document_loaders/web/notionapi": "document_loaders/web/notionapi",
Expand Down Expand Up @@ -637,6 +638,7 @@ export const config = {
"document_loaders/web/hn",
"document_loaders/web/imsdb",
"document_loaders/web/figma",
"document_loaders/web/firecrawl",
"document_loaders/web/github",
"document_loaders/web/pdf",
"document_loaders/web/notiondb",
Expand Down
18 changes: 18 additions & 0 deletions langchain/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -578,6 +578,10 @@
"document_loaders/web/figma.js",
"document_loaders/web/figma.d.ts",
"document_loaders/web/figma.d.cts",
"document_loaders/web/firecrawl.cjs",
"document_loaders/web/firecrawl.js",
"document_loaders/web/firecrawl.d.ts",
"document_loaders/web/firecrawl.d.cts",
"document_loaders/web/github.cjs",
"document_loaders/web/github.js",
"document_loaders/web/github.d.ts",
Expand Down Expand Up @@ -1230,6 +1234,7 @@
"@google-cloud/storage": "^7.7.0",
"@jest/globals": "^29.5.0",
"@langchain/scripts": "~0.0",
"@mendable/firecrawl-js": "^0.0.13",
"@notionhq/client": "^2.2.10",
"@pinecone-database/pinecone": "^1.1.0",
"@supabase/supabase-js": "^2.10.0",
Expand Down Expand Up @@ -1314,6 +1319,7 @@
"@gomomento/sdk-web": "^1.51.1",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! I noticed that this PR adds a new hard dependency with the addition of "@mendable/firecrawl-js". I've flagged this for your review as it's important to ensure the impact of this change on the project's dependencies. Keep up the great work!

"@google-ai/generativelanguage": "^0.2.1",
"@google-cloud/storage": "^6.10.1 || ^7.7.0",
"@mendable/firecrawl-js": "^0.0.13",
"@notionhq/client": "^2.2.10",
"@pinecone-database/pinecone": "*",
"@supabase/supabase-js": "^2.10.0",
Expand Down Expand Up @@ -1386,6 +1392,9 @@
"@google-cloud/storage": {
"optional": true
},
"@mendable/firecrawl-js": {
"optional": true
},
"@notionhq/client": {
"optional": true
},
Expand Down Expand Up @@ -2826,6 +2835,15 @@
"import": "./document_loaders/web/figma.js",
"require": "./document_loaders/web/figma.cjs"
},
"./document_loaders/web/firecrawl": {
"types": {
"import": "./document_loaders/web/firecrawl.d.ts",
"require": "./document_loaders/web/firecrawl.d.cts",
"default": "./document_loaders/web/firecrawl.d.ts"
},
"import": "./document_loaders/web/firecrawl.js",
"require": "./document_loaders/web/firecrawl.cjs"
},
"./document_loaders/web/github": {
"types": {
"import": "./document_loaders/web/github.d.ts",
Expand Down
34 changes: 34 additions & 0 deletions langchain/src/document_loaders/tests/firecrawl.int.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/* eslint-disable no-process-env */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! I noticed that the recent changes in the FireCrawlLoader tests are accessing environment variables directly. I've flagged this for your review to ensure proper handling of sensitive information. Let me know if you need any further assistance with this.

/* eslint-disable @typescript-eslint/no-non-null-assertion */
import { test } from "@jest/globals";
import { Document } from "@langchain/core/documents";
import { FireCrawlLoader } from "../web/firecrawl.js";

test("Test FireCrawlLoader load method with scrape mode", async () => {
const loader = new FireCrawlLoader({
url: "https://firecrawl.dev",
apiKey: process.env.FIRECRAWL_API_KEY,
mode: "scrape",
});

const documents = await loader.load();
expect(documents).toHaveLength(1);
const document = documents[0];
expect(document).toBeInstanceOf(Document);
expect(document.pageContent).toBeTruthy();
expect(document.metadata).toBeTruthy();
});

test("Test FireCrawlLoader load method with crawl mode", async () => {
const loader = new FireCrawlLoader({
url: "https://firecrawl.dev",
apiKey: process.env.FIRECRAWL_API_KEY,
mode: "crawl",
});

const documents = await loader.load();
const document = documents[0];
expect(document).toBeInstanceOf(Document);
expect(document.pageContent).toBeTruthy();
expect(document.metadata).toBeTruthy();
}, 15000);
108 changes: 108 additions & 0 deletions langchain/src/document_loaders/web/firecrawl.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
import FirecrawlApp from "@mendable/firecrawl-js";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there! I noticed that the recent code changes explicitly access an environment variable using getEnvironmentVariable. I've flagged this for your review to ensure it aligns with the intended functionality. Keep up the great work!

import { Document, type DocumentInterface } from "@langchain/core/documents";
import { getEnvironmentVariable } from "@langchain/core/utils/env";
import { BaseDocumentLoader } from "../base.js";

/**
* Interface representing the parameters for the Firecrawl loader. It
* includes properties such as the URL to scrape or crawl and the API key.
*/
interface FirecrawlLoaderParameters {
/**
* URL to scrape or crawl
*/
url: string;

/**
* API key for Firecrawl. If not provided, the default value is the value of the FIRECRAWL_API_KEY environment variable.
*/
apiKey?: string;

/**
* Mode of operation. Can be either "crawl" or "scrape". If not provided, the default value is "crawl".
*/
mode?: "crawl" | "scrape";
params?: Record<string, unknown>;
}
interface FirecrawlDocument {
markdown: string;
metadata: Record<string, unknown>;
}

/**
* Class representing a document loader for loading data from
* Firecrawl (firecrawl.dev). It extends the BaseDocumentLoader class.
* @example
* ```typescript
* const loader = new FireCrawlLoader({
* url: "{url}",
* apiKey: "{apiKey}",
* mode: "crawl"
* });
* const docs = await loader.load();
* ```
*/
export class FireCrawlLoader extends BaseDocumentLoader {
private apiKey: string;

private url: string;

private mode: "crawl" | "scrape";

private params?: Record<string, unknown>;

constructor(loaderParams: FirecrawlLoaderParameters) {
super();
const {
apiKey = getEnvironmentVariable("FIRECRAWL_API_KEY"),
url,
mode = "crawl",
params,
} = loaderParams;
if (!apiKey) {
throw new Error(
"Firecrawl API key not set. You can set it as FIRECRAWL_API_KEY in your .env file, or pass it to Firecrawl."
);
}

this.apiKey = apiKey;
this.url = url;
this.mode = mode;
this.params = params;
}

/**
* Loads the data from the Firecrawl.
* @returns An array of Documents representing the retrieved data.
* @throws An error if the data could not be loaded.
*/
public async load(): Promise<DocumentInterface[]> {
const app = new FirecrawlApp({ apiKey: this.apiKey });
let firecrawlDocs: FirecrawlDocument[];

if (this.mode === "scrape") {
const response = await app.scrapeUrl(this.url, this.params);
if (!response.success) {
throw new Error(
`Firecrawl: Failed to scrape URL. Error: ${response.error}`
);
}
firecrawlDocs = [response.data as FirecrawlDocument];
} else if (this.mode === "crawl") {
const response = await app.crawlUrl(this.url, this.params, true);
firecrawlDocs = response as FirecrawlDocument[];
} else {
throw new Error(
`Unrecognized mode '${this.mode}'. Expected one of 'crawl', 'scrape'.`
);
}

return firecrawlDocs.map(
(doc) =>
new Document({
pageContent: doc.markdown || "",
metadata: doc.metadata || {},
})
);
}
}
1 change: 1 addition & 0 deletions langchain/src/load/import_constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ export const optionalImportEntrypoints: string[] = [
"langchain/document_loaders/web/hn",
"langchain/document_loaders/web/imsdb",
"langchain/document_loaders/web/figma",
"langchain/document_loaders/web/firecrawl",
"langchain/document_loaders/web/github",
"langchain/document_loaders/web/notiondb",
"langchain/document_loaders/web/notionapi",
Expand Down
16 changes: 15 additions & 1 deletion yarn.lock
Original file line number Diff line number Diff line change
Expand Up @@ -10092,6 +10092,16 @@ __metadata:
languageName: node
linkType: hard

"@mendable/firecrawl-js@npm:^0.0.13":
version: 0.0.13
resolution: "@mendable/firecrawl-js@npm:0.0.13"
dependencies:
axios: ^1.6.8
dotenv: ^16.4.5
checksum: 9dbc0b6e5d300bb9ef9f45cebd5c0026ac468863984cdc73a57ed6fdf888eaead5f9e2325c6848d03897c72cab195fffb4ce7d832e39696a11216bc53b417b6d
languageName: node
linkType: hard

"@mistralai/mistralai@npm:^0.1.3":
version: 0.1.3
resolution: "@mistralai/mistralai@npm:0.1.3"
Expand Down Expand Up @@ -16954,7 +16964,7 @@ __metadata:
languageName: node
linkType: hard

"axios@npm:^1.6.2":
"axios@npm:^1.6.2, axios@npm:^1.6.8":
version: 1.6.8
resolution: "axios@npm:1.6.8"
dependencies:
Expand Down Expand Up @@ -26670,6 +26680,7 @@ __metadata:
"@langchain/openai": ~0.0.28
"@langchain/scripts": ~0.0
"@langchain/textsplitters": ~0.0.0
"@mendable/firecrawl-js": ^0.0.13
"@notionhq/client": ^2.2.10
"@pinecone-database/pinecone": ^1.1.0
"@supabase/supabase-js": ^2.10.0
Expand Down Expand Up @@ -26766,6 +26777,7 @@ __metadata:
"@gomomento/sdk-web": ^1.51.1
"@google-ai/generativelanguage": ^0.2.1
"@google-cloud/storage": ^6.10.1 || ^7.7.0
"@mendable/firecrawl-js": ^0.0.13
"@notionhq/client": ^2.2.10
"@pinecone-database/pinecone": "*"
"@supabase/supabase-js": ^2.10.0
Expand Down Expand Up @@ -26827,6 +26839,8 @@ __metadata:
optional: true
"@google-cloud/storage":
optional: true
"@mendable/firecrawl-js":
optional: true
"@notionhq/client":
optional: true
"@pinecone-database/pinecone":
Expand Down
Loading