Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(community): Port ArxivRetriever to LangChainJS #7250

Merged
merged 20 commits into from
Dec 24, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions docs/core_docs/docs/integrations/retrievers/arxiv-retriever.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# ArxivRetriever in LangChain.js (Docs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

---

## Overview

The `arXiv Retriever` allows users to query the arXiv database for academic articles. It supports both full-document retrieval (PDF parsing) and summary-based retrieval.

---

## Features
- Query Flexibility: Search using natural language queries or specific arXiv IDs.
- Full-Document Retrieval: Option to fetch and parse PDFs.
- Summaries as Documents: Retrieve summaries for faster results.
- Customizable Options: Configure maximum results and output format.

---
## Installation

Ensure the following dependencies are installed:
- `axios` for making HTTP requests
- `pdf-parse` for parsing PDFs
- `fast-xml-parser` for parsing XML responses from the arXiv API

```bash
npm install axios pdf-parse fast-xml-parser
```
---

## Getting started

#### Import the path
```typescript
import { ArxivRetriever } from "langchain-community/retrievers/arxiv.js";
```

#### Instantiate the retriever
```typescript
const retriever = new ArxivRetriever({
getFullDocuments: false, // Set to true to fetch full documents (PDFs)
maxSearchResults: 5, // Maximum number of results to retrieve
});
```
---

## Class: ArxivRetriever

### Parameters

| Name | Type | Default | Description |
|-------------------|-----------|---------|------------------------------------------------------|
| `getFullDocuments` | `boolean` | `false` | Whether to fetch full documents (PDFs) instead of summaries. |
| `maxSearchResults` | `number` | `10` | Maximum number of results to fetch from arXiv. |



### Methods

### `invoke(query: string): Promise<Document[]>`

Use the invoke method to search arXiv for relevant articles. You can use either natural language queries or specific arXiv IDs.

#### Parameters

| Name | Type | Description |
|--------|----------|----------------------------------------|
| `query` | `string` | A natural language query or arXiv ID. |

#### Returns
A `Promise` that resolves to an array of LangChain `Document` instances.

#### Example
```typescript
const documents = await retriever.invoke("quantum computing");
documents.forEach(doc => {
console.log("Title:", doc.metadata.title);
console.log("Content:", doc.pageContent); // Parsed PDF content
});
```
67 changes: 67 additions & 0 deletions examples/src/retrievers/arxiv.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import { ArxivRetriever } from "../../../libs/langchain-community/src/retrievers/arxiv.js";
jacoblee93 marked this conversation as resolved.
Show resolved Hide resolved

export const run = async () => {
/*
Direct look up by arXiv ID, for full texts
*/

const queryId = "1605.08386 2103.03404";
const retrieverById = new ArxivRetriever({
getFullDocuments: true,
maxSearchResults: 5
});
const documentsById = await retrieverById.invoke(queryId);
console.log(documentsById);

/*
[
Document
{
pageContent,
metadata:
{
author,
id,
published,
source,
updated,
url
}
},
Document
{
pageContent,
metadata
}
]
*/

/*
Search with natural language query, for summaries
*/

const queryNat = "What is the ImageBind model?";
const retrieverByNat = new ArxivRetriever(
{
getFullDocuments: false,
maxSearchResults: 2
}
);
const documentsByQuery = await retrieverByNat.invoke(queryNat);
console.log(documentsByQuery);

/*
[
Document
{
pageContent,
metadata
},
Document
{
pageContent,
metadata
}
]
*/
};
4 changes: 4 additions & 0 deletions libs/langchain-community/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -610,6 +610,10 @@ retrievers/amazon_knowledge_base.cjs
retrievers/amazon_knowledge_base.js
retrievers/amazon_knowledge_base.d.ts
retrievers/amazon_knowledge_base.d.cts
retrievers/arxiv.cjs
retrievers/arxiv.js
retrievers/arxiv.d.ts
retrievers/arxiv.d.cts
retrievers/bm25.cjs
retrievers/bm25.js
retrievers/bm25.d.ts
Expand Down
1 change: 1 addition & 0 deletions libs/langchain-community/langchain.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,7 @@ export const config = {
// retrievers
"retrievers/amazon_kendra": "retrievers/amazon_kendra",
"retrievers/amazon_knowledge_base": "retrievers/amazon_knowledge_base",
"retrievers/arxiv": "retrievers/arxiv",
jacoblee93 marked this conversation as resolved.
Show resolved Hide resolved
"retrievers/bm25": "retrievers/bm25",
"retrievers/chaindesk": "retrievers/chaindesk",
"retrievers/databerry": "retrievers/databerry",
Expand Down
13 changes: 13 additions & 0 deletions libs/langchain-community/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -2085,6 +2085,15 @@
"import": "./retrievers/amazon_knowledge_base.js",
"require": "./retrievers/amazon_knowledge_base.cjs"
},
"./retrievers/arxiv": {
"types": {
"import": "./retrievers/arxiv.d.ts",
"require": "./retrievers/arxiv.d.cts",
"default": "./retrievers/arxiv.d.ts"
},
"import": "./retrievers/arxiv.js",
"require": "./retrievers/arxiv.cjs"
},
"./retrievers/bm25": {
"types": {
"import": "./retrievers/bm25.d.ts",
Expand Down Expand Up @@ -3673,6 +3682,10 @@
"retrievers/amazon_knowledge_base.js",
"retrievers/amazon_knowledge_base.d.ts",
"retrievers/amazon_knowledge_base.d.cts",
"retrievers/arxiv.cjs",
"retrievers/arxiv.js",
"retrievers/arxiv.d.ts",
"retrievers/arxiv.d.cts",
"retrievers/bm25.cjs",
"retrievers/bm25.js",
"retrievers/bm25.d.ts",
Expand Down
1 change: 1 addition & 0 deletions libs/langchain-community/src/load/import_map.ts
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ export * as chat_models__moonshot from "../chat_models/moonshot.js";
export * as chat_models__ollama from "../chat_models/ollama.js";
export * as chat_models__togetherai from "../chat_models/togetherai.js";
export * as chat_models__yandex from "../chat_models/yandex.js";
export * as retrievers__arxiv from "../retrievers/arxiv.js";
export * as retrievers__bm25 from "../retrievers/bm25.js";
export * as retrievers__chaindesk from "../retrievers/chaindesk.js";
export * as retrievers__databerry from "../retrievers/databerry.js";
Expand Down
45 changes: 45 additions & 0 deletions libs/langchain-community/src/retrievers/arxiv.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import { BaseRetriever, BaseRetrieverInput } from "@langchain/core/retrievers";
import { Document } from "@langchain/core/documents";
import { searchArxiv, loadDocsFromResults, getDocsFromSummaries } from '../utils/arxiv.js';

export type ArxivRetrieverOptions = {
getFullDocuments?: boolean;
maxSearchResults?: number;
} & BaseRetrieverInput;

/**
* A retriever that searches arXiv for relevant articles based on a query.
* It can retrieve either full documents (PDFs) or just summaries.
*/
export class ArxivRetriever extends BaseRetriever {
static lc_name() {
return "ArxivRetriever";
}

lc_namespace = ["langchain", "retrievers", "arxiv_retriever"];

getFullDocuments: boolean;
maxSearchResults: number;

constructor(options: ArxivRetrieverOptions = {}) {
super(options);
this.getFullDocuments = options.getFullDocuments ?? false;
this.maxSearchResults = options.maxSearchResults ?? 10;
}

async _getRelevantDocuments(query: string): Promise<Document[]> {
try {
const results = await searchArxiv(query, this.maxSearchResults);

if (this.getFullDocuments) {
// Fetch and parse PDFs to get full documents
return await loadDocsFromResults(results);
} else {
// Use summaries as documents
return getDocsFromSummaries(results);
}
} catch (error) {
throw new Error(`Error retrieving documents from arXiv.`);
}
}
}
Loading