-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(community): Port ArxivRetriever to LangChainJS #7250
Merged
jacoblee93
merged 20 commits into
langchain-ai:main
from
AntonioFerreras:arxiv-retriever
Dec 24, 2024
Merged
Changes from 8 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
08b7b87
Merge pull request #1 from langchain-ai/main
AntonioFerreras 8f140d7
create ArxivRetriever, arxiv utils file, and config updates
AntonioFerreras b4d4a69
Documentation for Arxiv-Retriever
pdhruvin25 7835e96
Merge branch 'main' into arxiv-retriever
pdhruvin25 5b8958f
Edit the documentation for arXIV
pdhruvin25 47dcac0
Create integration test for Arxiv-Retriever
Googlogogo f00deda
Update integration test for arxiv retriever
Googlogogo e52a6e1
Add example usage file for arxiv retriever
boni-teppanyaki caa109c
Updated file to use fetch() instead of axios.get()
Googlogogo 55eb739
Final changes to docs
pdhruvin25 3ae9fc9
Update arxiv-retriever.mdx
jacoblee93 d169dde
Merge branch 'main' of github.com:langchain-ai/langchainjs into 7250
jacoblee93 58931bf
Format, rename, fix docs
jacoblee93 20cd43c
Rename
jacoblee93 a630c44
Fix
jacoblee93 d8a5f75
Merge branch 'main' of github.com:langchain-ai/langchainjs into 7250
jacoblee93 7c4f09f
Add optional dep
jacoblee93 b640c39
Lint
jacoblee93 7e49ac2
Fix
jacoblee93 25a96f6
Fix docs
jacoblee93 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
78 changes: 78 additions & 0 deletions
78
docs/core_docs/docs/integrations/retrievers/arxiv-retriever.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
# ArxivRetriever in LangChain.js (Docs) | ||
--- | ||
|
||
## Overview | ||
|
||
The `arXiv Retriever` allows users to query the arXiv database for academic articles. It supports both full-document retrieval (PDF parsing) and summary-based retrieval. | ||
|
||
--- | ||
|
||
## Features | ||
- Query Flexibility: Search using natural language queries or specific arXiv IDs. | ||
- Full-Document Retrieval: Option to fetch and parse PDFs. | ||
- Summaries as Documents: Retrieve summaries for faster results. | ||
- Customizable Options: Configure maximum results and output format. | ||
|
||
--- | ||
## Installation | ||
|
||
Ensure the following dependencies are installed: | ||
- `axios` for making HTTP requests | ||
- `pdf-parse` for parsing PDFs | ||
- `fast-xml-parser` for parsing XML responses from the arXiv API | ||
|
||
```bash | ||
npm install axios pdf-parse fast-xml-parser | ||
``` | ||
--- | ||
|
||
## Getting started | ||
|
||
#### Import the path | ||
```typescript | ||
import { ArxivRetriever } from "langchain-community/retrievers/arxiv.js"; | ||
``` | ||
|
||
#### Instantiate the retriever | ||
```typescript | ||
const retriever = new ArxivRetriever({ | ||
getFullDocuments: false, // Set to true to fetch full documents (PDFs) | ||
maxSearchResults: 5, // Maximum number of results to retrieve | ||
}); | ||
``` | ||
--- | ||
|
||
## Class: ArxivRetriever | ||
|
||
### Parameters | ||
|
||
| Name | Type | Default | Description | | ||
|-------------------|-----------|---------|------------------------------------------------------| | ||
| `getFullDocuments` | `boolean` | `false` | Whether to fetch full documents (PDFs) instead of summaries. | | ||
| `maxSearchResults` | `number` | `10` | Maximum number of results to fetch from arXiv. | | ||
|
||
|
||
|
||
### Methods | ||
|
||
### `invoke(query: string): Promise<Document[]>` | ||
|
||
Use the invoke method to search arXiv for relevant articles. You can use either natural language queries or specific arXiv IDs. | ||
|
||
#### Parameters | ||
|
||
| Name | Type | Description | | ||
|--------|----------|----------------------------------------| | ||
| `query` | `string` | A natural language query or arXiv ID. | | ||
|
||
#### Returns | ||
A `Promise` that resolves to an array of LangChain `Document` instances. | ||
|
||
#### Example | ||
```typescript | ||
const documents = await retriever.invoke("quantum computing"); | ||
documents.forEach(doc => { | ||
console.log("Title:", doc.metadata.title); | ||
console.log("Content:", doc.pageContent); // Parsed PDF content | ||
}); | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
import { ArxivRetriever } from "../../../libs/langchain-community/src/retrievers/arxiv.js"; | ||
jacoblee93 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
export const run = async () => { | ||
/* | ||
Direct look up by arXiv ID, for full texts | ||
*/ | ||
|
||
const queryId = "1605.08386 2103.03404"; | ||
const retrieverById = new ArxivRetriever({ | ||
getFullDocuments: true, | ||
maxSearchResults: 5 | ||
}); | ||
const documentsById = await retrieverById.invoke(queryId); | ||
console.log(documentsById); | ||
|
||
/* | ||
[ | ||
Document | ||
{ | ||
pageContent, | ||
metadata: | ||
{ | ||
author, | ||
id, | ||
published, | ||
source, | ||
updated, | ||
url | ||
} | ||
}, | ||
Document | ||
{ | ||
pageContent, | ||
metadata | ||
} | ||
] | ||
*/ | ||
|
||
/* | ||
Search with natural language query, for summaries | ||
*/ | ||
|
||
const queryNat = "What is the ImageBind model?"; | ||
const retrieverByNat = new ArxivRetriever( | ||
{ | ||
getFullDocuments: false, | ||
maxSearchResults: 2 | ||
} | ||
); | ||
const documentsByQuery = await retrieverByNat.invoke(queryNat); | ||
console.log(documentsByQuery); | ||
|
||
/* | ||
[ | ||
Document | ||
{ | ||
pageContent, | ||
metadata | ||
}, | ||
Document | ||
{ | ||
pageContent, | ||
metadata | ||
} | ||
] | ||
*/ | ||
}; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
import { BaseRetriever, BaseRetrieverInput } from "@langchain/core/retrievers"; | ||
import { Document } from "@langchain/core/documents"; | ||
import { searchArxiv, loadDocsFromResults, getDocsFromSummaries } from '../utils/arxiv.js'; | ||
|
||
export type ArxivRetrieverOptions = { | ||
getFullDocuments?: boolean; | ||
maxSearchResults?: number; | ||
} & BaseRetrieverInput; | ||
|
||
/** | ||
* A retriever that searches arXiv for relevant articles based on a query. | ||
* It can retrieve either full documents (PDFs) or just summaries. | ||
*/ | ||
export class ArxivRetriever extends BaseRetriever { | ||
static lc_name() { | ||
return "ArxivRetriever"; | ||
} | ||
|
||
lc_namespace = ["langchain", "retrievers", "arxiv_retriever"]; | ||
|
||
getFullDocuments: boolean; | ||
maxSearchResults: number; | ||
|
||
constructor(options: ArxivRetrieverOptions = {}) { | ||
super(options); | ||
this.getFullDocuments = options.getFullDocuments ?? false; | ||
this.maxSearchResults = options.maxSearchResults ?? 10; | ||
} | ||
|
||
async _getRelevantDocuments(query: string): Promise<Document[]> { | ||
try { | ||
const results = await searchArxiv(query, this.maxSearchResults); | ||
|
||
if (this.getFullDocuments) { | ||
// Fetch and parse PDFs to get full documents | ||
return await loadDocsFromResults(results); | ||
} else { | ||
// Use summaries as documents | ||
return getDocsFromSummaries(results); | ||
} | ||
} catch (error) { | ||
throw new Error(`Error retrieving documents from arXiv.`); | ||
} | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use this template for docs?
https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-scripts/src/cli/docs/templates/retrievers.ipynb