Skip to content

Feat: Change ingest-data.ts to handle langchain limitations for large… #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 26 additions & 10 deletions scripts/ingest-data.ts
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,29 @@ import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
/* Name of directory to retrieve files from. You can change this as required */
const directoryPath = 'Notion_DB';

async function ingestData(index: string, docs: any[], embeddings: any[], chunkSize: number) {
for (let i = 0; i < docs.length; i += chunkSize) {
const chunk = docs.slice(i, i + chunkSize);
try {
await PineconeStore.fromDocuments(
index,
chunk,
embeddings,
'text',
PINECONE_NAME_SPACE, // optional namespace for your vectors
);
console.log(`Successfully ingested chunk ${i / chunkSize + 1}`);
} catch (error) {
console.error(`Error ingesting chunk ${i / chunkSize + 1}:`, error);
// Handle the error as needed
throw new Error('Failed to ingest your data');
}
}
}

export const run = async () => {
try {
/*load raw docs from the markdown files in the directory */
/* Load raw docs from the markdown files in the directory */
const rawDocs = await processMarkDownFiles(directoryPath);

/* Split text into chunks */
Expand All @@ -23,16 +43,12 @@ export const run = async () => {
console.log('split docs', docs);

console.log('creating vector store...');
/*create and store the embeddings in the vectorStore*/
/* Create and store the embeddings in the vectorStore */
const embeddings = new OpenAIEmbeddings();
const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name
await PineconeStore.fromDocuments(
index,
docs,
embeddings,
'text',
PINECONE_NAME_SPACE, //optional namespace for your vectors
);
const index = pinecone.Index(PINECONE_INDEX_NAME); // change to your own index name

await ingestData(index, docs, embeddings, 1000);

} catch (error) {
console.log('error', error);
throw new Error('Failed to ingest your data');
Expand Down