Skip to content

Latest commit

 

History

History
30 lines (22 loc) · 1.37 KB

README.md

File metadata and controls

30 lines (22 loc) · 1.37 KB

Populate Database

This repository contains the code for populating the Pinecone database with the Dicoding proprietary dataset.

The create_search_index.py script is used to create an index for semantic search on the Dicoding discussion forum and modules.

The data/search.json file contains the data that will be used to create the index and has the following format:

{
    "document_id": "unique id of the document, a string",
    "title": "title of the document, a string",
    "content_display": "content of the document, a string comprising the title and content",
    "target_embedding": "embedding of the document with carlesoctav/multi-qa-en-id-mMiniLMv2-L6-H384, a list of floats"
}

The create_auto_tag_index.py script is used to create an index for auto-tagging on the Dicoding discussion forum.

The data/auto_tag.json file contains the data that will be used to create the index and has the following format:

{
    "context": "content of the document, a string comprising the title and content",
    "tags": "tags of the document, a list of strings",
    "target_embedding": "embedding of the document with carlesoctav/multi-qa-en-id-mMiniLMv2-L6-H384, a list of floats"
}

If your data doesn't have a target_embedding, you can use the embedding-endpoint from another repository to obtain the embedding for your data.