Skip to content

Latest commit

 

History

History
214 lines (173 loc) · 12.7 KB

how_to_add_projects_to_kb.md

File metadata and controls

214 lines (173 loc) · 12.7 KB

i# Question 1 I am building a knowledge base that will have the following metadata:

  1. section (For instance, AGB question, General Project question, Information for customers)
  2. Question. Has a question, that customers can ask
  3. Answer. An answer to the question.

For instance: General Project Questions;How high is the fixed interest rate for investors in the 'The Five' project and when does it start? What is the minimum term for the interest rate, even in the event of early repayment by the borrower?;The fixed interest rate is 5.6% p.a. and starts immediately upon receipt of the money. The minimum term is approximately 25 months until May 27, 2024, with at least 22 months of interest guaranteed even in the event of early repayment by February 27, 2024.

my task is to create CSV file for 100 projects to index them. I cannot understand how I can reference to projects in index? How can retrieval system distinguish all of them?

For instance, if a customer asks "What is the interest rate for the Project X", how can the retrieval system fetch information that is related exactly to the Project X, not for the Project Y? Which fields should I add to the index and how can I fetch data?

Answer 1

To create an efficient retrieval system for your knowledge base where questions are indexed by specific projects, you need to ensure that each entry in your database has enough metadata to distinguish between projects. Here's how you can approach it:

1. Add Project Metadata

You should add a field to represent each project specifically. This could be a unique identifier for each project or a project name. This will allow the retrieval system to differentiate between questions related to "Project X" and those related to "Project Y". Here's what your additional metadata could look like:

  • Project Name (or ID): A unique field to represent each project (e.g., "Project X" or a unique project identifier). This is essential for distinguishing between multiple projects.
  • Project Category/Type (Optional): If projects fall into categories or types, you can add this as an optional metadata field (e.g., "Residential", "Commercial", etc.).

2. Example CSV Format with Added Fields

Here’s an example of how your CSV format could be structured:

Project Name;Project ID;Section;Question;Answer
The Five;101;General Project Questions;How high is the fixed interest rate for investors in the 'The Five' project and when does it start? What is the minimum term for the interest rate, even in the event of early repayment by the borrower?;The fixed interest rate is 5.6% p.a. and starts immediately upon receipt of the money. The minimum term is approximately 25 months until May 27, 2024, with at least 22 months of interest guaranteed even in the event of early repayment by February 27, 2024.
Skyline Tower;102;AGB Questions;What happens if the borrower misses payments for the Skyline Tower project?;If the borrower misses a payment, the investment contract stipulates certain penalties, which vary depending on the duration of the delay and the project terms.

In this CSV format:

  • Project Name: The name of the project (e.g., "The Five", "Skyline Tower").
  • Project ID: A unique identifier for each project (e.g., 101, 102). This can be helpful for internal referencing.
  • Section: The section/category of the knowledge base question (e.g., "AGB Questions", "General Project Questions").
  • Question: The actual question related to the project.
  • Answer: The response to the question.

3. Fields to Add to the Index

The index should have the following fields to make sure the retrieval system can easily distinguish between projects:

  • Project Name or Project ID: This will be the primary reference to retrieve project-specific data. If the user’s query mentions "Project X", the system can directly retrieve all knowledge base entries with that project name or ID.
  • Section: This helps categorize the type of question, which may also help refine retrieval (e.g., "General Questions", "Financial Information", etc.).
  • Question: The question itself will be indexed for text-based retrieval.
  • Answer: The answer to the question will be retrieved and displayed based on the query.

4. How Retrieval Can Work

When a user asks a question like "What is the interest rate for Project X?":

  • Keyword Matching: The system can extract keywords such as "interest rate" and "Project X".
  • Project-Specific Filtering: The system should first filter by the Project Name/ID (e.g., "Project X") to ensure only data related to that project is considered.
  • Section and Question Matching: Within the filtered project-specific data, it can further refine the results by looking for keywords related to the question (e.g., "interest rate").

5. Fetch Data by Using a Full-Text Search Engine

  • Search by Project ID or Name: When indexing the data, ensure that the Project Name or Project ID is part of the query used for the search.
  • Full-Text Search for Questions: Use a full-text search engine (like Elasticsearch, MeiliSearch, or even PostgreSQL with the tsvector type) that allows you to search within the question/answer content as well.
  • Rank the Results: Rank the results based on relevance to the user query (for example, based on how well the project and question match the keywords in the query).

6. Further Improvements

  • Synonyms or Tags: Consider adding a "tags" or "keywords" column to capture synonymous terms or relevant keywords that might differ from project to project but still relate to the same topic.
  • Fallback Mechanism: If a project-specific question isn’t found, the retrieval system can fall back on more general answers (i.e., project-agnostic data).

Final CSV Format (with Recommended Fields):

Project Name;Project ID;Section;Question;Answer;Tags;Created_At
The Five;101;General Project Questions;How high is the fixed interest rate for investors in the 'The Five' project and when does it start?;The fixed interest rate is 5.6% p.a. and starts immediately upon receipt of the money.;interest rate, fixed rate, start date;2023-09-21
Skyline Tower;102;Financial Questions;What happens if the borrower misses payments for the Skyline Tower project?;If the borrower misses a payment, the contract stipulates certain penalties, which vary depending on the duration of the delay.;missed payment, penalties, late payment;2023-09-21

This approach will help ensure the retrieval system can distinguish and retrieve the correct project-specific data based on the user’s query.

Question 2

Based on your data, suggest the index for the ElasticSearch.

Answer 2

For indexing data in Elasticsearch for a knowledge base that contains information about different projects, the index needs to be structured to support efficient retrieval of project-specific questions and answers. Given the data structure you outlined, the Elasticsearch index can be optimized by ensuring that certain fields are both filterable (like Project Name or Project ID) and searchable (like Question and Answer).

Here’s how you can define the index mapping for your Elasticsearch index:

1. Index Mapping

The index mapping is a schema that defines how the fields in your documents should be indexed and searched. The following is a recommended mapping for your knowledge base:

{
  "mappings": {
    "properties": {
      "project_name": {
        "type": "keyword"
      },
      "project_id": {
        "type": "keyword"
      },
      "section": {
        "type": "keyword"
      },
      "question": {
        "type": "text",
        "analyzer": "standard"
      },
      "answer": {
        "type": "text",
        "analyzer": "standard"
      },
      "tags": {
        "type": "keyword"
      },
      "created_at": {
        "type": "date"
      }
    }
  }
}

2. Explanation of Fields in the Mapping

  • project_name (keyword): The name of the project. It is indexed as a keyword field, meaning it will be used for exact matching. This is essential for filtering answers related to a specific project.

  • project_id (keyword): A unique identifier for the project. Like project_name, this is also a keyword and is used for exact matching. For internal use, this might be a more reliable identifier.

  • section (keyword): The section of the knowledge base (e.g., "AGB Questions", "General Project Questions"). This is treated as a keyword, since sections are typically predefined categories that can be filtered.

  • question (text): This is indexed as a text field with the standard analyzer, which means it will be tokenized for full-text search. It allows users to perform searches based on the content of the question (e.g., "interest rate for Project X").

  • answer (text): Like the question field, this is also indexed as a text field with full-text search capabilities, which allows the retrieval of relevant answers based on user queries.

  • tags (keyword): Tags or keywords that are related to the question/answer pair. These can help further narrow down the search and provide more context (e.g., "interest rate", "missed payment"). These are indexed as exact match keywords.

  • created_at (date): A date field to track when the record was added. This is helpful for filtering or sorting based on recency (e.g., showing the most recent answers first).

3. Sample Document for Elasticsearch

Here’s an example of a document in Elasticsearch that matches your data structure:

{
  "project_name": "The Five",
  "project_id": "101",
  "section": "General Project Questions",
  "question": "How high is the fixed interest rate for investors in the 'The Five' project and when does it start? What is the minimum term for the interest rate, even in the event of early repayment by the borrower?",
  "answer": "The fixed interest rate is 5.6% p.a. and starts immediately upon receipt of the money. The minimum term is approximately 25 months until May 27, 2024, with at least 22 months of interest guaranteed even in the event of early repayment by February 27, 2024.",
  "tags": ["interest rate", "fixed rate", "start date"],
  "created_at": "2023-09-21T00:00:00Z"
}

4. Search Scenarios

Here’s how Elasticsearch would handle different types of queries based on the index:

A. Project-Specific Query (Exact Match)

If the user asks "What is the interest rate for Project X?":

  • You can filter by project_name or project_id to ensure that only documents related to "Project X" are considered.
  • Then, use a full-text search on the question field for terms like "interest rate".
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "project_name": "The Five" } }
      ],
      "must": [
        { "match": { "question": "interest rate" } }
      ]
    }
  }
}

B. General Search Without Project Filtering

If the user asks a more general question like "What is the interest rate for projects?":

  • You can perform a full-text search across the question and answer fields, without applying any project filters. This will return results from multiple projects.
{
  "query": {
    "multi_match": {
      "query": "interest rate",
      "fields": ["question", "answer"]
    }
  }
}

C. Filter by Section

If the user asks for questions from a specific section, like "AGB Questions":

  • You can filter based on the section field to retrieve questions from the relevant category.
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "section": "AGB Questions" } }
      ],
      "must": [
        { "match": { "question": "borrower" } }
      ]
    }
  }
}

5. Additional Optimizations

  • Synonyms: You can configure a synonym analyzer in Elasticsearch to handle different terminology for similar concepts (e.g., "rate", "interest rate", "yield").
  • Boosting: You can apply boosting to certain fields to give more weight to the project_name or question fields when they match the user query.
  • Autocomplete: Implementing an autocomplete feature based on the question field can help improve the user experience by suggesting questions as the user types their query.

6. Conclusion

To build a robust retrieval system in Elasticsearch:

  • Use project_name or project_id for exact matching when filtering by project.
  • Use question and answer fields as text for full-text search capabilities.
  • Add tags and section to further enhance search filtering and categorization.

This structure ensures that your system can efficiently differentiate between projects and return highly relevant results for the user’s query.