Skip to content

Extremely simple semantic cache using Qdrant, FastEmbed, OpenAI

License

Notifications You must be signed in to change notification settings

TechcodexIO/simple-semantic-cache

Repository files navigation

Enhancing LLM Interactions with Semantic Caching: A Python Implementation

In the rapidly evolving field of artificial intelligence, Large Language Models (LLMs) have become a cornerstone of natural language processing. However, as powerful as these models are, they come with challenges such as high latency and operational costs. This article explores a Python implementation that addresses these issues by introducing a semantic cache powered by a vector database.

Background

Large Language Models (LLMs)

Large Language Models, such as OpenAI's GPT series, are AI models trained on vast amounts of text data. They can understand and generate human-like text, answer questions, and perform various language-related tasks. While incredibly powerful, LLMs often require significant computational resources and can have high response times, especially for complex queries.

Vector Databases

Vector databases are specialized databases designed to store and query high-dimensional vectors efficiently. In natural language processing, these vectors often represent the semantic meaning of text. Vector databases enable fast similarity searches, making them ideal for implementing semantic caches.

Semantic Caching

Semantic caching is a technique that stores and retrieves information based on meaning rather than exact matches. In the context of LLMs, it involves caching responses to queries and retrieving them for semantically similar future queries. This approach can significantly reduce response times and computational load for frequently asked or similar questions.

The Python Implementation

The Python code demonstrates a practical application of these concepts, creating an interactive chat interface that uses OpenAI's GPT model with a semantic cache powered by the Qdrant vector database.

Key Components

  1. OpenAI API Integration: The code uses OpenAI's API to generate embeddings (vector representations of text) and to get responses from the GPT model.

  2. Qdrant Vector Database: Qdrant is used to store and retrieve cached responses based on the semantic similarity of queries.

  3. Semantic Caching Logic: The implementation includes functions to check the cache for similar queries, add new query-response pairs to the cache, and retrieve cached responses when appropriate.

How It Works

  1. When a user inputs a query, the system first generates an embedding for the query using OpenAI's embedding model.

  2. This embedding is used to search the Qdrant database for similar previous queries.

  3. If a sufficiently similar query is found (based on a configurable similarity threshold), the cached response is returned, saving time and API calls.

  4. If no similar query is found, the system sends the query to OpenAI's GPT model and caches the response for future use.

  5. The cache can be cleared at any time, allowing for fresh starts or updates to the knowledge base.

Benefits and Applications

This implementation offers several advantages:

  1. Reduced Latency: By retrieving cached responses for similar queries, the system can provide near-instantaneous responses for frequently asked questions.

  2. Cost Efficiency: Fewer API calls to the LLM service mean lower operational costs, especially for high-volume applications.

  3. Improved User Experience: Faster response times lead to a more fluid and responsive chat interface.

  4. Scalability: The use of a vector database allows the system to efficiently handle a large number of cached responses.

Potential Use Cases

This semantic caching system could be beneficial in various scenarios:

  • Customer Support Chatbots
  • Educational Q&A Systems
  • Internal Knowledge Bases for Organizations
  • Personal AI Assistants

Conclusion

By combining the power of LLMs with the efficiency of semantic caching and vector databases, this Python implementation showcases a practical approach to building more responsive and cost-effective AI-powered chat systems. As LLMs continue to evolve and become more integral to various applications, techniques like semantic caching will play a crucial role in optimizing their performance and accessibility.

This code serves as a starting point for developers looking to implement similar systems, offering a balance between the advanced capabilities of LLMs and the practical considerations of real-world applications.

Getting Started

Follow these instructions to set up and run the semantic cache implementation:

  1. Clone the repository:

    git clone https://github.com/TechcodexIO/simple-semantic-cache.git
    cd simple-semantic-cache
    
  2. Run the Docker Compose file to start the Qdrant vector database:

    docker-compose up -d
    
  3. Install the required Python libraries:

    pip install qdrant-client openai python-dotenv fastembed
    
  4. Copy the env.example file to .env and adjust it with your credentials and server information:

    cp env.example .env
    

    Edit the .env file and add your OpenAI API key, adjust Qdrant settings if necessary, and set the collection names for both implementations.

  5. To use OpenAI embeddings, run the llmcache.py script:

    python llmcache.py
    
  6. To use FastEmbed for embeddings, run the llmfastembedcache.py script:

    python llmfastembedcache.py
    

Both scripts will start an interactive chat interface. Type your queries and see the semantic cache in action. Type 'quit' to end the conversation or 'clear cache' to reset the semantic cache.

Remember to keep your API keys and credentials secure and never share them publicly.

Qdrant Vector Database

This project uses Qdrant as the vector database for storing and retrieving embeddings. Qdrant is running in a Docker container as part of the project setup.

Accessing the Qdrant Dashboard

To access the Qdrant dashboard:

  1. Ensure the Docker containers are running (docker-compose up -d). Should be running already from prior steps.
  2. Open a web browser and navigate to http://localhost:6333/dashboard

The dashboard provides a visual interface for managing collections, viewing metrics, and performing basic operations on your vector database.

References

About

Extremely simple semantic cache using Qdrant, FastEmbed, OpenAI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages