Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester & Retriever: Add support for Weaviate #64

Merged
merged 57 commits into from
May 8, 2024
Merged
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
403f118
Add new pipeline DTOs
MichaelOwenDyer Feb 14, 2024
e7c74f2
Apply autoformatter
MichaelOwenDyer Feb 14, 2024
26e86ac
Have DTOs extend BaseModel
MichaelOwenDyer Feb 15, 2024
6997315
Add data package
MichaelOwenDyer Feb 15, 2024
128ea40
update retrieval interface and requirements
yassinsws Feb 19, 2024
0818109
Merge branch 'main' into feature/datastore
yassinsws Feb 20, 2024
70ed83f
Use cloud cluster for weaviate for now for the hackathon
yassinsws Feb 21, 2024
7acf809
Merge remote-tracking branch 'origin/feature/datastore' into feature/…
yassinsws Feb 21, 2024
2c0793a
fix splitting function.
yassinsws Feb 21, 2024
b4cb05d
Add content_service, data ingester and vector repository subsystems
yassinsws Feb 22, 2024
2f3882f
Merge branch 'main' into feature/datastore
yassinsws Feb 22, 2024
05490f2
fix lintin
yassinsws Feb 22, 2024
e08aac6
Merge remote-tracking branch 'origin/feature/datastore' into feature/…
yassinsws Feb 22, 2024
a29a44b
add a return statement to unzip
yassinsws Feb 22, 2024
0c96395
Add image recognition for Ollama, GPT4V and image generation for Dall-E
Hialus Mar 6, 2024
e9874b9
Solved requirements problem ( removed olama for now as weaviate needs…
yassinsws Mar 17, 2024
3a186c9
fixed requirements file
yassinsws Mar 17, 2024
379550b
fixed message interpretation function in the llm class
yassinsws Mar 17, 2024
0f6e576
Added detail parameter to image_interpretation model
yassinsws Mar 18, 2024
a4186c3
renamed pyris_image to iris_image
yassinsws Mar 18, 2024
9f2848e
Update app/content_service/Ingestion/lectures_ingestion.py
yassinsws Mar 18, 2024
224a701
Update app/content_service/Retrieval/abstract_retrieval.py
yassinsws Mar 18, 2024
93a2f44
Update app/content_service/Ingestion/repository_ingestion.py
yassinsws Mar 18, 2024
bca6377
Update app/content_service/Ingestion/lectures_ingestion.py
yassinsws Mar 18, 2024
6e9525d
Update app/content_service/Retrieval/lecture_retrieval.py
yassinsws Mar 18, 2024
bc97236
erase old lecture download files
yassinsws Mar 18, 2024
b0291b1
Add a function to get lectures from Artemis
yassinsws Mar 18, 2024
7211386
Update app/content_service/get_lecture_from_artemis.py
yassinsws Mar 24, 2024
0f57336
black
yassinsws Mar 31, 2024
22a96ab
Added method to delete objects and collections from the data base, ad…
yassinsws Apr 1, 2024
53edf86
Fix Linters
yassinsws Apr 7, 2024
57b0d72
Update app/content_service/Retrieval/abstract_retrieval.py
yassinsws Apr 7, 2024
b4acb1d
Solve merge Conflict and update Pr
yassinsws Apr 25, 2024
58ac585
Fix Requirements, ollama should be deleted because it's using an old …
yassinsws Apr 25, 2024
1ca6b8e
Merge remote-tracking branch 'origin/main' into feature/Ingestion_pip…
yassinsws Apr 25, 2024
6c60225
Update code
yassinsws May 3, 2024
a9c77c1
Update code
yassinsws May 3, 2024
69c791a
Flake8
yassinsws May 3, 2024
c7f53ee
Update and merge main with datastore PR
yassinsws May 3, 2024
aa247b8
Erase drafts of lecture_ingestion and repository_ingestion, because i…
yassinsws May 3, 2024
4dd3b3d
refractor code
yassinsws May 3, 2024
f06e884
refractor code
yassinsws May 3, 2024
008a9e5
refractor code
yassinsws May 3, 2024
4bd9cd2
implement request changes
yassinsws May 4, 2024
7021ba5
implement request changes
yassinsws May 5, 2024
bc75592
modify lecute_unit_dto
yassinsws May 5, 2024
0ac2712
make class into enum
yassinsws May 5, 2024
b50ea25
make class into enum
yassinsws May 5, 2024
586aa1e
Merge branch 'main' into feature/datastore
yassinsws May 5, 2024
4699fed
Erase content_service
yassinsws May 5, 2024
ea32c7b
Erase content_service
yassinsws May 5, 2024
b0e6f1d
fix lecture_schema
yassinsws May 5, 2024
1b477ff
replace import all classes only with the classes needed
yassinsws May 7, 2024
dca1493
replace import all classes only with the classes needed
yassinsws May 7, 2024
fd11add
Merge branch 'main' into feature/datastore
yassinsws May 7, 2024
3930890
Update requirements.txt
yassinsws May 7, 2024
5a70b5c
rename db to database
yassinsws May 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
WEAVIATE_HOST=
WEAVIATE_PORT=
15 changes: 8 additions & 7 deletions app/domain/data/lecture_unit_dto.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
from datetime import datetime
from typing import Optional

from pydantic import BaseModel, Field


class LectureUnitDTO(BaseModel):
id: int
to_update: bool = Field(alias="toUpdate")
pdf_file_base64: str = Field(alias="pdfFile")
lecture_unit_id: int = Field(alias="lectureUnitId")
lecture_unit_name: str = Field(alias="lectureUnitName")
lecture_id: int = Field(alias="lectureId")
release_date: Optional[datetime] = Field(alias="releaseDate", default=None)
name: Optional[str] = None
attachment_version: int = Field(alias="attachmentVersion")
lecture_name: str = Field(alias="lectureName")
course_id: int = Field(alias="courseId")
course_name: str = Field(alias="courseName")
course_description: str = Field(alias="courseDescription")
Empty file added app/ingestion/__init__.py
Empty file.
29 changes: 29 additions & 0 deletions app/ingestion/abstract_ingestion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from abc import ABC, abstractmethod
from typing import List, Dict


class AbstractIngestion(ABC):
"""
Abstract class for ingesting repositories into a database.
"""

@abstractmethod
def chunk_data(self, path: str) -> List[Dict[str, str]]:
"""
Abstract method to chunk code files in the root directory.
"""
pass

@abstractmethod
def ingest(self, path: str) -> bool:
"""
Abstract method to ingest repositories into the database.
"""
pass

@abstractmethod
def update(self, path: str):
"""
Abstract method to update a repository in the database.
"""
pass
2 changes: 1 addition & 1 deletion app/pipeline/chat/tutor_chat_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ def _add_student_repository_to_prompt(
for file in selected_files:
if file in student_repository:
self.prompt += SystemMessagePromptTemplate.from_template(
f"For reference, we have access to the student's '{file}' file:"
f"For reference, we have access to the student's '{file}' file: "
)
self.prompt += HumanMessagePromptTemplate.from_template(
student_repository[file].replace("{", "{{").replace("}", "}}")
Expand Down
Empty file added app/retrieval/__init__.py
Empty file.
15 changes: 15 additions & 0 deletions app/retrieval/abstract_retrieval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from abc import ABC, abstractmethod
from typing import List


class AbstractRetrieval(ABC):
"""
Abstract class for retrieving data from a database.
"""

@abstractmethod
def retrieve(self, path: str, hybrid_factor: float, result_limit: int) -> List[str]:
"""
Abstract method to retrieve data from the database.
"""
pass
43 changes: 43 additions & 0 deletions app/retrieval/lecture_retrieval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from abc import ABC
from typing import List

from weaviate import WeaviateClient
from weaviate.classes.query import Filter

from app.retrieval.abstract_retrieval import AbstractRetrieval
from app.vector_database.lecture_schema import init_lecture_schema, LectureSchema


class LectureRetrieval(AbstractRetrieval, ABC):
"""
Class for retrieving lecture data from the database.
"""

def __init__(self, client: WeaviateClient):
self.collection = init_lecture_schema(client)

def retrieve(
self,
user_message: str,
hybrid_factor: float,
result_limit: int,
lecture_id: int = None,
message_vector: [float] = None,
) -> List[str]:
response = self.collection.query.hybrid(
query=user_message,
filters=(
Filter.by_property(LectureSchema.LECTURE_ID.value).equal(lecture_id)
if lecture_id
else None
),
alpha=hybrid_factor,
vector=message_vector,
return_properties=[
LectureSchema.PAGE_TEXT_CONTENT.value,
LectureSchema.PAGE_IMAGE_DESCRIPTION.value,
LectureSchema.COURSE_NAME.value,
],
limit=result_limit,
)
return response
45 changes: 45 additions & 0 deletions app/retrieval/repositories_retrieval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
from typing import List

from weaviate import WeaviateClient
from weaviate.classes.query import Filter

from app.retrieval.abstract_retrieval import AbstractRetrieval
from app.vector_database.repository_schema import (
init_repository_schema,
RepositorySchema,
)


class RepositoryRetrieval(AbstractRetrieval):
"""
Class for Retrieving repository code for from the vector database.
"""

def __init__(self, client: WeaviateClient):
self.collection = init_repository_schema(client)

def retrieve(
self,
user_message: str,
result_limit: int,
repository_id: int = None,
) -> List[str]:
response = self.collection.query.near_text(
near_text=user_message,
filters=(
Filter.by_property(RepositorySchema.REPOSITORY_ID.value).equal(
repository_id
)
if repository_id
else None
),
return_properties=[
RepositorySchema.REPOSITORY_ID.value,
RepositorySchema.COURSE_ID.value,
RepositorySchema.CONTENT.value,
RepositorySchema.EXERCISE_ID.value,
RepositorySchema.FILEPATH.value,
],
limit=result_limit,
)
return response
Empty file.
44 changes: 44 additions & 0 deletions app/vector_database/db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import logging
import os
import weaviate
from lecture_schema import init_lecture_schema
from repository_schema import init_repository_schema
import weaviate.classes as wvc

logger = logging.getLogger(__name__)


class VectorDatabase:
"""
Class to interact with the Weaviate vector database
"""

def __init__(self):
self.client = weaviate.connect_to_wcs(
cluster_url=os.getenv("WEAVIATE_CLUSTER_URL"),
auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WEAVIATE_AUTH_KEY")),
)
self.repositories = init_repository_schema(self.client)
self.lectures = init_lecture_schema(self.client)

def __del__(self):
self.client.close()

def delete_collection(self, collection_name):
"""
Delete a collection from the database
"""
if self.client.collections.exists(collection_name):
if self.client.collections.delete(collection_name):
logger.info(f"Collection {collection_name} deleted")
else:
logger.error(f"Collection {collection_name} failed to delete")

def delete_object(self, collection_name, property_name, object_property):
"""
Delete an object from the collection inside the databse
"""
collection = self.client.collections.get(collection_name)
collection.data.delete_many(
where=wvc.query.Filter.by_property(property_name).equal(object_property)
)
97 changes: 97 additions & 0 deletions app/vector_database/lecture_schema.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
from enum import Enum

from weaviate.classes.config import Property
from weaviate import WeaviateClient
from weaviate.collections import Collection
from weaviate.collections.classes.config import Configure, VectorDistances, DataType


class LectureSchema(Enum):
"""
Schema for the lecture slides
"""

COLLECTION_NAME = "LectureSlides"
COURSE_NAME = "course_name"
COURSE_DESCRIPTION = "course_description"
COURSE_ID = "course_id"
LECTURE_ID = "lecture_id"
LECTURE_NAME = "lecture_name"
LECTURE_UNIT_ID = "lecture_unit_id"
LECTURE_UNIT_NAME = "lecture_unit_name"
PAGE_TEXT_CONTENT = "page_text_content"
PAGE_IMAGE_DESCRIPTION = "page_image_explanation"
PAGE_BASE64 = "page_base64"
PAGE_NUMBER = "page_number"


def init_lecture_schema(client: WeaviateClient) -> Collection:
"""
Initialize the schema for the lecture slides
"""
if client.collections.exists(LectureSchema.COLLECTION_NAME.value):
return client.collections.get(LectureSchema.COLLECTION_NAME.value)
return client.collections.create(
name=LectureSchema.COLLECTION_NAME.value,
vectorizer_config=Configure.Vectorizer.none(),
vector_index_config=Configure.VectorIndex.hnsw(
distance_metric=VectorDistances.COSINE
),
properties=[
Property(
name=LectureSchema.COURSE_ID.value,
description="The ID of the course",
data_type=DataType.INT,
),
Property(
name=LectureSchema.COURSE_NAME.value,
description="The name of the course",
data_type=DataType.TEXT,
),
Property(
name=LectureSchema.COURSE_DESCRIPTION.value,
description="The description of the COURSE",
data_type=DataType.TEXT,
),
Property(
name=LectureSchema.LECTURE_ID.value,
description="The ID of the lecture",
data_type=DataType.INT,
),
Property(
name=LectureSchema.LECTURE_NAME.value,
description="The name of the lecture",
data_type=DataType.TEXT,
),
Property(
name=LectureSchema.LECTURE_UNIT_ID.value,
description="The ID of the lecture unit",
data_type=DataType.INT,
),
Property(
name=LectureSchema.LECTURE_UNIT_NAME.value,
description="The name of the lecture unit",
data_type=DataType.TEXT,
),
Property(
name=LectureSchema.PAGE_TEXT_CONTENT.value,
description="The original text content from the slide",
data_type=DataType.TEXT,
),
Property(
name=LectureSchema.PAGE_IMAGE_DESCRIPTION.value,
description="The description of the slide if the slide contains an image",
data_type=DataType.TEXT,
),
Property(
name=LectureSchema.PAGE_BASE64.value,
description="The base64 encoded image of the slide if the slide contains an image",
data_type=DataType.TEXT,
),
Property(
name=LectureSchema.PAGE_NUMBER.value,
description="The page number of the slide",
data_type=DataType.INT,
),
],
)
60 changes: 60 additions & 0 deletions app/vector_database/repository_schema.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
from enum import Enum
from weaviate.classes.config import Property
from weaviate import WeaviateClient
from weaviate.collections import Collection
from weaviate.collections.classes.config import Configure, VectorDistances, DataType


class RepositorySchema(Enum):
"""
Schema for the student repository
"""

COLLECTION_NAME = "StudentRepository"
CONTENT = "content"
COURSE_ID = "course_id"
EXERCISE_ID = "exercise_id"
REPOSITORY_ID = "repository_id"
FILEPATH = "filepath"


def init_repository_schema(client: WeaviateClient) -> Collection:
"""
Initialize the schema for the student repository
"""
if client.collections.exists(RepositorySchema.COLLECTION_NAME.value):
return client.collections.get(RepositorySchema.COLLECTION_NAME.value)
return client.collections.create(
name=RepositorySchema.COLLECTION_NAME.value,
vectorizer_config=Configure.Vectorizer.none(),
vector_index_config=Configure.VectorIndex.hnsw(
distance_metric=VectorDistances.COSINE
),
properties=[
Property(
name=RepositorySchema.CONTENT.value,
description="The content of this chunk of code",
data_type=DataType.TEXT,
),
Property(
name=RepositorySchema.COURSE_ID.value,
description="The ID of the course",
data_type=DataType.INT,
),
Property(
name=RepositorySchema.EXERCISE_ID.value,
description="The ID of the exercise",
data_type=DataType.INT,
),
Property(
name=RepositorySchema.REPOSITORY_ID.value,
description="The ID of the repository",
data_type=DataType.INT,
),
Property(
name=RepositorySchema.FILEPATH.value,
description="The filepath of the code",
data_type=DataType.TEXT,
),
],
)
6 changes: 4 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ ollama==0.1.9
openai==1.23.6
pre-commit==3.7.0
pydantic==2.7.1
PyMuPDF==1.23.22
PyYAML==6.0.1
uvicorn==0.29.0
requests~=2.31.0
requests~=2.31.0
uvicorn==0.27.1
weaviate-client==4.5.4
Loading