Add classifier functionality #1

Ayush-iitkgp · 2024-05-29T19:20:28Z

Added the following functionalities:

created new Django models and improved the existing models to define new relationships and constraints.
improved the existing GET {project_id}/threads endpoint to also return the label for each thread
Asynchronous job to categorize a new thread based on the first message
Asynchronous job to backfill the existing threads with a category
created endpoints to create, modify, and delete the labels for each project
implemented an endpoint to allow the user to correct the category of a conversation if they don’t agree with it.
Wrote tests for the functionalities implemented.

…r agent class and added migration script

…dels. and also include the migration script

…and added test for it

…d added test for it

bauefikapa

Hi @Ayush-iitkgp thanks for creating this submission! I left some comments because I thought this way you might get the most out of it but absolutely no need to do any more work. I will get back to you seperately by email

bauefikapa · 2024-06-06T07:57:04Z

org/models.py

@@ -86,6 +87,12 @@ class Project(AbstractBaseModel, AbstractProjectDependentModel):

    project_name = models.CharField(max_length=200)  # Airbyte
    product_name = models.CharField(max_length=100)  # for prompting: Airbyte
+    labels = ArrayField(


Why did you opt for storing the possible labels here instead of in their own table?

This was a design decision I made. I have given a logic behind it here: https://docs.google.com/document/d/1wfz5er6BDpF2VjKPFLFruwUGQ10WdhOpRBn1Dm2hpQ4/edit

Basically, I have sacrificed storage for the computation since in general computation is more expensive than storage. I am happy to discuss it further :)

What exact computation do you mean?

So there are 2 approaches to implement the categories functionality in the existing code:

The first approach is to define a new entity called Label that has many-to-many relationship with Project and one-to-many relationship with Thread. As far as I understand, it is preferred by you.

The second approach is to introduce a new column called Labels (which is an array of string) in the Project table and a new column called Label (which is a string) in the Thread table. Also, introduce a constraint that Label in the thread can take one of the values defined in the Project. This is the approach I took.

Let's weigh the pros and cons of each approach:

Let's look at the GET /{project_id}/threads endpoint.

Let's have a look at the database operations when using the first approach, for each thread we would have to perform a left join with the Label table to retrieve the category of each thread. Let's say there are on average 100 threads per project, we will have to perform 100 left joins when getting data for GET /{project_id}/threads endpoint. Since joins are a computationally expensive operation, this would slow down the GET /{project_id}/threads endpoint. But the database is normalized so the storage is optimized. Hence, this approach is computationally expensive but inexpensive storage-wise.

Let's look at the second approach that I took, the labels are already stored in the thread table hence no join operation will be performed when retrieving labels during the execution of the GET /{project_id}/threads endpoint. But in this case, the database is de-normalized which means we would have duplicate values in the label column of the thread table hence database size (hence storage) will be more than the first approach. This approach is storage-wise more expensive but computationally inexpensive.

Now, I had to decide between optimizing the system for the storage or the computation, I decided to optimize the system for the computation and hence I took the second approach.

Does it make sense now @bauefikapa?

I am aware that there are no good or bad decisions in the system design but only trade-offs. If you are not satisfied with the solution, do let me know I would be happy to refractor it :)

bauefikapa · 2024-06-06T08:02:46Z

org/views/label.py

+    permission_classes = [HasProjectAPIKey]
+    serializer_class = ProjectSerializer
+
+    def post(self, request: Request, project_id: Union[uuid.UUID, str]) -> Response:


You could essentially save yourself all of this code, if the labels where stored in their own database table. All of this functionality comes built in with the ModelViewSet

bauefikapa · 2024-06-06T08:04:46Z

org/urls.py

 router = DefaultRouter()

 urlpatterns = [
    path("v1/", include(router.urls)),
+    path(
+        "v1/projects/<uuid:project_id>/labels", ProjectLabelView.as_view(), name="label"


Does it make sense to expose this in the org app or in the query app?

good question.
I would need some more information about the directory structure of the project and the reasoning behind naming the different Django apps to make an informed opinion :)

bauefikapa · 2024-06-06T08:07:07Z

query/models.py

@@ -20,6 +21,7 @@ class Thread(AbstractBaseModel, AbstractProjectDependentModel):
    project = models.ForeignKey(
        Project, on_delete=models.CASCADE, related_name="threads"
    )
+    label = models.CharField(max_length=100, default=None, null=True, blank=True)


Why did you choose this to be just a char field and not a foreign key to label stored in its own table?

same answer as above :)

bauefikapa · 2024-06-06T08:10:16Z

tests/query/tasks/test_classify_thread.py

+pytestmark = pytest.mark.django_db
+
+
+def test_classify_thread(question_answer) -> None:


Does this test actually work? Normally you can not run anything in celery for the tests

Yes, it works.
Try running pytest tests/query/tasks/test_classify_thread.py from the terminal

I am not creating a task but rather executing it synchronously here:
classify_thread.apply(throw=True, kwargs={"thread_id": str(thread.id)})

query/urls.py

query/tasks.py

bauefikapa · 2024-06-06T08:25:42Z

query/models.py

+    A model representing a LabelReview, which has a 1:1 relationship with Thread.
+    """
+
+    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)


I am a little unsure why you would have a model like 'LabelReview' but then also have a label field on the 'Thread'

This model is used when a user reviews the label and corrects it.
We save an instance of this model so as to record who was the user, what was the old label, what is the new label and what was the comment provided by the user when changing the label.

it is kind of a log of the user action when they correct the label of a thread.

Does it make sense now @bauefikapa?

Yeah that makes sense to have in general I think but then you don't need another char field on 'Thread'

Ayush-iitkgp · 2024-06-06T16:29:55Z

Hi @Ayush-iitkgp thanks for creating this submission! I left some comments because I thought this way you might get the most out of it but absolutely no need to do any more work. I will get back to you separately by email

Hi @bauefikapa thanks for the thorough review. I replied to some of your comments. I am happy to discuss my solution and the reasoning behind it in detail over a call (if you wish). I look forward to your email :)

Have a nice evening!

ayush-ruhr and others added 18 commits May 24, 2024 00:27

Add .venv to git ignore

02ead3d

Fix docker file configuration and exposed db port to the system

35d9da6

Added git pre-commit configurations to the project

7ce77e5

Add python pre-commit config

d61bc09

max line length is 120 characters

97cf44e

fix pre-commit configuration

146760c

Added new models, modified the existing models, added a new classifie…

b401e17

…r agent class and added migration script

Added link to the problem statement in the README

bb48539

Fix the flow diagram path in the problem_statement.md file

e9582d1

Removed Label model and added new fields in the project and thread mo…

67a7059

…dels. and also include the migration script

Added label to the body of the GET {project_id}/threads endpoint

9353dac

Added label as the response to the GET {project_id}/threads endpoint …

3f4da30

…and added test for it

First working version of the test

3cc6177

Added the functionality to label a new conversation asynchronously an…

d9faaab

…d added test for it

Add test for the contraint on the thread

21efb03

First set up for the backfilling job

5a3c8b8

Run the backfill job every 10 minutes

08282ba

Add endpoints for creating, updating and deleting labels

356d3b7

Ayush-iitkgp force-pushed the dev branch from 85edb42 to 356d3b7 Compare June 4, 2024 23:07

Ayush-iitkgp added 11 commits June 5, 2024 01:31

task are working now

538a507

Added to todo list in readme

c54dbe3

improve readme

40dbeec

improve readme

34a51a0

Add latest postman collection

3251734

Add tests for thread classification service and thread backfill service

812dd1a

Add github action for test

c42aacf

Add github action for test

8005911

use line lenght of 120 characters in the github action

68c1dae

add endpoint for updting the thread label by the user

f1554e9

update readme

fd38026

Ayush-iitkgp added 3 commits June 6, 2024 01:49

take 1000 instead of 100 threads each 10 minutes for backfilling

cd71ddc

take 1000 instead of 100 threads each 10 minutes for backfilling

4466a07

update readme

46de42a

Ayush-iitkgp requested a review from bauefikapa June 5, 2024 23:51

update readme

1f9b1aa

bauefikapa reviewed Jun 6, 2024

View reviewed changes

Ayush-iitkgp added 2 commits June 14, 2024 12:17

moved update-label to ThreadView

9c85dcf

new endpoint to update_labels of a thread

8afebb3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add classifier functionality #1

Add classifier functionality #1

Ayush-iitkgp commented May 29, 2024 •

edited

Loading

bauefikapa left a comment

bauefikapa Jun 6, 2024

Ayush-iitkgp Jun 6, 2024

bauefikapa Jun 10, 2024

Ayush-iitkgp Jun 10, 2024 •

edited

Loading

bauefikapa Jun 6, 2024

bauefikapa Jun 6, 2024

Ayush-iitkgp Jun 6, 2024 •

edited

Loading

bauefikapa Jun 6, 2024

Ayush-iitkgp Jun 6, 2024 •

edited

Loading

bauefikapa Jun 6, 2024

Ayush-iitkgp Jun 6, 2024 •

edited

Loading

bauefikapa Jun 6, 2024

Ayush-iitkgp Jun 6, 2024

bauefikapa Jun 10, 2024

Ayush-iitkgp commented Jun 6, 2024

		pytestmark = pytest.mark.django_db


		def test_classify_thread(question_answer) -> None:

Add classifier functionality #1

Are you sure you want to change the base?

Add classifier functionality #1

Conversation

Ayush-iitkgp commented May 29, 2024 • edited Loading

bauefikapa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ayush-iitkgp Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ayush-iitkgp Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ayush-iitkgp Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ayush-iitkgp Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ayush-iitkgp commented Jun 6, 2024

Ayush-iitkgp commented May 29, 2024 •

edited

Loading

Ayush-iitkgp Jun 10, 2024 •

edited

Loading

Ayush-iitkgp Jun 6, 2024 •

edited

Loading

Ayush-iitkgp Jun 6, 2024 •

edited

Loading

Ayush-iitkgp Jun 6, 2024 •

edited

Loading