Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add classifier functionality #1

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

Add classifier functionality #1

wants to merge 35 commits into from

Conversation

Ayush-iitkgp
Copy link
Owner

@Ayush-iitkgp Ayush-iitkgp commented May 29, 2024

Added the following functionalities:

  1. created new Django models and improved the existing models to define new relationships and constraints.
  2. improved the existing GET {project_id}/threads endpoint to also return the label for each thread
  3. Asynchronous job to categorize a new thread based on the first message
  4. Asynchronous job to backfill the existing threads with a category
  5. created endpoints to create, modify, and delete the labels for each project
  6. implemented an endpoint to allow the user to correct the category of a conversation if they don’t agree with it.
  7. Wrote tests for the functionalities implemented.

@Ayush-iitkgp Ayush-iitkgp requested a review from bauefikapa June 5, 2024 23:51
Copy link
Collaborator

@bauefikapa bauefikapa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ayush-iitkgp thanks for creating this submission! I left some comments because I thought this way you might get the most out of it but absolutely no need to do any more work. I will get back to you seperately by email

@@ -86,6 +87,12 @@ class Project(AbstractBaseModel, AbstractProjectDependentModel):

project_name = models.CharField(max_length=200) # Airbyte
product_name = models.CharField(max_length=100) # for prompting: Airbyte
labels = ArrayField(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you opt for storing the possible labels here instead of in their own table?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a design decision I made. I have given a logic behind it here: https://docs.google.com/document/d/1wfz5er6BDpF2VjKPFLFruwUGQ10WdhOpRBn1Dm2hpQ4/edit

Basically, I have sacrificed storage for the computation since in general computation is more expensive than storage. I am happy to discuss it further :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exact computation do you mean?

Copy link
Owner Author

@Ayush-iitkgp Ayush-iitkgp Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are 2 approaches to implement the categories functionality in the existing code:

  1. The first approach is to define a new entity called Label that has many-to-many relationship with Project and one-to-many relationship with Thread. As far as I understand, it is preferred by you.
  2. The second approach is to introduce a new column called Labels (which is an array of string) in the Project table and a new column called Label (which is a string) in the Thread table. Also, introduce a constraint that Label in the thread can take one of the values defined in the Project. This is the approach I took.

Let's weigh the pros and cons of each approach:

  1. Let's look at the GET /{project_id}/threads endpoint.
  2. Let's have a look at the database operations when using the first approach, for each thread we would have to perform a left join with the Label table to retrieve the category of each thread. Let's say there are on average 100 threads per project, we will have to perform 100 left joins when getting data for GET /{project_id}/threads endpoint. Since joins are a computationally expensive operation, this would slow down the GET /{project_id}/threads endpoint. But the database is normalized so the storage is optimized. Hence, this approach is computationally expensive but inexpensive storage-wise.
  3. Let's look at the second approach that I took, the labels are already stored in the thread table hence no join operation will be performed when retrieving labels during the execution of the GET /{project_id}/threads endpoint. But in this case, the database is de-normalized which means we would have duplicate values in the label column of the thread table hence database size (hence storage) will be more than the first approach. This approach is storage-wise more expensive but computationally inexpensive.

Now, I had to decide between optimizing the system for the storage or the computation, I decided to optimize the system for the computation and hence I took the second approach.

Does it make sense now @bauefikapa?

I am aware that there are no good or bad decisions in the system design but only trade-offs. If you are not satisfied with the solution, do let me know I would be happy to refractor it :)

permission_classes = [HasProjectAPIKey]
serializer_class = ProjectSerializer

def post(self, request: Request, project_id: Union[uuid.UUID, str]) -> Response:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could essentially save yourself all of this code, if the labels where stored in their own database table. All of this functionality comes built in with the ModelViewSet

router = DefaultRouter()

urlpatterns = [
path("v1/", include(router.urls)),
path(
"v1/projects/<uuid:project_id>/labels", ProjectLabelView.as_view(), name="label"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to expose this in the org app or in the query app?

Copy link
Owner Author

@Ayush-iitkgp Ayush-iitkgp Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question.
I would need some more information about the directory structure of the project and the reasoning behind naming the different Django apps to make an informed opinion :)

@@ -20,6 +21,7 @@ class Thread(AbstractBaseModel, AbstractProjectDependentModel):
project = models.ForeignKey(
Project, on_delete=models.CASCADE, related_name="threads"
)
label = models.CharField(max_length=100, default=None, null=True, blank=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose this to be just a char field and not a foreign key to label stored in its own table?

Copy link
Owner Author

@Ayush-iitkgp Ayush-iitkgp Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same answer as above :)

pytestmark = pytest.mark.django_db


def test_classify_thread(question_answer) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test actually work? Normally you can not run anything in celery for the tests

Copy link
Owner Author

@Ayush-iitkgp Ayush-iitkgp Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works.
Try running pytest tests/query/tasks/test_classify_thread.py from the terminal

I am not creating a task but rather executing it synchronously here:
classify_thread.apply(throw=True, kwargs={"thread_id": str(thread.id)})

query/urls.py Outdated Show resolved Hide resolved
query/tasks.py Show resolved Hide resolved
A model representing a LabelReview, which has a 1:1 relationship with Thread.
"""

id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little unsure why you would have a model like 'LabelReview' but then also have a label field on the 'Thread'

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model is used when a user reviews the label and corrects it.
We save an instance of this model so as to record who was the user, what was the old label, what is the new label and what was the comment provided by the user when changing the label.

it is kind of a log of the user action when they correct the label of a thread.

Does it make sense now @bauefikapa?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense to have in general I think but then you don't need another char field on 'Thread'

@Ayush-iitkgp
Copy link
Owner Author

Hi @Ayush-iitkgp thanks for creating this submission! I left some comments because I thought this way you might get the most out of it but absolutely no need to do any more work. I will get back to you separately by email

Hi @bauefikapa thanks for the thorough review. I replied to some of your comments. I am happy to discuss my solution and the reasoning behind it in detail over a call (if you wish). I look forward to your email :)

Have a nice evening!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants