-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add classifier functionality #1
base: main
Are you sure you want to change the base?
Conversation
…r agent class and added migration script
…dels. and also include the migration script
…and added test for it
…d added test for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Ayush-iitkgp thanks for creating this submission! I left some comments because I thought this way you might get the most out of it but absolutely no need to do any more work. I will get back to you seperately by email
@@ -86,6 +87,12 @@ class Project(AbstractBaseModel, AbstractProjectDependentModel): | |||
|
|||
project_name = models.CharField(max_length=200) # Airbyte | |||
product_name = models.CharField(max_length=100) # for prompting: Airbyte | |||
labels = ArrayField( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you opt for storing the possible labels here instead of in their own table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a design decision I made. I have given a logic behind it here: https://docs.google.com/document/d/1wfz5er6BDpF2VjKPFLFruwUGQ10WdhOpRBn1Dm2hpQ4/edit
Basically, I have sacrificed storage for the computation since in general computation is more expensive than storage. I am happy to discuss it further :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exact computation do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there are 2 approaches to implement the categories functionality in the existing code:
- The first approach is to define a new entity called Label that has many-to-many relationship with Project and one-to-many relationship with Thread. As far as I understand, it is preferred by you.
- The second approach is to introduce a new column called Labels (which is an array of string) in the Project table and a new column called Label (which is a string) in the Thread table. Also, introduce a constraint that Label in the thread can take one of the values defined in the Project. This is the approach I took.
Let's weigh the pros and cons of each approach:
- Let's look at the GET /{project_id}/threads endpoint.
- Let's have a look at the database operations when using the first approach, for each thread we would have to perform a left join with the Label table to retrieve the category of each thread. Let's say there are on average 100 threads per project, we will have to perform 100 left joins when getting data for GET /{project_id}/threads endpoint. Since joins are a computationally expensive operation, this would slow down the GET /{project_id}/threads endpoint. But the database is normalized so the storage is optimized. Hence, this approach is computationally expensive but inexpensive storage-wise.
- Let's look at the second approach that I took, the labels are already stored in the thread table hence no join operation will be performed when retrieving labels during the execution of the GET /{project_id}/threads endpoint. But in this case, the database is de-normalized which means we would have duplicate values in the label column of the thread table hence database size (hence storage) will be more than the first approach. This approach is storage-wise more expensive but computationally inexpensive.
Now, I had to decide between optimizing the system for the storage or the computation, I decided to optimize the system for the computation and hence I took the second approach.
Does it make sense now @bauefikapa?
I am aware that there are no good or bad decisions in the system design but only trade-offs. If you are not satisfied with the solution, do let me know I would be happy to refractor it :)
permission_classes = [HasProjectAPIKey] | ||
serializer_class = ProjectSerializer | ||
|
||
def post(self, request: Request, project_id: Union[uuid.UUID, str]) -> Response: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could essentially save yourself all of this code, if the labels where stored in their own database table. All of this functionality comes built in with the ModelViewSet
router = DefaultRouter() | ||
|
||
urlpatterns = [ | ||
path("v1/", include(router.urls)), | ||
path( | ||
"v1/projects/<uuid:project_id>/labels", ProjectLabelView.as_view(), name="label" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to expose this in the org app or in the query app?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question.
I would need some more information about the directory structure of the project and the reasoning behind naming the different Django apps to make an informed opinion :)
@@ -20,6 +21,7 @@ class Thread(AbstractBaseModel, AbstractProjectDependentModel): | |||
project = models.ForeignKey( | |||
Project, on_delete=models.CASCADE, related_name="threads" | |||
) | |||
label = models.CharField(max_length=100, default=None, null=True, blank=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you choose this to be just a char field and not a foreign key to label stored in its own table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same answer as above :)
pytestmark = pytest.mark.django_db | ||
|
||
|
||
def test_classify_thread(question_answer) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test actually work? Normally you can not run anything in celery for the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it works.
Try running pytest tests/query/tasks/test_classify_thread.py
from the terminal
I am not creating a task but rather executing it synchronously here:
classify_thread.apply(throw=True, kwargs={"thread_id": str(thread.id)})
A model representing a LabelReview, which has a 1:1 relationship with Thread. | ||
""" | ||
|
||
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little unsure why you would have a model like 'LabelReview' but then also have a label field on the 'Thread'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This model is used when a user reviews the label and corrects it.
We save an instance of this model so as to record who was the user, what was the old label, what is the new label and what was the comment provided by the user when changing the label.
it is kind of a log of the user action when they correct the label of a thread.
Does it make sense now @bauefikapa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense to have in general I think but then you don't need another char field on 'Thread'
Hi @bauefikapa thanks for the thorough review. I replied to some of your comments. I am happy to discuss my solution and the reasoning behind it in detail over a call (if you wish). I look forward to your email :) Have a nice evening! |
Added the following functionalities: