Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] Gather issues observed with ML-bot and find ways to improve it #256

Open
softvision-oana-arbuzov opened this issue Mar 10, 2022 · 9 comments
Assignees

Comments

@softvision-oana-arbuzov

This task is created to gather the issues which are observed in regards to ML-bot behavior and come up with ways to improve it.

@softvision-oana-arbuzov
Copy link
Author

softvision-oana-arbuzov commented Mar 10, 2022

@softvision-raul-bucata : We've observed that the ML BOT changes the milestone to ml-autoclosed for what seems to be valid issues.
We've gathered a list and we'll continue updating it.

If the ML-BOT needs adjustment, we can manually triage the issues that seem valid, and move them to the relevant milestone until the adjustment is made.
Please let us know if there is anything we can do to help.

@softvision-oana-arbuzov
Copy link
Author

softvision-oana-arbuzov commented Mar 10, 2022

@ksy36 reply:
While this is expected to some extent, I do see that the number of potentially valid issues that it recently closed is not desirable.

Right now the criteria for the issue to be considered "valid" is whether it is moved to "needsdiagnosis" milestone. So the ML model takes into account the content of such issues (Domain, Tested different browser, UA, Description, etc.) and based on that makes a decision on incoming issues classification (valid/invalid). As we have a lot more "invalid" issues than "valid", and the number grows over time, I think this shift is bound to happen.

I've been looking at ml-autoclosed issues from time to time and moving some of them to needstriage/needsdiagnosis for the model to learn. But a lot of issues that seem valid are not reproducible, so even if an issue is reopened, it is not necessarily contributing to the "valid" pool.

We can try to improve the current rate. I've experimented with training a new model with these changes on the issues that you sent and it seems to make the prediction more accurate:

  1. Consider issues that are in "moved" milestone to be "valid".
    Some issues in this milestone never got to "needsdiagnosis" milestone and were moved straight to Bugzilla (mostly ETP issues I think). The content of these issues likely can be considered valid.

  2. Consider issues that were moved from "ml-autoclosed" to "accepted" to be valid (regardless of whether they were moved to "needsdisagnosis" or not).
    We can try to teach the model that way and see how the metrics change over time.

If the ML-BOT needs adjustment, we can manually triage the issues that seem valid, and move them to the relevant milestone until the adjustment is made.

So yeah, that would be awesome, if you have some time. If you notice an issue that seems reasonable and potentially valid, you could move it to the "accepted" milestone, as you did with issues on that list. It doesn't need to be done on a daily or weekly basis, maybe every other week or 2.

I'll make these 2 changes to the model and will keep reopening issues that seem potentially valid as well.

I could also experiment with the model further to make it more accurate. Also, one thing that is worth mentioning - if we receive a lot of duplicates for a certain issue, they never end up in needsdiagnosis. At some point, there are so many duplicates that the model considers a lot of future duplicates "invalid" as the weight is so much more on the "invalid" side and there is only one "valid" issue (example of that is imgur.com). While the duplicates are technically "valid", they are not actionable, so the fact that they're automatically closed works for us.

@softvision-oana-arbuzov softvision-oana-arbuzov changed the title [Meta] Gather issues observe with ML-bot [Meta] Gather issues observed with ML-bot and find ways to improve it Mar 10, 2022
@ksy36
Copy link

ksy36 commented Mar 10, 2022

Maybe we could add an additional column to track whether an issue has resolved in an actionable milestone (needsdiagnosis or moved). I've tried to add it in the spreadsheet, but don't have edit access.

I think this issue is only one as of today:
https://github.com/webcompat/web-bugs-private/issues/48890

And these 3 might be valid, but we can't test as a special account is needed:
https://github.com/webcompat/web-bugs-private/issues/48974
https://github.com/webcompat/web-bugs-private/issues/48919
https://github.com/webcompat/web-bugs-private/issues/48914

@softvision-oana-arbuzov
Copy link
Author

I've updated the list with the created date and status.

@ksy36
Copy link

ksy36 commented Apr 2, 2022

I've added webcompat/webcompat.com#3685 to temporarily add automatic labelling to gather statistics:

bugbug-reopened label for issues that have been reopened after ML bot closed them;
bugbug-valid to issues that have bugbug-reopened label AND has been moved to needsdiagnosis or moved milestone;

I'll add the same labels to the issues that are currently on the list once the change is deployed. So we won't need to manually update the list once all labels are added as this data will be on GitHub and Elasticsearch db.

@ksy36
Copy link

ksy36 commented Apr 6, 2022

Deployed webcompat/webcompat.com#3685, so it's adding the labels now. Also added the labels to the issues from the list: https://github.com/webcompat/web-bugs/issues?q=is%3Aissue+is%3Aopen+label%3Abugbug-reopened

@ksy36
Copy link

ksy36 commented Jun 7, 2022

I've built a graph in kibana to visualize bugbug-reopened and bugbug-valid labels, where the legend is based on the following:

closed as invalid - issues with bugbug-probability-high label
issues that looked valid - issues with bugbug-reopened label
true valid issues - issues with bugbug-valid label

Screen Shot 2022-06-06 at 2 09 36 PM

Despite the fact that there can be a lot of issues that looked valid, percent of "true" valid issues is quite low, with 2.52% being the highest:

Screen Shot 2022-06-06 at 4 17 03 PM

So "true" valid issues is what we should pay attention to and around 1-2% missed issues is expected.

To get accurate results we need to keep reopening closed issues, so I've been doing it for the past week and will continue to do so until we have 3-4 weeks of data. It could be that the first improvement to the model might have been enough to low the amount of "true" valid issues that are closed as invalid.

@karlcow
Copy link

karlcow commented Jun 7, 2022

That's really cool.

@ksy36
Copy link

ksy36 commented Jul 16, 2022

An update here: I've been reopening issues for the past 6 weeks, and the percentage of missed issues is within expected range with 1.82% being the highest:

Screen Shot 2022-07-13 at 7 13 38 PM

We could potentially increase the confidence threshold (from 97% to 98 or 99%) and it might improve accuracy by a bit, but would increase the amount of issues that need to be triaged manually. With the current rate of missed issues, I think we have the optimal balance.

There are also a few things that I can experiment with:

  • Extracting features from the issue body and see if that improves the metrics

  • Considering issues that were only reproducible in release and not in Nightly valid
    I've noticed that there is a subset of issues that we close as wontfix since they reproducible only in release and not in Nightly. The content of such issues could be considered valid, so it's worth experimenting whether it makes sense to include them to the valid class when training the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants