-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add StatisticalEvaluator
component
#6982
Conversation
@silvanocerza, can someone else with a better background look at this one? I can as well, but it'll take some time... |
@vblagoje would be cool if you could take a look in any case, even if you don't have enough context. If it's not thorough that's ok, you will at least familiarize with this part. Then we can have another set of eyes that have more context if necessary. 👍 |
return default_from_dict(cls, data) | ||
|
||
@component.output_types(result=float) | ||
def run(self, predictions: List[str]) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I also noticed the comment about length of these predictions, why not add a check here for zero length and then we can omit checks in all metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now I moved only the metrics that we already had from the old API. In future PRs other will be added, I'm unsure whether those will return the same values as F1 and Exact Match if length is zero. That's the only reason I've done it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to go, left some minor comments
36cd3ed
to
5ad3db2
Compare
Pull Request Test Coverage Report for Build 7903381466Details
💛 - Coveralls |
@@ -1,3 +1,4 @@ | |||
from .sas_evaluator import SASEvaluator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go into haystack.components.evaluators
- Exact Match: Measures the proportion of cases where prediction is identical to the expected label. | ||
""" | ||
|
||
class Metric(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's extract this and move it out the top-level namespace (components.evaluators
). We should call it StatisticalMetrics
(to disambiguate it from the others).
F1 = "F1" | ||
EM = "Exact Match" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably stick to snake_case
like we do elsewhere.
|
||
def __init__( | ||
self, | ||
labels: List[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an input instead. (cf https://github.com/deepset-ai/haystack/pull/6980/files#r1489849342)
def __init__( | ||
self, | ||
labels: List[str], | ||
metric: Metric, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this a Union[str, StatisticalMetric]
and add a from_str
function to the latter.
regexes_to_ignore: Optional[List[str]] = None, | ||
ignore_case: bool = False, | ||
ignore_punctuation: bool = False, | ||
ignore_numbers: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return default_from_dict(cls, data) | ||
|
||
@component.output_types(result=float) | ||
def run(self, predictions: List[str]) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
return {"result": self._metric_function(labels, predictions)} | ||
|
||
def _f1(self, labels: List[str], predictions: List[str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be a @staticmethod
.
|
||
return np_mean(scores) | ||
|
||
def _exact_match(self, labels: List[str], predictions: List[str]) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be a @staticmethod
.
Related Issues
Part of #6903
Proposed Changes:
Add a new
StatisticalEvaluator
component. It can be used to evaluate different statistical metrics from answers returned by LLMs.As of now it only supports F1 and Exact Match metric as I just migrated it from the previous API.
Ideally in future PRs we should add also Recall, Mean Reciprocal Rank and Mean Average Precision
How did you test it?
I migrated existing tests from
test_eval_f1.py
andtest_eval_em.py
to use the new API.Notes for the reviewer
I didn't delete the old eval API for the time being. A later PR will purge it after we move everything to the new one.
I'll add documentation configs in a later PR too.
Depends on #6980
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.