Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add StatisticalEvaluator component #6982

Merged
merged 4 commits into from
Feb 14, 2024
Merged

Conversation

silvanocerza
Copy link
Contributor

Related Issues

Part of #6903

Proposed Changes:

Add a new StatisticalEvaluator component. It can be used to evaluate different statistical metrics from answers returned by LLMs.

As of now it only supports F1 and Exact Match metric as I just migrated it from the previous API.
Ideally in future PRs we should add also Recall, Mean Reciprocal Rank and Mean Average Precision

How did you test it?

I migrated existing tests from test_eval_f1.py and test_eval_em.py to use the new API.

Notes for the reviewer

I didn't delete the old eval API for the time being. A later PR will purge it after we move everything to the new one.

I'll add documentation configs in a later PR too.

Depends on #6980

Checklist

@silvanocerza silvanocerza self-assigned this Feb 13, 2024
@silvanocerza silvanocerza requested review from a team as code owners February 13, 2024 12:05
@silvanocerza silvanocerza requested review from dfokina and vblagoje and removed request for a team February 13, 2024 12:05
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Feb 13, 2024
@vblagoje
Copy link
Member

@silvanocerza, can someone else with a better background look at this one? I can as well, but it'll take some time...

@silvanocerza
Copy link
Contributor Author

@vblagoje would be cool if you could take a look in any case, even if you don't have enough context. If it's not thorough that's ok, you will at least familiarize with this part.

Then we can have another set of eyes that have more context if necessary. 👍

return default_from_dict(cls, data)

@component.output_types(result=float)
def run(self, predictions: List[str]) -> Dict[str, Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I also noticed the comment about length of these predictions, why not add a check here for zero length and then we can omit checks in all metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now I moved only the metrics that we already had from the old API. In future PRs other will be added, I'm unsure whether those will return the same values as F1 and Exact Match if length is zero. That's the only reason I've done it like this.

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to go, left some minor comments

Base automatically changed from sas-evaluator to main February 14, 2024 15:16
@github-actions github-actions bot added the 2.x Related to Haystack v2.0 label Feb 14, 2024
@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 7903381466

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 10 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.05%) to 88.951%

Files with Coverage Reduction New Missed Lines %
evaluation/eval.py 10 65.63%
Totals Coverage Status
Change from base Build 7903358782: 0.05%
Covered Lines: 4943
Relevant Lines: 5557

💛 - Coveralls

@silvanocerza silvanocerza merged commit 36ab23d into main Feb 14, 2024
22 checks passed
@silvanocerza silvanocerza deleted the statistical-evaluator branch February 14, 2024 15:48
@@ -1,3 +1,4 @@
from .sas_evaluator import SASEvaluator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go into haystack.components.evaluators

- Exact Match: Measures the proportion of cases where prediction is identical to the expected label.
"""

class Metric(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract this and move it out the top-level namespace (components.evaluators). We should call it StatisticalMetrics (to disambiguate it from the others).

Comment on lines +30 to +31
F1 = "F1"
EM = "Exact Match"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably stick to snake_case like we do elsewhere.


def __init__(
self,
labels: List[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def __init__(
self,
labels: List[str],
metric: Metric,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a Union[str, StatisticalMetric] and add a from_str function to the latter.

Comment on lines +37 to +40
regexes_to_ignore: Optional[List[str]] = None,
ignore_case: bool = False,
ignore_punctuation: bool = False,
ignore_numbers: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return default_from_dict(cls, data)

@component.output_types(result=float)
def run(self, predictions: List[str]) -> Dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return {"result": self._metric_function(labels, predictions)}

def _f1(self, labels: List[str], predictions: List[str]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a @staticmethod.


return np_mean(scores)

def _exact_match(self, labels: List[str], predictions: List[str]) -> float:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a @staticmethod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants