feat: add CUREv1 retrieval dataset #1459

dbuades · 2024-11-14T22:46:59Z

CUREv1

Over the past year, we’ve worked closely with medical professionals to develop this dataset, which we’re now sharing with the community to support research in point-of-care information retrieval—a critical daily task for many practitioners.

The CURE is a cross-lingual retrieval dataset organized into:

Ten splits covering different medical disciplines:
- Dentistry and Oral Health
- Dermatology
- Gastroenterology
- Genetics
- Neuroscience and Neurology
- Orthopedic Surgery
- Otorhinolaryngology
- Plastic Surgery
- Psychiatry and Psychology
- Pulmonology
One mono-lingual setting and two cross-lingual settings:
- English-to-English
- Spanish-to-English
- French-to-English

Each split and cross-lingual setting is composed of 200 natural language queries formulated by health care professionals, capturing their information needs when consulting academic literature during their daily work.

The corpus is constructed from an index of English passages extracted from biomedical academic articles. Passages are then marked as either Highly Relevant, Partially Relevant or Not Relevant with respect to each query.

For more details, please check the Dataset Card in the Hub 🤗
A preprint detailing the curation process and providing an extended rationale will soon be published on arXiv, along with pre-embedded indexes for several of these models!

MTEB(Medical)

At the same time, we take the opportunity to introduce a specialized benchmark that groups MTEB tasks relevant to the medical domain. Initially, we’ve included the following tasks, but we welcome any suggestions for additional tasks you think may be valuable:

CUREv1
NFCorpus
TRECCOVID
TRECCOVID-PL
SciFact
SciFact-PL
MedicalQARetrieval
PublicHealthQA
MedrxivClusteringP2P.v2
MedrxivClusteringS2S.v2
CmedqaRetrieval
CMedQAv2-reranking

We have also computed results for these tasks across 18 open-source models. We can upload them to the results repo or somewhere else, please point us in the right direction as there seems to be lots of activity with the new leaderboard! 💪

Adding datasets checklist

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

--------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]>

KennethEnevoldsen

Great PR!!

We have also computed results for these tasks across 18 open-source models. We can upload them to the results repo or somewhere else, please point us in the right direction as there seems to be lots of activity with the new leaderboard!

Please do!

mteb/tasks/Retrieval/multilingual/CUREv1Retrieval.py

mteb/benchmarks/benchmarks.py

isaac-chung · 2024-11-20T08:25:22Z

Thanks for the PR! The only thing left to point out is that it'd be great if the PR description can be updated to reflect the changes made above. Otherwise I think this is good to merge. If you end up promoting this on socials, let us know :)

dbuades · 2024-11-21T22:08:58Z

Thanks for the PR! The only thing left to point out is that it'd be great if the PR description can be updated to reflect the changes made above. Otherwise I think this is good to merge. If you end up promoting this on socials, let us know :)

Thanks, @isaac-chung ! Sorry I missed your last comment. I’ve updated the PR description retroactively and am currently running the 18 models on all the newly added tasks in the benchmark. Once that’s done, I’ll open a PR with the results.

As for promoting the work, do you have any specific ideas in mind? I was planning to post something on LinkedIn next week, which I can also share here. Additionally, we’re preparing a preprint that we’ll be uploading to arXiv soon. Maybe we could use that opportunity to co-write something for the HF blog? We can discuss the angle but I believe that it could be really interesting!

isaac-chung · 2024-11-21T23:42:16Z

Thanks, @dbuades! Those all sound good, and I'm happy to share/repost what you have. An HF blog would be good as well - happy to collaborate there!

dbuades · 2024-11-22T14:54:13Z

Thanks, @dbuades! Those all sound good, and I'm happy to share/repost what you have. An HF blog would be good as well - happy to collaborate there!

Perfect! I'll keep you posted next week.

feat: add CUREv1 dataset

6bc7abe

--------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]>

dbuades force-pushed the feat/health-retrieval-tasks branch from e3ffa05 to 6bc7abe Compare November 15, 2024 06:25

KennethEnevoldsen reviewed Nov 18, 2024

View reviewed changes

mteb/tasks/Retrieval/multilingual/CUREv1Retrieval.py Outdated Show resolved Hide resolved

mteb/benchmarks/benchmarks.py Outdated Show resolved Hide resolved

KennethEnevoldsen self-requested a review November 18, 2024 09:31

KennethEnevoldsen approved these changes Nov 18, 2024

View reviewed changes

dbuades added 2 commits November 18, 2024 20:01

feat: add missing domains to medical tasks

701b892

feat: modify benchmark tasks

37ddab2

dbuades force-pushed the feat/health-retrieval-tasks branch from 46aff47 to 37ddab2 Compare November 19, 2024 01:33

chore: benchmark naming

8fe1288

isaac-chung merged commit 1cc6c9e into embeddings-benchmark:main Nov 21, 2024
10 checks passed

dbuades mentioned this pull request Nov 22, 2024

feat: first batch of results for the MTEB(Medical) benchmark embeddings-benchmark/results#55

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add CUREv1 retrieval dataset #1459

feat: add CUREv1 retrieval dataset #1459

dbuades commented Nov 14, 2024 •

edited

Loading

KennethEnevoldsen left a comment

isaac-chung commented Nov 20, 2024

dbuades commented Nov 21, 2024

isaac-chung commented Nov 21, 2024

dbuades commented Nov 22, 2024

feat: add CUREv1 retrieval dataset #1459

feat: add CUREv1 retrieval dataset #1459

Conversation

dbuades commented Nov 14, 2024 • edited Loading

CUREv1

MTEB(Medical)

Adding datasets checklist

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

isaac-chung commented Nov 20, 2024

dbuades commented Nov 21, 2024

isaac-chung commented Nov 21, 2024

dbuades commented Nov 22, 2024

dbuades commented Nov 14, 2024 •

edited

Loading