Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Multilingual retrieval loader #1473

Merged
merged 2 commits into from
Nov 19, 2024

Conversation

Samoed
Copy link
Collaborator

@Samoed Samoed commented Nov 19, 2024

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

I've added a multilingual retrieval loader with the following structure:

  • {config}-corpus
  • {config}-queries
    ...

However, currently, there are no datasets that fit this structure. Previously, when I uploaded datasets, I used corpus-config, but more other datasets start with config. These datasets have variations that make them incompatible with this loader for direct use, so this is more for the future

@Samoed Samoed requested a review from orionw November 19, 2024 12:12
@Samoed Samoed marked this pull request as ready for review November 19, 2024 15:30
Copy link
Contributor

@orionw orionw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Seems like it should make it easier. Have you checked that it works with some of the multilingual datasets? If you have then it seems good to merge to me.

@Samoed
Copy link
Collaborator Author

Samoed commented Nov 19, 2024

I tested this on CodeSearchCCRetrieval, which uses corpus-{config}. It worked when I swapped the name in the corpus for the test. However, I couldn't find datasets where the loader could be changed due to different naming schemes. This PR won't break anything; it's an attempt to standardize for future use

@KennethEnevoldsen
Copy link
Contributor

Looks good might be nice to add a push_to_hub method that pushes a dataset in a standard format (would make it easy to convert datasets)

@KennethEnevoldsen KennethEnevoldsen merged commit a27de33 into v2.0.0 Nov 19, 2024
10 checks passed
@KennethEnevoldsen KennethEnevoldsen deleted the multilingual_retrieval_loader branch November 19, 2024 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants