Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(datasets) Add recommended fl datasets docs #4556

Merged
merged 5 commits into from
Nov 22, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions datasets/doc/source/recommended-fl-datasets.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
Recommended FL Datasets
=======================

This page lists the recommended datasets for federated learning research, which can be used with Flower Datasets ``flwr-datasets``.

.. note::

All datasets from HuggingFace Hub can be used with our library. This page presents just a set of datasets we collected that you might find useful.

For more information about any dataset, visit its page by clicking the dataset name.

Image Datasets
--------------

.. list-table:: Image Datasets
:widths: 40 40 20
:header-rows: 1

* - Name
- Size
- Image Shape
* - `ylecun/mnist <https://huggingface.co/datasets/ylecun/mnist>`_
- train 60k;
test 10k
- 28x28
* - `uoft-cs/cifar10 <https://huggingface.co/datasets/uoft-cs/cifar10>`_
- train 50k;
test 10k
- 32x32x3
* - `uoft-cs/cifar100 <https://huggingface.co/datasets/uoft-cs/cifar100>`_
- train 50k;
test 10k
- 32x32x3
* - `zalando-datasets/fashion_mnist <https://huggingface.co/datasets/zalando-datasets/fashion_mnist>`_
- train 60k;
test 10k
- 28x28
* - `flwrlabs/femnist <https://huggingface.co/datasets/flwrlabs/femnist>`_
- train 814k
- 28x28
* - `zh-plus/tiny-imagenet <https://huggingface.co/datasets/zh-plus/tiny-imagenet>`_
- train 100k;
valid 10k
- 64x64x3
* - `flwrlabs/usps <https://huggingface.co/datasets/flwrlabs/usps>`_
- train 7.3k;
test 2k
- 16x16
* - `flwrlabs/pacs <https://huggingface.co/datasets/flwrlabs/pacs>`_
- train 10k
- 227x227
* - `flwrlabs/cinic10 <https://huggingface.co/datasets/flwrlabs/cinic10>`_
- train 90k;
valid 90k;
test 90k
- 32x32x3
* - `flwrlabs/caltech101 <https://huggingface.co/datasets/flwrlabs/caltech101>`_
- train 8.7k
- varies
* - `flwrlabs/office-home <https://huggingface.co/datasets/flwrlabs/office-home>`_
- train 15.6k
- varies
* - `flwrlabs/fed-isic2019 <https://huggingface.co/datasets/flwrlabs/fed-isic2019>`_
- train 18.6k;
test 4.7k
- varies
* - `ufldl-stanford/svhn <https://huggingface.co/datasets/ufldl-stanford/svhn>`_
- train 73.3k;
test 26k;
extra 531k
- 32x32x3
* - `sasha/dog-food <https://huggingface.co/datasets/sasha/dog-food>`_
- train 2.1k;
test 0.9k
- varies
* - `Mike0307/MNIST-M <https://huggingface.co/datasets/Mike0307/MNIST-M>`_
- train 59k;
test 9k
- 32x32

Audio Datasets
--------------

.. list-table:: Audio Datasets
:widths: 35 30 15
:header-rows: 1

* - Name
- Size
- Subset
* - `google/speech_commands <https://huggingface.co/datasets/google/speech_commands>`_
- train 64.7k
- v0.01
* - `google/speech_commands <https://huggingface.co/datasets/google/speech_commands>`_
- train 105.8k
- v0.02
* - `flwrlabs/ambient-acoustic-context <https://huggingface.co/datasets/flwrlabs/ambient-acoustic-context>`_
- train 70.3k
-
* - `fixie-ai/common_voice_17_0 <https://huggingface.co/datasets/fixie-ai/common_voice_17_0>`_
- varies
- 14 versions
* - `fixie-ai/librispeech_asr <https://huggingface.co/datasets/fixie-ai/librispeech_asr>`_
- varies
- clean/other

Tabular Datasets
----------------

.. list-table:: Tabular Datasets
:widths: 35 30
:header-rows: 1

* - Name
- Size
* - `scikit-learn/adult-census-income <https://huggingface.co/datasets/scikit-learn/adult-census-income>`_
- train 32.6k
* - `jlh/uci-mushrooms <https://huggingface.co/datasets/jlh/uci-mushrooms>`_
- train 8.1k
* - `scikit-learn/iris <https://huggingface.co/datasets/scikit-learn/iris>`_
- train 150

Text Datasets
-------------

.. list-table:: Text Datasets
:widths: 40 30 30
:header-rows: 1

* - Name
- Size
- Category
* - `sentiment140 <https://huggingface.co/datasets/sentiment140>`_
- train 1.6M;
test 0.5k
- Sentiment
* - `google-research-datasets/mbpp <https://huggingface.co/datasets/google-research-datasets/mbpp>`_
- full 974; sanitized 427
- General
* - `openai/openai_humaneval <https://huggingface.co/datasets/openai/openai_humaneval>`_
- test 164
- General
* - `lukaemon/mmlu <https://huggingface.co/datasets/lukaemon/mmlu>`_
- varies
- General
* - `takala/financial_phrasebank <https://huggingface.co/datasets/takala/financial_phrasebank>`_
- train 4.8k
- Financial
* - `pauri32/fiqa-2018 <https://huggingface.co/datasets/pauri32/fiqa-2018>`_
- train 0.9k; validation 0.1k; test 0.2k
- Financial
* - `zeroshot/twitter-financial-news-sentiment <https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment>`_
- train 9.5k; validation 2.4k
- Financial
* - `bigbio/pubmed_qa <https://huggingface.co/datasets/bigbio/pubmed_qa>`_
- train 2M; validation 11k
- Medical
* - `openlifescienceai/medmcqa <https://huggingface.co/datasets/openlifescienceai/medmcqa>`_
- train 183k; validation 4.3k; test 6.2k
- Medical
* - `bigbio/med_qa <https://huggingface.co/datasets/bigbio/med_qa>`_
- train 10.1k; test 1.3k; validation 1.3k
- Medical
Loading