A curated list of datasets that can be directly used or adapted for various table understanding tasks.
Note that some of the datasets provide only metadata and annotations, without the source files. Additionally, several datasets have download links that are no longer active. Since the authors may resolve this issue in the future, these datasets are still included in the list.
The repository will be continuously updated ✏️. If you find this resource useful for your research, just ⭐️ it and stay tuned!
Dataset | Source | Task(s) | Size | Modality |
---|---|---|---|---|
PubTables-1M |
|
|
947.64K tables | Image |
SciGen |
|
|
1.3K table-text description pairs | Text |
ComTQA |
|
|
1.5K tables and 9K QA pairs | Image |
DocGenom |
|
|
3K table-LaTeX pairs | Image |
numericNLG |
|
|
1.3K text-table pairs | Text |
SEM-TAB-FACTS |
|
|
3K tables | Text |
TAT-QA |
|
|
2K hybrid contexts (tables and text) and 16.5K QA pairs | Text |
WikiBio |
|
|
728.32K biographies | Text |
ToTTo |
|
|
120K table-text pairs | Text |
TabFact |
|
|
16K tables and 118K statements | Text |
TableBench |
|
|
3.6K tables and 886 QA pairs | Text |
TableInstruct |
|
|
3.6K tables and 20K QA pairs | Text |
FinQA |
|
|
8.2K QA pairs | Text |
LogicNLG |
|
|
7.3K tables | Text |
TabIS |
|
|
61K tables | Text |
DataBench |
|
|
56K tables | Text |
GitTables |
|
|
1M tables | Text |
AxCell: Segmented Tables |
|
|
1.9K tables | Text |
WDC Web Table Corpus 2012 |
|
|
147M tables | Text |
WDC Web Table Corpus 2015 |
|
|
233M tables | Text |
T2D |
|
|
1.7K tables | Text |
T2Dv2 |
|
|
779 tables | Text |
WikiTables |
|
|
1.6M tables | Text |
WikiTableQuestions |
|
|
2.1K tables and 22K QA pairs | Text |
WikiSQL |
|
|
24.2K tables | Text |
Spider 1.0 |
|
|
N/A | Text |
OTT-QA |
|
|
400K tables | Text |
HybridQA |
|
|
13K tables and 70K QA pairs | Text |
FEVEROUS |
|
|
87K claims | Text |
TableBank |
|
|
417K tables | Image |
PubTabNet |
|
|
568K tables | Image |
PubLayNet |
|
|
94K pages with tables and 113K tables | Image |
FinTabNet |
|
|
89K pages and 112.8K tables | Text |
WTW |
|
|
14.5K tables | Image |
SciTSR |
|
|
15K tables | Image |
TNCR |
|
|
6.6K images and 9.4K tables | Image |
DeepFigures |
|
|
1.4M tables | Text |
WikiTableSet |
|
|
5M tables | Image |
Tab2Know |
|
|
73k tables | Image and text |
Logic2Text |
|
|
5.6K tables and 10.8k (logical form, description) pairs | Text |
SQA |
|
|
17.5K QA pairs | Text |
FeTaQA |
|
|
10.3K (table, question, answer, table cells) pairs | Text |
ICDAR 2019 cTDaR |
|
|
N/A | Image |
SportsTables |
|
|
1.1K tables | Text |
SemTab2019 |
|
|
14.9K tables | Text |
Tough Tables (2T) |
|
|
180 tables | Text |
SemTab2020 |
|
|
131.4K tables | Text |
HardTables |
|
|
N/A | Text |
BiodivTab |
|
|
N/A | Text |
BioTable |
|
|
N/A | Text |
SemTab2021 |
|
|
9.1K tables | Text |
SemTab2022 |
|
|
N/A | Text |
NumDB |
|
|
389 tables | Text |
MammoTab |
|
|
980K tables | Text |
SOTAB |
|
|
107K tables | Text |
Wikary |
|
|
32K tables | Text |
HiTab |
|
|
3.5K tables | Text |
INFOTABS |
|
|
2.3K tables and 23.7K premise-hypothesis pairs | Text |
Rotowire |
|
|
4.8K summaries | Text |
SBNation |
|
|
10.9K summaries | Text |
AIT-QA |
|
|
515 questions and 116 tables | Text |
TabMWP |
|
|
37.6K tables | Image and text |
PubHealthTab |
|
|
1.9K claim-table pairs | Text |
MMTab |
|
|
202K tables | Image |
CTE |
|
|
75K pages and 35K tables | Image |
TD4CLTabs |
|
|
13.3K tables | Image |
SheetCopilot |
|
|
28 workbooks and 221 tasks | Text |
MultiModalQA |
|
|
700K tables | Text |
Table-GPT |
|
|
67.2K tables | Text |
arxivDIGESTables |
|
|
2.228K tables | Text |
SciTaT |
|
|
13.808K questions associated with 8.907K papers | Text |
Feel free to create a pull request or to open an issue if you would like to add other awesome datasets.