Skip to content

Latest commit

 

History

History
1774 lines (1773 loc) · 101 KB

spider.md

File metadata and controls

1774 lines (1773 loc) · 101 KB
layout permalink redirect_from
spider
spider
/seq2sql/spider
test image

Spider 1.0 test image

Yale Semantic Parsing and Text-to-SQL Challenge

What is Spider?

Nov 12, 2024: We have released Spider 2.0 full paper, data and code. Follow the guideline to submit your scores to the leaderboard!

Aug 28, 2024: The early access version of Spider 2.0 (a more realistic and challenging text-to-SQL task) is now available! We expect to release the whole dataset in 1-2 weeks. As this is a preliminary release, there may be errors. Your feedback would be invaluable in refining the dataset!

Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. In Spider 1.0, different complex SQL queries and databases appear in train and test sets. To do well on it, systems must generalize well to not only new SQL queries but also new database schemas.
Why we call it "Spider"? It is because our dataset is complex and cross-domain like a spider crawling across mutiple complex(with many foreign keys) nests(databases). XLANG Lab for Building LLM/VLM Agents Spider Paper (EMNLP'18) Spider Post
Related Works from XLANG Lab: Spider 2.0 Text-to-SQL ('24) Spider2-V ('24) OSWorld ('24) DS-1000 Challenge (ICML'23) Binder Framework (ICLR '23) UnifiedSKG Framework (EMNLP'22) SParC Challenge (ACL'19) CoSQL Challenge (EMNLP'19)

News

  • 11/12/2024 We have released Spider 2.0 full paper, data and code. Follow the guideline to submit your scores to the leaderboard.
  • 08/28/2024 The early access version of Spider 2.0 (a more realistic and challenging text-to-SQL task) is now available! As this is a preliminary release, there may be errors. Your feedback would be invaluable in refining the dataset!
  • 07/15/2024 Spider 2.0-vision (Benchmarking Multimodal Agents on Automating Data Science and Engineering Workflows) is out! Spider 2.0-SQL (much more realistic and challenging than Spider 1.0!) will be released in August.
  • 02/05/2024 We will no longer accept submissions for Spider 1.0 evaluations or update its leaderboard. The test set of Spider 1.0 has already been released (check the Spider dataset link below). Look forward to the release of Spider 2.0, a more realistic and challenging benchmark in the era of LLMs, expected this March June. Stay tuned!
  • 08/10/2023 Please check out XLANG Lab for Building LLM/VLM Agents!
  • 05/27/2023 Please check out Dr.Spider, a robustness evaluation benchmark based on Spider, from AWS AI Lab for studying robustness in semantic parsing!
  • 11/20/2022 Please check out our recent work DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. Please check out examples, data, and code on the DS-1000 project site!!
  • 10/18/2022 Please check out our recent work Binder: an easy but sota neural-symbolic built on GPT-3 Codex & SQL/Python interpreter. It injects GPT-3 Codex prompt API calls in programming languages! Please check out Binder demo, code, paper, and video on the Binder project site!!
  • 01/18/2022 Please check out our recent work UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. We open-sourced simple but SOTA/strong models for 21 tasks including text-to-SQL! Please check out our code in the UnifiedSKG repo!!
  • 03/11/2021 Please check out a nice work from Google Research (including new Spider splits) for studying compositional generalization in semantic parsing!
  • 11/15/2020 We will use Test Suite Accuracy as our official evaluation metric for Spider, SParC, and CoSQL. Please find the evaluation code from here. Also, Notice that Test results after May 02, 2020 are reported on the new release (collected some annotation errors).
  • 08/03/2020 Corrected "column_name" and "column_name_original" mismatches in 2 dbs ("scholar" and "formula_1") in tables.json, and reparsed SQL queries (this only affects some models (e.g. RATSQL) which use our parsed SQL as the SQL input). Please download the Spider dataset from this page again.
  • 06/07/2020 We corrected some annotation errors and label mismatches (not errors) in Spider dev and test sets (~4% of dev examples updated, click here for more details). Please download the Spider dataset from this page again.
  • 01/16/2020 For value prediction (in order to compute the execution accuracy), your model should be able to 1) copy from the question inputs, 2) retrieve from the database content (database content is available), or 3) generate numbers (e.g. 3 in "LIMIT 3").
  • 9/24/2019 (Min et al., EMNLP 2019) translated Spider to Chinese! Check out the Chinese challenge page.
  • 5/17/2019 Our paper SParC: Cross-Domain Semantic Parsing in Context with Salesforce Research was accepted to ACL 2019! It introduces the context-dependent version of the Spider challenge: SParC!
  • 5/17/2019 Please report any annotation errors here, we really appreciate your help and will update the data release in this summer!
  • 1/14/2019 The submission tutorial is out!.
  • 12/17/2018 We updated 7 sqlite database files (issue 14). Please download the Spider dataset from this page again.
  • 10/25/2018 The evaluation script and results were updated (issue 5). Please download the lastest versions of the script and papers. Also, please follow instructions in issue 3 to generate the latest SQL parsing results (fixed a bug).

Why Spider?

test image

As the above spider chart shows, Spider 1.0 is distinct from most of the previous semantic parsing tasks because:
  • ATIS, Geo, Academic: Each of them contains only a single database with a limited number of SQL queries, and has exact same SQL queries in train and test splits.
  • WikiSQL: The numbers of SQL queries and tables are significantly large. But all SQL queries are simple, and each database is only a simple table without any foreign key.
Spider 1.0 spans the largest area in the chart, making it the first complex and cross-domain semantic parsing and text-to-SQL dataset! Read more on the blog post.

Getting Started

The data is split into training, development, and test sets. Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

Spider Dataset Details of baseline models and evaluation script can be found on the following GitHub site: Spider GitHub Page

Data Examples

Some examples look like the following:

test image

Have Questions or Want to Contribute ?

Ask us questions at our Github issues page or contact Tao Yu, Rui Zhang, or Michihiro Yasunaga.

We expect the dataset to evolve. We would greatly appreciate it if you could donate us your non-private databases or SQL queries for the project.

Acknowledgement

We thank Graham Neubig, Tianze Shi, Catherine Finegan-Dollak, and the anonymous reviewers for their precious comments on this project. Also, we thank Pranav Rajpurkar for giving us the permission to build this website based on SQuAD.

Our team at the summit of the East Rock park in New Haven (The pose is "NLseq2SQL"):

test image
Tweet <script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>

Leaderboard - Execution with Values

Our current models do not predict any value in SQL conditions so that we do not provide execution accuracies. However, we encourage you to provide it in the future submissions. For value prediction, your model should be able to 1) copy from the question inputs, 2) retrieve from the database content (database content is available), or 3) generate numbers (e.g. 3 in "LIMIT 3"). Notice: Test results after May 02, 2020 are reported on the new release (collected some annotation errors).

Rank Model Test

1

Nov 2, 2023
MiniSeek

Anonymous

Code and paper coming soon
91.2

1

Aug 20, 2023
DAIL-SQL + GPT-4 + Self-Consistency

Alibaba Group

(Gao and Wang et al.,'2023) code
86.6

2

Aug 9, 2023
DAIL-SQL + GPT-4

Alibaba Group

(Gao and Wang et al.,'2023) code
86.2

3

October 17, 2023
DPG-SQL + GPT-4 + Self-Correction

Anonymous

Code and paper coming soon
85.6

4

Apr 21, 2023
DIN-SQL + GPT-4

University of Alberta

(Pourreza et al.,'2023) code
85.3

5

July 5, 2023
Hindsight Chain of Thought with GPT-4

Anonymous

Code and paper coming soon
83.9

6

Jun 1, 2023
C3 + ChatGPT + Zero-Shot

Zhejiang University & Hundsun

(Dong et al.,'2023) code
82.3

7

July 5, 2023
Hindsight Chain of Thought with GPT-4 and Instructions

Anonymous

Code and paper coming soon
80.8

8

Feb 7, 2023
RESDSQL-3B + NatSQL (DB content used)

Renmin University of China

(Li et al., AAAI'23) code
79.9

9

Nov 21, 2022
SeaD + PQL (DB content used)

Anonymous

78.5

10

Apr 21, 2023
DIN-SQL + CodeX

University of Alberta

(Pourreza et al.,'2023) code
78.2

11

August 10, 2023
T5-3B+NatSQL+Token Preprocessing (DB content used)

George Mason University & MIT

(Rai et al., ACL '23) code
78.0

12

Sep 14, 2022
CatSQL + GraPPa (DB content used)

Anonymous

78.0

13

Sep 13, 2022
Graphix-3B+PICARD (DB content used)

Alibaba DAMO & HKU STAR & SIAT

(Li et al., AAAI'2023) code
77.6

14

Sep 1, 2022
SHiP+PICARD (DB content used)

AWS AI Labs

(Zhao et al.,'22)
76.6

15

Apr 4, 2023
RASAT + NatSQL + Reranker (DB content used)

Anonymous

Paper coming soon
76.5

16

Dec 15, 2022
N-best List Rerankers + PICARD (DB content used)

Alexa AI

(Zeng et al., IEEE SLT 2023)
75.9

17

Jun 4, 2022
RASAT+PICARD (DB content used)

SJTU LUMIA & Netmind.AI

(Qi et al., EMNLP'22) code
75.5

18

May 8, 2022
T5-SR (DB content used)

Anonymous

75.2

19

Aug 12, 2022
RESDSQL+T5-1.1-lm100k-xl (DB content used)

Anonymous

75.1

20

Jul 14, 2021
T5-3B+PICARD (DB content used)

Element AI, a ServiceNow company

(Scholak et al., EMNLP'21) code
75.1

21

Aug 12, 2022
RESDSQL+T5-1.1-lm100k-large (DB content used)

Anonymous

74.8

22

May 18, 2022
SeaD + SP (DB content used)

Anonymous

74.1

23

May 4, 2021
RATSQL+GAP+NatSQL (DB content used)

Queen Mary University of London

(Gan et al., EMNLP Findings'21) code
73.3

24

August 10, 2021
T5-Base+NatSQL+Token Preprocessing (DB content used)

George Mason University & MIT

(Rai et al., ACL '23) code
71.1

25

Mar 10, 2021
SmBoP + GraPPa (DB content used)

Tel-Aviv University & Allen Institute for AI

(Rubin and Berant, NAACL'21) code
71.1

26

Aug 05, 2021
RaSaP + ELECTRA (DB content used)

Ant Group, ZhiXiaoBao & Ada

(Huang et al.,'21)
70.0

27

Nov 24, 2020
BRIDGE v2 + BERT(ensemble) (DB content used)

Salesforce Research

(Lin et al., EMNLP-Findings '20) code
68.3

28

Jan 16, 2021
COMBINE (DB content used)

Novelis.io Research

(Youssef et al.,'21)
68.2

29

Jul 22, 2022
T5QL-Base (DB content used)

Anonymous

66.8

30

Nov 24, 2020
BRIDGE v2 + BERT (DB content used)

Salesforce Research

(Lin et al., EMNLP-Findings '20) code
64.3

31

May 30, 2020
AuxNet + BART (DB content used)

Anonymous

62.6

32

May 30, 2020
BRIDGE + BERT (DB content used)

Salesforce Research

(Lin et al., EMNLP-Findings '20) code
59.9

33

May 20, 2020
GAZP + BERT (DB content used)

University of Washington & Facebook AI Research

(Zhong et al., EMNLP '20)
53.5

Leaderboard - Exact Set Match without Values

For exact matching evaluation, instead of simply conducting string comparison between the predicted and gold SQL queries, we decompose each SQL into several clauses, and conduct set comparison in each SQL clause. Please refer to the paper and the Github page for more details. Notice: Test results after May 02, 2020 are reported on the new release (collected some annotation errors).

Rank Model Dev Test

1

Nov 2, 2023
MiniSeek

Anonymous

Code and paper coming soon
80.3 81.5

1

Sep 13, 2022
Graphix-3B + PICARD (DB content used)

Alibaba DAMO & HKU STAR & SIAT

(Li et al., AAAI'2023) code
77.1 74.0

2

Sep 14, 2022
CatSQL + GraPPa (DB content used)

Anonymous

78.6 73.9

3

Sep 1, 2022
SHiP + PICARD (DB content used)

AWS AI Labs

(Zhao et al.,'22)
77.2 73.1

4

May 23, 2022
G³R + LGESQL + ELECTRA (DB content used)

Southeast University & Tencent Cloud Xiaowei

(Xiang et al., ACL-Findings '23)
78.1 72.9

6

Aug 12, 2022
RESDSQL+T5-1.1-lm100k-xl (DB content used)

Anonymous

78.1 72.4

6

May 8, 2022
T5-SR (DB content used)

Anonymous

77.2 72.4

7

Dec 15, 2022
N-best List Rerankers + PICARD (DB content used)

Alexa AI

(Zeng et al., IEEE SLT 2023)
76.4 72.2

8

Sep 1, 2021
S²SQL + ELECTRA (DB content used)

Alibaba DAMO

(Hui et al., ACL-Findings '22) code
76.4 72.1

9

Feb 7, 2023
RESDSQL-3B + NatSQL (DB content used)

Renmin University of China

(Li et al., AAAI'23) code
80.5 72.0

10

Jun 1, 2021
LGESQL + ELECTRA (DB content used)

SJTU X-LANCE Lab & AISpeech

(Cao et al., ACL'21) code
75.1 72.0

11

Jul 14, 2021
T5-3B+PICARD (DB content used)

Element AI, a ServiceNow company

(Scholak et al., EMNLP'21) code
75.5 71.9

12

Aug 12, 2022
RESDSQL+T5-1.1-lm100k-large (DB content used)

Anonymous

76.6 71.1

13

Jun 4, 2022
RASAT+PICARD (DB content used)

SJTU LUMIA & Netmind.AI

(Qi et al., EMNLP'22) code
75.3 70.9

14

Nov 19, 2020
DT-Fixup SQL-SP + RoBERTa (DB content used)

Borealis AI

(Xu et al., ACL'21) code
75.0 70.9

15

Nov 19, 2020
RAT-SQL + GraPPa + Adv (DB content used)

Anonymous

75.5 70.5

16

Oct 18, 2021
RATSQL++ + ELECTRA (DB content used)

Anonymous

75.7 70.3

17

Nov 19, 2020
SADGA + GAP (DB content used)

DMIR Lab

(Cai and Yuan et al., NeurIPS'21) code
73.1 70.1

18

Dec 25, 2020
RATSQL + GraPPa + GP (DB content used)

OCFT Gamma Big Data Lab

(Zhao et al.,'21)
72.8 69.8

19

Sep 08, 2020
RATSQL + GAP (DB content used)

University of Waterloo & AWS AI Labs

(Shi et al., AAAI'21) code
71.8 69.7

20

Apr 4, 2023
RASAT + NatSQL + Reranker (DB content used)

Anonymous

Paper coming soon
73.6 69.6

20

Aug 18, 2020
RATSQL + GraPPa (DB content used)

Yale & Salesforce Research

(Yu et al., ICLR'21) code
73.4 69.6

22

Mar 10, 2021
SmBoP + GraPPa (DB content used)

Tel-Aviv University & Allen Institute for AI

(Rubin and Berant, NAACL'21) code
74.7 69.5

23

Aug 05, 2021
RaSaP + ELECTRA (DB content used)

Ant Group, ZhiXiaoBao & Ada

(Huang et al.,'21)
74.7 69.0

24

May 4, 2021
RATSQL+GAP+NatSQL (DB content used)

Queen Mary University of London

(Gan et al., EMNLP Findings'21) code
- 68.7

25

Nov 20, 2020
RAT-SQL + STRUG (DB content used)

Microsoft Research & OSU

(Deng et al., NAACL '21)
72.6 68.4

26

Jun 1, 2021
LGESQL + BERT (DB content used)

SJTU X-LANCE Lab & AISpeech

(Cao et al., ACL'21) code
74.1 68.3

27

Jan 16, 2021
COMBINE (DB content used)

Novelis.io Research

(Youssef et al.,'21)
71.4 67.7

28

Nov 24, 2020
BRIDGE v2 + BERT(ensemble) (DB content used)

Salesforce Research

(Lin et al., EMNLP-Findings '20) code
71.1 67.5

29

Sep. 8, 2020
ShadowGNN + RoBERTa (DB content used)

SJTU X-LANCE Lab & AISpeech

(Chen et al., NAACL'21)
72.3 66.1

30

Jul. 22, 2022
T5QL-Base (DB content used)

Anonymous

69.3 65.9

31

May 02, 2020
RATSQL v3 + BERT (DB content used)

Microsoft Research

(Wang and Shin et al., ACL '20) code
69.7 65.6

32

Dec. 07, 2020
DuoRAT + BERT (DB content used)

Anonymous

- 65.4

33

Sep. 8, 2020
YCSQL + BERT (DB content used)

Anonymous

- 65.3

34

Jan. 29, 2021
ETA + BERT (DB content used)

Microsoft Research Asia

(Liu et al., ACL-Findings '21)
70.8 65.3

35

Nov 24, 2020
BRIDGE v2 + BERT (DB content used)

Salesforce Research

(Lin et al., EMNLP-Findings '20) code
70.0 65.0

36

October 17, 2023
DPG-SQL + GPT-4 + Self-Correction

Anonymous

Code and paper coming soon
- 64.7

37

Sep. 8, 2020
GP-RATSQL + BERT (DB content used)

Anonymous

- 64.5

38

Nov. 25, 2020
RATSQL-HPFT + BERT (DB content used)

Anonymous

- 64.4

39

Feb 2, 2021
LGESQL + GLOVE (DB content used)

SJTU X-LANCE Lab & AISpeech

(Cao et al., ACL'21) code
67.6 62.8

40

May 31, 2020
AuxNet + BART (DB content used)

Anonymous

70.0 61.9

41

Dec 13, 2019
RATSQL v2 + BERT (DB content used)

Microsoft Research

(Wang and Shin et al., ACL '20) code
65.8 61.9

42

May 31, 2020
AuxNet + BART

Anonymous

68.0 61.3

43

Feb 18, 2020
RYANSQL v2 + BERT

Kakao Enterprise

(Choi et al., '20)
70.6 60.6

44

Oct 19, 2020
SmBoP + BART

Tel-Aviv University & Allen Institute for AI

(Rubin and Berant '20)
66.0 60.5

45

Dec 18, 2019
IRNet++ + XLNet (DB content used)

Anonymous

65.5 60.1

46

Apr 21, 2023
DIN-SQL + GPT-4

University of Alberta

(Pourreza et al.,'2023) code
60.1 60

47

May 30, 2020
BRIDGE + BERT (DB content used)

Salesforce Research

(Lin et al., EMNLP-Findings '20) code
65.5 59.2

48

Nov 12, 2019
RYANSQL + BERT

Kakao Enterprise

(Choi et al., '20)
66.6 58.2

49

Dec 13, 2019
RATSQL v2 (DB content used)

Microsoft Research

(Wang and Shin et al., ACL '20) code
62.7 57.2

50

Apr 21, 2023
DIN-SQL + CodeX

University of Alberta

(Pourreza et al.,'2023) code
57.2 57

51

Dec 13, 2019
SLSQL + BERT + Data Annotation

National University of Singapore

(Lei and Wang et al., EMNLP '20) code
60.8 55.7

52

Dec 13, 2019
EditSQL+LSL + BERT

Anonymous

57.9 55.2

53

June 24, 2019
IRNet v2 + BERT

Microsoft Research Asia

63.9 55.0

54

Sep 20, 2019
GIRN + BERT

Anonymous

60.2 54.8

55

May 19, 2019
IRNet + BERT

Microsoft Research Asia

(Guo and Zhan et al., ACL '19) code
61.9 54.7

56

Nov 4, 2019
GNN + Bertrand-DR

Got It R&D

(Kelkar et al., '20) code
57.9 54.6

57

Apr 8, 2020
CNSQL

Anonymous

58.0 54.0

58

Sep 19, 2019
RATSQL

Anonymous

60.6 53.7

59

Sep 1, 2019
EditSQL + BERT

Yale University & Salesforce Research

(Zhang et al., EMNLP '19) code
57.6 53.4

60

May 21, 2020
GAZP + BERT

University of Washington & Facebook AI Research

(Zhong et al., EMNLP '20)
- 53.3

61

May 21, 2020
NatSQL v3

Anonymous

- 53.2

62

May 28, 2020
IRNET+ GNN

Anonymous

- 49.6

63

June 24, 2019
IRNet v2

Microsoft Research Asia

55.4 48.5

64

Aug 30, 2019
Global-GNN (DB content used)

Tel-Aviv University & Allen Institute for AI

(Bogin et al., EMNLP '19) code
52.7 47.4

65

Dec 13, 2019
LSL

Anonymous

56.8 47.0

66

Apr 5, 2020
GraphSQL

Anonymous

52.8 46.9

67

May 19, 2019
IRNet

Microsoft Research Asia

(Guo and Zhan et al., ACL '19) code
53.2 46.7

68

Mar 17, 2020
SG-IRNet

Anonymous

- 46.6

69

Dec 13, 2019
NatSQL v2

Anonymous

52.0 46.4

70

June 11, 2019
HSRNet

Anonymous

51.5 45.6

71

June 12, 2019
CFGN

Anonymous

48.7 44.1

72

Aug 31, 2019
NatSQL

Anonymous

52.9 42.5

73

May 16, 2019
GNN

Tel-Aviv University & Allen Institute for AI

(Bogin et al., ACL '19) code
40.7 39.4

74

Feb 25, 2019
SASeq

Anonymous

40.8 37.4

75

May 30, 2019
GrammarSQL

Allen Institute for AI

(Lin et al., '19)
34.8 33.8

76

Sep 1, 2019
EditSQL

Yale University & Salesforce Research

(Zhang et al., EMNLP '19) code
36.4 32.9

77

Dec 13, 2019
GuideSQL

Anonymous

36.8 31.5

78

Sep 20, 2018
SyntaxSQLNet + augment

Yale University

(Yu et al., EMNLP '18) code
24.8 27.2

79

April 18, 2019
RCSQL

SAP Labs Korea

(Lee, EMNLP'19)
28.5 24.3

80

Sep 20, 2018
SyntaxSQLNet

Yale University

(Yu et al., EMNLP '18) code
18.9 19.7

81

Sep 20, 2018
SQLNet

Shanghai Jiao Tong University (modified by Yale)

(Xu et al., '18) code
10.9 12.4

82

Sep 20, 2018
TypeSQL

Yale University

(Yu et al., NAACL '18) code
8.0 8.2

83

Sep 20, 2018
Seq2Seq + attention

University of Edinburgh (modified by Yale)

(Dong and Lapata, ACL '16) code
1.8 4.8

Other papers used Spider (evaluated on the dev but not test set):
  1. (Wang et al., KDD 2022), Alibaba DAMO
  2. (Ma et al., VLDB 2022), HKUST
  3. (Qin et al., COLING 2022), Alibaba DAMO
  4. (Usta et al., VLDB 2021), Bilkent University
  5. (Min et al., EMNLP 2019), Westlake University, Spider in Chinese
  6. (Yao et al., EMNLP 2019), OSU & Facebook AI Research
  7. (Shaw et al., ACL 2019), Google
  8. (Shin et al., NeurlPS 2019), UC Berkeley & MSR
  9. (Weir et al., SIGMOD 2019), Brown University & TU Darmstadt
  10. (Baik et al., ICDE 2019), U of Michigan & IBM