-
Notifications
You must be signed in to change notification settings - Fork 5
Eye tracking
An eye tracker measures eye positions and eye movement. Gaze patterns are an indirect measurement of cognitive processes occurring during reading. There are many available eye-tracking corpora in multiple languages that can be used for NLP purposes.
This collection contains corpora in the following languages:
- Chinese
- Danish
- Dutch
- English
- French
- German
- Hindi
- Persian
- Portuguese (Brazilian)
- Russian
- Spanish
- Swedish
- Turkish
- Multilingual
Stimulus: 300 one-line single sentences and 7 multiline passages in simplified Chinese
Participants: 98 native speakers
Data: https://osf.io/7uq3j/
Reference: Wu & Kit (2023)
Stimulus: 8,551 Chinese words in full sentences
Participants: 1,718 students, native speakers
Data: https://osf.io/94wue/
Reference: Zhang et al. (2022)
Stimulus: Novel "The Mysterious Affair at Styles" ("斯泰尔斯庄园奇案" in Chinese) written by Agatha Christie; contains L1 Chinese reading and L2 English reading
Participants: 30 participants
Data: https://osf.io/pmvhd/?view_only=77def2827a514254957cc846e14826cf
Reference: Sui et al. (2022)
Stimulus: 150 sentences
Participants: 60 participants
Data: https://osf.io/vr3k8/
Reference: Pan et al. (2021)
Stimulus: 100 articles from public news websites
Participants: 50 participants
Data: https://github.com/MMLabTHUSZ/ADEGBTS
Reference: Yi et al. (2020)
Stimulus: 90 sentences
Participants: 35
Provided features: first fixation duration, single fixation duration, gaze duration, total fixation duration, skipping probability
Data: https://osf.io/e2ws6/
Reference: Zang et al. (2018)
Stimulus: 15 questions and 60 answer documents
Participants: 29
Provided features: fixations and saccades
Data: http://www.thuir.cn/group/~YQLiu/
Reference: Li et al. (2018)
Stimulus: 20 speech manuscripts
Participants: 22 native speakers, 19 native speakers with dyslexia, 10 Danish L2 speakers of various levels
Provided features: first fixation duration, mean fixation duration, first pass duration, go-past time, total reading time, landing position, number of fixations, mean saccade duration, peak saccade velocity
Data: https://osf.io/ud8s5/
Reference: Hollenstein et al. (2022)
Also has a EEG data.
Stimulus: 200 Dutch sentences from the SONAR-500 Dutch corpus (book section)
Participants: 37
Provided features: raw eye-tracking data, the preprocessed eye-tracking data at the fixation, word, and trial levels
Data: https://data.ru.nl/collections/ru/cls/eeg_et_sentence_reading_dsc_556?0
Reference: Frank & Aumeistere (2023)
Also has an English part.
Stimulus: novel by Agatha Christie
Participants: 19 bilingual, 14 monolingual readers
Provided features: first fixation duration, single fixation duration, go-past time, total reading time, gaze duration
Data: http://expsy.ugent.be/downloads/geco/
Reference: Cop et al. (2017)
Stimulus: 3 existing Dutch short stories (2143, 2659, and 2988 words)
Participants: 102
Data: https://osf.io/qgx26/
Reference: Mak & Willems (2019)
Stimulus: Novel "The Mysterious Affair at Styles" ("斯泰尔斯庄园奇案" in Chinese) written by Agatha Christie; contains L1 Chinese reading and L2 English reading
Participants: 30 participants
Data: https://osf.io/pmvhd/?view_only=77def2827a514254957cc846e14826cf
Reference: Sui et al. (2022)
Stimulus: Sentences from the Wall Street Journal
Participants: 69 native English speakers and 296 English learners
Data: https://github.com/berzak/celer (access restricted due to licensing of the reading materials)
Reference: Berzak et al. (2022)
Stimulus: 3990 question and image pairs (tailored towards visual question answering), tagged and balanced by reasoning type and difficulty.
Participants: 49 participants
Data: https://perceptualui.org/publications/sood21_conll/
Reference: Sood et al. (2021)
Stimulus: four SAT passages for reading comprehension from practice tests
Participants: 95 undergraduate students
Data: https://github.com/ahnchive/SB-SAT
Reference: Ahn et al. 2020
Stimulus: 32 documents (around 200-250 words each) containing movie plot synopses from the MovieQA dataset
Participants: 23 English native speakers
Data: https://perceptualui.org/research/datasets/MQA-RC/
Reference: Sood et al. (2020)
Simultaneous eye tracking and EEG recordings.
Stimulus: Sentences from Wikipedia and the Stanford Sentiment Treebank, normal reading and annotation
Participants: 12
Provided features: number of fixations, first fixation duration, single fixation duration, go-past time, total reading time, gaze duration, pupil size
Data: https://osf.io/q3zws/
Reference: Hollenstein et al. (2018)
Stimulus: 700 sentences from Wikipedia, normal reading and annotation
Participants: 18
Provided features: number of fixations, first fixation duration, single fixation duration, go-past time, total reading time, gaze duration, pupil size
Data: https://osf.io/2urht/.
Reference: Hollenstein et al. (2020)
Also includes a Dutch part.
Stimulus: novel by Agatha Christie
Participants: 19 bilingual, 14 monolingual readers
Provided features: first fixation duration, single fixation duration, go-past time, total reading time, gaze duration
Data: http://expsy.ugent.be/downloads/geco/
Reference: Cop et al. (2017)
Stimulus: 40 passages of text
Participants: 48
Data: https://osf.io/4qtnf/
Reference: Parker et al. (2017)
Stimulus: online news articles, popular science magazines, and public-domain works of fiction
Participants: 84
Data: https://osf.io/sjefs/
Reference: Luke & Christianson (2016)
Stimulus: 27 individual texts from various domains (4,658 words in total)
Participants: 14-20 (contains data of subjects with and without autism)
Data: https://github.com/anomymous1/ASD-Data/
Provided features: Time to 1st View (sec), Time Viewed (sec), Time Viewed (%), Fixations (#), Revisits (#). Texts also include readability scores and comprehension questions.
Reference: Yaneva (2016)
The Center for Indian Language Technology (CFILT) offers 6 eye tracking datasets specifically recorded for NLP purposes.
Data: http://www.cfilt.iitb.ac.in/cognitive-nlp/
Stimulus: 48 essays selected from the ASAP AEG dataset
Participants: 8 fluent English speakers
Reference: Mathias et al. (2020)
Stimulus: 30 from different sources, ((simple) Wikipedia, news articles)
Participants: 20 fluent English speakers
Reference: Mathias et al. (2018)
Stimulus: sentences from Wikipedia and Simple Wikipedia
Participants: 16
Reference: Mishra et al. (2017)
Stimulus: Twitter or Amazon movie reviews
Participants: 7
Reference: Mishra et al. (2016)
Stimulus: MUC-6 dataset
Participants: 14
Reference: Cheri et al. (2016)
Stimulus: movie reviews from a movie corpus and from Twitter
Participants: 5
Reference: Joshi et al. (2014)
Stimulus: 205 sentences
Participants: 43
Data: https://link.springer.com/article/10.3758/s13428-012-0313-y#SupplementaryMaterial
Reference: Frank et al. (2013)
This dataset also includes self-paced reading times.
Stimulus: newspaper articles
Participants: 10
Data: can be provided by Alan Kennedy upon request.
Reference: Kennedy et al. (2003)
Also includes a French part.
Stimulus: newspaper articles
Participants: 10
Data: can be provided by Alan Kennedy upon request.
Reference: Kennedy et al. (2003)
Also includes an English part.
Stimulus: Scientific texts read by experts and non-experts
Participants: 75
Data: https://osf.io/dn5hp/
Reference: Jäger et al. (2021)
Stimulus: German language lessons using the web-based Duolingo
Participants: 22 participants (either native English speakers or fluent in English)
Data: https://figshare.com/s/688e387fbfdc000f4e90
Reference: Notaro et al. (2018)
This dataset also contains EEG and mouse movements metrics.
Stimulus: 144 sentences
Participants: 33
Provided features: predictability estimates
Data: http://read.psych.uni-potsdam.de/
Reference: Kliegl et al. (2004)
Stimulus: 153 sentences from the Hindi-Urdu treebank
Participants: 30
Provided features: lexical features, first fixation duration, total fixation time, first-pass reading time, regression path duration, etc.
Data: https://osf.io/dh54b/
Reference: Husain et al. (2015)
Stimulus: 136 sentences
Participants: 40
Provided features: FFD, FFP, SFD, FPRT, RBRT, TFT, RPD, CRPD, RRT, RRTP, RRTR, RBRC, TRC, LPRT
Data: http://www.ling.uni-potsdam.de/~vasishth/code/SafaviEtAl2016DataCode.zip
Reference: Safavi et al. (2016)
Also contains self-paced reading times.
Stimulus: 50 short paragraphs of various genres
Participants: 37
Data: https://osf.io/9jxg3/
Reference: Leal et al. (forthcoming)
Stimulus: 144 sentences
Participants: 96
Data: https://osf.io/x5q2r/
Reference: Laurinavichyute et al. (2018)
Stimulus: 264 sentences including various syntactic phenomena
Participants: 76
Data: https://github.com/bnicenboim/papers/tree/master/NicenboimEtAl2015.%20Working%20memory%20differences%20in%20long-distance%20dependency%20resolution
Reference: Nicenboim et al. (2015)
This study also contains a self-paced reading experiment.
Stimulus: 48 sentence pairs where the first sentence included a noun referring to a person (e.g., sister, hairdresser, person) and the second included a pronoun referring to the noun.
Participants: 120 participants
Data: https://figshare.com/articles/dataset/Open_data_Are_new_gender-neutral_pronouns_difficult_to_process_in_reading_The_case_of_hen_in_Swedish/13143158/1
Reference: Vergoossen et al. (2020)
Stimulus: 192 short texts, each composed of 1-3 sentences
Participants: 215
Provided features: total fixation duration, gaze duration, first fixation duration & more
Data: https://osf.io/w53cz/
Reference: Acartürk et al. (2023)
Stimulus: 12 short texts about general domain topics, native speakers reading in their own language
Participants: 580 readers of 13 languages (Dutch, English, Estonian, Finnish, German, Greek, Hebrew, Italian, Korean, Norwegian, Russian, Spanish, and Turkish)
Data: https://osf.io/3527a/
Reference: Siegelman et al. (2022)
Stimulus: 12 short texts about general domain topics, L2 speakers reading in English
Participants: 543 readers of 12 languages (Dutch, English, Estonian, Finnish, German, Greek, Hebrew, Italian, Norwegian, Russian, Spanish, and Turkish)
Data: https://osf.io/q9h43/
Reference: Kuperman et al. (2022)