The codebook specifies the data types, possible values, and other information for each column in the data files.
- Word features
- Stimuli and comprehension questions
- Items
- Areas of interest (AOI)
- Dependency trees
- Fixations
- Scanpaths
- Reading measures
- Reading measures merged
- Scanpaths merged
- Roi to word mapping
- Participants
TODO: insert short text about this section in this file
Please find the files under this link: Word features
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
word | string | Words as they appear in the stimuli texts. Words are split at white-space. | 0 | nan | nan | |
word_with_punct | The word as it appears in the text, including punctuation. | 0 | nan | nan | ||
word_index_in_text | 1-180 | Integer | The index of the word in the text. Indexing starts at 1. | 0 | nan | nan |
word_index_in_sent | 1-51 | Integer | The index of the word in the sentence. Indexing starts at 1. | 0 | nan | nan |
sent_index_in_text | 1-12 | Integer | The index of a sentence in the respective text. Indexing starts at 1. | 0 | nan | nan |
word_limit_char_indices | no stats? | Specifies the limits of each word in character indices. Format: [word_start],[word_end]. For example: 3,7 means a word starts at character index 3 in the text and ends at character index 7. The properties of the character indices are specified in char_index_in_text. | 0 | nan | nan | |
text_id | b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 | Unique identifier given to each stimulus text. | 0 | nan | nan | |
text_id_numeric | 0-11 | Integer | Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 | 0 | nan | nan |
text_domain | biology: 954, physics: 941 | Categorical | The domain of the stimulus text. | 0 | nan | nan |
text_domain_numeric | 0: 954, 1: 941 | Categorical | Numerical value of text_domain; 0=biology, 1=physics. | 0 | nan | nan |
word_length | 2-33 | Integer | Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters). | 0 | nan | nan |
STTS_punctuation_before | nan: 1883, $(: 12 | Categorical | If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. | 1883 | nan | nan |
STTS_punctuation_after | nan: 1689, |
Categorical | If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. | 1689 | nan | nan |
is_in_quote | 0: 1881, 1: 14 | Categorical | Whether or not the word is part of an expression in quotes. | 0 | nan | Manually tagged |
is_in_parentheses | 0: 1890, 1: 5 | Categorical | Whether or not the word is part of a phrase in parentheses. | 0 | nan | Manually tagged |
is_clause_beginning | 0: 1796, 1: 99 | Categorical | Whether or not the word is the beginning of a clause. | 0 | nan | Manually tagged |
is_sent_beginning | 0: 1798, 1: 97 | Categorical | Whether or not the word is the beginning of a new sentence. | 0 | nan | Manually tagged |
is_clause_end | Whether or not the word is the end of a clause. | 0 | nan | nan | ||
is_sent_end | Whether or not the word is the end of a sentence. | 0 | nan | nan | ||
is_abbreviation | 0: 1890, 1: 5 | Categorical | Whether or not the entire word is an abbreviation. | 0 | nan | Manually tagged |
is_expert_technical_term | 0: 1740, 1: 155 | Categorical | 1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"". | 0 | nan | Manually tagged |
is_general_technical_term | 0: 1646, 1: 249 | Categorical | 1 if the word is a technical term that is generally understandable. E.g.: "elektrisch" | 0 | nan | nan |
contains_symbol | 0: 1887, 1: 8 | Categorical | Whether or not the word contains a symbol. E.g.: β-D-Glucose | 0 | nan | nan |
contains_hyphen | 0: 1866, 1: 29 | Categorical | Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)). | 0 | nan | nan |
contains_abbreviation | 0: 1883, 1: 12 | Categorical | Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA. | 0 | nan | nan |
STTS_PoS_tag | ADJA: 154, ADJD: 53, ADV: 73, APPR: 184, APPRART: 48, APZR: 1, ART: 276, CARD: 9, KOKOM: 17, KON: 66, KOUI: 6, KOUS: 16, NE: 4, NN: 515, PAV: 18, PDAT: 16, PDS: 7, PIAT: 5, PIDAT: 9, PIS: 10, PPER: 25, PPOSAT: 7, PRELAT: 6, PRELS: 29, PRF: 25, PTKA: 1, PTKNEG: 4, PTKVZ: 13, PTKZU: 10, PWAV: 1, TRUNC: 5, VAFIN: 73, VAINF: 8, VMFIN: 25, VMINF: 1, VVFIN: 102, VVINF: 33, VVIZU: 2, VVPP: 38 | Categorical | Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information. | 0 | nan | Manually tagged |
type | string | The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name. | 4 | nan | dlexDB | |
type_length_chars | 2.0-33.0 | Integer | The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted. | 1 | nan | nan |
PoS_tag | adja: 162, adjd: 54, adv: 91, appr: 182, apprart: 48, art: 280, card: 9, kokom: 17, kon: 63, koui: 5, kous: 16, ne: 7, nn: 508, pdat: 16, pds: 7, piat: 5, pidat: 2, pis: 14, pper: 24, pposat: 7, prelat: 6, prels: 24, prf: 25, ptka: 1, ptkneg: 4, ptkvz: 15, ptkzu: 10, pwav: 1, trunc: 5, vafin: 73, vainf: 8, vmfin: 24, vminf: 1, vvfin: 103, vvinf: 33, vvizu: 2, vvpp: 38, xy: 5 | Categorical | Part-of-speech tag as defined by the dlexDB query. | 0 | nan | dlexDB |
lemma | string | nan | 4 | nan | dlexDB | |
lemma_length_chars | 1.0-32.0 | Integer | nan | 3 | nan | dlexDB |
syllables | string | nan | 25 | nan | dlexDB | |
type_length_syllables | 1.0-14.0 | Integer | nan | 24 | nan | dlexDB |
annotated_type_frequency_normalized | min: 0.00817507899599, max: 24738.5901996, mean: 3889.8532, std: 6967.089 | Float | The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma. | 127 | nan | dlexDB |
type_frequency_normalized | min: 0.00817507899599, max: 26530.3631386, mean: 4409.2283, std: 7712.5287 | Float | nan | 115 | nan | dlexDB |
lemma_frequency_normalized | min: 0.00817507899599, max: 80100.3069113, mean: 13063.8057, std: 25247.1898 | Float | nan | 115 | nan | dlexDB |
familiarity_normalized | min: 0.0, max: 26530.3631386, mean: 4074.0362, std: 7634.0602 | Float | nan | 117 | nan | dlexDB |
regularity_normalized | min: 0.0, max: 2123.30585022, mean: 37.6119, std: 123.3575 | Float | nan | 116 | nan | dlexDB |
document_frequency_normalized | min: 0.126068429944, max: 9372.80956103, mean: 3073.6225, std: 3377.4549 | Float | nan | 116 | nan | dlexDB |
sentence_frequency_normalized | min: 0.0155184320176, max: 30912.3596552, mean: 6119.8019, std: 9642.457 | Float | nan | 116 | nan | dlexDB |
cumulative_syllable_corpus_frequency_normalized | min: 1.40611358731, max: 125126.524676, mean: 16825.508, std: 15793.39 | Float | nan | 116 | nan | dlexDB |
cumulative_syllable_lexicon_frequency_normalized | min: 0.428085856899, max: 218985.607753, mean: 23221.2613, std: 31879.0143 | Float | nan | 119 | nan | dlexDB |
cumulative_character_corpus_frequency_normalized | min: 15533.2550482, max: 7810554.20193, mean: 1917789.2641, std: 1253328.3202 | Float | nan | 116 | nan | dlexDB |
cumulative_character_lexicon_frequency_normalized | min: 47003.8270876, max: 18380479.713, mean: 4265792.357, std: 2812004.0938 | Float | nan | 116 | nan | dlexDB |
cumulative_character_bigram_corpus_frequency_normalized | min: 5138.64210483, max: 1322150.62097, mean: 363265.3368, std: 217175.5613 | Float | nan | 116 | nan | dlexDB |
cumulative_character_bigram_lexicon_frequency_normalized | min: 12677.7626521, max: 2788357.77704, mean: 590209.5889, std: 442407.5129 | Float | nan | 116 | nan | dlexDB |
cumulative_character_trigram_corpus_frequency_normalized | min: 4358.04468689, max: 603427.130456, mean: 227949.9158, std: 122856.9432 | Float | nan | 116 | nan | dlexDB |
cumulative_character_trigram_lexicon_frequency_normalized | min: 11942.3111499, max: 899592.89035, mean: 237804.6839, std: 171696.6712 | Float | nan | 116 | nan | dlexDB |
initial_letter_frequency_normalized | min: 199.202149895, max: 110461.430317, mean: 38381.0963, std: 33346.9984 | Float | nan | 116 | nan | dlexDB |
initial_bigram_frequency_normalized | min: 1.57779024623, max: 53801.2331077, mean: 12768.0203, std: 14670.9631 | Float | nan | 116 | nan | dlexDB |
initial_trigram_frequency_normalized | min: -0.00817507899599, max: 29048.3692201, mean: 5888.4981, std: 8949.4325 | Float | nan | 116 | nan | dlexDB |
avg_cond_prob_in_bigrams | min: 1.2e-07, max: 0.5006180465, mean: 0.0451, std: 0.0448 | Float | The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information. | 116 | nan | dlexDB |
avg_cond_prob_in_trigrams | min: 3.153e-06, max: 25.0, mean: 0.2526, std: 0.6009 | Float | The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information. | 116 | nan | dlexDB |
neighbors_coltheart_higher_freq_cum_freq_normalized | min: 0.0, max: 44055.247282, mean: 2248.7136, std: 7540.5582 | Float | nan | 116 | nan | dlexDB |
neighbors_coltheart_higher_freq_count_normalized | min: 0.0, max: 8.13363128109, mean: 0.2077, std: 0.5007 | Float | nan | 116 | nan | dlexDB |
neighbors_coltheart_all_cum_freq_normalized | min: 0.0, max: 49782.1108458, mean: 5076.6032, std: 10127.1033 | Float | nan | 116 | nan | dlexDB |
neighbors_coltheart_all_count_normalized | min: 0.0, max: 47.5175301158, mean: 15.7971, std: 14.4153 | Float | nan | 116 | nan | dlexDB |
neighbors_levenshtein_higher_freq_cum_freq_normalized | min: 0.0, max: 44055.247282, mean: 2879.4346, std: 7921.0448 | Float | nan | 116 | nan | dlexDB |
neighbors_levenshtein_higher_freq_count_normalized | min: 0.0, max: 11.9864039932, mean: 0.3277, std: 0.6576 | Float | nan | 116 | nan | dlexDB |
neighbors_levenshtein_all_cum_freq_normalized | min: 0.0, max: 54875.2749862, mean: 6722.366, std: 11598.2601 | Float | nan | 116 | nan | dlexDB |
neighbors_levenshtein_all_count_normalized | min: 0.0, max: 75.7711966712, mean: 24.6418, std: 22.5295 | Float | nan | 116 | nan | dlexDB |
sent_surprisal_gpt2-base | Surprisal value extracted from a language model (GerPT2-base) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_gpt2-base | Surprisal value extracted from a language model (GerPT2-base) with the text as context. | 0 | nan | nan | ||
sent_surprisal_gpt2-large | Surprisal value extracted from a language model (GerPT2-large) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_gpt2-large | Surprisal value extracted from a language model (GerPT2-large) with the text as context. | 0 | nan | nan | ||
sent_surprisal_llama-7b | Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_llama-7b | Surprisal value extracted from a language model (LeoLM-7b) with the text as context. | 0 | nan | nan | ||
sent_surprisal_llama-13b | Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_llama-13b | Surprisal value extracted from a language model (LeoLM-13b) with the text as context. | 0 | nan | nan | ||
sent_surprisal_bert-base | Surprisal value extracted from a language model (BERT-base) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_bert-base | Surprisal value extracted from a language model (BERT-base) with the text as context. | 0 | nan | nan |
TODO: insert short text about this section in this file
Please find the file under this link: Stimuli including comprehension questions
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
text_id | b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 | Unique identifier given to each stimulus text. | 0 | nan | nan | |
text_id_numeric | 0-11 | Integer | Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 | 0 | nan | nan |
text_domain | biology: 6, physics: 6 | Categorical | The domain of the stimulus text. | 0 | nan | nan |
text_domain_numeric | 0: 6, 1: 6 | Categorical | Numerical value of text_domain; 0=biology, 1=physics. | 0 | nan | nan |
source | The source of the stimulus text. | 0 | nan | nan | ||
headline | string | The header of the respective stimulus text. | 0 | nan | nan | |
tq_1 | string | Text question 1. | 0 | nan | nan | |
tq_1_option1 | string | Option 1 for text question 1. | 0 | nan | nan | |
tq_1_option2 | string | Option 2 for text question 1. | 0 | nan | nan | |
tq_1_option3 | string | Option 3 for text question 1. | 0 | nan | nan | |
tq_1_option4 | string | Option 4 for text question 1. | 0 | nan | nan | |
tq_2 | string | Text question 2. | 0 | nan | nan | |
tq_2_option1 | string | Option 1 for text question 2. | 0 | nan | nan | |
tq_2_option2 | string | Option 2 for text question 2. | 0 | nan | nan | |
tq_2_option3 | string | Option 3 for text question 2. | 0 | nan | nan | |
tq_2_option4 | string | Option 4 for text question 2. | 0 | nan | nan | |
tq_3 | string | Text question 3. | 0 | nan | nan | |
tq_3_option1 | string | Option 1 for text question 3. | 0 | nan | nan | |
tq_3_option2 | string | Option 2 for text question 3. | 0 | nan | nan | |
tq_3_option3 | string | Option 3 for text question 3. | 0 | nan | nan | |
tq_3_option4 | string | Option 4 for text question 3. | 0 | nan | nan | |
bq_1 | string | Background question 1. | 0 | nan | nan | |
bq_1_option1 | string | Option 1 for background question 1. | 0 | nan | nan | |
bq_1_option2 | string | Option 2 for background question 1. | 0 | nan | nan | |
bq_1_option3 | string | Option 3 for background question 1. | 0 | nan | nan | |
bq_1_option4 | string | Option 4 for background question 1. | 0 | nan | nan | |
bq_2 | string | Background question 2. | 0 | nan | nan | |
bq_2_option1 | string | Option 1 for background question 2. | 0 | nan | nan | |
bq_2_option2 | string | Option 2 for background question 2. | 0 | nan | nan | |
bq_2_option3 | string | Option 3 for background question 2. | 0 | nan | nan | |
bq_2_option4 | string | Option 4 for background question 2. | 0 | nan | nan | |
bq_3 | string | Background question 3. | 0 | nan | nan | |
bq_3_option1 | string | Option 1 for background question 3. | 0 | nan | nan | |
bq_3_option2 | string | Option 2 for background question 3. | 0 | nan | nan | |
bq_3_option3 | string | Option 3 for background question 3. | 0 | nan | nan | |
bq_3_option4 | string | Option 4 for background question 3. | 0 | nan | nan | |
correct_ans_tq_1 | 1-4 | Integer | The index of the correct answer for text question 1. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. | 0 | nan | nan |
correct_ans_tq_2 | 1-4 | Integer | The index of the correct answer for text question 2. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. | 0 | nan | nan |
correct_ans_tq_3 | 1-4 | Integer | The index of the correct answer for text question 3. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. | 0 | nan | nan |
correct_ans_bq_1 | 1-4 | Integer | The index of the correct answer for background question 1. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. | 0 | nan | nan |
correct_ans_bq_2 | 1-4 | Integer | The index of the correct answer for background question 2. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. | 0 | nan | nan |
correct_ans_bq_3 | 1-4 | Integer | The index of the correct answer for background question 3. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question. | 0 | nan | nan |
TODO: insert short text about this section in this file
Please find the file under this link: Items
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
version | 0-119 | Integer | Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv. | 0 | nan | nan |
text_id | b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 | Unique identifier given to each stimulus text. | 0 | nan | nan | |
text_domain | biology: 720, physics: 720 | Categorical | The domain of the stimulus text. | 0 | nan | nan |
order_bq_1_ans | no stats? | The order in which the answers for background question 1 were presented. | 0 | nan | nan | |
order_bq_2_ans | no stats? | See description of order_bq_1_ans | 0 | nan | nan | |
order_bq_3_ans | no stats? | See description of order_bq_1_ans | 0 | nan | nan | |
order_tq_1_ans | no stats? | See description of order_bq_1_ans | 0 | nan | nan | |
order_tq_2_ans | no stats? | See description of order_bq_1_ans | 0 | nan | nan | |
order_tq_3_ans | no stats? | See description of order_bq_1_ans | 0 | nan | nan | |
trial | 1-12 | Integer | Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. | 0 | nan | nan |
TODO: insert short text about this section in this file
Please find the files under this link: AOI
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
aoi_type | The shape of the area of interest. In this corpus, all aois are rectangles around the characters. | 0 | nan | nan | ||
aoi | 1-1121 | Integer | The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated. | 0 | nan | nan |
start_x | 80-1622 | Integer | The x-coordinate in pixels of the top left corner of the aoi rectangle. | 0 | nan | nan |
start_y | 21-920 | Integer | The y-coordinate in pixels of the top left corner of the aoi rectangle. | 0 | nan | nan |
end_x | 92-1634 | Integer | The x-coordinate in pixels of the bottom right corner of the aoi rectangle. | 0 | nan | nan |
end_y | 99-998 | Integer | The y-coordinate in pixels of the bottom right corner of the aoi rectangle. | 0 | nan | nan |
character | string | Character as text. | 0 | nan | nan |
TODO: insert short text about this section in this file
Please find the file under this link: Dependency trees
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
sent_index_in_text | 1-12 | Integer | The index of a sentence in the respective text. Indexing starts at 1. | 0 | nan | nan |
sentence | string | The sentence in the text. | 0 | nan | nan | |
dependency_tree | string | The dependency tree of the sentence in the text. | 0 | nan | nan | |
text_id_numeric | 0-11 | Integer | Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 | 0 | nan | nan |
text_id | b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 | Unique identifier given to each stimulus text. | 0 | nan | nan |
TODO: insert short text about this section in this file
Please find the files under this link: Fixations
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
fixation_index | 1-1469 | Integer | The index of the fixation in temporal order. | 0 | nan | nan |
text_domain | bio: 203667, biology: 1032, physics: 199721 | Categorical | The domain of the stimulus text. | 0 | nan | nan |
trial | 1-12 | Integer | Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. | 0 | nan | nan |
acc_bq_1 | min: 0.0, max: 1.0, mean: 0.3869, std: 0.487 | Float | The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_bq_2 | min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789 | Float | The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_bq_3 | min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938 | Float | The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_1 | min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729 | Float | The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_2 | min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821 | Float | The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_3 | min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749 | Float | The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
fixation_duration | 2-4474 | Integer | The duration of the fixation in milliseconds. | 0 | nan | nan |
next_saccade_duration | 1.0-9491.0 | Integer | The duration of the saccade that follows a fixation in milliseconds. | 46 | nan | nan |
previous_saccade_duration | nan-nan | Integer | The duration of a saccade that preceeds a fixation in milliseconds. | 515 | nan | nan |
version | 0-105 | Integer | Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv. | 0 | nan | nan |
line | 1-12 | Integer | The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1. | 0 | nan | nan |
aoi | 1-1121 | Integer | The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated. | 0 | nan | nan |
char_index_in_line | 1-100 | Integer | Index of a character in the line. Indexing starts at 1. | 0 | nan | nan |
original_fixation_index | 1-1478 | Integer | The index of the uncorrected fixation. | 0 | nan | nan |
is_fixation_adjusted | False: 382202, True: 22218 | Categorical | Whether or not the fixation has been adjusted manually. | 0 | nan | Manually tagged. |
reader_id | 0-105 | Integer | The unique identifier given to each reader. Reader IDs start at 0. | 0 | nan | nan |
text_id | b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 | Unique identifier given to each stimulus text. | 0 | nan | nan |
TODO: insert short text about this section in this file
Please find the files under this link: Scanpaths
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
fixation_index | 1-1469 | Integer | The index of the fixation in temporal order. | 0 | nan | nan |
text_domain | bio: 4682, biology: 200017, physics: 199721 | Categorical | The domain of the stimulus text. | 0 | nan | nan |
trial | 1-12 | Integer | Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. | 0 | nan | nan |
acc_bq_1 | min: 0.0, max: 1.0, mean: 0.3869, std: 0.487 | Float | The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_bq_2 | min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789 | Float | The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_bq_3 | min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938 | Float | The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_1 | min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729 | Float | The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_2 | min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821 | Float | The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_3 | min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749 | Float | The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
fixation_duration | 2-4474 | Integer | The duration of the fixation in milliseconds. | 0 | nan | nan |
next_saccade_duration | 1.0-9491.0 | Integer | The duration of the saccade that follows a fixation in milliseconds. | 46 | nan | nan |
previous_saccade_duration | 1.0-9491.0 | Integer | The duration of a saccade that preceeds a fixation in milliseconds. | 515 | nan | nan |
version | 0-105 | Integer | Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv. | 0 | nan | nan |
line | 1-12 | Integer | The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1. | 0 | nan | nan |
aoi | 1-1121 | Integer | The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated. | 0 | nan | nan |
char_index_in_line | 1-100 | Integer | Index of a character in the line. Indexing starts at 1. | 0 | nan | nan |
original_fixation_index | 1-1478 | Integer | The index of the uncorrected fixation. | 0 | nan | nan |
is_fixation_adjusted | False: 382202, True: 22218 | Categorical | Whether or not the fixation has been adjusted manually. | 0 | nan | Manually tagged. |
reader_id | 0-105 | Integer | The unique identifier given to each reader. Reader IDs start at 0. | 0 | nan | nan |
text_id | b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 | Unique identifier given to each stimulus text. | 0 | nan | nan | |
word_index_in_text | 1-180 | Integer | The index of the word in the text. Indexing starts at 1. | 0 | nan | nan |
sent_index_in_text | 1-12 | Integer | The index of a sentence in the respective text. Indexing starts at 1. | 0 | nan | nan |
char_index_in_text | 1-1121 | Integer | Index of a character in the text. Indexing starts at 1. | 0 | nan | nan |
word | string | Words as they appear in the stimuli texts. Words are split at white-space. | 0 | nan | nan | |
character | string | Character as text. | 0 | nan | nan | |
text_id_numeric | 0-11 | Integer | Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 | 0 | nan | nan |
text_domain_numeric | 0: 204699, 1: 199721 | Categorical | Numerical value of text_domain; 0=biology, 1=physics. | 0 | nan | nan |
reader_domain_numeric | 0: 223158, 1: 181262 | Categorical | Numerical encoding of the reader domain; 0=biology, 1=physics. | 0 | nan | nan |
expert_status_numeric | 0: 154333, 1: 250087 | Categorical | Numerical value of expert_status; 0=beginner, 1=expert. | 0 | nan | nan |
expert_reading_label_numeric | 0: 290883, 1: 113537 | Categorical | Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert). 1=expert_reading, 0=non-expert_reading | 0 | nan | nan |
expert_reading_label | expert_reading: 113537, non-expert_reading: 290883 | Categorical | Whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert) | 0 | nan | nan |
TODO: insert short text about this section in this file
Please find the files under this link: Reading measures
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
word_index_in_sent | 1-51 | Integer | The index of the word in the sentence. Indexing starts at 1. | 0 | nan | nan |
sent_index_in_text | 1-12 | Integer | The index of a sentence in the respective text. Indexing starts at 1. | 0 | nan | nan |
FFD | min: 0, max: 2144, mean: 166.4158, std: 132.8433 | Float | First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0. | 0 | nan | nan |
SFD | min: 0, max: 2144, mean: 118.8309, std: 135.573 | Float | Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation). | 0 | nan | nan |
FD | min: 0, max: 2144, mean: 203.5219, std: 116.9324 | Float | First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass). | 0 | nan | nan |
FPRT | min: 0, max: 9649, mean: 247.1511, std: 298.6889 | Float | First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass). | 0 | nan | nan |
FRT | min: 0, max: 9649, mean: 291.8272, std: 288.631 | Float | First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass). | 0 | nan | nan |
TFT | min: 0, max: 25314, mean: 632.8199, std: 720.3975 | Float | Total-fixation time: sum of all fixations on a word (FPRT+RRT). | 0 | nan | nan |
RRT | min: 0, max: 23902, mean: 385.6688, std: 597.5206 | Float | Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT). | 0 | nan | nan |
RPD_inc | min: 0, max: 318898, mean: 632.8199, std: 3881.7376 | Float | Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT). | 0 | nan | nan |
RPD_exc | min: 0, max: 315640, mean: 342.295, std: 3815.3786 | Float | Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT). | 0 | nan | nan |
RBRT | min: 0, max: 10675, mean: 290.5249, std: 358.8929 | Float | Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc). | 0 | nan | nan |
Fix | 0: 14182, 1: 127943 | Categorical | Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR). | 0 | nan | nan |
FPF | 0: 38408, 1: 103717 | Categorical | First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0. | 0 | nan | nan |
RR | 0: 48283, 1: 93842 | Categorical | Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)). | 0 | nan | nan |
FPReg | 0: 119060, 1: 23065 | Categorical | First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)). | 0 | nan | nan |
TRC_out | 0-15 | Integer | Total count of outgoing regressions: total number of regressive saccades initiated from this word. | 0 | nan | nan |
TRC_in | 0-12 | Integer | Total count of incoming regressions: total number of regressive saccades landing on this word. | 0 | nan | nan |
LP | 0-28 | Integer | Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character. | 0 | nan | nan |
SL_in | -162-156 | Integer | Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression. | 0 | nan | nan |
SL_out | -179-63 | Integer | Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated. | 0 | nan | nan |
TFC | The total fixation count on the word. | 0 | nan | nan | ||
text_domain_numeric | 0: 71550, 1: 70575 | Categorical | Numerical value of text_domain; 0=biology, 1=physics. | 0 | nan | nan |
trial | 1-12 | Integer | Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. | 0 | nan | nan |
text_id | b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 | Unique identifier given to each stimulus text. | 0 | nan | nan | |
text_id_numeric | 0-11 | Integer | Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 | 0 | nan | nan |
reader_id | 0-105 | Integer | The unique identifier given to each reader. Reader IDs start at 0. | 0 | nan | nan |
gender_numeric | 0.0: 66325, 1.0: 73905, nan: 1895 | Categorical | Numerical value of gender; 0=male, 1=female. | 1895 | nan | nan |
reader_domain_numeric | 0: 81485, 1: 60640 | Categorical | Numerical encoding of the reader domain; 0=biology, 1=physics. | 0 | nan | nan |
expert_status_numeric | 0: 53060, 1: 89065 | Categorical | Numerical value of expert_status; 0=beginner, 1=expert. | 0 | nan | nan |
domain_expert_status_numeric | 0: 30320, 1: 51165, 2: 22740, 3: 37900 | Categorical | Numerical value of domain_expert_status; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert. | 0 | nan | nan |
expert_reading_label_numeric | 0: 97547, 1: 44578 | Categorical | Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert). 1=expert_reading, 0=non-expert_reading | 0 | nan | nan |
expert_reading_label | expert_reading: 44578, non-expert_reading: 97547 | Categorical | Whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert) | 0 | nan | nan |
age | min: 18.0, max: 41.0, mean: 24.1644, std: 4.1809 | Float | Reader's age. | 3790 | nan | nan |
mean_acc_bq | min: 0.0, max: 1.0, mean: 0.6487, std: 0.3076 | Float | The mean accuracy of all text questions for one text read by one reader. | 1958 | nan | nan |
mean_acc_tq | min: 0.0, max: 1.0, mean: 0.3939, std: 0.3158 | Float | The mean accuracy of all background questions for one text read by one reader. | 1958 | nan | nan |
acc_bq_1 | min: 0.0, max: 1.0, mean: 0.3922, std: 0.4883 | Float | The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_bq_2 | min: 0.0, max: 1.0, mean: 0.3619, std: 0.4805 | Float | The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_bq_3 | min: 0.0, max: 1.0, mean: 0.4277, std: 0.4947 | Float | The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_1 | min: 0.0, max: 1.0, mean: 0.6469, std: 0.4779 | Float | The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_2 | min: 0.0, max: 1.0, mean: 0.6428, std: 0.4792 | Float | The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_3 | min: 0.0, max: 1.0, mean: 0.6563, std: 0.4749 | Float | The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
TODO: insert short text about this section in this file
Please find the files under this link: Reading measures merged
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
word | string | Words as they appear in the stimuli texts. Words are split at white-space. | 0 | nan | nan | |
word_with_punct | The word as it appears in the text, including punctuation. | 0 | nan | nan | ||
word_index_in_text | 1-180 | Integer | The index of the word in the text. Indexing starts at 1. | 0 | nan | nan |
word_index_in_sent | 1-51 | Integer | The index of the word in the sentence. Indexing starts at 1. | 0 | nan | nan |
sent_index_in_text | 1-12 | Integer | The index of a sentence in the respective text. Indexing starts at 1. | 0 | nan | nan |
text_id_numeric | 0-11 | Integer | Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 | 0 | nan | nan |
text_domain | biology: 71550, physics: 70575 | Categorical | The domain of the stimulus text. | 0 | nan | nan |
word_length | 2-33 | Integer | Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters). | 0 | nan | nan |
STTS_punctuation_before | 0.0: 70800, 0: 70425, $(: 900 | Categorical | If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. | 0 | nan | nan |
STTS_punctuation_after |
|
Categorical | If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. | 0 | nan | nan |
is_in_quote | 0: 141075, 1: 1050 | Categorical | Whether or not the word is part of an expression in quotes. | 0 | nan | Manually tagged |
is_in_parentheses | 0: 141750, 1: 375 | Categorical | Whether or not the word is part of a phrase in parentheses. | 0 | nan | Manually tagged |
is_clause_beginning | 0: 134700, 1: 7425 | Categorical | Whether or not the word is the beginning of a clause. | 0 | nan | Manually tagged |
is_sent_beginning | 0: 134850, 1: 7275 | Categorical | Whether or not the word is the beginning of a new sentence. | 0 | nan | Manually tagged |
is_clause_end | Whether or not the word is the end of a clause. | 0 | nan | nan | ||
is_sent_end | Whether or not the word is the end of a sentence. | 0 | nan | nan | ||
is_abbreviation | 0: 141750, 1: 375 | Categorical | Whether or not the entire word is an abbreviation. | 0 | nan | Manually tagged |
is_expert_technical_term | 0: 130500, 1: 11625 | Categorical | 1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"". | 0 | nan | Manually tagged |
is_general_technical_term | 0: 123450, 1: 18675 | Categorical | 1 if the word is a technical term that is generally understandable. E.g.: "elektrisch" | 0 | nan | nan |
contains_symbol | 0: 141525, 1: 600 | Categorical | Whether or not the word contains a symbol. E.g.: β-D-Glucose | 0 | nan | nan |
contains_hyphen | 0: 139950, 1: 2175 | Categorical | Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)). | 0 | nan | nan |
contains_abbreviation | 0: 141225, 1: 900 | Categorical | Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA. | 0 | nan | nan |
STTS_PoS_tag | ADJA: 11550, ADJD: 3975, ADV: 5475, APPR: 13800, APPRART: 3600, APZR: 75, ART: 20700, CARD: 675, KOKOM: 1275, KON: 4950, KOUI: 450, KOUS: 1200, NE: 300, NN: 38625, PAV: 1350, PDAT: 1200, PDS: 525, PIAT: 375, PIDAT: 675, PIS: 750, PPER: 1875, PPOSAT: 525, PRELAT: 450, PRELS: 2175, PRF: 1875, PTKA: 75, PTKNEG: 300, PTKVZ: 975, PTKZU: 750, PWAV: 75, TRUNC: 375, VAFIN: 5475, VAINF: 600, VMFIN: 1875, VMINF: 75, VVFIN: 7650, VVINF: 2475, VVIZU: 150, VVPP: 2850 | Categorical | Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information. | 0 | nan | Manually tagged |
type | string | The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name. | 0 | nan | dlexDB | |
type_length_chars | 0.0-33.0 | Integer | The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted. | 0 | nan | nan |
PoS_tag | adja: 12150, adjd: 4050, adv: 6825, appr: 13650, apprart: 3600, art: 21000, card: 675, kokom: 1275, kon: 4725, koui: 375, kous: 1200, ne: 525, nn: 38100, pdat: 1200, pds: 525, piat: 375, pidat: 150, pis: 1050, pper: 1800, pposat: 525, prelat: 450, prels: 1800, prf: 1875, ptka: 75, ptkneg: 300, ptkvz: 1125, ptkzu: 750, pwav: 75, trunc: 375, vafin: 5475, vainf: 600, vmfin: 1800, vminf: 75, vvfin: 7725, vvinf: 2475, vvizu: 150, vvpp: 2850, xy: 375 | Categorical | Part-of-speech tag as defined by the dlexDB query. | 0 | nan | dlexDB |
lemma | string | nan | 0 | nan | dlexDB | |
lemma_length_chars | 0.0-32.0 | Integer | nan | 0 | nan | dlexDB |
syllables | string | nan | 0 | nan | dlexDB | |
type_length_syllables | 0.0-14.0 | Integer | nan | 0 | nan | dlexDB |
annotated_type_frequency_normalized | min: 0.0, max: 24738.5901996, mean: 3629.1612, std: 6797.6492 | Float | The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma. | 0 | nan | dlexDB |
type_frequency_normalized | min: 0.0, max: 26530.3631386, mean: 4141.6498, std: 7546.5578 | Float | nan | 0 | nan | dlexDB |
lemma_frequency_normalized | min: 0.0, max: 80100.3069113, mean: 12271.0154, std: 24660.3797 | Float | nan | 0 | nan | dlexDB |
familiarity_normalized | min: 0.0, max: 26530.3631386, mean: 3822.4994, std: 7457.3314 | Float | nan | 0 | nan | dlexDB |
regularity_normalized | min: 0.0, max: 2123.30585022, mean: 35.3095, std: 119.8288 | Float | nan | 0 | nan | dlexDB |
document_frequency_normalized | min: 0.0, max: 9372.80956103, mean: 2885.4746, std: 3353.4877 | Float | nan | 0 | nan | dlexDB |
sentence_frequency_normalized | min: 0.0, max: 30912.3596552, mean: 5745.1861, std: 9454.5921 | Float | nan | 0 | nan | dlexDB |
cumulative_syllable_corpus_frequency_normalized | min: 0.0, max: 125126.524676, mean: 15795.556, std: 15820.9152 | Float | nan | 0 | nan | dlexDB |
cumulative_syllable_lexicon_frequency_normalized | min: 0.0, max: 218985.607753, mean: 21763.0396, std: 31363.3366 | Float | nan | 0 | nan | dlexDB |
cumulative_character_corpus_frequency_normalized | min: 0.0, max: 7810554.20193, mean: 1800394.2485, std: 1298158.5605 | Float | nan | 0 | nan | dlexDB |
cumulative_character_lexicon_frequency_normalized | min: 0.0, max: 18380479.713, mean: 4004667.3367, std: 2909455.8454 | Float | nan | 0 | nan | dlexDB |
cumulative_character_bigram_corpus_frequency_normalized | min: 0.0, max: 1322150.62097, mean: 341028.5141, std: 227677.2532 | Float | nan | 0 | nan | dlexDB |
cumulative_character_bigram_lexicon_frequency_normalized | min: 0.0, max: 2788357.77704, mean: 554080.6642, std: 451286.9101 | Float | nan | 0 | nan | dlexDB |
cumulative_character_trigram_corpus_frequency_normalized | min: 0.0, max: 603427.130456, mean: 213996.2534, std: 130950.6249 | Float | nan | 0 | nan | dlexDB |
cumulative_character_trigram_lexicon_frequency_normalized | min: 0.0, max: 899592.89035, mean: 223247.7744, std: 175811.3775 | Float | nan | 0 | nan | dlexDB |
initial_letter_frequency_normalized | min: 0.0, max: 110461.430317, mean: 36031.6466, std: 33586.1123 | Float | nan | 0 | nan | dlexDB |
initial_bigram_frequency_normalized | min: 0.0, max: 53801.2331077, mean: 11986.4422, std: 14536.7787 | Float | nan | 0 | nan | dlexDB |
initial_trigram_frequency_normalized | min: -0.00817507899599, max: 29048.3692201, mean: 5528.0412, std: 8782.9659 | Float | nan | 0 | nan | dlexDB |
avg_cond_prob_in_bigrams | min: 0.0, max: 0.5006180465, mean: 0.0423, std: 0.0447 | Float | The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information. | 0 | nan | dlexDB |
avg_cond_prob_in_trigrams | min: 0.0, max: 25.0, mean: 0.2371, std: 0.5852 | Float | The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information. | 0 | nan | dlexDB |
neighbors_coltheart_higher_freq_cum_freq_normalized | min: 0.0, max: 44055.247282, mean: 2111.0615, std: 7323.9586 | Float | nan | 0 | nan | dlexDB |
neighbors_coltheart_higher_freq_count_normalized | min: 0.0, max: 8.13363128109, mean: 0.195, std: 0.4875 | Float | nan | 0 | nan | dlexDB |
neighbors_coltheart_all_cum_freq_normalized | min: 0.0, max: 49782.1108458, mean: 4765.8454, std: 9884.7277 | Float | nan | 0 | nan | dlexDB |
neighbors_coltheart_all_count_normalized | min: 0.0, max: 47.5175301158, mean: 14.8301, std: 14.4676 | Float | nan | 0 | nan | dlexDB |
neighbors_levenshtein_higher_freq_cum_freq_normalized | min: 0.0, max: 44055.247282, mean: 2703.1737, std: 7703.635 | Float | nan | 0 | nan | dlexDB |
neighbors_levenshtein_higher_freq_count_normalized | min: 0.0, max: 11.9864039932, mean: 0.3077, std: 0.6418 | Float | nan | 0 | nan | dlexDB |
neighbors_levenshtein_all_cum_freq_normalized | min: 0.0, max: 54875.2749862, mean: 6310.865, std: 11349.5391 | Float | nan | 0 | nan | dlexDB |
neighbors_levenshtein_all_count_normalized | min: 0.0, max: 75.7711966712, mean: 23.1334, std: 22.6083 | Float | nan | 0 | nan | dlexDB |
sent_surprisal_gpt2-base | Surprisal value extracted from a language model (GerPT2-base) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_gpt2-base | Surprisal value extracted from a language model (GerPT2-base) with the text as context. | 0 | nan | nan | ||
sent_surprisal_gpt2-large | Surprisal value extracted from a language model (GerPT2-large) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_gpt2-large | Surprisal value extracted from a language model (GerPT2-large) with the text as context. | 0 | nan | nan | ||
sent_surprisal_llama-7b | Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_llama-7b | Surprisal value extracted from a language model (LeoLM-7b) with the text as context. | 0 | nan | nan | ||
sent_surprisal_llama-13b | Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_llama-13b | Surprisal value extracted from a language model (LeoLM-13b) with the text as context. | 0 | nan | nan | ||
sent_surprisal_bert-base | Surprisal value extracted from a language model (BERT-base) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_bert-base | Surprisal value extracted from a language model (BERT-base) with the text as context. | 0 | nan | nan | ||
FFD | min: 0, max: 2144, mean: 166.4158, std: 132.8433 | Float | First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0. | 0 | nan | nan |
SFD | min: 0, max: 2144, mean: 118.8309, std: 135.573 | Float | Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation). | 0 | nan | nan |
FD | min: 0, max: 2144, mean: 203.5219, std: 116.9324 | Float | First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass). | 0 | nan | nan |
FPRT | min: 0, max: 9649, mean: 247.1511, std: 298.6889 | Float | First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass). | 0 | nan | nan |
FRT | min: 0, max: 9649, mean: 291.8272, std: 288.631 | Float | First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass). | 0 | nan | nan |
TFT | min: 0, max: 25314, mean: 632.8199, std: 720.3975 | Float | Total-fixation time: sum of all fixations on a word (FPRT+RRT). | 0 | nan | nan |
TFC | The total fixation count on the word. | 0 | nan | nan | ||
RRT | min: 0, max: 23902, mean: 385.6688, std: 597.5206 | Float | Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT). | 0 | nan | nan |
RPD_inc | min: 0, max: 318898, mean: 632.8199, std: 3881.7376 | Float | Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT). | 0 | nan | nan |
RPD_exc | min: 0, max: 315640, mean: 342.295, std: 3815.3786 | Float | Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT). | 0 | nan | nan |
RBRT | min: 0, max: 10675, mean: 290.5249, std: 358.8929 | Float | Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc). | 0 | nan | nan |
Fix | 0: 14182, 1: 127943 | Categorical | Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR). | 0 | nan | nan |
FPF | 0: 38408, 1: 103717 | Categorical | First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0. | 0 | nan | nan |
RR | 0: 48283, 1: 93842 | Categorical | Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)). | 0 | nan | nan |
FPReg | 0: 119060, 1: 23065 | Categorical | First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)). | 0 | nan | nan |
TRC_out | 0-15 | Integer | Total count of outgoing regressions: total number of regressive saccades initiated from this word. | 0 | nan | nan |
TRC_in | 0-12 | Integer | Total count of incoming regressions: total number of regressive saccades landing on this word. | 0 | nan | nan |
LP | 0-28 | Integer | Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character. | 0 | nan | nan |
SL_in | -162-156 | Integer | Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression. | 0 | nan | nan |
SL_out | -179-63 | Integer | Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated. | 0 | nan | nan |
acc_bq_1 | min: 0.0, max: 1.0, mean: 0.3922, std: 0.4883 | Float | The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_bq_2 | min: 0.0, max: 1.0, mean: 0.3619, std: 0.4805 | Float | The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_bq_3 | min: 0.0, max: 1.0, mean: 0.4277, std: 0.4947 | Float | The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_1 | min: 0.0, max: 1.0, mean: 0.6469, std: 0.4779 | Float | The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_2 | min: 0.0, max: 1.0, mean: 0.6428, std: 0.4792 | Float | The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_3 | min: 0.0, max: 1.0, mean: 0.6563, std: 0.4749 | Float | The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. | 1958 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
mean_acc_tq | min: 0.0, max: 1.0, mean: 0.3939, std: 0.3158 | Float | The mean accuracy of all background questions for one text read by one reader. | 1958 | nan | nan |
mean_acc_bq | min: 0.0, max: 1.0, mean: 0.6487, std: 0.3076 | Float | The mean accuracy of all text questions for one text read by one reader. | 1958 | nan | nan |
text_domain_numeric | 0: 142125 | Categorical | Numerical value of text_domain; 0=biology, 1=physics. | 0 | nan | nan |
trial | 1-12 | Integer | Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. | 0 | nan | nan |
text_id | b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 | Unique identifier given to each stimulus text. | 0 | nan | nan | |
reader_id | 0-105 | Integer | The unique identifier given to each reader. Reader IDs start at 0. | 0 | nan | nan |
gender_numeric | 0.0: 66325, 1.0: 73905, nan: 1895 | Categorical | Numerical value of gender; 0=male, 1=female. | 1895 | nan | nan |
reader_domain_numeric | 0: 81485, 1: 60640 | Categorical | Numerical encoding of the reader domain; 0=biology, 1=physics. | 0 | nan | nan |
age | min: 18.0, max: 41.0, mean: 24.1644, std: 4.1809 | Float | Reader's age. | 3790 | nan | nan |
expert_status_numeric | 0: 53060, 1: 89065 | Categorical | Numerical value of expert_status; 0=beginner, 1=expert. | 0 | nan | nan |
domain_expert_status_numeric | 0: 30320, 1: 51165, 2: 22740, 3: 37900 | Categorical | Numerical value of domain_expert_status; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert. | 0 | nan | nan |
expert_reading_label_numeric | 0: 90960, 1: 51165 | Categorical | Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert). 1=expert_reading, 0=non-expert_reading | 0 | nan | nan |
TODO: insert short text about this section in this file
Please find the files under this link: Scanpaths merged
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
fixation_index | 1-1469 | Integer | The index of the fixation in temporal order. | 0 | nan | nan |
text_domain | bio: 4682, biology: 200017, physics: 199721 | Categorical | The domain of the stimulus text. | 0 | nan | nan |
trial | 1-12 | Integer | Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text. | 0 | nan | nan |
acc_bq_1 | min: 0.0, max: 1.0, mean: 0.3869, std: 0.487 | Float | The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_bq_2 | min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789 | Float | The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_bq_3 | min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938 | Float | The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_1 | min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729 | Float | The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_2 | min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821 | Float | The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
acc_tq_3 | min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749 | Float | The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1. | 5785 | For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements). | nan |
fixation_duration | 2-4474 | Integer | The duration of the fixation in milliseconds. | 0 | nan | nan |
next_saccade_duration | 1.0-9491.0 | Integer | The duration of the saccade that follows a fixation in milliseconds. | 46 | nan | nan |
previous_saccade_duration | 1.0-9491.0 | Integer | The duration of a saccade that preceeds a fixation in milliseconds. | 515 | nan | nan |
version | 0-105 | Integer | Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv. | 0 | nan | nan |
line | 1-12 | Integer | The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1. | 0 | nan | nan |
aoi | 1-1121 | Integer | The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated. | 0 | nan | nan |
char_index_in_line | 1-100 | Integer | Index of a character in the line. Indexing starts at 1. | 0 | nan | nan |
original_fixation_index | 1-1478 | Integer | The index of the uncorrected fixation. | 0 | nan | nan |
is_fixation_adjusted | False: 382202, True: 22218 | Categorical | Whether or not the fixation has been adjusted manually. | 0 | nan | Manually tagged. |
reader_id | 0-105 | Integer | The unique identifier given to each reader. Reader IDs start at 0. | 0 | nan | nan |
text_id | b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 | Unique identifier given to each stimulus text. | 0 | nan | nan | |
word_index_in_text | 1-180 | Integer | The index of the word in the text. Indexing starts at 1. | 0 | nan | nan |
sent_index_in_text | 1-12 | Integer | The index of a sentence in the respective text. Indexing starts at 1. | 0 | nan | nan |
char_index_in_text | 1-1121 | Integer | Index of a character in the text. Indexing starts at 1. | 0 | nan | nan |
word | string | Words as they appear in the stimuli texts. Words are split at white-space. | 0 | nan | nan | |
character | string | Character as text. | 0 | nan | nan | |
text_id_numeric | 0-11 | Integer | Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5 | 0 | nan | nan |
text_domain_numeric | 0: 204699, 1: 199721 | Categorical | Numerical value of text_domain; 0=biology, 1=physics. | 0 | nan | nan |
reader_domain_numeric | 0: 223158, 1: 181262 | Categorical | Numerical encoding of the reader domain; 0=biology, 1=physics. | 0 | nan | nan |
expert_status_numeric | 0: 154333, 1: 250087 | Categorical | Numerical value of expert_status; 0=beginner, 1=expert. | 0 | nan | nan |
expert_reading_label_numeric | 0: 290883, 1: 113537 | Categorical | Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert). 1=expert_reading, 0=non-expert_reading | 0 | nan | nan |
expert_reading_label | expert_reading: 113537, non-expert_reading: 290883 | Categorical | Whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert) | 0 | nan | nan |
word_with_punct | The word as it appears in the text, including punctuation. | 96 | nan | nan | ||
word_index_in_sent | 1-51 | Integer | The index of the word in the sentence. Indexing starts at 1. | 0 | nan | nan |
word_length | 2-33 | Integer | Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters). | 0 | nan | nan |
STTS_punctuation_before | 0.0: 211108, 0: 189407, $(: 3905 | Categorical | If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. | 0 | nan | nan |
STTS_punctuation_after |
|
Categorical | If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here. | 0 | nan | nan |
is_in_quote | 0: 399715, 1: 4705 | Categorical | Whether or not the word is part of an expression in quotes. | 0 | nan | Manually tagged |
is_in_parentheses | 0: 403155, 1: 1265 | Categorical | Whether or not the word is part of a phrase in parentheses. | 0 | nan | Manually tagged |
is_clause_beginning | 0: 388232, 1: 16188 | Categorical | Whether or not the word is the beginning of a clause. | 0 | nan | Manually tagged |
is_sent_beginning | 0: 386681, 1: 17739 | Categorical | Whether or not the word is the beginning of a new sentence. | 0 | nan | Manually tagged |
is_clause_end | Whether or not the word is the end of a clause. | 0 | nan | nan | ||
is_sent_end | Whether or not the word is the end of a sentence. | 0 | nan | nan | ||
is_abbreviation | 0: 403478, 1: 942 | Categorical | Whether or not the entire word is an abbreviation. | 0 | nan | Manually tagged |
is_expert_technical_term | 0: 332354, 1: 72066 | Categorical | 1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"". | 0 | nan | Manually tagged |
is_general_technical_term | 0: 325333, 1: 79087 | Categorical | 1 if the word is a technical term that is generally understandable. E.g.: "elektrisch" | 0 | nan | nan |
contains_symbol | 0: 400458, 1: 3962 | Categorical | Whether or not the word contains a symbol. E.g.: β-D-Glucose | 0 | nan | nan |
contains_hyphen | 0: 388149, 1: 16271 | Categorical | Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)). | 0 | nan | nan |
contains_abbreviation | 0: 399423, 1: 4997 | Categorical | Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA. | 0 | nan | nan |
STTS_PoS_tag | ADJA: 51041, ADJD: 12714, ADV: 12236, APPR: 22470, APPRART: 5566, APZR: 91, ART: 37340, CARD: 1594, KOKOM: 2428, KON: 5798, KOUI: 654, KOUS: 2521, NE: 955, NN: 162980, PAV: 3444, PDAT: 3292, PDS: 1374, PIAT: 791, PIDAT: 1653, PIS: 1322, PPER: 2511, PPOSAT: 1360, PRELAT: 1302, PRELS: 4193, PRF: 3606, PTKA: 97, PTKNEG: 687, PTKVZ: 1490, PTKZU: 583, PWAV: 76, TRUNC: 1137, VAFIN: 10340, VAINF: 1206, VMFIN: 3953, VMINF: 153, VVFIN: 23854, VVINF: 7713, VVIZU: 578, VVPP: 9317 | Categorical | Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information. | 0 | nan | Manually tagged |
type | string | The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name. | 0 | nan | dlexDB | |
type_length_chars | 0.0-33.0 | Integer | The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted. | 0 | nan | nan |
PoS_tag | adja: 53330, adjd: 12226, adv: 15728, appr: 22193, apprart: 5566, art: 37918, card: 1594, kokom: 2428, kon: 5405, koui: 559, kous: 2521, ne: 1386, nn: 160585, pdat: 3292, pds: 1374, piat: 791, pidat: 352, pis: 2063, pper: 2434, pposat: 1360, prelat: 1302, prels: 4076, prf: 3606, ptka: 97, ptkneg: 687, ptkvz: 1891, ptkzu: 583, pwav: 76, trunc: 1137, vafin: 10340, vainf: 1206, vmfin: 3829, vminf: 153, vvfin: 23978, vvinf: 7713, vvizu: 578, vvpp: 9317, xy: 746 | Categorical | Part-of-speech tag as defined by the dlexDB query. | 0 | nan | dlexDB |
lemma | string | nan | 0 | nan | dlexDB | |
lemma_length_chars | 0.0-32.0 | Integer | nan | 0 | nan | dlexDB |
syllables | string | nan | 0 | nan | dlexDB | |
type_length_syllables | 0.0-14.0 | Integer | nan | 0 | nan | dlexDB |
annotated_type_frequency_normalized | min: 0.0, max: 24738.5901996, mean: 1950.9055, std: 5185.3006 | Float | The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma. | 0 | nan | dlexDB |
type_frequency_normalized | min: 0.0, max: 26530.3631386, mean: 2247.4523, std: 5847.2187 | Float | nan | 0 | nan | dlexDB |
lemma_frequency_normalized | min: 0.0, max: 80100.3069113, mean: 7203.2409, std: 19769.4428 | Float | nan | 0 | nan | dlexDB |
familiarity_normalized | min: 0.0, max: 26530.3631386, mean: 2191.7786, std: 5759.2592 | Float | nan | 0 | nan | dlexDB |
regularity_normalized | min: 0.0, max: 2123.30585022, mean: 46.8657, std: 137.5046 | Float | nan | 0 | nan | dlexDB |
document_frequency_normalized | min: 0.0, max: 9372.80956103, mean: 1684.1043, std: 2829.0626 | Float | nan | 0 | nan | dlexDB |
sentence_frequency_normalized | min: 0.0, max: 30912.3596552, mean: 3137.4539, std: 7374.8037 | Float | nan | 0 | nan | dlexDB |
cumulative_syllable_corpus_frequency_normalized | min: 0.0, max: 125126.524676, mean: 15768.7784, std: 17529.5528 | Float | nan | 0 | nan | dlexDB |
cumulative_syllable_lexicon_frequency_normalized | min: 0.0, max: 218985.607753, mean: 27232.3183, std: 36883.9628 | Float | nan | 0 | nan | dlexDB |
cumulative_character_corpus_frequency_normalized | min: 0.0, max: 7810554.20193, mean: 2053804.334, std: 1596380.3916 | Float | nan | 0 | nan | dlexDB |
cumulative_character_lexicon_frequency_normalized | min: 0.0, max: 18380479.713, mean: 4612580.9638, std: 3597155.0404 | Float | nan | 0 | nan | dlexDB |
cumulative_character_bigram_corpus_frequency_normalized | min: 0.0, max: 1322150.62097, mean: 356831.454, std: 269772.388 | Float | nan | 0 | nan | dlexDB |
cumulative_character_bigram_lexicon_frequency_normalized | min: 0.0, max: 2788357.77704, mean: 629626.1651, std: 539088.9742 | Float | nan | 0 | nan | dlexDB |
cumulative_character_trigram_corpus_frequency_normalized | min: 0.0, max: 603427.130456, mean: 200341.8076, std: 144122.7012 | Float | nan | 0 | nan | dlexDB |
cumulative_character_trigram_lexicon_frequency_normalized | min: 0.0, max: 899592.89035, mean: 236423.2776, std: 199573.1416 | Float | nan | 0 | nan | dlexDB |
initial_letter_frequency_normalized | min: 0.0, max: 110461.430317, mean: 28045.0077, std: 30618.9167 | Float | nan | 0 | nan | dlexDB |
initial_bigram_frequency_normalized | min: 0.0, max: 53801.2331077, mean: 8706.0335, std: 12743.2638 | Float | nan | 0 | nan | dlexDB |
initial_trigram_frequency_normalized | min: -0.00817507899599, max: 29048.3692201, mean: 3754.6304, std: 7393.1224 | Float | nan | 0 | nan | dlexDB |
avg_cond_prob_in_bigrams | min: 0.0, max: 0.5006180465, mean: 0.0313, std: 0.0466 | Float | The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information. | 0 | nan | dlexDB |
avg_cond_prob_in_trigrams | min: 0.0, max: 25.0, mean: 0.2251, std: 0.8814 | Float | The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information. | 0 | nan | dlexDB |
neighbors_coltheart_higher_freq_cum_freq_normalized | min: 0.0, max: 44055.247282, mean: 1276.643, std: 5775.4034 | Float | nan | 0 | nan | dlexDB |
neighbors_coltheart_higher_freq_count_normalized | min: 0.0, max: 8.13363128109, mean: 0.1556, std: 0.4321 | Float | nan | 0 | nan | dlexDB |
neighbors_coltheart_all_cum_freq_normalized | min: 0.0, max: 49782.1108458, mean: 2794.1781, std: 7982.6321 | Float | nan | 0 | nan | dlexDB |
neighbors_coltheart_all_count_normalized | min: 0.0, max: 47.5175301158, mean: 9.0448, std: 12.679 | Float | nan | 0 | nan | dlexDB |
neighbors_levenshtein_higher_freq_cum_freq_normalized | min: 0.0, max: 44055.247282, mean: 1683.6273, std: 6153.8504 | Float | nan | 0 | nan | dlexDB |
neighbors_levenshtein_higher_freq_count_normalized | min: 0.0, max: 11.9864039932, mean: 0.2681, std: 0.5814 | Float | nan | 0 | nan | dlexDB |
neighbors_levenshtein_all_cum_freq_normalized | min: 0.0, max: 54875.2749862, mean: 3761.4734, std: 9299.5647 | Float | nan | 0 | nan | dlexDB |
neighbors_levenshtein_all_count_normalized | min: 0.0, max: 75.7711966712, mean: 14.1417, std: 19.6383 | Float | nan | 0 | nan | dlexDB |
sent_surprisal_gpt2-base | Surprisal value extracted from a language model (GerPT2-base) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_gpt2-base | Surprisal value extracted from a language model (GerPT2-base) with the text as context. | 0 | nan | nan | ||
sent_surprisal_gpt2-large | Surprisal value extracted from a language model (GerPT2-large) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_gpt2-large | Surprisal value extracted from a language model (GerPT2-large) with the text as context. | 0 | nan | nan | ||
sent_surprisal_llama-7b | Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_llama-7b | Surprisal value extracted from a language model (LeoLM-7b) with the text as context. | 0 | nan | nan | ||
sent_surprisal_llama-13b | Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_llama-13b | Surprisal value extracted from a language model (LeoLM-13b) with the text as context. | 0 | nan | nan | ||
sent_surprisal_bert-base | Surprisal value extracted from a language model (BERT-base) with the sentence as context. | 0 | nan | nan | ||
text_surprisal_bert-base | Surprisal value extracted from a language model (BERT-base) with the text as context. | 0 | nan | nan | ||
FFD | min: 0, max: 2144, mean: 195.9741, std: 124.5597 | Float | First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0. | 0 | nan | nan |
SFD | min: 0, max: 2144, mean: 107.9483, std: 134.474 | Float | Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation). | 0 | nan | nan |
FD | min: 0, max: 2144, mean: 226.9857, std: 103.7904 | Float | First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass). | 0 | nan | nan |
FPRT | min: 0, max: 9649, mean: 408.9247, std: 526.0428 | Float | First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass). | 0 | nan | nan |
FRT | min: 0, max: 9649, mean: 456.8788, std: 518.1388 | Float | First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass). | 0 | nan | nan |
TFT | min: 0, max: 25314, mean: 1333.0163, std: 1428.494 | Float | Total-fixation time: sum of all fixations on a word (FPRT+RRT). | 0 | nan | nan |
TFC | The total fixation count on the word. | 0 | nan | nan | ||
RRT | min: 0, max: 23902, mean: 924.0916, std: 1240.0587 | Float | Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT). | 0 | nan | nan |
RPD_inc | min: 0, max: 318898, mean: 1076.7946, std: 5339.73 | Float | Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT). | 0 | nan | nan |
RPD_exc | min: 0, max: 315640, mean: 557.5849, std: 5209.143 | Float | Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT). | 0 | nan | nan |
RBRT | min: 0, max: 10675, mean: 519.2098, std: 638.9024 | Float | Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc). | 0 | nan | nan |
Fix | 0: 110, 1: 404310 | Categorical | Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR). | 0 | nan | nan |
FPF | 0: 56838, 1: 347582 | Categorical | First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0. | 0 | nan | nan |
RR | 0: 48241, 1: 356179 | Categorical | Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)). | 0 | nan | nan |
FPReg | 0: 308156, 1: 96264 | Categorical | First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)). | 0 | nan | nan |
TRC_out | 0-15 | Integer | Total count of outgoing regressions: total number of regressive saccades initiated from this word. | 0 | nan | nan |
TRC_in | 0-12 | Integer | Total count of incoming regressions: total number of regressive saccades landing on this word. | 0 | nan | nan |
LP | 1-28 | Integer | Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character. | 0 | nan | nan |
SL_in | -162-156 | Integer | Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression. | 0 | nan | nan |
SL_out | -179-63 | Integer | Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated. | 0 | nan | nan |
mean_acc_tq | min: 0.0, max: 1.0, mean: 0.3883, std: 0.3144 | Float | The mean accuracy of all background questions for one text read by one reader. | 5785 | nan | nan |
mean_acc_bq | min: 0.0, max: 1.0, mean: 0.6505, std: 0.3052 | Float | The mean accuracy of all text questions for one text read by one reader. | 5785 | nan | nan |
gender_numeric | 0.0: 187536, 1.0: 212874, nan: 4010 | Categorical | Numerical value of gender; 0=male, 1=female. | 4010 | nan | nan |
age | min: 18.0, max: 41.0, mean: 24.0283, std: 4.1436 | Float | Reader's age. | 8459 | nan | nan |
domain_expert_status_numeric | 0: 89325, 1: 133833, 2: 65008, 3: 116254 | Categorical | Numerical value of domain_expert_status; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert. | 0 | nan | nan |
TODO: insert short text about this section in this file
Please find the file under this link: aoi to word mapping
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
text_id | b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5 | Unique identifier given to each stimulus text. | 0 | nan | nan | |
word_index_in_text | 1-180 | Integer | The index of the word in the text. Indexing starts at 1. | 0 | nan | nan |
char_index_in_text | 1-1121 | Integer | Index of a character in the text. Indexing starts at 1. | 0 | nan | nan |
TODO: insert short text about this section in this file
Please find the file under this link: Participant information
Column name | Possible values | Value type | Description | Num missing values | Missing value description | Source |
---|---|---|---|---|---|---|
reader_id | 0-105 | Integer | The unique identifier given to each reader. Reader IDs start at 0. | 0 | nan | nan |
reader_domain | biology: 43, physics: 32 | Categorical | The area of expertise of the reader. All readers are students whose major is either physics or biology. | 0 | nan | nan |
reader_domain_numeric | 0: 43, 1: 32 | Categorical | Numerical encoding of the reader domain; 0=biology, 1=physics. | 0 | nan | nan |
expert_status | beginner: 28, expert: 47 | Categorical | Reader's expert status. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners. | 0 | nan | nan |
expert_status_numeric | 0: 28, 1: 47 | Categorical | Numerical value of expert_status; 0=beginner, 1=expert. | 0 | nan | nan |
domain_expert_status | biology-beginner: 16, biology-expert: 27, physics-beginner: 12, physics-expert: 20 | Categorical | The combination of the readers' major (reader_domain) and their expertise (expert_status). | 0 | nan | nan |
domain_expert_status_numeric | 0: 16, 1: 27, 2: 12, 3: 20 | Categorical | Numerical value of domain_expert_status; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert. | 0 | nan | nan |
glasses | no: 54, yes: 20, nan: 1 | Categorical | Whether or not reader is wearing glasses. | 1 | nan | nan |
age | min: 18.0, max: 41.0, mean: 24.1644, std: 4.2098 | Float | Reader's age. | 2 | nan | nan |
handedness | right: 68, left: 6, nan: 1 | Categorical | Reader's handedness. | 1 | nan | nan |
hours_sleep | min: 0.0, max: 11.0, mean: 7.2095, std: 1.3138 | Float | The hours of sleep of the participant before the experiment. | 1 | nan | nan |
alcohol | no: 71, yes: 3, nan: 1 | Categorical | Whether or not a participant consumed alcohol within 24 hours before the experiment start. | 1 | nan | nan |
gender | female: 39, male: 35, nan: 1 | Categorical | Reader's gender. | 1 | nan | nan |
gender_numeric | 0.0: 35, 1.0: 39, nan: 1 | Categorical | Numerical value of gender; 0=male, 1=female. | 1 | nan | nan |