Codebook

The codebook specifies the data types, possible values, and other information for each column in the data files.

Word features

TODO: insert short text about this section in this file

Please find the files under this link: Word features

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
word		string	Words as they appear in the stimuli texts. Words are split at white-space.	0	nan	nan
word_with_punct			The word as it appears in the text, including punctuation.	0	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	0	nan	nan
word_index_in_sent	1-51	Integer	The index of the word in the sentence. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
word_limit_char_indices	no stats?		Specifies the limits of each word in character indices. Format: [word_start],[word_end]. For example: 3,7 means a word starts at character index 3 in the text and ends at character index 7. The properties of the character indices are specified in char_index_in_text.	0	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	nan
text_domain	biology: 954, physics: 941	Categorical	The domain of the stimulus text.	0	nan	nan
text_domain_numeric	0: 954, 1: 941	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	nan
word_length	2-33	Integer	Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters).	0	nan	nan
STTS_punctuation_before	nan: 1883, $(: 12	Categorical	If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	1883	nan	nan
STTS_punctuation_after	nan: 1689, $.: 101, $,: 93, $(: 10, $($,: 2	Categorical	If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	1689	nan	nan
is_in_quote	0: 1881, 1: 14	Categorical	Whether or not the word is part of an expression in quotes.	0	nan	Manually tagged
is_in_parentheses	0: 1890, 1: 5	Categorical	Whether or not the word is part of a phrase in parentheses.	0	nan	Manually tagged
is_clause_beginning	0: 1796, 1: 99	Categorical	Whether or not the word is the beginning of a clause.	0	nan	Manually tagged
is_sent_beginning	0: 1798, 1: 97	Categorical	Whether or not the word is the beginning of a new sentence.	0	nan	Manually tagged
is_clause_end			Whether or not the word is the end of a clause.	0	nan	nan
is_sent_end			Whether or not the word is the end of a sentence.	0	nan	nan
is_abbreviation	0: 1890, 1: 5	Categorical	Whether or not the entire word is an abbreviation.	0	nan	Manually tagged
is_expert_technical_term	0: 1740, 1: 155	Categorical	1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"".	0	nan	Manually tagged
is_general_technical_term	0: 1646, 1: 249	Categorical	1 if the word is a technical term that is generally understandable. E.g.: "elektrisch"	0	nan	nan
contains_symbol	0: 1887, 1: 8	Categorical	Whether or not the word contains a symbol. E.g.: β-D-Glucose	0	nan	nan
contains_hyphen	0: 1866, 1: 29	Categorical	Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)).	0	nan	nan
contains_abbreviation	0: 1883, 1: 12	Categorical	Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA.	0	nan	nan
STTS_PoS_tag	ADJA: 154, ADJD: 53, ADV: 73, APPR: 184, APPRART: 48, APZR: 1, ART: 276, CARD: 9, KOKOM: 17, KON: 66, KOUI: 6, KOUS: 16, NE: 4, NN: 515, PAV: 18, PDAT: 16, PDS: 7, PIAT: 5, PIDAT: 9, PIS: 10, PPER: 25, PPOSAT: 7, PRELAT: 6, PRELS: 29, PRF: 25, PTKA: 1, PTKNEG: 4, PTKVZ: 13, PTKZU: 10, PWAV: 1, TRUNC: 5, VAFIN: 73, VAINF: 8, VMFIN: 25, VMINF: 1, VVFIN: 102, VVINF: 33, VVIZU: 2, VVPP: 38	Categorical	Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information.	0	nan	Manually tagged
type		string	The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name.	4	nan	dlexDB
type_length_chars	2.0-33.0	Integer	The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted.	1	nan	nan
PoS_tag	adja: 162, adjd: 54, adv: 91, appr: 182, apprart: 48, art: 280, card: 9, kokom: 17, kon: 63, koui: 5, kous: 16, ne: 7, nn: 508, pdat: 16, pds: 7, piat: 5, pidat: 2, pis: 14, pper: 24, pposat: 7, prelat: 6, prels: 24, prf: 25, ptka: 1, ptkneg: 4, ptkvz: 15, ptkzu: 10, pwav: 1, trunc: 5, vafin: 73, vainf: 8, vmfin: 24, vminf: 1, vvfin: 103, vvinf: 33, vvizu: 2, vvpp: 38, xy: 5	Categorical	Part-of-speech tag as defined by the dlexDB query.	0	nan	dlexDB
lemma		string	nan	4	nan	dlexDB
lemma_length_chars	1.0-32.0	Integer	nan	3	nan	dlexDB
syllables		string	nan	25	nan	dlexDB
type_length_syllables	1.0-14.0	Integer	nan	24	nan	dlexDB
annotated_type_frequency_normalized	min: 0.00817507899599, max: 24738.5901996, mean: 3889.8532, std: 6967.089	Float	The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma.	127	nan	dlexDB
type_frequency_normalized	min: 0.00817507899599, max: 26530.3631386, mean: 4409.2283, std: 7712.5287	Float	nan	115	nan	dlexDB
lemma_frequency_normalized	min: 0.00817507899599, max: 80100.3069113, mean: 13063.8057, std: 25247.1898	Float	nan	115	nan	dlexDB
familiarity_normalized	min: 0.0, max: 26530.3631386, mean: 4074.0362, std: 7634.0602	Float	nan	117	nan	dlexDB
regularity_normalized	min: 0.0, max: 2123.30585022, mean: 37.6119, std: 123.3575	Float	nan	116	nan	dlexDB
document_frequency_normalized	min: 0.126068429944, max: 9372.80956103, mean: 3073.6225, std: 3377.4549	Float	nan	116	nan	dlexDB
sentence_frequency_normalized	min: 0.0155184320176, max: 30912.3596552, mean: 6119.8019, std: 9642.457	Float	nan	116	nan	dlexDB
cumulative_syllable_corpus_frequency_normalized	min: 1.40611358731, max: 125126.524676, mean: 16825.508, std: 15793.39	Float	nan	116	nan	dlexDB
cumulative_syllable_lexicon_frequency_normalized	min: 0.428085856899, max: 218985.607753, mean: 23221.2613, std: 31879.0143	Float	nan	119	nan	dlexDB
cumulative_character_corpus_frequency_normalized	min: 15533.2550482, max: 7810554.20193, mean: 1917789.2641, std: 1253328.3202	Float	nan	116	nan	dlexDB
cumulative_character_lexicon_frequency_normalized	min: 47003.8270876, max: 18380479.713, mean: 4265792.357, std: 2812004.0938	Float	nan	116	nan	dlexDB
cumulative_character_bigram_corpus_frequency_normalized	min: 5138.64210483, max: 1322150.62097, mean: 363265.3368, std: 217175.5613	Float	nan	116	nan	dlexDB
cumulative_character_bigram_lexicon_frequency_normalized	min: 12677.7626521, max: 2788357.77704, mean: 590209.5889, std: 442407.5129	Float	nan	116	nan	dlexDB
cumulative_character_trigram_corpus_frequency_normalized	min: 4358.04468689, max: 603427.130456, mean: 227949.9158, std: 122856.9432	Float	nan	116	nan	dlexDB
cumulative_character_trigram_lexicon_frequency_normalized	min: 11942.3111499, max: 899592.89035, mean: 237804.6839, std: 171696.6712	Float	nan	116	nan	dlexDB
initial_letter_frequency_normalized	min: 199.202149895, max: 110461.430317, mean: 38381.0963, std: 33346.9984	Float	nan	116	nan	dlexDB
initial_bigram_frequency_normalized	min: 1.57779024623, max: 53801.2331077, mean: 12768.0203, std: 14670.9631	Float	nan	116	nan	dlexDB
initial_trigram_frequency_normalized	min: -0.00817507899599, max: 29048.3692201, mean: 5888.4981, std: 8949.4325	Float	nan	116	nan	dlexDB
avg_cond_prob_in_bigrams	min: 1.2e-07, max: 0.5006180465, mean: 0.0451, std: 0.0448	Float	The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information.	116	nan	dlexDB
avg_cond_prob_in_trigrams	min: 3.153e-06, max: 25.0, mean: 0.2526, std: 0.6009	Float	The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information.	116	nan	dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 2248.7136, std: 7540.5582	Float	nan	116	nan	dlexDB
neighbors_coltheart_higher_freq_count_normalized	min: 0.0, max: 8.13363128109, mean: 0.2077, std: 0.5007	Float	nan	116	nan	dlexDB
neighbors_coltheart_all_cum_freq_normalized	min: 0.0, max: 49782.1108458, mean: 5076.6032, std: 10127.1033	Float	nan	116	nan	dlexDB
neighbors_coltheart_all_count_normalized	min: 0.0, max: 47.5175301158, mean: 15.7971, std: 14.4153	Float	nan	116	nan	dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 2879.4346, std: 7921.0448	Float	nan	116	nan	dlexDB
neighbors_levenshtein_higher_freq_count_normalized	min: 0.0, max: 11.9864039932, mean: 0.3277, std: 0.6576	Float	nan	116	nan	dlexDB
neighbors_levenshtein_all_cum_freq_normalized	min: 0.0, max: 54875.2749862, mean: 6722.366, std: 11598.2601	Float	nan	116	nan	dlexDB
neighbors_levenshtein_all_count_normalized	min: 0.0, max: 75.7711966712, mean: 24.6418, std: 22.5295	Float	nan	116	nan	dlexDB
sent_surprisal_gpt2-base			Surprisal value extracted from a language model (GerPT2-base) with the sentence as context.	0	nan	nan
text_surprisal_gpt2-base			Surprisal value extracted from a language model (GerPT2-base) with the text as context.	0	nan	nan
sent_surprisal_gpt2-large			Surprisal value extracted from a language model (GerPT2-large) with the sentence as context.	0	nan	nan
text_surprisal_gpt2-large			Surprisal value extracted from a language model (GerPT2-large) with the text as context.	0	nan	nan
sent_surprisal_llama-7b			Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context.	0	nan	nan
text_surprisal_llama-7b			Surprisal value extracted from a language model (LeoLM-7b) with the text as context.	0	nan	nan
sent_surprisal_llama-13b			Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context.	0	nan	nan
text_surprisal_llama-13b			Surprisal value extracted from a language model (LeoLM-13b) with the text as context.	0	nan	nan
sent_surprisal_bert-base			Surprisal value extracted from a language model (BERT-base) with the sentence as context.	0	nan	nan
text_surprisal_bert-base			Surprisal value extracted from a language model (BERT-base) with the text as context.	0	nan	nan

Stimuli and comprehension questions

TODO: insert short text about this section in this file

Please find the file under this link: Stimuli including comprehension questions

Column name	Possible values	Value type	Description	Missing value description	Source
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	nan	nan
text_domain	biology: 6, physics: 6	Categorical	The domain of the stimulus text.	nan	nan
text_domain_numeric	0: 6, 1: 6	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	nan	nan
source			The source of the stimulus text.	nan	nan
headline		string	The header of the respective stimulus text.	nan	nan
tq_1		string	Text question 1.	nan	nan
tq_1_option1		string	Option 1 for text question 1.	nan	nan
tq_1_option2		string	Option 2 for text question 1.	nan	nan
tq_1_option3		string	Option 3 for text question 1.	nan	nan
tq_1_option4		string	Option 4 for text question 1.	nan	nan
tq_2		string	Text question 2.	nan	nan
tq_2_option1		string	Option 1 for text question 2.	nan	nan
tq_2_option2		string	Option 2 for text question 2.	nan	nan
tq_2_option3		string	Option 3 for text question 2.	nan	nan
tq_2_option4		string	Option 4 for text question 2.	nan	nan
tq_3		string	Text question 3.	nan	nan
tq_3_option1		string	Option 1 for text question 3.	nan	nan
tq_3_option2		string	Option 2 for text question 3.	nan	nan
tq_3_option3		string	Option 3 for text question 3.	nan	nan
tq_3_option4		string	Option 4 for text question 3.	nan	nan
bq_1		string	Background question 1.	nan	nan
bq_1_option1		string	Option 1 for background question 1.	nan	nan
bq_1_option2		string	Option 2 for background question 1.	nan	nan
bq_1_option3		string	Option 3 for background question 1.	nan	nan
bq_1_option4		string	Option 4 for background question 1.	nan	nan
bq_2		string	Background question 2.	nan	nan
bq_2_option1		string	Option 1 for background question 2.	nan	nan
bq_2_option2		string	Option 2 for background question 2.	nan	nan
bq_2_option3		string	Option 3 for background question 2.	nan	nan
bq_2_option4		string	Option 4 for background question 2.	nan	nan
bq_3		string	Background question 3.	nan	nan
bq_3_option1		string	Option 1 for background question 3.	nan	nan
bq_3_option2		string	Option 2 for background question 3.	nan	nan
bq_3_option3		string	Option 3 for background question 3.	nan	nan
bq_3_option4		string	Option 4 for background question 3.	nan	nan
correct_ans_tq_1	1-4	Integer	The index of the correct answer for text question 1. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan
correct_ans_tq_2	1-4	Integer	The index of the correct answer for text question 2. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan
correct_ans_tq_3	1-4	Integer	The index of the correct answer for text question 3. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan
correct_ans_bq_1	1-4	Integer	The index of the correct answer for background question 1. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan
correct_ans_bq_2	1-4	Integer	The index of the correct answer for background question 2. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan
correct_ans_bq_3	1-4	Integer	The index of the correct answer for background question 3. Specified as option number of the questions in that file. For example: 2 means that the answer that is specified in the column "tq_3_option2" is the correct answer to this question.	nan	nan

Items

TODO: insert short text about this section in this file

Please find the file under this link: Items

Column name	Possible values	Value type	Description	Missing value description	Source
version	0-119	Integer	Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv.	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan
text_domain	biology: 720, physics: 720	Categorical	The domain of the stimulus text.	nan	nan
order_bq_1_ans	no stats?		The order in which the answers for background question 1 were presented.	nan	nan
order_bq_2_ans	no stats?		See description of order_bq_1_ans	nan	nan
order_bq_3_ans	no stats?		See description of order_bq_1_ans	nan	nan
order_tq_1_ans	no stats?		See description of order_bq_1_ans	nan	nan
order_tq_2_ans	no stats?		See description of order_bq_1_ans	nan	nan
order_tq_3_ans	no stats?		See description of order_bq_1_ans	nan	nan
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	nan	nan

Areas of interest (AOI)

TODO: insert short text about this section in this file

Please find the files under this link: AOI

Column name	Possible values	Value type	Description	Missing value description	Source
aoi_type			The shape of the area of interest. In this corpus, all aois are rectangles around the characters.	nan	nan
aoi	1-1121	Integer	The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated.	nan	nan
start_x	80-1622	Integer	The x-coordinate in pixels of the top left corner of the aoi rectangle.	nan	nan
start_y	21-920	Integer	The y-coordinate in pixels of the top left corner of the aoi rectangle.	nan	nan
end_x	92-1634	Integer	The x-coordinate in pixels of the bottom right corner of the aoi rectangle.	nan	nan
end_y	99-998	Integer	The y-coordinate in pixels of the bottom right corner of the aoi rectangle.	nan	nan
character		string	Character as text.	nan	nan

Dependency trees

TODO: insert short text about this section in this file

Please find the file under this link: Dependency trees

Column name	Possible values	Value type	Description	Missing value description	Source
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	nan	nan
sentence		string	The sentence in the text.	nan	nan
dependency_tree		string	The dependency tree of the sentence in the text.	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan

Fixations

TODO: insert short text about this section in this file

Please find the files under this link: Fixations

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
fixation_index	1-1469	Integer	The index of the fixation in temporal order.	0	nan	nan
text_domain	bio: 203667, biology: 1032, physics: 199721	Categorical	The domain of the stimulus text.	0	nan	nan
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.3869, std: 0.487	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
fixation_duration	2-4474	Integer	The duration of the fixation in milliseconds.	0	nan	nan
next_saccade_duration	1.0-9491.0	Integer	The duration of the saccade that follows a fixation in milliseconds.	46	nan	nan
previous_saccade_duration	nan-nan	Integer	The duration of a saccade that preceeds a fixation in milliseconds.	515	nan	nan
version	0-105	Integer	Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv.	0	nan	nan
line	1-12	Integer	The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1.	0	nan	nan
aoi	1-1121	Integer	The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated.	0	nan	nan
char_index_in_line	1-100	Integer	Index of a character in the line. Indexing starts at 1.	0	nan	nan
original_fixation_index	1-1478	Integer	The index of the uncorrected fixation.	0	nan	nan
is_fixation_adjusted	False: 382202, True: 22218	Categorical	Whether or not the fixation has been adjusted manually.	0	nan	Manually tagged.
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan

Scanpaths

TODO: insert short text about this section in this file

Please find the files under this link: Scanpaths

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
fixation_index	1-1469	Integer	The index of the fixation in temporal order.	0	nan	nan
text_domain	bio: 4682, biology: 200017, physics: 199721	Categorical	The domain of the stimulus text.	0	nan	nan
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.3869, std: 0.487	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
fixation_duration	2-4474	Integer	The duration of the fixation in milliseconds.	0	nan	nan
next_saccade_duration	1.0-9491.0	Integer	The duration of the saccade that follows a fixation in milliseconds.	46	nan	nan
previous_saccade_duration	1.0-9491.0	Integer	The duration of a saccade that preceeds a fixation in milliseconds.	515	nan	nan
version	0-105	Integer	Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv.	0	nan	nan
line	1-12	Integer	The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1.	0	nan	nan
aoi	1-1121	Integer	The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated.	0	nan	nan
char_index_in_line	1-100	Integer	Index of a character in the line. Indexing starts at 1.	0	nan	nan
original_fixation_index	1-1478	Integer	The index of the uncorrected fixation.	0	nan	nan
is_fixation_adjusted	False: 382202, True: 22218	Categorical	Whether or not the fixation has been adjusted manually.	0	nan	Manually tagged.
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
char_index_in_text	1-1121	Integer	Index of a character in the text. Indexing starts at 1.	0	nan	nan
word		string	Words as they appear in the stimuli texts. Words are split at white-space.	0	nan	nan
character		string	Character as text.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	nan
text_domain_numeric	0: 204699, 1: 199721	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	nan
reader_domain_numeric	0: 223158, 1: 181262	Categorical	Numerical encoding of the reader domain; 0=biology, 1=physics.	0	nan	nan
expert_status_numeric	0: 154333, 1: 250087	Categorical	Numerical value of expert_status; 0=beginner, 1=expert.	0	nan	nan
expert_reading_label_numeric	0: 290883, 1: 113537	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	nan
expert_reading_label	expert_reading: 113537, non-expert_reading: 290883	Categorical	Whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert)	0	nan	nan

Reading measures

TODO: insert short text about this section in this file

Please find the files under this link: Reading measures

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
word_index_in_sent	1-51	Integer	The index of the word in the sentence. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
FFD	min: 0, max: 2144, mean: 166.4158, std: 132.8433	Float	First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0.	0	nan	nan
SFD	min: 0, max: 2144, mean: 118.8309, std: 135.573	Float	Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation).	0	nan	nan
FD	min: 0, max: 2144, mean: 203.5219, std: 116.9324	Float	First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass).	0	nan	nan
FPRT	min: 0, max: 9649, mean: 247.1511, std: 298.6889	Float	First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass).	0	nan	nan
FRT	min: 0, max: 9649, mean: 291.8272, std: 288.631	Float	First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass).	0	nan	nan
TFT	min: 0, max: 25314, mean: 632.8199, std: 720.3975	Float	Total-fixation time: sum of all fixations on a word (FPRT+RRT).	0	nan	nan
RRT	min: 0, max: 23902, mean: 385.6688, std: 597.5206	Float	Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT).	0	nan	nan
RPD_inc	min: 0, max: 318898, mean: 632.8199, std: 3881.7376	Float	Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT).	0	nan	nan
RPD_exc	min: 0, max: 315640, mean: 342.295, std: 3815.3786	Float	Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT).	0	nan	nan
RBRT	min: 0, max: 10675, mean: 290.5249, std: 358.8929	Float	Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc).	0	nan	nan
Fix	0: 14182, 1: 127943	Categorical	Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR).	0	nan	nan
FPF	0: 38408, 1: 103717	Categorical	First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0.	0	nan	nan
RR	0: 48283, 1: 93842	Categorical	Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)).	0	nan	nan
FPReg	0: 119060, 1: 23065	Categorical	First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)).	0	nan	nan
TRC_out	0-15	Integer	Total count of outgoing regressions: total number of regressive saccades initiated from this word.	0	nan	nan
TRC_in	0-12	Integer	Total count of incoming regressions: total number of regressive saccades landing on this word.	0	nan	nan
LP	0-28	Integer	Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character.	0	nan	nan
SL_in	-162-156	Integer	Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression.	0	nan	nan
SL_out	-179-63	Integer	Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated.	0	nan	nan
TFC			The total fixation count on the word.	0	nan	nan
text_domain_numeric	0: 71550, 1: 70575	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	nan
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	nan
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	nan
gender_numeric	0.0: 66325, 1.0: 73905, nan: 1895	Categorical	Numerical value of gender; 0=male, 1=female.	1895	nan	nan
reader_domain_numeric	0: 81485, 1: 60640	Categorical	Numerical encoding of the reader domain; 0=biology, 1=physics.	0	nan	nan
expert_status_numeric	0: 53060, 1: 89065	Categorical	Numerical value of expert_status; 0=beginner, 1=expert.	0	nan	nan
domain_expert_status_numeric	0: 30320, 1: 51165, 2: 22740, 3: 37900	Categorical	Numerical value of domain_expert_status; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert.	0	nan	nan
expert_reading_label_numeric	0: 97547, 1: 44578	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	nan
expert_reading_label	expert_reading: 44578, non-expert_reading: 97547	Categorical	Whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert)	0	nan	nan
age	min: 18.0, max: 41.0, mean: 24.1644, std: 4.1809	Float	Reader's age.	3790	nan	nan
mean_acc_bq	min: 0.0, max: 1.0, mean: 0.6487, std: 0.3076	Float	The mean accuracy of all text questions for one text read by one reader.	1958	nan	nan
mean_acc_tq	min: 0.0, max: 1.0, mean: 0.3939, std: 0.3158	Float	The mean accuracy of all background questions for one text read by one reader.	1958	nan	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.3922, std: 0.4883	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.3619, std: 0.4805	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4277, std: 0.4947	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6469, std: 0.4779	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6428, std: 0.4792	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6563, std: 0.4749	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan

Merged: fixations, participant info, reading measures and word features

TODO: insert short text about this section in this file

Please find the files under this link: Reading measures merged

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
word		string	Words as they appear in the stimuli texts. Words are split at white-space.	0	nan	nan
word_with_punct			The word as it appears in the text, including punctuation.	0	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	0	nan	nan
word_index_in_sent	1-51	Integer	The index of the word in the sentence. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	nan
text_domain	biology: 71550, physics: 70575	Categorical	The domain of the stimulus text.	0	nan	nan
word_length	2-33	Integer	Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters).	0	nan	nan
STTS_punctuation_before	0.0: 70800, 0: 70425, $(: 900	Categorical	If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	0	nan	nan
STTS_punctuation_after	$(: 750, $($,: 150, $,: 6975, $.: 7575, 0: 126675	Categorical	If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	0	nan	nan
is_in_quote	0: 141075, 1: 1050	Categorical	Whether or not the word is part of an expression in quotes.	0	nan	Manually tagged
is_in_parentheses	0: 141750, 1: 375	Categorical	Whether or not the word is part of a phrase in parentheses.	0	nan	Manually tagged
is_clause_beginning	0: 134700, 1: 7425	Categorical	Whether or not the word is the beginning of a clause.	0	nan	Manually tagged
is_sent_beginning	0: 134850, 1: 7275	Categorical	Whether or not the word is the beginning of a new sentence.	0	nan	Manually tagged
is_clause_end			Whether or not the word is the end of a clause.	0	nan	nan
is_sent_end			Whether or not the word is the end of a sentence.	0	nan	nan
is_abbreviation	0: 141750, 1: 375	Categorical	Whether or not the entire word is an abbreviation.	0	nan	Manually tagged
is_expert_technical_term	0: 130500, 1: 11625	Categorical	1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"".	0	nan	Manually tagged
is_general_technical_term	0: 123450, 1: 18675	Categorical	1 if the word is a technical term that is generally understandable. E.g.: "elektrisch"	0	nan	nan
contains_symbol	0: 141525, 1: 600	Categorical	Whether or not the word contains a symbol. E.g.: β-D-Glucose	0	nan	nan
contains_hyphen	0: 139950, 1: 2175	Categorical	Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)).	0	nan	nan
contains_abbreviation	0: 141225, 1: 900	Categorical	Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA.	0	nan	nan
STTS_PoS_tag	ADJA: 11550, ADJD: 3975, ADV: 5475, APPR: 13800, APPRART: 3600, APZR: 75, ART: 20700, CARD: 675, KOKOM: 1275, KON: 4950, KOUI: 450, KOUS: 1200, NE: 300, NN: 38625, PAV: 1350, PDAT: 1200, PDS: 525, PIAT: 375, PIDAT: 675, PIS: 750, PPER: 1875, PPOSAT: 525, PRELAT: 450, PRELS: 2175, PRF: 1875, PTKA: 75, PTKNEG: 300, PTKVZ: 975, PTKZU: 750, PWAV: 75, TRUNC: 375, VAFIN: 5475, VAINF: 600, VMFIN: 1875, VMINF: 75, VVFIN: 7650, VVINF: 2475, VVIZU: 150, VVPP: 2850	Categorical	Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information.	0	nan	Manually tagged
type		string	The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name.	0	nan	dlexDB
type_length_chars	0.0-33.0	Integer	The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted.	0	nan	nan
PoS_tag	adja: 12150, adjd: 4050, adv: 6825, appr: 13650, apprart: 3600, art: 21000, card: 675, kokom: 1275, kon: 4725, koui: 375, kous: 1200, ne: 525, nn: 38100, pdat: 1200, pds: 525, piat: 375, pidat: 150, pis: 1050, pper: 1800, pposat: 525, prelat: 450, prels: 1800, prf: 1875, ptka: 75, ptkneg: 300, ptkvz: 1125, ptkzu: 750, pwav: 75, trunc: 375, vafin: 5475, vainf: 600, vmfin: 1800, vminf: 75, vvfin: 7725, vvinf: 2475, vvizu: 150, vvpp: 2850, xy: 375	Categorical	Part-of-speech tag as defined by the dlexDB query.	0	nan	dlexDB
lemma		string	nan	0	nan	dlexDB
lemma_length_chars	0.0-32.0	Integer	nan	0	nan	dlexDB
syllables		string	nan	0	nan	dlexDB
type_length_syllables	0.0-14.0	Integer	nan	0	nan	dlexDB
annotated_type_frequency_normalized	min: 0.0, max: 24738.5901996, mean: 3629.1612, std: 6797.6492	Float	The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma.	0	nan	dlexDB
type_frequency_normalized	min: 0.0, max: 26530.3631386, mean: 4141.6498, std: 7546.5578	Float	nan	0	nan	dlexDB
lemma_frequency_normalized	min: 0.0, max: 80100.3069113, mean: 12271.0154, std: 24660.3797	Float	nan	0	nan	dlexDB
familiarity_normalized	min: 0.0, max: 26530.3631386, mean: 3822.4994, std: 7457.3314	Float	nan	0	nan	dlexDB
regularity_normalized	min: 0.0, max: 2123.30585022, mean: 35.3095, std: 119.8288	Float	nan	0	nan	dlexDB
document_frequency_normalized	min: 0.0, max: 9372.80956103, mean: 2885.4746, std: 3353.4877	Float	nan	0	nan	dlexDB
sentence_frequency_normalized	min: 0.0, max: 30912.3596552, mean: 5745.1861, std: 9454.5921	Float	nan	0	nan	dlexDB
cumulative_syllable_corpus_frequency_normalized	min: 0.0, max: 125126.524676, mean: 15795.556, std: 15820.9152	Float	nan	0	nan	dlexDB
cumulative_syllable_lexicon_frequency_normalized	min: 0.0, max: 218985.607753, mean: 21763.0396, std: 31363.3366	Float	nan	0	nan	dlexDB
cumulative_character_corpus_frequency_normalized	min: 0.0, max: 7810554.20193, mean: 1800394.2485, std: 1298158.5605	Float	nan	0	nan	dlexDB
cumulative_character_lexicon_frequency_normalized	min: 0.0, max: 18380479.713, mean: 4004667.3367, std: 2909455.8454	Float	nan	0	nan	dlexDB
cumulative_character_bigram_corpus_frequency_normalized	min: 0.0, max: 1322150.62097, mean: 341028.5141, std: 227677.2532	Float	nan	0	nan	dlexDB
cumulative_character_bigram_lexicon_frequency_normalized	min: 0.0, max: 2788357.77704, mean: 554080.6642, std: 451286.9101	Float	nan	0	nan	dlexDB
cumulative_character_trigram_corpus_frequency_normalized	min: 0.0, max: 603427.130456, mean: 213996.2534, std: 130950.6249	Float	nan	0	nan	dlexDB
cumulative_character_trigram_lexicon_frequency_normalized	min: 0.0, max: 899592.89035, mean: 223247.7744, std: 175811.3775	Float	nan	0	nan	dlexDB
initial_letter_frequency_normalized	min: 0.0, max: 110461.430317, mean: 36031.6466, std: 33586.1123	Float	nan	0	nan	dlexDB
initial_bigram_frequency_normalized	min: 0.0, max: 53801.2331077, mean: 11986.4422, std: 14536.7787	Float	nan	0	nan	dlexDB
initial_trigram_frequency_normalized	min: -0.00817507899599, max: 29048.3692201, mean: 5528.0412, std: 8782.9659	Float	nan	0	nan	dlexDB
avg_cond_prob_in_bigrams	min: 0.0, max: 0.5006180465, mean: 0.0423, std: 0.0447	Float	The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information.	0	nan	dlexDB
avg_cond_prob_in_trigrams	min: 0.0, max: 25.0, mean: 0.2371, std: 0.5852	Float	The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information.	0	nan	dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 2111.0615, std: 7323.9586	Float	nan	0	nan	dlexDB
neighbors_coltheart_higher_freq_count_normalized	min: 0.0, max: 8.13363128109, mean: 0.195, std: 0.4875	Float	nan	0	nan	dlexDB
neighbors_coltheart_all_cum_freq_normalized	min: 0.0, max: 49782.1108458, mean: 4765.8454, std: 9884.7277	Float	nan	0	nan	dlexDB
neighbors_coltheart_all_count_normalized	min: 0.0, max: 47.5175301158, mean: 14.8301, std: 14.4676	Float	nan	0	nan	dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 2703.1737, std: 7703.635	Float	nan	0	nan	dlexDB
neighbors_levenshtein_higher_freq_count_normalized	min: 0.0, max: 11.9864039932, mean: 0.3077, std: 0.6418	Float	nan	0	nan	dlexDB
neighbors_levenshtein_all_cum_freq_normalized	min: 0.0, max: 54875.2749862, mean: 6310.865, std: 11349.5391	Float	nan	0	nan	dlexDB
neighbors_levenshtein_all_count_normalized	min: 0.0, max: 75.7711966712, mean: 23.1334, std: 22.6083	Float	nan	0	nan	dlexDB
sent_surprisal_gpt2-base			Surprisal value extracted from a language model (GerPT2-base) with the sentence as context.	0	nan	nan
text_surprisal_gpt2-base			Surprisal value extracted from a language model (GerPT2-base) with the text as context.	0	nan	nan
sent_surprisal_gpt2-large			Surprisal value extracted from a language model (GerPT2-large) with the sentence as context.	0	nan	nan
text_surprisal_gpt2-large			Surprisal value extracted from a language model (GerPT2-large) with the text as context.	0	nan	nan
sent_surprisal_llama-7b			Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context.	0	nan	nan
text_surprisal_llama-7b			Surprisal value extracted from a language model (LeoLM-7b) with the text as context.	0	nan	nan
sent_surprisal_llama-13b			Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context.	0	nan	nan
text_surprisal_llama-13b			Surprisal value extracted from a language model (LeoLM-13b) with the text as context.	0	nan	nan
sent_surprisal_bert-base			Surprisal value extracted from a language model (BERT-base) with the sentence as context.	0	nan	nan
text_surprisal_bert-base			Surprisal value extracted from a language model (BERT-base) with the text as context.	0	nan	nan
FFD	min: 0, max: 2144, mean: 166.4158, std: 132.8433	Float	First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0.	0	nan	nan
SFD	min: 0, max: 2144, mean: 118.8309, std: 135.573	Float	Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation).	0	nan	nan
FD	min: 0, max: 2144, mean: 203.5219, std: 116.9324	Float	First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass).	0	nan	nan
FPRT	min: 0, max: 9649, mean: 247.1511, std: 298.6889	Float	First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass).	0	nan	nan
FRT	min: 0, max: 9649, mean: 291.8272, std: 288.631	Float	First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass).	0	nan	nan
TFT	min: 0, max: 25314, mean: 632.8199, std: 720.3975	Float	Total-fixation time: sum of all fixations on a word (FPRT+RRT).	0	nan	nan
TFC			The total fixation count on the word.	0	nan	nan
RRT	min: 0, max: 23902, mean: 385.6688, std: 597.5206	Float	Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT).	0	nan	nan
RPD_inc	min: 0, max: 318898, mean: 632.8199, std: 3881.7376	Float	Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT).	0	nan	nan
RPD_exc	min: 0, max: 315640, mean: 342.295, std: 3815.3786	Float	Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT).	0	nan	nan
RBRT	min: 0, max: 10675, mean: 290.5249, std: 358.8929	Float	Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc).	0	nan	nan
Fix	0: 14182, 1: 127943	Categorical	Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR).	0	nan	nan
FPF	0: 38408, 1: 103717	Categorical	First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0.	0	nan	nan
RR	0: 48283, 1: 93842	Categorical	Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)).	0	nan	nan
FPReg	0: 119060, 1: 23065	Categorical	First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)).	0	nan	nan
TRC_out	0-15	Integer	Total count of outgoing regressions: total number of regressive saccades initiated from this word.	0	nan	nan
TRC_in	0-12	Integer	Total count of incoming regressions: total number of regressive saccades landing on this word.	0	nan	nan
LP	0-28	Integer	Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character.	0	nan	nan
SL_in	-162-156	Integer	Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression.	0	nan	nan
SL_out	-179-63	Integer	Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated.	0	nan	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.3922, std: 0.4883	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.3619, std: 0.4805	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4277, std: 0.4947	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6469, std: 0.4779	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6428, std: 0.4792	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6563, std: 0.4749	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	1958	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
mean_acc_tq	min: 0.0, max: 1.0, mean: 0.3939, std: 0.3158	Float	The mean accuracy of all background questions for one text read by one reader.	1958	nan	nan
mean_acc_bq	min: 0.0, max: 1.0, mean: 0.6487, std: 0.3076	Float	The mean accuracy of all text questions for one text read by one reader.	1958	nan	nan
text_domain_numeric	0: 142125	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	nan
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	nan
gender_numeric	0.0: 66325, 1.0: 73905, nan: 1895	Categorical	Numerical value of gender; 0=male, 1=female.	1895	nan	nan
reader_domain_numeric	0: 81485, 1: 60640	Categorical	Numerical encoding of the reader domain; 0=biology, 1=physics.	0	nan	nan
age	min: 18.0, max: 41.0, mean: 24.1644, std: 4.1809	Float	Reader's age.	3790	nan	nan
expert_status_numeric	0: 53060, 1: 89065	Categorical	Numerical value of expert_status; 0=beginner, 1=expert.	0	nan	nan
domain_expert_status_numeric	0: 30320, 1: 51165, 2: 22740, 3: 37900	Categorical	Numerical value of domain_expert_status; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert.	0	nan	nan
expert_reading_label_numeric	0: 90960, 1: 51165	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	nan

Merged: scanpaths, participant info, reading measures and word features

TODO: insert short text about this section in this file

Please find the files under this link: Scanpaths merged

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
fixation_index	1-1469	Integer	The index of the fixation in temporal order.	0	nan	nan
text_domain	bio: 4682, biology: 200017, physics: 199721	Categorical	The domain of the stimulus text.	0	nan	nan
trial	1-12	Integer	Each participant reads all 12 texts, the order of which follows their trial number. If text b0 has trial number 2 for participant 5, this participant read text b0 as the second text.	0	nan	nan
acc_bq_1	min: 0.0, max: 1.0, mean: 0.3869, std: 0.487	Float	The accuracy of background question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_2	min: 0.0, max: 1.0, mean: 0.3564, std: 0.4789	Float	The accuracy of background question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_bq_3	min: 0.0, max: 1.0, mean: 0.4217, std: 0.4938	Float	The accuracy of background question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_1	min: 0.0, max: 1.0, mean: 0.6625, std: 0.4729	Float	The accuracy of text question 1. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_2	min: 0.0, max: 1.0, mean: 0.6326, std: 0.4821	Float	The accuracy of text question 2. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
acc_tq_3	min: 0.0, max: 1.0, mean: 0.6564, std: 0.4749	Float	The accuracy of text question 3. The answer can be either true or false, so the value is either 0 or 1.	5785	For participant 1 (p0), 31 (p0, b1, b5), 32 (p0), 61 (p1), 62 (b0, b1, b3, b5, 04) and 90 (b3) the accuracies for certain trials are missing due to hardware problems (missing measurements).	nan
fixation_duration	2-4474	Integer	The duration of the fixation in milliseconds.	0	nan	nan
next_saccade_duration	1.0-9491.0	Integer	The duration of the saccade that follows a fixation in milliseconds.	46	nan	nan
previous_saccade_duration	1.0-9491.0	Integer	The duration of a saccade that preceeds a fixation in milliseconds.	515	nan	nan
version	0-105	Integer	Specifies the version of the items. In each version, the order of the stimuli and the order of the answer options for each question differ. The specifics of each version can be found in the items.tsv.	0	nan	nan
line	1-12	Integer	The texts were presented on the screen in multiple lines. Specifies the line of the respective row; indexing starts at 1.	0	nan	nan
aoi	1-1121	Integer	The region of interest specified as character index in the text (see char_index_in_text). Defines which character has been fixated.	0	nan	nan
char_index_in_line	1-100	Integer	Index of a character in the line. Indexing starts at 1.	0	nan	nan
original_fixation_index	1-1478	Integer	The index of the uncorrected fixation.	0	nan	nan
is_fixation_adjusted	False: 382202, True: 22218	Categorical	Whether or not the fixation has been adjusted manually.	0	nan	Manually tagged.
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	nan
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	0	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	0	nan	nan
sent_index_in_text	1-12	Integer	The index of a sentence in the respective text. Indexing starts at 1.	0	nan	nan
char_index_in_text	1-1121	Integer	Index of a character in the text. Indexing starts at 1.	0	nan	nan
word		string	Words as they appear in the stimuli texts. Words are split at white-space.	0	nan	nan
character		string	Character as text.	0	nan	nan
text_id_numeric	0-11	Integer	Numerical value of text_id; 0=b0, 1=b1, 2=b2, 3=b3, 4=b4, 5=b5, 6=p0, 7=p1, 8=p2, 9=p3, 10=p4, 11=p5	0	nan	nan
text_domain_numeric	0: 204699, 1: 199721	Categorical	Numerical value of text_domain; 0=biology, 1=physics.	0	nan	nan
reader_domain_numeric	0: 223158, 1: 181262	Categorical	Numerical encoding of the reader domain; 0=biology, 1=physics.	0	nan	nan
expert_status_numeric	0: 154333, 1: 250087	Categorical	Numerical value of expert_status; 0=beginner, 1=expert.	0	nan	nan
expert_reading_label_numeric	0: 290883, 1: 113537	Categorical	Numeric encoding of whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert). 1=expert_reading, 0=non-expert_reading	0	nan	nan
expert_reading_label	expert_reading: 113537, non-expert_reading: 290883	Categorical	Whether the reader is an expert in the text domain (i.e. text_domain == reader_domain and reader is expert)	0	nan	nan
word_with_punct			The word as it appears in the text, including punctuation.	96	nan	nan
word_index_in_sent	1-51	Integer	The index of the word in the sentence. Indexing starts at 1.	0	nan	nan
word_length	2-33	Integer	Word length is defined in number of characters including symbols like hyphens but without sentence punctuation at the end (i.e., z.B. = 4 characters; DNA-Kette =9 characters; eats.=4 characters).	0	nan	nan
STTS_punctuation_before	0.0: 211108, 0: 189407, $(: 3905	Categorical	If a word is preceded by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	0	nan	nan
STTS_punctuation_after	$(: 3260, $($,: 573, $,: 22559, $.: 25794, 0: 352234	Categorical	If a word is followed by a punctuation mark, the STTS-PoS-tag of the punctuation mark is added here.	0	nan	nan
is_in_quote	0: 399715, 1: 4705	Categorical	Whether or not the word is part of an expression in quotes.	0	nan	Manually tagged
is_in_parentheses	0: 403155, 1: 1265	Categorical	Whether or not the word is part of a phrase in parentheses.	0	nan	Manually tagged
is_clause_beginning	0: 388232, 1: 16188	Categorical	Whether or not the word is the beginning of a clause.	0	nan	Manually tagged
is_sent_beginning	0: 386681, 1: 17739	Categorical	Whether or not the word is the beginning of a new sentence.	0	nan	Manually tagged
is_clause_end			Whether or not the word is the end of a clause.	0	nan	nan
is_sent_end			Whether or not the word is the end of a sentence.	0	nan	nan
is_abbreviation	0: 403478, 1: 942	Categorical	Whether or not the entire word is an abbreviation.	0	nan	Manually tagged
is_expert_technical_term	0: 332354, 1: 72066	Categorical	1 if the word is a technical term that is not generally understandable. E.g.: ""Agarose"".	0	nan	Manually tagged
is_general_technical_term	0: 325333, 1: 79087	Categorical	1 if the word is a technical term that is generally understandable. E.g.: "elektrisch"	0	nan	nan
contains_symbol	0: 400458, 1: 3962	Categorical	Whether or not the word contains a symbol. E.g.: β-D-Glucose	0	nan	nan
contains_hyphen	0: 388149, 1: 16271	Categorical	Whether or not the word contains a hyphen. E.g. 1 for DNA-Fragment (not words that have tag TRUNC (compositional first element, e.g. in "Sekundär- und Tertiärstrukturen", "Sekundär-" does not count as having a hyphen.)).	0	nan	nan
contains_abbreviation	0: 399423, 1: 4997	Categorical	Whether or not the word contains an abbreviation. 0 for words that are only an abbreviation. See is_abbreviation. E.g. 1 for DNA-Fragment, 0 for DNA.	0	nan	nan
STTS_PoS_tag	ADJA: 51041, ADJD: 12714, ADV: 12236, APPR: 22470, APPRART: 5566, APZR: 91, ART: 37340, CARD: 1594, KOKOM: 2428, KON: 5798, KOUI: 654, KOUS: 2521, NE: 955, NN: 162980, PAV: 3444, PDAT: 3292, PDS: 1374, PIAT: 791, PIDAT: 1653, PIS: 1322, PPER: 2511, PPOSAT: 1360, PRELAT: 1302, PRELS: 4193, PRF: 3606, PTKA: 97, PTKNEG: 687, PTKVZ: 1490, PTKZU: 583, PWAV: 76, TRUNC: 1137, VAFIN: 10340, VAINF: 1206, VMFIN: 3953, VMINF: 153, VVFIN: 23854, VVINF: 7713, VVIZU: 578, VVPP: 9317	Categorical	Part-of-speech tags according to the STTS-tagset. See stimuli/ANNOTATION.MD for more information.	0	nan	Manually tagged
type		string	The orthographical representation of a word as found in the corpus; this data is case sensitive, i.e. there is a distinction between name and Name.	0	nan	dlexDB
type_length_chars	0.0-33.0	Integer	The length of the type of a word in characters. See the description of word_length for a definition of how characters are counted.	0	nan	nan
PoS_tag	adja: 53330, adjd: 12226, adv: 15728, appr: 22193, apprart: 5566, art: 37918, card: 1594, kokom: 2428, kon: 5405, koui: 559, kous: 2521, ne: 1386, nn: 160585, pdat: 3292, pds: 1374, piat: 791, pidat: 352, pis: 2063, pper: 2434, pposat: 1360, prelat: 1302, prels: 4076, prf: 3606, ptka: 97, ptkneg: 687, ptkvz: 1891, ptkzu: 583, pwav: 76, trunc: 1137, vafin: 10340, vainf: 1206, vmfin: 3829, vminf: 153, vvfin: 23978, vvinf: 7713, vvizu: 578, vvpp: 9317, xy: 746	Categorical	Part-of-speech tag as defined by the dlexDB query.	0	nan	dlexDB
lemma		string	nan	0	nan	dlexDB
lemma_length_chars	0.0-32.0	Integer	nan	0	nan	dlexDB
syllables		string	nan	0	nan	dlexDB
type_length_syllables	0.0-14.0	Integer	nan	0	nan	dlexDB
annotated_type_frequency_normalized	min: 0.0, max: 24738.5901996, mean: 1950.9055, std: 5185.3006	Float	The number of occurrences of an annotated type in corpus. An annotated type is a unique combination of a type, its part-of-speech tag and its lemma.	0	nan	dlexDB
type_frequency_normalized	min: 0.0, max: 26530.3631386, mean: 2247.4523, std: 5847.2187	Float	nan	0	nan	dlexDB
lemma_frequency_normalized	min: 0.0, max: 80100.3069113, mean: 7203.2409, std: 19769.4428	Float	nan	0	nan	dlexDB
familiarity_normalized	min: 0.0, max: 26530.3631386, mean: 2191.7786, std: 5759.2592	Float	nan	0	nan	dlexDB
regularity_normalized	min: 0.0, max: 2123.30585022, mean: 46.8657, std: 137.5046	Float	nan	0	nan	dlexDB
document_frequency_normalized	min: 0.0, max: 9372.80956103, mean: 1684.1043, std: 2829.0626	Float	nan	0	nan	dlexDB
sentence_frequency_normalized	min: 0.0, max: 30912.3596552, mean: 3137.4539, std: 7374.8037	Float	nan	0	nan	dlexDB
cumulative_syllable_corpus_frequency_normalized	min: 0.0, max: 125126.524676, mean: 15768.7784, std: 17529.5528	Float	nan	0	nan	dlexDB
cumulative_syllable_lexicon_frequency_normalized	min: 0.0, max: 218985.607753, mean: 27232.3183, std: 36883.9628	Float	nan	0	nan	dlexDB
cumulative_character_corpus_frequency_normalized	min: 0.0, max: 7810554.20193, mean: 2053804.334, std: 1596380.3916	Float	nan	0	nan	dlexDB
cumulative_character_lexicon_frequency_normalized	min: 0.0, max: 18380479.713, mean: 4612580.9638, std: 3597155.0404	Float	nan	0	nan	dlexDB
cumulative_character_bigram_corpus_frequency_normalized	min: 0.0, max: 1322150.62097, mean: 356831.454, std: 269772.388	Float	nan	0	nan	dlexDB
cumulative_character_bigram_lexicon_frequency_normalized	min: 0.0, max: 2788357.77704, mean: 629626.1651, std: 539088.9742	Float	nan	0	nan	dlexDB
cumulative_character_trigram_corpus_frequency_normalized	min: 0.0, max: 603427.130456, mean: 200341.8076, std: 144122.7012	Float	nan	0	nan	dlexDB
cumulative_character_trigram_lexicon_frequency_normalized	min: 0.0, max: 899592.89035, mean: 236423.2776, std: 199573.1416	Float	nan	0	nan	dlexDB
initial_letter_frequency_normalized	min: 0.0, max: 110461.430317, mean: 28045.0077, std: 30618.9167	Float	nan	0	nan	dlexDB
initial_bigram_frequency_normalized	min: 0.0, max: 53801.2331077, mean: 8706.0335, std: 12743.2638	Float	nan	0	nan	dlexDB
initial_trigram_frequency_normalized	min: -0.00817507899599, max: 29048.3692201, mean: 3754.6304, std: 7393.1224	Float	nan	0	nan	dlexDB
avg_cond_prob_in_bigrams	min: 0.0, max: 0.5006180465, mean: 0.0313, std: 0.0466	Float	The conditional probability of the bigram, given the occurrence of its first component. In other words, how likely it is for the second component to follow directly after the first. Here, this measure is computed on the basis of the annotated type information.	0	nan	dlexDB
avg_cond_prob_in_trigrams	min: 0.0, max: 25.0, mean: 0.2251, std: 0.8814	Float	The conditional probability of the trigram, given the occurrence of its initial bigram. In other words, how likely it is for the third component to follow directly after the initial pair. Here, this measure is computed on the basis of the annotated type information.	0	nan	dlexDB
neighbors_coltheart_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 1276.643, std: 5775.4034	Float	nan	0	nan	dlexDB
neighbors_coltheart_higher_freq_count_normalized	min: 0.0, max: 8.13363128109, mean: 0.1556, std: 0.4321	Float	nan	0	nan	dlexDB
neighbors_coltheart_all_cum_freq_normalized	min: 0.0, max: 49782.1108458, mean: 2794.1781, std: 7982.6321	Float	nan	0	nan	dlexDB
neighbors_coltheart_all_count_normalized	min: 0.0, max: 47.5175301158, mean: 9.0448, std: 12.679	Float	nan	0	nan	dlexDB
neighbors_levenshtein_higher_freq_cum_freq_normalized	min: 0.0, max: 44055.247282, mean: 1683.6273, std: 6153.8504	Float	nan	0	nan	dlexDB
neighbors_levenshtein_higher_freq_count_normalized	min: 0.0, max: 11.9864039932, mean: 0.2681, std: 0.5814	Float	nan	0	nan	dlexDB
neighbors_levenshtein_all_cum_freq_normalized	min: 0.0, max: 54875.2749862, mean: 3761.4734, std: 9299.5647	Float	nan	0	nan	dlexDB
neighbors_levenshtein_all_count_normalized	min: 0.0, max: 75.7711966712, mean: 14.1417, std: 19.6383	Float	nan	0	nan	dlexDB
sent_surprisal_gpt2-base			Surprisal value extracted from a language model (GerPT2-base) with the sentence as context.	0	nan	nan
text_surprisal_gpt2-base			Surprisal value extracted from a language model (GerPT2-base) with the text as context.	0	nan	nan
sent_surprisal_gpt2-large			Surprisal value extracted from a language model (GerPT2-large) with the sentence as context.	0	nan	nan
text_surprisal_gpt2-large			Surprisal value extracted from a language model (GerPT2-large) with the text as context.	0	nan	nan
sent_surprisal_llama-7b			Surprisal value extracted from a language model (LeoLM-7b) with the sentence as context.	0	nan	nan
text_surprisal_llama-7b			Surprisal value extracted from a language model (LeoLM-7b) with the text as context.	0	nan	nan
sent_surprisal_llama-13b			Surprisal value extracted from a language model (LeoLM-13b) with the sentence as context.	0	nan	nan
text_surprisal_llama-13b			Surprisal value extracted from a language model (LeoLM-13b) with the text as context.	0	nan	nan
sent_surprisal_bert-base			Surprisal value extracted from a language model (BERT-base) with the sentence as context.	0	nan	nan
text_surprisal_bert-base			Surprisal value extracted from a language model (BERT-base) with the text as context.	0	nan	nan
FFD	min: 0, max: 2144, mean: 195.9741, std: 124.5597	Float	First-fixation duration: duration of the first fixation on a word if this word is fixated in first-pass reading, otherwise 0.	0	nan	nan
SFD	min: 0, max: 2144, mean: 107.9483, std: 134.474	Float	Single-fixation duration: duration of the only first-pass fixation on a word, 0 if the word was skipped or more than one fixation occurred in the first-pass (equals FFD in case of a single first-pass fixation).	0	nan	nan
FD	min: 0, max: 2144, mean: 226.9857, std: 103.7904	Float	First duration: duration of the first fixation on a word (identical to FFD if not skipped in the first-pass).	0	nan	nan
FPRT	min: 0, max: 9649, mean: 408.9247, std: 526.0428	Float	First-pass reading time: sum of the durations of all first-pass fixations on a word (0 if the word was skipped in the first-pass).	0	nan	nan
FRT	min: 0, max: 9649, mean: 456.8788, std: 518.1388	Float	First-reading time: sum of the duration of all fixations from first fixating the word (independent if the first fixation occurs in first-pass reading) until leaving the word for the first time (equals FPRT in case the word was fixated in the first-pass).	0	nan	nan
TFT	min: 0, max: 25314, mean: 1333.0163, std: 1428.494	Float	Total-fixation time: sum of all fixations on a word (FPRT+RRT).	0	nan	nan
TFC			The total fixation count on the word.	0	nan	nan
RRT	min: 0, max: 23902, mean: 924.0916, std: 1240.0587	Float	Re-reading time: sum of the durations of all fixations on a word that do not belong to the first-pass (TFT-FPRT).	0	nan	nan
RPD_inc	min: 0, max: 318898, mean: 1076.7946, std: 5339.73	Float	Inclusive regression-path duration: Sum of all fixation durations starting from the first first-pass fixation on a word until fixation on a word to the right of this word (including all regressive fixations on previous words), 0 if the word was not fixated in the first-pass (RPD_exc+RBRT).	0	nan	nan
RPD_exc	min: 0, max: 315640, mean: 557.5849, std: 5209.143	Float	Exclusive regression-path duration: Sum of all fixation durations after initiating a first-pass regression from a word until fixating a word to the right of this word, without counting fixations on the word itself (RPD_inc-RBRT).	0	nan	nan
RBRT	min: 0, max: 10675, mean: 519.2098, std: 638.9024	Float	Right-bounded reading time: Sum of all fixation durations on a word until a word to the right of this word is fixated (RPD_inc-RDP_exc).	0	nan	nan
Fix	0: 110, 1: 404310	Categorical	Fixation: 1 if the word was fixated, otherwise 0 (FPF or RR).	0	nan	nan
FPF	0: 56838, 1: 347582	Categorical	First-pass fixation: 1 if the word was fixated in the first-pass, otherwise 0.	0	nan	nan
RR	0: 48241, 1: 356179	Categorical	Re-reading: 1 if the word was fixated after the first-pass reading, otherwise 0 (sign(RRT)).	0	nan	nan
FPReg	0: 308156, 1: 96264	Categorical	First-pass regression: 1 if a regression was initiated in the first-pass reading of the word, otherwise 0 (sign(RPD exc)).	0	nan	nan
TRC_out	0-15	Integer	Total count of outgoing regressions: total number of regressive saccades initiated from this word.	0	nan	nan
TRC_in	0-12	Integer	Total count of incoming regressions: total number of regressive saccades landing on this word.	0	nan	nan
LP	1-28	Integer	Landing position: position of the first saccade on the word expressed by ordinal position of the fixated character.	0	nan	nan
SL_in	-162-156	Integer	Incoming saccade length: length of the saccade that leads to first fixation on a word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression.	0	nan	nan
SL_out	-179-63	Integer	Outgoing saccade length: length of the first saccade that leaves the word in number of words; positive sign if the saccade is a progressive one, negative sign if it is a regression; 0 if the word is never fixated.	0	nan	nan
mean_acc_tq	min: 0.0, max: 1.0, mean: 0.3883, std: 0.3144	Float	The mean accuracy of all background questions for one text read by one reader.	5785	nan	nan
mean_acc_bq	min: 0.0, max: 1.0, mean: 0.6505, std: 0.3052	Float	The mean accuracy of all text questions for one text read by one reader.	5785	nan	nan
gender_numeric	0.0: 187536, 1.0: 212874, nan: 4010	Categorical	Numerical value of gender; 0=male, 1=female.	4010	nan	nan
age	min: 18.0, max: 41.0, mean: 24.0283, std: 4.1436	Float	Reader's age.	8459	nan	nan
domain_expert_status_numeric	0: 89325, 1: 133833, 2: 65008, 3: 116254	Categorical	Numerical value of domain_expert_status; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert.	0	nan	nan

AOI to word mapping

TODO: insert short text about this section in this file

Please find the file under this link: aoi to word mapping

Column name	Possible values	Value type	Description	Missing value description	Source
text_id	b0, b1, b2, b3, b4, b5, p0, p1, p2, p3, p4, p5		Unique identifier given to each stimulus text.	nan	nan
word_index_in_text	1-180	Integer	The index of the word in the text. Indexing starts at 1.	nan	nan
char_index_in_text	1-1121	Integer	Index of a character in the text. Indexing starts at 1.	nan	nan

Participants

TODO: insert short text about this section in this file

Please find the file under this link: Participant information

Column name	Possible values	Value type	Description	Num missing values	Missing value description	Source
reader_id	0-105	Integer	The unique identifier given to each reader. Reader IDs start at 0.	0	nan	nan
reader_domain	biology: 43, physics: 32	Categorical	The area of expertise of the reader. All readers are students whose major is either physics or biology.	0	nan	nan
reader_domain_numeric	0: 43, 1: 32	Categorical	Numerical encoding of the reader domain; 0=biology, 1=physics.	0	nan	nan
expert_status	beginner: 28, expert: 47	Categorical	Reader's expert status. Readers are considered experts if they are either MSc or PhD students. 1st semester BSc students are considered beginners.	0	nan	nan
expert_status_numeric	0: 28, 1: 47	Categorical	Numerical value of expert_status; 0=beginner, 1=expert.	0	nan	nan
domain_expert_status	biology-beginner: 16, biology-expert: 27, physics-beginner: 12, physics-expert: 20	Categorical	The combination of the readers' major (reader_domain) and their expertise (expert_status).	0	nan	nan
domain_expert_status_numeric	0: 16, 1: 27, 2: 12, 3: 20	Categorical	Numerical value of domain_expert_status; 0=biology-beginner, 1=biology-expert, 2=physics-beginner, 3=physics-expert.	0	nan	nan
glasses	no: 54, yes: 20, nan: 1	Categorical	Whether or not reader is wearing glasses.	1	nan	nan
age	min: 18.0, max: 41.0, mean: 24.1644, std: 4.2098	Float	Reader's age.	2	nan	nan
handedness	right: 68, left: 6, nan: 1	Categorical	Reader's handedness.	1	nan	nan
hours_sleep	min: 0.0, max: 11.0, mean: 7.2095, std: 1.3138	Float	The hours of sleep of the participant before the experiment.	1	nan	nan
alcohol	no: 71, yes: 3, nan: 1	Categorical	Whether or not a participant consumed alcohol within 24 hours before the experiment start.	1	nan	nan
gender	female: 39, male: 35, nan: 1	Categorical	Reader's gender.	1	nan	nan
gender_numeric	0.0: 35, 1.0: 39, nan: 1	Categorical	Numerical value of gender; 0=male, 1=female.	1	nan	nan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CODEBOOK.md

CODEBOOK.md

Codebook

Table of contents

Word features

Stimuli and comprehension questions

Items

Areas of interest (AOI)

Dependency trees

Fixations

Scanpaths

Reading measures

Merged: fixations, participant info, reading measures and word features

Merged: scanpaths, participant info, reading measures and word features

AOI to word mapping

Participants

Files

CODEBOOK.md

Latest commit

History

CODEBOOK.md

File metadata and controls

Codebook

Table of contents

Word features

Stimuli and comprehension questions

Items

Areas of interest (AOI)

Dependency trees

Fixations

Scanpaths

Reading measures

Merged: fixations, participant info, reading measures and word features

Merged: scanpaths, participant info, reading measures and word features

AOI to word mapping

Participants