forked from allenai/dolma
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CITATION.cff
201 lines (200 loc) · 6.96 KB
/
CITATION.cff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Dolma: an Open Corpus of Three Trillion Tokens for
Language Model Pretraining Research
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- family-names: Soldaini
given-names: Luca
email: [email protected]
affiliation: Allen Institute For AI
orcid: 'https://orcid.org/0000-0001-6998-9863'
- family-names: Kinney
given-names: Rodney
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Bhagia
given-names: Akshita
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Schwenk
given-names: Dustin
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Atkinson
given-names: David
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Authur
given-names: Russell
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Bogin
given-names: Ben
email: [email protected]
affiliation: 'Allen Institute For AI, University of Washington'
- family-names: Chandu
given-names: Khyathi
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Dumas
given-names: Jennifer
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Elazar
given-names: Yanai
email: [email protected]
affiliation: 'Allen Institute For AI, University of Washington'
- family-names: Hofmann
given-names: Valentin
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Jha
given-names: Ananya Harsh
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Kumar
given-names: Sachin
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Lucy
given-names: Li
email: [email protected]
affiliation: 'University for Berkeley, Allen Institute For AI'
- family-names: Lyu
given-names: Xinxi
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Lambert
given-names: Nathan
email: [email protected]
affiliation: Allen Institute For AI
orcid: 'https://orcid.org/0000-0002-9997-6817'
- family-names: Magnusson
given-names: Ian
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Morrison
given-names: Jacob
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Muennighoff
given-names: Niklas
email: [email protected]
- family-names: Naik
given-names: Aakanksha
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Nam
given-names: Crystal
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Peters
given-names: Matthew E
affiliation: Spiffy AI
email: [email protected]
- family-names: Ravichander
given-names: Abhilasha
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Richardson
given-names: Kyle
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Shen
given-names: Shannon Zejiang
email: [email protected]
affiliation: Massachusetts Institute of Technology
- family-names: Strubell
given-names: Emma
email: [email protected]
affiliation: 'Carnegie Mellon University, Allen Institute For AI'
orcid: 'https://orcid.org/0000-0003-2798-0726'
- family-names: Subramani
given-names: Nishant
email: [email protected]
affiliation: 'Carnegie Mellon University, Allen Institute For AI'
- family-names: Tafjord
given-names: Oyvind
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Walsh
given-names: Pete
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Zettlemoyer
given-names: Luke
email: [email protected]
affiliation: University of Washington
orcid: 'https://orcid.org/0009-0008-8296-0764'
- family-names: Smith
given-names: Noah A
email: [email protected]
affiliation: 'Allen Institute For AI, University of Washington'
orcid: 'https://orcid.org/0000-0002-2310-6380'
- family-names: Hajishirzi
given-names: Hannaneh
email: [email protected]
affiliation: 'Allen Institute For AI, University of Washington'
orcid: 'https://orcid.org/0000-0002-1055-6657'
- family-names: Beltagy
given-names: Iz
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Groeneveld
given-names: Dirk
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Dodge
given-names: Jesse
email: [email protected]
affiliation: Allen Institute For AI
- family-names: Lo
given-names: Kyle
email: [email protected]
affiliation: Allen Institute For AI
identifiers:
- type: url
value: 'https://arxiv.org/abs/2402.00159'
description: arXiv
- type: url
value: 'https://huggingface.co/datasets/allenai/dolma'
description: Dataset
repository-code: 'https://github.com/allenai/dolma'
url: 'https://github.com/allenai/dolma'
abstract: >
Language models have become a critical technology to
tackling a wide range of natural language processing
tasks, yet many details about how the best-performing
language models were developed are not reported. In
particular, information about their pretraining corpora is
seldom discussed: commercial language models rarely
provide any information about their data; even open models
rarely release datasets they are trained on, or an exact
recipe to reproduce them. As a result, it is challenging
to conduct certain threads of language modeling research,
such as understanding how training data impacts model
capabilities and shapes their limitations. To facilitate
open research on language model pretraining, we release
Dolma, a three trillion tokens English corpus, built from
a diverse mixture of web content, scientific papers, code,
public-domain books, social media, and encyclopedic
materials. In addition, we open source our data curation
toolkit to enable further experimentation and reproduction
of our work. In this report, we document Dolma, including
its design principles, details about its construction, and
a summary of its contents. We interleave this report with
analyses and experimental results from training language
models on intermediate states of Dolma to share what we
have learned about important data curation practices,
including the role of content or quality filters,
deduplication, and multi-source mixing. Dolma has been
used to train OLMo, a state-of-the-art, open language
model and framework designed to build and study the
science of language modeling.
license: Apache-2.0