-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathindex.html
261 lines (259 loc) · 13.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
<!DOCTYPE html>
<html>
<head>
<title>Dataset Project</title>
<link href='https://fonts.googleapis.com/css?family=Open+Sans' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="https://caltechlibrary.github.io/css/site.css">
</head>
<body>
<header>
<a href="http://library.caltech.edu" title="link to Caltech Library Homepage"><img src="https://caltechlibrary.github.io/assets/liblogo.gif" alt="Caltech Library logo"></a>
</header>
<nav>
<ul>
<li><a href="/">Home</a></li>
<li><a href="index.html">README</a></li>
<li><a href="LICENSE">LICENSE</a></li>
<li><a href="install.html">INSTALL</a></li>
<li><a href="user_manual.html">User Manual</a></li>
<li><a href="about.html">About</a></li>
<li><a href="search.html">Search</a></li>
<li><a href="https://github.com/caltechlibrary/dataset">GitHub</a></li>
</ul>
</nav>
<section>
<h1 id="dataset-project">Dataset Project</h1>
<p><a href="https://data.caltech.edu/badge/latestdoi/79394591"><img
src="https://data.caltech.edu/badge/79394591.svg" alt="DOI" /></a></p>
<p><a href="https://www.repostatus.org/#active"><img
src="https://www.repostatus.org/badges/latest/active.svg"
alt="Project Status: Active – The project has reached a stable, usable state and is being actively developed." /></a></p>
<p>The Dataset Project provides tools for working with collections of
JSON documents stored on the local file system in a pairtree or in a SQL
database supporting JSON columns. Two primary tools are provided by the
project – a command line interface (dataset) and a <a
href="https://en.wikipedia.org/wiki/Representational_state_transfer">RESTful</a>
web service (datasetd).</p>
<h2 id="dataset-a-command-line-tool">dataset, a command line tool</h2>
<p><a href="doc/dataset.html">dataset</a> is a command line tool for
working with collections of <a
href="https://en.wikipedia.org/wiki/JSON">JSON</a> documents.
Collections can be stored on the file system in a pairtree directory
structure or stored in a SQL database that supports JSON columns
(currently SQLite3 or MySQL 8 are supported). Collections using the file
system store the JSON documents in a <a
href="https://datatracker.ietf.org/doc/html/draft-kunze-pairtree-01">pairtree</a>.
The JSON documents are plain UTF-8 source. This means the objects can be
accessed with common <a
href="https://en.wikipedia.org/wiki/Unix">Unix</a> text processing tools
as well as most programming languages.</p>
<p>The <strong>dataset</strong> command line tool supports common data
management operations such as initialization of collections; document
creation, reading, updating and deleting; listing keys of JSON objects
in the collection; and associating non-JSON documents (attachments) with
specific JSON documents in the collection.</p>
<h3 id="enhanced-features-include">enhanced features include</h3>
<ul>
<li>aggregate objects into data <a
href="docs/frame.html">frames</a></li>
<li>generate sample sets of keys and objects</li>
<li>clone a collection</li>
<li>clone a collection into training and test samples</li>
</ul>
<p>See <a href="how-to/getting-started-with-dataset.html">Getting
started with dataset</a> for a tour and tutorial.</p>
<h2 id="datasetd-dataset-as-a-web-service">datasetd, dataset as a web
service</h2>
<p><a href="doc/datasetd.html">datasetd</a> is a RESTful web service
implementation of the <em>dataset</em> command line program. It features
a sub-set of capability found in the command line tool. This allows
dataset collections to be integrated safely into web applications or
used concurrently by multiple processes. It achieves this by storing the
dataset collection in a SQL database using JSON columns.</p>
<h2 id="design-choices">Design choices</h2>
<p><em>dataset</em> and <em>datasetd</em> are intended to be simple
tools for managing collections JSON object documents in a predictable
structured way.</p>
<p><em>dataset</em> is guided by the idea that you should be able to
work with JSON documents as easily as you can any plain text document on
the Unix command line. <em>dataset</em> is intended to be simple to use
with minimal setup (e.g. <code>dataset init mycollection.ds</code>
creates a new collection called ‘mycollection.ds’).</p>
<ul>
<li><em>dataset</em> and <em>datasetd</em> store JSON object documents
in collections.
<ul>
<li>Storage of the JSON documents may be either in a pairtree on disk or
in a SQL database using JSON columns (e.g. SQLite3 or MySQL 8)</li>
<li>dataset collections are made up of a directory containing a
collection.json and codemeta.json files.</li>
<li>collection.json metadata file describing the collection,
e.g. storage type, name, description, if versioning is enabled</li>
<li>codemeta.json is a <a href="https://codemeta.github.io">codemeta</a>
file describing the nature of the collection, e.g. authors, description,
funding</li>
<li>collection objects are accessed by their key, a unique identifier
made of lower case alpha numeric characters</li>
<li>collection names are usually lowered case and usually have a
<code>.ds</code> extension for easy identification</li>
</ul></li>
</ul>
<p><em>datatset</em> collection storage options - <a
href="https://datatracker.ietf.org/doc/html/draft-kunze-pairtree-01">pairtree</a>
is the default disk organization of a dataset collection - the pairtree
path is always lowercase - non-JSON attachments can be associated with a
JSON document and found in a directories organized by semver (semantic
version number) - versioned JSON documents are created along side the
current JSON document but are named using both their key and semver -
SQL store stores JSON documents in a JSON column - SQLite3 and MySQL 8
are the current SQL databases support - A “DSN URI” is used to identify
and gain access to the SQL database - The DSN URI maybe passed through
the environment</p>
<p><em>datasetd</em> is a web service - is intended as a back end web
service run on localhost - by default it runs on localhost port 8485 -
supports collections that use the SQL storage engine - <strong>should
never be used as a public facing web service</strong> - there are no
user level access mechanisms - anyone with access to the web service end
point has access to the dataset collection content</p>
<p>The choice of plain UTF-8 is intended to help future proof reading
dataset collections. Care has been taken to keep <em>dataset</em> simple
enough and light weight enough that it will run on a machine as small as
a Raspberry Pi Zero while being equally comfortable on a more resource
rich server or desktop environment. <em>dataset</em> can be re-implement
in any programming language supporting file input and output, common
string operations and along with JSON encoding and decoding functions.
The current implementation is in the Go language.</p>
<h2 id="features">Features</h2>
<p><a href="docs/dataset.html">dataset</a> supports - Initialize a new
dataset collection - Define metadata about the collection using a
codemeta.json file - Define a keys file holding a list of allocated keys
in the collection - Creates a pairtree for object storage - Codemeta
file support for describing the collection contents - Simple JSON object
versioning - Listing <a href="docs/keys.html">Keys</a> in a collection -
Object level actions - <a href="docs/create.html">create</a> - <a
href="docs/read.html">read</a> - <a href="docs/update.html">update</a> -
<a href="docs/delete.html">delete</a> - <a
href="docs/keys.html">keys</a> - <a href="docs/has-key.html">has-key</a>
- <a href="docs/sample.html">sample</a> - <a
href="docs/clone.html">clone</a> - <a
href="docs/clone-sample.html">clone-sample</a> - Documents as
attachments - <a href="docs/attachments.html">attachments</a> (list) -
<a href="docs/attach.html">attach</a> (create/update) - <a
href="docs/retrieve.html">retrieve</a> (read) - <a
href="docs/prune.html">prune</a> (delete) - The ability to create data
<a href="docs/frame.html">frames</a> from while collections or based on
keys lists - frames are defined using a list of keys and a lost <a
href="docs/dotpath.html">dot paths</a> describing what is to be pulled
out of a stored JSON objects and into the frame - frame level actions -
frames, list the frame names in the collection - frame, define a frame,
does not overwrite an existing frame with the same name - frame-def,
show the frame definition (in case we need it for some reason) -
frame-keys, return a list of keys in the frame - frame-objects, return a
list of objects in the frame - refresh, using the current frame
definition reload all the objects in the frame given a key list -
reframe, replace the frame definition then reload the objects in the
frame using the existing key list - has-frame, check to see if a frame
exists - delete-frame remove the frame</p>
<p><a href="docs/datasetd.html">datasetd</a> supports</p>
<ul>
<li>List <a href="docs/collections-endpoint.html">collections</a>
available from the web service</li>
<li>List a <a href="collection-endpoint.html">collection</a>’s
metadata</li>
<li>List a collection’s <a href="docs/keys-endpoint.html">Keys</a></li>
<li>Object level actions
<ul>
<li><a href="docs/create-endpoint.html">create</a></li>
<li><a href="docs/read-endpoint.html">read</a></li>
<li><a href="docs/update-endpoint.html">update</a></li>
<li><a href="docs/delete-endpoint.html">delete</a></li>
<li>Documents as attachments
<ul>
<li><a href="docs/attach-endpoint.html">attach</a></li>
<li><a href="docs/retrieve-endpoint.html">retrieve</a></li>
<li><a href="docs/prune-endpoint.html">prune</a></li>
</ul></li>
</ul></li>
<li>The ability to create data <a href="docs/frame.html">frames</a> from
collections or based on keys lists and dot paths to form a new object
<ul>
<li><a href="docs/dotpath.html">dot paths</a> describing what is to be
pulled out of a stored JSON objects</li>
</ul></li>
</ul>
<p>Both <em>dataset</em> and <em>datasetd</em> maybe useful for general
data science applications needing JSON object management or in
implementing repository systems in research libraries and archives.</p>
<h2 id="limitations-of-dataset-and-datasetd">Limitations of
<em>dataset</em> and <em>datasetd</em></h2>
<p><em>dataset</em> has many limitations, some are listed below</p>
<ul>
<li>the pairtree implementation it is not a multi-process, multi-user
data store</li>
<li>it is not a general purpose database system</li>
<li>it stores all keys in lower case in order to deal with file systems
that are not case sensitive, compatibility needed by a pairtree</li>
<li>it stores collection names as lower case to deal with file systems
that are not case sensitive</li>
<li>it does not have a built-in query language, search or sorting</li>
<li>it should NOT be used for sensitive or secret information</li>
</ul>
<p><em>datasetd</em> is a simple web service intended to run on
“localhost:8485”.</p>
<ul>
<li>it does not include support for authentication</li>
<li>it does not support a query language, search or sorting</li>
<li>it does not support access control by users or roles</li>
<li>it does not provide auto key generation</li>
<li>it limits the size of JSON documents stored to the size supported by
with host SQL JSON columns</li>
<li>it limits the size of attached files to less than 250 MiB</li>
<li>it does not support partial JSON record updates or retrieval</li>
<li>it does not provide an interactive Web UI for working with dataset
collections</li>
<li>it does not support HTTPS or “at rest” encryption</li>
<li>it should NOT be used for sensitive or secret information</li>
</ul>
<h2 id="read-next">Read next …</h2>
<ul>
<li>About the <a href="docs/dataset.html">dataset</a> command</li>
<li>About <a href="docs/datasetd.html">datasetd</a> web service</li>
<li><a href="INSTALL.html">Installation</a></li>
<li><a href="LICENSE">License</a></li>
<li><a href="CONTRIBUTING.html">Contributing</a></li>
<li><a href="CODE_OF_CONDUCT.html">Code of conduct</a></li>
<li>Explore <em>dataset</em> and <em>datasetd</em>
<ul>
<li><a href="how-to/getting-started-with-dataset.html"
title="Python examples as well as command line">Getting Started with
Dataset</a></li>
<li><a href="how-to/">How To</a> guides</li>
<li><a href="docs/">Reference Documentation</a>.</li>
<li><a href="docs/topics.html">Topics</a></li>
</ul></li>
</ul>
<h2 id="authors-and-history">Authors and history</h2>
<ul>
<li>R. S. Doiel</li>
<li>Tommy Morrell</li>
</ul>
<h2 id="releases">Releases</h2>
<p>Compiled versions are provided for Linux (x86), Mac OS X (x86 and
M1), Windows 11 (x86) and Raspberry Pi OS (ARM7).</p>
<p><a
href="https://github.com/caltechlibrary/dataset/releases">github.com/caltechlibrary/dataset/releases</a></p>
<h2 id="related-projects">Related projects</h2>
<p>You can use <em>dataset</em> from Python via the <a
href="https://github.com/caltechlibrary/py_dataset">py_dataset</a>
package. You can use <em>dataset</em> from Deno+TypeScript by running
datasetd and access it with <a
href="https://github.com/caltechlibraray/ts_dataset">ts_dataset</a>.</p>
</section>
<footer>
<span>© 2022 <a href="https://www.library.caltech.edu/copyright">Caltech Library</a></span>
<address>1200 E California Blvd, Mail Code 1-32, Pasadena, CA 91125-3200</address>
<span><a href="mailto:[email protected]">Email Us</a></span>
<span>Phone: <a href="tel:+1-626-395-3405">(626)395-3405</a></span>
</footer>
</body>
</html>