-
Notifications
You must be signed in to change notification settings - Fork 1
/
data-organisation.qmd
645 lines (525 loc) · 40.6 KB
/
data-organisation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
---
title: "1) Organization"
link-external-newwindow: true
---
## General notes
Research data is valuable for researchers and forms the basis for their research. Therefore, it is advisable to structure the data well to save time and effort in the daily handling of research data. In this part of the workshop, we will look closer at organizational aspects of data management, mainly the folder structure, file and folder naming, and file formats.
A clear and consistent folder structure and folder and file naming convention are important for making your data [**f**indable](https://www.go-fair.org/fair-principles/) and [**i**nteroperable](https://www.go-fair.org/fair-principles/). You should think about it beforehand in order to avoid inconsistencies or the need to rename large amounts of data.
Your structure and your naming conventions should be intuitive. However, we recommend to explicitly describe them (typically in a [README file](documentation.qmd#readme-file)) because they may not be that intuitive for others or your future self ("why did I do it like this?").
In the following sections, you'll find some input on the organizational aspects you should consider. Note that not all of them may apply to each dataset. Besides the tasks, we'll provide some general hints and rules. Some rules only apply to some use cases, and sometimes, there are good arguments for not sticking to every rule. However, in such cases you should know (and potentially document) why you decide differently.
## Folder structure
At the start of your research project, you have to decide how to arrange your files and folders. This decision depends on the structure of your data and documentation.
Organizational choices may involve trade-offs, such as the number of files per folder versus folder depth, intuitive names versus strict naming conventions, and structuring by processing level, access permissions, file size, or other criteria.
::: {.callout-tip icon="true" title="Task 1.1: (\~ 5 minutes)"}
Look at the folder structure (but not at the details of the folder or file names yet, which will be the next task).
- Is the folder structure intuitive and logical (what is done, how, and why)?
- Is it explicitly described? Where can you find this information (metadata of repository or in a README file)?
- How many files are stored per folder, and how deeply are they nested?
- Discuss: What would you leave as it is, what would you change, or what are the alternatives?
In case there are no folders, you may discuss whether it would make sense to add folders.
:::
::: {.callout-warning icon="true" title="Avoid too long file paths"}
Depending on the operating system, the total path length has an upper limit, e.g. 255 characters. Exceeding this limit will cause errors. Also note that the path of the copy may be even longer than your original path if you synchronize or backup your data, which can cause your sync or backup job to fail.
Therefore, try to keep your full path clearly below such upper limits.
- Bad example: `X:/Projects/Microscopy_Project/Microscopy_Projects_2024/October_2024/RawData_October2024/Microscopy_RawData_Image003.tif`
- Better: `X:/Projects/Microscopy/2024-10/RawData/Image003.tif`
:::
::: {.callout-note icon="true" title="Further hints on folder structure"}
* Avoid deeply nested folder structures: SubSubSubSubSubFolders can be pretty inconvenient.
* Avoid too many files or subfolders within one folder:\
It can be quite inconvenient to look through dozens of heterogeneous file names. In case of clearly structured file names (e.g. numbered files like `Image003.tif` or `Plot01_Part03.tab`), a larger number of elements per folder can also be fine. However, for huge amounts of files (several thousand), the performance of the file explorer may decrease.
* In case different project members should have different access restrictions to files, this could also be considered in your folder structure.
:::
::: {.callout-note collapse="true" icon="false" title="Examples"}
Example for structuring a dataset: organized by file type[^1]
```
+ DatasetA
+ Data
+ Processed
+ Raw
+ Results
+ Figure1.tif
+ Figure2.tif
```
Example for structuring a dataset: organized by analysis[^1]
```
+ DatasetB
+ Figure1
+ RawData
+ Results
+ Figure1.tif
+ Figure2
+ RawData
+ Results
+ Figure2.tif
```
[^1]: adapted from [https://datadryad.org/stash/best_practices#organize](https://datadryad.org/stash/best_practices#organize)
Example for a project folder structure[^2]:
```
+ Project_Folder
+ 1_Project_Management
+ Finance
+ Proposals
+ Reports
+ 2_Ethics_and_Governance
+ Consent_Forms
+ Ethical_Approvals
+ 3_Dissemination
+ Presentations
+ Publications
+ Publicity
+ Experiment_01
+ Data
+ Data_Analysis
+ Inputs
+ Outputs
```
[^2]: adapted from Suse Prejawa (2021, [https://hdl.handle.net/21.11116/0000-0008-662A-7](https://hdl.handle.net/21.11116/0000-0008-662A-7))
Example for a project folder structure[^3]:
```
\\file.mpic.de\projects\ExampleProject\
+ GeneralOverview # General documentation of the project
+ Meetings # Meeting notes, presentations
+ INST # Instruments
+ Instrument1 # One folder per instrument
+ Doc # Documentation for this instrument
+ L_0 # Raw data
+ L_2 # Processed/analyzed data on original resolution
+ Product1 # One folder per data product
+ Code # Code used for creating the data of this product
+ Data # Data files of this product
+ Doc # Documentation for this data product
+ L_3 # Gridded data products
+ Product1 # One folder per data product (e.g. hourly averages)
+ Code # Code used for creating the data of this product
+ Data # Data files of this product
+ Doc # Documentation for this data product
+ Labbook # Labbook (photos of paper logbook or exports from ELN)
```
[^3]: adapted from a template used at the Max Planck Institute for Chemistry for measurement projects/campaigns (e.g. with a research aircraft)
:::
::: {.callout-important collapse="true" title="Solution: Example 1"}
> - Is the folder structure intuitive and logical (what is done, how, and why)?
> - Is it explicitly described? Where can you find this information (metadata of repository or in a README file)?
> - How many files are stored per folder, and how deeply are they nested?
> - Discuss: What would you leave as it is, what would you change, or what are the alternatives?
>
> In case there are no folders, you may discuss whether it would make sense to add folders.
The dataset has 42 files, but no folder structure. Folders are not needed here, because all files (except for the README file) are of same type, just for different months. However, one could make one subfolder per year.
:::
::: {.callout-important collapse="true" title="Solution: Example 2"}
> - Is the folder structure intuitive and logical (what is done, how, and why)?
> - Is it explicitly described? Where can you find this information (metadata of repository or in a README file)?
> - How many files are stored per folder, and how deeply are they nested?
> - Discuss: What would you leave as it is, what would you change, or what are the alternatives?
>
> In case there are no folders, you may discuss whether it would make sense to add folders.
The dataset contains 6 files, whithout folder structure. However, 2 of them are of type 'tar.gz', which contain compressed ASCII files. The content is described in the README file.
Also the tar.gz files do not contain many files, thus no further folder structure is needed.
:::
::: {.callout-important collapse="true" title="Solution: Example 3"}
> - Is the folder structure intuitive and logical (what is done, how, and why)?
> - Is it explicitly described? Where can you find this information (metadata of repository or in a README file)?
> - How many files are stored per folder, and how deeply are they nested?
> - Discuss: What would you leave as it is, what would you change, or what are the alternatives?
Following notes relate to the content of `OSF Storage`.
- Yes, files are grouped into data, code (scripts) etc.
- The content is described in the README file, but not completely.
- There are up to 8 files per folder, one folder level.
:::
::: {.callout-important collapse="true" title="Solution: Example 4"}
> - Is the folder structure intuitive and logical (what is done, how, and why)?
> - Is it explicitly described? Where can you find this information (metadata of repository or in a README file)?
> - How many files are stored per folder, and how deeply are they nested?
> - Discuss: What would you leave as it is, what would you change, or what are the alternatives?
On dataset-level, there is only one zip file, no folders. However, within the zip file, there is a folder structure:
- Yes, intuitive structure: separation between data tables and scripts, ...
- Explicitly described in README file.
- Around 3 to 4 folder levels. Folder `Data/Group` has 40 subfolders with 9 files each.
:::
## File and folder names
In the next section, we will explore best practices for file and folder naming to create a clear and organized data structure. File or folder names have the following primary purposes:
* Always: Uniquely identify the file or folder (within a folder),
* Often: Give information about its content, e.g. `README.txt`, `MeetingProtocol.docx`, `Temperature_RawData.tab`,
* Sometimes: Enable logical order when sorting alphabetically, e.g. `1_RawData`, `2_PreProcessed`, `3_Processed`, `4_Combined`.
Generally, the same rules apply to the naming of folders and files. They shall allow to choose the desired file amongst all the other files of the folder. Therefore, the names should be concise and intuitive (if applicable). For instance, a file named `XYZ123` might not be immediately clear, so it's important to explain its purpose somewhere, typically in a [README file](documentation.qmd#readme-file). Well-structured folders have clear naming conventions, which are explicitly described.
::: {.callout-tip icon="true" title="Task 1.2: (\~ 10 minutes)"}
- What naming convention is used in this dataset? Is it intuitive and logical? Is it explicitly described?
- Are the names meaningful? Are there misleading names?
- In case of multiple files: Do they appear in a logical order when sorted alphabetically?
- Are there problematic characters like spaces, non-ASCII characters, etc.?
- What about the length of the names?
- Discuss: What would you leave as it is, what would you change, or what are the alternatives?
:::
::: {.callout-warning icon="true" title="Do not use bad characters."}
Depending on the operating system and application, some characters are forbidden or may lead to problems and, thus, should be avoided.
- Very bad: Any non-ASCII character, e.g., `öäüßµαδ°±•€→☺É`
- Bad: Any whitespace character, e.g. `File 1.txt`. They can cause problems, e.g., in some batch tasks, in particular, if one forgets to surround the name with quotes. Furthermore, double or multiple spaces and spaces at the beginning of the name are not clearly visible.
- Forbidden in Windows: `\/:*?"<>|`
- Also not recommended: `,;()[]{}` etc.
To summarize: You should only use Latin letters A-Z, a-z, digits 0-9, underscore, hyphen and dot, i.e. following characters: `ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz123456789_-.`
Furthermore, the dot should only be used in file names, and there only once before the file extension, e.g. "Notes.txt". Some programs use a dot or underscore as the first character for special file types, e.g. `_quarto.yml` or `.git` and thus should be avoided for regular data files.
:::
::: {.callout-warning icon="true" title="No 'hello.txt' and 'Hello.txt' in same folder."}
Ensure that subfolders and files have unique names within a folder, even in case-insensitive ways. For example, do not put two files named `hello.txt` and `Hello.txt` in the same folder.
This note is particularly relevant for Linux users, where putting both files in the same folder is possible. However, in Windows, that is not allowed. Thus, sharing such a folder between users of different operating systems would cause problems.
:::
::: {.callout-note icon="true" title="Excursion: Ordering and timestamps"}
A naming convention can enable a logical order of the file or folder names when sorting them alphabetically. Here, we provide some tips:
- When names include numbers, leading zeros are often helpful:
- Ordering with "0":\
`Scan01.csv`, `Scan02.csv`, `Scan03.csv`, `Scan04.csv`, `Scan05.csv`, `Scan06.csv`,\
`Scan07.csv`, `Scan08.csv`, `Scan09.csv`, `Scan10.csv`, `Scan11.csv`, `Scan12.csv`
- Ordering without:\
`Scan1.csv`, `Scan10.csv`, `Scan11.csv`, `Scan12.csv`, `Scan2.csv`, `Scan3.csv`,\
`Scan4.csv`, `Scan5.csv`, `Scan6.csv`, `Scan7.csv`, `Scan8.csv`, `Scan9.csv`
- Timestamps should always be given with a leading zero and 'from big to small', i.e. year, month, day of month, hour, minute, second. This recommendation complies with the international format [ISO 8601](https://www.iso.org/iso-8601-date-and-time-format.html) (e.g. "2024-07-31", "2024-07-31T2313").
- Very bad: `13Jan2024`, `21April2021`, `3Dec2025`
- Also bad: `03122025`, `13012024`, `21042021`
- Good: `2021-04-21`, `2024-01-13`, `2025-12-03`
- Also ok: `20210421`, `20240113`, `20251203`
- Including time of day: `20210421T0345`, `20240113T1730`, `20251203T1900` for 03:45, 17:30, 19:00
:::
::: {.callout-note icon="true" title="Further good practice for file naming"}
- Include relevant information in the file name. However, don't misuse a file name as a way to store all your metadata.
- Avoid overly long names (a maximum of 32 characters is suggested). Mind also the previous note about the full path length.
- Avoid moving or renaming folders or files. This is especially relevant when you or others have referred to the file by using its file name or path.
- Generate a [README file](documentation.qmd#readme-file) explaining file nomenclature (including the meaning of acronyms or abbreviations), file organization and versioning. Store this file on top of the folder structure for easy accessibility.
There are different possibilities to indicate logical units in a name without using a whitespace:
- Kebab-case: `The-quick-brown-fox-jumps-over-the-lazy-dog.txt`
- CamelCase: `TheQuickBrownFoxJumpsOverTheLazyDog.txt`
- Snake_case: `The_quick_brown_fox_jumps_over_the_lazy_dog.txt`
Compromises often have to be made, such as including relevant information versus avoiding long names. Note that folder names with a precise and narrow meaning may become outdated when further content is filled in over time.
Because of that, [persistent identifiers (PID)](publication.qmd#persistent-identifier) typically avoid to include semantic information, e.g. `doi:10.17617/3.1STIJV`.
:::
::: {.callout-note icon="true" title="Excursion: Versioning"}
Documents may evolve over time. File versioning allows for reverting to earlier versions if needed and shall allow for keeping track of changes, including documentation on the underlying rationale and people involved.
Version control can be done either manually by using naming conventions or by using a version control system like [Git](https://git-scm.com/). The following hints apply to manual version control, meaning that you store both the current and previous versions in your file system.
- Versions should be numbered consecutively, e.g. `Handbook_v3.pdf`. Major changes (v1, v2, v3, ...) can be distinguished from minor ones (v1-1, v1-2, v1-3 or 1a, 1b, 1c). You may use leading zeros if you expect more than nine versions.
- Alternatively, a date or timestamp could indicate the version, e.g. `Handbook_v20240725.pdf`.
- You may use qualifiers such as "raw" or "processed" for data or "draft" or "internal" for documents. However, note that terms such as "final", "final2", "final-revised", "final-changed_again", and "final_ready" can be confusing. In other words: Avoid the word "final" in file names.
- Document your versioning convention, e.g. what you mean with major or minor changes.
- Document the essential changes you have made between the versions.
For further reading: GitHub recommends version names like '1.3.2' for the releases of software products, details see [Semantic Versioning 2.0.0](https://semver.org/).
:::
::: {.callout-important collapse="true" title="Solution: Example 1"}
> - What is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?
> - Are the names meaningful? Are there misleading names?
> - In case of multiple files: Do they appear in a logical order when sorted alphabetically?
> - Are there problematic characters like spaces, non-ASCII characters, etc.?
> - What about the length of the names?
> - Discuss: What would you leave as it is, what would you change, or what are the alternatives?
- Files consist of prefix `amb_hourly_qc_wc4.4_cal6.0_`, followed by year and month (e.g. `2014_08`), followed by `_core-params.csv`. Prefix might be intuitive for researchers of that field, but it is not explicitly described.
- Probably the prefix has some meaning, but as it is not explicitly stated in the README, we can only speculate.
- Yes, files are sorted according to the month of measurement.
- Yes: File name `AMB hourly, readme.rtf` contains spaces and a comma. The other file names contain several dots (should only be one dot, namely before the file extension `csv`).
- Length is not problematic, but longer than needed.
- Replace `AMB hourly, readme.rtf` by `README.rtf`. The other file names can be shortened, e.g. `HourlyCoreParams_2014_08.csv`. And if the file name prefix `amb_hourly_qc_wc4.4_cal6.0_` contains relevant information, this should be explicitly given in the metadata or README file.
:::
::: {.callout-important collapse="true" title="Solution: Example 2"}
> - What is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?
> - Are the names meaningful? Are there misleading names?
> - In case of multiple files: Do they appear in a logical order when sorted alphabetically?
> - Are there problematic characters like spaces, non-ASCII characters, etc.?
> - What about the length of the names?
> - Discuss: What would you leave as it is, what would you change, or what are the alternatives?
- The dataset contains only 6 files, thus there is not really a convention available, and also not needed. The non-intuitive parts `wos` and `fo` are explained in the README file (namely "Web of Science data" and "Faculty Opinions data"). The files inside the tar.gz-files seem to follow some convention, and their content is explicitly mentioned in the README file.
- Probably yes. Anyhow, their content is mentioned in the README file.
- Only few files, does order not important.
- No problematic characters found.
- Length of the names: OK
:::
::: {.callout-important collapse="true" title="Solution: Example 3"}
> - What is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?
> - Are the names meaningful? Are there misleading names?
> - In case of multiple files: Do they appear in a logical order when sorted alphabetically?
> - Are there problematic characters like spaces, non-ASCII characters, etc.?
> - What about the length of the names?
> - Discuss: What would you leave as it is, what would you change, or what are the alternatives?
Following notes relate to the content of `OSF Storage`.
- Not so clear, but not many files, thus no clear conventions needed. However, the files in folder `result` are lacking an explanation in the README file, and their names are not very intuitive.
- Meaning not clear for all files.
- Only few files, thus order not important.
- No problematic characters found in `OSF Storage`.
- Length of the names: OK
:::
::: {.callout-important collapse="true" title="Solution: Example 4"}
> - What is the convention of the naming used in this dataset? Is it intuitive and logical? Is it explicitly described?
> - Are the names meaningful? Are there misleading names?
> - In case of multiple files: Do they appear in a logical order when sorted alphabetically?
> - Are there problematic characters like spaces, non-ASCII characters, etc.?
> - What about the length of the names?
> - Discuss: What would you leave as it is, what would you change, or what are the alternatives?
- The subfolders of `Data/Group` and `Data/Solo` have names like `01_09_2022__10_13_33`, which seem to refer to a date and maye time of day.
- Yes, names are meaningful and intuitive.
- Subfolders are not in a chronological order, because the date is given in a disadvantageous format (e.g. `01_09_2022`) - better would be `2022_09_01` or `2022-09-01`.
- Yes: Folder name `Stan model code` contains spaces.
- Length of the names: OK
- If folder name `01_09_2022__10_13_33` stands for timestamp 2022-09-01T10:13:33, then it could be renamed to `20220901T101333` or `2022-09-01_101333`.
:::
## File formats
A file format has to be chosen when storing information in a file. It builds the backbone of your data and is usually specified by the file extension (e.g. .txt). To keep your data [interoperable](https://www.go-fair.org/fair-principles/), the format needs a clear structure. This makes your data easy to read with many software products (e.g., out-of-the-box solutions or by writing a small script). Clear documentation of the file format shall be publicly available. Considering all these aspects, the chance is high that the file can be read in future, making it suitable for long-term preservation - which is one of our main goals when managing data. Therefore, open file formats are recommended, while proprietary formats should be avoided.
Ideally, when choosing a suitable format, you'll consider the following properties:
- Readable by humans with a simple editor
- Readable with many programs
- Easy to understand, low complexity
- Small (storage space)
- Quick to read (performance)
However, usually compromises have to be made. For example, binary files are generally more performant than csv files and thus more suitable during the active research process. At the same time, csv is a well-established format for long-term preservation and is easier for humans to read.
::: {.callout-tip icon="true" title="Task 1.3: (\~ 10 minutes)"}
- Are the files stored in an open or a proprietary format?
- Is the file format used "future-proof", e.g., suitable for long-term archiving?
- How easy is it to open the file (regarding available programs and file size)?
- How complex are the files? What is their internal structure?
- What about performance and file size?
- How easy is it to understand the file as humans?
- Are they machine-readable and standardized? How easy is it to write a script to read the files?
- Which alternative formats exist?
:::
::: {.callout-warning icon="true" title="Avoid proprietary formats."}
Often, proprietary formats have intentionally no proper documentation as the company behind the system wants to keep their business information behind closed doors. The companies sometimes even use technical protection mechanisms, making the file format readable only by commercial software. This reduces the interoperability and reusability of the files and, in the worst case, makes them unreadable in the long term. (Imagine the company that provided the software and file format no longer exists.) Furthermore, the files might contain hidden (potentially sensitive) information. Thus, such formats should be avoided.
:::
::: {.callout-note icon="true" title="Examples of recommended formats"}
In the following list, you'll find some formats which are widely used, well-documented and readable with several programs.
- For documentation:
- Plain text (.txt)
- HTML, XHTML, Markdown
- PDF (PDF/A-1)
- maybe: Rich Text Format (.rtf), Open Document Text (.odt), docx, ...
- Tabular data:
- Comma-separated values (.csv)
- Tab-delimited (.tab)
- maybe: Open Document Spreadsheet (.ods), xlsx, ...
- Nested data:
- JSON
- XML
- Further formats:
- NetCDF, HDF5, ...
- png, jpg, ...
Notes:
- **PDF**: PDF has been developed by Adobe Inc. and thus originally had been a proprietary format, and several versions exist. Nevertheless, the format is widely used today. For archival purposes, a PDF/A version is the best choice. PDF is best suited for fixed documentation. However, editing PDF files or extracting data from them takes a lot of work.
- **Spreadsheet files**: Spreadsheets may look nice, particularly when formatted in a colourful way. But for the machine-readability, this can cause problems. In particular, we do not recommend that you present relevant information just by formatting content differently. You can take this as a rule of thumb: Spreadsheet files like .xlsx or .ods are not well machine-readable.
:::
::: {.callout-note collapse="true" icon="true" title="Excursion: Premium format ASCII"}
A gold standard for storing digital information is an ASCII file. In an ASCII file, each byte represents one visible character (except for the white spaces and control characters like tab stop and linebreaks).
Therefore, ASCII files can be read or opened by any text editor or data-processing software, even with programs like Excel, Word, Wordpad or web browsers (only possibly limited regarding the file size).
**Characters beyond ASCII**:
An ASCII file can only contain the following visible characters: ``!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~``
Otherwise, it is not an ASCII file.
For some years, the Unicode-based file format "UTF-8" has been available, which can represent many characters beyond the ASCII characters, like "ü", "€", and even some smilies ☺. Nowadays, UTF-8 is supported by many editors and browsers. The good thing about UTF-8 is that as long as a UTF-8 file contains only ASCII characters, the UTF-8 file is automatically an ASCII file. In other words, an ASCII file is a super-interoperable UTF-8 file.
:::
::: {.callout-important collapse="true" title="Solution: Example 1"}
> - Are the files stored in an open or a proprietary format?
> - Is the file format used "future-proof", e.g., suitable for long-term archiving?
> - How easy is it to open the file (regarding available programs and file size)?
> - How complex are the files? What is their internal structure?
> - What about performance and file size?
> - How easy is it to understand the file as humans?
> - Are they machine-readable and standardized? How easy is it to write a script to read the files?
> - Which alternative formats exist?
- Files are ASCII files, thus open.
- Yes, ASCII is suitable for long-term archiving.
- Easy to open, e.g. with text editor.
- Files have tabular shape.
- OK, file sizes are below 1 MB.
- Shape: easy to understand, meaning of the columns given in README file.
- Most data analysis programs have import functions for csv. The quotes in the first column might be cumbersome for some import routines.
- Tab-separated files, spreadsheet files, etc.
:::
::: {.callout-important collapse="true" title="Solution: Example 2"}
> - Are the files stored in an open or a proprietary format?
> - Is the file format used "future-proof", e.g., suitable for long-term archiving?
> - How easy is it to open the file (regarding available programs and file size)?
> - How complex are the files? What is their internal structure?
> - What about performance and file size?
> - How easy is it to understand the file as humans?
> - Are they machine-readable and standardized? How easy is it to write a script to read the files?
> - Which alternative formats exist?
- The small files are ASCII or UTF-8 files, thus open. The tar.gz files are compressed TAR-files, thus also in an open format.
- Yes, ASCII is definitively suitable for long-term archiving. Also tar.gz files are widely used and can thus be considered suitable for long-term archiving.
- The tar.gz files need specific software for extraction, which is freely available, but maybe not installed everywhere, and not all people are familiar with. Thus it is commandable that the extaction is described in the README file. However, the file size of several GB can be problematic for users having a slow internet connection. And unpacked, the largest file is more than 26 GB, more than the RAM size of many computers.
- The data files (inside the tar.gz) are not complex, just tables.
- Due to compression, the file size is reduced for storage and download. However, the tables contain many digits, probably more than needed. Reducing them would decrease file size. Binary files instead of ASCII files would need less time for loading.
- Shape: easy to understand, meaning of column see README file.
- Most data analysis programs have import functions for csv.
- Binary files like [HDF](https://de.wikipedia.org/wiki/Hierarchical_Data_Format), which could enhance performance.
:::
::: {.callout-important collapse="true" title="Solution: Example 3"}
> - Are the files stored in an open or a proprietary format?
> - Is the file format used "future-proof", e.g., suitable for long-term archiving?
> - How easy is it to open the file (regarding available programs and file size)?
> - How complex are the files? What is their internal structure?
> - What about performance and file size?
> - How easy is it to understand the file as humans?
> - Are they machine-readable and standardized? How easy is it to write a script to read the files?
> - Which alternative formats exist?
Following notes relate to the content of "OSF Storage".
- Most files are in an open format: ASCII tables, JSON files, R scripts. But what are "nii.gz" files in folder "results" - maybe zipped [NIfTI](https://en.wikipedia.org/wiki/Neuroimaging_Informatics_Technology_Initiative) files?
- Yes for ASCII tables and JSON files; maybe yes for nii.gz files.
- ASCII tables and JSON files: easy to open with every text editor, special software or libraries needed for nii.gz.
- Files in folder `data` are tables (csv) or Codebooks (in JSON format) describing those.
- OK, because the files are not very large.
- ASCII tables and JSON files are easy to understand by humans; nii.gz needs suitable software.
- Most data analysis programs have import functions for csv, also JSON import functions are available for several programs.
- For csv-tables: Tab-separated files, spreadsheet files, etc; for JSON: XML
:::
::: {.callout-important collapse="true" title="Solution: Example 4"}
> - Are the files stored in an open or a proprietary format?
> - Is the file format used "future-proof", e.g., suitable for long-term archiving?
> - How easy is it to open the file (regarding available programs and file size)?
> - How complex are the files? What is their internal structure?
> - What about performance and file size?
> - How easy is it to understand the file as humans?
> - Are they machine-readable and standardized? How easy is it to write a script to read the files?
> - Which alternative formats exist?
- Files are stored as ASCII tables or plain text files, which are open formats.
- Yes, suitable for long-term archiving.
- Easy, readable with text editor.
- Data files are ASCII tables.
- Due to compression, the file size is reduced for storage and download. Binary files instead of ASCII files would need less time for loading.
- The format is easy to understand by humans, but the columns are not explicitly described.
- Most data analysis programs have import functions for semicolon-separated tables.
- Binary files like [HDF](https://de.wikipedia.org/wiki/Hierarchical_Data_Format) could be used (cf note above related to performance).
:::
### Special file types: tabular text file (optional)
Please note that the task in this section is optional. You can go through this section if you still have some time left during the workshop or read it afterwards.
Tabular text files store data in a structured format, where each row represents a record and each column represents a field, with data separated by a designated column separator. Even after deciding to store tabular data in text files (e.g. files which can be opened in any editor), there are various ways and conventions to choose from:
- Column separator: typically tab or comma, sometimes space or semicolon
- Numeric values: handling of missing values (e.g. "NA", "", etc.)
- Representation of timestamps, e.g. "2024-08-01T08:59"
- Header lines with meta information?
- Encoding: Recommended is ASCII or UTF-8
::: {.callout-tip icon="true" title="Task 1.4: (\~ 5 minutes)"}
- How is the file encoded (e.g. ASCII, UTF-8)?
- Numbers: What about their precision (enough or too much)?
- Special numbers: Do special numbers like "NA", "", "N/A", "999", "0" occur? Is their meaning documented?
- Time: Which format is used for the date and time of day? Which time zone is used?
- Tables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?
- Is the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?
:::
::: {.callout-note collapse="true" icon="false" title="Example"}
First, you will find an example of a very bad file, followed by an improved version.
File `Measured last month.txt`:
```
date, time,sensor,sensor
03/07/24 12.00 AM,17.3
03/07/24 1.00 AM,16.9
03/07/24 2.00 AM,16.7
03/07/24 3.00 AM,16.4
03/07/24 4.00 AM,16.2
03/07/24 5.00 AM,15.9
03/07/24 6.00 AM
03/07/24 7.00 AM
03/07/24 8.00 AM
03/07/24 9.00 AM,16.5
03/07/24 10.00 AM,17.0,7.2
03/07/24 11.00 AM,17.6,4.6
03/07/24 12.00 PM,18.0
03/07/24 1.00 PM,18.5
```
We gathered some comments on that file:
- First, we notice that the file name is bad. It contains spaces, and "last month" is no meaningful name (which month is considered as the actual one?).
- That file is not a proper csv file because it does not have a proper tabular shape:
- The header line indicates that we have 4 columns. When looking at the data, one can assume that there is one comma too much as the date and time of day are stored in one column.
- Further, we have at most one comma leading into two columns in the data rows. This does not match the header. Therefore, we can assume that some values are missing.
- The header line contains twice the word "sensor". Thus, the column names are not unique.
- The time column is horrible:
- The date is given in an ambiguous format: you do not know if it is 03 July 2024 or 07 March 2024 or 24 July 2003, or 1924 or in the year 24 AD? You should use the international format [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601), here "2024-03-07" for 7 March 2024.
- Time is given according to 12h clock with "AM" and "PM". Hint: **Never use "AM" or "PM"** in a scientific context. Always use 24h clock!
- The hour is noted without a leading zero.
- Between hour and minute, a dot is used. It would be better to use a colon, e.g. "00:00".
- Important information is missing (but might be given in a separate README file or Codebook), e.g.:
- Time zone?
- What is the file about?
- Why are values missing?
- Unit?
And here is a better version:
File `Temp_Rain_202407.csv`:
```
# Averaged temperature and precipitation of Ex_Emplum station
#
# File created on 2024-04-22 by Schlaubi Schlumpf.
# This file contains temperature and precipitation measured at the fictitious weather station 'Ex_Emplum' at 55.432 degrees North, 55.678 degrees East.
# Raw data have been averaged over 1 hour.
# NA indicates missing values due to measurement interruption or instrument malfunction.
# Column description:
# - Time: Start time of the 1-hour interval, given as UTC, in ISO 8601 format 'YYYY-MM-DDThh:mm'.
# - Temp: Temperature at 2 m above ground level, averaged over the 1-hour interval, in degrees Celsius. The error of the given value is expected to be below 0.3 degrees Celsius.
# - Rain: Precipitation height accumulated within the 1-hour interval, in mm. The error of the given value is expected to be below 0.5 mm.
#
Time,Temp,Rain
2024-03-07T00:00,17.3,0
2024-03-07T01:00,16.9,0
2024-03-07T02:00,16.7,0
2024-03-07T03:00,16.4,0
2024-03-07T04:00,16.2,0
2024-03-07T05:00,15.9,0
2024-03-07T06:00,NA,NA
2024-03-07T07:00,NA,NA
2024-03-07T08:00,NA,NA
2024-03-07T09:00,16.5,0
2024-03-07T10:00,17.0,7.2
2024-03-07T11:00,17.6,4.6
2024-03-07T12:00,18.0,0
2024-03-07T13:00,18.5,0
```
:::
::: {.callout-important collapse="true" title="Solution: Example 1"}
> - How is the file encoded (e.g. ASCII, UTF-8)?
> - Numbers: What about their precision (enough or too much)?
> - Special numbers: Do special numbers like "NA", "", "N/A", "999", "0" occur? Is their meaning documented?
> - Time: Which format is used for the date and time of day? Which time zone is used?
> - Tables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?
> - Is the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?
- ASCII
- Has many digits, e.g. `986.223944276841`.
- No information about missing values found in README file. But file `amb_hourly_qc_wc4.4_cal6.0_2017_03_core-params.csv` contains `NA`.
- Time is ISO 8601 conform, except that a space is given between date and time of day, e.g. `2017-03-23 09:30:00`. In readme file mentioned: "All times given in GMT".
- Comma as column separator, whitespace only between date and time of day, no missing columns found.
- Not self-explaining but mentioned in README file, also the units.
:::
::: {.callout-important collapse="true" title="Solution: Example 2"}
> - How is the file encoded (e.g. ASCII, UTF-8)?
> - Numbers: What about their precision (enough or too much)?
> - Special numbers: Do special numbers like "NA", "", "N/A", "999", "0" occur? Is their meaning documented?
> - Time: Which format is used for the date and time of day? Which time zone is used?
> - Tables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?
> - Is the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?
- UTF-8, except for the tar.gz. The files inside those tar.gz are even ASCII files.
- Probably more digits than needed, e.g. `-13.333333333333336`. Considering the file size, shortening them could be worthwile.
- No information about missing values found in README file. But `NA` found in several files.
- Time: There seems to be no time column.
- Tables: Comma as column separator, no missing columns or whitespaces found.
- Not self-explaining but columns mentioned in README file.
:::
::: {.callout-important collapse="true" title="Solution: Example 3"}
> - How is the file encoded (e.g. ASCII, UTF-8)?
> - Numbers: What about their precision (enough or too much)?
> - Special numbers: Do special numbers like "NA", "", "N/A", "999", "0" occur? Is their meaning documented?
> - Time: Which format is used for the date and time of day? Which time zone is used?
> - Tables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?
> - Is the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?
Following notes relate to the content of `OSF Storage`.
- Encoding: ASCII files (except for the nii.gz files)
- Numbers: e.g. `0.878519` - looks reasonable
- No information about missing values found in README file. But `NA` found in several files.
- Time: There seems to be no time column.
- Tables: Comma as column separator, no missing columns or whitespaces found.
- Content of the table is explained in JSON file (Codebook).
:::
::: {.callout-important collapse="true" title="Solution: Example 4"}
> - How is the file encoded (e.g. ASCII, UTF-8)?
> - Numbers: What about their precision (enough or too much)?
> - Special numbers: Do special numbers like "NA", "", "N/A", "999", "0" occur? Is their meaning documented?
> - Time: Which format is used for the date and time of day? Which time zone is used?
> - Tables: What do you generally notice (regarding the choice of separator, whitespaces, missing columns)?
> - Is the content of the table self-explaining? Is there a column description? What units were used? Is there detailed information elsewhere?
- Data tables are ASCII files.
- Numbers: e.g. `73.12958` - looks reasonable
- Special numbers or contents: Some Columns contain parenthesis - what is their meaning?
- Time: Time column with seconds(?) since start time?
- Tables: Semicolon as column separator, also semicolon after last column.
- The README file says "Variable names should be quite descriptive, but please get in touch in case anything is unclear", but not all columns are so clear to understand.
:::
## References
Examples and notes have been adapted from: Onboarding into Research Data Management, Franke et al. 2024, [https://hdl.handle.net/21.11116/0000-000E-194D-1](https://hdl.handle.net/21.11116/0000-000E-194D-1), file "FDM-Onboarding-2024_CPT-Slides.pdf" pages 44-51, 56-59.