Skip to content

Commit

Permalink
Merge pull request #556 from JonathanGregory/issue141-2
Browse files Browse the repository at this point in the history
Clarification that text-valued variables and attributes can be Unicode strings or UTF-8 `char` arrays
  • Loading branch information
JonathanGregory authored Nov 23, 2024
2 parents 351c6e5 + 9e809ce commit 4268011
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 7 deletions.
18 changes: 12 additions & 6 deletions ch02.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,18 +12,24 @@ NetCDF files should have the file name extension "**`.nc`**".

// TODO: Check, should this be a bullet list?
Data variables must be one of the following data types: **`string`**, **`char`**, **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, **`unsigned int64`**, **`float`** or **`real`**, and **`double`** (which are all the link:$$https://docs.unidata.ucar.edu/nug/current/md_types.html$$[netCDF external data types] supported by netCDF-4).
The **`string`** type is only available in files using the netCDF version 4 (netCDF-4) format.
The **`string`** type, which has variable length, is only available in files using the netCDF version 4 (netCDF-4) format.
The **`char`** and **`string`** types are not intended for numeric data.
One byte numeric data should be stored using the **`byte`** or **`unsigned byte`** data types.
It is possible to treat the **`byte`** and **`short`** types as unsigned by using the NUG convention of indicating the unsigned range using the **`valid_min`**, **`valid_max`**, or **`valid_range`** attributes.
In many situations, any integer type may be used.
When the phrase "integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**.

Strings in variables may be represented one of two ways - as atomic strings or as character arrays.
An n-dimensional array of strings may be implemented as a variable of type **`string`** with _n_ dimensions, or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable.
For example, a character array variable of strings containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name.
The other strings, such as "May", should be padded with trailing NULL or space characters so that every array element is filled.
If the atomic string option is chosen, each element of the variable can be assigned a string with a different length.
A text string can be stored either in a variable-length **`string`** or in a fixed-length **`char`** array.
In both cases, text strings must be represented in Unicode Normalization Form C (NFC, link:$$https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf$$[section 3.11] and link:$$https://unicode.org/reports/tr15$$[Annex 15] of the Unicode standard) and encoded according to UTF-8.
A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their NFC UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal `00`-`7F`).

Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used.
However, if the text string is stored in a **`char`** variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention.

An __n__-dimensional array of strings may be implemented as a variable or an attribute of type **`string`** with _n_ dimensions (only _n_=1 is allowed for an attribute) or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable.
For example, a **`char`** variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name.
The other strings, such as "May", would be padded with trailing NULL or space characters so that every array element is filled.
A **`string`** variable to store the same information would be dimensioned (12), with each element of the array containing a string of the appropriate length.
The CDL example below shows one variable of each type.

[[char-and-string-variables-ex]]
Expand Down
4 changes: 3 additions & 1 deletion conformance.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ See https://github.com/ugrid-conventions/ugrid-conventions for the UGRID conform

*Requirements:*

* CF attributes that take string values must be 1D character arrays or single atomic strings.
* Any text stored in a CF attribute or variable must be represented in Unicode Normalization Form C and encoded in UTF-8.

* Any attribute of variable-length string type must be a scalar (not an array).

[[section-1]]

Expand Down
1 change: 1 addition & 0 deletions history.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

=== Working version (most recent first)

* {issues}141[Issue #141]: Clarification that text may be stored in variables and attributes as either vlen strings or char arrays, and must be represented in Unicode Normalization Form C and encoded according to UTF-8.
* {issues}367[Issue #367]: Remove the AMIP and GRIB columns from the standard name table format defined by Appendix B.
* {issues}403[Issue #403]: Metadata to encode quantization properties
* {issues}530[Issue #530]: Define "the most rapidly varying dimension", and use this phrase consistently with the clarification "(the last dimension in CDL order)".
Expand Down

0 comments on commit 4268011

Please sign in to comment.