diff --git a/ch02.adoc b/ch02.adoc index a0e569bf..7f9d83cb 100644 --- a/ch02.adoc +++ b/ch02.adoc @@ -12,18 +12,24 @@ NetCDF files should have the file name extension "**`.nc`**". // TODO: Check, should this be a bullet list? Data variables must be one of the following data types: **`string`**, **`char`**, **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, **`unsigned int64`**, **`float`** or **`real`**, and **`double`** (which are all the link:$$https://docs.unidata.ucar.edu/nug/current/md_types.html$$[netCDF external data types] supported by netCDF-4). -The **`string`** type is only available in files using the netCDF version 4 (netCDF-4) format. +The **`string`** type, which has variable length, is only available in files using the netCDF version 4 (netCDF-4) format. The **`char`** and **`string`** types are not intended for numeric data. One byte numeric data should be stored using the **`byte`** or **`unsigned byte`** data types. It is possible to treat the **`byte`** and **`short`** types as unsigned by using the NUG convention of indicating the unsigned range using the **`valid_min`**, **`valid_max`**, or **`valid_range`** attributes. In many situations, any integer type may be used. When the phrase "integer type" is used in this document, it should be understood to mean **`byte`**, **`unsigned byte`**, **`short`**, **`unsigned short`**, **`int`**, **`unsigned int`**, **`int64`**, or **`unsigned int64`**. -Strings in variables may be represented one of two ways - as atomic strings or as character arrays. -An n-dimensional array of strings may be implemented as a variable of type **`string`** with _n_ dimensions, or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable. -For example, a character array variable of strings containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name. -The other strings, such as "May", should be padded with trailing NULL or space characters so that every array element is filled. -If the atomic string option is chosen, each element of the variable can be assigned a string with a different length. +A text string can be stored either in a variable-length **`string`** or in a fixed-length **`char`** array. +In both cases, text strings must be represented in Unicode Normalization Form C (NFC, link:$$https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf$$[section 3.11] and link:$$https://unicode.org/reports/tr15$$[Annex 15] of the Unicode standard) and encoded according to UTF-8. +A text string consisting only of ASCII characters is guaranteed to conform with this requirement, because the ASCII characters are a subset of Unicode, and their NFC UTF-8 encodings are the same as their one-byte ASCII codes (decimal 0-127, hexadecimal `00`-`7F`). + +Before version 1.12, CF did not require UTF-8 encoding, and did not provide or endorse any convention to record what encoding was used. +However, if the text string is stored in a **`char`** variable, the encoding might be recorded by the **`_Encoding`** attribute, although this is not a CF or NUG convention. + +An __n__-dimensional array of strings may be implemented as a variable or an attribute of type **`string`** with _n_ dimensions (only _n_=1 is allowed for an attribute) or as a variable of type **`char`** with _n_+1 dimensions, where the most rapidly varying dimension (the last dimension in CDL order) is large enough to contain the longest string in the variable. +For example, a **`char`** variable containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name. +The other strings, such as "May", would be padded with trailing NULL or space characters so that every array element is filled. +A **`string`** variable to store the same information would be dimensioned (12), with each element of the array containing a string of the appropriate length. The CDL example below shows one variable of each type. [[char-and-string-variables-ex]] diff --git a/conformance.adoc b/conformance.adoc index 4e490ab9..b1f28878 100644 --- a/conformance.adoc +++ b/conformance.adoc @@ -27,7 +27,9 @@ See https://github.com/ugrid-conventions/ugrid-conventions for the UGRID conform *Requirements:* -* CF attributes that take string values must be 1D character arrays or single atomic strings. +* Any text stored in a CF attribute or variable must be represented in Unicode Normalization Form C and encoded in UTF-8. + +* Any attribute of variable-length string type must be a scalar (not an array). [[section-1]] diff --git a/history.adoc b/history.adoc index bfca341c..f1812116 100644 --- a/history.adoc +++ b/history.adoc @@ -7,6 +7,7 @@ === Working version (most recent first) +* {issues}141[Issue #141]: Clarification that text may be stored in variables and attributes as either vlen strings or char arrays, and must be represented in Unicode Normalization Form C and encoded according to UTF-8. * {issues}367[Issue #367]: Remove the AMIP and GRIB columns from the standard name table format defined by Appendix B. * {issues}403[Issue #403]: Metadata to encode quantization properties * {issues}530[Issue #530]: Define "the most rapidly varying dimension", and use this phrase consistently with the clarification "(the last dimension in CDL order)".