Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve text around handling strings #1480

Open
tacaswell opened this issue Sep 29, 2024 · 1 comment
Open

Improve text around handling strings #1480

tacaswell opened this issue Sep 29, 2024 · 1 comment

Comments

@tacaswell
Copy link
Contributor

from https://manual.nexusformat.org/nxdl-types.html#data-types-allowed-in-nxdl-specifications

NX_CHAR:
The preferred string representation is UTF-8. Both fixed-length strings and variable-length strings are valid. String arrays cannot be used where only a string is expected (title, start_time, end_time, NX_class attribute,…). Fields or attributes requiring the use of string arrays will be clearly marked as such (like the NXdata attribute auxiliary_signals). This is the default field type.

At the nexus level we should decide if "NX_CHAR" is "sequence-of-char-as-8-byte-good-luck-with-encoding" a-la c or "sequence of unicode points" a-la strings in modern programming languages.

If it is the second then we should drop the sentence, If it is the first we should at least change the language to be "encoding" (rather than the representation), possible change to "when using hdf5 use the utf-8 enocding", or still consider dropping it.

For reference the h5py docs on strings: https://docs.h5py.org/en/stable/strings.html#strings and notes on encoding https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/


I think we should go with the second option and assert that the details of the encoding are the business of the underlying file format not of nexus proper (any more than we would pull endianess up into nexus). In the case of xml, the whole file has an encoding (which should be at the top!) and hdf5 (and h5py) can also handle this:

import h5py
a = '你好世界'
with open('/tmp/test.h5', 'w') as f:
   f['a'] = [a, a+a, 'bob']

which if we poke at the files gives:

In [10]: f = h5py.File('/tmp/test.h5')

In [11]: f['a'].dtype
Out[11]: dtype('O')

In [12]: f['a'].asstr()[:]
Out[12]: array(['你好世界', '你好世界你好世界', 'bob'], dtype=object)

In [13]: f['a'][:]
Out[13]: 
array([b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c',
       b'\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c\xe4\xbd\xa0\xe5\xa5\xbd\xe4\xb8\x96\xe7\x95\x8c',
       b'bob'], dtype=object)

This works because h5py does put the encoding in the hdf5 file:

In [24]: from h5py import h5t

In [25]: h5t.check_string_dtype(f['a'].dtype)
Out[25]: string_info(encoding='utf-8', length=None)

and you can see it in h5dump

@ h5dump /tmp/test.h5
HDF5 "/tmp/test.h5" {
GROUP "/" {
   DATASET "a" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 3 ) / ( 3 ) }
      DATA {
      (0): "\37777777744\37777777675\37777777640\37777777745\37777777645\37777777675\37777777744\37777777670\37777777626\37777777747\37777777625\37777777614",
      (1): "\37777777744\37777777675\37777777640\37777777745\37777777645\37777777675\37777777744\37777777670\37777777626\37777777747\37777777625\37777777614\37777777744\37777777675\37777777640\37777777745\37777777645\37777777675\37777777744\37777777670\37777777626\37777777747\37777777625\37777777614",
      (2): "bob"
      }
   }
}
}

which shows h5py is doing this using what I believe are standard hdf5 tools so I would expect this to be available to any language.

@tacaswell
Copy link
Contributor Author

https://docs.hdfgroup.org/archive/support/HDF5/doc/RM/RM_H5T.html#Datatype-SetCset is the upstream hdf5 docs which say this is available from 1.8 on and the only supported encodings are ASCII and utf-8

@tacaswell tacaswell changed the title Improve text around handling screens Improve text around handling strings Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant