Codepoints are stored UTF-8-encoded.
All multibyte integers are little-endian.
8 or 12 bytes | ... |
---|---|
Header | Data |
6 bytes | 1 byte | 1 byte | 4 bytes? |
---|---|---|---|
magic number | version | flags | optional crc |
magic number
:"UTS#46"
(0x55 0x54 0x53 0x23 0x34 0x36
).version
: format version (1 byte; currently0x01
).flags
: See Flags below.optional crc
: A CRC32 of the data section ifflags
has thehas crc
bit set.
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|
unused | has crc | compression |
-
has crc
: If set, there will be a CRC32 of the data section at the end of the header. -
compression
: compression mode of the data. Currently identical to NSData’s compression constants + 1:- 0: no compression - 1: LZFSE - 2: LZ4 - 3: LZMA - 4: ZLIB
The data section is a (possibly-compressed; see Flags) collection of data blocks of the format
[marker][section data] ...
Section data formats:
If marker is characterMap
(0xFF
):
[codepoint][mapped-codepoint ...][null] ...
If marker is ignoredCharacters
(0xFE
) or disallowedCharacters
(0xFD
):
[codepoint-range] ...
If marker is joiningTypes
(0xFC
):
[type][[codepoint-range] ...]
where type
is one of C
, D
, L
, R
, or T
.
codepoint-range
: two codepoints, marking the first and last codepoints of a
closed range. Single-codepoint ranges have the same start and end codepoint.