-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement PackedPatternsV1 with packing and unpacking #5580
Conversation
Still needs databake and serde impls, but the packing part is ready for review/landing. For human-readable Serde, I was thinking of serializing something resembling the builder; maybe I will just derive Serde on that and use it directly. |
The individual commits are not that useful; most of the changes are in just the one new file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewed builder, still reviewing others
/// | ||
/// The LMS value determines which pattern index is used for the first column: | ||
/// | ||
/// | LMS Value | Long Index | Medium Index | Short Index | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's add a column that contains the entries "LMS", "L, MS", "LM, S", and "L, M, S" since it's easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 6b30bd8
/// | Sc | S + 6 | 18-20 | Sa | | ||
/// | ||
/// As a result, if there are no variants, bits 2 and higher will be all zero, | ||
/// making the header int suitable for varint packing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: varint packing, as used by postcard and other optimized binary serialization formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 4e94347
/// | Lb | S + 1 | 3-5 | La | | ||
/// | Mb | S + 2 | 6-8 | Ma | | ||
/// | Sb | S + 3 | 9-11 | Sa | | ||
/// | Lc | S + 4 | 12-14 | La | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: so inheritence is always from a? I thought c inherits from b first?
we could mark Lc as inheriting from "Lb, then La" etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed both variants to inherit from standard because it was easier to implement, and in practice, if both variant0 and variant1 are present, they will definitely be different from each other. The main inheritance that matters is variant0 from standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that's convincing.
/// The variants are currently used by year formatting: | ||
/// | ||
/// - Standard: Year, which could be partial precision (2-digit Gregorain) | ||
/// - Variant 0: Full Year, which is always full precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Not a fan of calling them variant0 variant1 in code here; can we primarily use descriptive names and in the docs mention that this is also "variant0" and "variant1"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your suggestion to call them full_year
and with_era
? Because I wanted these to be generalizable enough to other cases that might want to use this mechanism in the future. For instance, maybe some locale will want to use it for day periods or something ("6 in the evening", "7 in the evening", "8pm", "9pm").
{ | ||
if let Some(pattern) = pattern { | ||
if pattern != fallback { | ||
*chunk = match elements.iter().position(|p| p == pattern) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: potentially use a BTreeMap<pattern, index>
to speed this up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No Ord
impls available and I don't particularly want to add them; they would peculate everywhere (20+ inner types including all fields, etc). And we can't move the first 1-3 elements on the vec, anyway. So the linear search seems alright.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay for now. Worth checking how slow this codegen is.
|
||
// Check to see if we need to switch to Q=1 mode | ||
#[allow(clippy::unwrap_used)] // the array is nonempty | ||
if chunks.iter().max().unwrap() >= &0x8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think we need to use hex here. Also minor style nit would be to have the *
on the LHS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically, I prefer writing in hex when I'm emphasizing that I care more about the bit representation than the numeric value. "8" means "the number 8" and "0x8" means "an integer with the fourth bit on and all others off"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I guess so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it a constant in 6b30bd8
|
||
/// A builder for a [`PackedPatternsV1`]. | ||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub struct PackedPatternsBuilder<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: datagen-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in the next PR, I use this type for human-readable deserialization, not just datagen.
} | ||
|
||
// Check to see if we need to switch to Q=1 mode | ||
#[allow(clippy::unwrap_used)] // the array is nonempty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: if we make this datagen-only, the comment could say "the array is nonempty and this is datagen-only"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved the comment in c01686e
I wish we would have array::max
one of these days: rust-lang/rust#78504
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also use one of the workarounds in that issue to avoid the unwrap
It is impossible for an IndexN array to need more than a length integer of size N, anyway, the max index is always `>=` the length. Part of #5523 Builds on #5593 We could in theory just have a `VZVFormatCombo<Index, Len>` type that allows free selection, however I'm trying to keep this minimal. Overall the main use case for that is picking things like "a small array of ;argely-sized elements" and we could just expose Index16Len8 for that. I can see that being useful for things like #5580, though it also feels like a data microoptimization. The "total" lines in fingerprints.csv are interspersed in giant diffs, and this basically only gets a max of 2-6 byte wins per data, but the overall data size went down by 200KB. Not amazing, not terrible. ```rust [18:26:22] मanishearth@manishearth-glaptop2 ~/dev/icu4x/provider/data ^_^ $ rg total | awk '{ gsub(/B,/, "", $3); s +=$3} END{print s}' 23501369 [18:26:08] मanishearth@manishearth-glaptop2 ~/dev/icu4x/provider/data ^_^ $ rg total | awk '{ gsub(/B,/, "", $3); s +=$3} END{print s}' 23391499 ```
Part of #257
Replaces #5521