Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LineBreakPropertyTable and RuleBreakPropertyTable serializable #1638

Closed
aethanyc opened this issue Feb 25, 2022 · 3 comments · Fixed by #1652
Closed

Make LineBreakPropertyTable and RuleBreakPropertyTable serializable #1638

aethanyc opened this issue Feb 25, 2022 · 3 comments · Fixed by #1652
Assignees
Labels
C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@aethanyc
Copy link
Contributor

This issue blocks #1373.

Both LineBreakPropertyTable and RuleBreakPropertyTable are behind the provider_serde feature, but the feature is not defined in segmenter crate's Cargo.toml. Therefore, I made the following change to Cargo.toml here.

 [features]
-default = []
+default = ["provider_serde"]
+provider_serde = []

Then I got the following build error.

error[E0277]: the trait bound `[[u8; 1024]; 128]: Serialize` is not satisfied
   --> experimental/segmenter/src/provider.rs:43:14
    |
43  |     Borrowed(&'data [[u8; 1024]; 128]),
    |              ^ the trait `Serialize` is not implemented for `[[u8; 1024]; 128]`
    |
    = help: the following implementations were found:
              <[T; 0] as Serialize>
              <[T; 10] as Serialize>
              <[T; 11] as Serialize>
              <[T; 12] as Serialize>
            and 30 others
    = note: required because of the requirements on the impl of `Serialize` for `&[[u8; 1024]; 128]`

It seems a large array (or a 2d array) is not serializable, at least it's not implemented in serde per this serde issue.

The 2d array is a serialized form of this properties_map array in build.rs, representing a mapping from codepoint to a break state. I feel this is similar to an enumerate Unicode property. For example, each slot inLineBreakPropertyTable stores a line breaking state, which is synthesized from the Line_Break property plus other conditions from the line breaking rule. (@makotokato please correct me if I misunderstand it.)

So we can either:

  1. Continue to use 2d array, and make them serializable.
  2. Store the codepoint -> break state mapping in a CodePointTrie. @echeran Is this possible in ICU4X? This sounds like I'm asking for a CodePointTrie builder.

cc @sffc

@aethanyc aethanyc added T-core Type: Required functionality C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) labels Feb 25, 2022
@aethanyc
Copy link
Contributor Author

aethanyc commented Mar 1, 2022

After reading this VarZeroVec example, I think we can useVarZeroVec<'data, ZeroSlice<u8>> instead of a 2d array to make RuleBreakPropertyTable serializable.

@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Mar 1, 2022
@sffc
Copy link
Member

sffc commented Mar 1, 2022

The easiest solution that should avoid performance penalties is to use a flat ZeroVec<u8> and treat it as a matrix in row major order.

VarZeroVec is not the right answer because it is designed for variable width data, but our data is fixed width. VZV incurs a performance penalty because it needs to read from an index table first before it knows the final offset of the data.

Can we also solve this at the same time we solve the data deduplication?

@aethanyc
Copy link
Contributor Author

aethanyc commented Mar 2, 2022

Thanks for the suggestion!

Can we also solve this at the same time we solve the data deduplication?

I've made RuleBreakPropertyTable a ZeroVec<u8> in #1652, but the PR is large enough. File #1653 as a followup.

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Mar 3, 2022
@aethanyc aethanyc added this to the 2022 Q1 0.6 Sprint D milestone Mar 3, 2022
@aethanyc aethanyc self-assigned this Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants