String DSL should support valid UTF-8 #54

dcapwell · 2019-02-15T18:09:10Z

I find that if I use the basicMultilingualPlaneAlphabet from the string dsl that I get back invalid UTF-8; to generate a UTF-8 gen I have the following in my code

public static final Gen<String> UTF_8_GEN =
            SourceDSL.strings()
                    .basicMultilingualPlaneAlphabet()
                    .ofLengthBetween(0, 1024)
            .map(s -> new String(s.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));

This conversion to bytes and back will drop all non-valid code points.

The text was updated successfully, but these errors were encountered:

jlink · 2019-02-16T09:18:54Z

I'm not quite sure what you mean by "I get back invalid UTF-8". As far as I understand it Java uses UTF-16 to encode strings internally and you would specify a charset only when translating to bytes.

dcapwell · 2019-03-09T02:21:29Z

Sorry for not replying for a long time.

I have a lot of use cases which deal with serialization, so I want to make sure UTF-8 strings are serialized and deserialized without loss; there is a large assumption in most of my code that the original string is valid UTF-8. What I find when I use the code above is that the deserializing the string returns a different value, so the two strings are no longer .equals(o).

Looking up the UTF-8 code points, I see the max defined UTF 8 value is 99k but StringDSL defines 65k. I could totally be reading everything wrong (I use UTF-8, I don't know the spec at all =D), but that would imply to me that I should always get back UTF-8 chars; yet for some reason the string comes back as invalid UTF 8 and the Charset will drop some chars.

My common use case is to deal with UTF-8 strings so I tend to define the generator above in every project.

jlink · 2019-03-16T15:07:12Z

I dug a bit deeper into the problem and checked for which codepoints forth and back conversion does not produce the same chars. The smallest one I found was 0xD800 which is the beginning of an area where Unicode does currently have no defined characters (see https://unicode-table.com).
So the phenomenon will be the same when using UTF-16 for example.

So, maybe a better approach than providing a specialised UTF8 generator could be to (optionally) filter out all codepoints that have no defined character in unicode, e.g. like that

SourceDSL.strings()
                    .basicMultilingualPlaneAlphabet()
                    .ofLengthBetween(0, 1024)
                    .acceptOnlyValid("utf-8")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String DSL should support valid UTF-8 #54

String DSL should support valid UTF-8 #54

dcapwell commented Feb 15, 2019

jlink commented Feb 16, 2019

dcapwell commented Mar 9, 2019

jlink commented Mar 16, 2019

String DSL should support valid UTF-8 #54

String DSL should support valid UTF-8 #54

Comments

dcapwell commented Feb 15, 2019

jlink commented Feb 16, 2019

dcapwell commented Mar 9, 2019

jlink commented Mar 16, 2019