Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String DSL should support valid UTF-8 #54

Open
dcapwell opened this issue Feb 15, 2019 · 3 comments
Open

String DSL should support valid UTF-8 #54

dcapwell opened this issue Feb 15, 2019 · 3 comments

Comments

@dcapwell
Copy link

I find that if I use the basicMultilingualPlaneAlphabet from the string dsl that I get back invalid UTF-8; to generate a UTF-8 gen I have the following in my code

public static final Gen<String> UTF_8_GEN =
            SourceDSL.strings()
                    .basicMultilingualPlaneAlphabet()
                    .ofLengthBetween(0, 1024)
            .map(s -> new String(s.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8));

This conversion to bytes and back will drop all non-valid code points.

@jlink
Copy link
Contributor

jlink commented Feb 16, 2019

I'm not quite sure what you mean by "I get back invalid UTF-8". As far as I understand it Java uses UTF-16 to encode strings internally and you would specify a charset only when translating to bytes.

@dcapwell
Copy link
Author

dcapwell commented Mar 9, 2019

Sorry for not replying for a long time.

I have a lot of use cases which deal with serialization, so I want to make sure UTF-8 strings are serialized and deserialized without loss; there is a large assumption in most of my code that the original string is valid UTF-8. What I find when I use the code above is that the deserializing the string returns a different value, so the two strings are no longer .equals(o).

Looking up the UTF-8 code points, I see the max defined UTF 8 value is 99k but StringDSL defines 65k. I could totally be reading everything wrong (I use UTF-8, I don't know the spec at all =D), but that would imply to me that I should always get back UTF-8 chars; yet for some reason the string comes back as invalid UTF 8 and the Charset will drop some chars.

My common use case is to deal with UTF-8 strings so I tend to define the generator above in every project.

@jlink
Copy link
Contributor

jlink commented Mar 16, 2019

I dug a bit deeper into the problem and checked for which codepoints forth and back conversion does not produce the same chars. The smallest one I found was 0xD800 which is the beginning of an area where Unicode does currently have no defined characters (see https://unicode-table.com).
So the phenomenon will be the same when using UTF-16 for example.

So, maybe a better approach than providing a specialised UTF8 generator could be to (optionally) filter out all codepoints that have no defined character in unicode, e.g. like that

SourceDSL.strings()
                    .basicMultilingualPlaneAlphabet()
                    .ofLengthBetween(0, 1024)
                    .acceptOnlyValid("utf-8")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants