Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Unicode encoding issues with detailed_message #817

Merged
merged 1 commit into from
Mar 13, 2024

Conversation

imjoehaines
Copy link
Contributor

@imjoehaines imjoehaines commented Mar 11, 2024

Goal

Ruby 3.2's Exception#detailed_message method returns a string that is encoded as UTF-8 but has a String#encoding set to ASCII_8BIT. This causes issues when we later convert the string to UTF-8 (for sending as JSON) because the conversion is invalid:

irb(main):001> a = Exception.new("Обичам те\n大好き")
=> #<Exception:"Обичам те\n大好き">
irb(main):002> a.detailed_message
=> "\xD0\x9E\xD0\xB1\xD0\xB8\xD1\x87\xD0\xB0\xD0\xBC \xD1\x82\xD0\xB5 (Exception)\n\xE5\xA4\xA7\xE5\xA5\xBD\xE3\x81\x8D"
irb(main):003> a.detailed_message.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004> a.detailed_message.encode(Encoding::UTF_8, invalid: :replace, undef: :replace)
=> "������������ ���� (Exception)\n���������"

If the detailed message is forced to UTF-8 then it works as expected:

irb(main):005> b = a.detailed_message.force_encoding(Encoding::UTF_8)
=> "Обичам те (Exception)\n大好き"

This can then be sent as JSON correctly

You can compare the bytes in this string with the "ASCII-8BIT" encoded string above and they match exactly1:

irb(main):06> b.bytes.map { |byte| byte.to_s(16) }.map(&:upcase)
=> ["D0", "9E", "D0", "B1", "D0", "B8", "D1", "87", "D0", "B0", "D0", "BC", "20", "D1", "82", "D0", "B5", "20", "28", "45", "78", "63", "65", "70", "74", "69", "6F", "6E", "29", "A", "E5", "A4", "A7", "E5", "A5", "BD", "E3", "81", "8D"]

The bit in the middle is (Exception)\n that's displayed literally in the ASCII-8BIT output:

irb(main):017> ["20", "28", "45", "78", "63", "65", "70", "74", "69", "6F", "6E", "29", "A"].map { |x| x.to_i(16) }.pack("C*")
=> " (Exception)\n"

Testing

  • Existing tests pass
  • New tests with UTF-8, UTF-16 & Shift JIS encoded messages

Footnotes

  1. You would expect this as force_encoding doesn't changing the underlying bytes, but this decoding back into the original input proves that it's really a UTF-8 string

@imjoehaines imjoehaines marked this pull request as ready for review March 11, 2024 13:07
@imjoehaines imjoehaines changed the base branch from master to next March 13, 2024 09:19
@imjoehaines imjoehaines merged commit 27beed6 into next Mar 13, 2024
137 checks passed
@imjoehaines imjoehaines deleted the fix-unicode-detailed-message branch March 13, 2024 09:19
@imjoehaines imjoehaines mentioned this pull request Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants