Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function #36713

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yigithankarabulut
Copy link

Description

Truncate_all will slice the string up to the given length. If truncating at exactly the length results in a broken UTF-8 encoding, it'll truncate before where the last UTF-8 character started.

Link to tracking issue

Fixes #36017

Testing

Two UTF-8 characters were added to the end of the string and the limit was adjusted to match them and the truncation process was tested.

Documentation

Updated pkg/ottl/ottfuncs/README.md with description.

Copy link

linux-foundation-easycla bot commented Dec 7, 2024

CLA Signed

  • ✅login: yigithankarabulut / (7db6139)

The committers listed above are authorized under a signed CLA.

Comment on lines +51 to +60
truncatedStr := stringVal[:limit]
for !utf8.ValidString(truncatedStr) {
limit--
if limit == 0 {
value.SetStr("")
return true
}
truncatedStr = stringVal[:limit]
}
value.SetStr(truncatedStr)
Copy link
Contributor

@jade-guiton-dd jade-guiton-dd Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since utf8.ValidString requires creating a slice and checking all of it every loop, I think a simpler and more efficient solution is to check if the byte after the slice is a valid rune start byte:

Suggested change
truncatedStr := stringVal[:limit]
for !utf8.ValidString(truncatedStr) {
limit--
if limit == 0 {
value.SetStr("")
return true
}
truncatedStr = stringVal[:limit]
}
value.SetStr(truncatedStr)
for limit > 0 && !utf8.RuneStart(stringVal[limit]) {
limit--
}
value.SetStr(stringVal[:limit])

(Neither solution works all that well if the string is not valid UTF-8 in the first place, and both will gladly cut in the middle of multi-codepoint grapheme clusters, but I'm assuming these edge cases are not much of a concern compared to outputting invalid UTF-8.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. The way you wrote is faster and simpler, so it can be preferred, of course, if we assume that the string is generally UTF-8 compatible. In the other way, if we consider that a UTF-8 character is a maximum of 4 bytes, we can also validate the string by going back at most 3 times. The choice is yours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the simplest way to handle invalid UTF-8 input here would be to validate the whole string once; if it succeeds, proceed with the loop; otherwise, just cut at the byte level since it's not UTF-8 in the first place. What do you think?

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[pkg/ottl] truncate_all function corrupts UTF-8 encoding
3 participants