[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function #36713

yigithankarabulut · 2024-12-07T16:45:23Z

Description

Truncate_all will slice the string up to the given length. If truncating at exactly the length results in a broken UTF-8 encoding, it'll truncate before where the last UTF-8 character started.

Link to tracking issue

Fixes #36017

Testing

Two UTF-8 characters were added to the end of the string and the limit was adjusted to match them and the truncation process was tested.

Documentation

Updated pkg/ottl/ottfuncs/README.md with description.

Signed-off-by: Yigithan Karabulut <[email protected]>

linux-foundation-easycla · 2024-12-07T16:45:27Z

✅login: yigithankarabulut / (7db6139)

The committers listed above are authorized under a signed CLA.

jade-guiton-dd · 2024-12-10T14:03:54Z

pkg/ottl/ottlfuncs/func_truncate_all.go

+				truncatedStr := stringVal[:limit]
+				for !utf8.ValidString(truncatedStr) {
+					limit--
+					if limit == 0 {
+						value.SetStr("")
+						return true
+					}
+					truncatedStr = stringVal[:limit]
+				}
+				value.SetStr(truncatedStr)


Since utf8.ValidString requires creating a slice and checking all of it every loop, I think a simpler and more efficient solution is to check if the byte after the slice is a valid rune start byte:

Suggested change

truncatedStr := stringVal[:limit]

for !utf8.ValidString(truncatedStr) {

limit--

if limit == 0 {

value.SetStr("")

return true

}

truncatedStr = stringVal[:limit]

}

value.SetStr(truncatedStr)

for limit > 0 && !utf8.RuneStart(stringVal[limit]) {

limit--

}

value.SetStr(stringVal[:limit])

(Neither solution works all that well if the string is not valid UTF-8 in the first place, and both will gladly cut in the middle of multi-codepoint grapheme clusters, but I'm assuming these edge cases are not much of a concern compared to outputting invalid UTF-8.)

Thanks for your comment. The way you wrote is faster and simpler, so it can be preferred, of course, if we assume that the string is generally UTF-8 compatible. In the other way, if we consider that a UTF-8 character is a maximum of 4 bytes, we can also validate the string by going back at most 3 times. The choice is yours.

I think the simplest way to handle invalid UTF-8 input here would be to validate the whole string once; if it succeeds, proceed with the loop; otherwise, just cut at the byte level since it's not UTF-8 in the first place. What do you think?

github-actions · 2024-12-27T05:21:01Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

added utf8 support to truncate_all function

7db6139

Signed-off-by: Yigithan Karabulut <[email protected]>

yigithankarabulut requested review from TylerHelmuth, bogdandrutu, evan-bradley and a team as code owners December 7, 2024 16:45

github-actions bot assigned mx-psi Dec 7, 2024

github-actions bot added the pkg/ottl label Dec 7, 2024

github-actions bot requested a review from kentquirk December 7, 2024 16:45

jade-guiton-dd reviewed Dec 10, 2024

View reviewed changes

github-actions bot added the Stale label Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function #36713

[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function #36713

yigithankarabulut commented Dec 7, 2024

linux-foundation-easycla bot commented Dec 7, 2024 •

edited

Loading

jade-guiton-dd Dec 10, 2024 •

edited

Loading

yigithankarabulut Dec 11, 2024

jade-guiton-dd Dec 12, 2024

github-actions bot commented Dec 27, 2024

[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function #36713

Are you sure you want to change the base?

[pkg/ottl/ottlfuncs] Added utf8 support to truncate_all function #36713

Conversation

yigithankarabulut commented Dec 7, 2024

Description

Link to tracking issue

Testing

Documentation

linux-foundation-easycla bot commented Dec 7, 2024 • edited Loading

jade-guiton-dd Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

yigithankarabulut Dec 11, 2024

Choose a reason for hiding this comment

jade-guiton-dd Dec 12, 2024

Choose a reason for hiding this comment

github-actions bot commented Dec 27, 2024

linux-foundation-easycla bot commented Dec 7, 2024 •

edited

Loading

jade-guiton-dd Dec 10, 2024 •

edited

Loading