Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize performance of initcap function (~2x faster) #13691

Merged
merged 2 commits into from
Dec 12, 2024

Conversation

tlm365
Copy link
Contributor

@tlm365 tlm365 commented Dec 8, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

Eliminate the intermediate Vec<char> and directly push into a String.

What changes are included in this PR?

Update the initcap function, benchmark included.

Are these changes tested?

Yes, existing unit tests.

Are there any user-facing changes?

No.

*BENCHMARK RESULT

Compiling datafusion-functions v43.0.0 (/home/tailm/repos/github/datafusion/datafusion/functions)
    Finished `bench` profile [optimized] target(s) in 1m 02s
     Running benches/initcap.rs (target/release/deps/initcap-408d405a24fbbea1)
Gnuplot not found, using plotters backend
initcap string view shorter than 12 [size=1024]
                        time:   [34.326 µs 34.353 µs 34.383 µs]
                        change: [-48.954% -48.796% -48.631%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 26 outliers among 100 measurements (26.00%)
  4 (4.00%) low severe
  9 (9.00%) low mild
  5 (5.00%) high mild
  8 (8.00%) high severe

initcap string view longer than 12 [size=1024]
                        time:   [52.105 µs 52.151 µs 52.201 µs]
                        change: [-54.232% -54.111% -53.998%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe

initcap string [size=1024]
                        time:   [53.143 µs 53.324 µs 53.587 µs]
                        change: [-53.164% -52.956% -52.775%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

initcap string view shorter than 12 [size=4096]
                        time:   [135.91 µs 136.08 µs 136.27 µs]
                        change: [-51.523% -51.418% -51.301%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe

initcap string view longer than 12 [size=4096]
                        time:   [202.91 µs 203.35 µs 204.06 µs]
                        change: [-53.743% -53.636% -53.540%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

initcap string [size=4096]
                        time:   [205.59 µs 205.78 µs 206.00 µs]
                        change: [-53.462% -53.306% -53.164%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

@tlm365 tlm365 changed the title Optimize performance of initcap (~2x faster) Optimize performance of initcap function (~2x faster) Dec 8, 2024
@tlm365
Copy link
Contributor Author

tlm365 commented Dec 8, 2024

I'm not sure if this is an intended implementation or not. But seems like it would be better if we return StringViewArray instead of StringArray here 🤔.

@tlm365 tlm365 marked this pull request as ready for review December 8, 2024 19:55
char_vector.push(c.to_ascii_lowercase());
fn initcap_string(input: Option<&str>) -> Option<String> {
input.map(|s| {
let mut result = String::with_capacity(s.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we could likely make this function much faster still by using one of the StringArrayWriters directly (it would save this allocation entirely)

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch thanks @tlm365

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

Comment on lines +140 to 148
for c in s.chars() {
let transformed = if prev_is_alphanumeric {
c.to_ascii_lowercase()
} else {
char_vector.push(c.to_ascii_uppercase());
}
previous_character_letter_or_number =
c.is_ascii_uppercase() || c.is_ascii_lowercase() || c.is_ascii_digit();
c.to_ascii_uppercase()
};
result.push(transformed);
prev_is_alphanumeric = c.is_ascii_alphanumeric();
}
Copy link
Member

@Weijun-H Weijun-H Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious which implementation is faster 🤔

        if s.is_empty() {
            return result;
        } else {
            let (upper, lower) = s.split_at(1);
            result.push_str(&upper.to_ascii_uppercase());
            result.push_str(&lower.to_ascii_lowercase());
        }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's not correct and/or possible with utf8?
Seems this answer (last snippet) https://stackoverflow.com/a/38406885 might be close?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's not correct and/or possible with utf8?
Seems this answer (last snippet) https://stackoverflow.com/a/38406885 might be close?

@Dandandan you're right. I tested it with some unit tests for special characters (unicode), this PR's implementation doesn't work and the old implementation doesn't work either. But I think it makes sense at the moment since the initcap function is in datafusion::function::string not datafusion::function::unicode.

Perhaps, it would be better if we update the documentation that initcap is not supported for unicode characters yet and try to handle it if necessary in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a PR to update the docs: #13749

@alamb alamb merged commit 320e4d6 into apache:main Dec 12, 2024
28 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 12, 2024

Thanks again @tlm365 @jayzhan211 @Dandandan @Weijun-H and @Dandandan (quite a distinguished list of contributors!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants