Documenting hashing technique #618

lemire · 2024-03-20T19:52:22Z

As remarked by @the-moisrex in issue 617, Premature optimiztion, the hashing technique to identify the protocol is unnecessary if the protocol string is predictable. By simply testing all six possible strings, starting from the most probable, it is likely that you can often get better performance in many practical cases.

It is a trade-off between best average or best-case performance and worst-case performance. A naive approach, in the worst case, may need to do six different string comparisons which could be potentially expensive. We made the choice to go with an option that is slightly less optimal in the case where we can predict, at compile time, the protocol, but has the benefit of having well understood worst-case performance. It is a good old "O(1)" performance vs "O(N)" performance where N is not too large.

This trade-off can be revisited at any time. It is worth documenting it in the code, however. This is what this PR does.

See also : #619

the-moisrex · 2024-03-20T20:03:32Z

I'm just gonna leave this here; it's something from my own implementation but I've changed mine because I needed to ignore tabs and newlines.

You should not be aiming for the worse case scenario; which, the worse case scenario of this algorithm is not as bad as you think it is. If you have 10000 character of scheme characters, worse case is not 10000. Worse case would be sum of all of the special schemes which is just 21.

I have a benchmark for this as well.

        /**
         * @return 0 if unknown, otherwise return the port
         */
        [[nodiscard]] constexpr stl::uint16_t known_port() const noexcept {
            // NOLINTBEGIN(*-magic-numbers)
            switch (this->size()) {
                case 2:
                    if (this->operator[](0) == 'w' && this->operator[](1) == 's') {
                        return 80U;
                    }
                    break;
                case 3:
                    if (*this == "wss") {
                        return 443U;
                    }
                    if (*this == "ftp") {
                        return 21U;
                    }
                    break;
                case 4:
                    if (*this == "http") {
                        return 80U;
                    }
                    break;
                case 5:
                    if (*this == "https") {
                        return 443U;
                    }
                    break;
                default: break;
            }
            return 0U;
            // NOLINTEND(*-magic-numbers)
        }

lemire · 2024-03-20T20:24:06Z

@the-moisrex I understand that these functions may be bottlenecks in your applications. In ada, with the current code, identifying the protocol or or the special port, does not even register when profiling. It may be possible to gain 1% or 2% in some tests but we can rule out significant gains from the profiling data alone, at least in the tests that we have done so far.

We are still open to revisiting design choices, but this particular component of the code is unlikely to become a priority.

I should point out that the code routine you offer is a form of hashing: you use the key length as the hash function and then branch out.

There is evidently a wide range of design choices, and I invite you to pursue your investigations in your projects.

include/ada/scheme-inl.h

Co-authored-by: Yagiz Nizipli <[email protected]>

lemire added 2 commits March 20, 2024 15:41

doc: documenting hashing technique

9f9923d

putting the comment once

dff9e36

lemire requested a review from anonrig March 20, 2024 19:52

lemire mentioned this pull request Mar 20, 2024

Premature optimiztion #617

Closed

lemire added 4 commits March 20, 2024 15:54

linting

c64f634

better wording

5df2c5e

cutting long sentence

6833e69

cutting another long line

c053bbe

ada-url locked as resolved and limited conversation to collaborators Mar 20, 2024

anonrig approved these changes Mar 20, 2024

View reviewed changes

include/ada/scheme-inl.h Show resolved Hide resolved

Update include/ada/scheme-inl.h

611c35b

Co-authored-by: Yagiz Nizipli <[email protected]>

lemire merged commit 7dc9e67 into main Mar 21, 2024
21 checks passed

lemire deleted the documenting_hashing_technique branch March 21, 2024 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting hashing technique #618

Documenting hashing technique #618

lemire commented Mar 20, 2024 •

edited

Loading

the-moisrex commented Mar 20, 2024

lemire commented Mar 20, 2024 •

edited

Loading

Documenting hashing technique #618

Documenting hashing technique #618

Conversation

lemire commented Mar 20, 2024 • edited Loading

the-moisrex commented Mar 20, 2024

lemire commented Mar 20, 2024 • edited Loading

lemire commented Mar 20, 2024 •

edited

Loading

lemire commented Mar 20, 2024 •

edited

Loading