Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce a new method (set_host_to_base_host) to replace set_host when set_host is unnecessary #762

Merged
merged 3 commits into from
Oct 16, 2024

Conversation

lemire
Copy link
Member

@lemire lemire commented Oct 16, 2024

In our main parsing function, we repeatedly call set_host. This is a relatively expensive setter function.

It is only a wrapper around set_host_or_hostname<false>, but that function is not cheap...

bool url_aggregator::set_host(const std::string_view input) {
  return set_host_or_hostname<false>(input);
}

Let us look at it:

template <bool override_hostname>
bool url_aggregator::set_host_or_hostname(const std::string_view input) {
  ada_log("url_aggregator::set_host_or_hostname ", input);
  ADA_ASSERT_TRUE(validate());
  ADA_ASSERT_TRUE(!helpers::overlaps(input, buffer));
  if (has_opaque_path) {
    return false;
  }

  std::string previous_host(get_hostname());
  uint32_t previous_port = components.port;

  size_t host_end_pos = input.find('#');
  std::string _host(input.data(), host_end_pos != std::string_view::npos
                                      ? host_end_pos
                                      : input.size());
  helpers::remove_ascii_tab_or_newline(_host);
  std::string_view new_host(_host);

  // If url's scheme is "file", then set state to file host state, instead of
  // host state.
  if (type != ada::scheme::type::FILE) {
    std::string_view host_view(_host.data(), _host.length());
    auto [location, found_colon] =
        helpers::get_host_delimiter_location(is_special(), host_view);

    // Otherwise, if c is U+003A (:) and insideBrackets is false, then:
    // Note: the 'found_colon' value is true if and only if a colon was
    // encountered while not inside brackets.
    if (found_colon) {
      if constexpr (override_hostname) {
        return false;
      }
      std::string_view sub_buffer = new_host.substr(location + 1);
      if (!sub_buffer.empty()) {
        set_port(sub_buffer);
      }
    }
    // If url is special and host_view is the empty string, validation error,
    // return failure. Otherwise, if state override is given, host_view is the
    // empty string, and either url includes credentials or url's port is
    // non-null, return.
    else if (host_view.empty() &&
             (is_special() || has_credentials() || has_port())) {
      return false;
    }

    // Let host be the result of host parsing host_view with url is not special.
    if (host_view.empty() && !is_special()) {
      if (has_hostname()) {
        clear_hostname();  // easy!
      } else if (has_dash_dot()) {
        add_authority_slashes_if_needed();
        delete_dash_dot();
      }
      return true;
    }

    bool succeeded = parse_host(host_view);
    if (!succeeded) {
      update_base_hostname(previous_host);
      update_base_port(previous_port);
    } else if (has_dash_dot()) {
      // Should remove dash_dot from pathname
      delete_dash_dot();
    }
    return succeeded;
  }

  size_t location = new_host.find_first_of("/\\?");
  if (location != std::string_view::npos) {
    new_host.remove_suffix(new_host.length() - location);
  }

  if (new_host.empty()) {
    // Set url's host to the empty string.
    clear_hostname();
  } else {
    // Let host be the result of host parsing buffer with url is not special.
    if (!parse_host(new_host)) {
      update_base_hostname(previous_host);
      update_base_port(previous_port);
      return false;
    }

    // If host is "localhost", then set host to the empty string.
    if (helpers::substring(buffer, components.host_start,
                           components.host_end) == "localhost") {
      clear_hostname();
    }
  }
  ADA_ASSERT_TRUE(validate());
  return true;
}

This function in turn will call parse_host.

ada_really_inline bool url_aggregator::parse_host(std::string_view input) {
  ada_log("url_aggregator:parse_host \"", input, "\" [", input.size(),
          " bytes]");
  ADA_ASSERT_TRUE(validate());
  ADA_ASSERT_TRUE(!helpers::overlaps(input, buffer));
  if (input.empty()) {
    return is_valid = false;
  }  // technically unnecessary.
  // If input starts with U+005B ([), then:
  if (input[0] == '[') {
    // If input does not end with U+005D (]), validation error, return failure.
    if (input.back() != ']') {
      return is_valid = false;
    }
    ada_log("parse_host ipv6");

    // Return the result of IPv6 parsing input with its leading U+005B ([) and
    // trailing U+005D (]) removed.
    input.remove_prefix(1);
    input.remove_suffix(1);
    return parse_ipv6(input);
  }

  // If isNotSpecial is true, then return the result of opaque-host parsing
  // input.
  if (!is_special()) {
    return parse_opaque_host(input);
  }
  // Let domain be the result of running UTF-8 decode without BOM on the
  // percent-decoding of input. Let asciiDomain be the result of running domain
  // to ASCII with domain and false. The most common case is an ASCII input, in
  // which case we do not need to call the expensive 'to_ascii' if a few
  // conditions are met: no '%' and no 'xn-' subsequence.

  // Often, the input does not contain any forbidden code points, and no upper
  // case ASCII letter, then we can just copy it to the buffer. We want to
  // optimize for such a common case.
  uint8_t is_forbidden_or_upper =
      unicode::contains_forbidden_domain_code_point_or_upper(input.data(),
                                                             input.size());
  // Minor optimization opportunity:
  // contains_forbidden_domain_code_point_or_upper could be extend to check for
  // the presence of characters that cannot appear in the ipv4 address and we
  // could also check whether x and n and - are present, and so we could skip
  // some of the checks below. However, the gains are likely to be small, and
  // the code would be more complex.
  if (is_forbidden_or_upper == 0 &&
      input.find("xn-") == std::string_view::npos) {
    // fast path
    update_base_hostname(input);
    if (checkers::is_ipv4(get_hostname())) {
      ada_log("parse_host fast path ipv4");
      return parse_ipv4(get_hostname(), true);
    }
    ada_log("parse_host fast path ", get_hostname());
    return true;
  }
  // We have encountered at least one forbidden code point or the input contains
  // 'xn-' (case insensitive), so we need to call 'to_ascii' to perform the full
  // conversion.

  ada_log("parse_host calling to_ascii");
  std::optional<std::string> host = std::string(get_hostname());
  is_valid = ada::unicode::to_ascii(host, input, input.find('%'));
  if (!is_valid) {
    ada_log("parse_host to_ascii returns false");
    return is_valid = false;
  }
  ada_log("parse_host to_ascii succeeded ", *host, " [", host->size(),
          " bytes]");

  if (std::any_of(host.value().begin(), host.value().end(),
                  ada::unicode::is_forbidden_domain_code_point)) {
    return is_valid = false;
  }

  // If asciiDomain ends in a number, then return the result of IPv4 parsing
  // asciiDomain.
  if (checkers::is_ipv4(host.value())) {
    ada_log("parse_host got ipv4 ", *host);
    return parse_ipv4(host.value(), false);
  }

  update_base_hostname(host.value());
  ADA_ASSERT_TRUE(validate());
  return true;
}

We are replacing all of that by

void url_aggregator::set_host_to_base_host(const std::string_view input) noexcept {
  ada_log("url_aggregator::set_host_to_base_host ", input);
  ADA_ASSERT_TRUE(validate());
  ADA_ASSERT_TRUE(!helpers::overlaps(input, buffer));
  if (type != ada::scheme::type::FILE) {
    // Let host be the result of host parsing host_view with url is not special.
    if (input.empty() && !is_special()) {
      if (has_hostname()) {
        clear_hostname();
      } else if (has_dash_dot()) {
        add_authority_slashes_if_needed();
        delete_dash_dot();
      }
      return;
    }
  }
  update_base_hostname(input);
  ADA_ASSERT_TRUE(validate());
  return ;
}

This tiny function is obviously cheaper. :-)

Its most expensive call is update_base_hostname, which is quite cheap:

inline void url_aggregator::update_base_hostname(const std::string_view input) {
  ada_log("url_aggregator::update_base_hostname ", input, " [", input.size(),
          " bytes], buffer is '", buffer, "' [", buffer.size(), " bytes]");
  ADA_ASSERT_TRUE(validate());
  ADA_ASSERT_TRUE(!helpers::overlaps(input, buffer));

  // This next line is required for when parsing a URL like `foo://`
  add_authority_slashes_if_needed();

  bool has_credentials = components.protocol_end + 2 < components.host_start;
  uint32_t new_difference =
      replace_and_resize(components.host_start, components.host_end, input);

  if (has_credentials) {
    buffer.insert(components.host_start, "@");
    new_difference++;
  }
  components.host_end += new_difference;
  components.pathname_start += new_difference;
  if (components.search_start != url_components::omitted) {
    components.search_start += new_difference;
  }
  if (components.hash_start != url_components::omitted) {
    components.hash_start += new_difference;
  }
  ADA_ASSERT_TRUE(validate());
}

So let us look at a little benchmark.... This will not affect cases where we are parsing a single URL. It is only relevant when we have a base URL. We have one benchmark for this (wpt_bench).

Apple system:

Before:

BasicBench_AdaURL_url_aggregator     183060 ns       182722 ns         3897 speed=112.652M/s time/byte=8.87687ns time/url=214.462ns url/s=4.66283M/s

After:

BasicBench_AdaURL_url_aggregator     178301 ns       177918 ns         3894 speed=115.694M/s time/byte=8.64351ns time/url=208.824ns url/s=4.78872M/s

Linux system:

Before:

BasicBench_AdaURL_url_aggregator     267213 ns       266918 ns         2621 GHz=3.19318 cycle/byte=39.09 cycles/url=944.399 instructions/byte=104.67 instructions/cycle=2.67766 instructions/ns=8.55026 instructions/url=2.52878k ns/url=295.755 speed=77.1174M/s time/byte=12.9672ns time/url=313.284ns url/s=3.192M/s

After:

BasicBench_AdaURL_url_aggregator     265053 ns       264735 ns         2666 GHz=3.1932 cycle/byte=37.3268 cycles/url=901.802 instructions/byte=101.74 instructions/cycle=2.72567 instructions/ns=8.70359 instructions/url=2.45801k ns/url=282.413 speed=77.7533M/s time/byte=12.8612ns time/url=310.722ns url/s=3.21831M/s

Conclusion

This seems like a clear (if small) win when parsing URLs with a base.

@lemire lemire requested a review from anonrig October 16, 2024 19:09
@@ -1109,6 +1109,28 @@ inline std::ostream &operator<<(std::ostream &out,
const ada::url_aggregator &u) {
return out << u.to_string();
}

void url_aggregator::set_host_to_base_host(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_ prefix functions are public, and update_ ones are privates. can we change the function name and add @private as a comment so it won't be included in the documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name. I don't see where we use @Private. The function is already private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example:

* @private

@private just hides it from doxygen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in that example, we are marking private something that is public in C++.

@lemire lemire merged commit 63cb72a into main Oct 16, 2024
28 checks passed
@lemire lemire deleted the set_host_to_base_host branch October 16, 2024 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants