Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser.allocate will reallocate buffers - call allocate only to change depth #79

Open
TysonAndre opened this issue Oct 2, 2022 · 0 comments · May be fixed by #81
Open

parser.allocate will reallocate buffers - call allocate only to change depth #79

TysonAndre opened this issue Oct 2, 2022 · 0 comments · May be fixed by #81

Comments

@TysonAndre
Copy link
Collaborator

https://github.com/simdjson/simdjson/blob/master/doc/dom.md#reusing-the-parser-for-maximum-efficiency

If you're using simdjson to parse multiple documents, or in a loop, you should make a parser once and reuse it. The simdjson library will allocate and retain internal buffers between parses, keeping buffers hot in cache and keeping memory allocation and initialization to a minimum. In this manner, you can parse terabytes of JSON data without doing any new allocation.

class simdjson::dom::parser only provides set_max_depth(), allocate(), but not set_capacity(). So to set just the max depth, only call allocate() if the depth actually changed, which should be infrequent

  • parser::parse_into_document calls ensure_capacity already, and ensure_capacity calls allocate if needed

Related to #73

Note that simdjson will not need capacities beyond the range of a uint32, and will reject requests for larger capacities

/** The maximum document size supported by simdjson. */
constexpr size_t SIMDJSON_MAXSIZE_BYTES = 0xFFFFFFFF;
simdjson_warn_unused simdjson_inline error_code parser::allocate(size_t new_capacity, size_t new_max_depth) noexcept {
  if (new_capacity > max_capacity()) { return CAPACITY; }
  if (string_buf && new_capacity == capacity() && new_max_depth == max_depth()) { return SUCCESS; }

  // string_capacity copied from document::allocate
  _capacity = 0;
  size_t string_capacity = SIMDJSON_ROUNDUP_N(5 * new_capacity / 3 + SIMDJSON_PADDING, 64);
  string_buf.reset(new (std::nothrow) uint8_t[string_capacity]);
#if SIMDJSON_DEVELOPMENT_CHECKS
  start_positions.reset(new (std::nothrow) token_position[new_max_depth]);
#endif
  if (implementation) {
    SIMDJSON_TRY( implementation->set_capacity(new_capacity) );
    SIMDJSON_TRY( implementation->set_max_depth(new_max_depth) );
  } else {
    SIMDJSON_TRY( simdjson::get_active_implementation()->create_dom_parser_implementation(new_capacity, new_max_depth, implementation) );
  }
  _capacity = new_capacity;
  _max_depth = new_max_depth;
  return SUCCESS;
}
TysonAndre added a commit to TysonAndre/simdjson_php that referenced this issue Oct 2, 2022
Closes crazyxman#80 - simdjson_is_valid() and other PHP functions would
previously return false when out of memory

- Related to crazyxman#60 - other php apis (using emalloc instead) will also emit
  fatal errors when out of memory and end the process.

Closes crazyxman#79 - reuse buffers for strings less than 1000000 bytes and
100000 depth. (Assumes the depth rarely changes in callers)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant