Incremental parsing is ineffective when a new tag is opened #23

marijnh · 2021-02-17T08:59:44Z

Due to the way changes in the external scanner state prevent reuse of nodes created with another state, if you open a new tag (which is a common editing operation) somewhere in a big document, everything coming after that tag will be re-parsed. (Same with removing or renaming an opening tag.) This is somewhat similar to the situation when you open a block comment marker, but there the structure of the content actually changes—here you will in most situations just end up creating the exact same nodes.

But not in all situations, I guess, due to the affront to context-freedom that is implicitly closed elements. So I guess this behavior is defensible. But it seems unfortunate. Is it a conscious design decision? Have there been any attempts at workarounds? (Like maybe a mechanism to make dependence on scanner state more fine-grained and disabling the state compare for nodes that have no scanner-state-dependent tokens in them—I think that could be made to work for the HTML case, but it may not be worthwhile in any other situation. I noticed the scanner state approach fits the Python scanner very well—there it encodes exactly the thing you need, without wasting any useful opportunities for reuse.)

As always, feel free to close as 'out of scope'.

Crude benchmark

let Parser = require("tree-sitter")
let p = new Parser
p.setLanguage(require("tree-sitter-html"))

function time(name, f) {
  f() // warmup
  for (let t0 = Date.now(), count = 1, t;; count++) {
    f()
    if ((t = Date.now() - t0) > 1000) {
      console.log(name, (count / (t / 1000)).toFixed(2) + "/s")
      break
    }
  }
}

let doc = "<html>\n  <body>\n" + "    <p>Lots of <span>content</span> here</p>\n".repeat(10000)

time("Parse from scratch", () => p.parse(doc))

let ast2 = p.parse(doc)
let doc2 = doc.slice(0, 15) + "<div>" + doc.slice(15)
ast2.edit({
  startIndex: 15,
  oldEndIndex: 15,
  newEndIndex: 20,
  startPosition: {row: 1, column: 8},
  oldEndPosition: {row: 1, column: 8},
  newEndPosition: {row: 1, column: 13},
})

time("Opened new tag", () => p.parse(doc2, ast2))

let ast3 = p.parse(doc)
let doc3 = doc.slice(0, 15) + "<div></div>" + doc.slice(15)
ast3.edit({
  startIndex: 15,
  oldEndIndex: 21,
  newEndIndex: 20,
  startPosition: {row: 1, column: 8},
  oldEndPosition: {row: 1, column: 8},
  newEndPosition: {row: 1, column: 19},
})

time("Inserted new tag", () => p.parse(doc3, ast3))

// On my machine:
// Parse from scratch 9.36/s
// Opened new tag 9.31/s
// Inserted new tag 49.65/s

The text was updated successfully, but these errors were encountered:

maxbrunsfeld · 2021-02-19T00:36:12Z

This is a great point @marijnh. I don't think I'd ever considered it. I'd love to avoid the performance pitfall, but it does seem like a bit of work.

To recap Tree-sitter's current behavior - We discard a subtree if its preceding external scanner state is different from the preceding external scanner state on our parse stack.

I think we could generalize this in the following way. When deciding whether or not to reuse a subtree (assuming that it is reusable in all other respects):

If its preceding external scanner state is equal to the preceding external scanner state on our parse stack, then reuse the node (like we do today).
Otherwise, check if the node leaves the external scanner state unchanged. That is, check if the last external token within the node has the same external scanner state as the node's preceding external scanner state. This would often be true in the case of HTML, because in the process of parsing the node, we would have popped all of the same tags that we had pushed onto the parse stack.
If this equality ☝️ does not hold (the node has changed the externals scanner state), then discard the node (or at least do not reuse it in its entirety).
But if the equality does hold, then we perform a custom, language-specific compatibility check between the previous external scanner state on our parse stack, and the node's preceding external scanner state. If the two states are found to be compatible, then we can reuse the node.

When pushing this node onto the parse stack, we would need to add some logic that ensures that we don't allow this node to affect the stored "preceding external scanner state" value associated with our current parse stack.

I think the compatibility check could have a signature like this:

bool tree_sitter_html_external_scanner_compare(
  const void *scanner,
  const char *current_state,
  const char *state_before_this_subtree
);

maxbrunsfeld · 2021-02-19T00:40:44Z

In the case of HTML, the compare function could check that every open tag in current_state is also present in state_before_this_subtree. This would ensure that none of the tags in the subtree would cause a currently-opened tag to be implicitly closed.

Do you think this would be enough to get good incremental performance, or is the reuse logic too restrictive? Or would we need to provide more information about the subtree to the compare function?

marijnh · 2021-02-19T08:20:59Z

Thanks for the thoughtful answer. In Lezer (where I play decidedly more loose and fast with incremental parsing safety in an attempt to keep tree size and code complexity small) I am now using a thing where a tokenizer can turn off these state checks for node reuse entirely, and I haven't been able to find a counterexample where incremental parsing of HTML actually fails with the check turned off. But that may just mean I haven't looked hard enough. The rules you propose sound more robust.

gpetrov · 2024-10-08T08:15:21Z

@maxbrunsfeld any new developments on this? Seems like an urgent issue specially in html parsing. We also need a quicker, notification about the really changed html nodes

maxbrunsfeld mentioned this issue Feb 20, 2021

Tree-sitter 1.0 Checklist tree-sitter/tree-sitter#930

Open

30 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental parsing is ineffective when a new tag is opened #23

Incremental parsing is ineffective when a new tag is opened #23

marijnh commented Feb 17, 2021 •

edited

Loading

maxbrunsfeld commented Feb 19, 2021 •

edited

Loading

maxbrunsfeld commented Feb 19, 2021

marijnh commented Feb 19, 2021 •

edited

Loading

gpetrov commented Oct 8, 2024

Incremental parsing is ineffective when a new tag is opened #23

Incremental parsing is ineffective when a new tag is opened #23

Comments

marijnh commented Feb 17, 2021 • edited Loading

maxbrunsfeld commented Feb 19, 2021 • edited Loading

maxbrunsfeld commented Feb 19, 2021

marijnh commented Feb 19, 2021 • edited Loading

gpetrov commented Oct 8, 2024

marijnh commented Feb 17, 2021 •

edited

Loading

maxbrunsfeld commented Feb 19, 2021 •

edited

Loading

marijnh commented Feb 19, 2021 •

edited

Loading