-
-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce cmake_node_convert_type
#522
Conversation
(I opened a thread asking about this at https://talk.commonmark.org/t/parsing-document-and-nesting-in-larger-ast/4583/2 before I decided to just make it myself.) |
Patch taken from commonmark/cmark#522
@nwellnhof do you have an opinion about this? |
I wouldn't want to allow violations of the tree structure just for convenience. Conceptually, a document node should always be the root node.
This doesn't seem like a good reason. You can simply write a helper function to extract children, or track subtrees in your own data structure, or use the "custom" node type or whatever. |
Well, one way to deal with things conceptually is to rename the document node to "no op" node or something, and then it makes sense that it can appear anywhere. If that is not desirable, then I would like to go with the custom node solution by being able to convert an existing document node to a root node. I don't like having to extra all the children for what should be a simple append-child embed, but converting the node first and them embedding it feels fine. (FWIW I come from functional programming land where having special root notes is considered somewhat suspicious. Usually any type of node in a tree can appear anywhere in that tree, and if there has to be something special on the outside, it is a different type, not just a different data constructor. @jgm should know what I am talking about :) ) |
But why? You can do that with a helper function in ~10 lines of C. |
Patch taken from commonmark/cmark#522
Patch taken from commonmark/cmark#522
@nwellnhof Because I believe embedding a subtree should be an O(1) operation. |
Typically, you only append a subtree once and the result has to be processed at some point. So it doesn't matter whether appending is O(1) or O(n). If you really need to append subtrees more than once and performance is important, you can move its children into a |
Whether or not there a practical problem for O(n) vs O(1) here, I still fine making embedding a subtree more complicated than it needs to be quite ugly. I looked at Lines 56 to 85 in 76cbc2d
|
It simply turned out this way and nobody complained in the last ~10 years. All DOM-like APIs that I know have a special root node. But libcmark really is different in this regard. The document node doesn't hold any additional data. From a quick look, it almost is a "no-op" node. The only exception seems to be the XML serializer. I still wouldn't change the API just to cater to the preferences of a single user. If you really want to push this through, please make sure that the serializers continue to work as expected and add some tests in |
So just to be clear, if all these things happens:
Then you would be willing to reconsider? Sounds good to me! I'll gladly take a stab at 1-3. |
Of course, it would be quite reasonable to include different data in the document node (e.g. metadata), and maybe we'd want to do this in some future iteration. So I'd be against renaming this as a NOOP node. (Speaking of functional programming, Pandoc's document model has a distinguished top-level node. And as @nwellnhof says, it's quite normal for there to be a distinguished top-level node in hierarchical models of documents.) I'd say: just go with the 10 lines of C and be done with it. The alternative is a number of changes to this project that will take work, might have unanticipated consequences, could constrain future changes to the document model, and don't have a large benefit. |
@jgm Oh, oops, I just pushed the the other changes before seeing this.
Well that is why I said
Per https://github.com/jgm/pandoc-types/blob/master/src/Text/Pandoc/Definition.hs#L274 Pandoc has 3 separate types C having just a single "node" type and not those 3 is understandable, but it does leave a bit ambiguous "types" vs "data constructors". If you think the right way to think of the C node type is the union of all 3 conceptual types, and so the doc node has invariants just the same way as the "block" and "inline" nodes do, then I would suggest that I contribute a function to convert a documentation node to one of the other types of nodes that also is allowed to have block children. This keeps the operation O(1) and easy, whereas with the existing interface it has to be O(n). |
(In addition to a |
How theoretical is this concern about O(1) vs O(N)? |
Entirely theoretical, I freely admit. It's all aesthetics for me about making things that can be simple and cheap actually be simple and cheap. |
I'm just not convinced it's worth it. Any changes of this scale have unanticipated consequences and may produce unanticipated bugs or constraint future development, and this software has a lot of users. So there's a high bar for making changes, and I don't think considerations of aesthetics are going to get you across it. |
I hope the alternative plan of just having the new operations is a little easier to reason about, since there are no data structure invariant changes? Also, I just realized actually the concat/move operator is enough, because one can create a new node, move all the children, and then delete the old new, so we don't need to worry about how converting the type effects the union fields. This last observation makes me quite pleased that new version should be demonstrably safe ;). |
Ah no, my concat thing isn't so special because updating the parent points is still O(n) :(. Still can do the node conversion however. |
0df1baa
to
ef073ca
Compare
cmake_node_convert_type
OK I pushed the version. The first two commits are pure refactors, factoring out new private static functions, and does not change the interface. The second commit introduces The 3rd commit is the my same subdocument test, but now using |
ef073ca
to
13158e4
Compare
OK I am quite happy with the There are 4 "classes" (of node types):
and then None + Any adjoined top and bottom for subtyping purposes. From those classes the correspondence to e.g. Pandoc types should be pretty clear (document root -> And finally, those classes make the If this first commit looks safe in isolation, then I think the other three "fall out" very simply. |
// Success path | ||
node->type = type; | ||
S_cmark_node_init(node); | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some node types, you have to free old content before reinitializing the node, see cmark_node_free
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is now fixed.
I have been hinting strongly that I am unlikely to merge this, but let me make that more explicit. Changing the node type is a tricky operation. Different nodes have different data associated with them, for example. I think you're trying to cover all the bases in your patch, but it ends up being a complex patch that might well introduce bugs. Given that there is no pressing need for this change, I'm inclined to be conservative and not make it. In addition, I think that there are drawbacks to integrating an existing document by creating a custom block node and putting the document's children there. That creates a distinction between a document created this way and a document that would have been created from scratch, but do we want that distinction? |
In doing so: - Factor our `S_compute_allowed_child` - Create a new (private for now) notion of a "node_class", which separats the node types into buckets based on invariants. - Combine `is_block` and `is_inline` into `S_node_class` This will be used in the new `cmake_node_convert_type`.
This will be used in the new `cmake_node_convert_type`.
This will be used in the new `cmake_node_convert_type`.
13158e4
to
f8e59f3
Compare
I understand that you are leaning against :). I am trying to make new revisions in hopes that they actually address concerns, not wear you down or something.
What I am trying to do is in the latest versions is have a bunch of no-op refactors such that the marginal complexity of the new function is quite low. Right now, it is just this: diff --git a/src/node.c b/src/node.c
index e5c83db..da43a2d 100644
--- a/src/node.c
+++ b/src/node.c
@@ -175,6 +175,23 @@ cmark_node *cmark_node_new(cmark_node_type type) {
return cmark_node_new_with_mem(type, &DEFAULT_MEM_ALLOCATOR);
}
+int cmake_node_convert_type(cmark_node *node, cmark_node_type type) {
+ // Only need to validate children if we have any
+ if (node->first_child) {
+ enum node_class
+ old = S_compute_allowed_child(node->type),
+ new = S_compute_allowed_child(type);
+ if (!S_node_class_le(old, new))
+ return 0;
+ }
+
+ // Success path
+ S_cmark_node_deinit(node);
+ node->type = type;
+ S_cmark_node_init(node);
+ return 1;
+}
+
// Free a cmark_node list and any children.
static void S_free_nodes(cmark_node *e) {
cmark_mem *mem = e->mem; which I think is quite good! (Much better than the version where you last left a comment.)
And I tried to only reuse existing allocation / deallocation code (the latter just now in response to #522 (comment)) to make sure I am not breaking any new ground. I could not refactor any new code, and just have pure additions. Maybe that would result in a simpler patch, but that would be quite scary to me as now a bunch of logic is duplicated! If you're really sure there is no possible version of this you would accept. We can just close the PR. |
@nwellnhof do you have opinions about whether this should go forward? |
FWIW, I just thought of one more variation, which is a "parse into" parsing variant where instead of the parser creating the root node, it adds children to whatever root node you give it. With this, node type conversion is not necessary to embed parsed documents in a larger. |
The variation is interesting. It would be easy to factor that out of our existing function. I guess one question is whether there's any utility in being able to reassign block node types besides your original use case. If not, then either your variation or a function that just converts a document into a custom node could do the trick, with less complexity than the above patch. |
I am not sure there is. I realized "document root" and "list item" are both no op nodes. (I think those are extrinsic definitions of a list of nodes based on their location in the AST, rather than intrinsic definition of a "real" node.) In my test (and my external usage of this) I was converting a doc root to list tem (no op -> no op), but if you don't want do that, e.g. are adding a new section with heading, you are out of luck as no-op nodes don't go elsewhere --- you might have nothing to convert to note to but a custom node. (Pandoc, by allowing you to alias So given the "intrusive" design of the C where the I'll close this :) #524 I pushed just before you commented :) |
This is useful when constructing ASTs that contain a mix of computed and parsed data. "sub documents" that are parsed have a single root document node, and rather than having to extract its children, those can be placed as-is in a larger parse tree.