Web languages toolchain normalization #60
sashafirsov
started this conversation in
Ideas
Replies: 1 comment
-
LLVM perhaps has the answer on schema definition semantics and lingo. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
History
As the HTML pages in most of web applications are the product of multiple transformations with the mix of various programming techniques and languages, the parsing/validation/transpiling/source maps/syntax highlight/IDE suggestions are essential to keep the complexity of mixed content under control.
Source maps
To beat the complexity problem, web apps industry have tried different approaches without much success. Like moving to single JS language dumpint the separation of concerns( business logic, styling, html as structure ).
Why "without success" is the whole world is using it? Because the basics of software developer performance needs where missed. The ability to debug usually starts from source maps, i.e. the final native implementation of DOM, CSS, JS to be mapped to the sourse. It happen that it is completely absent on HTML, messed up or missed in css, and only JS has relatively reliable mechanizm of sourses to code map.
On template rendering side the source maps are lost as well. Not XSLT, nor JSX can be tracked from the browser inspect DOM tool.
React plugin with virtual DOM is semi-baked solution as also is missing the source maps.
Debugging
While the changes of DOM can be tracked in dev tools, doing so on the source level is not possible as there is no source map for DOM. Same with css.
The XSLT which is native to browser has no debug support in the browser and even in insulation, is not much available in IDEs.
IDE suggestions/syntax highlight/validation
Has limited support and new framework usually start without any. As soon as there is more than 1 transformation, the mixed content is loosing the embedded scope language support.
data as source
is missed completely. The final DOM is populated with data does not have any visibility into where it is originated from.
For JSON/XML the data has same complexity as the html/template itself to generate the source map. Same is applicable for JS objects transition chain.
Problem and its Root Cause
The problem is cross-concern and cross-transformations transparency.
Why it is a growing concern?
the use of 3rd party libs since the standard base set is increasing drastically. Partially due to open source popularity growth, partially due to 3-rd party content and code embedding become a mainstream. The business specialized on particular feature usually gives better quality and pricing. But it means own lifecycle and insulated development stack. With the rise of AI-assisted and AI-generated code volume, the effect would be a level of magnitude larger.
The transparency of transformations on such scale would become not just Developer Experience(DX) issue but a critical Cyber Security concern as even with the sources and data available, its effect on the final web page DOM and behavior is not transparent, hence can not be rationally validated.
HTML parsing
There is no standard definition of HTML expressed as document schema. Instead, it is an algorithm with tons of if-else. Look into Chromium or NU validator (FF parser). The perfect example on how not to code. Attempt to normalize it via XML schema definition was failed by standards body as it does not reflect the algorithm based approach.
Why it is a growing concern?
Because it is reason to reject the web stack improvements from browser vendors. Because it makes the parser implementation enourmously complex problem. The WebKit folks suggest not to write the parser in toolchains but use the browser
innerHTML
capabilities 🤯. Besides the synchronous nature and absence of streaming (have to pass the whole string), bringing the browser executable as part of build toolchain into evironments which are quite niche oriented is not realistic. As of now such toolchain are either node(js), Java, or .NET. None of listed are tolerant to use the external process for inline otherwise routine.The primary reason to change the HTML is not a complexity of the feature implementation, it is the vulnerability of algorithm which is so unpredictable on the changes. Think of it. You can not change the feature because of parser. Yuck.
The Declarative Custom Elements and fully functional templating in HTML proposals now are blocked becase of unvilinnes to change the parser by vendors. And that is understandable. The root case is in missing the normalized document schema and reference implementation for the parser with streaming interface and multithreaded processing.
XML stack
Is in better than HTML position, a little. It does have schema definitions, but those are not sufficient to describe full object model relations. And when it goes to produce the HTML by XML transfomation by XSLT, same problems above arise: HTML does not have a schema.
Dynamic schema
XSLT can leverage the aliasing of template calls by utilizing the same concept as DCE - custom tag. Both, XSLT and HTML custom tags defined dynamically can not be expressed by current schema definition protocols.
toolchain - validators, suggestion, transpilers
Have the parser as a common part. Without the declarative schema definition, those tools do not have ability to use a generic parser. As the compliant parser algorithm exist just in Java( as part of NU Validator compiled to C in FireFox), or on C in WebKit, the tools have to rely on limited compatibility parsing libs which also far away from generic streaming async/multithreading conventions, hence DX unfriendly and not reliable.
Essential for transformation transparency architectural elements
Embedded content types
HTML is a "structure" of document part . But HTML is not the only structure which is legitimate on the web page. The SVG and MathML become a part of HTML5 tag set. There are hundreeds of other content types actively used by dev crowd. Many squesed into HTML as SCRIPT with custom
type
, some as XML comments, other as encoded strings in attributes. Each have own language and parser.The HTML5 for magic reason does not allow the "data island" with own parser. Not abstract, nor the XML one. By legitimizing the parser switch to particular content type the need for those zillions underdeveloped formats would vanish. And with schema allowed to be associated with content type, its validation, transpiling with source maps, streaming parsing, etc. would be enabled. Which would improve the quality, performance, and DX for custom content types.
reference instead of copy
Not HTML nor classic XML schema have a generic concept of resource reference. The image which is embedded into page on HTML level can not be used in another part. Same with template. Which is insame in terms of resource allocation by memory and processing caching.
The frameworks which do render DOM are utilizig the "memoization" but the browser DOM just does not have the entity which can be the subject for caching and reusing.
attributes
One of the purpose for attributes is to pass the parameter to the implementation. But there is no generic mechanizm to pass other than string content type. By enabling ATTRIBUTE as the tag and enabling the generic parsing as for any another tag with own content type( see Embedded content types above), this gap in using of HTML as web application structure would be solved.
dom parts
As HTML is used as template source, its parts in order to be DX friendly need to be exposed as valid HTML markup construct. HTML does not have straight way to surround the part of text or sub-dom without visual impact as markup either subject for rendering box model with styling (DIV or SPAN) or self-enclosed content without ability to avoid side effects.
Legitimizing the PART as HTML markup would solve the problem of transformation transparency: the inspection of PART could now jump to the transformation source chain. Same is applicable for PART as legitimate markup withing ATTRIBUTE.
Solution
Dynamic AST Schema
With declarative XML syntax. Which can serve as the source of truth for schema based parsers. The cross-language toolchain would not be an issue as the generic parsing algorithm would be available for transpiling into any programming language.
Generic parser
Schema-based parser with streaming async and multi-threaded interface would serve the base as for browser sources as for whole toolchain from validators to IDE plugins.
Its algorithms should provide the ability
mixed content AST interface
The content types currently actively used include as JS lingo variations as TypeScript as various data types( anomations, 3D, etc.
The generic AST with API similar to TS would cover most of languages from HTML to JS and XPath.
With parsers passing the source map transparency and AST as parsing result available, cross-concern validations would become a natural part of any tool in the toolset. As an example, XPath validation against the source XML and XSLT, or CSS against generated by template DOM. The tools would allow the cross-concerns optimization as bundling, obfuscation, run-time unused dependencies elimination("tree chaking"), etc.
Summary
The HTML is not the markup for the text formatting anymore. It is a structural backbone for complex web application where peaces come from may vendors with own life cycle and dev toolchains. In order to support this complexity the web apps echosystem needs:
Disclaimer
Most of proposed is backward compatible with current HTML5 syntax but also forward-compatible with proposal of self-closing tags, </> as closing tag, with alternative syntax for JSON, XSLT, binary formats.
The changes on browser side would require the replacement of parser, but would be backward compatible by rendered DOM. API would be optimized to use streaming and multi-threading which should affect positively the performance and memory consumption.
The standards changes on HTML validator( currently it is not a standard though ), the HTML schema would become a formal standard definition and be a source of truth to TS typings, C/Rust interfaces, etc.
Beta Was this translation helpful? Give feedback.
All reactions