Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Theoretical underpinning? #54

Closed
lassik opened this issue Aug 10, 2018 · 3 comments
Closed

Theoretical underpinning? #54

lassik opened this issue Aug 10, 2018 · 3 comments
Assignees
Labels

Comments

@lassik
Copy link

lassik commented Aug 10, 2018

Greetings,

I'm interested in the concept of a universal parser suited to all programming languages and trying to understand the approach you're taking with this project. I went through the code and issues and can't find any obvious references to parsing algorithms from the computer science literature.

In particular, I found the following PEG (Parsing Expression Grammar) based papers today that seem to touch on the problem of universal parsing:

Are you incorporating anything from approaches such as LL/LR or PEG, or creating your own approach from scratch? In either case, may I ask about the rationale?

@prettydiff
Copy link
Collaborator

prettydiff commented Aug 12, 2018

@lassik Thank you for your interest.

I am not formally educated in computer science. As a result the basic strategic concept I employ is original.

Here is how I got here:

  1. My first parser was a rough XML parser. I needed to beautify markup languages in the browser and wasn't happy with other tools available at the time written in JavaScript so I wrote my own, but it wasn't incredibly good.
  2. At this time I was using a very old version of JS Beautify for parsing and beautification of JavaScript. JS Beautify was not a formal parser and had some limitations on complex structures. I wrote custom enhancements to overcome noticeable shortcomings, these were hacks. When I was stuck in the prison like environment of North Fort Hood for pre-mobilization for my third military deployment I had all the time in the world so I wrote my own formal JavaScript parser. This was around November 2012.
  3. When I started writing that JavaScript parser I took what I had learned from working with JS Beautify and from reading the code of JSLint and JSMin and created a simple table-like structure for representing a language. It was quick to write and quick to read.
  4. Using what I had learned from writing that JavaScript parser I wrote a new markup and CSS parser to create table-like structures similar to what I created for the JavaScript parser, but each of these parsers produced unique structures for the needs of their languages.
  5. From the beginning CSS and JavaScript tools were dependencies for the markup parser because HTML could contain islands of CSS and JavaScript code directly embedded and not understandable to the markup parser. Then JSX language came out from Facebook in support of a JavaScript framework. JSX allows islands of XML to be embedded in JavaScript, so now the JavaScript and XML parsers must be co-dependent and I had to be a little bit smarter about the relationship of these parsers and recursively so. This was around November 2014.
  6. Last year I wanted to extract the parsers and isolate them from the rest of my code, but I encountered a couple of problems that I had to solve.

Problems to solve

  1. I needed a universal format. Before each of my parsers had a similar table-like structure, but they were very different in the kinds of data they held and the quantify of fields they contained.
  2. I needed recursively parsed output. Before I had formalized recursive access between the markup and JavaScript parsers, but they would each parse their respective inputs and return formatted strings as outputs. If the parsers were to be truly isolated then they must return parsed data as output instead of formatted code strings.
  3. I needed to define a universal structure. Normally parsers produce an abstract syntax tree. Object trees are painful to traverse and hard to analyze from single global perspective. Table like structures had served me very well in the past from both performance and analytical perspectives. The benefit of an AST though is that the structure of the parsed code is easily understandable as the returned AST should represent that code's structure in the AST's shape. I had already solved this problem in my JavaScript parser by simply pointing to a code token's starting structuring with a reference location (array index) and identifying the current structure type the current token resides in. I was easily able to adapt this structure identification to other parsing considerations, but I needed it applied in a universal manner so that one language island of a larger other language understands it is actually an island of a larger external structure no differently than its own internal structures.

Those 3 problems are all I solved for. So far the concept appears solid in practice. The concept is sufficient to parse any language and even a document of nested unrelated languages.

Proof

To prove this concept actually applies universally I still need more testing.

  1. So far I have only tested this against web technology languages. This is sufficient to test vertical extensibility of embedded languages. I have also written a Github flavored markdown parser for the framework, but I need to support other languages unrelated to web technologies to ensure I am capturing all concerns of formally describing potentially any language.
  2. So far I have only used the framework to analyze code for code formatting, like beautification and minification. The framework sufficiently describes the code to support those forms of analysis. In order to actually say the framework universally parses languages it needs to support all manners of analysis, which means I need to apply the framework against other unrelated problems.

Effort

The architecture and output structure have not changed since I have released this framework last year. I have made a few modifications to the lexers as I discover defects in their output. I need to spend more time really proving this idea to validate it against its claimed goals. So far this year I have instead spent the bulk of my time writing Pretty Diff version 3. Once I release Pretty Diff version 3 I can circle back to this application in a more dedicated capacity. I am close to this. Pretty Diff v3 is largely written and is now just awaiting more formal testing of various code samples to ensure it isn't breaking people's code.

I have been fortunate to have had a lot of time in the past year to work on these things. I will continue to have wide availability as I am about to embark upon my 4th military deployment. It is just a matter of being disciplined and testing code for hours everyday in a vacuum. If Sequoyah could do it I really don't have any excuses.

Did I answer your question?

Please let me know if I have failed to address your curiosity.

@lassik
Copy link
Author

lassik commented Aug 15, 2018

Thank you for this. This is the best reply I've ever seen on GitHub. Incredibly thorough and candid. You definitely answered my questions and then some. You are also admirably committed to your work.

I skimmed a bunch more papers and web pages and came away with the impression that the field is very complex and there basically isn't a generally agreed-upon best way to do anything. Many different algorithms can be made to work but they all come with tradeoffs relative to the others. ANTLR seems to be betting on LL/LR parsing. Most other recent tools seem to be using some variant of PEG, augmented to address its weaknesses. PEG has less ambiguity than LR but may produce worse error messages for users and consume more memory. Hand-rolled recursive-descent parsers may be slower than the ones produced by parser generators, but are unambiguous and have precise control over everything. GCC famously switched from a generated LR parser to a hand-rolled parser for its C++ frontend some time ago in order to produce better error messages. Parsing embedded languages (e.g. JavaScript inside HTML) doesn't seem to have been studied much in academia. I didn't find any theory about how to label parser-generated AST nodes in the most useful manner. Given how inconclusive the entire field is, I guess your approach is as good as any of the preceding ones. I hope it proves fruitful as the language repertoire expands and wish you the best of luck in pursuing it.

Personally, my ultimate interest re: parsing is trying to replace text editing with structural code editing. A universal parser would be the trickiest part in such a system (nobody wants to convert their source code to binary files overnight, so the editor would have to be able to read and write text files using traditional syntax). Since nobody seems to have discovered the "one true parsing framework", a virtual machine for running parsers in a sandbox and a language for writing them might be worth investigating (one general enough to support multiple approaches). OMeta and Ohm aim close to this. The VM could be ported to multiple host languages to re-use the same parser and AST definitions in all of them. But honestly, I'll probably never have time to pursue these goals beyond crude proof-of-concept level.

P.S. Very cool to learn about Sequoyah. Led an amazing life.

@prettydiff
Copy link
Collaborator

What I believe is cool about the approach I am attempting is that it is potentially language agnostic. It is the conforming output format that is more important. If somebody wants to contribute a C language lexer to the parse framework and they are most comfortable writing in C then their contributions should be in C.

At this time all of the framework parts are centrally managed. It truly is a framework in the application sense. I would need to move to a separately managed but centrally dispatched model where by the necessary framework parts are distributed between the language lexers in addition to the parsed output. I have opened an issue, #55, to think through this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants