-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Theoretical underpinning? #54
Comments
@lassik Thank you for your interest. I am not formally educated in computer science. As a result the basic strategic concept I employ is original. Here is how I got here:
Problems to solve
Those 3 problems are all I solved for. So far the concept appears solid in practice. The concept is sufficient to parse any language and even a document of nested unrelated languages. ProofTo prove this concept actually applies universally I still need more testing.
EffortThe architecture and output structure have not changed since I have released this framework last year. I have made a few modifications to the lexers as I discover defects in their output. I need to spend more time really proving this idea to validate it against its claimed goals. So far this year I have instead spent the bulk of my time writing Pretty Diff version 3. Once I release Pretty Diff version 3 I can circle back to this application in a more dedicated capacity. I am close to this. Pretty Diff v3 is largely written and is now just awaiting more formal testing of various code samples to ensure it isn't breaking people's code. I have been fortunate to have had a lot of time in the past year to work on these things. I will continue to have wide availability as I am about to embark upon my 4th military deployment. It is just a matter of being disciplined and testing code for hours everyday in a vacuum. If Sequoyah could do it I really don't have any excuses. Did I answer your question?Please let me know if I have failed to address your curiosity. |
Thank you for this. This is the best reply I've ever seen on GitHub. Incredibly thorough and candid. You definitely answered my questions and then some. You are also admirably committed to your work. I skimmed a bunch more papers and web pages and came away with the impression that the field is very complex and there basically isn't a generally agreed-upon best way to do anything. Many different algorithms can be made to work but they all come with tradeoffs relative to the others. ANTLR seems to be betting on LL/LR parsing. Most other recent tools seem to be using some variant of PEG, augmented to address its weaknesses. PEG has less ambiguity than LR but may produce worse error messages for users and consume more memory. Hand-rolled recursive-descent parsers may be slower than the ones produced by parser generators, but are unambiguous and have precise control over everything. GCC famously switched from a generated LR parser to a hand-rolled parser for its C++ frontend some time ago in order to produce better error messages. Parsing embedded languages (e.g. JavaScript inside HTML) doesn't seem to have been studied much in academia. I didn't find any theory about how to label parser-generated AST nodes in the most useful manner. Given how inconclusive the entire field is, I guess your approach is as good as any of the preceding ones. I hope it proves fruitful as the language repertoire expands and wish you the best of luck in pursuing it. Personally, my ultimate interest re: parsing is trying to replace text editing with structural code editing. A universal parser would be the trickiest part in such a system (nobody wants to convert their source code to binary files overnight, so the editor would have to be able to read and write text files using traditional syntax). Since nobody seems to have discovered the "one true parsing framework", a virtual machine for running parsers in a sandbox and a language for writing them might be worth investigating (one general enough to support multiple approaches). OMeta and Ohm aim close to this. The VM could be ported to multiple host languages to re-use the same parser and AST definitions in all of them. But honestly, I'll probably never have time to pursue these goals beyond crude proof-of-concept level. P.S. Very cool to learn about Sequoyah. Led an amazing life. |
What I believe is cool about the approach I am attempting is that it is potentially language agnostic. It is the conforming output format that is more important. If somebody wants to contribute a C language lexer to the parse framework and they are most comfortable writing in C then their contributions should be in C. At this time all of the framework parts are centrally managed. It truly is a framework in the application sense. I would need to move to a separately managed but centrally dispatched model where by the necessary framework parts are distributed between the language lexers in addition to the parsed output. I have opened an issue, #55, to think through this. |
Greetings,
I'm interested in the concept of a universal parser suited to all programming languages and trying to understand the approach you're taking with this project. I went through the code and issues and can't find any obvious references to parsing algorithms from the computer science literature.
In particular, I found the following PEG (Parsing Expression Grammar) based papers today that seem to touch on the problem of universal parsing:
Are you incorporating anything from approaches such as LL/LR or PEG, or creating your own approach from scratch? In either case, may I ask about the rationale?
The text was updated successfully, but these errors were encountered: