Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can whitespace be ignored? #10

Closed
rogersm opened this issue May 31, 2020 · 2 comments
Closed

Can whitespace be ignored? #10

rogersm opened this issue May 31, 2020 · 2 comments
Labels

Comments

@rogersm
Copy link

rogersm commented May 31, 2020

I'm migrating a fairly big parser from python and is there any way of ignoring a certain terminal?

Being able to have the parser ignore whitespace would be extremely useful.

@scymtym
Copy link
Owner

scymtym commented May 31, 2020

Being able to have the parser ignore whitespace would be extremely useful.

There is nothing builtin, because packrat parsers like Esrap typically operate without the otherwise common separation into a lexer and parser stage. I typically address the problem by defining two variants of most rules: one that assumes that leading and trailing whitespace has been handled and a second one that invokes the first rule but skips over any trailing whitespace.

I started to write a chapter for the manual about this topic which follows below. I hope that helps.

Parsers often consist of a lexer component in addition to the actual parser. The lexer breaks the input into tokens dealing with whitespace and comments. The parser then consumes this token stream. This approach has the advantage that the parser component does not have to with "details" such as whitespace and comments which are usually rather uniform across the whole grammar but there are downside as well. For example, the Python grammar depends on whitespace and a lexer stage prevents later stages from accessing comments. There are also context sensitive tokens such as >> in the C++ grammar.

Esrap does not suffer from the downsides of having a separate lexer component. This comes, however, at the price of having to deal with things like whitespace and comments in the actual grammar. Fortunately, a macro along the following lines combined with a convention for writing rules can make the inconvenience almost disappear:

(defmacro deftoken (name expression &body options)
  (let ((name/skippable (alexandria:symbolicate name '#:/s))
        (name/maybe-skippable (alexandria:symbolicate name '#:/?s)))
    `(progn
       (esrap:defrule ,name ,expression ,@options)
       (esrap:defrule ,name/skippable
           (and ,name skippable)
         (:function first))
       (esrap:defrule ,name/maybe-skippable
           (and ,name (esrap:? skippable))
         (:function first)))))

this can be used as

(esrap:defrule whitespace
    (+ (or #\Space #\Tab #\Newline))
  (:constant nil))

(esrap:defrule comment
    (and "/*" (* (not "*/")) "*/")
  (:function second)
  (:text t))

(esrap:defrule skippable
  (+ (or whitespace comment)))

(deftoken type
    (and (alpha-char-p character) (* (alphanumericp character)))
  (:text t))

(deftoken identifier
    (and (alpha-char-p character) (* (alphanumericp character)))
  (:text t))

(deftoken equals
    #\=
  (:text t))

(deftoken value
    (+ (digit-char-p character))
  (:text t)
  (:function parse-integer))

(deftoken variable
    (and type/s identifier/?s equals/?s value))

Note: the final token of a rule expression should not except trailing skippables. This is necessary to make the rule usable in different contexts.

Note that comments can still be captured when they occur at "interesting" locations (e.g. documentation comments):

(deftoken keyword-class
    "class")

(esrap:defrule class
    (and keyword-class/s identifier/?s #\{ "..." #\}))

(esrap:defrule compilation-unit
    (* (or whitespace comment variable class))
  (:lambda (top-level-nodes)
    ;; Remove whitespace results. Could for example associate comment
    ;; nodes to class or function nodes following them in the input
    ;; text.
    (remove nil top-level-nodes)))

In this grammar, top-level comments can be captured and e.g. associated with class or function nodes but "token rules" lead to whitespace and comments being ignored everywhere else.

(esrap:parse 'compilation-unit "class foo {...}
/*comment and a newline*/

int a=1 /*comment*/
int     b           =         5")

The parser.common-rules library provides a deftoken macro.

@scymtym scymtym pinned this issue May 31, 2020
@rogersm
Copy link
Author

rogersm commented Jun 1, 2020

Many thanks for the excellent reply. I completely missed the parser.common-rules. I was thinking of modifying the library, but with your deftoken I will be able to work out a solution.

@rogersm rogersm closed this as completed Jun 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants