Can whitespace be ignored? #10

rogersm · 2020-05-31T18:52:20Z

I'm migrating a fairly big parser from python and is there any way of ignoring a certain terminal?

Being able to have the parser ignore whitespace would be extremely useful.

scymtym · 2020-05-31T20:38:52Z

Being able to have the parser ignore whitespace would be extremely useful.

There is nothing builtin, because packrat parsers like Esrap typically operate without the otherwise common separation into a lexer and parser stage. I typically address the problem by defining two variants of most rules: one that assumes that leading and trailing whitespace has been handled and a second one that invokes the first rule but skips over any trailing whitespace.

I started to write a chapter for the manual about this topic which follows below. I hope that helps.

Parsers often consist of a lexer component in addition to the actual parser. The lexer breaks the input into tokens dealing with whitespace and comments. The parser then consumes this token stream. This approach has the advantage that the parser component does not have to with "details" such as whitespace and comments which are usually rather uniform across the whole grammar but there are downside as well. For example, the Python grammar depends on whitespace and a lexer stage prevents later stages from accessing comments. There are also context sensitive tokens such as >> in the C++ grammar.

Esrap does not suffer from the downsides of having a separate lexer component. This comes, however, at the price of having to deal with things like whitespace and comments in the actual grammar. Fortunately, a macro along the following lines combined with a convention for writing rules can make the inconvenience almost disappear:

(defmacro deftoken (name expression &body options)
  (let ((name/skippable (alexandria:symbolicate name '#:/s))
        (name/maybe-skippable (alexandria:symbolicate name '#:/?s)))
    `(progn
       (esrap:defrule ,name ,expression ,@options)
       (esrap:defrule ,name/skippable
           (and ,name skippable)
         (:function first))
       (esrap:defrule ,name/maybe-skippable
           (and ,name (esrap:? skippable))
         (:function first)))))

this can be used as

(esrap:defrule whitespace
    (+ (or #\Space #\Tab #\Newline))
  (:constant nil))

(esrap:defrule comment
    (and "/*" (* (not "*/")) "*/")
  (:function second)
  (:text t))

(esrap:defrule skippable
  (+ (or whitespace comment)))

(deftoken type
    (and (alpha-char-p character) (* (alphanumericp character)))
  (:text t))

(deftoken identifier
    (and (alpha-char-p character) (* (alphanumericp character)))
  (:text t))

(deftoken equals
    #\=
  (:text t))

(deftoken value
    (+ (digit-char-p character))
  (:text t)
  (:function parse-integer))

(deftoken variable
    (and type/s identifier/?s equals/?s value))

Note: the final token of a rule expression should not except trailing skippables. This is necessary to make the rule usable in different contexts.

Note that comments can still be captured when they occur at "interesting" locations (e.g. documentation comments):

(deftoken keyword-class
    "class")

(esrap:defrule class
    (and keyword-class/s identifier/?s #\{ "..." #\}))

(esrap:defrule compilation-unit
    (* (or whitespace comment variable class))
  (:lambda (top-level-nodes)
    ;; Remove whitespace results. Could for example associate comment
    ;; nodes to class or function nodes following them in the input
    ;; text.
    (remove nil top-level-nodes)))

In this grammar, top-level comments can be captured and e.g. associated with class or function nodes but "token rules" lead to whitespace and comments being ignored everywhere else.

(esrap:parse 'compilation-unit "class foo {...}
/*comment and a newline*/

int a=1 /*comment*/
int     b           =         5")

The parser.common-rules library provides a deftoken macro.

rogersm · 2020-06-01T08:52:57Z

Many thanks for the excellent reply. I completely missed the parser.common-rules. I was thinking of modifying the library, but with your deftoken I will be able to work out a solution.

scymtym added the question label May 31, 2020

scymtym pinned this issue May 31, 2020

rogersm closed this as completed Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can whitespace be ignored? #10

Can whitespace be ignored? #10

rogersm commented May 31, 2020

scymtym commented May 31, 2020

rogersm commented Jun 1, 2020

Can whitespace be ignored? #10

Can whitespace be ignored? #10

Comments

rogersm commented May 31, 2020

scymtym commented May 31, 2020

rogersm commented Jun 1, 2020