-
Notifications
You must be signed in to change notification settings - Fork 0
/
c-parser.scrbl
287 lines (232 loc) · 8.85 KB
/
c-parser.scrbl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
#lang scribble/lp2
@(require scribble-code-examples)
@(require scribble-code-examples)
@(require scriblib/footnote)
@tt{yacc} is a tool for generating LALR@note{Look ahead, left-to-right,
rightmost deriviation} parsers. The original tool was written in C,
and generated C source code, but we will be using a port of @tt{yacc} implemented
in the racket @tt{br-parser-tools} library.
With classic @tt{yacc} implementations, the generated code
is convoluted and difficult to understand (as most generated code is).
Since we're once using a Racket implementation, the parser generation step will be
hidden in a macro, so our parser source is more readable.
Let's build a parser to continue parsing our C grammar example from before with
@tt{yacc}. We'll assume a lexer with tokens like the ones we defined in the @tt{lex}
example.
@tt{yacc} grammars are constructed out of "rules." Unlike regular expression
rules, @tt{yacc} rules can reference themselves recursively. This
allows for more powerful parsing rules.
Let's start with adding a simple rule to our parser called @tt{number} that
parses numeric values.
First we'll include the necessary libraries
@chunk[<includes>
(require
br-parser-tools/yacc
br-parser-tools/lex
(prefix-in : br-parser-tools/lex-sre))
(require "c-lexer.scrbl")
]
Then, we'll write the @tt{numbers} rule.
@(define ev (make-code-eval #:lang "racket"))
@(ev '(require br-parser-tools/yacc))
@(ev '(require br-parser-tools/lex))
@(ev '(require "c-lexer.scrbl"))
@(ev '(require "c-parser.scrbl"))
@code-examples[#:lang "racket" #:context #'here #:eval ev]{
(define num-parser
(parser
(error
(lambda (tok-ok? tok-name tok-value)
(error "parsing failed, unexpected " tok-name tok-value)))
(start number)
(tokens keyword-tokens value-tokens)
(end EOF)
(grammar
(number ((INT) $1)
((DOUBLE) $1)))))}
This is a fully functioning parser! It doesn't parse much, and it only
constructs a tree of one node, but it is capable of parsing valid "programs."
An example to demonstrate what it can do:
@code-examples[#:lang "at-exp racket" #:context #'here #:eval ev]|{
(define test-input (open-input-string "2"))
(num-parser (lambda () (c-lexer test-input)))
(define test-input (open-input-string "a string of text to make it fail"))
(num-parser (lambda () (c-lexer test-input)))
}|
A @tt{number} can be an @tt{INT} token or a @tt{DOUBLE} token. Since we have
both @tt{INT} and @tt{DOUBLE} listed as numbers, the rule will match either token
when it appears in the token stream.
The variables with the '$' prefix (@tt{$1} in this case) reference the i@superscript{th}
variable in the match list. For instance, @tt{((A B C) (list $1 $3))}, would
return a list with the @tt{A} and @tt{C} tokens (i.e. the first and third).
We'll use these simple patterns to build the rest of our parser.
For a full-fledged parser we have to define a few things first. The overall
structure is shown below:
@chunk[<*>
<includes>
(provide c-parser)
(define c-parser
(parser
(tokens keyword-tokens value-tokens)
(error
(lambda (tok-ok? tok-name tok-value)
(error "parsing failed, unexpected" tok-name tok-value)))
(start func-def)
(end EOF)
<precs>
(grammar
<literal>
<expr-list>
<expr>
<var-decl>
<func-def>
<arg-list>
<stmt>
<maybe-else>
<stmt-list>)))]
To avoid ambiguities when parsing infix operators, we'll define operator precedence next.
@chunk[<precs>
(precs
(right =)
(left - +)
(left * /)) ]
Now we can write the individual grammar rules. String, character, numeric,
and other hardcoded values in a program are often called "literals." Since
they're easy to parse, we'll start with them.
@chunk[<literal>
(literal
((INT) $1)
((DOUBLE) $1)
((CHAR) $1)
((STRING) $1))]
Since the token already contains the literal value, we don't need to do
any manipulation and can return the token directly with @tt{$1}.
Not every language differentiates between expressions and statements
as C does. Scheme for instance has no grammatical structure for
statements. Since C has statments, our parser must differentiate between
expressions and statements, so if a program attempts to use a statement
in the context of an expression, the parser will fail.
In our parser, we'll handle variable declaration statements, return
statements, and if statements. We will define these rules later.
Now we'll write a rule to parse expressions.
@chunk[<expr>
(expr
((ID = expr) (list 'assign $1 $3))
((literal) $1)
((ID) $1)
<binops>
<function-call>)]
The assignment rule is slightly more sophisticated than the rules shown
before. First, it's recursive which allows us to write more powerful
rules. Second, when we construct the returned results, we do a small
transformation to make it a list where the first element is the symbol
"assign." We tag the first element in the list to make differentiation between
different node types easier.
Next we define our infix operations (once again using simple tagged lists to make the
nodes differentiable).
@chunk[<binops>
((expr + expr) (list '+ $1 $3))
((expr - expr) (list '- $1 $3))
((expr * expr) (list '* $1 $3))
((expr / expr) (list '/ $1 $3))
((expr < expr) (list '< $1 $3))
((expr > expr) (list '> $1 $3))
((expr += expr) (list '+= $1 $3))
((expr -= expr) (list '-= $1 $3))
]
These grammar rules employ left-recursion, since it is allowed in @tt{yacc}.
This is because @tt{yacc} is a bottom-up parser, so left-recursion doesn't
need to be eliminated as it does in a top-down parser.
To finish our expression rules, we add a rule for parsing function calls.
@chunk[<function-call>
((ID LPAREN expr-list RPAREN) (cons $1 $3))
]
Since we don't know how many expressions will show up in an expression
list (for a function call, for instance), we must parse until the end of the
list.
@chunk[<expr-list>
(expr-list
((expr COMMA expr-list) (cons $1 $3))
((expr) (list $1))
(() (list)))]
By matching an expression @italic{and} a comma, followed by a recursive
match of the expression list again, we can determine when to stop. When
the token after the next expression is @italic{not} a comma, the parser
terminates with one of the last two rules.
Now we must define rules for to parse statements. Note that an expression
followed by a semi-colon is also a statement.
@chunk[<stmt>
(stmt
<if-stmt>
<return-stmt>
((var-decl SEMI) $1)
((expr SEMI) $1)
)]
Variable declaration simply parses two identifiers. The first is the type
of the variable and the second is its name. We'll use this @tt{var-decl}
later on for parameter definitions in our @tt{func-def} rule, so it has
its own rule.
@chunk[<var-decl>
(var-decl
((ID ID) (list 'decl $1 $2)))]
Return statements are straightforward as well - they're simply the return
keyword followed by an expression and a semicolon to delimit the end of the statement.
@chunk[<return-stmt>
((return expr SEMI) (list 'return $2))
]
@tt{if} statements are composed of many tokens, and have an optional else statement.
@chunk[<if-stmt>
((if LPAREN expr RPAREN LBRACE stmt-list RBRACE maybe-else)
(append (list 'if $3 $6) $8))]
We'll include a rule called @tt{maybe-else} which will return an empty list if the
else statement doesn't exist.
@chunk[<maybe-else>
(maybe-else
((else LBRACE stmt-list RBRACE) (list 'else $3))
(() '()))]
When @tt{maybe-else} rule returns an empty list the @tt{if} parser
rule will have nothing to @tt{append}. When there is, an else statement, the
@tt{'else} will be tagged on the end, along with the block of statements
(from the @tt{stmt-list} rule).
@chunk[<stmt-list>
(stmt-list
(() (list))
((stmt stmt-list) (cons $1 $2)))]
The last significant rule is the top-level rule that utilizes all the
above rules to construct function definitions. To parse a full-scale
C program, our starting rule would need to be a list of function
definitions, we'll keep it simple for now by just parsing one.
@chunk[<func-def>
(func-def
((ID ID LPAREN arg-list RPAREN LBRACE stmt-list RBRACE)
(list 'func-def $1 $2 $4 $7)))]
The final rule is the argument list helper which expects a list of @tt{var-decl}s that
we defined earlier.
@chunk[<arg-list>
(arg-list
((var-decl COMMA arg-list) (cons $1 $3))
((var-decl) (list $1))
(() (list)))]
Now we can parse C-ish looking programs! For example,
@code-examples[#:lang "at-exp racket" #:context #'here #:eval ev]|{
(with-input-from-string
"int main() {
int x;
x = rand(0, 3);
int i;
i = 2;
if(i < x) {
printf(\"my number is %d\", x);
}
}"
(lambda ()
(c-parser (lambda () (c-lexer (current-input-port))))))
}|
@code-examples[#:lang "at-exp racket" #:context #'here #:eval ev]|{
(with-input-from-string
"int add(int a, int b) {
return a + b;
}"
(lambda ()
(c-parser (lambda () (c-lexer (current-input-port))))))
}|