-
Notifications
You must be signed in to change notification settings - Fork 13
Qi Compiler Sync Aug 11 2023
Qi Compiler Sync Aug 11 2023
Adjacent meetings: Previous | Up | Next [None]
We continued exploring the stream fusion compiler optimization and were able to achieve "ignition," where the performance was able to break even and exceed Racket's (by almost 5x!).
Last time, we prototyped the stream fusion optimization to see if it could result in faster performance for sequences of functional transformations in Qi (map
, filter
, fold
, zip
, etc.). We discovered that this optimization relies on several other optimizations already being present in the compiler, many of which are absent in Racket. Nevertheless, we wanted to continue down this path to uncover precisely what those optimizations would be, to understand what it would take for stream fusion to be performant, and whether that would be worth it. To do this, we manually implemented various compiler optimizations one at a time, comparing against baseline performance each time, and as of last week, we were yet to break even with Racket performance.
Alongside our attempts to achieve stream fusion, last time, we also implemented the filter
→ map
benchmark using a single foldr
operation, which reached 2x of Racket's performance. As the computation is done here in a single step (i.e. our goal even with stream fusion), we assumed that this would be the upper bound on stream fusion performance, and that, once we were able to reach it by manually implementing optimizations, that would give us the sequence of compiler optimizations that would be necessary for the more general stream fusion approach to be equally performant.
We discussed Racket's Futures as one possible way to improve performance. We felt that doing this without changing the semantics of the language in terms of the order of operations would be tricky to pull off, but there may be bounded cases where it would be usable. This is likely something for the "future" though, once we have a basic and functional optimizing compiler.
Last time, we'd inlined most of the function calls, but proceeded to other optimizations without fully inlining everything. In particular, the stream->list
function call had not been inlined. So we first tried inlining this too:
(define (filter-map lst)
(define (next-list->stream state)
(cond [(null? state) 'done]
[else (list 'yield (car state) (cdr state))]))
(define (next-filter-stream state)
(match (next-list->stream state)
['done 'done]
[(cons 'skip new-state) (cons 'skip new-state)]
[(list 'yield value new-state)
(if (odd? value)
(list 'yield value new-state)
(cons 'skip new-state))]))
(define (next-map-stream state)
(match (next-filter-stream state)
['done 'done]
[(cons 'skip new-state) (cons 'skip new-state)]
[(list 'yield value new-state)
(list 'yield (sqr value) new-state)]))
(let ([next next-map-stream]
[state lst])
(let loop ([state state])
(match (next state)
['done null]
[(cons 'skip state)
(loop state)]
[(list 'yield value state)
(cons value
(loop state))]))))
Benchmarking showed:
$ ./report-competitive.rkt -s "filter-map" -s "filter-map (large list)"
Running competitive benchmarks...
Running Racket benchmarks...
filter-map:
171 ms
filter-map (large list):
648 ms
Running Qi benchmarks...
filter-map:
153 ms
filter-map (large list):
838 ms
Performance relative to baseline:
((filter-map 1) (filter-map (large list) 1))
So, we had just about broken even with Racket.
The stream implementation above returns values like (list 'yield value new-state)
and (cons 'skip new-state)
. As these are cons pairs, they incur a cost of allocation on each recursion. Instead, we considered returning and passing around multiple values -- a fixed number of them -- so that they can be passed in directly as arguments and thereby not incur an allocation cost. This would also make it easier for the Chez Scheme compiler to optimize in the "CP0" optimization pass. We didn't have time to try out this approach live, but implemented it after the meeting:
(define (filter-map lst)
(define-inline (next-list->stream state)
(cond [(null? state) (values 'done #f #f)]
[else (values 'yield (car state) (cdr state))]))
(define-inline (next-filter-stream state)
(call-with-values
(λ ()
(next-list->stream state))
(λ (type value new-state)
(case type
[(done) (values 'done #f #f)]
[(skip) (values 'skip #f new-state)]
[(yield)
(if (odd? value)
(values 'yield value new-state)
(values 'skip #f new-state))]))))
(define-inline (next-map-stream state)
(call-with-values
(λ ()
(next-filter-stream state))
(λ (type value new-state)
(case type
[(done) (values 'done #f #f)]
[(skip) (values 'skip #f new-state)]
[(yield)
(values 'yield (sqr value) new-state)]))))
(let ([next next-map-stream]
[state lst])
(let loop ([state state])
(call-with-values
(λ ()
(next state))
(λ (type value new-state)
(case type
[(done) null]
[(skip)
(loop new-state)]
[(yield)
(cons value
(loop new-state))]))))))
We also use define-inline
here, which is a hook into the compilation process that allow us to explicitly indicate functions that should be inlined at the points where they are invoked. It seems safe to do this for all internal functions used by the filter-map
implementation (and also within the Qi compiler, in the future).
Benchmarking shows:
nonlocal $ ./report-competitive.rkt -s "filter-map" -s "filter-map (large list)"
Running competitive benchmarks...
Running Racket benchmarks...
filter-map:
171 ms
filter-map (large list):
568 ms
Running Qi benchmarks...
filter-map:
75 ms
filter-map (large list):
506 ms
Performance relative to baseline:
((filter-map (large list) 1) (filter-map 0.44))
... a speedup of about 2-3x!
We then reimplemented the basic stream data type and interfaces in a continuation passing style. This approach works by directly invoking the next
functions for subsequent streams in the functional sequence, without explicitly constructing a stream data type anywhere. This employs lambdas, named lets, and lots of currying to get around the need to allocate data, and we assumed on this basis that it would compile down to simple lambda
and letrec
forms, which would be easily optimized by the underlying Chez Scheme compiler, and, similar to the "multiple values" approach, avoid the allocation cost of cons
.
(begin-encourage-inline
(define-inline (cstream->list next)
(λ (state)
(let loop ([state state])
((next (λ () null)
(λ (state) (loop state))
(λ (value state)
(cons value (loop state))))
state))))
(define-inline (list->cstream-next done skip yield)
(λ (state)
(cond [(null? state) (done)]
[else (yield (car state) (cdr state))])))
(define-inline ((map-cstream-next f next) done skip yield)
(next done
skip
(λ (value state)
(yield (f value) state))))
(define-inline ((filter-cstream-next f next) done skip yield)
(next done
skip
(λ (value state)
(if (f value)
(yield value state)
(skip state))))))
;; except for cstream->list, it's all CPS with tail recursion
(define (filter-map lst)
((cstream->list
(map-cstream-next sqr
(filter-cstream-next odd?
list->cstream-next)))
lst))
This also uses another compiler hook, begin-encourage-inline
, for good measure.
Benchmarking showed:
nonlocal $ ./report-competitive.rkt -s "filter-map" -s "filter-map (large list)"
Running competitive benchmarks...
Running Racket benchmarks...
filter-map:
161 ms
filter-map (large list):
567 ms
Running Qi benchmarks...
filter-map:
44 ms
filter-map (large list):
242 ms
Performance relative to baseline:
((filter-map (large list) 0.43) (filter-map 0.27))
... a speedup of 3-5x!
This is possible because Racket (and Chez Scheme) are good at treating "lambdas that do not escape" as being "equivalent to a goto." I'm not entirely sure what this means, but the paper Lambda: The Ultimate GOTO by Guy Steele explains this, apparently. The idea seems to be that lambdas in general can be rewritten procedurally by the compiler, and, evidently, Racket / Chez Scheme are good at this particular optimization.
By "playing to Racket's strengths" and leveraging chains of lambdas, explicitly employing compiler hooks provided by Racket, and avoiding allocation, it seems we are able to make stream fusion feasible and get fairly significant gains without triggering the "chain reaction" that we feared last time. So, it looks like we will be able to implement this optimization in the Qi compiler after all.
(Some of these are carried over from last time)
- Add another nonlocal benchmark:
sum (map square (upto 1 n))
(example from St-Amour's writeup). - Implement stream fusion in the compiler for all of the cases described in St-Amour's writeup (i.e.
unfoldr
foldr
map
filter
foldl
zip
average
). - Add additional nonlocal benchmarks to validate all cases optimized by stream fusion.
- Add docs for
syntax-spec
and put it on the package index. - Review other bindings cases that we could support next (e.g. selective binding, packing values, and binding exceptions in
try
).
Dominik, Marc, Michael, Sid
Home | Developer's Guide | Calendar | Events | Projects | Meeting Notes