Skip to content

Qi Compiler Sync Aug 11 2023

Siddhartha Kasivajhula edited this page Aug 11, 2023 · 7 revisions

Ignition!

Qi Compiler Sync Aug 11 2023

Adjacent meetings: Previous | Up | Next [None]

Summary

We continued exploring the stream fusion compiler optimization and were able to achieve "ignition," where the performance was able to break even and exceed Racket's (by almost 5x!).

Background

Last time, we prototyped the stream fusion optimization to see if it could result in faster performance for sequences of functional transformations in Qi (map, filter, fold, zip, etc.). We discovered that this optimization relies on several other optimizations already being present in the compiler, many of which are absent in Racket. Nevertheless, we wanted to continue down this path to uncover precisely what those optimizations would be, to understand what it would take for stream fusion to be performant, and whether that would be worth it. To do this, we manually implemented various compiler optimizations one at a time, comparing against baseline performance each time, and as of last week, we were yet to break even with Racket performance.

An Expected Upper Bound

Alongside our attempts to achieve stream fusion, last time, we also implemented the filtermap benchmark using a single foldr operation, which reached 2x of Racket's performance. As the computation is done here in a single step (i.e. our goal even with stream fusion), we assumed that this would be the upper bound on stream fusion performance, and that, once we were able to reach it by manually implementing optimizations, that would give us the sequence of compiler optimizations that would be necessary for the more general stream fusion approach to be equally performant.

Approaches

Futures?

We discussed Racket's Futures as one possible way to improve performance. We felt that doing this without changing the semantics of the language in terms of the order of operations would be tricky to pull off, but there may be bounded cases where it would be usable. This is likely something for the "future" though, once we have a basic and functional optimizing compiler.

More Inlining

Last time, we'd inlined most of the function calls, but proceeded to other optimizations without fully inlining everything. In particular, the stream->list function call had not been inlined. So we first tried inlining this too:

(define (filter-map lst)
  (define (next-list->stream state)
    (cond [(null? state) 'done]
          [else (list 'yield (car state) (cdr state))]))
  (define (next-filter-stream state)
    (match (next-list->stream state)
      ['done 'done]
      [(cons 'skip new-state) (cons 'skip new-state)]
      [(list 'yield value new-state)
       (if (odd? value)
           (list 'yield value new-state)
           (cons 'skip new-state))]))
  (define (next-map-stream state)
    (match (next-filter-stream state)
      ['done 'done]
      [(cons 'skip new-state) (cons 'skip new-state)]
      [(list 'yield value new-state)
       (list 'yield (sqr value) new-state)]))
  (let ([next next-map-stream]
        [state lst])
    (let loop ([state state])
      (match (next state)
        ['done null]
        [(cons 'skip state)
         (loop state)]
        [(list 'yield value state)
         (cons value
               (loop state))]))))

Benchmarking showed:

$ ./report-competitive.rkt -s "filter-map" -s "filter-map (large list)"

Running competitive benchmarks...

Running Racket benchmarks...
filter-map:
171 ms
filter-map (large list):
648 ms

Running Qi benchmarks...
filter-map:
153 ms
filter-map (large list):
838 ms

Performance relative to baseline:
((filter-map 1) (filter-map (large list) 1))

So, we had just about broken even with Racket.

Leveraging multiple values

The stream implementation above returns values like (list 'yield value new-state) and (cons 'skip new-state). As these are cons pairs, they incur a cost of allocation on each recursion. Instead, we considered returning and passing around multiple values -- a fixed number of them -- so that they can be passed in directly as arguments and thereby not incur an allocation cost. This would also make it easier for the Chez Scheme compiler to optimize in the "CP0" optimization pass. We didn't have time to try out this approach live, but implemented it after the meeting:

(define (filter-map lst)
  (define-inline (next-list->stream state)
    (cond [(null? state) (values 'done #f #f)]
          [else (values 'yield (car state) (cdr state))]))
  (define-inline (next-filter-stream state)
    (call-with-values
     (λ ()
       (next-list->stream state))
     (λ (type value new-state)
       (case type
         [(done) (values 'done #f #f)]
         [(skip) (values 'skip #f new-state)]
         [(yield)
          (if (odd? value)
              (values 'yield value new-state)
              (values 'skip #f new-state))]))))
  (define-inline (next-map-stream state)
    (call-with-values
     (λ ()
       (next-filter-stream state))
     (λ (type value new-state)
       (case type
         [(done) (values 'done #f #f)]
         [(skip) (values 'skip #f new-state)]
         [(yield)
          (values 'yield (sqr value) new-state)]))))
  (let ([next next-map-stream]
        [state lst])
    (let loop ([state state])
      (call-with-values
       (λ ()
         (next state))
       (λ (type value new-state)
         (case type
           [(done) null]
           [(skip)
            (loop new-state)]
           [(yield)
            (cons value
                  (loop new-state))]))))))

We also use define-inline here, which is a hook into the compilation process that allow us to explicitly indicate functions that should be inlined at the points where they are invoked. It seems safe to do this for all internal functions used by the filter-map implementation (and also within the Qi compiler, in the future).

Benchmarking shows:

nonlocal $ ./report-competitive.rkt -s "filter-map" -s "filter-map (large list)"

Running competitive benchmarks...

Running Racket benchmarks...
filter-map:
171 ms
filter-map (large list):
568 ms

Running Qi benchmarks...
filter-map:
75 ms
filter-map (large list):
506 ms

Performance relative to baseline:
((filter-map (large list) 1) (filter-map 0.44))

... a speedup of about 2-3x!

Continuation-passing style

We then reimplemented the basic stream data type and interfaces in a continuation passing style. This approach works by directly invoking the next functions for subsequent streams in the functional sequence, without explicitly constructing a stream data type anywhere. This employs lambdas, named lets, and lots of currying to get around the need to allocate data, and we assumed on this basis that it would compile down to simple lambda and letrec forms, which would be easily optimized by the underlying Chez Scheme compiler, and, similar to the "multiple values" approach, avoid the allocation cost of cons.

(begin-encourage-inline
  (define-inline (cstream->list next)
    (λ (state)
      (let loop ([state state])
        ((next (λ () null)
               (λ (state) (loop state))
               (λ (value state)
                 (cons value (loop state))))
         state))))

  (define-inline (list->cstream-next done skip yield)
    (λ (state)
      (cond [(null? state) (done)]
            [else (yield (car state) (cdr state))])))

  (define-inline ((map-cstream-next f next) done skip yield)
    (next done
          skip
          (λ (value state)
            (yield (f value) state))))

  (define-inline ((filter-cstream-next f next) done skip yield)
    (next done
          skip
          (λ (value state)
            (if (f value)
                (yield value state)
                (skip state))))))

;; except for cstream->list, it's all CPS with tail recursion
(define (filter-map lst)
  ((cstream->list
    (map-cstream-next sqr
                      (filter-cstream-next odd?
                                           list->cstream-next)))
   lst))

This also uses another compiler hook, begin-encourage-inline, for good measure.

Benchmarking showed:

nonlocal $ ./report-competitive.rkt -s "filter-map" -s "filter-map (large list)"

Running competitive benchmarks...

Running Racket benchmarks...
filter-map:
161 ms
filter-map (large list):
567 ms

Running Qi benchmarks...
filter-map:
44 ms
filter-map (large list):
242 ms

Performance relative to baseline:
((filter-map (large list) 0.43) (filter-map 0.27))

... a speedup of 3-5x!

This is possible because Racket (and Chez Scheme) are good at treating "lambdas that do not escape" as being "equivalent to a goto." I'm not entirely sure what this means, but the paper Lambda: The Ultimate GOTO by Guy Steele explains this, apparently. The idea seems to be that lambdas in general can be rewritten procedurally by the compiler, and, evidently, Racket / Chez Scheme are good at this particular optimization.

Outlook

By "playing to Racket's strengths" and leveraging chains of lambdas, explicitly employing compiler hooks provided by Racket, and avoiding allocation, it seems we are able to make stream fusion feasible and get fairly significant gains without triggering the "chain reaction" that we feared last time. So, it looks like we will be able to implement this optimization in the Qi compiler after all.

Next Steps

(Some of these are carried over from last time)

  • Add another nonlocal benchmark: sum (map square (upto 1 n)) (example from St-Amour's writeup).
  • Implement stream fusion in the compiler for all of the cases described in St-Amour's writeup (i.e. unfoldr foldr map filter foldl zip average).
  • Add additional nonlocal benchmarks to validate all cases optimized by stream fusion.
  • Add docs for syntax-spec and put it on the package index.
  • Review other bindings cases that we could support next (e.g. selective binding, packing values, and binding exceptions in try).

Attendees

Dominik, Marc, Michael, Sid

Clone this wiki locally