Fix accidentally quadratic JPEG parsing #215

nh2 · 2022-12-04T14:26:04Z

Makes JPEG parsing 1000x faster for large pictures (I tried with a 110 MB one).

I found this speedup accidentally while fixing various incorrect uses of the binary package's Get parser API, which caused the Get based functions to not correctly maintain the parser offset (how many bytes were consumed).
Also adds an incremental parsing API.

Please see the individual commit messages, especially the one of the commit titled Jpg: Fix quadratic JPEG parsing, for full details.

Other preparatory refactoring commits are also included.

Copying the main commit's initial message here for easy reading:

Data.ByteString.Lazy's index is O(chunks), not O(1).
The default chunk size is 32 KB.
Thus, calling that L.index in a +1 loop, as extractScanContent did, caused accidentally quadratic runtime.
This could be easily observed by putting a trace of n into the loop, and observing how printouts get slower over time.
Further, the (ab)use of getRemainingLazyBytes from binary's Data.Binary.Get module resulted in weird semi-lazy behaviour that was both incorrect (not setting the parser offset correctly, generating misleading error offsets and messages) and slow (encouraging this loop over lazy ByteStrings instead of just using the Get parser as intended).
I thoroughly documented the quirks of the existing parseFrames function in its haddocks, and subsequently renamed it to parseFramesSemiLazy.
This commit fixes the above issues by replacing extractScanContent by a normal Get based parser.
Two variants of that fixed parser are added: parseECS and parseECS_simple. (ECS is the proper name for the "scan content" bytes according to the spec.)
The simple one is the straightforward translation, the other one a higher-performance implementation that is only ~20% slower than a non-lazy ByteString based loop.
Both variants are faster than the original implementation because they are linear, not accidentally quadratic.
parseFrames is replaced by a strict implementation that uses the new, correct parseECS.
Compared to the previous parseFrames (and current parseFramesSemiLazy) it also fixes its remaining issues I found:
- It did not correctly check the 0xFF value of the frame marker, instead just skipping 1 Byte.
- It ignored any unparseable garbage at the end of the JPEG, even including truncated (and thus not properly framed) data.
  I added both as TODOs for the retained legacy implementation. parseFrames was previously unexported from this .Internal module, so this rename has no backwards incompatibility implications.
Backwards compatibility is maintained because the instance Binary JpgImage continues to use parseFramesSemiLazy, for which I've taken care to preserve its existing quirky laziness
semantics.
I kept it this way for now because due to lack of comments it is unclear to me whether this quirky lazy behaviour was intended, or a complete accident.

Beyond that:

Further exports are added.
This allows users to write their own equivalent of parseFrames, for example one that that searches for the first scan header to determine the image dimensions without parsing the whole JPG.
These fixes make the library compatible with sane IO approaches, such as streaming parsing with binary-conduit or other uses of binary's incremental parser input interface.
An equivalence test for parseECS with parseECS_simple is added.

They are useful for applications that just want to search for these signatures, or implement PNG logic slighly differently.

The cabal file requires a much larger `binary` version than this backwards compat code is for.

* remove `go` boilerplate (`f ... = go where go = ...` without any recursive call to `go`) * simplify `parse*` functions by returning `Maybe`, instead of taking a list to which the result is prepended (that conflated the concerns of parsing with using the parsed results) * make it more explicit that for some parsing errors, JuicyPixels just skips he corresponding `JpgFrame`s * improve some variable names

Cabal 3.2 complains: Warning: 'ghc-prof-options: -prof' is not necessary and will lead to problems when used on a library. Use the configure flag --enable-library-profiling

Switch to Generics for some existing NFData instances. Also add `Eq` for result comparisons.

This makes reading large JPEGs (e.g. 110 MB) 1000x faster. * `Data.ByteString.Lazy.index` is O(chunks), not O(1). The default chunk size is 32 KB. Thus, calling that `L.index` in a `+1` loop, as `extractScanContent` did, caused accidentally quadratic runtime. * This could be easily observed by putting a `trace` of `n` into the loop, and observing how printouts get slower over time. * Further, the (ab)use of `getRemainingLazyBytes` from `binary`'s `Data.Binary.Get` module resulted in weird semi-lazy behaviour that was both incorrect (not setting the parser offset correctly, generating misleading error offsets and messages) and slow (encouraging this loop over lazy ByteStrings instead of just using the `Get` parser as intended). * I thoroughly documented the quirks of the existing `parseFrames` function in its haddocks, and subsequently renamed it to `parseFramesSemiLazy`. * This commit fixes the above issues by replacing `extractScanContent` by a normal `Get` based parser. * Two variants of that fixed parser are added: `parseECS` and `parseECS_simple`. (ECS is the proper name for the "scan content" bytes according to the spec.) The simple one is the straightforward translation, the other one a higher-performance implementation that is only ~20% slower than a non-lazy ByteString based loop. Both variants are faster than the original implementation because they are linear, not accidentally quadratic. * `parseFrames` is replaced by a strict implementation that uses the new, correct `parseECS`. Compared to the previous `parseFrames` (and current `parseFramesSemiLazy`) it also fixes its remaining issues I found: * It did not correctly check the 0xFF value of the frame marker, instead just skipping 1 Byte. * It ignored any unparseable garbage at the end of the JPEG, even including truncated (and thus not properly framed) data. I added both as TODOs for the retained legacy implementation. `parseFrames` was previously unexported from this `.Internal` module, so this rename has no backwards incompatibility implications. * Backwards compatibility is maintained because the `instance Binary JpgImage` continues to use `parseFramesSemiLazy`, for which I've taken care to preserve its existing quirky laziness semantics. I kept it this way for now because due to lack of comments it is unclear to me whether this quirky lazy behaviour was intended, or a complete accident. Beyond that: * Further exports are added. This allows users to write their own equivalent of `parseFrames`, for example one that that searches for the first scan header to determine the image dimensions without parsing the whole JPG. * These fixes make the library compatible with sane IO approaches, such as streaming parsing with `binary-conduit` or other uses of `binary`'s incremental parser input interface. * An equivalence test for `parseECS` with `parseECS_simple` is added.

Twinside · 2022-12-04T15:06:27Z

I'll trust you on that, I don't have the tooling to check it

nh2 · 2022-12-04T15:16:27Z

@Twinside Here's an easy way to measure, e.g. in GHCi:

import qualified Data.ByteString.Lazy as L
import Data.Binary
import Data.Binary.Get
import qualified Codec.Picture.Jpg.Internal.Types as JPG
:set -XTypeApplications
:set +s

L.readFile "large110MB.jpg" >>= \bs -> return $ case runGetOrFail (get @JPG.JpgImage) bs of { Left (_rest, offset, err) -> Left ("ERROR", offset, err) ; Right (_rest, offset, jpgImage) -> Right (offset, jpgImage `deepseq` ()) }

The previous implementation prints (by means of :set +s which enables timing):

(806.22 secs, 121,282,712 bytes)

Doing the same with the new implementation (parseECS) from this PR prints:

(0.40 secs, 122,191,224 bytes)

So for this case, it is 2000x faster.

For parseECS_simple, I get:

(0.88 secs, 4,729,102,080 bytes)

This is still quite fast, but 2.5x slower than parseECS and doing 20x more allocation.

For the claim

only ~20% slower than a non-lazy ByteString based loop

I simply made a copy of the existing extractScanContent and switched the types from .Lazy to normal ByteString, like this:

extractScanContentStrict :: L.ByteString -> (L.ByteString, L.ByteString)
extractScanContentStrict str_lazy = aux 0
  where !maxi = fromIntegral $ B.length str - 1
        !str = L.toStrict str_lazy

        aux !n | n >= maxi = (L.fromStrict str, L.empty)
               | v == 0xFF && vNext /= 0 && not isReset = (let (a, b) = B.splitAt n str in (L.fromStrict a, L.fromStrict b))
               | otherwise = aux (n + 1)
             where v = {- (if n `mod` 1000000 == 0 then trace (" n = " ++ show n) else id) -} str `B.index` n
                   vNext = str `B.index` (n + 1)
                   isReset = 0xD0 <= vNext && vNext <= 0xD7

For that I obtained:

(0.35 secs, 235,527,808 bytes)

nh2 · 2022-12-04T15:19:44Z

In the commit I made a claim that getRemainingLazyByteString does not work well with binary's incremental input interface. That can be checked with these commands, using the binary-conduit as an example:

import Conduit
import Data.Conduit.Serialization.Binary -- from `binary-conduit`

-- Only reads a small part of the file:
runConduitRes $ sourceFile "bigfile.bin" .| sinkGet ((\a rest -> a) <$> getWord8 <*> getWord8)

-- This reads the entire file via the conduit (which is not lazy IO):
runConduitRes $ sourceFile "bigfile.bin" .| sinkGet ((\a rest -> a) <$> getWord8 <*> getRemainingLazyByteString)

nh2 · 2022-12-04T15:28:23Z

@Twinside

There are other usages of L.index and Lb.index in JuicyPixels that might also be quadratic and that I didn't fix.

For example:

git grep '\.index' | grep -v '\bB\.index\b'

src/Codec/Picture/HDR.hs:  where at n = L.index str . fromIntegral $ idx + n
src/Codec/Picture/HDR.hs:            | otherwise = pure $ L.index inputData (fromIntegral idx)
src/Codec/Picture/Png.hs:      PixelRGBA8 r g b $ Lb.index transpBuffer (fromIntegral ix)

nh2 · 2022-12-04T15:30:34Z

@Twinside It would be nice if you could tell me if the semi-lazy behaviour of JPEG parsing was accidental or intentional.

If it was accidental, we could consider it a bug, and perhaps switch the implementation of the JPEG parser away from the quirky to the strict one (which so far I haven't done).

NorfairKing · 2022-12-06T10:03:06Z

This PR seems to introduce a bug.

This code:

 -- Load the image
  dynamicImage <- decodeImage contents
  pure $ imageToJpg 100 dynamicImage

Turns this image:

Into this image

nh2 · 2022-12-06T23:15:37Z

I will investigate.

…n value. Fixes regression reported in Twinside#215 (comment) Also fix follow-up error that `parseFrames`, `parseFramesSemiLazy` and `parseToFirstFrameHeader` incorrectly tried to parse beyond the `JpgEndOfImage`. Jpg: Fix `parseToFirstFrameHeader`

Twinside#215 (comment)

nh2 · 2022-12-07T03:54:11Z

This should fix it: PR #216

Twinside · 2022-12-07T08:56:53Z

nh2 added 9 commits November 14, 2022 17:31

cabal: Relax vector dependency. It builds fine with this

d820a6c

PNG: Export signatures.

a883844

They are useful for applications that just want to search for these signatures, or implement PNG logic slighly differently.

InternalHelper.hs: Remove old binary backwards compat code.

fd16074

The cabal file requires a much larger `binary` version than this backwards compat code is for.

InternalHelper.hs: Remove unnecessary re-implementation of toStrict

af75580

Jpg: Refactor: Extract skipUntilFrames

ccea76a

cabal: Fix warning due to redundant -prof.

7335c7c

Cabal 3.2 complains: Warning: 'ghc-prof-options: -prof' is not necessary and will lead to problems when used on a library. Use the configure flag --enable-library-profiling

Make Jpg types deepseqable for benchmarking.

48164fc

Switch to Generics for some existing NFData instances. Also add `Eq` for result comparisons.

Twinside merged commit 69e7aef into Twinside:master Dec 4, 2022

nh2 added a commit to nh2/Juicy.Pixels that referenced this pull request Dec 7, 2022

WIP: Equivalence investigation code for reported issue:

054d76b

Twinside#215 (comment)

nh2 mentioned this pull request Dec 7, 2022

Fix JPEG parsing regression #216

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix accidentally quadratic JPEG parsing #215

Fix accidentally quadratic JPEG parsing #215

nh2 commented Dec 4, 2022

Twinside commented Dec 4, 2022

nh2 commented Dec 4, 2022 •

edited

Loading

nh2 commented Dec 4, 2022

nh2 commented Dec 4, 2022

nh2 commented Dec 4, 2022

NorfairKing commented Dec 6, 2022

nh2 commented Dec 6, 2022

nh2 commented Dec 7, 2022

Twinside commented Dec 7, 2022

Fix accidentally quadratic JPEG parsing #215

Fix accidentally quadratic JPEG parsing #215

Conversation

nh2 commented Dec 4, 2022

Twinside commented Dec 4, 2022

nh2 commented Dec 4, 2022 • edited Loading

nh2 commented Dec 4, 2022

nh2 commented Dec 4, 2022

nh2 commented Dec 4, 2022

NorfairKing commented Dec 6, 2022

nh2 commented Dec 6, 2022

nh2 commented Dec 7, 2022

Twinside commented Dec 7, 2022

nh2 commented Dec 4, 2022 •

edited

Loading