-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix accidentally quadratic JPEG parsing #215
Conversation
They are useful for applications that just want to search for these signatures, or implement PNG logic slighly differently.
The cabal file requires a much larger `binary` version than this backwards compat code is for.
* remove `go` boilerplate (`f ... = go where go = ...` without any recursive call to `go`) * simplify `parse*` functions by returning `Maybe`, instead of taking a list to which the result is prepended (that conflated the concerns of parsing with using the parsed results) * make it more explicit that for some parsing errors, JuicyPixels just skips he corresponding `JpgFrame`s * improve some variable names
Cabal 3.2 complains: Warning: 'ghc-prof-options: -prof' is not necessary and will lead to problems when used on a library. Use the configure flag --enable-library-profiling
Switch to Generics for some existing NFData instances. Also add `Eq` for result comparisons.
This makes reading large JPEGs (e.g. 110 MB) 1000x faster. * `Data.ByteString.Lazy.index` is O(chunks), not O(1). The default chunk size is 32 KB. Thus, calling that `L.index` in a `+1` loop, as `extractScanContent` did, caused accidentally quadratic runtime. * This could be easily observed by putting a `trace` of `n` into the loop, and observing how printouts get slower over time. * Further, the (ab)use of `getRemainingLazyBytes` from `binary`'s `Data.Binary.Get` module resulted in weird semi-lazy behaviour that was both incorrect (not setting the parser offset correctly, generating misleading error offsets and messages) and slow (encouraging this loop over lazy ByteStrings instead of just using the `Get` parser as intended). * I thoroughly documented the quirks of the existing `parseFrames` function in its haddocks, and subsequently renamed it to `parseFramesSemiLazy`. * This commit fixes the above issues by replacing `extractScanContent` by a normal `Get` based parser. * Two variants of that fixed parser are added: `parseECS` and `parseECS_simple`. (ECS is the proper name for the "scan content" bytes according to the spec.) The simple one is the straightforward translation, the other one a higher-performance implementation that is only ~20% slower than a non-lazy ByteString based loop. Both variants are faster than the original implementation because they are linear, not accidentally quadratic. * `parseFrames` is replaced by a strict implementation that uses the new, correct `parseECS`. Compared to the previous `parseFrames` (and current `parseFramesSemiLazy`) it also fixes its remaining issues I found: * It did not correctly check the 0xFF value of the frame marker, instead just skipping 1 Byte. * It ignored any unparseable garbage at the end of the JPEG, even including truncated (and thus not properly framed) data. I added both as TODOs for the retained legacy implementation. `parseFrames` was previously unexported from this `.Internal` module, so this rename has no backwards incompatibility implications. * Backwards compatibility is maintained because the `instance Binary JpgImage` continues to use `parseFramesSemiLazy`, for which I've taken care to preserve its existing quirky laziness semantics. I kept it this way for now because due to lack of comments it is unclear to me whether this quirky lazy behaviour was intended, or a complete accident. Beyond that: * Further exports are added. This allows users to write their own equivalent of `parseFrames`, for example one that that searches for the first scan header to determine the image dimensions without parsing the whole JPG. * These fixes make the library compatible with sane IO approaches, such as streaming parsing with `binary-conduit` or other uses of `binary`'s incremental parser input interface. * An equivalence test for `parseECS` with `parseECS_simple` is added.
I'll trust you on that, I don't have the tooling to check it |
@Twinside Here's an easy way to measure, e.g. in GHCi: import qualified Data.ByteString.Lazy as L
import Data.Binary
import Data.Binary.Get
import qualified Codec.Picture.Jpg.Internal.Types as JPG
:set -XTypeApplications
:set +s
L.readFile "large110MB.jpg" >>= \bs -> return $ case runGetOrFail (get @JPG.JpgImage) bs of { Left (_rest, offset, err) -> Left ("ERROR", offset, err) ; Right (_rest, offset, jpgImage) -> Right (offset, jpgImage `deepseq` ()) } The previous implementation prints (by means of
Doing the same with the new implementation (
So for this case, it is 2000x faster. For
This is still quite fast, but 2.5x slower than For the claim
I simply made a copy of the existing extractScanContentStrict :: L.ByteString -> (L.ByteString, L.ByteString)
extractScanContentStrict str_lazy = aux 0
where !maxi = fromIntegral $ B.length str - 1
!str = L.toStrict str_lazy
aux !n | n >= maxi = (L.fromStrict str, L.empty)
| v == 0xFF && vNext /= 0 && not isReset = (let (a, b) = B.splitAt n str in (L.fromStrict a, L.fromStrict b))
| otherwise = aux (n + 1)
where v = {- (if n `mod` 1000000 == 0 then trace (" n = " ++ show n) else id) -} str `B.index` n
vNext = str `B.index` (n + 1)
isReset = 0xD0 <= vNext && vNext <= 0xD7 For that I obtained:
|
In the commit I made a claim that import Conduit
import Data.Conduit.Serialization.Binary -- from `binary-conduit`
-- Only reads a small part of the file:
runConduitRes $ sourceFile "bigfile.bin" .| sinkGet ((\a rest -> a) <$> getWord8 <*> getWord8)
-- This reads the entire file via the conduit (which is not lazy IO):
runConduitRes $ sourceFile "bigfile.bin" .| sinkGet ((\a rest -> a) <$> getWord8 <*> getRemainingLazyByteString) |
There are other usages of For example: git grep '\.index' | grep -v '\bB\.index\b' src/Codec/Picture/HDR.hs: where at n = L.index str . fromIntegral $ idx + n
src/Codec/Picture/HDR.hs: | otherwise = pure $ L.index inputData (fromIntegral idx)
src/Codec/Picture/Png.hs: PixelRGBA8 r g b $ Lb.index transpBuffer (fromIntegral ix) |
@Twinside It would be nice if you could tell me if the semi-lazy behaviour of JPEG parsing was accidental or intentional. If it was accidental, we could consider it a bug, and perhaps switch the implementation of the JPEG parser away from the quirky to the strict one (which so far I haven't done). |
I will investigate. |
…n value. Fixes regression reported in Twinside#215 (comment) Also fix follow-up error that `parseFrames`, `parseFramesSemiLazy` and `parseToFirstFrameHeader` incorrectly tried to parse beyond the `JpgEndOfImage`. Jpg: Fix `parseToFirstFrameHeader`
This should fix it: PR #216 |
merged! I'll push an update on hackage "soon©" |
Makes JPEG parsing 1000x faster for large pictures (I tried with a 110 MB one).
binary
package'sGet
parser API, which caused theGet
based functions to not correctly maintain the parser offset (how many bytes were consumed).Please see the individual commit messages, especially the one of the commit titled
Jpg: Fix quadratic JPEG parsing
, for full details.Other preparatory refactoring commits are also included.
Copying the main commit's initial message here for easy reading:
Data.ByteString.Lazy
'sindex
is O(chunks), not O(1).The default chunk size is 32 KB.
Thus, calling that
L.index
in a+1
loop, asextractScanContent
did, caused accidentally quadratic runtime.trace
ofn
into the loop, and observing how printouts get slower over time.getRemainingLazyBytes
frombinary
'sData.Binary.Get
module resulted in weird semi-lazy behaviour that was both incorrect (not setting the parser offset correctly, generating misleading error offsets and messages) and slow (encouraging this loop over lazy ByteStrings instead of just using theGet
parser as intended).parseFrames
function in its haddocks, and subsequently renamed it toparseFramesSemiLazy
.extractScanContent
by a normalGet
based parser.parseECS
andparseECS_simple
. (ECS is the proper name for the "scan content" bytes according to the spec.)The simple one is the straightforward translation, the other one a higher-performance implementation that is only ~20% slower than a non-lazy ByteString based loop.
Both variants are faster than the original implementation because they are linear, not accidentally quadratic.
parseFrames
is replaced by a strict implementation that uses the new, correctparseECS
.Compared to the previous
parseFrames
(and currentparseFramesSemiLazy
) it also fixes its remaining issues I found:I added both as TODOs for the retained legacy implementation.
parseFrames
was previously unexported from this.Internal
module, so this rename has no backwards incompatibility implications.instance Binary JpgImage
continues to useparseFramesSemiLazy
, for which I've taken care to preserve its existing quirky lazinesssemantics.
I kept it this way for now because due to lack of comments it is unclear to me whether this quirky lazy behaviour was intended, or a complete accident.
Beyond that:
This allows users to write their own equivalent of
parseFrames
, for example one that that searches for the first scan header to determine the image dimensions without parsing the whole JPG.binary-conduit
or other uses ofbinary
's incremental parser input interface.parseECS
withparseECS_simple
is added.