-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardening against adversarial input #85
Comments
Hello, Thank you opening this issue, we were aware of the lack of hardening for parquet-go as a peripheral concern. However the target environment has always been where the user input can be trusted, i.e. reading back files which we have written using this package. We can do more fuzzing to weed out the kind of bugs you've found but i think you've hit the nail on the head here in identifying how tough hardening is... re. uncompressed size: I know snappy writes the uncompressed length in the first few bytes, we track the compressed/uncompressed sizes of column chunks (which are composed of pages which in turn have one or more compressed blocks) so it should be theoretically possible to extract this information before reading the whole file. I'll create some tickets to track the fuzzing & uncompressed size estimate efforts |
Yeah, broadly there are two classes: A) Unvalidated input, like the B) Check for excessive memory. I am far from an expert in Parquet, and it seems you have at least done a good deal of work in this department. This is of course much harder to track down, since you need to check every allocation based on sizes from input data. Decompressed sizes can be checked. Assuming you have a if snappy.DecodedLen(block) > maxSize {
return nil, errors.New("max block size exceeded")
} // limitReader returns a Reader that reads from r
// but stops with EOF after n bytes.
// The underlying implementation is a *limitedReader.
func limitReader(r io.Reader, n int64) io.Reader { return &limitedReader{r, n} }
// A limitedReader reads from R but limits the amount of
// data returned to just N bytes. Each call to Read
// updates N to reflect the new amount remaining.
// Read returns EOF when N <= 0 or when the underlying R returns EOF.
type limitedReader struct {
R io.Reader // underlying reader
N int64 // max bytes remaining
}
func (l *limitedReader) Read(p []byte) (n int, err error) {
if l.N <= 0 {
return 0, errors.New("max block size exceeded")
}
if int64(len(p)) > l.N {
p = p[0:l.N]
}
n, err = l.R.Read(p)
l.N -= int64(n)
return
}
func (gzipCompressor) DecompressBlock(block []byte) ([]byte, error) {
buf := bytes.NewReader(block)
r, err := gzip.NewReader(buf)
if err != nil {
return nil, err
}
ret, err := ioutil.ReadAll(limitReader(r, maxSize))
if err != nil {
return nil, err
}
return ret, r.Close()
} Maybe you want to extend the But again, decompression is just some of the places where memory can "explode" AFAICT. |
@klauspost thanks for the report! I'm kinda surprised to see such obvious crashers in fuzz testing, as we've put in effort into fuzz testing in the past. I was able to reproduce this rather quickly, and all 5 crashers (so far) seem to stem from lack of input validation in code auto-generated by thrift. This makes me think potentially all Go implementations of parquet suffer from the same or similar issues. I'll keep fuzz-testing this and try and fix the thrift code generator. |
FYI: apache/thrift#2582 seems to fix the thrift-related crashers so far. |
We've merged the fixes for the Thrift related crashes: #86 |
@klauspost You can use the new v0.11.0 tag which includes the above fixes: https://github.com/fraugster/parquet-go/releases/tag/v0.11.0 |
Confirmed, I now only see OOM crashes, but it can't really run very long before it dies. |
Describe the bug
This library provides no protection against untrusted inputs.
It is trivial to write inputs that will cause a server to OOM or crash from another issue.
This makes it impossible to use this package unless you have full trust in the users uploading input.
In our minio server we are therefore forced to disable Parquet parsing in S3 Select, since the server may be running in an environment where the users are untrusted.
Adding limits and safety to this package would extend its usage a lot. I know from experience it can be hard and there are a lot of paths to cover, but usually fuzz tests can guide you.
I made a fuzz test to test basic functionality. Even without a seed corpus it crashes within a minute, and letting it run a bit longer runs the OS out of resources as well. Ideally I would like to specify limits in terms of memory usage so we can make reasonable assumptions of how much memory each Reader will take at max.
Unit test to reproduce
Pre 1.18 fuzz test:
Build+execute with:
This will start crashing within a minute.
Example crash:
I am not looking for a solution for the specific crash, but rather the class of crashes that can be triggered with malicious user inputs.
parquet-go specific details
0.10.0
Likely.
The text was updated successfully, but these errors were encountered: