Fuzzing DuckDB with AFL++
Implemented fuzz tests:
- fuzz test the csv reader: function
read_csv()
- fuzz test the json reader: function
read_json()
- fuzz test the parquet reader: function
read_parquet()
- fuzz test attaching duckdb storage files
- fuzz test processing write-ahead log file (wal)
AFL++ is a fuzzer that comes with its own compiler. (Also see fuzzing_in_depth).
Fuzzing DuckDB with AFL++ consists of the following steps. Note that there are make-commands to run most of these steps, they are listed further down.
-
Create a target executable by compiling the duckdb library and the wrapper source code with the main function with the
afl-clang-fast++
compiler. -
Provide an input corpus with typical inputs (valid or invalid). Depending on the fuzz scenario, this can be:
- data inputs files: csv, json, parquet - taken from
duckdb/data
- duckdb database files - created by script
create_duckdb_file_corpus.sh
- duckdb wal files (write-ahead log) - created by script
create_wal_file_corpus.sh
Note: for the 'multi-param' fuzzers, the input corpus needs to be pre-processed. See: Appendix A - encoding arguments to corpus files
- data inputs files: csv, json, parquet - taken from
-
Fuzzing itself
afl++
will call the target executable with various inputs based on the corpus to see if it can make it crash. The steps are as follows:- AFL++ fuzzer
- creates (binary) test-data based on the corpus, introducing semi random mutations
- repeatedly calls the target executable, each run providing different test-data as input (stdin)
- Target executable
- reads the test-data from stdin
- if applicable, pre-processes the incoming data, depending on the fuzz target this may include:
- fix checksums, magic bytes, and size information, to prevent simple rejection scenarios
- decode prepended argument information. The argument information is not part of the file that needs to be ingested by duckdb.
- stores the data that needs to be ingested in a temporary file or pipe
- calls duckdb library function (with the decoded arguments) to process the data
- DuckDB library functions
- tries to ingest and process the test-data; in case of bugs, this might crash the target executable
- AFL++ fuzzer
- keeps track of crashing scenarios and store them as fuzz results
- AFL++ fuzzer
-
Inspect fuzz results
- The fuzz results need to be inspected to see if there are any inputs that resulted in 'crashes' or 'hangs'.
- The fuzz results are copied from the container to this repository to new directory
fuzz_results
. - Note that many generated inputs will be invalid, but will give a graceful error. These cases are OK, so they will not go into the
crashes
orhangs
subdirectories. Scenarios with bugs will be in thecrashes
orhangs
subdirectories.
-
Reproduce crashes
Note: the
fuzz_results
directory contains the original inputs, before they were processed by the target executable!The reproduction depends on the fuzzer:
-
For csv, json, and parquet inputs without any additional arguments (
base_fuzzer
andpipe_fuzzer
), the crash cases should be directly reproducible by importing these files into duckdb:$ duckdb -c "SELECT * FROM read_csv('my_csv_file')"
$ cat my_csv_file | duckdb -c "SELECT * FROM read_csv('/dev/stdin')"
$ duckdb -c "SELECT * FROM read_json('my_json_file')"
$ cat my_json_file | duckdb -c "SELECT * FROM read_json('/dev/stdin')"
$ duckdb -c "SELECT * FROM read_parquet('my_parquet_file')"
-
For
csv_single_param_fuzzer
, the crashes should be reproduced with script:test_csv_reader_single_param.py
. The reason is that the first byte contains the parameter scenario info, and is not part of the actual csv input. -
For
multi_param_fuzzer
, the crashes cases should first be decoded with scriptdecode_multi_param_files.py
. See step 5 of Appendix A - encoding arguments to corpus files. After the crashes (or hangs) are decoded, you can o use scriptcreate_sqllogic_for_file_readers.py
to generate the corresponding sqllogic tests, to verify during debuging if a potential bugfix was effictive. -
For duckdb file inputs (
duckdb_file_fuzzer
), the input files from AFL++ should be post-processed with scriptfix_duckdb_file.py
. Afterwards, they can be reproduced by opening the duckdb file with duckdb:$ duckdb my_duckdb_file
$ duckdb -c "ATTACH 'my_duckdb_file' AS tmp_db (READ_ONLY); use tmp_db; show tables;"
To reproduce the entire output folder
fuzz_results/duckdb_file_fuzzer/default/crashes
, use scripttest_duckdb_output.sh
. -
For wal inputs (
wal_fuzzer
), the AFL++ input files should be post-processed with scriptfix_wal_file.py
or in bulk withfix_wal_files.sh
. A corresponding database file is required to process the fixed wal file. Note that the database file is constant (base_db
), while the wal file is different each run of the fuzzer. To reproduce a crash, rename the crashing wal file tobase_db.wal
, place it next tobase_db
and openbase_db
with duckdb. To create thebase_db
:source ./scripts/corpus_creation/create_base_db.sh create_base_db
To reproduce the entire output folder
fuzz_results/wal_fuzzer/default/crashes
, first runfix_wal_files.sh
, followed bywal_replay.sh
-
Fuzz duckdb with afl++ by executing the folowing steps consequtively.
-
Install Docker Desktop
-
Create an afl++ container and clone the latest version of duckdb in it. Compiling duckdb and running the fuzzer happens in this container.
make afl-up
-
Compile executables that can be fuzz-tested
make compile-fuzzers
-
Run one or more fuzz tests (see the
Makefile
for the corpus selection and fuzz options)make fuzz_csv_base
make fuzz_csv_single_param
make fuzz_csv_multi_param
make fuzz_csv_pipe
make fuzz_json_base
make fuzz_json_multi_param
make fuzz_json_pipe
make fuzz_parquet_base
make fuzz_parquet_multi_param
make fuzz_duckdb_file
make fuzz_wal_file
Note: these make targets also create or select the required corpus.
-
Inspect the fuzz result. See above.
-
If there are 'crashes' or 'hangs', create reproducible cases to file the issue. See above
-
Clean up. The container keeps spinning unless explictly stopped; don't skip this step, check with
docker ps
.make afl-down
The fuzzing settings are currently hardcoded in the fuzz-*
targets in the Makefile
. To see all options:
make man-page
(when the container is running)
Normally, fuzz executables are compiled inside the AFL++ container, with the afl-clang-fast++
compiler.
Locally building the fuzz-executables can be useful for develop/debug purposes. Crash-cases found by the fuzzer should be reproducible by the duckdb-cli, but if they are not there is also the option to debug with locally built fuzz-executables.
Steps:
-
Compile DuckDB source code
Checkout the duckdb source code (
duckdb/duckdb
) and compile with the following flags:cd /my_path/duckdb make GEN=ninja BUILD_JSON=1 CRASH_ON_ASSERT=1
-
Compile fuzz-executables
Compile fuzz-executables with normal
clang++
compiler instead ofafl-clang-fast++
:make DUCKDB_LOCAL_DIR=/my_path/duckdb compile-fuzzers-local
alternatively, change the DUCKDB_LOCAL_DIR variable in the Makefile, so you can just use
make compile-fuzzers-local
-
Run fuzz-executables; feed the content of a file to the executalbe, for example:
< test.csv ./build/csv_base_fuzzer
Notes:
-
for
base_fuzzer
andpipe_fuzzer
(ingesting csv/json/parquet data): no special considerations -
for
csv_single_param_fuzzer
: note that the the first byte is used to determine the parameter for read_csv() and is not considered part of the data file. -
for
multi_param_fuzzer
, the executable assumes the input contains prepended argument info. Data from thefuzz_results
can directly fed to the executable, since it is based on a corpus that also contains prepended argument info. Files created bycreate_multi_param_corpus.py
can also be directly used. -
for
duckdb_file_fuzzer
, any duckdb file (created by afl++ or not) can directly be fed to the executable.Note that
duckdb_file_fuzzer
also executes a fixup script. -
for
wal_fuzzer
, any wal file (created by afl++ or not) can directly be fed to the executable, however, abase_db
file also needs to be present.Note that
wal_fuzzer
also executes a fixup script.# create base_db in build dir source ./scripts/corpus_creation/create_base_db.sh cd build create_base_db cd .. # create valid wal files ./scripts/corpus_creation/create_wal_file_corpus.sh # run wal_fuzzer < ./corpus/walfiles/create1.wal ./build/wal_fuzzer
-
To effectively fuzz the file readers, we not only want to test with a variety of csv/json/parquet files, but also want to run with different arguments like ignore_errors=true
, skip=42
, etc.
The AFL++ fuzzer, however, doesn't natively supports this multi-dimensionality, since it produces a single byte stream input for the fuzz target.
A way to circumvent this is by asumming that the first part of the byte input is encoded argument information, rather then the csv/json/parquet content itself.
By prepending the corpus files with encoded argument information, the fuzzer will create similar inputs for the fuzz executable.
Steps:
-
Run a make command (it will do the other steps, except step 5):
make fuzz_csv_multi_param
make fuzz_json_multi_param
make fuzz_parquet_multi_param
-
Create a json file that lists the base corpus data together with the arguments.
- script:
create_multi_param_corpus_info.py
- example:
[ { "id": 1, "data_file": "data/csv/rejects/multiple_errors/invalid_utf_more.csv", "arguments": { "store_rejects": "true", "header": "1", "max_line_size": "40", "auto_detect": "false", "columns": "{'name': 'VARCHAR', 'age': 'INTEGER', 'current_day': 'DATE', 'barks': 'INTEGER'}" } }, { "id": 2, "data_file": "data/csv/auto/int_bol.csv", "arguments": { "dtypes": "['varchar', 'varchar', 'varchar']" } } ]
- script:
-
Encode and prepend the additional arguments to the data files to create the corpus.
- script:
create_multi_param_corpus.py
- See this article for the main idea.
- The following encoding is used:
- single header byte: 1 byte (unsigned char) with the number of arguments
- followed by encoding per argument:
- 1 byte: param_name (enum)
- 1 byte: length of argument value (max 255) -> N
- N bytes: argument value
- N values per data type:
- BOOLEAN: N=1 (odd values decode to true, even values to false)
- INTEGER: N=8 (8 byte singed integer)
- DOUBLE: N=8 (8 byte double precision float)
- VARCHAR: N=[0-255] depending on length of argument value
- script:
-
The fuzzer generates inputs based on the prepended corpus files. Therefore it might mutate both the leading bytes with the argument info and/or the remainder with the actual input file. Since the parameter names are stored with an enum, the parameters used in the function call might change, as well as their values. The fuzzer might also change the N values, in case the N value is incompatible with argument data type, fallback values are used.
-
The target executable decodes the prepended argument bytes and trims them from the input data. The duckdb function is called with the decoded argument string.
-
To reproduce crashses found this way use
decode_multi_param_files.py
. This recreates the input files in their original format, along with the argument string that caused the crash when reading them (stored in file_REPRODUCTIONS.json
). Optionally, you can usecreate_sqllogic_for_file_readers.py
to create sqllogic tests for every crash case. Alternatively, the reproducible scenarios in_REPRODUCTIONS.json
can be executed manually, or by scripts like:test_csv_reader_with_args.py
test_json_reader_with_args.py