Downfile can be used to serialize any data from Python in a controlled manner. The data is stored as a set of components in a ZIP file. The format of each component is some standard format, such as JSON (for dictionaries etc) or Feather (for Pandas DataFrames).
To serialize or deserialize new types, methods can be registered using setuptools entry_points
.
Since a (de)serializer has to be written manually for each type, it does not have the same security and compatibility issues that Pickle has, but instead comes with a slightly higher development overhead.
Example usage:
>>> data = {"bar": pd.DataFrame({"foo": [1, 2, 3]}), "fie": "hello"}
>>> downfile.dump(data, "test.down")
>>> data2 = downfile.parse("test.down")
>>> data2
{'bar': foo
0 1
1 2
2 3,
'fie': 'hello'}
- Python base types: int, float, bool, str, dict, list
- Python builtin exceptions
- Pandas DataFrames
- Numpy arrays
In setup.py
in your own package (say mypippackage
) add:
entry_points = {
'downfile.dumpers': [
'somepackage.somemodule.DataType=mypippackage.mymodule:dumper',
],
'downfile.parsers': [
'mypippackage.myformat=mypackage.mymodule:parser',
]}
then in mypackage.mymodule
provide the following two methods
def dumper(file, obj):
# Here `mypippackage.myformat` is the filename extension.
# If the file format has a standard extension, such as `.png`, `.csv` etc,
# you might want to use that here instead.
name = file.new_file("mypippackage.myformat")
with file.open_buffered(name, "w") as f:
someFunctionToWriteObjToFile(f)
# Here mypippackage.myformat is the key used to find `parser` in `setp.py` later.
return {"__jsonclass__": ["mypippackage.myformat", [name]]}
def parser(file, obj):
name = obj["__jsonclass__"][1][0]
with file.open_buffered(name, "r") as f:
return someFunctionToReadObjFromFile(f)
mypippackage.myformat
can be any string that is reasonably unique, typically the file extension used by the file format your're using for serialization. However, it is good practice to include the pip package name for your package, so that people can easily find out what packages are missing when failing to parse a file!
If you're familiar with JSON RPC class hinting, you're probably wondering if dumper really has to write a file, or if it could just return some JSONifyable data. And the answer is nope, it doesn't need to write a file. If you're curious about serializing small objects, check out the datetime handler.
To recursively encode some component value of the data you're encoding, you can use downfile.formats.format_json.to_json_string(downfile, v)
. This will encode the value v
to a JSON string and return it. The returned JSON will use the same class hinting structure used in the main JSON file to serialize any complex type to external files.
The file
argument to dumper
/parser
above is an instance of downfile.Downfile
, which is a subclass of zipfile.ZipFile
that implements a few extra methods: new_file(extension)
returns a new unique filename, open_buffered(filename, mode="r"|"w")
works like open()
, but uses a temporary file so that multiple files can be opened concurrently (zipfile.ZipFile.open()
does not support this).
- A Downfile is a zip file
- A Downfile must contain a JSON file named
0.json
- This JSON file must contain an object with a key
root
- The content of the
root
key is considered the content of the entire Downfile.
- This JSON file must contain an object with a key
- Any file inside a Downfile can reference additional files inside the Downfile using relative paths
- Any JSON file inside a Downfile can use JSON RPC 1.0 class hinting
- A class hint of
{"__jsonclass__": ["mypippackage.myformat", ["filename.ext"]]}
must be used for data that is stored in a separate file inside the Downfile
- Uses the class hints
{"__jsonclass__":["datetime.datetime", ["%Y-%m-%d %H:%M:%S"]]}
and{"__jsonclass__": ["datetime.date", ["%Y-%m-%d"]]}
- Does not store any external file
- Uses the class hint
{"__jsonclass__":["exception"]}
args
property of the class hint object holds exception argumentstype
property of the class hint object holds a list of string names for all classes in the inheritance list of the exception, most specific first. The names are prefixed with their respective module name.- Does not store any external file
- Uses the class hint
{"__jsonclass__": ["feather", [name]]}
- Stored as a Feather file
- Any object column will have its cell values encoded as JSON with the same class hinting used for the main JSON file.
- To allow for more complex columans and indices (e.g. multilevel, numeric columns etc) not supported by the feather format, columns and index can optionally be converted to dataframes and stored separately (using the same method)
- The index is stored in a property
index
on the class hint object, and its name in a propertyindex_name
- The columns are stored in a property
columns
on the class hint object, and its name in a propertycolumns_name
- The index is stored in a property
- Uses the class hint
{"__jsonclass__": ["npy", [name]]}
- Stored as an NPY file
- If dtype is
Object
, values are encoded as JSON with the same class hinting used for the main JSON file, meaning that the pickle based encoder of Numpy is never triggered.