How to work with large data models #49
Replies: 6 comments 8 replies
-
Just to note, MF6 distinguishes array and list data. Arrays are always homogeneous and often (but not always) represent some variable living on features of the grid. I guess xarray would work only where this is true? Lists can be homogeneous (elements all having the same record type) or heterogeneous (elements can be any of several record types). The former are amenable to numpy, pandas, providing columns separately, etc. The latter not so. Period data can be heterogeneous for some packages and so will not always be representable in a tabular format. Also, for homogeneous arrays, maybe we could consider type hinting input context class attributes with e.g. |
Beta Was this translation helpful? Give feedback.
-
We've discussed/prototyped a modflow-specific duck array type, supporting e.g. an optional multiplication factor, convenient layer access, and optimized storage for constant arrays. I wonder now if xarray can do at least the first two: layer as a labeled dimension and/or coordinate, factor stored as metadata and accessed/applied at write time. |
Beta Was this translation helpful? Give feedback.
-
Some questions we might be able to answer with prototypes in the near term:
*These will require some benchmarking/profiling/performance testing, comparing flopy3 legacy IO with numpy/pandas/xarray IO |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
-
a question: should we send all array data through xarray/dask and do lazy/chunked operations on it, or operate on data below a certain size in memory? |
Beta Was this translation helpful? Give feedback.
-
We were discussing how XArray and CAttrs would work together like in this PR: #62 . cattrs will always return a dictionary of some sort, so that it can be passed into packages like pyyaml or tomlkit. cattrs provides the conversion for datetimes and encoding. But cattrs is not made for file-io, because when you would want to convert everything into one big string, that would all be passed into memory in a single instance. It would be better to use functionality like jinja for file writing and just pass it the actual python instance of simulation / model / package. We can write filter functions that can handle the dask chunking while writing large datasets from a netcdf file / lazy dask array. This is something we still need to prototype. |
Beta Was this translation helpful? Give feedback.
-
One of the bigger questions that we still have is how we are going to handle large datasets, also depending on the type of data (grid based / array based).
We should investigate how we can work with large data models where the data is chunked, preferably on disk. We see the benefits of using a package like Dask. Package that build on top of that are numpy, xarray, and uxarray. We are wondering if a package like uxarray also supports our more complex DISU model (fully unstructured 3d grid).
We have to consider how we are going to let users provide data to the flopy package. That can be via indexed arrays, but that can also be with a grid mask where NaNs are filled in when no data is provided. Depending on the sparsity of the data that is typically used, we can make a decision on the type of data we want users to put in via the API.
Packages of interest:
XArray seems to support on disk formats for zarr and netcdf: https://docs.xarray.dev/en/stable/api.html#dataset-methods. Letting users open these data formats would only let us reference to it, perhaps write in it if needed. At the moment that the data needs to go to the MF6 input format, we can convert at that moment and not earlier. Making it more efficient.
If MF6 would support netcdf formats directly, there would be no conversion needed and it would save even more time.
Beta Was this translation helpful? Give feedback.
All reactions