Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prototype attrs data model #24

Merged
merged 3 commits into from
Sep 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 100 additions & 101 deletions docs/dev/sdd.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,18 @@
- [Objects](#objects)
- [Counting the ways](#counting-the-ways)
- [Context resolution](#context-resolution)
- [`Dict` mimicry](#dict-mimicry)
- [Dictionary mimicry](#dictionary-mimicry)
- [Impedance mismatch](#impedance-mismatch)
- [Parameters](#parameters)
- [Arrays](#arrays)
- [Tables](#tables)
- [Laziness](#laziness)
- [Lazies](#lazies)
- [Signals](#signals)
- [Units](#units)
- [Code generation](#code-generation)
- [IO](#io)
- [Overview](#overview-1)
- [Unified access](#unified-access)
- [Decorators](#decorators)
- [Converters](#converters)
- [Codecs](#codecs)
- [Parsers](#parsers)
Expand Down Expand Up @@ -49,31 +49,30 @@ like plotting, aggregations, input/output, etc.
IO, for instance, is at the boundary of an application and should only affect
the object model in rare instances.

*any more?*
2. ...

## Overview

We propose a generic framework for hydrologic models.
FloPy can provide a generic framework for hydrologic
models.

FloPy can consist of plugins, each defining a wrapper
for a given hydrologic program. Programs are expected
to provide an unambiguous input specification.

FloPy can provide a basic set of building blocks with
which a program's input parameters can be defined and
configured. This will consist of parameters and their
contexts. The former are leaves, the latter nodes in
an input context tree (this is made rigorous below).
configured. This will consist of parameters in nested
contexts. (This is made more rigorous below.)

Provided an input specification for a program, FloPy
can generate an object-oriented Python interface for
it. This will consist of an **object model** (input
data model) and **IO machinery** (data access layer)
at minimum. (These are separate concerns and will be
coupled only where necessary/appropriate.)
data model) and **IO module** (data access layer).

Once these exist, they *are* the specification —
specification documents should be derivable in reverse.
specification documents should be derivable from them
in reverse.

FloPy will provide a **plugin runtime** which accepts
a program selection and an input configuration.
Expand All @@ -84,28 +83,28 @@ report its progress, and make its results available.
### Runtime

FloPy will provide a plugin runtime whose purpose is to
wrap and run arbitrary hydrologic programs. We consider
the **simulation** the fundamental abstraction here; a
simulation is a *plan for how to execute a program*. A
simulation is *not* the execution of the program itself
*nor* the harness which drives the program:
wrap/run arbitrary hydrologic programs. **Simulation**
is the fundamental abstraction: we could consider the
simulation a *plan for how to execute a program*.

This at odds with the standard terminology in MODFLOW 6,
where a simulation *is precisely* the runtime. FloPy is
an interface to modeling codes, and as such, adopts the
view that one might as well call the thing that becomes
the simulation the simulation; this seems like a benign
and (maybe even appropriate) effacement of a meaningful
where a simulation means the runtime itself. FloPy, as
an interface to programs, could reasonably call the thing
that becomes the simulation the simulation; seems benign
(and maybe even appropriate) effacement of a meaningful
distinction for reasons of precedent and familiarity.

A distinct abstraction can represent the "task" running
the program, and a third can represent its output. `Run`
for the former and `Result` for the latter maybe?
A distinct abstraction could represent the "task" that
runs the program. A third could represent its output.
The latter should be derivable from the simulation, if
results are available in a given workspace, so results
can still be retrieved easily in a subsequent session,
or by someone else provided the workspace contents.

All runs can have an autogenerated GUID and a name, with
the name defaulting to the GUID if the run is anonymous.
Runs could have an autogenerated GUID and an optional
name. Anonymous runs' names could default to the GUID.

Scheduling seems like it could benefit from asynchrony.
Scheduling seems like it may benefit from asynchrony.
While programs should ideally make maximal use of the
resources provided to them, one might want to run more
than one single-threaded program at once, without the
Expand All @@ -115,8 +114,8 @@ An awaitable (coroutine-based) API, returning futures
instead of blocking, could allow an arbitrary number
of concurrent runs.

We can provide a traditional synchronous alternative
which runs the simulation directly.
If this is pursued, a synchronous alternative should
be provided which runs programs directly as done now.

### Plugins

Expand Down Expand Up @@ -147,25 +146,23 @@ provide access to model results.
We want:

- an intuitive, consistent, & expressive interface
to a broad range of programs.
to a broad range of programs

- a small core codebase and a largely autogenerated
user-facing input data model.
user-facing input data model

- an unsurprising and uncomplicated core framework
accessible to new contributors.
accessible to new contributors

- more consistent (and fewer) points of entry.
- more consistent (and fewer) points of entry

- easy access to a program's input specification.
- easy access to a program's input specification

- easy access to a simulation's configuration.
- easy access to a simulation's input configuration

- hierarchical namespacing, address resolution, and
value lookups.
- hierarchical namespacing and context resolution

- context-aware parameters and automatic enforcement
of program invariants.
- automatic enforcement of program invariants

...and more.

Expand All @@ -179,18 +176,10 @@ The latter aim to make it easier to give a class a nice
`dataclasses` is derived from an older project called
[`attrs`](https://www.attrs.org/en/stable/) which has
some extra powers, including hooks for validation and
transformation, introspection tools, and more. Its age
does not appear to be a problem; the developer remains
active and it's on a regular release cadence with many
active users.
transformation, introspection tools, and more.

Since `attrs` solves several of our problems at once,
we aim to build a prototype of the core object model
on it.

Peripheral concerns (e.g. plotting/exporting) can be
handled by mixins, so we can avoid polluting the core
classes and also avoid the diamond problem.
we aim to prototype the core object model on it.

#### Context resolution

Expand All @@ -199,22 +188,23 @@ context. This will support hierarchical addressing, as
is used for the MF6 memory manager. It may also inform
certain user-facing operations (for instance, a method
may work differently if a component is independent vs
an element in a simulation).
an element in a simulation). This should also help to
provide nice string representations.

It is also simply convenient to ask a component: what
are you attached to? The component should be able to
display a tree showing its own position in context.
It will be convenient to ask a component: what are you
attached to? The component should be able to produce a
tree showing its own position in context.

Parent pointers might be implemented as weak references
to avoid memory leaks; e.g., if a component is removed
from a simulation and the simulation is descarded, then
we want the garbage collector to be free to collect it,
and we want a finalizer callback to set the component's
`parent` to None.
with a finalization callback to set parent references
to `None`.


This should also help provide nice string representations.

#### `Dict` mimicry
#### Dictionary mimicry

The dictionary is a ubiquitous data container, useful
for e.g. passing keyword arguments, and for potential
Expand Down Expand Up @@ -245,68 +235,77 @@ extra column `file_name` or similar, which identifies
the DFN file and the component it specifies).

From this, FloPy must generate a nested object model.
This means distinguishing scalars from composites. It
means plugin developers need to think deeply about how
they map a linguistically picky program's input to the
FloPy data model. And it means we, as developers, need
to think deeply about the FloPy data model.
This means distinguishing scalars from composites, in
MF6's case, and in general, requires mapping an input
specification of arbitrary structure and content to a program-agnostic data model.

### Parameters

A parameter is a program input variable.

Parameters are the core of the FloPy4 data model.

Parameters are primitives or composites of such.
A **parameter** is a program input variable.

The data model should be agnostic to any program
supported by FloPy4; plugins should hide details
of the program's data representation and present
the same core object model and parameter types.
A parameter is a leaf in the **context tree**. The
simulation is the root.

As described above in the object model section: a
simulation is the root context. A context contains
parameters. A parameter can be a scalar or another
context. A scalar is a leaf in the context tree.
A parameter is a primitive value or a **composite**
of such.

Scalars are Python primitives. These are: int,
float, boolean, string, path, array, or table.
Primitive parameters are **scalar** (int, float, bool,
string, path), **array-like**, or **tabular**.

Ideally a data model would be dependency-agnostic,
> [!NOTE]
> Ideally a data model would be dependency-agnostic,
but we view NumPy and Pandas as de facto standard
library and accept them as array/table primitives.

If we ever need to provide array/table abstractions
library and accept them as array/table primitives.
If there is ever need to provide arrays/tables
of our own, we could take inspiration from
[astropy](https://github.com/astropy/astropy).

A record is a context whose parameters are all
scalars; no nested contexts. We will consider
this a `Dict` for practical use though it will
need implementing as an `attrs`-based class so
its parameter spec is discoverable upon import.

A list can contain a single parameter type or a
union of parameter types.

On this view, an MF6 keystring is a `typing.Union`
of multiple records. The period block is a list of
unions of records.
Composite parameters are **record** and **union**
(product and sum, respectively) types, as well as
**lists** of primitives or records. A record is a
named and ordered tuple of primitives.

A record's parameters must all be scalars, except
for its last parameter, which may be a sequence of
scalars (such a record could be called *variadic*;
it is a value constructor with unspecified arity).

> [!NOTE]
> A record is a `Dict` for practical purposes. It
needs implementing as an `attrs`-based class so
its parameter spec is discoverable upon import,
though.

A list may constrain its elements to parameters of
a single scalar or record type, or may hold unions
of such.

> [!NOTE]
> On this view an MF6 keystring is a `typing.Union`
of records and a period block is a list of `Union`s
of records.

A context is a map of parameters. So is a record;
the operative difference is that composites cannot
contain nested parameters. A context is a non-leaf
node in the tree which can contain both parameters
and other contexts.

We envision a nested hierarchy of `attrs`-based
classes, all acting like dictionaries, making up
the context tree. Each of these has parameters
and/or other classes as members.
the context tree. These will include composites:
strongly typed records and unions will be more
convenient to work with.

FloPy can thus define a parameter as:
So, FloPy can define a parameter as:

```python
from typing import Dict, List
from numpy.typing import ArrayLike
from pandas import DataFrame

Scalar = bool | int | float | str | Path
Record = Dict[str, Scalar]
Record = Dict[str, Scalar | List[Scalar]]
List = List[Scalar | Record]
Array = ArrayLike
Table = DataFrame
Expand All @@ -315,7 +314,7 @@ Param = Scalar | Record | List | Array | Table

This is proposed as a general foundation onto which
it should be possible to map input specifications
for a broad range of programs, not only MODFLOW 6.
for a wide range of programs, not only MODFLOW 6.

#### Arrays

Expand Down Expand Up @@ -347,7 +346,7 @@ We can store parameter specification information
in the `DataFrame.attrs` property or by way of
[custom accessors](https://pandas.pydata.org/pandas-docs/stable/development/extending.html#registering-custom-accessors).

#### Laziness
#### Lazies

We recognize a distinction between two types of
parameter: configuration and data. This isn't
Expand Down Expand Up @@ -489,7 +488,7 @@ sequenceDiagram
MF6Parser-->DFN: defines grammar
```

#### Unified access
#### Decorators

A small set of class decorators could provide unified access to
IO for object model classes. Alternatively these could be mixins.
Expand Down
Loading
Loading