From 9e09242d36af5e0ff2542714627e5cd7b6b9ea1d Mon Sep 17 00:00:00 2001 From: Tom Clark <> Date: Fri, 13 Sep 2019 13:00:53 +0100 Subject: [PATCH 1/2] DOC Extended data framework section --- docs/source/examples.rst | 1 + docs/source/schema.rst | 295 +++++++++++++++++++- docs/source/schema_other_considerations.rst | 88 ++++++ docs/source/version_history.rst | 1 + 4 files changed, 376 insertions(+), 9 deletions(-) create mode 100644 docs/source/schema_other_considerations.rst diff --git a/docs/source/examples.rst b/docs/source/examples.rst index aa4e44d..20cb638 100644 --- a/docs/source/examples.rst +++ b/docs/source/examples.rst @@ -9,6 +9,7 @@ copied straight from the unit test cases, so you can always check there to see h .. _example_schema: + Example Schema ============== diff --git a/docs/source/schema.rst b/docs/source/schema.rst index a35cb46..7930c92 100644 --- a/docs/source/schema.rst +++ b/docs/source/schema.rst @@ -6,7 +6,9 @@ Schema This is the core of **twined**, whose whole purpose is to provide and use schemas for digital twins.. + .. _requirements: + Requirements of digital twin schema =================================== @@ -19,8 +21,8 @@ A *schema* defines a digital twin, and has multiple roles. It: If this weren't enough, the schema: #. Must be trustable (i.e. a schema from an untrusted, corrupt or malicious third party should be safe to at least read) -#. Must be machine-readable -#. Must be human-readable +#. Must be machine-readable *and machine-understandable* [1]_ +#. Must be human-readable *and human-understandable* [1]_ #. Must be searchable/indexable Fortunately for digital twin developers, many of these requirements have already been seen for data interchange formats @@ -29,18 +31,293 @@ developed for the web. **twined** uses ``JSON`` and ``JSONSchema`` to interchang If you're not already familiar with ``JSONSchema`` (or wish to know why **twined** uses ``JSON`` over the seemingly more appropriate ``XML`` standard), see :ref:`introducing_json_schema`. +.. toctree:: + :maxdepth: 0 + :hidden: + + schema_introducing_json + + +.. _data_framework: + +Data framework +============== + +We cannot simply expect many developers to create digital twins with some schema, then to be able to connect them all +together - even if those schema are all fully valid (*readable*). **twined** makes things slightly more specific. + +**twined** has an opinionated view on how incoming data is organised. This results in a top-level schema that is +extremely prescriptive (*understandable*), allowing digital twins to be introspected and connected. + + +.. _data_types: + +Data types +---------- + +Let us review the classes of data i/o undertaken a digital twin: + +.. tabs:: + + .. group-tab:: Config + + **Configuration data (input)** + + Control parameters relating to what the twin should do, or how it should operate. For example, should a twin produce + output images as low resolution PNGs or as SVGs? How many iterations of a fluid flow solver should be used? What is + the acceptable error level on an classifier algorithm? + + *These values should always have defaults.* + + .. group-tab:: Values + + **Value data (input, output)** + + Raw values passed directly to/from a twin. For example current rotor speed, or forecast wind direction. + + Values might be passed at instantiation of a twin (typical application-like process) or via a socket. + + *These values should never have defaults.* + + .. group-tab:: Files + + **File data (input, output)** + + Twins frequently operate on file content - eg files on disc or objects in a cloud data store. For example, + groups of ``.csv`` files can contain data to train a machine learning algorithm. There are four subclasses of file i/o + that may be undertaken by digital twins: + + #. Input file (read) - eg to read input data from a csv file + #. Temporary file (read-write, disposable) - eg to save intermediate results to disk, reducing memory use + #. Cache file (read-write, persistent) - eg to save a trained classifier for later use in prediction + #. Output file (write) - eg to write postprocessed csv data ready for the next twin, or save generated images etc. + + .. group-tab:: External + + **External service data (input, output)** + + A digital twin might: + - GET/POST data from/to an external API, + - query/update a database. + + Such data exchange may not be controllable by **twined** (which is intended to operate at the boundaries of the + twin) unless the resulting data is returned from the twin and must therefore be schema-compliant. + + .. group-tab:: Credentials + + **Credentials (input)** + + In order to: + - GET/POST data from/to an API, + - query a database, or + - connect to a socket (for receiving Values or emitting Values, Monitors or Logs) + + a digital twin must have *access* to it. API keys, database URIs, etc must be supplied to the digital twin but + treated with best practice with respect to security considerations. + + *Credentials should never be hard-coded into application code, always passed in* + + .. group-tab:: Monitors/Logs + + There are two kinds of monitoring data required from a digital twin. + + **Monitor data (output)** + + Values for health and progress monitoring of the twin, for example percentage progress, iteration number and + status - perhaps even residuals graphs for a converging calculation. Broadly speaking, this should be user-facing + information. + + *This kind of monitoring data can be in a suitable form for display on a dashboard* + + **Log data (output)** + + Logged statements, typically in iostream form, produced by the twin (e.g. via python's ``logging`` module) must be + capturable as an output for debugging and monitoring purposes. Broadly speaking, this should be developer-facing + information. + + +.. _data_descriptions: + +Data descriptions +----------------- + +Here, we describe how each of these data classes is described by **twined**. + -.. _specifying_a_framework: -Specifying a framework -====================== +.. tabs:: -We cannot simply expect many developers to create digital twins with a ``JSONSchema`` then to be able to connect them all -together. **twined** makes things slightly more specific. + .. group-tab:: Config + + **Configuration data** + + Configuration data is supplied as a simple object, which of course can be nested (although we don't encourage deep + nesting). The following is a totally hypothetical configuration... + + .. code-block:: javascript + + { + "max_iterations": 0, + "compute_vectors": True, + "cache_mode": "extended", + "initial_conditions": { + "intensity": 0.0, + "direction", 0.0 + } + } + + .. group-tab:: Values + + **Value data (input, output)** + + For Values data, a twin will accept and/or respond with raw JSON (this could originate over a socket, be read from + a file or API depending exactly on the twin) containing variables of importance: + + .. code-block:: javascript + + { + "rotor_speed": 13.2, + "wind_direction": 179.4 + } + + .. group-tab:: Files + + **File data (input, output)** + + Files are not streamed directly to the digital twin (this would require extreme bandwidth in whatever system is + orchestrating all the twins). Instead, files should be made available on the local storage system; i.e. a volume + mounted to whatever container or VM the digital twin runs in. + + Groups of files are described by a ``manifest``, where a manifest is (in essence) a catalogue of files in a + dataset. + + A digital twin might receive multiple manifests, if it uses multiple datasets. For example, it could use a 3D + point cloud LiDAR dataset, and a meteorological dataset. + + .. code-block:: javascript + + { + "manifests": [ + { + "type": "dataset", + "id": "3c15c2ba-6a32-87e0-11e9-3baa66a632fe", // UUID of the manifest + "files": [ + { + "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86", // UUID of that file + "sha1": "askjnkdfoisdnfkjnkjsnd" // for quality control to check correctness of file contents + "name": "Lidar - 4 to 10 Dec.csv", + "path": "local/file/path/to/folder/containing/it/", + "type": "csv", + "metadata": { + }, + "size_bytes": 59684813, + "tags": "lidar, helpful, information, like, sequence:1", // Searchable, parsable and filterable + }, + { + "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86", + "name": "Lidar - 11 to 18 Dec.csv", + "path": "local/file/path/to/folder/containing/it/", + "type": "csv", + "metadata": { + }, + "size_bytes": 59684813, + "tags": "lidar, helpful, information, like, sequence:2", // Searchable, parsable and filterable + }, + { + "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86", + "name": "Lidar report.pdf", + "path": "local/file/path/to/folder/containing/it/", + "type": "pdf", + "metadata": { + }, + "size_bytes": 484813, + "tags": "report", // Searchable, parsable and filterable + } + ] + }, + { + // ... another dataset manifest ... + } + ] + } + + .. NOTE:: + + Tagging syntax is extremely powerful. Below, you'll see how this enables a digital twin to specify things like: + + *"Uh, so I need an ordered sequence of files, that are CSV files, and are tagged as lidar."* + + This allows **twined** to check that the input files contain what is needed, enables quick and easy + extraction of subgroups or particular sequences of files within a dataset, and enables management systems + to map candidate datasets to twins that might be used to process them. + + + .. group-tab:: External + + **External service data (input, output)** + + There's nothing for **twined** to do here! + + If the purpose of the twin (and this is a common scenario!) is simply + to fetch data from some service then return it as values from the twin, that's perfect. But its + the twin developer's job to do the fetchin', not ours ;) + + However, fetching from your API or database might require some credentials. See the following tab for help with + that. + + .. group-tab:: Credentials + + **Credentials (input)** + + Credentials should be securely managed by whatever system is managing the twin, then made accessible to the twin + in the form of environment variables: + + .. code-block:: javascript + + SERVICE_API_KEY=someLongTokenTHatYouProbablyHaveToPayTheThirdPartyProviderLoadsOfMoneyFor + + **twined** helps by providing a small shim to check for their presence and bring these environment variables + into your configuration. + + .. ATTENTION:: + + Do you trust the twin code? If you insert credentials to your own database into a digital twin + provided by a third party, you better be very sure that twin isn't going to scrape all that data out then send + it elsewhere! + + Alternatively, if you're building a twin requiring such credentials, it's your responsibility to give the end + users confidence that you're not abusing their access. + + There'll be a lot more discussion on these issues, but it's outside the scope of **twined** - all we do here is + make sure a twin has the credentials it requires. + + .. group-tab:: Monitors/Logs + + **Monitor data (output)** + + **Log data (output)** + + +.. ATTENTION:: + *What's the difference between Configuration and Values data? Isn't it the same?* + + No. Configuration data is supplied to a twin to initialise it, and always has defaults. Values data is ingested by a + twin, maybe at startup but maybe also later (if the twin is working like a live server). In complex cases, which + Values are required may also depend on the Configuration of the twin! + + Values data can also be returned from a twin whereas configuration data is not. + + Don't get hung up on this yet - in simple (most) cases, they are effectively the same. For a twin which is run as a + straightforward analysis, both the Configuration and Values are processed at startup. + + + +.. Footnotes: + +.. [1] *Understandable* essentially means that, once read, the machine or human knows what it actually means and what to do with it. .. toctree:: :maxdepth: 0 :hidden: - schema_introducing_json - + schema_other_considerations diff --git a/docs/source/schema_other_considerations.rst b/docs/source/schema_other_considerations.rst new file mode 100644 index 0000000..b07aeed --- /dev/null +++ b/docs/source/schema_other_considerations.rst @@ -0,0 +1,88 @@ +.. _other_considerations: + +==================== +Other Considerations +==================== + +A variety of thoughts that arose whilst architecting **twined**. + +.. _bash_style_stdio: + +Bash-style stdio +---------------- + +Some thought was given to using a very old-school-unix approach to piping data between twins, via stdout. + +Whilst attractive (as being a wildly fast way of piping data between twins on the same machine) it was felt this +was insufficiently general, eg: + + - where twins don't exist on the same machine or container, making it cumbersome to engineer common iostreams + - where slight differences between different shells might lead to incompatibilities or changes in behaviour + +And also unfriendly, eg: + + - engineers or scientists unfamiliar with subtleties of bash shell scripting encounter difficulty piping data around + - difficult to build friendly web based tools to introspect the data and configuration + - bound to be headaches on windows platforms, even though windows now supports bash + - easy to corrupt using third party libraries (e.g. which print to stdout) + + +.. _Units: + +Units +----- + +Being used (mostly) for engineering and scientific analysis, it was tempting to add in a specified sub-schema for units. +For example, mandating that where values can be given in units, they be specified in a certain way, like: + +.. code-block:: javascript + + { + "wind_speed": { + "value": 10.2, + "units": "mph" + } + } + +or (more succinct): + +.. code-block:: javascript + + { + "wind_speed": 10.2, + "wind_speed_units": "mph" + } + +It's still extremely tempting to provide this facility; or at least provide some way of specifying in the schema +what units a value should be provided in. Thinking about it but don't have time right now. +If anybody wants to start crafting a PR with an extension or update to **twined** that facilitates this; please raise an +issue to start progressing it. + + +.. _variable_style: + +Variable Style +-------------- + +A premptive stamp on the whinging... + +Note that in the ``JSON`` descriptions above, all variables are named in ``snake_case`` rather than ``camelCase``. This +decision, more likely than even Brexit to divide opinions, is based on: + - The reservation of snake case for the schema spec has the subtle advantage that in future, we might be able to use + camelCase within the spec to denote class types in some useful way, just like in python. Not sure yet; just mulling. + - The :ref:`requirements` mention human-readability as a must; + `this paper `_ + suggests a 20% slower comprehension of camel case than snake. + - The languages we anticipate being most popular for building twins seem to trend toward snake case (eg + `python `_, `c++ `_) + although to be fair we might've woefully misjudged which languages start emerging. + - We're starting in Python so are taking a lead from PEP8, which is bar none the most successful style guide on the + planet, because it got everybody on the same page really early on. + +If existing code that you're dropping in uses camelCase, please don't file that as an issue... converting property +names automatically after schema validation generation is trivial, there are tons of libraries (like +`humps `_) to do it. + +We'd also consider a pull request for a built-in utility converting `to `_ and +`from <>`_ that does this following validation and prior to returning results. Suggest your proposed approach on the +issues board. diff --git a/docs/source/version_history.rst b/docs/source/version_history.rst index 0bcdbd0..00d3c1c 100644 --- a/docs/source/version_history.rst +++ b/docs/source/version_history.rst @@ -14,6 +14,7 @@ open-source the framework we developed to connect applications and digital twins .. _version_0.0.x: + 0.0.x ===== From dc5b3f1bd098dd5f3d0d81fa22fada3e4ab26c4b Mon Sep 17 00:00:00 2001 From: Tom Clark <> Date: Fri, 13 Sep 2019 13:01:52 +0100 Subject: [PATCH 2/2] VER Version bump for the extended documentation --- setup.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/setup.py b/setup.py index 6a8ceb4..8c7d9c6 100644 --- a/setup.py +++ b/setup.py @@ -15,7 +15,7 @@ setup( name='twined', - version='0.0.3', + version='0.0.4', py_modules=[], install_requires=[], url='https://www.github.com/octue/twined',