From 9e09242d36af5e0ff2542714627e5cd7b6b9ea1d Mon Sep 17 00:00:00 2001
From: Tom Clark <>
Date: Fri, 13 Sep 2019 13:00:53 +0100
Subject: [PATCH 1/2] DOC Extended data framework section

---
 docs/source/examples.rst                    |   1 +
 docs/source/schema.rst                      | 295 +++++++++++++++++++-
 docs/source/schema_other_considerations.rst |  88 ++++++
 docs/source/version_history.rst             |   1 +
 4 files changed, 376 insertions(+), 9 deletions(-)
 create mode 100644 docs/source/schema_other_considerations.rst

diff --git a/docs/source/examples.rst b/docs/source/examples.rst
index aa4e44d..20cb638 100644
--- a/docs/source/examples.rst
+++ b/docs/source/examples.rst
@@ -9,6 +9,7 @@ copied straight from the unit test cases, so you can always check there to see h
 
 
 .. _example_schema:
+
 Example Schema
 ==============
 
diff --git a/docs/source/schema.rst b/docs/source/schema.rst
index a35cb46..7930c92 100644
--- a/docs/source/schema.rst
+++ b/docs/source/schema.rst
@@ -6,7 +6,9 @@ Schema
 
 This is the core of **twined**, whose whole purpose is to provide and use schemas for digital twins..
 
+
 .. _requirements:
+
 Requirements of digital twin schema
 ===================================
 
@@ -19,8 +21,8 @@ A *schema* defines a digital twin, and has multiple roles. It:
 If this weren't enough, the schema:
 
 #. Must be trustable (i.e. a schema from an untrusted, corrupt or malicious third party should be safe to at least read)
-#. Must be machine-readable
-#. Must be human-readable
+#. Must be machine-readable *and machine-understandable* [1]_
+#. Must be human-readable *and human-understandable* [1]_
 #. Must be searchable/indexable
 
 Fortunately for digital twin developers, many of these requirements have already been seen for data interchange formats
@@ -29,18 +31,293 @@ developed for the web. **twined** uses ``JSON`` and ``JSONSchema`` to interchang
 If you're not already familiar with ``JSONSchema`` (or wish to know why **twined** uses ``JSON`` over the seemingly more
 appropriate ``XML`` standard), see :ref:`introducing_json_schema`.
 
+.. toctree::
+   :maxdepth: 0
+   :hidden:
+
+   schema_introducing_json
+
+
+.. _data_framework:
+
+Data framework
+==============
+
+We cannot simply expect many developers to create digital twins with some schema, then to be able to connect them all
+together - even if those schema are all fully valid (*readable*). **twined** makes things slightly more specific.
+
+**twined** has an opinionated view on how incoming data is organised. This results in a top-level schema that is
+extremely prescriptive (*understandable*), allowing digital twins to be introspected and connected.
+
+
+.. _data_types:
+
+Data types
+----------
+
+Let us review the classes of data i/o undertaken a digital twin:
+
+.. tabs::
+
+   .. group-tab:: Config
+
+      **Configuration data (input)**
+
+      Control parameters relating to what the twin should do, or how it should operate. For example, should a twin produce
+      output images as low resolution PNGs or as SVGs? How many iterations of a fluid flow solver should be used? What is
+      the acceptable error level on an classifier algorithm?
+
+      *These values should always have defaults.*
+
+   .. group-tab:: Values
+
+      **Value data (input, output)**
+
+      Raw values passed directly to/from a twin. For example current rotor speed, or forecast wind direction.
+
+      Values might be passed at instantiation of a twin (typical application-like process) or via a socket.
+
+      *These values should never have defaults.*
+
+   .. group-tab:: Files
+
+      **File data (input, output)**
+
+      Twins frequently operate on file content - eg files on disc or objects in a cloud data store. For example,
+      groups of ``.csv`` files can contain data to train a machine learning algorithm. There are four subclasses of file i/o
+      that may be undertaken by digital twins:
+
+      #. Input file (read) - eg to read input data from a csv file
+      #. Temporary file (read-write, disposable) - eg to save intermediate results to disk, reducing memory use
+      #. Cache file (read-write, persistent) - eg to save a trained classifier for later use in prediction
+      #. Output file (write) - eg to write postprocessed csv data ready for the next twin, or save generated images etc.
+
+   .. group-tab:: External
+
+      **External service data (input, output)**
+
+      A digital twin might:
+         - GET/POST data from/to an external API,
+         - query/update a database.
+
+      Such data exchange may not be controllable by **twined** (which is intended to operate at the boundaries of the
+      twin) unless the resulting data is returned from the twin and must therefore be schema-compliant.
+
+   .. group-tab:: Credentials
+
+      **Credentials (input)**
+
+      In order to:
+         - GET/POST data from/to an API,
+         - query a database, or
+         - connect to a socket (for receiving Values or emitting Values, Monitors or Logs)
+
+      a digital twin must have *access* to it. API keys, database URIs, etc must be supplied to the digital twin but
+      treated with best practice with respect to security considerations.
+
+      *Credentials should never be hard-coded into application code, always passed in*
+
+   .. group-tab:: Monitors/Logs
+
+      There are two kinds of monitoring data required from a digital twin.
+
+      **Monitor data (output)**
+
+      Values for health and progress monitoring of the twin, for example percentage progress, iteration number and
+      status - perhaps even residuals graphs for a converging calculation. Broadly speaking, this should be user-facing
+      information.
+
+      *This kind of monitoring data can be in a suitable form for display on a dashboard*
+
+      **Log data (output)**
+
+      Logged statements, typically in iostream form, produced by the twin (e.g. via python's ``logging`` module) must be
+      capturable as an output for debugging and monitoring purposes. Broadly speaking, this should be developer-facing
+      information.
+
+
+.. _data_descriptions:
+
+Data descriptions
+-----------------
+
+Here, we describe how each of these data classes is described by **twined**.
+
 
-.. _specifying_a_framework:
-Specifying a framework
-======================
+.. tabs::
 
-We cannot simply expect many developers to create digital twins with a ``JSONSchema`` then to be able to connect them all
-together. **twined** makes things slightly more specific.
+   .. group-tab:: Config
+
+      **Configuration data**
+
+      Configuration data is supplied as a simple object, which of course can be nested (although we don't encourage deep
+      nesting). The following is a totally hypothetical configuration...
+
+      .. code-block:: javascript
+
+         {
+             "max_iterations": 0,
+             "compute_vectors": True,
+             "cache_mode": "extended",
+             "initial_conditions": {
+                 "intensity": 0.0,
+                 "direction", 0.0
+             }
+         }
+
+   .. group-tab:: Values
+
+      **Value data (input, output)**
+
+      For Values data, a twin will accept and/or respond with raw JSON (this could originate over a socket, be read from
+      a file or API depending exactly on the twin) containing variables of importance:
+
+      .. code-block:: javascript
+
+         {
+             "rotor_speed": 13.2,
+             "wind_direction": 179.4
+         }
+
+   .. group-tab:: Files
+
+      **File data (input, output)**
+
+      Files are not streamed directly to the digital twin (this would require extreme bandwidth in whatever system is
+      orchestrating all the twins). Instead, files should be made available on the local storage system; i.e. a volume
+      mounted to whatever container or VM the digital twin runs in.
+
+      Groups of files are described by a ``manifest``, where a manifest is (in essence) a catalogue of files in a
+      dataset.
+
+      A digital twin might receive multiple manifests, if it uses multiple datasets. For example, it could use a 3D
+      point cloud LiDAR dataset, and a meteorological dataset.
+
+      .. code-block:: javascript
+
+         {
+             "manifests": [
+                 {
+                     "type": "dataset",
+                     "id": "3c15c2ba-6a32-87e0-11e9-3baa66a632fe",  // UUID of the manifest
+                     "files": [
+                         {
+                             "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",  // UUID of that file
+                             "sha1": "askjnkdfoisdnfkjnkjsnd"  // for quality control to check correctness of file contents
+                             "name": "Lidar - 4 to 10 Dec.csv",
+                             "path": "local/file/path/to/folder/containing/it/",
+                             "type": "csv",
+                             "metadata": {
+                             },
+                             "size_bytes": 59684813,
+                             "tags": "lidar, helpful, information, like, sequence:1",  // Searchable, parsable and filterable
+                         },
+                         {
+                             "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
+                             "name": "Lidar - 11 to 18 Dec.csv",
+                             "path": "local/file/path/to/folder/containing/it/",
+                             "type": "csv",
+                             "metadata": {
+                             },
+                             "size_bytes": 59684813,
+                             "tags": "lidar, helpful, information, like, sequence:2",  // Searchable, parsable and filterable
+                         },
+                         {
+                             "id": "abff07bc-7c19-4ed5-be6d-a6546eae8e86",
+                             "name": "Lidar report.pdf",
+                             "path": "local/file/path/to/folder/containing/it/",
+                             "type": "pdf",
+                             "metadata": {
+                             },
+                             "size_bytes": 484813,
+                             "tags": "report",  // Searchable, parsable and filterable
+                         }
+                     ]
+                 },
+                 {
+                     // ... another dataset manifest ...
+                 }
+             ]
+         }
+
+      .. NOTE::
+
+         Tagging syntax is extremely powerful. Below, you'll see how this enables a digital twin to specify things like:
+
+        *"Uh, so I need an ordered sequence of files, that are CSV files, and are tagged as lidar."*
+
+         This allows **twined** to check that the input files contain what is needed, enables quick and easy
+         extraction of subgroups or particular sequences of files within a dataset, and enables management systems
+         to map candidate datasets to twins that might be used to process them.
+
+
+   .. group-tab:: External
+
+      **External service data (input, output)**
+
+      There's nothing for **twined** to do here!
+
+      If the purpose of the twin (and this is a common scenario!) is simply
+      to fetch data from some service then return it as values from the twin, that's perfect. But its
+      the twin developer's job to do the fetchin', not ours ;)
+
+      However, fetching from your API or database might require some credentials. See the following tab for help with
+      that.
+
+   .. group-tab:: Credentials
+
+      **Credentials (input)**
+
+      Credentials should be securely managed by whatever system is managing the twin, then made accessible to the twin
+      in the form of environment variables:
+
+      .. code-block:: javascript
+
+         SERVICE_API_KEY=someLongTokenTHatYouProbablyHaveToPayTheThirdPartyProviderLoadsOfMoneyFor
+
+      **twined** helps by providing a small shim to check for their presence and bring these environment variables
+      into your configuration.
+
+      .. ATTENTION::
+
+         Do you trust the twin code? If you insert credentials to your own database into a digital twin
+         provided by a third party, you better be very sure that twin isn't going to scrape all that data out then send
+         it elsewhere!
+
+         Alternatively, if you're building a twin requiring such credentials, it's your responsibility to give the end
+         users confidence that you're not abusing their access.
+
+         There'll be a lot more discussion on these issues, but it's outside the scope of **twined** - all we do here is
+         make sure a twin has the credentials it requires.
+
+   .. group-tab:: Monitors/Logs
+
+      **Monitor data (output)**
+
+      **Log data (output)**
+
+
+.. ATTENTION::
+    *What's the difference between Configuration and Values data? Isn't it the same?*
+
+    No. Configuration data is supplied to a twin to initialise it, and always has defaults. Values data is ingested by a
+    twin, maybe at startup but maybe also later (if the twin is working like a live server). In complex cases, which
+    Values are required may also depend on the Configuration of the twin!
+
+    Values data can also be returned from a twin whereas configuration data is not.
+
+    Don't get hung up on this yet - in simple (most) cases, they are effectively the same. For a twin which is run as a
+    straightforward analysis, both the Configuration and Values are processed at startup.
+
+
+
+.. Footnotes:
+
+.. [1] *Understandable* essentially means that, once read, the machine or human knows what it actually means and what to do with it.
 
 
 .. toctree::
    :maxdepth: 0
    :hidden:
 
-   schema_introducing_json
-
+   schema_other_considerations
diff --git a/docs/source/schema_other_considerations.rst b/docs/source/schema_other_considerations.rst
new file mode 100644
index 0000000..b07aeed
--- /dev/null
+++ b/docs/source/schema_other_considerations.rst
@@ -0,0 +1,88 @@
+.. _other_considerations:
+
+====================
+Other Considerations
+====================
+
+A variety of thoughts that arose whilst architecting **twined**.
+
+.. _bash_style_stdio:
+
+Bash-style stdio
+----------------
+
+Some thought was given to using a very old-school-unix approach to piping data between twins, via stdout.
+
+Whilst attractive (as being a wildly fast way of piping data between twins on the same machine) it was felt this
+was insufficiently general, eg:
+
+ - where twins don't exist on the same machine or container, making it cumbersome to engineer common iostreams
+ - where slight differences between different shells might lead to incompatibilities or changes in behaviour
+
+And also unfriendly, eg:
+
+ - engineers or scientists unfamiliar with subtleties of bash shell scripting encounter difficulty piping data around
+ - difficult to build friendly web based tools to introspect the data and configuration
+ - bound to be headaches on windows platforms, even though windows now supports bash
+ - easy to corrupt using third party libraries (e.g. which print to stdout)
+
+
+.. _Units:
+
+Units
+-----
+
+Being used (mostly) for engineering and scientific analysis, it was tempting to add in a specified sub-schema for units.
+For example, mandating that where values can be given in units, they be specified in a certain way, like:
+
+.. code-block:: javascript
+
+   {
+       "wind_speed": {
+           "value": 10.2,
+           "units": "mph"
+       }
+   }
+
+or (more succinct):
+
+.. code-block:: javascript
+
+   {
+       "wind_speed": 10.2,
+       "wind_speed_units": "mph"
+   }
+
+It's still extremely tempting to provide this facility; or at least provide some way of specifying in the schema
+what units a value should be provided in. Thinking about it but don't have time right now.
+If anybody wants to start crafting a PR with an extension or update to **twined** that facilitates this; please raise an
+issue to start progressing it.
+
+
+.. _variable_style:
+
+Variable Style
+--------------
+
+A premptive stamp on the whinging...
+
+Note that in the ``JSON`` descriptions above, all variables are named in ``snake_case`` rather than ``camelCase``. This
+decision, more likely than even Brexit to divide opinions, is based on:
+  - The reservation of snake case for the schema spec has the subtle advantage that in future, we might be able to use
+    camelCase within the spec to denote class types in some useful way, just like in python. Not sure yet; just mulling.
+  - The :ref:`requirements` mention human-readability as a must;
+    `this paper <https://ieeexplore.ieee.org/document/5521745?tp=&arnumber=5521745&url=http:%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5521745>`_
+    suggests a 20% slower comprehension of camel case than snake.
+  - The languages we anticipate being most popular for building twins seem to trend toward snake case (eg
+    `python <https://www.python.org/dev/peps/pep-0008/>`_, `c++ <https://google.github.io/styleguide/cppguide.html>`_)
+    although to be fair we might've woefully misjudged which languages start emerging.
+  - We're starting in Python so are taking a lead from PEP8, which is bar none the most successful style guide on the
+    planet, because it got everybody on the same page really early on.
+
+If existing code that you're dropping in uses camelCase, please don't file that as an issue... converting property
+names automatically after schema validation generation is trivial, there are tons of libraries (like
+`humps <https://humps.readthedocs.io/en/latest/>`_) to do it.
+
+We'd also consider a pull request for a built-in utility converting `to <https://pypi.org/project/camelcase/>`_ and
+`from <>`_ that does this following validation and prior to returning results. Suggest your proposed approach on the
+issues board.
diff --git a/docs/source/version_history.rst b/docs/source/version_history.rst
index 0bcdbd0..00d3c1c 100644
--- a/docs/source/version_history.rst
+++ b/docs/source/version_history.rst
@@ -14,6 +14,7 @@ open-source the framework we developed to connect applications and digital twins
 
 
 .. _version_0.0.x:
+
 0.0.x
 =====
 

From dc5b3f1bd098dd5f3d0d81fa22fada3e4ab26c4b Mon Sep 17 00:00:00 2001
From: Tom Clark <>
Date: Fri, 13 Sep 2019 13:01:52 +0100
Subject: [PATCH 2/2] VER Version bump for the extended documentation

---
 setup.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/setup.py b/setup.py
index 6a8ceb4..8c7d9c6 100644
--- a/setup.py
+++ b/setup.py
@@ -15,7 +15,7 @@
 
 setup(
     name='twined',
-    version='0.0.3',
+    version='0.0.4',
     py_modules=[],
     install_requires=[],
     url='https://www.github.com/octue/twined',