#2091: Create a documentation folder for Enceladus 3 (#2101)

#2091: Create a documentation folder for Enceladus 3 * new folder for v3.0.0 documentation * version 3.0.0 in the list of versions * new data files appropriate to the version * typo fix * link to Spark 3 documentation * CODEOWNERS
AbsaOSS · Aug 2, 2022 · 781deb0 · 781deb0
1 parent e03acf7
commit 781deb0
Show file tree

Hide file tree

Showing 25 changed files with 2,128 additions and 2 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1 +1 @@
-* @lokm01 @benedeki @DzMakatun @Zejnilovic @dk1844 @AdrianOlosutean @zakera786
+* @lokm01 @benedeki @DzMakatun @Zejnilovic @dk1844 @lsulak @zakera786
diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@ $> bundle exec jekyll serve
 # => Now browse to http://localhost:4000
 ```
 
-### Run convinience scripts
+### Run convenience scripts
 
 #### Generate new docs
 ```ruby

diff --git a/_data/configuration_3_0_0.yml b/_data/configuration_3_0_0.yml
@@ -0,0 +1,81 @@
+---
+  - name: conformance.allowOriginalColumnsMutability
+    options:
+    - name: boolean
+      description: "Allows to modify/drop columns from the original input (default is <i>false</i>)"
+  - name: conformance.autoclean.standardized.hdfs.folder
+    options:
+      - name: boolean
+        description: 'Automatically delete standardized data folder after successful run of a Conformance job <sup><a href="#note1">*</a></sup>'
+  - name: control.info.validation
+    options:
+    - name: <i>strict</i>
+      description: Job will fail on failed _INFO file validation.
+    - name: <i>warning</i>
+      description: "(default) A warning message will be displayed on failed validation,
+        but the job will go on."
+    - name: <i>none</i>
+      description: No validation is done.
+  - name: enceladus.recordId.generation.strategy
+    options:
+    - name: <i>uuid</i>
+      description: "(default) <code>enceladus_record_id</code> column will be added and will contain
+        a UUID <code>String</code> for each row."
+    - name: <i>stableHashId</i>
+      description: "<code>enceladus_record_id</code> column will be added and populated with an
+        always-the-same <code>Int</code> hash (Murmur3-based, for testing)."
+    - name: <i>none</i>
+      description: no column will be added to the output.
+  - name: max.processing.partition.size
+    options:
+      - name: non-negative long integer
+        description: 'Maximal size (in bytes) for the processing partition, which would influence the written parquet file size 
+        <b>NB! Experimental - sizes might still not fulfill the requested limits</b>'
+  - name: menas.rest.uri
+    options:
+    - name: string with URLs
+      description: 'Comma-separated list of URLs where Menas will be looked for. E.g.:
+        <code>http://example.com/menas1,http://domain.com:8080/menas2</code>'
+  - name: menas.rest.retryCount
+    options:
+      - name: non-negative integer
+        description: Each of the <code>menas.rest.uri</code> URLs can be tried multiple times for fault-tolerance
+  - name: menas.rest.availability.setup
+    options:
+      - name: <i>roundrobin</i>
+        description: "(default) Starts from a random URL from the <code>menas.rest.uri</code> list, if it fails the next
+         one is tried, if last is reached start from 0 until all are tried"
+      - name: <i>fallback</i>
+        description: "Always starts from the first URL, and only if it fails the second follows etc."
+  - name: min.processing.partition.size
+    options:
+      - name: non-negative long integer
+        description: 'Minimal size (in bytes) for the processing partition, which would influence the written parquet file size 
+        <b>NB! Experimental - sizes might still not fulfill the requested limits</b>'
+  - name: standardization.defaultTimestampTimeZone.default
+    options:
+    - name: string with any valid time zone name
+      description: The time zone for normalization of timestamps that don't have their own time zone either in data
+        itself or in metadata. If left empty the system time zone will be used.
+  - name: standardization.defaultTimestampTimeZone.[rawFormat]
+    options:
+      - name: string with any valid time zone name
+        description: Same as above <code>standardization.defaultTimestampTimeZone.default</code>, but applies only for
+          the specific input raw format - then it takes precedence over
+          <code>standardization.defaultTimestampTimeZone.default</code>.
+  - name: standardization.defaultDateTimeZone.default
+    options:
+    - name: string with any valid time zone name
+      description: The time zone for normalization of dates that don't have their own time zone either in data itself
+        or in metadata in case they need it. Most probably this should be left undefined.
+  - name: standardization.defaultDateTimeZone.[rawFormat]
+    options:
+      - name: string with any valid time zone name
+        description: Same as above <code>standardization.defaultDateTimeZone.default</code>, but applies only for
+          the specific input raw format - then it takes precedence over
+          <code>standardization.defaultDateTimeZone.default</code>.
+  - name: timezone
+    options:
+    - name: string with any valid time zone name
+      description: The time zone the Spark application will operate in. Strongly recommended
+        to keep it to default <i>UTC</i>
diff --git a/_data/menas-configuration_3_0_0.yml b/_data/menas-configuration_3_0_0.yml
@@ -0,0 +1,46 @@
+---
+  - name: javax.net.ssl.keyStore
+    options:
+      - name: string path to JKS file
+        description: 'KeyStore file containing records of private keys to connect to a secure schema registry.
+          E.g.: <code>/path/to/keystore.jks</code>'
+  - name: javax.net.ssl.keyStorePassword
+    options:
+      - name: string
+        description: 'Password for the file referenced in <code><a href="javax.net.ssl.keyStore">javax.net.ssl.keyStore</a></code>. E.g.:
+          <code>password1234</code>'
+  - name: javax.net.ssl.trustStore
+    options:
+      - name: string path to JKS file
+        description: 'TrustStore file containing records of trusted certificates to connect to a secure schema registry.
+        E.g.: <code>/path/to/truststore.jks</code> <sup><a href="#note2">*</a></sup>'
+  - name: javax.net.ssl.trustStorePassword
+    options:
+      - name: string
+        description: 'Password for the file referenced in <code><a href="javax.net.ssl.trustStore">javax.net.ssl.trustStore</a></code>. E.g.:
+        <code>password123</code>'
+  - name: menas.auth.admin.role
+    options:
+      - name: string
+        description: 'Specifies the admin role to operate property definition create and update operations.'
+  - name: menas.auth.roles.regex
+    options:
+      - name: string - regular expression
+        description: 'Regular expression specifying which user roles to include in JWT. E.g.:
+        <code>^menas_</code>. If the expression filters out the admin role (<code><a href="#menas.auth.admin.role">menas.auth.admin.role</a></code>), account won''t be recognized as admin.'
+  - name: menas.auth.ad.server
+    options:
+      - name: string - space-separated AD server domains
+        description: 'ActiveDirectory server domain(s) - multiple values are supported as fallback options.
+        DN (e.g. <code>dc=example,dc=com</code>) should not be included as this is supplied in <code>menas.auth.ldap.search.base</code>.
+        Example: <code>menas.auth.ad.server=ldaps://first.ldap.here ldaps://second.ldap.here ldaps://third.ldap.here</code> (notice no quotes)'
+  - name: menas.schemaRegistry.baseUrl
+    options:
+      - name: string with URL
+        description: 'Base Url to (secure) schema registry. E.g.:
+        <code>https://localhost:8081</code> <sup><a href="#note1">*</a></sup>'
+  - name: menas.schemaRegistry.warnUnsecured
+    options:
+      - name: boolean
+        description: 'If set, in case the <code>javax.net.ssl.*</code> settings are missing or incorrect, the application
+        will issue a warning. Default: <code>True</code>'
diff --git a/_data/selected-plugins-configuration_3_0_0.yml b/_data/selected-plugins-configuration_3_0_0.yml
@@ -0,0 +1,9 @@
+---
+- name: atum.hdfs.info.file.permissions
+  options:
+    - name: string with FS permissions
+      description: 'Desired FS permissions for Atum <code>_INFO</code> file. Default: <code>644</code>.'
+- name: spline.hdfs.file.permissions
+  options:
+    - name: string with FS permissions
+      description: "Desired FS permissions for Spline's <code>_LINEAGE</code> file. Default: <code>644</code>."
diff --git a/_data/versions.yaml b/_data/versions.yaml
@@ -1,2 +1,3 @@
 - '1.0.0'
 - '2.0.0'
+- '3.0.0'
diff --git a/_docs/3.0.0/build-process.md b/_docs/3.0.0/build-process.md
@@ -0,0 +1,7 @@
+---
+layout: docs
+title: Build Process
+version: '3.0.0'
+categories:
+    - '3.0.0'
+---
diff --git a/_docs/3.0.0/components.md b/_docs/3.0.0/components.md
@@ -0,0 +1,33 @@
+---
+layout: docs
+title: Components
+version: '3.0.0'
+categories:
+    - '3.0.0'
+---
+
+### Menas
+
+Menas is a UI component of the Enceladus project. It is used to define datasets and schemas representing your data. Using dataset definition you define where the data is, where should it land if any conformance rules should be applied. Schema defines how does the data will look (column names, types) after standardization.
+
+[More...]({{ site.baseurl }}/docs/{{ page.version }}/components/menas)
+
+### SparkJobs
+
+Enceladus consists of two spark jobs. One is Standardization, for alignation of data types and format, and the second one is Conformance, which then applies conformance rules onto the data.
+
+#### Standardization
+
+Standardization is used to transform almost any data format into a standardized, strongly typed parquet format, so the data can be used/view using unified tools.
+
+#### Conformance
+
+Conformance is used to apply conformance rules (mapping, negation, casting, etc.) onto the data. Conformance rules are additional tranformations of the data.
+
+### Plugins
+
+[More...]({{ site.baseurl }}/docs/{{ page.version }}/plugins)
+
+### Built-in Plugins
+
+[More...]({{ site.baseurl }}/docs/{{ page.version }}/plugins-built-in)
diff --git a/_docs/3.0.0/components/menas.md b/_docs/3.0.0/components/menas.md
@@ -0,0 +1,40 @@
+---
+layout: docs
+title: Components - Menas
+version: '3.0.0'
+categories:
+    - '3.0.0'
+    - components
+---
+## API
+
+### Monitoring endpoints
+
+All `/admin` endpoints except `/admin/health` require authentication (and will require strict permissions once [Authorization]({{ site.github.issues_url }}/30) is implemented)
+* `GET /admin` - list of all monitoring endpoints
+* `GET /admin/heapdump` - downloads a heapdump of the application
+* `GET /admin/threaddump` - list of the threaddump of the application
+* `GET /admin/loggers` - list of all the application loggers and their log levels
+* `POST /admin/loggers/{logger}` - change the log level of a logger in runtime
+* `GET /admin/health` - get a detailed status report of the application's health:
+```json
+{
+  "status": "UP",
+  "details": {
+    "HDFSConnection": {
+      "status": "UP"
+    },
+    "MongoDBConnection": {
+      "status": "UP"
+    },
+    "diskSpace": {
+      "status": "UP",
+      "details": {
+        "total": 1000240963584,
+        "free": 766613557248,
+        "threshold": 10485760
+      }
+    }
+  }
+}
+```
diff --git a/_docs/3.0.0/deployment.md b/_docs/3.0.0/deployment.md
@@ -0,0 +1,20 @@
+---
+layout: docs
+title: Deployment
+version: '3.0.0'
+categories:
+    - '3.0.0'
+---
+
+## Menas
+
+### Prerequisits to deploying Menas are
+
+- Tomcat 8.5+ to deploy the war to
+- `HADOOP_CONF_DIR` environment variable. This variable should point to a folder containing Hadoop configuration files (`core-site.xml`, `hdfs-site.xml` and `yarn-site.xml`). These are used to query the HDFS for folder locations.
+- MongoDB 4.0+ used as a storage
+- _OPTIONAL_ [Spline 0.3.X](https://absaoss.github.io/spline/0.3.html) for viewing of the lineage from Menas. Even without Spline in Menas, Standardization and Conformance will log lineage to Mongo.
+
+### Deploying Menas
+
+The easiest way to deploy Menas is to copy the `menas-VERSION.war` to `$TOMCAT_HOME/webapps`. This will create `<tomcat IP>/menas-VERSION` path on your local server.
diff --git a/_docs/3.0.0/plugins-built-in.md b/_docs/3.0.0/plugins-built-in.md
@@ -0,0 +1,66 @@
+---
+layout: docs
+title: Built-in Plugins
+version: '3.0.0'
+categories:
+    - '3.0.0'
+---
+<!-- toc -->
+- [What are built-in plugins](#what-are-built-in-plugins)
+- [Existing built-in plugins](#existing-built-in-plugins)
+  - [KafkaInfoPlugin](#kafkainfoplugin)
+  - [KafkaErrorSenderPlugin](#kafkaerrorsenderplugin)
+<!-- tocstop -->
+
+## What are built-in plugins
+
+Built-in plugins provide some additional but relatively elementary functionality. And also serve as an example how plugins 
+are written. Unlike externally created plugins they are automatically included in the `SparkJobs.jar` file and therefore 
+don't need to be included using the `--jars` option.
+
+## Existing built-in plugins
+
+The plugin class name is specified for Standardization and Conformance separately since some plugins need to run only
+during execution of one of these jobs. Plugin class name keys have numeric suffixes (`.1` in this example). The numeric
+suffix specifies the order at which plugins are invoked. It should always start with `1` and be incremented by 1 without
+gaps.
+
+### KafkaInfoPlugin
+
+The purpose of this plugin is to send control measurements to a Kafka topic each time a checkpoint is reached or job
+status is changed. This can help to monitor production issues and react to errors as quickly as possible.
+Control measurements are sent in `Avro` format and the schema is automatically registered in a schema registry.
+
+This plugin is a built-in one. In order to enable it, you need to provide the following configuration settings in
+`application.conf`:
+
+```
+standardization.plugin.control.metrics.1=za.co.absa.enceladus.plugins.builtin.controlinfo.mq.kafka.KafkaInfoPlugin
+conformance.plugin.control.metrics.1=za.co.absa.enceladus.plugins.builtin.controlinfo.mq.kafka.KafkaInfoPlugin
+kafka.schema.registry.url="http://127.0.0.1:8081"
+kafka.bootstrap.servers="127.0.0.1:9092"
+kafka.info.metrics.client.id="controlInfo"
+kafka.info.metrics.topic.name="control.info"
+# Optional security settings
+#kafka.security.protocol="SASL_SSL"
+#kafka.sasl.mechanism="GSSAPI"
+# Optional Schema Registry Security Parameters
+#kafka.schema.registry.basic.auth.credentials.source=USER_INFO
+#kafka.schema.registry.basic.auth.user.info=user:password
+```
+
+### KafkaErrorSenderPlugin
+
+The purpose of this plugin is to send errors to a Kafka topic.
+
+This plugin is a built-in one. In order to enable it, you need to provide the following configuration settings in
+`application.conf`:
+
+```
+standardization.plugin.postprocessor.1=za.co.absa.enceladus.plugins.builtin.errorsender.mq.kafka.KafkaErrorSenderPlugin
+conformance.plugin.postprocessor.1=za.co.absa.enceladus.plugins.builtin.errorsender.mq.kafka.KafkaErrorSenderPlugin
+`kafka.schema.registry.url`=
+`kafka.bootstrap.servers`=
+`kafka.error.client.id`=
+`kafka.error.topic.name`=
+```
diff --git a/_docs/3.0.0/plugins.md b/_docs/3.0.0/plugins.md
@@ -0,0 +1,37 @@
+---
+layout: docs
+title: Plugins
+version: '3.0.0'
+categories:
+    - '3.0.0'
+---
+
+**Standardization** and **Conformance** support plugins that allow executing additional actions at certain times of the computation.
+
+A plugin can be externally developed. In this case, in order to use the plugin a plugin jar needs to be supplied to
+`spark-submit` using the `--jars` option. You can also use built-in plugins by enabling them in `application.conf`
+or passing configuration information directly to `spark-submit`.
+
+The way it works is like this. A plugin factory (a class that implements `PluginFactory`) overrides the
+apply method. Standardization and Conformance will invoke this method when job starts and provides a configuration that
+includes all settings from `application.conf` plus settings passed to JVM via `spark-submit`. The factory then
+instantiates a plugin and returns it to the caller. If the factory throws an exception the Spark application
+(Standardization or Conformance) will be stopped. If the factory returns `null` an error will be logged by the application,
+but it will continue to run.
+
+There's one type of plugins supported for now:
+
+## Control Metrics Plugins
+
+_Control metrics plugins_ allow execution of additional actions any time a checkpoint is created
+or job status changes. In order to write such a plugin to Enceladus you need to implement the `ControlMetricsPlugin` and
+`ControlMetricsPluginFactory` interfaces.
+
+Controls metrics plugins are invoked each time a job status changes (e.g. from `running` to `succeeded`) or when a checkpoint
+is reached. A `Checkpoint` is an [Atum][atum] concept to ensure accuracy and completeness of data.
+A checkpoint is created at the end of Standardization and Conformance, and after each conformance rule
+configured to create control measurements. At this point the `onCheckpoint()` callback is called with an instance of control
+measurements. It is up to the plugin to decide what to do at this point. All exceptions thrown from a plugin will be
+logged, but the spark application will continue to run.
+
+[atum]: https://github.com/AbsaOSS/atum
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		* @lokm01 @benedeki @DzMakatun @Zejnilovic @dk1844 @AdrianOlosutean @zakera786
		* @lokm01 @benedeki @DzMakatun @Zejnilovic @dk1844 @lsulak @zakera786
-Original file line number
+Diff line change
@@ -1,2 +1,3 @@
     - '1.0.0'
     - '2.0.0'
+    - '3.0.0'