Merge branch 'develop'

dimajix · Mar 1, 2022 · 80a9ec4 · 80a9ec4
2 parents ba5a982 + be00e69
commit 80a9ec4
Show file tree

Hide file tree

Showing 442 changed files with 12,909 additions and 2,520 deletions.
diff --git a/BUILDING.md b/BUILDING.md
@@ -55,7 +55,7 @@ appropriate build profiles, you can easily create a custom build.
 Although you can normally build Flowman on Windows, it is recommended to use Linux instead. But nevertheless Windows
 is still supported to some extend, but requires some extra care. You will need the Hadoop WinUtils installed. You can 
 download the binaries from https://github.com/cdarlint/winutils and install an appropriate version somewhere onto 
-your machine. Do not forget to set the HADOOP_HOME or PATH environment variable to the installation directory of these 
+your machine. Do not forget to set the `HADOOP_HOME` or `PATH` environment variable to the installation directory of these 
 utils!
 
 You should also configure git such that all files are checked out using "LF" endings instead of "CRLF", otherwise

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,17 @@
+# Version 0.22.0 - 2022-03-01
+
+* Add new `sqlserver` relation
+* Implement new documentation subsystem
+* Change default build to Spark 3.2.1 and Hadoop 3.3.1
+* Add new `drop` target for removing tables
+* Speed up project loading by reusing Jackson mapper
+* Implement new `jdbc` metric sink
+* Implement schema cache in Executor to speed up documentation and similar tasks
+* Add new config variables `flowman.execution.mapping.schemaCache` and `flowman.execution.relation.schemaCache`
+* Add new config variable `flowman.default.target.verifyPolicy` to ignore empty tables during VERIFY phase
+* Implement initial support for indexes in JDBC relations
+
+
 # Version 0.21.2 - 2022-02-14
 
 * Fix importing projects

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,110 @@
+# Contributing to Flowman
+
+You want to contribute to Flowman? Welcome! Please read this document to understand what you can do:
+ * [Report an Issue](#report-an-issue)
+ * [Contribute Documentation](#contribute-documentation)
+ * [Contribute Code](#contribute-code)
+
+
+## Report an Issue
+
+If you find a bug - behavior of Flowman code contradicting your expectation - you are welcome to report it.
+We can only handle well-reported, actual bugs, so please follow the guidelines below.
+
+Once you have familiarized with the guidelines, you can go to the [GitHub issue tracker for Flowman](https://github.com/dimajix/flowman/issues/new) to report the issue.
+
+### Quick Checklist for Bug Reports
+
+Issue report checklist:
+ * Real, current bug
+ * No duplicate
+ * Reproducible
+ * Good summary
+ * Well-documented
+ * Minimal example
+
+### Issue handling process
+
+When an issue is reported, a committer will look at it and either confirm it as a real issue, close it if it is not an issue, or ask for more details.
+
+An issue that is about a real bug is closed as soon as the fix is committed.
+
+### Usage of Labels
+
+GitHub offers labels to categorize issues. We suggest the following labels:
+
+Labels for issue categories:
+ * bug: this issue is a bug in the code
+ * feature: this issue is a request for a new functionality or an enhancement request
+ * environment: this issue relates to supporting a specific runtime environment (Cloudera, specific Spark/Hadoop version, etc)
+
+Status of open issues:
+ * help wanted: the feature request is approved and you are invited to contribute
+
+Status/resolution of closed issues:
+ * wontfix: while acknowledged to be an issue, a fix cannot or will not be provided
+
+### Issue Reporting Disclaimer
+
+We want to improve the quality of Flowman and good bug reports are welcome! But our capacity is limited, thus we reserve the right to close or to not process insufficient bug reports in favor of those which are very cleanly documented and easy to reproduce. Even though we would like to solve each well-documented issue, there is always the chance that it will not happen - remember: Flowman is Open Source and comes without warranty.
+
+Bug report analysis support is very welcome! (e.g. pre-analysis or proposing solutions)
+
+
+
+## Contribute Documentation
+
+Flowman has many features implemented, unfortunately not all of them are well documented. So this is an area where we highly welcome contributions from users in order to improve the documentation. The documentation is contained in the "doc" subdirectory within the source code repository. This implies that when you want to contribute documentation, you have to follow the same procedure as for contributing code.
+
+
+
+## Contribute Code
+
+You are welcome to contribute code to Flowman in order to fix bugs or to implement new features.
+
+There are three important things to know:
+
+1.  You must be aware of the Apache License (which describes contributions) and **agree to the Contributors License Agreement**. This is common practice in all major Open Source projects.
+ For company contributors special rules apply. See the respective section below for details.
+2.  Please ensure your contribution adopts Flowmans **code style, quality, and product standards**. The respective section below gives more details on the coding guidelines.
+3.  **Not all proposed contributions can be accepted**. Some features may e.g. just fit a third-party plugin better. The code must fit the overall direction of Flowman and really improve it. The more effort you invest, the better you should clarify in advance whether the contribution fits: the best way would be to just open an issue to discuss the feature you plan to implement (make it clear you intend to contribute).
+
+### Contributor License Agreement
+
+When you contribute (code, documentation, or anything else), you have to be aware that your contribution is covered by the same [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0) that is applied to Flowman itself.
+
+In particular, you need to agree to the [Flowman Contributors License Agreement](https://cla-assistant.io/dimajix/flowman), stating that you have the right and are okay to put your contribution under the license of this project.
+CLA assistant will ask you to confirm that.
+
+This applies to all contributors, including those contributing on behalf of a company.
+If you agree to its content, you simply have to click on the link posted by the CLA assistant as a comment to the pull request. Click it to check the CLA, then accept it on the following screen if you agree to it.
+CLA assistant will save this decision for upcoming contributions and will notify you if there is any change to the CLA in the meantime.
+
+### Contribution Content Guidelines
+
+These are some rules we try to follow:
+
+-   Apply a clean coding style adapted to the surrounding code, even though we are aware the existing code is not fully clean
+-   Use (4)spaces for indentation
+-   Use variable naming conventions like in the other files you are seeing (camelcase)
+-   No println - use SLF4J logging instead
+-   Comment your code where it gets non-trivial
+-   Write a unit test
+-   Do not do any incompatible changes, especially do not change or remove existing properties from YAML specs
+
+### How to contribute - the Process
+
+1.  Make sure the change would be welcome (e.g. a bugfix or a useful feature); best do so by proposing it in a GitHub issue
+2.  Create a branch forking the flowman repository and do your change
+3.  Commit and push your changes on that branch
+4.  If your change fixes an issue reported at GitHub, add the following line to the commit message:
+    - ```Fixes #(issueNumber)```
+5.  Create a Pull Request with the following information
+    - Describe the problem you fix with this change.
+    - Describe the effect that this change has from a user's point of view. App crashes and lockups are pretty convincing for example, but not all bugs are that obvious and should be mentioned in the text.
+    - Describe the technical details of what you changed. It is important to describe the change in a most understandable way so the reviewer is able to verify that the code is behaving as you intend it to.
+6.  Follow the link posted by the CLA assistant to your pull request and accept it, as described in detail above.
+7.  Wait for our code review and approval, possibly enhancing your change on request
+    -   Note that the Flowman developers also have their regular duties, so depending on the required effort for reviewing, testing and clarification this may take a while
+8.  Once the change has been approved we will inform you in a comment
+9.  We will close the pull request, feel free to delete the now obsolete branch
diff --git a/QUICKSTART.md b/QUICKSTART.md
@@ -16,7 +16,7 @@ Fortunately, Apache Spark is rather simple to install locally on your machine:
 
 ### Download & Install Spark
 
-As of this writing, the latest release of Flowman is 0.20.0 and is available prebuilt for Spark 3.1.2 on the Spark
+As of this writing, the latest release of Flowman is 0.22.0 and is available prebuilt for Spark 3.2.1 on the Spark
 homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it.
 
 ```shell
@@ -25,8 +25,8 @@ mkdir playground
 cd playground
 
 # Download and unpack Spark & Hadoop
-curl -L https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link
-ln -snf spark-3.1.2-bin-hadoop3.2 spark
+curl -L https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.2.1-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link
+ln -snf spark-3.2.1-bin-hadoop3.2 spark
 ```
 The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other.
 
@@ -35,19 +35,20 @@ The Spark package already contains Hadoop, so with this single download you alre
 If you are trying to run the application on Windows, you also need the *Hadoop Winutils*, which is a set of
 DLLs required for the Hadoop libraries to be working. You can get a copy at https://github.com/kontext-tech/winutils .
 Once you downloaded the appropriate version, you need to place the DLLs into a directory `$HADOOP_HOME/bin`, where
-`HADOOP_HOME` refers to some location on your Windows PC. You also need to set the following environment variables:
+`HADOOP_HOME` refers to some arbitrary location of your choice on your Windows PC. You also need to set the following 
+environment variables: 
 * `HADOOP_HOME` should point to the parent directory of the `bin` directory
 * `PATH` should also contain `$HADOOP_HOME/bin`
 
 
 ## 1.2 Install Flowman
 
 You find prebuilt Flowman packages on the corresponding release page on GitHub. For this quickstart, we chose
-`flowman-dist-0.20.0-oss-spark3.1-hadoop3.2-bin.tar.gz` which nicely fits to the Spark package we just downloaded before.
+`flowman-dist-0.22.0-oss-spark3.2-hadoop3.3-bin.tar.gz` which nicely fits to the Spark package we just downloaded before.
 
 ```shell
 # Download and unpack Flowman
-curl -L https://github.com/dimajix/flowman/releases/download/0.20.0/flowman-dist-0.20.0-oss-spark3.1-hadoop3.2-bin.tar.gz | tar xvzf -
+curl -L https://github.com/dimajix/flowman/releases/download/0.22.0/flowman-dist-0.22.0-oss-spark3.2-hadoop3.3-bin.tar.gz | tar xvzf -
 
 # Create a nice link
 ln -snf flowman-0.20.0 flowman
@@ -81,13 +82,9 @@ That’s all we need to run the Flowman example.
 
 # 2. Flowman Shell
 
-The example data is stored in a S3 bucket provided by myself. In order to access the data, you need to provide valid
-AWS credentials in your environment:
-
-```shell
-$ export AWS_ACCESS_KEY_ID=<your aws access key>
-$ export AWS_SECRET_ACCESS_KEY=<your aws secret key>
-```
+The example data is stored in a S3 bucket provided by myself.  Since the data is publicly available and the project is
+configured to use anonymous AWS authentication, you do not need to provide your AWS credentials (you even do not
+even need to have an account on AWS)
 
 ## 2.1 Start interactive Flowman shell
 

diff --git a/README.md b/README.md
@@ -21,11 +21,11 @@ keep all aspects (like transformations and schema information) in a single place
 * Semantics of a build tool like Maven - just for data instead for applications
 * Declarative syntax in YAML files
 * Data model management (Create, Migrate and Destroy Hive tables, JDBC tables or file based storage)
+* Generation of meaningful documentation 
 * Flexible expression language
 * Jobs for managing build targets (like copying files or uploading data via sftp)
 * Automatic data dependency management within the execution of individual jobs
-* Rich set of execution metrics
-* Meaningful logging output
+* Meaningful logging output & rich set of execution metrics
 * Powerful yet simple command line tools
 * Extendable via Plugins
 
@@ -38,28 +38,21 @@ You can find the official homepage at [Flowman.io](https://flowman.io)
 
 # Installation
 
-You can either grab an appropriate pre-build package at https://github.com/dimajix/flowman/releases or you
-can build your own version via Maven with
-
-    mvn clean install
-
-Please also read [BUILDING.md](BUILDING.md) for detailed instructions, specifically on build profiles.
-
+You can either grab an appropriate pre-build package at [GitHub](https://github.com/dimajix/flowman/releases) 
 
 ## Installing the Packed Distribution 
 
 The packed distribution file is called `flowman-{version}-bin.tar.gz` and can be extracted at any 
 location using
-
-    tar xvzf flowman-{version}-bin.tar.gz
-
+```shell
+tar xvzf flowman-{version}-bin.tar.gz
+```
 
 ## Apache Spark
 
 Flowman does not bring its own Spark libraries, but relies on a correctly installed Spark distribution. You can 
 download appropriate packages directly from [https://spark.apache.org](the Spark Homepage). 
 
-
 ## Hadoop Utils for Windows
 
 If you are trying to run the application on Windows, you also need the *Hadoop Winutils*, which is a set of
@@ -70,7 +63,6 @@ Once you downloaded the appropriate version, you need to place the DLLs into a d
 * `PATH` should also contain `$HADOOP_HOME/bin`
 
 
-
 # Command Line Utils
 
 The primary tool provided by Flowman is called `flowexec` and is located in the `bin` folder of the 
@@ -80,19 +72,37 @@ installation directory.
 
 The `flowexec` tool has several subcommands for working with objects and projects. The general pattern 
 looks as follows
-
-    flowexec [generic options] <cmd> <subcommand> [specific options and arguments]
+```shell
+flowexec [generic options] <cmd> <subcommand> [specific options and arguments]
+```
 
 For working with `flowexec`, either your current working directory needs to contain a Flowman
 project with a file `project.yml` or you need to specify the path to a valid project via
-
-    flowexec -f /path/to/project/folder <cmd>
+```shell
+flowexec -f /path/to/project/folder <cmd>
+```
 
 ## Interactive Shell
 
 With version 0.14.0, Flowman also introduced a new interactive shell for executing data flows. The shell can be
 started via
-
-    flowshell -f <project>
+```shell
+flowshell -f <project>
+```
 
 Within the shell, you can interactively build targets and inspect intermediate mappings.
+
+
+# Building
+
+You can build your own version via Maven with
+```shell
+mvn clean install
+```
+Please also read [BUILDING.md](BUILDING.md) for detailed instructions, specifically on build profiles.
+
+
+# Contributing
+
+You want to contribute to Flowman? Welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) to understand what you can 
+do.
diff --git a/docker/conf/default-namespace.yml b/docker/conf/default-namespace.yml
@@ -13,6 +13,28 @@ connections:
     username: $System.getenv('FLOWMAN_LOGDB_USER', '')
     password: $System.getenv('FLOWMAN_LOGDB_PASSWORD', '')
 
+# This adds a hook for creating an execution log in a file
+hooks:
+  kind: report
+  location: ${project.basedir}/generated-report.txt
+  metrics:
+    # Define common labels for all metrics
+    labels:
+      project: ${project.name}
+    metrics:
+      # Collect everything
+      - selector:
+          name: .*
+        labels:
+          category: ${category}
+          kind: ${kind}
+          name: ${name}
+
+# This configures where metrics should be written to. Since we cannot assume a working Prometheus push gateway, we
+# simply print them onto the console
+metrics:
+  - kind: console
+
 config:
   - spark.sql.warehouse.dir=/opt/flowman/hive/warehouse
   - spark.hadoop.hive.metastore.uris=
@@ -21,7 +43,7 @@ config:
 
 store:
   kind: file
-  location: /opt/flowman/examples
+  location: $System.getenv('FLOWMAN_HOME')/examples
 
 plugins:
   - flowman-aws

diff --git a/docker/conf/history-server.yml b/docker/conf/history-server.yml
@@ -0,0 +1,20 @@
+# The following definition provides a "run history" stored in a database. If nothing else is specified, the database
+# is stored locally as a Derby database. If you do not want to use the history, you can simply remove the whole
+# 'history' block from this file.
+history:
+  kind: jdbc
+  connection: flowman_state
+  retries: 3
+  timeout: 1000
+
+connections:
+  flowman_state:
+    driver: $System.getenv('FLOWMAN_LOGDB_DRIVER', 'org.apache.derby.jdbc.EmbeddedDriver')
+    url: $System.getenv('FLOWMAN_LOGDB_URL', $String.concat('jdbc:derby:', $System.getenv('FLOWMAN_HOME'), '/logdb;create=true'))
+    username: $System.getenv('FLOWMAN_LOGDB_USER', '')
+    password: $System.getenv('FLOWMAN_LOGDB_PASSWORD', '')
+
+plugins:
+  - flowman-mariadb
+  - flowman-mysql
+  - flowman-mssqlserver
diff --git a/docker/pom.xml b/docker/pom.xml
@@ -10,10 +10,14 @@
     <parent>
         <groupId>com.dimajix.flowman</groupId>
         <artifactId>flowman-root</artifactId>
-        <version>0.21.2</version>
+        <version>0.22.0</version>
         <relativePath>../pom.xml</relativePath>
     </parent>
 
+    <properties>
+        <spark-hadoop-archive.version>${hadoop-api.version}</spark-hadoop-archive.version>
+    </properties>
+
     <profiles>
         <profile>
             <id>CDH-6.3</id>
@@ -27,6 +31,16 @@
                 <dockerfile.skip>true</dockerfile.skip>
             </properties>
         </profile>
+        <profile>
+            <id>spark-3.2</id>
+            <activation>
+                <activeByDefault>true</activeByDefault>
+            </activation>
+            <properties>
+                <!-- The Spark 3.2 archives continue to have a wrong file name -->
+                <spark-hadoop-archive.version>3.2</spark-hadoop-archive.version>
+            </properties>
+        </profile>
     </profiles>
 
     <build>
@@ -93,7 +107,7 @@
                     <pullNewerImage>false</pullNewerImage>
                     <buildArgs>
                         <BUILD_SPARK_VERSION>${spark.version}</BUILD_SPARK_VERSION>
-                        <BUILD_HADOOP_VERSION>${hadoop-api.version}</BUILD_HADOOP_VERSION>
+                        <BUILD_HADOOP_VERSION>${spark-hadoop-archive.version}</BUILD_HADOOP_VERSION>
                         <DIST_FILE>flowman-dist-${flowman.dist.label}-bin.tar.gz</DIST_FILE>
                         <http_proxy>${env.http_proxy}</http_proxy>
                         <https_proxy>${env.https_proxy}</https_proxy>