Merge branch 'develop'

dimajix · Jun 2, 2021 · 1bac111 · 1bac111
2 parents 70a0dba + 7fa979c
commit 1bac111
Show file tree

Hide file tree

Showing 278 changed files with 20,897 additions and 1,146 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -60,16 +60,6 @@ build-default:
       - flowman-dist/target/flowman-dist-*-bin.tar.gz
     expire_in: 5 days
 
-# List additional build variants (some of them will be built on pushes)
-build-hadoop2.6-spark2.3:
-  stage: build
-  script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-2.6 -Pspark-2.3 -Ddockerfile.skip'
-  artifacts:
-    name: "flowman-dist-hadoop2.6-spark2.3"
-    paths:
-      - flowman-dist/target/flowman-dist-*-bin.tar.gz
-    expire_in: 5 days
-
 build-hadoop2.6-spark2.4:
   stage: build
   script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-2.6 -Pspark-2.4 -Ddockerfile.skip'
@@ -133,17 +123,6 @@ build-hadoop3.2-spark3.1:
       - flowman-dist/target/flowman-dist-*-bin.tar.gz
     expire_in: 5 days
 
-build-cdh5.15:
-  stage: build
-  except:
-    - pushes
-  script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDH-5.15 -Ddockerfile.skip'
-  artifacts:
-    name: "flowman-dist-cdh5.15"
-    paths:
-      - flowman-dist/target/flowman-dist-*-bin.tar.gz
-    expire_in: 5 days
-
 build-cdh6.3:
   stage: build
   script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDH-6.3 -Ddockerfile.skip'

diff --git a/.travis.yml b/.travis.yml
@@ -19,14 +19,6 @@ jobs:
       jdk: openjdk8
       script: mvn clean install
 
-    - name: Hadoop 2.6 with Spark 2.3
-      jdk: openjdk8
-      script: mvn clean install -Phadoop-2.6 -Pspark-2.3 -Ddockerfile.skip
-
-    - name: Hadoop 2.7 with Spark 2.3
-      jdk: openjdk8
-      script: mvn clean install -Phadoop-2.7 -Pspark-2.3 -Ddockerfile.skip
-
     - name: Hadoop 2.6 with Spark 2.4
       jdk: openjdk8
       script: mvn clean install -Phadoop-2.6 -Pspark-2.4
@@ -51,10 +43,6 @@ jobs:
       jdk: openjdk8
       script: mvn clean install -Phadoop-3.2 -Pspark-3.1
 
-    - name: CDH 5.15
-      jdk: openjdk8
-      script: mvn clean install -PCDH-5.15 -Ddockerfile.skip
-
     - name: CDH 6.3
       jdk: openjdk8
       script: mvn clean install -PCDH-6.3 -Ddockerfile.skip
diff --git a/BUILDING.md b/BUILDING.md
@@ -3,7 +3,18 @@
 The whole project is built using Maven. The build also includes a Docker image, which requires that Docker
 is installed on the build machine.
 
-## Build with Maven
+## Prerequisites
+
+You need the following tools installed on your machine:
+* JDK 1.8 or later. If you build a variant with Scala 2.11, you have to use JDK 1.8 (and not anything newer like
+  Java 11). This mainly affects builds with Spark 2.x
+* Apache Maven (install via package manager download from https://maven.apache.org/download.cgi)
+* npm (install via package manager or download from https://www.npmjs.com/get-npm)
+* Windows users also need Hadoop winutils installed. Those can be retrieved from https://github.com/cdarlint/winutils
+and later. See some additional details for building on Windows below.
+
+
+# Build with Maven
 
 Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as
 
@@ -22,9 +33,11 @@ in a complex environment with Kerberos. You can find the `tar.gz` file in the di
 
 ## Build on Windows
 
-Although you can normally build Flowman on Windows, you will need the Hadoop WinUtils installed. You can download
-the binaries from https://github.com/steveloughran/winutils and install an appropriate version somewhere onto your 
-machine. Do not forget to set the HADOOP_HOME environment variable to the installation directory of these utils!
+Although you can normally build Flowman on Windows, it is recommended to use Linux instead. But nevertheless Windows
+is still supported to some extend, but requires some extra care. You will need the Hadoop WinUtils installed. You can 
+download the binaries from https://github.com/cdarlint/winutils and install an appropriate version somewhere onto 
+your machine. Do not forget to set the HADOOP_HOME or PATH environment variable to the installation directory of these 
+utils!
 
 You should also configure git such that all files are checked out using "LF" endings instead of "CRLF", otherwise
 some unittests may fail and Docker images might not be useable. This can be done by setting the git configuration
@@ -46,24 +59,23 @@ the `master` branch really builds clean with all unittests passing on Linux.
 
 ## Build for Custom Spark / Hadoop Version
 
-Per default, Flowman will be built for fairly recent versions of Spark (2.4.5 as of this writing) and Hadoop (2.8.5). 
+Per default, Flowman will be built for fairly recent versions of Spark (3.0.2 as of this writing) and Hadoop (3.2.0). 
 But of course you can also build for a different version by either using a profile
 
 ```shell
-mvn install -Pspark2.3 -Phadoop2.7 -DskipTests
+mvn install -Pspark2.4 -Phadoop2.7 -DskipTests
 ```
 
 This will always select the latest bugfix version within the minor version. You can also specify versions explicitly 
 as follows:    
 
 ```shell
-mvn install -Dspark.version=2.2.1 -Dhadoop.version=2.7.3
+mvn install -Dspark.version=2.4.3 -Dhadoop.version=2.7.3
 ```
         
 Note that using profiles is the preferred way, as this guarantees that also dependencies are selected
 using the correct version. The following profiles are available:
 
-* spark-2.3
 * spark-2.4
 * spark-3.0
 * spark-3.1 
@@ -73,37 +85,12 @@ using the correct version. The following profiles are available:
 * hadoop-2.9
 * hadoop-3.1
 * hadoop-3.2
-* CDH-5.15
 * CDH-6.3
 
 With these profiles it is easy to build Flowman to match your environment. 
 
 ## Building for Open Source Hadoop and Spark
 
-### Spark 2.3 and Hadoop 2.6:
-
-```shell
-mvn clean install -Pspark-2.3 -Phadoop-2.6
-```
-
-### Spark 2.3 and Hadoop 2.7:
-
-```shell
-mvn clean install -Pspark-2.3 -Phadoop-2.7
-```
-
-### Spark 2.3 and Hadoop 2.8:
-
-```shell
-mvn clean install -Pspark-2.3 -Phadoop-2.8
-```
-
-### Spark 2.3 and Hadoop 2.9:
-
-```shell
-mvn clean install -Pspark-2.3 -Phadoop-2.9
-```
-
 ### Spark 2.4 and Hadoop 2.6:
 
 ```shell
@@ -148,13 +135,7 @@ mvn clean install -Pspark-3.1 -Phadoop-3.2
 
 ## Building for Cloudera
 
-The Maven project also contains preconfigured profiles for Cloudera.
-
-```shell
-mvn clean install -Pspark-2.3 -PCDH-5.15 -DskipTests
-```
-
-Or for Cloudera 6.3 
+The Maven project also contains preconfigured profiles for Cloudera CDH 6.3.
 
 ```shell
 mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,16 @@
+# Version 0.17.0 - 2021-06-02
+
+* New Flowman Kernel and Flowman Studio application prototypes
+* New ParallelExecutor
+* Fix before/after dependencies in `count` target
+* Default build is now Spark 3.1 + Hadoop 3.2   
+* Remove build profiles for Spark 2.3 and CDH 5.15
+* Add MS SQL Server plugin containing JDBC driver
+* Speed up file listing for `file` relations  
+* Use Spark JobGroups
+* Better support running Flowman on Windows with appropriate batch scripts
+
+
 # Version 0.16.0 - 2021-04-23
 
 * Add logo to Flowman Shell

diff --git a/NOTICE b/NOTICE
@@ -66,6 +66,12 @@ MariaDB Java Client
   * HOMEPAGE:
     * https://mariadb.com
 
+MSSQL JDBC Client
+  * LICENSE
+    * license/LICENSE-mssql-jdbc.txt
+  * HOMEPAGE:
+    * https://github.com/Microsoft/mssql-jdbc
+
 Apache Derby
   * LICENSE
     * license/LICENSE-derby.txt (Apache 2.0 License)

diff --git a/build-release.sh b/build-release.sh
@@ -15,15 +15,10 @@ build_profile() {
 
 build_profile hadoop-2.6 spark-2.3
 build_profile hadoop-2.6 spark-2.4
-build_profile hadoop-2.7 spark-2.3
 build_profile hadoop-2.7 spark-2.4
-build_profile hadoop-2.8 spark-2.3
-build_profile hadoop-2.8 spark-2.4
-build_profile hadoop-2.9 spark-2.3
-build_profile hadoop-2.9 spark-2.4
-build_profile hadoop-2.9 spark-3.0
-build_profile hadoop-3.1 spark-3.0
+build_profile hadoop-2.7 spark-3.0
 build_profile hadoop-3.2 spark-3.0
+build_profile hadoop-2.7 spark-3.1
 build_profile hadoop-3.2 spark-3.1
 build_profile CDH-5.15
 build_profile CDH-6.3
diff --git a/docker/pom.xml b/docker/pom.xml
@@ -10,8 +10,8 @@
     <parent>
         <groupId>com.dimajix.flowman</groupId>
         <artifactId>flowman-root</artifactId>
-        <version>0.16.0</version>
-        <relativePath>..</relativePath>
+        <version>0.17.0</version>
+        <relativePath>../pom.xml</relativePath>
     </parent>
 
     <properties>

diff --git a/docs/building.md b/docs/building.md
@@ -60,20 +60,19 @@ You might also want to skip unittests (the HBase plugin is currently failing und
 
 ### Build for Custom Spark / Hadoop Version
 
-Per default, Flowman will be built for fairly recent versions of Spark (2.4.5 as of this writing) and Hadoop (2.8.5). 
+Per default, Flowman will be built for fairly recent versions of Spark (3.0.2 as of this writing) and Hadoop (3.2.0). 
 But of course you can also build for a different version by either using a profile
 
-    mvn install -Pspark2.2 -Phadoop2.7 -DskipTests
+    mvn install -Pspark2.4 -Phadoop2.7 -DskipTests
 
 This will always select the latest bugfix version within the minor version. You can also specify versions explicitly 
 as follows:    
 
-    mvn install -Dspark.version=2.2.1 -Dhadoop.version=2.7.3
+    mvn install -Dspark.version=2.4.1 -Dhadoop.version=2.7.3
         
 Note that using profiles is the preferred way, as this guarantees that also dependencies are selected
 using the correct version. The following profiles are available:
 
-* spark-2.3
 * spark-2.4
 * spark-3.0
 * spark-3.1
@@ -83,29 +82,12 @@ using the correct version. The following profiles are available:
 * hadoop-2.9
 * hadoop-3.1
 * hadoop-3.2
-* CDH-5.15
 * CDH-6.3
 
 With these profiles it is easy to build Flowman to match your environment. 
 
 ### Building for Open Source Hadoop and Spark
 
-Spark 2.3 and Hadoop 2.6:
-
-    mvn clean install -Pspark-2.3 -Phadoop-2.6
-
-Spark 2.3 and Hadoop 2.7:
-
-    mvn clean install -Pspark-2.3 -Phadoop-2.7
-
-Spark 2.3 and Hadoop 2.8:
-
-    mvn clean install -Pspark-2.3 -Phadoop-2.8
-
-Spark 2.3 and Hadoop 2.9:
-
-    mvn clean install -Pspark-2.3 -Phadoop-2.9
-
 Spark 2.4 and Hadoop 2.6:
 
     mvn clean install -Pspark-2.4 -Phadoop-2.6
@@ -137,11 +119,7 @@ Spark 3.1 and Hadoop 3.2
 
 ### Building for Cloudera
 
-The Maven project also contains preconfigured profiles for Cloudera.
-
-    mvn clean install -Pspark-2.3 -PCDH-5.15 -DskipTests
-
-Or for Cloudera 6.3 
+The Maven project also contains preconfigured profiles for Cloudera CDH 6.3.
 
     mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests
 

diff --git a/docs/config.md b/docs/config.md
@@ -31,7 +31,11 @@ the existence of targets to decide if a rebuild is required.
 
 - `flowman.execution.executor.class` *(type: class)* *(default: `com.dimajix.flowman.execution.SimpleExecutor`)*
 Configure the executor to use. The default `SimpleExecutor` will process all targets in the correct order
-sequentially.
+  sequentially. The alternative implementation `com.dimajix.flowman.execution.ParallelExecutor` will run multiple 
+  targets in parallel (if they are not depending on each other)
+
+- `flowman.execution.executor.parallelism` *(type: int)* *(default: 4)*
+The number of targets to be executed in parallel, when the `ParallelExecutor` is used.
 
 - `flowman.execution.scheduler.class` *(type: class)* *(default: `com.dimajix.flowman.execution.SimpleScheduler`)*
   Configure the scheduler to use. The default `SimpleScheduler` will sort all targets according to their dependency.

diff --git a/docs/spec/mapping/mock.md b/docs/spec/mapping/mock.md
@@ -15,14 +15,32 @@ mappings:
 
 ```yaml
 mappings:
-  empty_mapping:
+  some_other_mapping:
     kind: mock
     mapping: some_mapping
     records:
       - [1,2,"some_string",""]
       - [2,null,"cat","black"]
 ```
 
+```yaml
+mappings:
+  some_mapping:
+    kind: mock
+    mapping: some_mapping
+    records:
+      - Campaign ID: DIR_36919
+        LineItemID ID: DIR_260390
+        SiteID ID: 23374
+        CreativeID ID: 292668
+        PlacementID ID: 108460
+      - Campaign ID: DIR_36919
+        LineItemID ID: DIR_260390
+        SiteID ID: 23374
+        CreativeID ID: 292668
+        PlacementID ID: 108460
+```
+
 ## Fields
 * `kind` **(mandatory)** *(type: string)*: `mock`
 
@@ -39,7 +57,7 @@ mappings:
     * `MEMORY_AND_DISK_SER`
 
 * `mapping` **(optional)** *(type: string)*:
-  Specifies the name of the mapping to be mocked. If no name is given, the a mapping with the same name will be 
+  Specifies the name of the mapping to be mocked. If no name is given, then a mapping with the same name will be 
   mocked. Note that this will only work when used as an override mapping in test cases, otherwise an infinite loop
   would be created by referencing to itself.
 

diff --git a/docs/spec/mapping/values.md b/docs/spec/mapping/values.md
@@ -18,8 +18,8 @@ mappings:
         - name: str_col
           type: string
     records:
-        - [1,"some_string"]
-        - [2,"cat"]
+      - [1,"some_string"]
+      - [2,"cat"]
 ```
 
 ```yaml
@@ -30,8 +30,21 @@ mappings:
       int_col: integer
       str_col: string
     records:
-        - [1,"some_string"]
-        - [2,"cat"]
+      - [1,"some_string"]
+      - [2,"cat"]
+```
+
+```yaml
+mappings:
+  fake_input:
+    kind: values
+    columns:
+      int_col: integer
+      str_col: string
+    records:
+      - int_col: 1
+        str_col: "some_string"
+      - str_col: "cat"
 ```