Merge branch 'develop'

dimajix · Sep 10, 2020 · 174e4b3 · 174e4b3
2 parents 6a9ce9d + bf80d81
commit 174e4b3
Show file tree

Hide file tree

Showing 361 changed files with 7,434 additions and 2,500 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -8,6 +8,11 @@ cache:
 services:
   - docker
 
+deploy:
+  provider: releases
+  file: flowman-dist/target/flowman-dist-*-bin.tar.gz*
+  overwrite: true
+
 jobs:
   include:
     - name: Default Build

diff --git a/BUILDING.md b/BUILDING.md
@@ -3,11 +3,17 @@
 The whole project is built using Maven. The build also includes a Docker image, which requires that Docker
 is installed on the build machine.
 
-# Main Artifacts
+## Build with Maven
 
-The main artifacts will be a Docker image 'dimajix/flowman' and additionally a tar.gz file containing a
-runnable version of Flowman for direct installation in cases where Docker is not available or when you
-want to run Flowman in a complex environment with Kerberos.
+Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as
+
+    mvn clean install
+
+## Main Artifacts
+
+The main artifacts will be a Docker image 'dimajix/flowman' and additionally a tar.gz file containing a runnable 
+version of Flowman for direct installation in cases where Docker is not available or when you want to run Flowman 
+in a complex environment with Kerberos. You can find the `tar.gz` file in the directory `flowman-dist/target`
 
 
 # Custom Builds
@@ -56,60 +62,69 @@ using the correct version. The following profiles are available:
 * CDH-5.15
 * CDH-6.3
 
+With these profiles it is easy to build Flowman to match your environment. 
 
-## Building for Cloudera
+## Building for Open Source Hadoop and Spark
 
-The Maven project also contains preconfigured profiles for Cloudera.
+Spark 2.3 and Hadoop 2.6:
+
+    mvn clean install -Pspark-2.3 -Phadoop-2.6
+
+Spark 2.3 and Hadoop 2.7:
+
+    mvn clean install -Pspark-2.3 -Phadoop-2.7
 
-    mvn install -Pspark-2.3 -PCDH-5.15 -DskipTests
+Spark 2.3 and Hadoop 2.8:
 
+    mvn clean install -Pspark-2.3 -Phadoop-2.8
 
-## Skipping Docker Image
+Spark 2.3 and Hadoop 2.9:
 
-Part of the build also is a Docker image. Since you might not want to use it, because you are using different base
-images, you can skip the building of the Docker image via `-Ddockerfile.skip`
+    mvn clean install -Pspark-2.3 -Phadoop-2.9
 
-# Releasing
+Spark 2.4 and Hadoop 2.6:
 
-## Releasing
+    mvn clean install -Pspark-2.4 -Phadoop-2.6
+
+Spark 2.4 and Hadoop 2.7:
 
-When making a release, the gitflow maven plugin should be used for managing versions
+    mvn clean install -Pspark-2.4 -Phadoop-2.7
 
-    mvn gitflow:release
+Spark 2.4 and Hadoop 2.8:
 
-## Deploying to Central Repository
+    mvn clean install -Pspark-2.4 -Phadoop-2.8
 
-Both snapshot and release versions can be deployed to Sonatype, which in turn is mirrored by the Maven Central
-Repository.
+Spark 2.4 and Hadoop 2.9:
 
-    mvn deploy -Dgpg.skip=false
-
-The deployment has to be committed via     
-
-    mvn nexus-staging:close -DstagingRepositoryId=comdimajixflowman-1001
+    mvn clean install -Pspark-2.4 -Phadoop-2.9
+
+Spark 3.0 and Hadoop 3.1
+
+    mvn clean install -Pspark-3.0 -Phadoop-3.1
+
+Spark 3.0 and Hadoop 3.2
+
+    mvn clean install -Pspark-3.0 -Phadoop-3.2
+
+## Building for Cloudera
+
+The Maven project also contains preconfigured profiles for Cloudera.
+
+    mvn clean install -Pspark-2.3 -PCDH-5.15 -DskipTests
 
-Or the staging data can be removed via
+Or for Cloudera 6.3 
+
+    mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests
 
-    mvn nexus-staging:drop    
 
-## Deploying to Custom Repository
+## Skipping Docker Image
 
-You can also deploy to a different repository by setting the following properties
-* `deployment.repository.id` - contains the ID of the repository. This should match any entry in your settings.xml for authentication
-* `deployment.repository.snapshot-id` - contains the ID of the repository. This should match any entry in your settings.xml for authentication
-* `deployment.repository.server` - the url of the server as used by the nexus-staging-maven-plugin
-* `deployment.repository.url` - the url of the default release repsotiory
-* `deployment.repository.snapshot-url` - the url of the snapshot repository
+Part of the build also is a Docker image. Since you might not want to use it, because you are using different base
+images, you can skip the building of the Docker image via `-Ddockerfile.skip`
 
-Per default, Flowman uses the staging mechanism provided by the nexus-staging-maven-plugin. This this is not what you
-want, you can simply disable the Plugin via `skipTests` 
+## Building Documentation
 
-With these settings you can deploy to a different (local) repository, for example
+Flowman also contains Markdown documentation which is processed by Sphinx to generate the online HTML documentation.
 
-    mvn deploy \
-        -Pspark-2.3 \
-        -PCDH-5.15 \
-        -Ddeployment.repository.snapshot-url=https://nexus-snapshots.my-company.net/repository/snapshots \
-        -Ddeployment.repository.snapshot-id=nexus-snapshots \
-        -DskipStaging \
-        -DskipTests
+    cd docs
+    make html
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,17 @@
+# Version 0.14.0
+
+* Fix AWS plugin for Hadoop 3.x
+* Improve setup of logging
+* Shade Velocity for better interoperability with Spark 3
+* Add new web hook facility in namespaces and jobs
+* Existing targets will not be overwritten anymore by default. Either use the `--force` command line option, or set 
+the configuration property `flowman.execution.target.forceDirty` to `true` for the old behaviour.
+* Add new command line option `--keep-going`
+* Implement new `com.dimajix.spark.io.DeferredFileCommitProtocol` which can be used by setting the Spark configuration
+parameter `spark.sql.sources.commitProtocolClass`
+* Add new `flowshell` application
+
+
 # Version 0.13.1 - 2020-07-14
 
 * Code improvements

diff --git a/RELEASING.md b/RELEASING.md
@@ -0,0 +1,44 @@
+# Releasing
+
+## Releasing
+
+When making a release, the gitflow maven plugin should be used for managing versions
+
+    mvn gitflow:release
+
+## Deploying to Central Repository
+
+Both snapshot and release versions can be deployed to Sonatype, which in turn is mirrored by the Maven Central
+Repository.
+
+    mvn deploy -Dgpg.skip=false
+
+The deployment has to be committed via     
+
+    mvn nexus-staging:close -DstagingRepositoryId=comdimajixflowman-1001
+
+Or the staging data can be removed via
+
+    mvn nexus-staging:drop    
+
+## Deploying to Custom Repository
+
+You can also deploy to a different repository by setting the following properties
+* `deployment.repository.id` - contains the ID of the repository. This should match any entry in your settings.xml for authentication
+* `deployment.repository.snapshot-id` - contains the ID of the repository. This should match any entry in your settings.xml for authentication
+* `deployment.repository.server` - the url of the server as used by the nexus-staging-maven-plugin
+* `deployment.repository.url` - the url of the default release repsotiory
+* `deployment.repository.snapshot-url` - the url of the snapshot repository
+
+Per default, Flowman uses the staging mechanism provided by the nexus-staging-maven-plugin. This this is not what you
+want, you can simply disable the Plugin via `skipTests` 
+
+With these settings you can deploy to a different (local) repository, for example
+
+    mvn deploy \
+        -Pspark-2.3 \
+        -PCDH-5.15 \
+        -Ddeployment.repository.snapshot-url=https://nexus-snapshots.my-company.net/repository/snapshots \
+        -Ddeployment.repository.snapshot-id=nexus-snapshots \
+        -DskipStaging \
+        -DskipTests
diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -1,6 +1,8 @@
 FROM ${docker.base-image.repository}:${docker.base-image.version}
 MAINTAINER [email protected]
 
+ARG DIST_FILE
+
 USER root
 
 ENV FLOMAN_HOME=/opt/flowman
@@ -12,9 +14,9 @@ COPY libexec/ /opt/docker/libexec/
 
 
 # Copy and install Repository
-COPY flowman-dist-${project.version}-bin.tar.gz /tmp/repo/
+COPY $DIST_FILE /tmp/repo/flowman-dist.tar.gz
 COPY conf/ /tmp/repo/conf
-RUN tar -C /opt --owner=root --group=root -xzf /tmp/repo/flowman-dist-${project.version}-bin.tar.gz && \
+RUN tar -C /opt --owner=root --group=root -xzf /tmp/repo/flowman-dist.tar.gz && \
     ln -s /opt/flowman* /opt/flowman && \
     cp -a /tmp/repo/conf/* /opt/flowman/conf && \
     chown -R root:root /opt/flowman* && \

diff --git a/docker/pom.xml b/docker/pom.xml
@@ -10,11 +10,12 @@
     <parent>
         <groupId>com.dimajix.flowman</groupId>
         <artifactId>flowman-root</artifactId>
-        <version>0.13.1</version>
+        <version>0.14.0</version>
         <relativePath>..</relativePath>
     </parent>
 
     <properties>
+        <dist.tag>${project.version}-${hadoop.dist}-spark${spark-api.version}-hadoop${hadoop-api.version}</dist.tag>
         <docker.base-image.repository>dimajix/spark</docker.base-image.repository>
         <docker.base-image.version>${spark.version}</docker.base-image.version>
     </properties>
@@ -52,7 +53,7 @@
                                 <resource>
                                     <directory>../flowman-dist/target</directory>
                                     <includes>
-                                        <include>flowman-dist-${project.version}-bin.tar.gz</include>
+                                        <include>flowman-dist-${dist.tag}-bin.tar.gz</include>
                                     </includes>
                                     <filtering>false</filtering>
                                 </resource>
@@ -94,8 +95,11 @@
                     <repository>dimajix/flowman</repository>
                     <contextDirectory>target/build</contextDirectory>
                     <useMavenSettingsForAuth>true</useMavenSettingsForAuth>
-                    <tag>${project.version}</tag>
+                    <tag>${dist.tag}</tag>
                     <pullNewerImage>false</pullNewerImage>
+                    <buildArgs>
+                        <DIST_FILE>flowman-dist-${dist.tag}-bin.tar.gz</DIST_FILE>
+                    </buildArgs>
                 </configuration>
             </plugin>
         </plugins>

diff --git a/docs/building.md b/docs/building.md
@@ -0,0 +1,134 @@
+# Building Flowman
+
+Since Flowman depends on libraries like Spark and Hadoop, which are commonly provided by a platform environment like
+Cloudera or EMR,  you currently need to build Flowman yourself to match the correct versions. Prebuilt Flowman
+distributions are planned, but not available yet.
+
+The whole project is built using Maven. The build also includes a Docker image, which requires that Docker
+is installed on the build machine - building the Docker image can be disabled (see below).
+
+## Build with Maven
+
+Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as
+
+    mvn clean install
+
+## Main Artifacts
+
+The main artifacts will be a Docker image 'dimajix/flowman' and additionally a tar.gz file containing a runnable 
+version of Flowman for direct installation in cases where Docker is not available or when you want to run Flowman 
+in a complex environment with Kerberos. You can find the `tar.gz` file in the directory `flowman-dist/target`
+
+
+# Custom Builds
+
+## Build on Windows
+
+Although you can normally build Flowman on Windows, you will need the Hadoop WinUtils installed. You can download
+the binaries from https://github.com/steveloughran/winutils and install an appropriate version somewhere onto your 
+machine. Do not forget to set the HADOOP_HOME environment variable to the installation directory of these utils!
+
+You should also configure git such that all files are checked out using "LF" endings instead of "CRLF", otherwise
+some unittests may fail and Docker images might not be useable. This can be done by setting the git configuration
+value "core.autocrlf" to "input"
+
+    git config --global core.autocrlf input
+
+You might also want to skip unittests (the HBase plugin is currently failing under windows)
+
+    mvn clean install -DskipTests    
+
+
+## Build for Custom Spark / Hadoop Version
+
+Per default, Flowman will be built for fairly recent versions of Spark (2.4.5 as of this writing) and Hadoop (2.8.5). 
+But of course you can also build for a different version by either using a profile
+
+    mvn install -Pspark2.2 -Phadoop2.7 -DskipTests
+
+This will always select the latest bugfix version within the minor version. You can also specify versions explicitly 
+as follows:    
+
+    mvn install -Dspark.version=2.2.1 -Dhadoop.version=2.7.3
+        
+Note that using profiles is the preferred way, as this guarantees that also dependencies are selected
+using the correct version. The following profiles are available:
+
+* spark-2.3
+* spark-2.4
+* spark-3.0
+* hadoop-2.6
+* hadoop-2.7
+* hadoop-2.8
+* hadoop-2.9
+* hadoop-3.1
+* hadoop-3.2
+* CDH-5.15
+* CDH-6.3
+
+With these profiles it is easy to build Flowman to match your environment. 
+
+## Building for Open Source Hadoop and Spark
+
+Spark 2.3 and Hadoop 2.6:
+
+    mvn clean install -Pspark-2.3 -Phadoop-2.6
+
+Spark 2.3 and Hadoop 2.7:
+
+    mvn clean install -Pspark-2.3 -Phadoop-2.7
+
+Spark 2.3 and Hadoop 2.8:
+
+    mvn clean install -Pspark-2.3 -Phadoop-2.8
+
+Spark 2.3 and Hadoop 2.9:
+
+    mvn clean install -Pspark-2.3 -Phadoop-2.9
+
+Spark 2.4 and Hadoop 2.6:
+
+    mvn clean install -Pspark-2.4 -Phadoop-2.6
+
+Spark 2.4 and Hadoop 2.7:
+
+    mvn clean install -Pspark-2.4 -Phadoop-2.7
+
+Spark 2.4 and Hadoop 2.8:
+
+    mvn clean install -Pspark-2.4 -Phadoop-2.8
+
+Spark 2.4 and Hadoop 2.9:
+
+    mvn clean install -Pspark-2.4 -Phadoop-2.9
+
+Spark 3.0 and Hadoop 3.1
+
+    mvn clean install -Pspark-3.0 -Phadoop-3.1
+
+Spark 3.0 and Hadoop 3.2
+
+    mvn clean install -Pspark-3.0 -Phadoop-3.2
+
+## Building for Cloudera
+
+The Maven project also contains preconfigured profiles for Cloudera.
+
+    mvn clean install -Pspark-2.3 -PCDH-5.15 -DskipTests
+
+Or for Cloudera 6.3 
+
+    mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests
+
+
+## Skipping Docker Image
+
+Part of the build also is a Docker image. Since you might not want to use it, because you are using different base
+images, you can skip the building of the Docker image via `-Ddockerfile.skip`
+
+## Building Documentation
+
+Flowman also contains Markdown documentation which is processed by Sphinx to generate the online HTML documentation.
+
+    cd docs
+    make html