Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
kupferk committed Sep 10, 2020
2 parents 6a9ce9d + bf80d81 commit 174e4b3
Show file tree
Hide file tree
Showing 361 changed files with 7,434 additions and 2,500 deletions.
5 changes: 5 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ cache:
services:
- docker

deploy:
provider: releases
file: flowman-dist/target/flowman-dist-*-bin.tar.gz*
overwrite: true

jobs:
include:
- name: Default Build
Expand Down
97 changes: 56 additions & 41 deletions BUILDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,17 @@
The whole project is built using Maven. The build also includes a Docker image, which requires that Docker
is installed on the build machine.

# Main Artifacts
## Build with Maven

The main artifacts will be a Docker image 'dimajix/flowman' and additionally a tar.gz file containing a
runnable version of Flowman for direct installation in cases where Docker is not available or when you
want to run Flowman in a complex environment with Kerberos.
Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as

mvn clean install

## Main Artifacts

The main artifacts will be a Docker image 'dimajix/flowman' and additionally a tar.gz file containing a runnable
version of Flowman for direct installation in cases where Docker is not available or when you want to run Flowman
in a complex environment with Kerberos. You can find the `tar.gz` file in the directory `flowman-dist/target`


# Custom Builds
Expand Down Expand Up @@ -56,60 +62,69 @@ using the correct version. The following profiles are available:
* CDH-5.15
* CDH-6.3

With these profiles it is easy to build Flowman to match your environment.

## Building for Cloudera
## Building for Open Source Hadoop and Spark

The Maven project also contains preconfigured profiles for Cloudera.
Spark 2.3 and Hadoop 2.6:

mvn clean install -Pspark-2.3 -Phadoop-2.6

Spark 2.3 and Hadoop 2.7:

mvn clean install -Pspark-2.3 -Phadoop-2.7

mvn install -Pspark-2.3 -PCDH-5.15 -DskipTests
Spark 2.3 and Hadoop 2.8:

mvn clean install -Pspark-2.3 -Phadoop-2.8

## Skipping Docker Image
Spark 2.3 and Hadoop 2.9:

Part of the build also is a Docker image. Since you might not want to use it, because you are using different base
images, you can skip the building of the Docker image via `-Ddockerfile.skip`
mvn clean install -Pspark-2.3 -Phadoop-2.9

# Releasing
Spark 2.4 and Hadoop 2.6:

## Releasing
mvn clean install -Pspark-2.4 -Phadoop-2.6

Spark 2.4 and Hadoop 2.7:

When making a release, the gitflow maven plugin should be used for managing versions
mvn clean install -Pspark-2.4 -Phadoop-2.7

mvn gitflow:release
Spark 2.4 and Hadoop 2.8:

## Deploying to Central Repository
mvn clean install -Pspark-2.4 -Phadoop-2.8

Both snapshot and release versions can be deployed to Sonatype, which in turn is mirrored by the Maven Central
Repository.
Spark 2.4 and Hadoop 2.9:

mvn deploy -Dgpg.skip=false

The deployment has to be committed via

mvn nexus-staging:close -DstagingRepositoryId=comdimajixflowman-1001
mvn clean install -Pspark-2.4 -Phadoop-2.9

Spark 3.0 and Hadoop 3.1

mvn clean install -Pspark-3.0 -Phadoop-3.1

Spark 3.0 and Hadoop 3.2

mvn clean install -Pspark-3.0 -Phadoop-3.2

## Building for Cloudera

The Maven project also contains preconfigured profiles for Cloudera.

mvn clean install -Pspark-2.3 -PCDH-5.15 -DskipTests

Or the staging data can be removed via
Or for Cloudera 6.3

mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests

mvn nexus-staging:drop

## Deploying to Custom Repository
## Skipping Docker Image

You can also deploy to a different repository by setting the following properties
* `deployment.repository.id` - contains the ID of the repository. This should match any entry in your settings.xml for authentication
* `deployment.repository.snapshot-id` - contains the ID of the repository. This should match any entry in your settings.xml for authentication
* `deployment.repository.server` - the url of the server as used by the nexus-staging-maven-plugin
* `deployment.repository.url` - the url of the default release repsotiory
* `deployment.repository.snapshot-url` - the url of the snapshot repository
Part of the build also is a Docker image. Since you might not want to use it, because you are using different base
images, you can skip the building of the Docker image via `-Ddockerfile.skip`

Per default, Flowman uses the staging mechanism provided by the nexus-staging-maven-plugin. This this is not what you
want, you can simply disable the Plugin via `skipTests`
## Building Documentation

With these settings you can deploy to a different (local) repository, for example
Flowman also contains Markdown documentation which is processed by Sphinx to generate the online HTML documentation.

mvn deploy \
-Pspark-2.3 \
-PCDH-5.15 \
-Ddeployment.repository.snapshot-url=https://nexus-snapshots.my-company.net/repository/snapshots \
-Ddeployment.repository.snapshot-id=nexus-snapshots \
-DskipStaging \
-DskipTests
cd docs
make html
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
# Version 0.14.0

* Fix AWS plugin for Hadoop 3.x
* Improve setup of logging
* Shade Velocity for better interoperability with Spark 3
* Add new web hook facility in namespaces and jobs
* Existing targets will not be overwritten anymore by default. Either use the `--force` command line option, or set
the configuration property `flowman.execution.target.forceDirty` to `true` for the old behaviour.
* Add new command line option `--keep-going`
* Implement new `com.dimajix.spark.io.DeferredFileCommitProtocol` which can be used by setting the Spark configuration
parameter `spark.sql.sources.commitProtocolClass`
* Add new `flowshell` application


# Version 0.13.1 - 2020-07-14

* Code improvements
Expand Down
44 changes: 44 additions & 0 deletions RELEASING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Releasing

## Releasing

When making a release, the gitflow maven plugin should be used for managing versions

mvn gitflow:release

## Deploying to Central Repository

Both snapshot and release versions can be deployed to Sonatype, which in turn is mirrored by the Maven Central
Repository.

mvn deploy -Dgpg.skip=false

The deployment has to be committed via

mvn nexus-staging:close -DstagingRepositoryId=comdimajixflowman-1001

Or the staging data can be removed via

mvn nexus-staging:drop

## Deploying to Custom Repository

You can also deploy to a different repository by setting the following properties
* `deployment.repository.id` - contains the ID of the repository. This should match any entry in your settings.xml for authentication
* `deployment.repository.snapshot-id` - contains the ID of the repository. This should match any entry in your settings.xml for authentication
* `deployment.repository.server` - the url of the server as used by the nexus-staging-maven-plugin
* `deployment.repository.url` - the url of the default release repsotiory
* `deployment.repository.snapshot-url` - the url of the snapshot repository

Per default, Flowman uses the staging mechanism provided by the nexus-staging-maven-plugin. This this is not what you
want, you can simply disable the Plugin via `skipTests`

With these settings you can deploy to a different (local) repository, for example

mvn deploy \
-Pspark-2.3 \
-PCDH-5.15 \
-Ddeployment.repository.snapshot-url=https://nexus-snapshots.my-company.net/repository/snapshots \
-Ddeployment.repository.snapshot-id=nexus-snapshots \
-DskipStaging \
-DskipTests
6 changes: 4 additions & 2 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
FROM ${docker.base-image.repository}:${docker.base-image.version}
MAINTAINER [email protected]

ARG DIST_FILE

USER root

ENV FLOMAN_HOME=/opt/flowman
Expand All @@ -12,9 +14,9 @@ COPY libexec/ /opt/docker/libexec/


# Copy and install Repository
COPY flowman-dist-${project.version}-bin.tar.gz /tmp/repo/
COPY $DIST_FILE /tmp/repo/flowman-dist.tar.gz
COPY conf/ /tmp/repo/conf
RUN tar -C /opt --owner=root --group=root -xzf /tmp/repo/flowman-dist-${project.version}-bin.tar.gz && \
RUN tar -C /opt --owner=root --group=root -xzf /tmp/repo/flowman-dist.tar.gz && \
ln -s /opt/flowman* /opt/flowman && \
cp -a /tmp/repo/conf/* /opt/flowman/conf && \
chown -R root:root /opt/flowman* && \
Expand Down
10 changes: 7 additions & 3 deletions docker/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@
<parent>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-root</artifactId>
<version>0.13.1</version>
<version>0.14.0</version>
<relativePath>..</relativePath>
</parent>

<properties>
<dist.tag>${project.version}-${hadoop.dist}-spark${spark-api.version}-hadoop${hadoop-api.version}</dist.tag>
<docker.base-image.repository>dimajix/spark</docker.base-image.repository>
<docker.base-image.version>${spark.version}</docker.base-image.version>
</properties>
Expand Down Expand Up @@ -52,7 +53,7 @@
<resource>
<directory>../flowman-dist/target</directory>
<includes>
<include>flowman-dist-${project.version}-bin.tar.gz</include>
<include>flowman-dist-${dist.tag}-bin.tar.gz</include>
</includes>
<filtering>false</filtering>
</resource>
Expand Down Expand Up @@ -94,8 +95,11 @@
<repository>dimajix/flowman</repository>
<contextDirectory>target/build</contextDirectory>
<useMavenSettingsForAuth>true</useMavenSettingsForAuth>
<tag>${project.version}</tag>
<tag>${dist.tag}</tag>
<pullNewerImage>false</pullNewerImage>
<buildArgs>
<DIST_FILE>flowman-dist-${dist.tag}-bin.tar.gz</DIST_FILE>
</buildArgs>
</configuration>
</plugin>
</plugins>
Expand Down
134 changes: 134 additions & 0 deletions docs/building.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Building Flowman

Since Flowman depends on libraries like Spark and Hadoop, which are commonly provided by a platform environment like
Cloudera or EMR, you currently need to build Flowman yourself to match the correct versions. Prebuilt Flowman
distributions are planned, but not available yet.

The whole project is built using Maven. The build also includes a Docker image, which requires that Docker
is installed on the build machine - building the Docker image can be disabled (see below).

## Build with Maven

Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as

mvn clean install

## Main Artifacts

The main artifacts will be a Docker image 'dimajix/flowman' and additionally a tar.gz file containing a runnable
version of Flowman for direct installation in cases where Docker is not available or when you want to run Flowman
in a complex environment with Kerberos. You can find the `tar.gz` file in the directory `flowman-dist/target`


# Custom Builds

## Build on Windows

Although you can normally build Flowman on Windows, you will need the Hadoop WinUtils installed. You can download
the binaries from https://github.com/steveloughran/winutils and install an appropriate version somewhere onto your
machine. Do not forget to set the HADOOP_HOME environment variable to the installation directory of these utils!

You should also configure git such that all files are checked out using "LF" endings instead of "CRLF", otherwise
some unittests may fail and Docker images might not be useable. This can be done by setting the git configuration
value "core.autocrlf" to "input"

git config --global core.autocrlf input

You might also want to skip unittests (the HBase plugin is currently failing under windows)

mvn clean install -DskipTests


## Build for Custom Spark / Hadoop Version

Per default, Flowman will be built for fairly recent versions of Spark (2.4.5 as of this writing) and Hadoop (2.8.5).
But of course you can also build for a different version by either using a profile

mvn install -Pspark2.2 -Phadoop2.7 -DskipTests

This will always select the latest bugfix version within the minor version. You can also specify versions explicitly
as follows:

mvn install -Dspark.version=2.2.1 -Dhadoop.version=2.7.3
Note that using profiles is the preferred way, as this guarantees that also dependencies are selected
using the correct version. The following profiles are available:

* spark-2.3
* spark-2.4
* spark-3.0
* hadoop-2.6
* hadoop-2.7
* hadoop-2.8
* hadoop-2.9
* hadoop-3.1
* hadoop-3.2
* CDH-5.15
* CDH-6.3

With these profiles it is easy to build Flowman to match your environment.

## Building for Open Source Hadoop and Spark

Spark 2.3 and Hadoop 2.6:

mvn clean install -Pspark-2.3 -Phadoop-2.6

Spark 2.3 and Hadoop 2.7:

mvn clean install -Pspark-2.3 -Phadoop-2.7

Spark 2.3 and Hadoop 2.8:

mvn clean install -Pspark-2.3 -Phadoop-2.8

Spark 2.3 and Hadoop 2.9:

mvn clean install -Pspark-2.3 -Phadoop-2.9

Spark 2.4 and Hadoop 2.6:

mvn clean install -Pspark-2.4 -Phadoop-2.6

Spark 2.4 and Hadoop 2.7:

mvn clean install -Pspark-2.4 -Phadoop-2.7

Spark 2.4 and Hadoop 2.8:

mvn clean install -Pspark-2.4 -Phadoop-2.8

Spark 2.4 and Hadoop 2.9:

mvn clean install -Pspark-2.4 -Phadoop-2.9

Spark 3.0 and Hadoop 3.1

mvn clean install -Pspark-3.0 -Phadoop-3.1

Spark 3.0 and Hadoop 3.2

mvn clean install -Pspark-3.0 -Phadoop-3.2

## Building for Cloudera

The Maven project also contains preconfigured profiles for Cloudera.

mvn clean install -Pspark-2.3 -PCDH-5.15 -DskipTests

Or for Cloudera 6.3

mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests


## Skipping Docker Image

Part of the build also is a Docker image. Since you might not want to use it, because you are using different base
images, you can skip the building of the Docker image via `-Ddockerfile.skip`

## Building Documentation

Flowman also contains Markdown documentation which is processed by Sphinx to generate the online HTML documentation.

cd docs
make html
Loading

0 comments on commit 174e4b3

Please sign in to comment.