Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
kupferk committed Jun 2, 2021
2 parents 70a0dba + 7fa979c commit 1bac111
Show file tree
Hide file tree
Showing 278 changed files with 20,897 additions and 1,146 deletions.
21 changes: 0 additions & 21 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,16 +60,6 @@ build-default:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days

# List additional build variants (some of them will be built on pushes)
build-hadoop2.6-spark2.3:
stage: build
script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-2.6 -Pspark-2.3 -Ddockerfile.skip'
artifacts:
name: "flowman-dist-hadoop2.6-spark2.3"
paths:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days

build-hadoop2.6-spark2.4:
stage: build
script: 'mvn ${MAVEN_CLI_OPTS} clean package -Phadoop-2.6 -Pspark-2.4 -Ddockerfile.skip'
Expand Down Expand Up @@ -133,17 +123,6 @@ build-hadoop3.2-spark3.1:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days

build-cdh5.15:
stage: build
except:
- pushes
script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDH-5.15 -Ddockerfile.skip'
artifacts:
name: "flowman-dist-cdh5.15"
paths:
- flowman-dist/target/flowman-dist-*-bin.tar.gz
expire_in: 5 days

build-cdh6.3:
stage: build
script: 'mvn ${MAVEN_CLI_OPTS} clean package -PCDH-6.3 -Ddockerfile.skip'
Expand Down
12 changes: 0 additions & 12 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,6 @@ jobs:
jdk: openjdk8
script: mvn clean install

- name: Hadoop 2.6 with Spark 2.3
jdk: openjdk8
script: mvn clean install -Phadoop-2.6 -Pspark-2.3 -Ddockerfile.skip

- name: Hadoop 2.7 with Spark 2.3
jdk: openjdk8
script: mvn clean install -Phadoop-2.7 -Pspark-2.3 -Ddockerfile.skip

- name: Hadoop 2.6 with Spark 2.4
jdk: openjdk8
script: mvn clean install -Phadoop-2.6 -Pspark-2.4
Expand All @@ -51,10 +43,6 @@ jobs:
jdk: openjdk8
script: mvn clean install -Phadoop-3.2 -Pspark-3.1

- name: CDH 5.15
jdk: openjdk8
script: mvn clean install -PCDH-5.15 -Ddockerfile.skip

- name: CDH 6.3
jdk: openjdk8
script: mvn clean install -PCDH-6.3 -Ddockerfile.skip
61 changes: 21 additions & 40 deletions BUILDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,18 @@
The whole project is built using Maven. The build also includes a Docker image, which requires that Docker
is installed on the build machine.

## Build with Maven
## Prerequisites

You need the following tools installed on your machine:
* JDK 1.8 or later. If you build a variant with Scala 2.11, you have to use JDK 1.8 (and not anything newer like
Java 11). This mainly affects builds with Spark 2.x
* Apache Maven (install via package manager download from https://maven.apache.org/download.cgi)
* npm (install via package manager or download from https://www.npmjs.com/get-npm)
* Windows users also need Hadoop winutils installed. Those can be retrieved from https://github.com/cdarlint/winutils
and later. See some additional details for building on Windows below.


# Build with Maven

Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as

Expand All @@ -22,9 +33,11 @@ in a complex environment with Kerberos. You can find the `tar.gz` file in the di

## Build on Windows

Although you can normally build Flowman on Windows, you will need the Hadoop WinUtils installed. You can download
the binaries from https://github.com/steveloughran/winutils and install an appropriate version somewhere onto your
machine. Do not forget to set the HADOOP_HOME environment variable to the installation directory of these utils!
Although you can normally build Flowman on Windows, it is recommended to use Linux instead. But nevertheless Windows
is still supported to some extend, but requires some extra care. You will need the Hadoop WinUtils installed. You can
download the binaries from https://github.com/cdarlint/winutils and install an appropriate version somewhere onto
your machine. Do not forget to set the HADOOP_HOME or PATH environment variable to the installation directory of these
utils!

You should also configure git such that all files are checked out using "LF" endings instead of "CRLF", otherwise
some unittests may fail and Docker images might not be useable. This can be done by setting the git configuration
Expand All @@ -46,24 +59,23 @@ the `master` branch really builds clean with all unittests passing on Linux.

## Build for Custom Spark / Hadoop Version

Per default, Flowman will be built for fairly recent versions of Spark (2.4.5 as of this writing) and Hadoop (2.8.5).
Per default, Flowman will be built for fairly recent versions of Spark (3.0.2 as of this writing) and Hadoop (3.2.0).
But of course you can also build for a different version by either using a profile

```shell
mvn install -Pspark2.3 -Phadoop2.7 -DskipTests
mvn install -Pspark2.4 -Phadoop2.7 -DskipTests
```

This will always select the latest bugfix version within the minor version. You can also specify versions explicitly
as follows:

```shell
mvn install -Dspark.version=2.2.1 -Dhadoop.version=2.7.3
mvn install -Dspark.version=2.4.3 -Dhadoop.version=2.7.3
```
Note that using profiles is the preferred way, as this guarantees that also dependencies are selected
using the correct version. The following profiles are available:

* spark-2.3
* spark-2.4
* spark-3.0
* spark-3.1
Expand All @@ -73,37 +85,12 @@ using the correct version. The following profiles are available:
* hadoop-2.9
* hadoop-3.1
* hadoop-3.2
* CDH-5.15
* CDH-6.3

With these profiles it is easy to build Flowman to match your environment.

## Building for Open Source Hadoop and Spark

### Spark 2.3 and Hadoop 2.6:

```shell
mvn clean install -Pspark-2.3 -Phadoop-2.6
```

### Spark 2.3 and Hadoop 2.7:

```shell
mvn clean install -Pspark-2.3 -Phadoop-2.7
```

### Spark 2.3 and Hadoop 2.8:

```shell
mvn clean install -Pspark-2.3 -Phadoop-2.8
```

### Spark 2.3 and Hadoop 2.9:

```shell
mvn clean install -Pspark-2.3 -Phadoop-2.9
```

### Spark 2.4 and Hadoop 2.6:

```shell
Expand Down Expand Up @@ -148,13 +135,7 @@ mvn clean install -Pspark-3.1 -Phadoop-3.2

## Building for Cloudera

The Maven project also contains preconfigured profiles for Cloudera.

```shell
mvn clean install -Pspark-2.3 -PCDH-5.15 -DskipTests
```

Or for Cloudera 6.3
The Maven project also contains preconfigured profiles for Cloudera CDH 6.3.

```shell
mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests
Expand Down
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
# Version 0.17.0 - 2021-06-02

* New Flowman Kernel and Flowman Studio application prototypes
* New ParallelExecutor
* Fix before/after dependencies in `count` target
* Default build is now Spark 3.1 + Hadoop 3.2
* Remove build profiles for Spark 2.3 and CDH 5.15
* Add MS SQL Server plugin containing JDBC driver
* Speed up file listing for `file` relations
* Use Spark JobGroups
* Better support running Flowman on Windows with appropriate batch scripts


# Version 0.16.0 - 2021-04-23

* Add logo to Flowman Shell
Expand Down
6 changes: 6 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,12 @@ MariaDB Java Client
* HOMEPAGE:
* https://mariadb.com

MSSQL JDBC Client
* LICENSE
* license/LICENSE-mssql-jdbc.txt
* HOMEPAGE:
* https://github.com/Microsoft/mssql-jdbc

Apache Derby
* LICENSE
* license/LICENSE-derby.txt (Apache 2.0 License)
Expand Down
9 changes: 2 additions & 7 deletions build-release.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,10 @@ build_profile() {

build_profile hadoop-2.6 spark-2.3
build_profile hadoop-2.6 spark-2.4
build_profile hadoop-2.7 spark-2.3
build_profile hadoop-2.7 spark-2.4
build_profile hadoop-2.8 spark-2.3
build_profile hadoop-2.8 spark-2.4
build_profile hadoop-2.9 spark-2.3
build_profile hadoop-2.9 spark-2.4
build_profile hadoop-2.9 spark-3.0
build_profile hadoop-3.1 spark-3.0
build_profile hadoop-2.7 spark-3.0
build_profile hadoop-3.2 spark-3.0
build_profile hadoop-2.7 spark-3.1
build_profile hadoop-3.2 spark-3.1
build_profile CDH-5.15
build_profile CDH-6.3
4 changes: 2 additions & 2 deletions docker/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
<parent>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-root</artifactId>
<version>0.16.0</version>
<relativePath>..</relativePath>
<version>0.17.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

<properties>
Expand Down
30 changes: 4 additions & 26 deletions docs/building.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,20 +60,19 @@ You might also want to skip unittests (the HBase plugin is currently failing und

### Build for Custom Spark / Hadoop Version

Per default, Flowman will be built for fairly recent versions of Spark (2.4.5 as of this writing) and Hadoop (2.8.5).
Per default, Flowman will be built for fairly recent versions of Spark (3.0.2 as of this writing) and Hadoop (3.2.0).
But of course you can also build for a different version by either using a profile

mvn install -Pspark2.2 -Phadoop2.7 -DskipTests
mvn install -Pspark2.4 -Phadoop2.7 -DskipTests

This will always select the latest bugfix version within the minor version. You can also specify versions explicitly
as follows:

mvn install -Dspark.version=2.2.1 -Dhadoop.version=2.7.3
mvn install -Dspark.version=2.4.1 -Dhadoop.version=2.7.3
Note that using profiles is the preferred way, as this guarantees that also dependencies are selected
using the correct version. The following profiles are available:

* spark-2.3
* spark-2.4
* spark-3.0
* spark-3.1
Expand All @@ -83,29 +82,12 @@ using the correct version. The following profiles are available:
* hadoop-2.9
* hadoop-3.1
* hadoop-3.2
* CDH-5.15
* CDH-6.3

With these profiles it is easy to build Flowman to match your environment.

### Building for Open Source Hadoop and Spark

Spark 2.3 and Hadoop 2.6:

mvn clean install -Pspark-2.3 -Phadoop-2.6

Spark 2.3 and Hadoop 2.7:

mvn clean install -Pspark-2.3 -Phadoop-2.7

Spark 2.3 and Hadoop 2.8:

mvn clean install -Pspark-2.3 -Phadoop-2.8

Spark 2.3 and Hadoop 2.9:

mvn clean install -Pspark-2.3 -Phadoop-2.9

Spark 2.4 and Hadoop 2.6:

mvn clean install -Pspark-2.4 -Phadoop-2.6
Expand Down Expand Up @@ -137,11 +119,7 @@ Spark 3.1 and Hadoop 3.2

### Building for Cloudera

The Maven project also contains preconfigured profiles for Cloudera.

mvn clean install -Pspark-2.3 -PCDH-5.15 -DskipTests

Or for Cloudera 6.3
The Maven project also contains preconfigured profiles for Cloudera CDH 6.3.

mvn clean install -Pspark-2.4 -PCDH-6.3 -DskipTests

Expand Down
6 changes: 5 additions & 1 deletion docs/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,11 @@ the existence of targets to decide if a rebuild is required.

- `flowman.execution.executor.class` *(type: class)* *(default: `com.dimajix.flowman.execution.SimpleExecutor`)*
Configure the executor to use. The default `SimpleExecutor` will process all targets in the correct order
sequentially.
sequentially. The alternative implementation `com.dimajix.flowman.execution.ParallelExecutor` will run multiple
targets in parallel (if they are not depending on each other)

- `flowman.execution.executor.parallelism` *(type: int)* *(default: 4)*
The number of targets to be executed in parallel, when the `ParallelExecutor` is used.

- `flowman.execution.scheduler.class` *(type: class)* *(default: `com.dimajix.flowman.execution.SimpleScheduler`)*
Configure the scheduler to use. The default `SimpleScheduler` will sort all targets according to their dependency.
Expand Down
22 changes: 20 additions & 2 deletions docs/spec/mapping/mock.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,32 @@ mappings:
```yaml
mappings:
empty_mapping:
some_other_mapping:
kind: mock
mapping: some_mapping
records:
- [1,2,"some_string",""]
- [2,null,"cat","black"]
```
```yaml
mappings:
some_mapping:
kind: mock
mapping: some_mapping
records:
- Campaign ID: DIR_36919
LineItemID ID: DIR_260390
SiteID ID: 23374
CreativeID ID: 292668
PlacementID ID: 108460
- Campaign ID: DIR_36919
LineItemID ID: DIR_260390
SiteID ID: 23374
CreativeID ID: 292668
PlacementID ID: 108460
```
## Fields
* `kind` **(mandatory)** *(type: string)*: `mock`

Expand All @@ -39,7 +57,7 @@ mappings:
* `MEMORY_AND_DISK_SER`

* `mapping` **(optional)** *(type: string)*:
Specifies the name of the mapping to be mocked. If no name is given, the a mapping with the same name will be
Specifies the name of the mapping to be mocked. If no name is given, then a mapping with the same name will be
mocked. Note that this will only work when used as an override mapping in test cases, otherwise an infinite loop
would be created by referencing to itself.

Expand Down
21 changes: 17 additions & 4 deletions docs/spec/mapping/values.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ mappings:
- name: str_col
type: string
records:
- [1,"some_string"]
- [2,"cat"]
- [1,"some_string"]
- [2,"cat"]
```
```yaml
Expand All @@ -30,8 +30,21 @@ mappings:
int_col: integer
str_col: string
records:
- [1,"some_string"]
- [2,"cat"]
- [1,"some_string"]
- [2,"cat"]
```
```yaml
mappings:
fake_input:
kind: values
columns:
int_col: integer
str_col: string
records:
- int_col: 1
str_col: "some_string"
- str_col: "cat"
```
Expand Down
Loading

0 comments on commit 1bac111

Please sign in to comment.