Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
kupferk committed Mar 1, 2022
2 parents ba5a982 + be00e69 commit 80a9ec4
Show file tree
Hide file tree
Showing 442 changed files with 12,909 additions and 2,520 deletions.
2 changes: 1 addition & 1 deletion BUILDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ appropriate build profiles, you can easily create a custom build.
Although you can normally build Flowman on Windows, it is recommended to use Linux instead. But nevertheless Windows
is still supported to some extend, but requires some extra care. You will need the Hadoop WinUtils installed. You can
download the binaries from https://github.com/cdarlint/winutils and install an appropriate version somewhere onto
your machine. Do not forget to set the HADOOP_HOME or PATH environment variable to the installation directory of these
your machine. Do not forget to set the `HADOOP_HOME` or `PATH` environment variable to the installation directory of these
utils!

You should also configure git such that all files are checked out using "LF" endings instead of "CRLF", otherwise
Expand Down
14 changes: 14 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,17 @@
# Version 0.22.0 - 2022-03-01

* Add new `sqlserver` relation
* Implement new documentation subsystem
* Change default build to Spark 3.2.1 and Hadoop 3.3.1
* Add new `drop` target for removing tables
* Speed up project loading by reusing Jackson mapper
* Implement new `jdbc` metric sink
* Implement schema cache in Executor to speed up documentation and similar tasks
* Add new config variables `flowman.execution.mapping.schemaCache` and `flowman.execution.relation.schemaCache`
* Add new config variable `flowman.default.target.verifyPolicy` to ignore empty tables during VERIFY phase
* Implement initial support for indexes in JDBC relations


# Version 0.21.2 - 2022-02-14

* Fix importing projects
Expand Down
110 changes: 110 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Contributing to Flowman

You want to contribute to Flowman? Welcome! Please read this document to understand what you can do:
* [Report an Issue](#report-an-issue)
* [Contribute Documentation](#contribute-documentation)
* [Contribute Code](#contribute-code)


## Report an Issue

If you find a bug - behavior of Flowman code contradicting your expectation - you are welcome to report it.
We can only handle well-reported, actual bugs, so please follow the guidelines below.

Once you have familiarized with the guidelines, you can go to the [GitHub issue tracker for Flowman](https://github.com/dimajix/flowman/issues/new) to report the issue.

### Quick Checklist for Bug Reports

Issue report checklist:
* Real, current bug
* No duplicate
* Reproducible
* Good summary
* Well-documented
* Minimal example

### Issue handling process

When an issue is reported, a committer will look at it and either confirm it as a real issue, close it if it is not an issue, or ask for more details.

An issue that is about a real bug is closed as soon as the fix is committed.

### Usage of Labels

GitHub offers labels to categorize issues. We suggest the following labels:

Labels for issue categories:
* bug: this issue is a bug in the code
* feature: this issue is a request for a new functionality or an enhancement request
* environment: this issue relates to supporting a specific runtime environment (Cloudera, specific Spark/Hadoop version, etc)

Status of open issues:
* help wanted: the feature request is approved and you are invited to contribute

Status/resolution of closed issues:
* wontfix: while acknowledged to be an issue, a fix cannot or will not be provided

### Issue Reporting Disclaimer

We want to improve the quality of Flowman and good bug reports are welcome! But our capacity is limited, thus we reserve the right to close or to not process insufficient bug reports in favor of those which are very cleanly documented and easy to reproduce. Even though we would like to solve each well-documented issue, there is always the chance that it will not happen - remember: Flowman is Open Source and comes without warranty.

Bug report analysis support is very welcome! (e.g. pre-analysis or proposing solutions)



## Contribute Documentation

Flowman has many features implemented, unfortunately not all of them are well documented. So this is an area where we highly welcome contributions from users in order to improve the documentation. The documentation is contained in the "doc" subdirectory within the source code repository. This implies that when you want to contribute documentation, you have to follow the same procedure as for contributing code.



## Contribute Code

You are welcome to contribute code to Flowman in order to fix bugs or to implement new features.

There are three important things to know:

1. You must be aware of the Apache License (which describes contributions) and **agree to the Contributors License Agreement**. This is common practice in all major Open Source projects.
For company contributors special rules apply. See the respective section below for details.
2. Please ensure your contribution adopts Flowmans **code style, quality, and product standards**. The respective section below gives more details on the coding guidelines.
3. **Not all proposed contributions can be accepted**. Some features may e.g. just fit a third-party plugin better. The code must fit the overall direction of Flowman and really improve it. The more effort you invest, the better you should clarify in advance whether the contribution fits: the best way would be to just open an issue to discuss the feature you plan to implement (make it clear you intend to contribute).

### Contributor License Agreement

When you contribute (code, documentation, or anything else), you have to be aware that your contribution is covered by the same [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0) that is applied to Flowman itself.

In particular, you need to agree to the [Flowman Contributors License Agreement](https://cla-assistant.io/dimajix/flowman), stating that you have the right and are okay to put your contribution under the license of this project.
CLA assistant will ask you to confirm that.

This applies to all contributors, including those contributing on behalf of a company.
If you agree to its content, you simply have to click on the link posted by the CLA assistant as a comment to the pull request. Click it to check the CLA, then accept it on the following screen if you agree to it.
CLA assistant will save this decision for upcoming contributions and will notify you if there is any change to the CLA in the meantime.

### Contribution Content Guidelines

These are some rules we try to follow:

- Apply a clean coding style adapted to the surrounding code, even though we are aware the existing code is not fully clean
- Use (4)spaces for indentation
- Use variable naming conventions like in the other files you are seeing (camelcase)
- No println - use SLF4J logging instead
- Comment your code where it gets non-trivial
- Write a unit test
- Do not do any incompatible changes, especially do not change or remove existing properties from YAML specs

### How to contribute - the Process

1. Make sure the change would be welcome (e.g. a bugfix or a useful feature); best do so by proposing it in a GitHub issue
2. Create a branch forking the flowman repository and do your change
3. Commit and push your changes on that branch
4. If your change fixes an issue reported at GitHub, add the following line to the commit message:
- ```Fixes #(issueNumber)```
5. Create a Pull Request with the following information
- Describe the problem you fix with this change.
- Describe the effect that this change has from a user's point of view. App crashes and lockups are pretty convincing for example, but not all bugs are that obvious and should be mentioned in the text.
- Describe the technical details of what you changed. It is important to describe the change in a most understandable way so the reviewer is able to verify that the code is behaving as you intend it to.
6. Follow the link posted by the CLA assistant to your pull request and accept it, as described in detail above.
7. Wait for our code review and approval, possibly enhancing your change on request
- Note that the Flowman developers also have their regular duties, so depending on the required effort for reviewing, testing and clarification this may take a while
8. Once the change has been approved we will inform you in a comment
9. We will close the pull request, feel free to delete the now obsolete branch
23 changes: 10 additions & 13 deletions QUICKSTART.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Fortunately, Apache Spark is rather simple to install locally on your machine:

### Download & Install Spark

As of this writing, the latest release of Flowman is 0.20.0 and is available prebuilt for Spark 3.1.2 on the Spark
As of this writing, the latest release of Flowman is 0.22.0 and is available prebuilt for Spark 3.2.1 on the Spark
homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it.

```shell
Expand All @@ -25,8 +25,8 @@ mkdir playground
cd playground

# Download and unpack Spark & Hadoop
curl -L https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link
ln -snf spark-3.1.2-bin-hadoop3.2 spark
curl -L https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.2.1-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link
ln -snf spark-3.2.1-bin-hadoop3.2 spark
```
The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other.

Expand All @@ -35,19 +35,20 @@ The Spark package already contains Hadoop, so with this single download you alre
If you are trying to run the application on Windows, you also need the *Hadoop Winutils*, which is a set of
DLLs required for the Hadoop libraries to be working. You can get a copy at https://github.com/kontext-tech/winutils .
Once you downloaded the appropriate version, you need to place the DLLs into a directory `$HADOOP_HOME/bin`, where
`HADOOP_HOME` refers to some location on your Windows PC. You also need to set the following environment variables:
`HADOOP_HOME` refers to some arbitrary location of your choice on your Windows PC. You also need to set the following
environment variables:
* `HADOOP_HOME` should point to the parent directory of the `bin` directory
* `PATH` should also contain `$HADOOP_HOME/bin`


## 1.2 Install Flowman

You find prebuilt Flowman packages on the corresponding release page on GitHub. For this quickstart, we chose
`flowman-dist-0.20.0-oss-spark3.1-hadoop3.2-bin.tar.gz` which nicely fits to the Spark package we just downloaded before.
`flowman-dist-0.22.0-oss-spark3.2-hadoop3.3-bin.tar.gz` which nicely fits to the Spark package we just downloaded before.

```shell
# Download and unpack Flowman
curl -L https://github.com/dimajix/flowman/releases/download/0.20.0/flowman-dist-0.20.0-oss-spark3.1-hadoop3.2-bin.tar.gz | tar xvzf -
curl -L https://github.com/dimajix/flowman/releases/download/0.22.0/flowman-dist-0.22.0-oss-spark3.2-hadoop3.3-bin.tar.gz | tar xvzf -

# Create a nice link
ln -snf flowman-0.20.0 flowman
Expand Down Expand Up @@ -81,13 +82,9 @@ That’s all we need to run the Flowman example.

# 2. Flowman Shell

The example data is stored in a S3 bucket provided by myself. In order to access the data, you need to provide valid
AWS credentials in your environment:

```shell
$ export AWS_ACCESS_KEY_ID=<your aws access key>
$ export AWS_SECRET_ACCESS_KEY=<your aws secret key>
```
The example data is stored in a S3 bucket provided by myself. Since the data is publicly available and the project is
configured to use anonymous AWS authentication, you do not need to provide your AWS credentials (you even do not
even need to have an account on AWS)

## 2.1 Start interactive Flowman shell

Expand Down
50 changes: 30 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ keep all aspects (like transformations and schema information) in a single place
* Semantics of a build tool like Maven - just for data instead for applications
* Declarative syntax in YAML files
* Data model management (Create, Migrate and Destroy Hive tables, JDBC tables or file based storage)
* Generation of meaningful documentation
* Flexible expression language
* Jobs for managing build targets (like copying files or uploading data via sftp)
* Automatic data dependency management within the execution of individual jobs
* Rich set of execution metrics
* Meaningful logging output
* Meaningful logging output & rich set of execution metrics
* Powerful yet simple command line tools
* Extendable via Plugins

Expand All @@ -38,28 +38,21 @@ You can find the official homepage at [Flowman.io](https://flowman.io)

# Installation

You can either grab an appropriate pre-build package at https://github.com/dimajix/flowman/releases or you
can build your own version via Maven with

mvn clean install

Please also read [BUILDING.md](BUILDING.md) for detailed instructions, specifically on build profiles.

You can either grab an appropriate pre-build package at [GitHub](https://github.com/dimajix/flowman/releases)

## Installing the Packed Distribution

The packed distribution file is called `flowman-{version}-bin.tar.gz` and can be extracted at any
location using

tar xvzf flowman-{version}-bin.tar.gz

```shell
tar xvzf flowman-{version}-bin.tar.gz
```

## Apache Spark

Flowman does not bring its own Spark libraries, but relies on a correctly installed Spark distribution. You can
download appropriate packages directly from [https://spark.apache.org](the Spark Homepage).


## Hadoop Utils for Windows

If you are trying to run the application on Windows, you also need the *Hadoop Winutils*, which is a set of
Expand All @@ -70,7 +63,6 @@ Once you downloaded the appropriate version, you need to place the DLLs into a d
* `PATH` should also contain `$HADOOP_HOME/bin`



# Command Line Utils

The primary tool provided by Flowman is called `flowexec` and is located in the `bin` folder of the
Expand All @@ -80,19 +72,37 @@ installation directory.

The `flowexec` tool has several subcommands for working with objects and projects. The general pattern
looks as follows

flowexec [generic options] <cmd> <subcommand> [specific options and arguments]
```shell
flowexec [generic options] <cmd> <subcommand> [specific options and arguments]
```

For working with `flowexec`, either your current working directory needs to contain a Flowman
project with a file `project.yml` or you need to specify the path to a valid project via

flowexec -f /path/to/project/folder <cmd>
```shell
flowexec -f /path/to/project/folder <cmd>
```

## Interactive Shell

With version 0.14.0, Flowman also introduced a new interactive shell for executing data flows. The shell can be
started via

flowshell -f <project>
```shell
flowshell -f <project>
```

Within the shell, you can interactively build targets and inspect intermediate mappings.


# Building

You can build your own version via Maven with
```shell
mvn clean install
```
Please also read [BUILDING.md](BUILDING.md) for detailed instructions, specifically on build profiles.


# Contributing

You want to contribute to Flowman? Welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) to understand what you can
do.
24 changes: 23 additions & 1 deletion docker/conf/default-namespace.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,28 @@ connections:
username: $System.getenv('FLOWMAN_LOGDB_USER', '')
password: $System.getenv('FLOWMAN_LOGDB_PASSWORD', '')

# This adds a hook for creating an execution log in a file
hooks:
kind: report
location: ${project.basedir}/generated-report.txt
metrics:
# Define common labels for all metrics
labels:
project: ${project.name}
metrics:
# Collect everything
- selector:
name: .*
labels:
category: ${category}
kind: ${kind}
name: ${name}

# This configures where metrics should be written to. Since we cannot assume a working Prometheus push gateway, we
# simply print them onto the console
metrics:
- kind: console

config:
- spark.sql.warehouse.dir=/opt/flowman/hive/warehouse
- spark.hadoop.hive.metastore.uris=
Expand All @@ -21,7 +43,7 @@ config:

store:
kind: file
location: /opt/flowman/examples
location: $System.getenv('FLOWMAN_HOME')/examples

plugins:
- flowman-aws
Expand Down
20 changes: 20 additions & 0 deletions docker/conf/history-server.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# The following definition provides a "run history" stored in a database. If nothing else is specified, the database
# is stored locally as a Derby database. If you do not want to use the history, you can simply remove the whole
# 'history' block from this file.
history:
kind: jdbc
connection: flowman_state
retries: 3
timeout: 1000

connections:
flowman_state:
driver: $System.getenv('FLOWMAN_LOGDB_DRIVER', 'org.apache.derby.jdbc.EmbeddedDriver')
url: $System.getenv('FLOWMAN_LOGDB_URL', $String.concat('jdbc:derby:', $System.getenv('FLOWMAN_HOME'), '/logdb;create=true'))
username: $System.getenv('FLOWMAN_LOGDB_USER', '')
password: $System.getenv('FLOWMAN_LOGDB_PASSWORD', '')

plugins:
- flowman-mariadb
- flowman-mysql
- flowman-mssqlserver
18 changes: 16 additions & 2 deletions docker/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,14 @@
<parent>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-root</artifactId>
<version>0.21.2</version>
<version>0.22.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

<properties>
<spark-hadoop-archive.version>${hadoop-api.version}</spark-hadoop-archive.version>
</properties>

<profiles>
<profile>
<id>CDH-6.3</id>
Expand All @@ -27,6 +31,16 @@
<dockerfile.skip>true</dockerfile.skip>
</properties>
</profile>
<profile>
<id>spark-3.2</id>
<activation>
<activeByDefault>true</activeByDefault>
</activation>
<properties>
<!-- The Spark 3.2 archives continue to have a wrong file name -->
<spark-hadoop-archive.version>3.2</spark-hadoop-archive.version>
</properties>
</profile>
</profiles>

<build>
Expand Down Expand Up @@ -93,7 +107,7 @@
<pullNewerImage>false</pullNewerImage>
<buildArgs>
<BUILD_SPARK_VERSION>${spark.version}</BUILD_SPARK_VERSION>
<BUILD_HADOOP_VERSION>${hadoop-api.version}</BUILD_HADOOP_VERSION>
<BUILD_HADOOP_VERSION>${spark-hadoop-archive.version}</BUILD_HADOOP_VERSION>
<DIST_FILE>flowman-dist-${flowman.dist.label}-bin.tar.gz</DIST_FILE>
<http_proxy>${env.http_proxy}</http_proxy>
<https_proxy>${env.https_proxy}</https_proxy>
Expand Down
Loading

0 comments on commit 80a9ec4

Please sign in to comment.