Skip to content

Commit

Permalink
Merge branch 'develop'
Browse files Browse the repository at this point in the history
  • Loading branch information
kupferk committed Jan 26, 2022
2 parents c139042 + a0e9203 commit 3033035
Show file tree
Hide file tree
Showing 221 changed files with 4,943 additions and 1,390 deletions.
10 changes: 9 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
# Version 0.20.1
# Version 0.21.0 - 2033-01-26

* Fix wrong dependencies in Swagger plugin
* Implement basic schema inference for local CSV files
* Implement new `stack` mapping
* Improve error messages of local CSV parser


# Version 0.20.1 - 2022-01-06

* Implement detection of dependencies introduced by schema

Expand Down
2 changes: 1 addition & 1 deletion docker/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
<parent>
<groupId>com.dimajix.flowman</groupId>
<artifactId>flowman-root</artifactId>
<version>0.20.1</version>
<version>0.21.0</version>
<relativePath>../pom.xml</relativePath>
</parent>

Expand Down
18 changes: 10 additions & 8 deletions docs/building.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,29 +5,31 @@ Cloudera or EMR, you currently need to build Flowman yourself to match the corr

## Download Prebuilt Distribution

As an alternative to building Flowman yourself, prebuilt Flowman distributions are are provided on
The simplest way to get started with Flowman is to download a prebuilt distribution, which is provided at
[GitHub](https://github.com/dimajix/flowman/releases). This probably is the simplest way to grab a working Flowman
package. Note that for each release, there are different packages being provided, for different Spark and Hadoop
versions. The naming is very simple:
```
flowman-dist-<version>-oss-spark<spark-version>-hadoop<hadoop-version>-bin.tar.gz
```
You simply have to use the package which fits to the Spark and Hadoop versions of your environment. For example the
package of Flowman 0.14.1 and for Spark 3.0 and Hadoop 3.2 would be
package of Flowman 0.20.1 and for Spark 3.1 and Hadoop 3.2 would be
```
flowman-dist-0.14.1-oss-spark30-hadoop32-bin.tar.gz
flowman-dist-0.20.1-oss-spark31-hadoop32-bin.tar.gz
```
and the full URL then would be
```
https://github.com/dimajix/flowman/releases/download/0.14.1/flowman-dist-0.14.1-oss-spark3.0-hadoop3.2-bin.tar.gz
https://github.com/dimajix/flowman/releases/download/0.20.1/flowman-dist-0.20.1-oss-spark3.1-hadoop3.2-bin.tar.gz
```

The whole project is built using Maven. The build also includes a Docker image, which requires that Docker
is installed on the build machine - building the Docker image can be disabled (see below).

## Build with Maven

When you decide against downloading a prebuilt Flowman distribution, you can simply built it yourself with Maven.
When you decide against downloading a prebuilt Flowman distribution, you can simply build it yourself with Maven. As
a prerequisite, you need
* Java (1.8 for Spark <= 2.4 and 11 for Spark >= 3.0)
* Apache Maven
* On Windows: Hadoop libraries

Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as

mvn clean install
Expand Down
221 changes: 221 additions & 0 deletions docs/concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# Core Concepts

Flowman is a *data build tool* which uses a declarative syntax to specify, what needs to be built. The main difference
to classical build tools like `make`, `maven` is that Flowman builds *data* instead of *applications* or *libraries*.
Flowman borrows many features from classical build tools, like support for [build phases](lifecycle.md), automatic
dependency detection, clean console output and more.

But how can we instruct Flowman to build data? The input and output data is specified in declarative yaml files
together with all transformations applied along the way from reading to writing data. At the core of these yaml
files are the following entity types

## Relations

[*Relations*](spec/relation/index.md) specify *physical manifestations of data* in external systems. A relation may
refer to any data source (or sink) like a table or view in a MySQL database, a table in Hive or files on some
distributed filesystem like HDFS or files stored in object store like S3.

A relation can serve both as a data source or a data sink, or as both (this is when automatic dependency management
comes into play, which is required to determine the correct build order). Each relation typically has some important
properties like its *schema* (i.e. the columns including name and type), its location (be it a directory in a shared
file system or a URL to connect to). Of course the available properties depend on the specific *kind* of relation.

### Examples
For example a table in Hive can be specified as follows:
```yaml
relations:
parquet_relation:
kind: hiveTable
database: default
table: financial_transactions
# Specify the physical location where the data files should be stored at. If you leave this out, the Hive
# default location will be used
location: /warehouse/default/financial_transactions
# Specify the file format to use
format: parquet
# Add partition column
partitions:
- name: business_date
type: string
# Specify a schema, which is mandatory for write operations
schema:
kind: inline
fields:
- name: id
type: string
- name: amount
type: double
```
And a table in a MySQL database can be specified as:
```yaml
relations:
frontend_users:
kind: jdbc
# Specify the name of the connection to use
connection:
kind: jdbc
driver: "com.mysql.cj.jdbc.Driver"
url: "jdbc:mysql://mysql-crm.acme.com/crm_main"
username: "flowman"
password: "super_secret"
# Specify the table
table: "users"
```
Or you can easily access files in S3 via:
```yaml
relations:
csv_export:
kind: file
# Specify the file format to use
format: "csv"
# Specify the base directory where all data is stored. This location does not include the partition pattern
location: "s3://acme.com/export/weather/csv"
# Set format specific options
options:
delimiter: ","
quote: "\""
escape: "\\"
header: "true"
compression: "gzip"
# Specify an optional schema here. It is always recommended to explicitly specify a schema for every relation
# and not just let data flow from a mapping into a target.
schema:
kind: embedded
fields:
- name: country
type: STRING
- name: min_wind_speed
type: FLOAT
- name: max_wind_speed
type: FLOAT
```
## Mappings
The next very important entity of Flowman is the [*mapping*](spec/mapping/index.md) category which describes *data
transformation* (and in addition as a special but very important kind, *reading data*). Mappings can use the result of
other mappings as their input and thereby build a complex flow of data transformations. Internally all these
transformations are executed using Apache Spark.
There are all kinds of mappings available, like simple [filter](spec/mapping/filter.md) mappings,
[aggregate](spec/mapping/aggregate.md) mappings and very powerful generic [sql](spec/mapping/sql.md) mappings.
Again, each mapping is described using a specific set of properties depending on the selected kind.
### Examples
The example below shows how to access a relation called `facts_table` (which is not shown here). It will read a
single *partition* of data, which is commonly done for incremental processing only newly arrived data.
```yaml
mappings:
facts_all:
kind: readRelation
relation: facts_table
partitions:
year:
start: $start_year
end: $end_year
```

The following example is a simple filter mapping, which is equivalent to a `WHERE` clause in traditional SQL. It applies
the filter to the output of the incoming `facts_all` mapping (not shown).
```yaml
mappings:
facts_special:
kind: filter
input: facts_all
condition: "special_flag = TRUE"
```

You can also perform arbitrary SQL queries (in Spark SQL) by using the `sql` mapping:
```yaml
mappings:
people_union:
kind: sql
sql: "
SELECT
first_name,
last_name
FROM
people_internal
UNION ALL
SELECT
first_name,
last_name
FROM
people_external
"
```


## Targets

Now we have the two entity types *mapping* and *relation*, and we already saw how we can read from a relation using
the `readRelation` mapping. But how can we store the result of a flow of transformations back into some relation? This
is where *build targets* come into play. They kind of connect the output of a mapping with a relation and tell Flowman
to write the results of a mapping into a relation. These targets are the entities which will be *build* by Flowman and
which support a lifecycle starting from creating a relation, migrating it to the newest schema, filling with data,
verifying it etc.

Again Flowman provides many types of build targets, but the most important one is the `relation` build target

### Examples

The following example writes the output of the mapping `stations_mapping` into a relation called `stations_table`.
Again the example will only write into a single partition for incrementally processing only new data.
```yaml
targets:
stations:
kind: relation
mapping: stations_mapping
relation: stations_table
partition:
processing_date: "${processing_date}"
```


## Jobs

While targets would contain all the information for building the data, Flowman uses an additional entity called *job*
which simply bundles multiple targets, such that they are built together. The idea is that while your project may
contain many targets, you might want to group them together, such that only specific targets are built together.

And this is done via a job in Flowman, which mainly contains a list of targets to be built. Additionally, a job
allows to specify build parameters, which need to be provided on the command line. A typical example would be a
date which selects only a subset of the available data for processing.

### Examples

The following example defines a job called `main` with two build targets `stations` and `weather` . Moreover, the job
defines a mandatory parameter called `processing_date`, which can be referenced as a variable in all entities.
```yaml
jobs:
main:
description: "Processes all outputs"
parameters:
- name: processing_date
type: string
description: "Specifies the date in yyyy-MM-dd for which the job will be run"
environment:
- start_ts=$processing_date
- end_ts=$Date.parse($processing_date).plusDays(1)
targets:
- stations
- weather
```

## Additional entities
While these for types (relations, mappings, targets and jobs) form the basis of every Flowman project, there are some
additional entities like [tests](spec/test/index.md), [connections](spec/connection/index.md) and more. You find an
overview of all entity types in the [project specification documentation](spec/index.md)


## Lifecycle

Flowman sees data as artifacts with a common lifecycle, from creation until deletion. The lifecycle itself consists of
multiple different *build phases*, each of them representing one stage of the whole lifecycle. Each target supports
at least one of these build phases, which means that the target is performing some action during that phase. The
specific phases depend on the target type. Read on on [lifecycles and phases](lifecycle.md) for more detailed
information.
3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Flowman also provides optional plugins which extend functionality. You can find
:glob:
quickstart
building
concepts
installation
lifecycle
spec/index
Expand All @@ -101,4 +101,5 @@ Flowman also provides optional plugins which extend functionality. You can find
cookbook/index
plugins/index
config
building
```
11 changes: 6 additions & 5 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ be provided by your platform. This approach ensures that the existing Spark and
with all patches and extensions available on your platform. Specifically this means that Flowman requires the following
components present on your system:

* Java 1.8
* Java 11 (or 1.8 when using Spark 2.4)
* Apache Spark with a matching minor version
* Apache Hadoop with a matching minor version

Expand All @@ -25,13 +25,13 @@ versions. The naming is very simple:
flowman-dist-<version>-oss-spark<spark-version>-hadoop<hadoop-version>-bin.tar.gz
```
You simply have to use the package which fits to the Spark and Hadoop versions of your environment. For example the
package of Flowman 0.14.1 and for Spark 3.0 and Hadoop 3.2 would be
package of Flowman 0.20.1 and for Spark 3.1 and Hadoop 3.2 would be
```
flowman-dist-0.14.1-oss-spark30-hadoop32-bin.tar.gz
flowman-dist-0.20.1-oss-spark31-hadoop32-bin.tar.gz
```
and the full URL then would be
```
https://github.com/dimajix/flowman/releases/download/0.14.1/flowman-dist-0.14.1-oss-spark3.0-hadoop3.2-bin.tar.gz
https://github.com/dimajix/flowman/releases/download/0.20.1/flowman-dist-0.20.1-oss-spark3.1-hadoop3.2-bin.tar.gz
```


Expand Down Expand Up @@ -77,7 +77,8 @@ tar xvzf flowman-dist-X.Y.Z-bin.tar.gz
├── flowman-example
├── flowman-impala
├── flowman-kafka
└── flowman-mariadb
├── flowman-mariadb
└── flowman-...
```

* The `bin` directory contains the Flowman executables
Expand Down
13 changes: 13 additions & 0 deletions docs/spec/connection/jdbc.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,25 @@ environment:

connections:
mysql-db:
kind: jdbc
driver: "$mysql_db_driver"
url: "$mysql_db_url"
username: "$mysql_db_username"
password: "$mysql_db_password"
```
## Fields
* `kind` **(mandatory)** *(string)*: `jdbc`

* `driver` **(mandatory)** *(string)*

* `url` **(mandatory)** *(string)*

* `username` **(optional)** *(string)*

* `password` **(optional)** *(string)*

* `properties` **(optional)** *(map)*


## Description
6 changes: 4 additions & 2 deletions docs/spec/mapping/filter.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@

# Filter Mapping

The `filter` mapping is one of the simplest one and applies a row filter to all incoming records. This is equivalent
to a `WHERE` or `HAVING` condition in a classical SQL statement.

## Example
```
```yaml
mappings:
facts_special:
kind: filter
Expand Down
2 changes: 1 addition & 1 deletion docs/spec/mapping/sort.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ downstream mappings may destroy the sort order again, so this should be the last
before you save a result using an output operation.

## Example
```
```yaml
mappings:
stations_sorted:
kind: sort
Expand Down
2 changes: 1 addition & 1 deletion docs/spec/mapping/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
The `sql` mapping allows to execute any SQL transformation which contains Spark SQL code.

## Example
```
```yaml
mappings:
people_union:
kind: sql
Expand Down
Loading

0 comments on commit 3033035

Please sign in to comment.