Merge branch 'develop'

dimajix · Jan 26, 2022 · 3033035 · 3033035
2 parents c139042 + a0e9203
commit 3033035
Show file tree

Hide file tree

Showing 221 changed files with 4,943 additions and 1,390 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,12 @@
-# Version 0.20.1
+# Version 0.21.0 - 2033-01-26
+
+* Fix wrong dependencies in Swagger plugin
+* Implement basic schema inference for local CSV files
+* Implement new `stack` mapping
+* Improve error messages of local CSV parser
+
+
+# Version 0.20.1 - 2022-01-06
 
 * Implement detection of dependencies introduced by schema
 

diff --git a/docker/pom.xml b/docker/pom.xml
@@ -10,7 +10,7 @@
     <parent>
         <groupId>com.dimajix.flowman</groupId>
         <artifactId>flowman-root</artifactId>
-        <version>0.20.1</version>
+        <version>0.21.0</version>
         <relativePath>../pom.xml</relativePath>
     </parent>
 

diff --git a/docs/building.md b/docs/building.md
@@ -5,29 +5,31 @@ Cloudera or EMR,  you currently need to build Flowman yourself to match the corr
 
 ## Download Prebuilt Distribution
 
-As an alternative to building Flowman yourself, prebuilt Flowman distributions are are provided on 
+The simplest way to get started with Flowman is to download a prebuilt distribution, which is provided at 
 [GitHub](https://github.com/dimajix/flowman/releases). This probably is the simplest way to grab a working Flowman 
 package. Note that for each release, there are different packages being provided, for different Spark and Hadoop 
 versions. The naming is very simple:
 ```
 flowman-dist-<version>-oss-spark<spark-version>-hadoop<hadoop-version>-bin.tar.gz
 ```
 You simply have to use the package which fits to the Spark and Hadoop versions of your environment. For example the
-package of Flowman 0.14.1 and for Spark 3.0 and Hadoop 3.2 would be
+package of Flowman 0.20.1 and for Spark 3.1 and Hadoop 3.2 would be
 ```
-flowman-dist-0.14.1-oss-spark30-hadoop32-bin.tar.gz
+flowman-dist-0.20.1-oss-spark31-hadoop32-bin.tar.gz
 ```
 and the full URL then would be
 ```
-https://github.com/dimajix/flowman/releases/download/0.14.1/flowman-dist-0.14.1-oss-spark3.0-hadoop3.2-bin.tar.gz
+https://github.com/dimajix/flowman/releases/download/0.20.1/flowman-dist-0.20.1-oss-spark3.1-hadoop3.2-bin.tar.gz
 ```
 
-The whole project is built using Maven. The build also includes a Docker image, which requires that Docker
-is installed on the build machine - building the Docker image can be disabled (see below).
-
 ## Build with Maven
 
-When you decide against downloading a prebuilt Flowman distribution, you can simply built it yourself with Maven.
+When you decide against downloading a prebuilt Flowman distribution, you can simply build it yourself with Maven. As
+a prerequisite, you need
+    * Java (1.8 for Spark <= 2.4 and 11 for Spark >= 3.0)
+    * Apache Maven
+    * On Windows: Hadoop libraries
+
 Building Flowman with the default settings (i.e. Hadoop and Spark version) is as easy as
 
     mvn clean install

diff --git a/docs/concepts.md b/docs/concepts.md
@@ -0,0 +1,221 @@
+# Core Concepts
+
+Flowman is a *data build tool* which uses a declarative syntax to specify, what needs to be built. The main difference
+to classical build tools like `make`, `maven` is that Flowman builds *data* instead of *applications* or *libraries*.
+Flowman borrows many features from classical build tools, like support for [build phases](lifecycle.md), automatic
+dependency detection, clean console output and more.
+
+But how can we instruct Flowman to build data? The input and output data is specified in declarative yaml files
+together with all transformations applied along the way from reading to writing data. At the core of these yaml
+files are the following entity types
+
+## Relations
+
+[*Relations*](spec/relation/index.md) specify *physical manifestations of data* in external systems. A relation may
+refer to any data source (or sink) like a table or view in a MySQL database, a table in Hive or files on some
+distributed filesystem like HDFS or files stored in object store like S3. 
+
+A relation can serve both as a data source or a data sink, or as both (this is when automatic dependency management 
+comes into play, which is required to determine the correct build order). Each relation typically has some important
+properties like its *schema* (i.e. the columns including name and type), its location (be it a directory in a shared
+file system or a URL to connect to). Of course the available properties depend on the specific *kind* of relation.
+
+### Examples
+For example a table in Hive can be specified as follows:
+```yaml
+relations:
+  parquet_relation:
+    kind: hiveTable
+    database: default
+    table: financial_transactions
+    # Specify the physical location where the data files should be stored at. If you leave this out, the Hive
+    # default location will be used
+    location: /warehouse/default/financial_transactions
+    # Specify the file format to use
+    format: parquet
+    # Add partition column
+    partitions:
+        - name: business_date
+          type: string
+    # Specify a schema, which is mandatory for write operations
+    schema:
+      kind: inline
+      fields:
+        - name: id
+          type: string
+        - name: amount
+          type: double
+```
+And a table in a MySQL database can be specified as:
+```yaml
+relations:
+  frontend_users:
+    kind: jdbc
+    # Specify the name of the connection to use
+    connection:
+      kind: jdbc
+      driver: "com.mysql.cj.jdbc.Driver"
+      url: "jdbc:mysql://mysql-crm.acme.com/crm_main"
+      username: "flowman"
+      password: "super_secret"
+    # Specify the table
+    table: "users"
+```
+Or you can easily access files in S3 via:
+```yaml
+relations:
+  csv_export:
+    kind: file
+    # Specify the file format to use
+    format: "csv"
+    # Specify the base directory where all data is stored. This location does not include the partition pattern
+    location: "s3://acme.com/export/weather/csv"
+    # Set format specific options
+    options:
+      delimiter: ","
+      quote: "\""
+      escape: "\\"
+      header: "true"
+      compression: "gzip"
+    # Specify an optional schema here. It is always recommended to explicitly specify a schema for every relation
+    # and not just let data flow from a mapping into a target.
+    schema:
+      kind: embedded
+      fields:
+        - name: country
+          type: STRING
+        - name: min_wind_speed
+          type: FLOAT
+        - name: max_wind_speed
+          type: FLOAT
+```
+
+
+## Mappings
+
+The next very important entity of Flowman is the [*mapping*](spec/mapping/index.md) category which describes *data 
+transformation* (and in addition as a special but very important kind, *reading data*). Mappings can use the result of 
+other mappings as their input and thereby build a complex flow of data transformations. Internally all these 
+transformations are executed using Apache Spark.
+
+There are all kinds of mappings available, like simple [filter](spec/mapping/filter.md) mappings, 
+[aggregate](spec/mapping/aggregate.md) mappings and very powerful generic [sql](spec/mapping/sql.md) mappings.
+Again, each mapping is described using a specific set of properties depending on the selected kind.
+
+### Examples
+
+The example below shows how to access a relation called `facts_table` (which is not shown here). It will read a 
+single *partition* of data, which is commonly done for incremental processing only newly arrived data.
+```yaml
+mappings:
+  facts_all:
+    kind: readRelation
+    relation: facts_table
+    partitions:
+      year:
+        start: $start_year
+        end: $end_year
+```
+
+The following example is a simple filter mapping, which is equivalent to a `WHERE` clause in traditional SQL. It applies
+the filter to the output of the incoming `facts_all` mapping (not shown).
+```yaml
+mappings:
+  facts_special:
+    kind: filter
+    input: facts_all
+    condition: "special_flag = TRUE"
+```
+
+You can also perform arbitrary SQL queries (in Spark SQL) by using the `sql` mapping:
+```yaml
+mappings:
+  people_union:
+    kind: sql
+    sql: "
+      SELECT
+        first_name,
+        last_name
+      FROM
+        people_internal
+
+      UNION ALL
+
+      SELECT
+        first_name,
+        last_name
+      FROM
+        people_external
+    "
+```
+
+
+## Targets
+
+Now we have the two entity types *mapping* and *relation*, and we already saw how we can read from a relation using
+the `readRelation` mapping. But how can we store the result of a flow of transformations back into some relation? This
+is where *build targets* come into play. They kind of connect the output of a mapping with a relation and tell Flowman
+to write the results of a mapping into a relation. These targets are the entities which will be *build* by Flowman and
+which support a lifecycle starting from creating a relation, migrating it to the newest schema, filling with data,
+verifying it etc.
+
+Again Flowman provides many types of build targets, but the most important one is the `relation` build target 
+
+### Examples
+
+The following example writes the output of the mapping `stations_mapping` into a relation called `stations_table`.
+Again the example will only write into a single partition for incrementally processing only new data.
+```yaml
+targets:
+  stations:
+    kind: relation
+    mapping: stations_mapping
+    relation: stations_table
+    partition:
+      processing_date: "${processing_date}"
+```
+
+
+## Jobs
+
+While targets would contain all the information for building the data, Flowman uses an additional entity called *job*
+which simply bundles multiple targets, such that they are built together. The idea is that while your project may
+contain many targets, you might want to group them together, such that only specific targets are built together. 
+
+And this is done via a job in Flowman, which mainly contains a list of targets to be built. Additionally, a job
+allows to specify build parameters, which need to be provided on the command line. A typical example would be a
+date which selects only a subset of the available data for processing.
+
+### Examples
+
+The following example defines a job called `main` with two build targets `stations` and `weather` . Moreover, the job
+defines a mandatory parameter called `processing_date`, which can be referenced as a variable in all entities.
+```yaml
+jobs:
+  main:
+    description: "Processes all outputs"
+    parameters:
+      - name: processing_date
+        type: string
+        description: "Specifies the date in yyyy-MM-dd for which the job will be run"
+    environment:
+      - start_ts=$processing_date
+      - end_ts=$Date.parse($processing_date).plusDays(1)
+    targets:
+      - stations
+      - weather
+```
+
+## Additional entities
+While these for types (relations, mappings, targets and jobs) form the basis of every Flowman project, there are some
+additional entities like [tests](spec/test/index.md), [connections](spec/connection/index.md) and more. You find an
+overview of all entity types in the [project specification documentation](spec/index.md)
+
+
+## Lifecycle
+
+Flowman sees data as artifacts with a common lifecycle, from creation until deletion. The lifecycle itself consists of
+multiple different *build phases*, each of them representing one stage of the whole lifecycle. Each target supports
+at least one of these build phases, which means that the target is performing some action during that phase. The
+specific phases depend on the target type. Read on on [lifecycles and phases](lifecycle.md) for more detailed 
+information.
diff --git a/docs/index.md b/docs/index.md
@@ -92,7 +92,7 @@ Flowman also provides optional plugins which extend functionality. You can find
    :glob:
 
    quickstart
-   building
+   concepts
    installation
    lifecycle
    spec/index
@@ -101,4 +101,5 @@ Flowman also provides optional plugins which extend functionality. You can find
    cookbook/index
    plugins/index
    config
+   building
 ```
diff --git a/docs/installation.md b/docs/installation.md
@@ -7,7 +7,7 @@ be provided by your platform. This approach ensures that the existing Spark and
 with all patches and extensions available on your platform. Specifically this means that Flowman requires the following
 components present on your system:
 
-* Java 1.8
+* Java 11 (or 1.8 when using Spark 2.4)
 * Apache Spark with a matching minor version 
 * Apache Hadoop with a matching minor version
 
@@ -25,13 +25,13 @@ versions. The naming is very simple:
 flowman-dist-<version>-oss-spark<spark-version>-hadoop<hadoop-version>-bin.tar.gz
 ```
 You simply have to use the package which fits to the Spark and Hadoop versions of your environment. For example the
-package of Flowman 0.14.1 and for Spark 3.0 and Hadoop 3.2 would be
+package of Flowman 0.20.1 and for Spark 3.1 and Hadoop 3.2 would be
 ```
-flowman-dist-0.14.1-oss-spark30-hadoop32-bin.tar.gz
+flowman-dist-0.20.1-oss-spark31-hadoop32-bin.tar.gz
 ```
 and the full URL then would be
 ```
-https://github.com/dimajix/flowman/releases/download/0.14.1/flowman-dist-0.14.1-oss-spark3.0-hadoop3.2-bin.tar.gz
+https://github.com/dimajix/flowman/releases/download/0.20.1/flowman-dist-0.20.1-oss-spark3.1-hadoop3.2-bin.tar.gz
 ```
 
 
@@ -77,7 +77,8 @@ tar xvzf flowman-dist-X.Y.Z-bin.tar.gz
     ├── flowman-example
     ├── flowman-impala
     ├── flowman-kafka
-    └── flowman-mariadb
+    ├── flowman-mariadb
+    └── flowman-...
 ```
 
 * The `bin` directory contains the Flowman executables

diff --git a/docs/spec/connection/jdbc.md b/docs/spec/connection/jdbc.md
@@ -10,12 +10,25 @@ environment:
 
 connections:
   mysql-db:
+    kind: jdbc
     driver: "$mysql_db_driver"
     url: "$mysql_db_url"
     username: "$mysql_db_username"
     password: "$mysql_db_password"
 ```
 
 ## Fields
+* `kind` **(mandatory)** *(string)*: `jdbc`
+
+* `driver` **(mandatory)** *(string)*
+
+* `url` **(mandatory)** *(string)*
+
+* `username` **(optional)** *(string)*
+
+* `password` **(optional)** *(string)*
+
+* `properties` **(optional)** *(map)* 
+
 
 ## Description
diff --git a/docs/spec/mapping/filter.md b/docs/spec/mapping/filter.md
@@ -1,8 +1,10 @@
-
 # Filter Mapping
 
+The `filter` mapping is one of the simplest one and applies a row filter to all incoming records. This is equivalent
+to a `WHERE` or `HAVING` condition in a classical SQL statement.
+
 ## Example
-```
+```yaml
 mappings:
   facts_special:
     kind: filter

diff --git a/docs/spec/mapping/sort.md b/docs/spec/mapping/sort.md
@@ -5,7 +5,7 @@ downstream mappings may destroy the sort order again, so this should be the last
 before you save a result using an output operation.
 
 ## Example
-```
+```yaml
 mappings:
   stations_sorted:
     kind: sort

diff --git a/docs/spec/mapping/sql.md b/docs/spec/mapping/sql.md
@@ -2,7 +2,7 @@
 The `sql` mapping allows to execute any SQL transformation which contains Spark SQL code.
 
 ## Example
-```
+```yaml
 mappings:
   people_union:
     kind: sql