diff --git a/BUILDING.md b/BUILDING.md index 1257721df..30a3f4914 100644 --- a/BUILDING.md +++ b/BUILDING.md @@ -55,7 +55,7 @@ appropriate build profiles, you can easily create a custom build. Although you can normally build Flowman on Windows, it is recommended to use Linux instead. But nevertheless Windows is still supported to some extend, but requires some extra care. You will need the Hadoop WinUtils installed. You can download the binaries from https://github.com/cdarlint/winutils and install an appropriate version somewhere onto -your machine. Do not forget to set the HADOOP_HOME or PATH environment variable to the installation directory of these +your machine. Do not forget to set the `HADOOP_HOME` or `PATH` environment variable to the installation directory of these utils! You should also configure git such that all files are checked out using "LF" endings instead of "CRLF", otherwise diff --git a/CHANGELOG.md b/CHANGELOG.md index b506f1b1b..293482aa5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,17 @@ +# Version 0.22.0 - 2022-03-01 + +* Add new `sqlserver` relation +* Implement new documentation subsystem +* Change default build to Spark 3.2.1 and Hadoop 3.3.1 +* Add new `drop` target for removing tables +* Speed up project loading by reusing Jackson mapper +* Implement new `jdbc` metric sink +* Implement schema cache in Executor to speed up documentation and similar tasks +* Add new config variables `flowman.execution.mapping.schemaCache` and `flowman.execution.relation.schemaCache` +* Add new config variable `flowman.default.target.verifyPolicy` to ignore empty tables during VERIFY phase +* Implement initial support for indexes in JDBC relations + + # Version 0.21.2 - 2022-02-14 * Fix importing projects diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 000000000..68d860396 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,110 @@ +# Contributing to Flowman + +You want to contribute to Flowman? Welcome! Please read this document to understand what you can do: + * [Report an Issue](#report-an-issue) + * [Contribute Documentation](#contribute-documentation) + * [Contribute Code](#contribute-code) + + +## Report an Issue + +If you find a bug - behavior of Flowman code contradicting your expectation - you are welcome to report it. +We can only handle well-reported, actual bugs, so please follow the guidelines below. + +Once you have familiarized with the guidelines, you can go to the [GitHub issue tracker for Flowman](https://github.com/dimajix/flowman/issues/new) to report the issue. + +### Quick Checklist for Bug Reports + +Issue report checklist: + * Real, current bug + * No duplicate + * Reproducible + * Good summary + * Well-documented + * Minimal example + +### Issue handling process + +When an issue is reported, a committer will look at it and either confirm it as a real issue, close it if it is not an issue, or ask for more details. + +An issue that is about a real bug is closed as soon as the fix is committed. + +### Usage of Labels + +GitHub offers labels to categorize issues. We suggest the following labels: + +Labels for issue categories: + * bug: this issue is a bug in the code + * feature: this issue is a request for a new functionality or an enhancement request + * environment: this issue relates to supporting a specific runtime environment (Cloudera, specific Spark/Hadoop version, etc) + +Status of open issues: + * help wanted: the feature request is approved and you are invited to contribute + +Status/resolution of closed issues: + * wontfix: while acknowledged to be an issue, a fix cannot or will not be provided + +### Issue Reporting Disclaimer + +We want to improve the quality of Flowman and good bug reports are welcome! But our capacity is limited, thus we reserve the right to close or to not process insufficient bug reports in favor of those which are very cleanly documented and easy to reproduce. Even though we would like to solve each well-documented issue, there is always the chance that it will not happen - remember: Flowman is Open Source and comes without warranty. + +Bug report analysis support is very welcome! (e.g. pre-analysis or proposing solutions) + + + +## Contribute Documentation + +Flowman has many features implemented, unfortunately not all of them are well documented. So this is an area where we highly welcome contributions from users in order to improve the documentation. The documentation is contained in the "doc" subdirectory within the source code repository. This implies that when you want to contribute documentation, you have to follow the same procedure as for contributing code. + + + +## Contribute Code + +You are welcome to contribute code to Flowman in order to fix bugs or to implement new features. + +There are three important things to know: + +1. You must be aware of the Apache License (which describes contributions) and **agree to the Contributors License Agreement**. This is common practice in all major Open Source projects. + For company contributors special rules apply. See the respective section below for details. +2. Please ensure your contribution adopts Flowmans **code style, quality, and product standards**. The respective section below gives more details on the coding guidelines. +3. **Not all proposed contributions can be accepted**. Some features may e.g. just fit a third-party plugin better. The code must fit the overall direction of Flowman and really improve it. The more effort you invest, the better you should clarify in advance whether the contribution fits: the best way would be to just open an issue to discuss the feature you plan to implement (make it clear you intend to contribute). + +### Contributor License Agreement + +When you contribute (code, documentation, or anything else), you have to be aware that your contribution is covered by the same [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0) that is applied to Flowman itself. + +In particular, you need to agree to the [Flowman Contributors License Agreement](https://cla-assistant.io/dimajix/flowman), stating that you have the right and are okay to put your contribution under the license of this project. +CLA assistant will ask you to confirm that. + +This applies to all contributors, including those contributing on behalf of a company. +If you agree to its content, you simply have to click on the link posted by the CLA assistant as a comment to the pull request. Click it to check the CLA, then accept it on the following screen if you agree to it. +CLA assistant will save this decision for upcoming contributions and will notify you if there is any change to the CLA in the meantime. + +### Contribution Content Guidelines + +These are some rules we try to follow: + +- Apply a clean coding style adapted to the surrounding code, even though we are aware the existing code is not fully clean +- Use (4)spaces for indentation +- Use variable naming conventions like in the other files you are seeing (camelcase) +- No println - use SLF4J logging instead +- Comment your code where it gets non-trivial +- Write a unit test +- Do not do any incompatible changes, especially do not change or remove existing properties from YAML specs + +### How to contribute - the Process + +1. Make sure the change would be welcome (e.g. a bugfix or a useful feature); best do so by proposing it in a GitHub issue +2. Create a branch forking the flowman repository and do your change +3. Commit and push your changes on that branch +4. If your change fixes an issue reported at GitHub, add the following line to the commit message: + - ```Fixes #(issueNumber)``` +5. Create a Pull Request with the following information + - Describe the problem you fix with this change. + - Describe the effect that this change has from a user's point of view. App crashes and lockups are pretty convincing for example, but not all bugs are that obvious and should be mentioned in the text. + - Describe the technical details of what you changed. It is important to describe the change in a most understandable way so the reviewer is able to verify that the code is behaving as you intend it to. +6. Follow the link posted by the CLA assistant to your pull request and accept it, as described in detail above. +7. Wait for our code review and approval, possibly enhancing your change on request + - Note that the Flowman developers also have their regular duties, so depending on the required effort for reviewing, testing and clarification this may take a while +8. Once the change has been approved we will inform you in a comment +9. We will close the pull request, feel free to delete the now obsolete branch diff --git a/QUICKSTART.md b/QUICKSTART.md index 071de70f3..5f127d115 100644 --- a/QUICKSTART.md +++ b/QUICKSTART.md @@ -16,7 +16,7 @@ Fortunately, Apache Spark is rather simple to install locally on your machine: ### Download & Install Spark -As of this writing, the latest release of Flowman is 0.20.0 and is available prebuilt for Spark 3.1.2 on the Spark +As of this writing, the latest release of Flowman is 0.22.0 and is available prebuilt for Spark 3.2.1 on the Spark homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it. ```shell @@ -25,8 +25,8 @@ mkdir playground cd playground # Download and unpack Spark & Hadoop -curl -L https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link -ln -snf spark-3.1.2-bin-hadoop3.2 spark +curl -L https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.2.1-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link +ln -snf spark-3.2.1-bin-hadoop3.2 spark ``` The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other. @@ -35,7 +35,8 @@ The Spark package already contains Hadoop, so with this single download you alre If you are trying to run the application on Windows, you also need the *Hadoop Winutils*, which is a set of DLLs required for the Hadoop libraries to be working. You can get a copy at https://github.com/kontext-tech/winutils . Once you downloaded the appropriate version, you need to place the DLLs into a directory `$HADOOP_HOME/bin`, where -`HADOOP_HOME` refers to some location on your Windows PC. You also need to set the following environment variables: +`HADOOP_HOME` refers to some arbitrary location of your choice on your Windows PC. You also need to set the following +environment variables: * `HADOOP_HOME` should point to the parent directory of the `bin` directory * `PATH` should also contain `$HADOOP_HOME/bin` @@ -43,11 +44,11 @@ Once you downloaded the appropriate version, you need to place the DLLs into a d ## 1.2 Install Flowman You find prebuilt Flowman packages on the corresponding release page on GitHub. For this quickstart, we chose -`flowman-dist-0.20.0-oss-spark3.1-hadoop3.2-bin.tar.gz` which nicely fits to the Spark package we just downloaded before. +`flowman-dist-0.22.0-oss-spark3.2-hadoop3.3-bin.tar.gz` which nicely fits to the Spark package we just downloaded before. ```shell # Download and unpack Flowman -curl -L https://github.com/dimajix/flowman/releases/download/0.20.0/flowman-dist-0.20.0-oss-spark3.1-hadoop3.2-bin.tar.gz | tar xvzf - +curl -L https://github.com/dimajix/flowman/releases/download/0.22.0/flowman-dist-0.22.0-oss-spark3.2-hadoop3.3-bin.tar.gz | tar xvzf - # Create a nice link ln -snf flowman-0.20.0 flowman @@ -81,13 +82,9 @@ That’s all we need to run the Flowman example. # 2. Flowman Shell -The example data is stored in a S3 bucket provided by myself. In order to access the data, you need to provide valid -AWS credentials in your environment: - -```shell -$ export AWS_ACCESS_KEY_ID= -$ export AWS_SECRET_ACCESS_KEY= -``` +The example data is stored in a S3 bucket provided by myself. Since the data is publicly available and the project is +configured to use anonymous AWS authentication, you do not need to provide your AWS credentials (you even do not +even need to have an account on AWS) ## 2.1 Start interactive Flowman shell diff --git a/README.md b/README.md index 676cbbd61..5a580a064 100644 --- a/README.md +++ b/README.md @@ -21,11 +21,11 @@ keep all aspects (like transformations and schema information) in a single place * Semantics of a build tool like Maven - just for data instead for applications * Declarative syntax in YAML files * Data model management (Create, Migrate and Destroy Hive tables, JDBC tables or file based storage) +* Generation of meaningful documentation * Flexible expression language * Jobs for managing build targets (like copying files or uploading data via sftp) * Automatic data dependency management within the execution of individual jobs -* Rich set of execution metrics -* Meaningful logging output +* Meaningful logging output & rich set of execution metrics * Powerful yet simple command line tools * Extendable via Plugins @@ -38,28 +38,21 @@ You can find the official homepage at [Flowman.io](https://flowman.io) # Installation -You can either grab an appropriate pre-build package at https://github.com/dimajix/flowman/releases or you -can build your own version via Maven with - - mvn clean install - -Please also read [BUILDING.md](BUILDING.md) for detailed instructions, specifically on build profiles. - +You can either grab an appropriate pre-build package at [GitHub](https://github.com/dimajix/flowman/releases) ## Installing the Packed Distribution The packed distribution file is called `flowman-{version}-bin.tar.gz` and can be extracted at any location using - - tar xvzf flowman-{version}-bin.tar.gz - +```shell +tar xvzf flowman-{version}-bin.tar.gz +``` ## Apache Spark Flowman does not bring its own Spark libraries, but relies on a correctly installed Spark distribution. You can download appropriate packages directly from [https://spark.apache.org](the Spark Homepage). - ## Hadoop Utils for Windows If you are trying to run the application on Windows, you also need the *Hadoop Winutils*, which is a set of @@ -70,7 +63,6 @@ Once you downloaded the appropriate version, you need to place the DLLs into a d * `PATH` should also contain `$HADOOP_HOME/bin` - # Command Line Utils The primary tool provided by Flowman is called `flowexec` and is located in the `bin` folder of the @@ -80,19 +72,37 @@ installation directory. The `flowexec` tool has several subcommands for working with objects and projects. The general pattern looks as follows - - flowexec [generic options] [specific options and arguments] +```shell +flowexec [generic options] [specific options and arguments] +``` For working with `flowexec`, either your current working directory needs to contain a Flowman project with a file `project.yml` or you need to specify the path to a valid project via - - flowexec -f /path/to/project/folder +```shell +flowexec -f /path/to/project/folder +``` ## Interactive Shell With version 0.14.0, Flowman also introduced a new interactive shell for executing data flows. The shell can be started via - - flowshell -f +```shell +flowshell -f +``` Within the shell, you can interactively build targets and inspect intermediate mappings. + + +# Building + +You can build your own version via Maven with +```shell +mvn clean install +``` +Please also read [BUILDING.md](BUILDING.md) for detailed instructions, specifically on build profiles. + + +# Contributing + +You want to contribute to Flowman? Welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) to understand what you can +do. diff --git a/docker/conf/default-namespace.yml b/docker/conf/default-namespace.yml index 06363ec75..0f2cfcf0d 100644 --- a/docker/conf/default-namespace.yml +++ b/docker/conf/default-namespace.yml @@ -13,6 +13,28 @@ connections: username: $System.getenv('FLOWMAN_LOGDB_USER', '') password: $System.getenv('FLOWMAN_LOGDB_PASSWORD', '') +# This adds a hook for creating an execution log in a file +hooks: + kind: report + location: ${project.basedir}/generated-report.txt + metrics: + # Define common labels for all metrics + labels: + project: ${project.name} + metrics: + # Collect everything + - selector: + name: .* + labels: + category: ${category} + kind: ${kind} + name: ${name} + +# This configures where metrics should be written to. Since we cannot assume a working Prometheus push gateway, we +# simply print them onto the console +metrics: + - kind: console + config: - spark.sql.warehouse.dir=/opt/flowman/hive/warehouse - spark.hadoop.hive.metastore.uris= @@ -21,7 +43,7 @@ config: store: kind: file - location: /opt/flowman/examples + location: $System.getenv('FLOWMAN_HOME')/examples plugins: - flowman-aws diff --git a/docker/conf/history-server.yml b/docker/conf/history-server.yml new file mode 100644 index 000000000..6e2b18221 --- /dev/null +++ b/docker/conf/history-server.yml @@ -0,0 +1,20 @@ +# The following definition provides a "run history" stored in a database. If nothing else is specified, the database +# is stored locally as a Derby database. If you do not want to use the history, you can simply remove the whole +# 'history' block from this file. +history: + kind: jdbc + connection: flowman_state + retries: 3 + timeout: 1000 + +connections: + flowman_state: + driver: $System.getenv('FLOWMAN_LOGDB_DRIVER', 'org.apache.derby.jdbc.EmbeddedDriver') + url: $System.getenv('FLOWMAN_LOGDB_URL', $String.concat('jdbc:derby:', $System.getenv('FLOWMAN_HOME'), '/logdb;create=true')) + username: $System.getenv('FLOWMAN_LOGDB_USER', '') + password: $System.getenv('FLOWMAN_LOGDB_PASSWORD', '') + +plugins: + - flowman-mariadb + - flowman-mysql + - flowman-mssqlserver diff --git a/docker/pom.xml b/docker/pom.xml index 50a88de41..edbac5d93 100644 --- a/docker/pom.xml +++ b/docker/pom.xml @@ -10,10 +10,14 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml + + ${hadoop-api.version} + + CDH-6.3 @@ -27,6 +31,16 @@ true + + spark-3.2 + + true + + + + 3.2 + + @@ -93,7 +107,7 @@ false ${spark.version} - ${hadoop-api.version} + ${spark-hadoop-archive.version} flowman-dist-${flowman.dist.label}-bin.tar.gz ${env.http_proxy} ${env.https_proxy} diff --git a/docs/cli/flowexec.md b/docs/cli/flowexec.md index be9be66d5..193f5bb57 100644 --- a/docs/cli/flowexec.md +++ b/docs/cli/flowexec.md @@ -29,7 +29,7 @@ or for inspecting individual entities. ## Project Commands -The most important command group is for executing a specific lifecycle or a individual phase for the whole project. +The most important command group is for executing a specific lifecycle or an individual phase for the whole project. ```shell script flowexec project ``` @@ -49,6 +49,21 @@ individual targets with `-d`. the whole lifecycle for `verify` includes the phases `create` and `build` and these phases would be executed before `verify`. If this is not what you want, then use the option `-nl` +### Examples +In order to build a project (i.e. run `VALIDATE`, `CREATE` and `BUILD` execution phases) stored in the subdirectory +`examples/weather` which defines an (optional) parameter `year`, simply run + +```shell +flowexec -f examples/weather project build year=2018 +``` + +If you only want to execute the `BUILD` phase and skip the first two other phases, then you need to add the +command line option `-nl` to skip the lifecycle: + +```shell +flowexec -f examples/weather project build year=2018 -nl +``` + ## Job Commands Similar to the project commands, individual jobs with different names than `main` can be executed. @@ -79,6 +94,22 @@ This will execute the whole job by executing the desired lifecycle for the `main the whole lifecycle for `verify` includes the phases `create` and `build` and these phases would be executed before `verify`. If this is not what you want, then use the option `-nl` + +### Examples +In order to build (i.e. run `VALIDATE`, `CREATE` and `BUILD` execution phases) the `main` job of a project stored +in the subdirectory `examples/weather` which defines an (optional) parameter `year`, simply run + +```shell +flowexec -f examples/weather job build main year=2018 +``` + +If you only want to execute the `BUILD` phase and skip the first two other phases, then you need to add the +command line option `-nl` to skip the lifecycle: + +```shell +flowexec -f examples/weather job build main year=2018 -nl +``` + The following example will only execute the `BUILD` phase of the job `daily`, which defines a parameter `processing_datetime` with type datetiem. The job will be executed for the whole date range from 2021-06-01 until 2021-08-10 with a step size of one day. Flowman will execute up to four jobs in parallel (`-j 4`). diff --git a/docs/cli/flowshell.md b/docs/cli/flowshell.md index 1a54f288a..95501036c 100644 --- a/docs/cli/flowshell.md +++ b/docs/cli/flowshell.md @@ -36,7 +36,9 @@ Some additional commands in `flowshell` which are not available via `flowexec` a Start the Flowman shell for your project via - flowshell -f /path/to/your/project +```shell +flowshell -f /path/to/your/project +``` Now you can list all jobs via diff --git a/docs/conf.py b/docs/conf.py index c79f24d29..1c1898c26 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -61,9 +61,9 @@ # built documents. # # The short X.Y version. -version = '0.20' +version = '0.22' # The full version, including alpha/beta/rc tags. -release = '0.20.0' +release = '0.22.0' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. diff --git a/docs/config.md b/docs/config.md index 671aba603..bceb77af1 100644 --- a/docs/config.md +++ b/docs/config.md @@ -52,6 +52,16 @@ The number of mappings to be processed in parallel. Increasing this number may h relations are read from and their initial setup is slow (for example due to slow directory listings). With the default value of 1, the parallelism is completely disabled and a non-threaded code path is used instead. +- `flowman.execution.mapping.schemaCache` *(type: boolean)* *(default: true)* (since Flowman 0.22.0) +Turn on/off caching of schema information of mappings. Caching this information (which is enabled per default) can + speed up schema inference, which is used for `mapping` schemas and when creating the documentation of mappings. Turning + off the cache is mainly for debugging purposes. + +- `flowman.execution.relation.schemaCache` *(type: boolean)* *(default: true)* (since Flowman 0.22.0) +Turn on/off caching of schema information of relations. Caching this information (which is enabled per default) can + speed up schema inference, which is used for `relation` schemas and when creating the documentation of relations and. + mappings. Turning off the cache is mainly for debugging purposes. + - `flowman.execution.scheduler.class` *(type: class)* *(default: `com.dimajix.flowman.execution.DependencyScheduler`)* (since Flowman 0.16.0) Configure the scheduler to use, which essentially decides which target to build next. - The default `DependencyScheduler` will sort all targets according to their dependency. @@ -99,6 +109,14 @@ Sets the strategy to use how tables should be migrated. Possible values are: actual defined columns. Per default Flowman will add/remove columns to/from records such that they match the current physical layout. See [relations](spec/relation/index.md) for possible options and more details. +- `flowman.default.target.verifyPolicy` *(type: string)* *(default:`EMPTY_AS_FAILURE`)* (since Flowman 0.22.0) +Defines the default target policy that is used during the `VERIFY` execution phase. The setting controls how Flowman +interprets an empty table. Normally you'd expect that all target tables contain records, but this might not always +be the case, for example when the source tables are already empty. Possible values are + - *`EMPTY_AS_FAILURE`*: Flowman will report an empty target table as an error in the `VERIFY` phase. + - *`EMPTY_AS_SUCCESS`*: Flowman will ignore empty tables, but still check for existence in the `VERIFY` phase. + - *`EMPTY_AS_SUCCESS_WITH_ERRORS`*: An empty output table is handled as partially successful. + - `flowman.default.target.outputMode` *(type: string)* *(default:`OVERWRITE`)* Sets the default target output mode. Possible values are - *`OVERWRITE`*: Will overwrite existing data. Only supported in batch output. diff --git a/docs/cookbook/data-qualioty.md b/docs/cookbook/data-qualioty.md deleted file mode 100644 index 24a73182d..000000000 --- a/docs/cookbook/data-qualioty.md +++ /dev/null @@ -1,25 +0,0 @@ -# Data Quality - -Data quality is an important topic, which is also addressed in Flowman. The special [measure target](../spec/target/measure.md) -provides some means to collect some important metrics from data and provide the results as metrics. These in turn -can be [published to Prometheus](metrics.md) or other metric collectors. - - -## Example - -```yaml -targets: - measures: - kind: measure - measures: - record_stats: - kind: sql - query: " - SELECT - COUNT(*) AS record_count - SUM(column IS NULL) AS column_sum - FROM some_mapping" -``` - -This example will publish two metrics, `record_count` and `column_sum`, which then can be sent to a -[metric sink](../spec/metric) configured in the [namespace](../spec/namespace.md). diff --git a/docs/cookbook/data-quality.md b/docs/cookbook/data-quality.md new file mode 100644 index 000000000..df600145e --- /dev/null +++ b/docs/cookbook/data-quality.md @@ -0,0 +1,96 @@ +# Data Quality Checks + +Data quality is an important topic, which is also addressed in Flowman in multiple, complementary ways. + + +## Verification and Validation + +First you might want to add some [validate](../spec/target/validate.md) and [verify](../spec/target/verify.md) targets +to your job. The `validate` the target will be executed before the `CREATE` phase and is well suited for performing some tests +on the source data. If these tests fail, you may either emit a simple warning or stop the build altogether in failed +state (which is the default behaviour). + +The `verify` target will be executed in the `VERIFY` phase after the `BUILD` phase and is well suited for conducting +data quality tests after the build itself has finished. Again a failing `verify` target may either only generate a +warning, or may fail the build. + +### Example + +```yaml +targets: + validate_input: + kind: validate + mode: failFast + assertions: + assert_primary_key: + kind: sql + tests: + - query: "SELECT id,count(*) FROM source GROUP BY id HAVING count(*) > 0" + expected: [] + + assert_measurement_count: + kind: sql + tests: + - query: "SELECT COUNT(*) FROM measurements_extracted" + expected: 2 +``` + + +## Data Quality Checks as Documentation + +With the new [documentation framework](../documenting/index.md), Flowman adds the possibility not only to document +mappings and relations, but also to add test cases. These will be executed as part of the documentation (which is +generated with an independent command with [`flowexec`](../cli/flowexec.md)). + + +## Data Quality Metrics +In addition to the `validate` and `verify` targets, Flowman also offers a special [measure target](../spec/target/measure.md). +This target provides some means to collect some important metrics from data and provide the results as metrics. These +in turn can be [published to Prometheus](metrics.md) or other metric collectors. + + +### Example + +```yaml +targets: + measures: + kind: measure + measures: + record_stats: + kind: sql + query: " + SELECT + COUNT(*) AS record_count + SUM(column IS NULL) AS column_sum + FROM some_mapping" +``` + +This example will publish two metrics, `record_count` and `column_sum`, which then can be sent to a +[metric sink](../spec/metric) configured in the [namespace](../spec/namespace.md). + + +## When to use what +All three approaches are complementary and can be used together. It all depends on what you want to achieve. + +### Checking Pre- and Post-Conditions +If you want to verify that certain pre- or post-conditions in the source or output data are met, then the +[`validate`](../spec/target/validate.md) and [`verify`](../spec/target/verify.md) targets should be used. They +will perform arbitrary tests either before the `CREATE` and `BUILD` phase (in case of the `validate` target) or after +the `BUILD` phase (in case of the `verify` target). In case any of the tests fail, the whole build will fail and not +proceed any processing. This approach can be used to only start the data transformations when input data is clean and +matches your expectations. + +### Continuous Monitoring of Data Quality +If you want to setup some continuous monitoring of your data quality (either input or output or both), then the +[`measure` target](../spec/target/measure.md) is the right choice. It will collect arbitrary numerical metrics from +the data and publish it to a metrics sink like Prometheus. Typically, metric collectors are used in conjunction with +a dashboard (like Grafana), which then can be used to display the whole history of these metrics over time. This way +you can see if data quality improves or gets worse, and many of these tools also allow you to set up alarms when +some threshold is reached. + +### Documenting Expectations with Reality Check +Finally, the whole documentation subsystem is the right tool for specifying your expectations on the data quality and +have these expectations automatically checked with the real data. In combination with continuous monitoring this can +help to better understand what might be going wrong. In contrast to pre/post-condition checking, a failed check in +the documentation will not fail the build - it will simply be marked as failed in the documentation, but that's all +what will happen. diff --git a/docs/cookbook/docker.md b/docs/cookbook/docker.md new file mode 100644 index 000000000..6b759d222 --- /dev/null +++ b/docs/cookbook/docker.md @@ -0,0 +1,21 @@ +# Running Flowman in Docker + +Flowman can also be run inside Docker, especially when working in local mode (i.e. without a cluster). It is also +possible to run Flowman in Docker in Spark distributed processing mode, but this requires more configuration options +to forward all required ports etc. + +## Running Locally + +We publish Flowman Docker images on [Docker Hub](https://hub.docker.com/repository/docker/dimajix/flowman), +which are good enough for local work. You can easily start a Flowman session in Docker as follows: + +```shell +docker run --rm -ti dimajix/flowman:0.21.0-oss-spark3.1-hadoop3.2 bash +``` + +Then once the Docker image has started you will be presented with a bash prompt. Then you can easily build the +weather example of Flowman via +```shell +cd /opt/flowman +flowexec -f examples/weather job build main +``` diff --git a/docs/cookbook/impala.md b/docs/cookbook/impala.md index 5b55d508f..a2f710ee5 100644 --- a/docs/cookbook/impala.md +++ b/docs/cookbook/impala.md @@ -1,4 +1,4 @@ -# Impala +# Updating Impala Metadata Impala is another "SQL on Hadoop" execution engine mainly developed and backed up by Cloudera. Impala allows you to access data stored in Hadoop and registered in the Hive metastore, just like Hive itself, but often at a significantly diff --git a/docs/cookbook/kerberos.md b/docs/cookbook/kerberos.md index a98a37271..8df93045e 100644 --- a/docs/cookbook/kerberos.md +++ b/docs/cookbook/kerberos.md @@ -1,6 +1,6 @@ -# Kerberos +# Using Kerberos Authentication -Of course you can also run Flowman in a Kerberos environment, as long as the components you use actually support +Of course, you can also run Flowman in a Kerberos environment, as long as the components you use actually support Kerberos. This includes Spark, Hadoop and Kafka. ## Configuring Kerberos @@ -14,7 +14,7 @@ KRB_PRINCIPAL={{KRB_PRINCIPAL}}@MY-REALM.NET KRB_KEYTAB=$FLOWMAN_CONF_DIR/{{KRB_PRINCIPAL}}.keytab ``` -Of course this way, Flowman will always use the same Kerberos principal for all projects. Currently there is no other +Of course this way, Flowman will always use the same Kerberos principal for all projects. Currently, there is no other way, since Spark and Hadoop need to have the Kerberos principal set at startup. But you can simply use different config directories and switch between them by setting the `FLOWMAN_CONF_DIR` environment variable. diff --git a/docs/cookbook/metrics.md b/docs/cookbook/metrics.md index 84892d3c0..764c81bc1 100644 --- a/docs/cookbook/metrics.md +++ b/docs/cookbook/metrics.md @@ -21,13 +21,23 @@ jobs: targets: - my_target - my_other_target + # The following section configures the metric board, which selects the Flowman metrics of interest and also + # maps the Flowman metric names to possibly different names metrics: + # Define labels which are attached to all published metrics below labels: force: ${force} status: ${status} phase: ${phase} datetime: ${processing_datetime} metrics: + # Collect everything + - selector: + name: .* + labels: + category: ${category} + kind: ${kind} + name: ${name} # This metric contains the number of records per output. It will search all metrics called # `target_records` and export them as `flowman_output_records`. It will also label each metric with # the name of each Flowman build target (in case you have multiple targets) diff --git a/docs/cookbook/override-jars.md b/docs/cookbook/override-jars.md new file mode 100644 index 000000000..9a0201ab3 --- /dev/null +++ b/docs/cookbook/override-jars.md @@ -0,0 +1,25 @@ +# Force Spark to specific jar version + +A common problem with Spark and specifically with many Hadoop environments (like Cloudera) are mismatches between +application jar versions and jars provided by the runtime environment. Flowman is built with carefully set dependency +version to match those of each supported runtime environment. But sometimes this might not be enough. + +For example Cloudera ships with a rather old JDBC driver for MS SQL Server / Azure SQL Server which is not compatible +with the `sqlserver` relation type provided by the [MS SQL Server plugin](../plugins/mssqlserver.md). This will result +in `ClassNotFound` or `MethodNotFound` exceptions during execution. But it is still +possible to force Spark to use the newer JDBC driver by changing some config options. + + +## Configuration + +You need to add the following lines to your custom `flowman-env.sh` file which is stored in the `conf` subdirectory: + +```shell +# Add MS SQL JDBC Driver. Normally this is handled by the plugin mechanism, but Cloudera already provides some +# old version of the JDBC driver, and this is the only place where we can force to use our JDBC driver +SPARK_JARS="$FLOWMAN_HOME/plugins/flowman-mssqlserver/mssql-jdbc-9.2.1.jre8.jar" +SPARK_OPTS="--conf spark.executor.extraClassPath=mssql-jdbc-9.2.1.jre8.jar" +``` +The first line will explicitly add the plugin jar to the list of jars as passed to `spark-submit`. But this is still +not enough, we also have to set `spark.executor.extraClassPath` which will *prepend* the specified jars to the +classpath of the executor. diff --git a/docs/cookbook/sharing.md b/docs/cookbook/sharing.md new file mode 100644 index 000000000..c0d68f94c --- /dev/null +++ b/docs/cookbook/sharing.md @@ -0,0 +1,97 @@ +# Sharing Entities between Projects + +In bigger projects, it makes sense to organize data transformations like Flowman projects into separate subprojects, +so they can be maintained independently by possibly different teams. A classical example would be to have a different +Flowman project per source system (let it be your CRM system, your financial transaction processing system etc). +In a data lake environment, you probably want to implement independent Flowman projects to perform the first +technical transformations for each of these source systems. Then in the next layer, you want to create a more +complex and integrated data model built on top of these independent models. + +In such scenarios, you want to share some common entity definitions between these projects, for example the Flowman +project for building the integrated data model may want to reuse the relations from the other projects. + +Flowman well supports these scenarios by the concept of imports. + +## Example + +First you define a project which exports entities. Actually you might need to do nothing, since importing a project +will make all entities available to the importing side. But maybe your project also requires some variables to be +set, like the processing date. Typically you would include such variables as job parameters: +```yaml +# Project A, which contains shared resources +jobs: + # Define a base job with common environment variables + base: + parameters: + - name: processing_datetime + type: timestamp + description: "Specifies the datetime in yyyy-MM-ddTHH:mm:ss.Z for which the result will be generated" + - name: processing_duration + type: duration + description: "Specifies the processing duration (either P1D or PT1H)" + environment: + - start_ts=$processing_datetime + - end_ts=${Timestamp.add(${processing_datetime}, ${processing_duration})} + - start_unixtime=${Timestamp.parse($start_ts).toEpochSeconds()} + - end_unixtime=${Timestamp.parse($end_ts).toEpochSeconds()} + + # Define a specific job for daily processing + daily: + extends: base + parameters: + - name: processing_datetime + type: timestamp + environment: + - processing_duration=P1D + + # Define a specific job for hourly processing + hourly: + extends: base + parameters: + - name: processing_datetime + environment: + - processing_duration=PT1H +``` + +The another project may want to access resources from project A, but within the context of the `export` job. This +can be achieved by declaring the dependency in an `imports` section within the project manifest: + +```yaml +# project.yml of another project +name: raw-exporter + +imports: + # Import with no job and no (or default) parameters + - poject: project_b + + # Import project with specified job context + - project: project_a + # The job may even be a variable, so different job context can be imported + job: $period + arguments: + processing_datetime: $processing_datetime +``` +Then you can easily access entities from `project_a` and `project_b` as follows: + +```yaml +mappings: + # You can access all entities from different projects by using the project name followed by a slash ("/") + sap_transactions: + kind: filter + input: project_b/transactions + condition: "transaction_code = 8100" + +relations: + ad_impressions: + kind: alias + input: project_a/ad_impressions + +jobs: + main: + parameters: + - name: processing_datetime + type: timestamp + environment: + # Set the variable $period, so it will be used to import the correct job + - period=daily +``` diff --git a/docs/cookbook/validation.md b/docs/cookbook/validation.md index 2e6505b34..f449e9cde 100644 --- a/docs/cookbook/validation.md +++ b/docs/cookbook/validation.md @@ -1,4 +1,4 @@ -# Validations +# Pre-build Validations In many cases, you'd like to perform some validation of input data before you start processing. For example when joining data, you often assume some uniqueness constraint on the join key in some tables or mappings. If that @@ -48,3 +48,16 @@ The example above will validate assumptions on `some_table` mapping, which reads All `validate` targets are executed during the [VALIDATE](../lifecycle.md) phase, which is executed before any other build phase. If one of these targets fail, Flowman will stop execution on return an error. This helps to prevent building invalid data. + + +## Verification + +In addition to *validating* data quality before a FLowman job starts its main work later in the `CREATE` and +`BUILD` phase, Flowman also provides the ability to *verify* the results of all data transformations after the +the `BUILD` execution phase, namely in the `VERIFY` phase. In order to implement a verification, you simply need +to use a [verify](../spec/target/verify.md) target, which works precisely like the `validate` target with the only +difference that it is executed after the `BUILD` phase. + +Note that when you are concerned about the quality of the data produced by your Flowman job, the `verify` target +is only one of multiple possibilities to implement meaningful checks. Read more in the +[data quality cookbook](data-qualioty.md) about available options. diff --git a/docs/cookbook/windows.md b/docs/cookbook/windows.md new file mode 100644 index 000000000..6b8718db8 --- /dev/null +++ b/docs/cookbook/windows.md @@ -0,0 +1,33 @@ +# Running Flowman on Windows + +Flowman is best run on Linux, especially for production usage. Windows support is at best experimental and will +probably be never within the focus of the project. Nevertheless, there are also some options for running Flowman on +Windows, with the main purpose to provide developers some way to create and test projects on their local machines. + +The main difficulty in supporting Windows comes from two aspects +* Windows doesn't support `bash`, therefore all scripts have been rewritten to run on Windows +* Hadoop and Spark require some special *Hadoop WinUtils* libraries to be installed + + +## Installing using WinUtils +The first naturla option is to install Flowman directly on your Windows machine. Of course this also requires a +working Apache Spark installation. You can download an appropriate version from the +[Apache Spark homepage](https://spark.apache.org). + +Next, you are required to install the *Hadoop Winutils*, which is a set of DLLs required for the Hadoop libraries to +be working. You can get a copy at https://github.com/kontext-tech/winutils . +Once you downloaded the appropriate version, you need to place the DLLs into a directory `$HADOOP_HOME/bin`, where +`HADOOP_HOME` refers to some arbitrary location of your choice on your Windows PC. You also need to set the following +environment variables: +* `HADOOP_HOME` should point to the parent directory of the `bin` directory +* `PATH` should also contain `$HADOOP_HOME/bin` + + +## Using Docker +A simpler way to run Flowman on Windows is to use a Docker image available on +[Docker Hub](https://hub.docker.com/repository/docker/dimajix/flowman) + + +## Using WSL +And of course you can also simply install a Linux distro of your choice via WSL and then normally +[install Flowman](../installation.md) within WSL. diff --git a/docs/documenting/checks.md b/docs/documenting/checks.md new file mode 100644 index 000000000..8c21af297 --- /dev/null +++ b/docs/documenting/checks.md @@ -0,0 +1,106 @@ +# Checking Model Properties + +In addition to provide pure descriptions of model entities, the documentation framework in Flowman also provides +the ability to specify model properties (like unique values in a column, not null etc). These properties will not only +be part of the documentation, they will also be verified as part of generating the documentation. + + +## Example + +```yaml +relations: + measurements: + kind: file + format: parquet + location: "$basedir/measurements/" + partitions: + - name: year + type: integer + granularity: 1 + # We prefer to use the inferred schema of the mapping that is written into the relation + schema: + kind: mapping + mapping: measurements_extracted + + documentation: + description: "This model contains all individual measurements" + columns: + - name: year + description: "The year of the measurement, used for partitioning the data" + checks: + - kind: notNull + - name: usaf + checks: + - kind: notNull + - name: wban + checks: + - kind: notNull + - name: air_temperature_qual + checks: + - kind: notNull + - kind: values + values: [0,1,2,3,4,5,6,7,8,9] + - name: air_temperature + checks: + - kind: expression + expression: "air_temperature >= -100 OR air_temperature_qual <> 1" + - kind: expression + expression: "air_temperature <= 100 OR air_temperature_qual <> 1" + # Schema tests, which might involve multiple columns + checks: + kind: foreignKey + relation: stations + columns: + - usaf + - wban + references: + - usaf + - wban +``` + +## Available Column Checks + +Flowman implements a couple of different check types on a per column basis. + +### Not NULL + +One simple but yet important test is to check if a column does not contain any `NULL` values + +* `kind` **(mandatory)** *(string)*: `notNull` + + +### Unique Values + +Another important test is to check for unique values in a column. Note that this test will exclude `NULL` values, +so in many cases you might want to specify both `notNUll` and `unique`. + +* `kind` **(mandatory)** *(string)*: `unique` + + +### Specific Values + +In order to test if a column only contains specific values, you can use the `values` test. Note that this test will +exclude records with `NULL` values in the column, so in many cases you might want to specify both `notNUll` and `values`. + +* `kind` **(mandatory)** *(string)*: `values` +* `values` **(mandatory)** *(list:string)*: List of admissible values + + +### Range of Values + +Especially when working with numerical data, you might also want to check their range. This can be implemented by using +the `range` test. Note that this test will exclude records with `NULL` values in the column, so in many cases you might +want to specify both `notNUll` and `range`. + +* `kind` **(mandatory)** *(string)*: `range` +* `lower` **(mandatory)** *(string)*: Lower value (inclusive) +* `upper` **(mandatory)** *(string)*: Upper value (inclusive) + + +### SQL Expression + +A very flexible test is provided with the SQL expression test. This test allows you to specify any simple SQL expression +(which may also use different columns), which should evaluate to `TRUE` for all records passing the test. + +* `kind` **(mandatory)** *(string)*: `expression` +* `expression` **(mandatory)** *(string)*: Boolean SQL Expression diff --git a/docs/documenting/config.md b/docs/documenting/config.md new file mode 100644 index 000000000..6526b3133 --- /dev/null +++ b/docs/documenting/config.md @@ -0,0 +1,74 @@ +# Configuring the Documentation + +Flowman has a sound default for generating documentation for relations, mappings and targets. But you might want +to explicitly influence the way for what and how documentation is generated. This can be easily done by supplying +a `documentation.yml` file at the root level of your project (so it would be a sibling of the `project.yml` file). + + +## Example + +```yaml +collectors: + # Collect documentation of relations + - kind: relations + # Collect documentation of mappings + - kind: mappings + # Collect documentation of build targets + - kind: targets + # Execute all checks + - kind: checks + +generators: + # Create an output file in the project directory + - kind: file + location: ${project.basedir}/doc + # This will exclude all mappings + excludeMappings: ".*" + excludeRelations: + # You can either specify a name or regular expression (without the project) + - "stations_raw" + # Or can also explicitly specify a name with the project. Note that the entries actually are regular expressions + - ".*/measurements_raw" +``` + +## Collectors + +Flowman uses so called *collectors* which create an internal model of the documentation from the core entities like +relations, mappings and build targets. The default configuration uses the four collectors `relations`, `mappings`, +`targets` and `checks`, with each of them being responsible for one entity type and the last one will execute all +data quality checks. If you really do not require documentation for one of these targets, you may want to simply +remove the corresponding collector from that list. + + +## File Generator Fields + +The generator is used for generating the documentation. You can configure multiple generators for creating multiple +differently configured documentations. + +* `kind` **(mandatory)** *(type: string)*: `file` + +* `location` **(mandatory)** *(type: string)*: Specifies the output location + +* `includeMappings` **(optional)** *(type: list:regex)* *(default: ".*")*: +List of regular expressions which mappings to include. Per default all mappings will be included in the output. +The list of filters will be applied before the `excludeMappings` filter list. + +* `excludeMappings` **(optional)** *(type: list:regex)* + List of regular expressions which mappings to exclude. Per default no mapping will be excluded in the output. + The list of filters will be applied after the `includeMappings` filter list. + +* `includeTargets` **(optional)** *(type: list:regex)* *(default: ".*")*: + List of regular expressions which targets to include. Per default all targets will be included in the output. + The list of filters will be applied before the `excludeTargets` filter list. + +* `excludeTargets` **(optional)** *(type: list:regex)* + List of regular expressions which targets to exclude. Per default no target will be excluded in the output. + The list of filters will be applied after the `includeTargets` filter list. + +* `includeRelations` **(optional)** *(type: list:regex)* *(default: ".*")*: + List of regular expressions which relations to include. Per default all relations will be included in the output. + The list of filters will be applied before the `excludeRelations` filter list. + +* `excludeRelations` **(optional)** *(type: list:regex)* + List of regular expressions which relations to exclude. Per default no relation will be excluded in the output. + The list of filters will be applied after the `includeRelations` filter list. diff --git a/docs/documenting/index.md b/docs/documenting/index.md new file mode 100644 index 000000000..7af0837df --- /dev/null +++ b/docs/documenting/index.md @@ -0,0 +1,70 @@ +# Documenting with Flowman + +Flowman supports to automatically generate a documentation of your project. The documentation can either include all +major entities like mappings, relations and targets. Or you may want to focus only on some aspects like the relations +which is useful for providing a documentation of the data model. + +```eval_rst +.. toctree:: + :maxdepth: 1 + :glob: + + * +``` + +![Flowman Documentation](../images/flowman-documentation.png) + +### Providing Descriptions + +Although Flowman will generate many valuable documentation bits by inspecting the project, the most important entities +(relations, mappings and targets) also provide the ability to manually and explicitly add documentation to them. This +documentation will override any automatically inferred information. + + +### Generating Documentation via Command Line + +Generating the documentation is as easy as running [flowexec](../cli/flowexec.md) as follows: + +```shell +flowexec -f my_project_directory documentation generate +``` + +Since generating documentation also requires a job context (which may contain additional parameters and environment +variables), you can also explicitly specify the job which is used for instantiating all entities like relations, +mappings and targets as follows: + +```shell +flowexec -f my_project_directory documentation generate +``` +If no job is specified, Flowman will use the `main` job + + +### Generating Documentation via Build Target + +The section above descirbes how to explicitly generate the project documentation by invoking +`flowexec documentation generate`. As an alternative, Flowman offers a [document](../spec/target/document.md) +targets, which allows one to generate the documentation during the `VERIFY` phase (after the `BUILD` phase has +finished) of a normal Flowman project. + +This can be easily configured as follows + +```yaml +targets: + # This target will create a documentation in the VERIFY phase + doc: + kind: documentation + # We do not specify any additional configuration, so the project's documentation.yml file will be used +``` + +Then you only need to add that build target `doc` to your job as follows: + +```yaml +jobs: + main: + targets: + # List all targets which should be built as part of the `main` job + - measurements + - ... + # Finally add the "doc" job for generating the documentation + - doc +``` diff --git a/docs/documenting/mappings.md b/docs/documenting/mappings.md new file mode 100644 index 000000000..b60893407 --- /dev/null +++ b/docs/documenting/mappings.md @@ -0,0 +1,51 @@ +# Documenting Mappings + +As with other entities, Flowman tries to automatically infer a meaningful documentation for mappings, especially +for the schema of all mappings outputs. But of course this is not always possible especially when mappings perform +complex transformations such that a single output column depends on multiple input columns. Probably the most +complex example is the [SQL](../spec/mapping/sql.md) mapping which allows to implement most complex transformations. + +In order to mitigate this issue, you can explicitly provide additional documentation for mappings via the +`documentation` tag, which is supported by all mappings. + +## Example + +```yaml +mappings: + # Extract multiple columns from the raw measurements data using SQL SUBSTR functions + measurements_extracted: + kind: select + input: measurements_raw + columns: + usaf: "SUBSTR(raw_data,5,6)" + wban: "SUBSTR(raw_data,11,5)" + date: "TO_DATE(SUBSTR(raw_data,16,8), 'yyyyMMdd')" + time: "SUBSTR(raw_data,24,4)" + air_temperature: "CAST(SUBSTR(raw_data,88,5) AS FLOAT)/10" + air_temperature_qual: "SUBSTR(raw_data,93,1)" + + documentation: + columns: + - name: usaf + description: "The USAF (US Air Force) id of the weather station" + - name: wban + description: "The WBAN id of the weather station" + - name: date + description: "The date when the measurement was made" + - name: time + description: "The time when the measurement was made" + - name: report_type + description: "The report type of the measurement" + - name: air_temperature + description: "The air temperature in degree Celsius" + - name: air_temperature_qual + description: "The quality indicator of the air temperature. 1 means trustworthy quality." +``` + +## Fields + +* `description` **(optional)** *(type: string)*: A description of the mapping + +* `columns` **(optional)** *(type: schema)*: A documentation of the output schema. Note that Flowman will inspect +the schema of the mapping itself and only overlay the provided documentation. Only fields found in the original +output schema will be documented, so you cannot add fields to the documentation which actually do not exist. diff --git a/docs/documenting/relations.md b/docs/documenting/relations.md new file mode 100644 index 000000000..98de36456 --- /dev/null +++ b/docs/documenting/relations.md @@ -0,0 +1,42 @@ +# Documenting Relations + +As with other entities, Flowman tries to automatically infer a meaningful documentation for mappings, especially +for the schema of a relation. In order to do so, Flowman will query the original data source and look up any +metadata (for example Flowman will pick up column descriptions in the Hive Metastore). + +In order to provide additiona information, you can explicitly provide additional documentation for mappings via the +`documentation` tag, which is supported by all mappings. + +## Example + +```yaml +relations: + aggregates: + kind: file + format: parquet + location: "$basedir/aggregates/" + partitions: + - name: year + type: integer + granularity: 1 + + documentation: + description: "The table contains all aggregated measurements" + columns: + - name: country + description: "Country of the weather station" + - name: min_temperature + description: "Minimum air temperature per year in degrees Celsius" + - name: max_temperature + description: "Maximum air temperature per year in degrees Celsius" + - name: avg_temperature + description: "Average air temperature per year in degrees Celsius" +``` + +## Fields + +* `description` **(optional)** *(type: string)*: A description of the mapping + +* `columns` **(optional)** *(type: schema)*: A documentation of the output schema. Note that Flowman will inspect + the schema of the mapping itself and only overlay the provided documentation. Only fields found in the original + output schema will be documented, so you cannot add fields to the documentation which actually do not exist. diff --git a/docs/documenting/targets.md b/docs/documenting/targets.md new file mode 100644 index 000000000..d65c6589e --- /dev/null +++ b/docs/documenting/targets.md @@ -0,0 +1,17 @@ +# Documenting Targets + +Flowman also supports documenting build targets. + +## Example + +```yaml +targets: + stations: + kind: relation + description: "Write stations" + mapping: stations_raw + relation: stations + + documentation: + description: "This build target is used to write the weather stations" +``` diff --git a/docs/images/flowman-documentation.png b/docs/images/flowman-documentation.png new file mode 100644 index 000000000..d1866bcb1 Binary files /dev/null and b/docs/images/flowman-documentation.png differ diff --git a/docs/index.md b/docs/index.md index bbddbb24b..fefaf6c06 100644 --- a/docs/index.md +++ b/docs/index.md @@ -96,6 +96,8 @@ Flowman also provides optional plugins which extend functionality. You can find installation lifecycle spec/index + testing/index + documenting/index cli/index history-server/index cookbook/index diff --git a/docs/installation.md b/docs/installation.md index 97cf222d0..38076c894 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -1,6 +1,6 @@ # Installation Guide -## Requirements +## 1. Requirements Flowman brings many dependencies with the installation archive, but everything related to Hadoop or Spark needs to be provided by your platform. This approach ensures that the existing Spark and Hadoop installation is used together @@ -14,10 +14,44 @@ components present on your system: Note that Flowman can be built for different Hadoop and Spark versions, and the major and minor version of the build needs to match the ones of your platform +### Download & Install Spark -## Downloading Flowman +As of this writing, the latest release of Flowman is 0.22.0 and is available prebuilt for Spark 3.2.1 on the Spark +homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it. -Since version 0.14.1, prebuilt releases are provided on the [FLowman Homepage](https://flowman.io) or on +```shell +# Create a nice playground which doesn't mess up your system +mkdir playground +cd playground + +# Download and unpack Spark & Hadoop +curl -L https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.2.1-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link +ln -snf spark-3.2.1-bin-hadoop3.2 spark +``` +The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other. + +Once you have installed Spark, you should set the environment variable `SPARK_HOME`, so Flowman can find it +```shell +export SPARK_HOME= +``` +It might be a good idea to add a corresponding line to your `.bashrc` or `.profile`. + +### Download & Install Hadoop Utils for Windows + +If you are trying to run the application on Windows, you also need the *Hadoop Winutils*, which is a set of +DLLs required for the Hadoop libraries to be working. You can get a copy at https://github.com/kontext-tech/winutils . +Once you downloaded the appropriate version, you need to place the DLLs into a directory `$HADOOP_HOME/bin`, where +`HADOOP_HOME` refers to some arbitrary location of your choice on your Windows PC. You also need to set the following +environment variables: +* `HADOOP_HOME` should point to the parent directory of the `bin` directory +* `PATH` should also contain `$HADOOP_HOME/bin` + +The documentation contains a [dedicated section for Windows users](cookbook/windows.md) + + +## 2. Downloading Flowman + +Since version 0.14.1, prebuilt releases are provided on the [Flowman Homepage](https://flowman.io) or on [GitHub](https://github.com/dimajix/flowman/releases). This probably is the simplest way to grab a working Flowman package. Note that for each release, there are different packages being provided, for different Spark and Hadoop versions. The naming is very simple: @@ -35,15 +69,14 @@ https://github.com/dimajix/flowman/releases/download/0.20.1/flowman-dist-0.20.1- ``` - -## Building Flowman +### Building Flowman As an alternative to downloading a pre-built distribution of Flowman, you might also want to [build Flowman](building.md) yourself in order to match your environment. A task which is not difficult for someone who has basic experience with Maven. -## Local Installation +## 3. Local Installation Flowman is distributed as a `tar.gz` file, which simply needs to be extracted at some location on your computer or server. This can be done via @@ -57,8 +90,6 @@ tar xvzf flowman-dist-X.Y.Z-bin.tar.gz ├── bin ├── conf ├── examples -│   ├── plugin-example -│   │   └── job │   ├── sftp-upload │   │   ├── config │   │   ├── data @@ -89,7 +120,7 @@ tar xvzf flowman-dist-X.Y.Z-bin.tar.gz * The `examples` directory contains some example projects -## Configuration +## 4. Configuration (optional) You probably need to perform some basic global configuration of Flowman. The relevant files are stored in the `conf` directory. @@ -187,7 +218,7 @@ plugins: ### `default-namespace.yml` On top of the very global settings, Flowman also supports so called *namespaces*. Each project is executed within the -context of one namespace, if nothing else is specified the *defautlt namespace*. Each namespace contains some +context of one namespace, if nothing else is specified the *default namespace*. Each namespace contains some configuration, such that different namespaces might represent different tenants or different staging environments. #### Example @@ -229,9 +260,27 @@ store: ``` -## Running in a Kerberized Environment +## 5. Running Flowman + +Now when you have installed Spark and Flowman, you can easily start Flowman via +```shell +cd +export SPARK_HOME= + +bin/flowshell -f examples/weather +``` + + +## 6. Related Topics + +### Running Flowman on Windows +Please have a look at [Running Flowman on Windows](cookbook/windows.md) for detailed information. + + +### Running in a Kerberized Environment Please have a look at [Kerberos](cookbook/kerberos.md) for detailed information. -## Deploying with Docker -It is also possible to run Flowman inside Docker. This simply requires a Docker image with a working Spark and -Hadoop installation such that Flowman can be installed inside the image just as it is installed locally. + +### Running in Docker +It is also possible to run Flowman inside Docker. We now also provide some images at +[Docker Hub](https://hub.docker.com/repository/docker/dimajix/flowman) diff --git a/docs/lifecycle.md b/docs/lifecycle.md index b590542cd..89f68fc8a 100644 --- a/docs/lifecycle.md +++ b/docs/lifecycle.md @@ -5,7 +5,7 @@ multiple different phases, each of them representing one stage of the whole life ## Lifecycle Phases -The full lifecycle consists out of specific phases, as follows: +The full lifecycle consists out of specific execution phases, as follows: 1. **VALIDATE**. This first phase is used for validation and any error will stop the next steps. A validation step might for example @@ -29,7 +29,7 @@ some specific user defined tests that compare data. If verification fails, the b tables (i.e. it will delete data), but it will keep tables alive. 6. **DESTROY**. -The final phase *destroy* is used to phyiscally remove relations including their data. This will also remove table +The final phase *destroy* is used to physically remove relations including their data. This will also remove table definitions, views and directories. It performs the opposite operation than the *create* phase. diff --git a/docs/plugins/aws.md b/docs/plugins/aws.md index 65a8385ee..24de19a17 100644 --- a/docs/plugins/aws.md +++ b/docs/plugins/aws.md @@ -1 +1,13 @@ # AWS Plugin + +The AWS plugin does not provide new entity types to Flowman, but will provide compatibility with the S3 object +store to be usable as a data source or sink via the `s3a` file system. + + +## Activation + +The plugin can be easily activated by adding the following section to the [default-namespace.yml](../spec/namespace.md) +```yaml +plugins: + - flowman-aws +``` diff --git a/docs/plugins/azure.md b/docs/plugins/azure.md index f1da18c0c..b4b0c23f2 100644 --- a/docs/plugins/azure.md +++ b/docs/plugins/azure.md @@ -2,3 +2,12 @@ The Azure plugin mainly provides the ADLS (Azure DataLake Filesystem) and ABS (Azure Blob Filesystem) to be used as the storage layer. + + +## Activation + +The plugin can be easily activated by adding the following section to the [default-namespace.yml](../spec/namespace.md) +```yaml +plugins: + - flowman-azure +``` diff --git a/docs/plugins/delta.md b/docs/plugins/delta.md index a60e251a5..795c69afd 100644 --- a/docs/plugins/delta.md +++ b/docs/plugins/delta.md @@ -12,3 +12,12 @@ move to Spark 3.0+. * [`deltaTable` relation](../spec/relation/deltaTable.md) * [`deltaFile` relation](../spec/relation/deltaFile.md) * ['deltaVacuum' target](../spec/target/deltaVacuum.md) + + +## Activation + +The plugin can be easily activated by adding the following section to the [default-namespace.yml](../spec/namespace.md) +```yaml +plugins: + - flowman-delta +``` diff --git a/docs/plugins/json.md b/docs/plugins/json.md index 878164a32..0aff7043a 100644 --- a/docs/plugins/json.md +++ b/docs/plugins/json.md @@ -1,4 +1,16 @@ # JSON Plugin +The OpenAPI plugin provides compatibility with JSON schema definition files. + + ## Provided Entities * [`json` schema](../spec/schema/json.md) + + +## Activation + +The plugin can be easily activated by adding the following section to the [default-namespace.yml](../spec/namespace.md) +```yaml +plugins: + - flowman-json +``` diff --git a/docs/plugins/kafka.md b/docs/plugins/kafka.md index 94cf05d3d..b8d72849b 100644 --- a/docs/plugins/kafka.md +++ b/docs/plugins/kafka.md @@ -2,3 +2,12 @@ ## Provided Entities * [`kafka` relation](../spec/relation/kafka.md) + + +## Activation + +The plugin can be easily activated by adding the following section to the [default-namespace.yml](../spec/namespace.md) +```yaml +plugins: + - flowman-kafka +``` diff --git a/docs/plugins/mariadb.md b/docs/plugins/mariadb.md index 28b2a6e43..346c4c5de 100644 --- a/docs/plugins/mariadb.md +++ b/docs/plugins/mariadb.md @@ -1 +1,12 @@ # MariaDB Plugin + +The MariaDB plugin mainly provides a JDBC driver to access MariaDB databases via the [JDBC relation](../spec/relation/jdbc.md) + + +## Activation + +The plugin can be easily activated by adding the following section to the [default-namespace.yml](../spec/namespace.md) +```yaml +plugins: + - flowman-mariadb +``` diff --git a/docs/plugins/mssql.md b/docs/plugins/mssql.md deleted file mode 100644 index e456e271f..000000000 --- a/docs/plugins/mssql.md +++ /dev/null @@ -1 +0,0 @@ -# MS SQL Server Plugin diff --git a/docs/plugins/mssqlserver.md b/docs/plugins/mssqlserver.md new file mode 100644 index 000000000..9237d7a57 --- /dev/null +++ b/docs/plugins/mssqlserver.md @@ -0,0 +1,19 @@ +# MS SQL Server Plugin + +The MS SQL Server plugin provides a JDBC driver to access MS SQL Server and Azure SQL Server databases via +the [JDBC relation](../spec/relation/jdbc.md). Moreover, it also provides a specialized +[`sqlserver` relation](../spec/relation/sqlserver.md) which uses bulk copy to speed up writing process and it +also uses temp tables to encapsulate the whole data upload within a transaction. + + +## Provided Entities +* [`sqlserver` relation](../spec/relation/sqlserver.md) + + +## Activation + +The plugin can be easily activated by adding the following section to the [default-namespace.yml](../spec/namespace.md) +```yaml +plugins: + - flowman-mssqlserver +``` diff --git a/docs/plugins/mysql.md b/docs/plugins/mysql.md index 17a308daa..802afa606 100644 --- a/docs/plugins/mysql.md +++ b/docs/plugins/mysql.md @@ -1 +1,12 @@ # MySQL Plugin + +The MySQL plugin mainly provides a JDBC driver to access MySQL databases via the [JDBC relation](../spec/relation/jdbc.md) + + +## Activation + +The plugin can be easily activated by adding the following section to the [default-namespace.yml](../spec/namespace.md) +```yaml +plugins: + - flowman-mysql +``` diff --git a/docs/plugins/openapi.md b/docs/plugins/openapi.md index 5e8a8ba28..ca5222993 100644 --- a/docs/plugins/openapi.md +++ b/docs/plugins/openapi.md @@ -1,4 +1,16 @@ # OpenAPI Plugin +The OpenAPI plugin provides compatibility with OpenAPI schema definition files. + + ## Provided Entities * [`openApi` schema](../spec/schema/open-api.md) + + +## Activation + +The plugin can be easily activated by adding the following section to the [default-namespace.yml](../spec/namespace.md) +```yaml +plugins: + - flowman-openapi +``` diff --git a/docs/plugins/swagger.md b/docs/plugins/swagger.md index 623194122..a2b390371 100644 --- a/docs/plugins/swagger.md +++ b/docs/plugins/swagger.md @@ -1,4 +1,16 @@ # Swagger Plugin +The Swagger plugin provides compatibility with Swagger schema definition files. + + ## Provided Entities * [`swagger` schema](../spec/schema/swagger.md) + + +## Activation + +The plugin can be easily activated by adding the following section to the [default-namespace.yml](../spec/namespace.md) +```yaml +plugins: + - flowman-swagger +``` diff --git a/docs/quickstart.md b/docs/quickstart.md index 0a429f9d9..9603d53bc 100644 --- a/docs/quickstart.md +++ b/docs/quickstart.md @@ -14,7 +14,7 @@ Fortunately, Spark is rather simple to install locally on your machine: ### Download & Install Spark -As of this writing, the latest release of Flowman is 0.18.0 and is available prebuilt for Spark 3.1.2 on the Spark +As of this writing, the latest release of Flowman is 0.22.0 and is available prebuilt for Spark 3.2.1 on the Spark homepage. So we download the appropriate Spark distribution from the Apache archive and unpack it. ```shell @@ -22,8 +22,8 @@ homepage. So we download the appropriate Spark distribution from the Apache arch mkdir playground cd playground# Download and unpack Spark & Hadoop -curl -L https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link -ln -snf spark-3.1.2-bin-hadoop3.2 spark +curl -L https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz | tar xvzf -# Create a nice link +ln -snf spark-3.2.1-bin-hadoop3.2 spark ``` The Spark package already contains Hadoop, so with this single download you already have both installed and integrated with each other. @@ -31,12 +31,12 @@ The Spark package already contains Hadoop, so with this single download you alre ## 2. Install Flowman You find prebuilt Flowman packages on the corresponding release page on GitHub. For this quickstart, we chose -`flowman-dist-0.17.0-oss-spark3.0-hadoop3.2-bin.tar.gz` which nicely fits to the Spark package we just downloaded before. +`flowman-dist-0.22.0-oss-spark3.2-hadoop3.3-bin.tar.gz` which nicely fits to the Spark package we just downloaded before. ```shell # Download and unpack Flowman -curl -L https://github.com/dimajix/flowman/releases/download/0.17.0/flowman-dist-0.18.0-oss-spark3.1-hadoop3.2-bin.tar.gz | tar xvzf -# Create a nice link -ln -snf flowman-0.18.0 flowman +curl -L https://github.com/dimajix/flowman/releases/download/0.22.0/flowman-dist-0.22.0-oss-spark3.2-hadoop3.3-bin.tar.gz | tar xvzf -# Create a nice link +ln -snf flowman-0.22.0 flowman ``` ### Flowman Configuration @@ -68,13 +68,9 @@ That’s all we need to run the Flowman example. ## 3. Flowman Shell -The example data is stored in a S3 bucket provided by myself. In order to access the data, you should provide valid -AWS credentials in your environment (not needed any more, since the example uses anonymous authentication): - -```shell -$ export AWS_ACCESS_KEY_ID= -$ export AWS_SECRET_ACCESS_KEY= -``` +The example data is stored in a S3 bucket provided by myself. Since the data is publicly available and the project is +configured to use anonymous AWS authentication, you do not need to provide your AWS credentials (you even do not +even need to have an account on AWS) ### Start interactive Flowman shell diff --git a/docs/spec/assertion/schema.md b/docs/spec/assertion/schema.md index 1d21bce01..3ba9ac637 100644 --- a/docs/spec/assertion/schema.md +++ b/docs/spec/assertion/schema.md @@ -17,7 +17,7 @@ columns: ```yaml kind: schema mapping: some_mapping -ignoreNullable: false +ignoreNullability: false schema: kind: inline fields: @@ -35,7 +35,7 @@ targets: assertions: - kind: schema mapping: some_mapping - ignoreNullable: true + ignoreNullability: true ignoreCase: false ignoreOrder: true schema: diff --git a/docs/spec/hooks/report.md b/docs/spec/hooks/report.md index c9c26db71..bd3dd2215 100644 --- a/docs/spec/hooks/report.md +++ b/docs/spec/hooks/report.md @@ -1,10 +1,40 @@ # Report Hook +The `report` will create a textual report file containing information on the execution. As with all hooks, it can be +either added on the namespace level or on the job leve. + ## Example ```yaml job: main: hooks: - kind: report - location: file:///tmp/my-report.txt + location: ${project.basedir}/generated-report.txt + metrics: + # Define common labels for all metrics + labels: + project: ${project.name} + metrics: + # This metric contains the number of records per output + - name: output_records + selector: + name: target_records + labels: + category: target + labels: + target: ${name} + # This metric contains the processing time per output + - name: output_time + selector: + name: target_runtime + labels: + category: target + labels: + target: ${name} + # This metric contains the overall processing time + - name: processing_time + selector: + name: job_runtime + labels: + category: job ``` diff --git a/docs/spec/index.md b/docs/spec/index.md index 602af6927..4ef18e91f 100644 --- a/docs/spec/index.md +++ b/docs/spec/index.md @@ -1,4 +1,4 @@ -# Project Specification +# **Reference Documentation** Flowman uses so called *flow specifications* which contain all details of data transformations and additional processing steps. Flow specifications are provided by the user as (potentially diff --git a/docs/spec/metric/jdbc.md b/docs/spec/metric/jdbc.md new file mode 100644 index 000000000..11be79bbf --- /dev/null +++ b/docs/spec/metric/jdbc.md @@ -0,0 +1,46 @@ +# JDBC Metric Sink + +The `jdbc` metric sink is a very simple sink, which simply stores all collected metrics in a relational database. +Actually it is highly recommended setting up a proper monitoring using Prometheus or other supported and established +monitoring system instead of relying on a relational database. + + +## Example + +```yaml +metrics: + # Also add console metric sink (this is optional, but recommended) + - kind: console + # Now configure the Prometheus metric sink + - kind: jdbc + # Specify labels on a commit level + labels: + project: ${project.name} + version: ${project.version} + connection: + kind: jdbc + driver: "com.mysql.cj.jdbc.Driver" + url: "jdbc:mysql://mysql-01.ffm.dimajix.net/dimajix_flowman" + username: "flowman-metrics" + password: "my-secret-password" +``` + + +## Fields + +* `kind` **(mandatory)** *(string)*: `prometheus` + +* `connection` **(mandatory)** *(string/connection)*: Either the name of a [`jdbc` connection](../connection/jdbc.md) +or an directly embedded JDBC connection (like in the example). + +* `commitTable` **(optional)** *(string)* *(default: flowman_metric_commits)*: The name of the table which will +get one entry ("commit") per publication of metrics. + +* `commitLabelTable` **(optional)** *(string)* *(default: flowman_metric_commit_labels)*: The name of the table which will + contain the labels of each commit. + +* `metricTable` **(optional)** *(string)* *(default: flowman_metrics)*: The name of the table which will contain +the metrics. + +* `metricLabelTable` **(optional)** *(string)* *(default: flowman_metric_labels)*: The name of the table which will +contain the labels of each metric. diff --git a/docs/spec/metric/prometheus.md b/docs/spec/metric/prometheus.md index 094ae51ab..1b81d8270 100644 --- a/docs/spec/metric/prometheus.md +++ b/docs/spec/metric/prometheus.md @@ -1,5 +1,9 @@ # Prometheus Metric Sink +The `prometheus` metric sink allows you to publish collected metrics to a Prometheus push gateway. This then can +be scraped by a Prometheus server. + + ## Example The following example configures a prometheus sink in a namespace. You would need to include this snippet for example in the `default-namespace.yml` in the Flowman configuration directory @@ -18,3 +22,10 @@ metrics: ``` ## Fields + +* `kind` **(mandatory)** *(string)*: `prometheus` + +* `url` **(mandatory)** *(string)*: Specifies the URL of the prometheus push gateway + +* `labels` **(optional)** *(map)*: Specifies an additional set of labels to be pushed to prometheus. This set +of labels will determine the path in Prometheus push gateway, under which all metrics will be atomically published. diff --git a/docs/spec/project.md b/docs/spec/project.md index 9cea1c64a..68c839b13 100644 --- a/docs/spec/project.md +++ b/docs/spec/project.md @@ -1,12 +1,15 @@ # Projects +The specification of all relations, data transformations and build targets is done within Flowman projects. Each +project has a top level project descriptor which mainly contains some meta information like project name and +version and a list of subdirectories, which contain the entity definitions. + ## Project Specification -Flowman always requires a *Project* top level file containing general information (like a -projects name and version) and directories where to look for specifications. The project -file should be named `project.yml`, this way `flowexec` will directly pick it up when only -the directory is given on the command line. +Flowman always requires a *Project* top level file containing general information (like a projects name and version) +and directories where to look for specifications. The project file should be named `project.yml`, this way `flowexec` +and `flowshell` will directly pick it up when only the directory is given on the command line. A typical `project.yml` file looks as follows: @@ -21,6 +24,13 @@ modules: - mapping - target - job + +imports: + - project: other_project + + - project: commons + arguments: + processing_date: $processing_date ``` ## Fields @@ -28,21 +38,24 @@ modules: Each project supports the following fields: * `name` **(mandatory)** *(string)* -The name of the overall project. This field will be used in a later Flowman version for -sharing mappings and relations between different projects. +The name of the overall project. This field is used by Flowman for sharing mappings and relations between different +projects. * `version` **(optional)** *(string)* -The version currently is not used by Flowman, but can be used for the end-user to help keeping -track of which version of a project is currently being used. +The version currently is not used by Flowman, but can be used for the end-user to help keeping track of which version +of a project is currently being used. * `description` **(optional)** *(string)* A description of the overall project. Can be any text, is not used by Flowman otherwise -* `modules` **(mandatory)** *(list)* -The `modules` secion contains a list of *subdirectories* or *filenames* where Flowman should -search for more YAML specification files. This helps to organize complex projects into -different modules and/or aspects. The directory and file names are relative to the project -file itself. +* `modules` **(mandatory)** *(list:string)* +The `modules` secion contains a list of *subdirectories* or *filenames* where Flowman should search for more YAML +specification files. This helps to organize complex projects into different modules and/or aspects. The directory and +file names are relative to the project file itself. + +* `imports` **(optional)** *(list:import)* +Within the `imports` section you can specify different projects to be imported and made available for referencing +its entities. ## Proposed Directory Layout diff --git a/docs/spec/relation/deltaFile.md b/docs/spec/relation/deltaFile.md index 90bf11f55..bab75bfad 100644 --- a/docs/spec/relation/deltaFile.md +++ b/docs/spec/relation/deltaFile.md @@ -99,7 +99,8 @@ The `deltaFile` relation supports the following output modes in a [`relation` ta | `overwrite_dynamic` | no | - | | `append` | yes | Append new records to the existing table | | `update` | yes | Updates existing records, either using `mergeKey` or the primary key of the specified `schema` | -| `merge` | no | - | + +In addition, the `deltaFile` relation also supports complex merge operations in a [`merge` target](../target/merge.md). ### Stream Writing In addition to batch writing, the Delta file relation also supports stream writing via the diff --git a/docs/spec/relation/deltaTable.md b/docs/spec/relation/deltaTable.md index aa01e2f80..563ce8c03 100644 --- a/docs/spec/relation/deltaTable.md +++ b/docs/spec/relation/deltaTable.md @@ -114,7 +114,8 @@ The `deltaTable` relation supports the following output modes in a [`relation` t | `overwrite_dynamic` | no | - | | `append` | yes | Append new records to the existing table | | `update` | yes | Updates existing records, either using `mergeKey` or the primary key of the specified `schema` | -| `merge` | no | - | + +In addition, the `deltaFile` relation also supports complex merge operations in a [`merge` target](../target/merge.md). ### Stream Writing In addition to batch writing, the Delta table relation also supports stream writing via the diff --git a/docs/spec/relation/file.md b/docs/spec/relation/file.md index 8f5393227..3fec6f44f 100644 --- a/docs/spec/relation/file.md +++ b/docs/spec/relation/file.md @@ -12,7 +12,9 @@ relations: format: "csv" # Specify the base directory where all data is stored. This location does not include the partition pattern location: "${export_dir}" - # Specify the pattern how to identify files and/or partitions. This pattern is relative to the `location` + # You could specify the pattern how to identify files and/or partitions. This pattern is relative to the `location`. + # Actually, it is highly recommended NOT to explicitly specify a partition pattern for outgoing relations + # and let Spark generate this according to the Hive standard. pattern: "${export_pattern}" # Set format specific options options: @@ -67,14 +69,18 @@ relations: column has a name and a type and optionally a granularity. Normally the partition columns are separate from the schema, but you *may* also include the partition column in the schema, although this is not considered to be best practice. But it turns out to be quite useful in combination with dynamically writing to multiple partitions. - + * `pattern` **(optional)** *(string)* *(default: empty)*: This field specifies the directory and/or file name pattern to access specific partitions. Please see the section [Partitioning](#Partitioning) below. +## Automatic Migrations +The `file` relation does not support any automatic migration like adding/removing columns. + + ## Schema Conversion -The file relation fully supports automatic schema conversion on input and output operations as described in the +The `file` relation fully supports automatic schema conversion on input and output operations as described in the corresponding section of [relations](index.md). @@ -91,7 +97,6 @@ The `file` relation supports the following output modes in a [`relation` target] | `overwrite_dynamic` | yes | Overwrite only partitions dynamically determined by the data itself | | `append` | yes | Append new records to the existing files | | `update` | no | - | -| `merge` | no | - | ### Stream Writing In addition to batch writing, the file relation also supports stream writing via the @@ -124,7 +129,10 @@ in all situations where only schema information is required. ### Partitioning -Flowman also supports partitioning, i.e. written to different sub directories. +Flowman also supports partitioning, i.e. written to different subdirectories. You can explicitly specify a *partition +pattern* via the `pattern` field, but it is highly recommended to NOT explicitly set this field and let Spark manage +partitions itself. This way Spark can infer partition values from directory names and will also list directories more +efficiently. ### Writing to Dynamic Partitions diff --git a/docs/spec/relation/generic.md b/docs/spec/relation/generic.md index b81a1a2d7..3b71c35f4 100644 --- a/docs/spec/relation/generic.md +++ b/docs/spec/relation/generic.md @@ -34,3 +34,7 @@ relations: * `format` **(optional)** *(string)* *(default: empty)*: Specifies the name of the Spark data source format to use. + + +## Automatic Migrations +The `generic` relation does not support any automatic migration like adding/removing columns. diff --git a/docs/spec/relation/hiveTable.md b/docs/spec/relation/hiveTable.md index 029651abe..aaeea8d29 100644 --- a/docs/spec/relation/hiveTable.md +++ b/docs/spec/relation/hiveTable.md @@ -154,7 +154,6 @@ The `hive` relation supports the following output modes in a [`relation` target] | `overwrite_dynamic` | yes | Overwrite only the partitions dynamically inferred from the data. | | `append` | yes | Append new records to the existing table | | `update` | no | - | -| `merge` | no | - | ## Remarks diff --git a/docs/spec/relation/hiveUnionTable.md b/docs/spec/relation/hiveUnionTable.md index 10a13a2de..9aefdfdea 100644 --- a/docs/spec/relation/hiveUnionTable.md +++ b/docs/spec/relation/hiveUnionTable.md @@ -116,14 +116,13 @@ following changes to a data schema are supported ## Output Modes The `hiveUnionTable` relation supports the following output modes in a [`relation` target](../target/relation.md): -|Output Mode |Supported | Comments| ---- | --- | --- -|`errorIfExists`|yes|Throw an error if the Hive table already exists| -|`ignoreIfExists`|yes|Do nothing if the Hive table already exists| -|`overwrite`|yes|Overwrite the whole table or the specified partitions| -|`append`|yes|Append new records to the existing table| -|`update`|no|-| -|`merge`|no|-| +| Output Mode | Supported | Comments | +|------------------|-----------|-------------------------------------------------------| +| `errorIfExists` | yes | Throw an error if the Hive table already exists | +| `ignoreIfExists` | yes | Do nothing if the Hive table already exists | +| `overwrite` | yes | Overwrite the whole table or the specified partitions | +| `append` | yes | Append new records to the existing table | +| `update` | no | - | ## Remarks diff --git a/docs/spec/relation/jdbcTable.md b/docs/spec/relation/jdbcTable.md new file mode 100644 index 000000000..0d04491a3 --- /dev/null +++ b/docs/spec/relation/jdbcTable.md @@ -0,0 +1,175 @@ +# JDBC Table Relations + +The `jdbcTable` relation allows you to access databases using a JDBC driver. Note that you need to put an appropriate JDBC +driver onto the classpath of Flowman. This can be done by using an appropriate plugin. + + +## Example + +```yaml +# First specify a connection. This can be used by multiple JDBC relations +connections: + frontend: + kind: jdbc + driver: "$frontend_db_driver" + url: "$frontend_db_url" + username: "$frontend_db_username" + password: "$frontend_db_password" + +relations: + frontend_users: + kind: jdbcTable + # Specify the name of the connection to use + connection: frontend + # Specify the table + table: "users" + schema: + kind: avro + file: "${project.basedir}/schema/users.avsc" + primaryKey: + - user_id + indexes: + - name: "users_idx0" + columns: [user_first_name, user_last_name] +``` +It is also possible to directly embed the connection as follows: +```yaml +relations: + frontend_users: + kind: jdbcTable + # Specify the name of the connection to use + connection: + kind: jdbc + driver: "$frontend_db_driver" + url: "$frontend_db_url" + username: "$frontend_db_username" + password: "$frontend_db_password" + # Specify the table + table: "users" +``` +For most cases, it is recommended not to embed the connection, since this prevents reusing the same connection in +multiple places. + +It is also possible to access the results of an arbitrary SQL query, which is executed inside the target database: +```yaml +relations: + lineitem: + kind: jdbc + connection: frontend + query: " + SELECT + CONCAT('DIR_', li.id) AS lineitem, + li.campaign_id AS campaign, + IF(c.demand_type_system = 1, 'S', IF(li.demand_type_system = 1, 'S', 'D')) AS demand_type + FROM + line_item AS li + INNER JOIN + campaign c + ON c.id = li.campaign_id + " + schema: + kind: embedded + fields: + - name: lineitem + type: string + - name: campaign + type: long + - name: demand_type + type: string +``` +The schema is still optional in this case, but it will help [mocking](mock.md) the relation for unittests. + + +## Fields + * `kind` **(mandatory)** *(type: string)*: `jdbcTable` or `jdbc` + + * `schema` **(optional)** *(type: schema)* *(default: empty)*: + Explicitly specifies the schema of the JDBC source. Alternatively Flowman will automatically + try to infer the schema. + + * `primaryKey` **(optional)** *(type: list)* *(default: empty)*: +List of columns which form the primary key. This will be used when Flowman creates the table, and this will also be used +as the fallback for merge/upsert operations, when no `mergeKey` and no explicit merge condition is specified. + + * `mergeKey` **(optional)** *(type: list)* *(default: empty)*: + List of columns which will be used as default condition for merge and upsert operations. The main difference to + `primaryKey` is that these columns will not be used as a primary key for creating the table. + + * `description` **(optional)** *(type: string)* *(default: empty)*: + A description of the relation. This is purely for informational purpose. + + * `connection` **(mandatory)** *(type: string)*: + The *connection* field specifies the name of a [Connection](../connection/index.md) + object which has to be defined elsewhere. + + * `database` **(optional)** *(type: string)* *(default: empty)*: + Defines the Hive database where the table is defined. When no database is specified, the + table is accessed without any specific qualification, meaning that the default database + will be used or the one specified in the connection. + + * `table` **(optional)** *(type: string)*: + Specifies the name of the table in the relational database. You either need to specify this `table` property +or the `query` property. + + * `query` **(optional)** *(type: string)*: +As an alternative to directly accessing a table, you can also specify an SQL query which will be executed by the +database for retrieving data. Of course, then only read operations are possible. You either need to specify this +`query` property or the `table` property. + + * `properties` **(optional)** *(type: map:string)* *(default: empty)*: + Specifies any additional properties passed to the JDBC connection. Note that both the JDBC + relation and the JDBC connection can define properties. So it is advisable to define all + common properties in the connection and more table specific properties in the relation. + The connection properties are applied first, then the relation properties. This means that + a relation property can overwrite a connection property if it has the same name. + + * `indexes` **(optional)** *(type: list:index)* *(default: empty)*: + Specifies a list of database indexes to be created. Each index has the properties `name`, `columns` and `unique`. + + +## Automatic Migrations +Flowman supports some automatic migrations, specifically with the migration strategies `ALTER`, `ALTER_REPLACE` +and `REPLACE` (those can be set via the global config variable `flowman.default.relation.migrationStrategy`, +see [configuration](../../config.md) for more details). + +The migration strategy `ALTER` supports the following alterations for JDBC relations: +* Changing nullability +* Adding new columns +* Dropping columns +* Changing the column type +* Adding / dropping indexes +* Changing the primary key + +Note that although Flowman will try to apply these changes, not all SQL databases support all of these changes in +all variations. Therefore, it may well be the case, that the SQL database will fail performing these changes. If +the migration strategy is set to `ALTER_REPLACE`, then Flowman will fall back to trying to replace the whole table +altogether on *any* non-recoverable exception during migration. + + +## Schema Conversion +The JDBC relation fully supports automatic schema conversion on input and output operations as described in the +corresponding section of [relations](index.md). + + +## Output Modes +The `jdbcTable` relation supports the following output modes in a [`relation` target](../target/relation.md): + +| Output Mode | Supported | Comments | +|---------------------|-----------|--------------------------------------------------------------| +| `errorIfExists` | yes | Throw an error if the JDBC table already exists | +| `ignoreIfExists` | yes | Do nothing if the JDBC table already exists | +| `overwrite` | yes | Overwrite the whole table or the specified partitions | +| `overwrite_dynamic` | no | - | +| `append` | yes | Append new records to the existing table | +| `update` | yes | Perform upsert operations using the merge key or primary key | + +In addition, the `jdbcTable` relation also supports complex merge operations in a [`merge` target](../target/merge.md). + + +## Remarks + +Note that Flowman will rely on schema inference in some important situations, like [mocking](mock.md) and generally +for describing the schema of a relation. This might create unwanted connections to the physical data source, +particular in case of self-contained tests. To prevent Flowman from creating a connection to the physical data +source, you simply need to explicitly specify a schema, which will then be used instead of the physical schema +in all situations where only schema information is required. diff --git a/docs/spec/relation/kafka.md b/docs/spec/relation/kafka.md index 51aa1bc3c..7193de879 100644 --- a/docs/spec/relation/kafka.md +++ b/docs/spec/relation/kafka.md @@ -61,22 +61,21 @@ List of Kafka bootstrap servers to contact. This list does not need to be exhaus ### Batch Writing The `kafa` relation supports the following output modes in a [`relation` target](../target/relation.md): -|Output Mode |Supported | Comments| ---- | --- | --- -|`errorIfExists`|yes|Throw an error if the Kafka topic already exists| -|`ignoreIfExists`|yes|Do nothing if the Kafka topic already exists| -|`overwrite`|no|-| -|`overwrite_dynamic`|no|-| -|`append`|yes|Append new records to the existing Kafka topic| -|`update`|no|-| -|`merge`|no|-| +| Output Mode | Supported | Comments | +|---------------------|-----------|--------------------------------------------------| +| `errorIfExists` | yes | Throw an error if the Kafka topic already exists | +| `ignoreIfExists` | yes | Do nothing if the Kafka topic already exists | +| `overwrite` | no | - | +| `overwrite_dynamic` | no | - | +| `append` | yes | Append new records to the existing Kafka topic | +| `update` | no | - | ### Stream Writing In addition to batch writing, the Kafka relation also supports stream writing via the [`stream` target](../target/stream.md) with the following semantics: -|Output Mode |Supported | Comments| ---- | --- | --- -|`append`|yes|Append new records from the streaming process once they don't change any more| -|`update`|yes|Append records every time they are updated| -|`complete`|no|-| +| Output Mode | Supported | Comments | +|-------------|-----------|-------------------------------------------------------------------------------| +| `append` | yes | Append new records from the streaming process once they don't change any more | +| `update` | yes | Append records every time they are updated | +| `complete` | no | - | diff --git a/docs/spec/relation/local.md b/docs/spec/relation/local.md index ed1ad8da9..0cbbfe449 100644 --- a/docs/spec/relation/local.md +++ b/docs/spec/relation/local.md @@ -62,8 +62,19 @@ whole lifecycle of the directory for you. This means that * The directory specified in `location` will be truncated or individual partitions will be dropped during `clean` phase * The directory specified in `location` tables will be removed during `destroy` phase + +## Automatic Migrations +The `local` relation does not support any automatic migration like adding/removing columns. + + ## Supported File Format +The `local` relation only supports a very limited set of file formats (currently only `CSV` files) + ### CSV + ## Partitioning + +The `local` relation also supports partitioning by storing different partitions in separate files or subdirectories. +You need to explicitly specify a *partition pattern* via the `pattern` field. diff --git a/docs/spec/relation/jdbc.md b/docs/spec/relation/sqlserver.md similarity index 73% rename from docs/spec/relation/jdbc.md rename to docs/spec/relation/sqlserver.md index a35b8c897..dea3cb7ed 100644 --- a/docs/spec/relation/jdbc.md +++ b/docs/spec/relation/sqlserver.md @@ -1,24 +1,31 @@ -# JDBC Relations +# SQL Server Relations -The JDBC relation allows you to access databases using a JDBC driver. Note that you need to put an appropriate JDBC -driver onto the classpath of Flowman. This can be done by using an appropriate plugin. +The SQL Server relation allows you to access MS SQL Server and Azure SQL databases using a JDBC driver. It uses the +`spark-sql-connector` from Microsoft to speed up processing. The `sqlserver` relation will also make use of a +global temporary table as an intermediate staging target and then atomically replace the contents of the target +table with the contents of the temp table within a single transaction. + + +## Plugin + +This relation type is provided as part of the [`flowman-mssql` plugin](../../plugins/mssql.md), which needs to be enabled +in your `namespace.yml` file. See [namespace documentation](../namespace.md) for more information for configuring plugins. ## Example ```yaml -# First specify a connection. This can be used by multiple JDBC relations +# First specify a connection. This can be used by multiple SQL Server relations connections: frontend: kind: jdbc - driver: "$frontend_db_driver" url: "$frontend_db_url" username: "$frontend_db_username" password: "$frontend_db_password" relations: frontend_users: - kind: jdbc + kind: sqlserver # Specify the name of the connection to use connection: frontend # Specify the table @@ -28,16 +35,18 @@ relations: file: "${project.basedir}/schema/users.avsc" primaryKey: - user_id + indexes: + - name: "users_idx0" + columns: [user_first_name, user_last_name] ``` It is also possible to directly embed the connection as follows: ```yaml relations: frontend_users: - kind: jdbc + kind: sqlserver # Specify the name of the connection to use connection: kind: jdbc - driver: "$frontend_db_driver" url: "$frontend_db_url" username: "$frontend_db_username" password: "$frontend_db_password" @@ -67,13 +76,12 @@ as the fallback for merge/upsert operations, when no `mergeKey` and no explicit A description of the relation. This is purely for informational purpose. * `connection` **(mandatory)** *(type: string)*: - The *connection* field specifies the name of a [Connection](../connection/index.md) + The *connection* field specifies the name of a [JDBC Connection](../connection/jdbc.md) object which has to be defined elsewhere. * `database` **(optional)** *(type: string)* *(default: empty)*: - Defines the Hive database where the table is defined. When no database is specified, the - table is accessed without any specific qualification, meaning that the default database - will be used or the one specified in the connection. + Defines the Hive database where the table is defined. When no database is specified, the table is accessed without any +specific qualification, meaning that the default database will be used or the one specified in the connection. * `table` **(mandatory)** *(type: string)*: Specifies the name of the table in the relational database. @@ -85,6 +93,9 @@ as the fallback for merge/upsert operations, when no `mergeKey` and no explicit The connection properties are applied first, then the relation properties. This means that a relation property can overwrite a connection property if it has the same name. + * `indexes` **(optional)** *(type: list:index)* *(default: empty)*: + Specifies a list of database indexes to be created. Each index has the properties `name`, `columns` and `unique`. + ## Automatic Migrations Flowman supports some automatic migrations, specifically with the migration strategies `ALTER`, `ALTER_REPLACE` @@ -96,9 +107,11 @@ The migration strategy `ALTER` supports the following alterations for JDBC relat * Adding new columns * Dropping columns * Changing the column type +* Adding / dropping indexes +* Changing the primary key Note that although Flowman will try to apply these changes, not all SQL databases support all of these changes in -all variations. Therefore it may well be the case, that the SQL database will fail performing these changes. If +all variations. Therefore, it may well be the case, that the SQL database will fail performing these changes. If the migration strategy is set to `ALTER_REPLACE`, then Flowman will fall back to trying to replace the whole table altogether on *any* non-recoverable exception during migration. @@ -109,7 +122,7 @@ corresponding section of [relations](index.md). ## Output Modes -The `jdbc` relation supports the following output modes in a [`relation` target](../target/relation.md): +The `sqlserver` relation supports the following output modes in a [`relation` target](../target/relation.md): | Output Mode | Supported | Comments | |---------------------|-----------|-------------------------------------------------------| @@ -118,8 +131,9 @@ The `jdbc` relation supports the following output modes in a [`relation` target] | `overwrite` | yes | Overwrite the whole table or the specified partitions | | `overwrite_dynamic` | no | - | | `append` | yes | Append new records to the existing table | -| `update` | no | - | -| `merge` | no | - | +| `update` | yes | - | + +In addition, the `sqlserver` relation also supports complex merge operations in a [`merge` target](../target/merge.md). ## Remarks diff --git a/docs/spec/schema/embedded.md b/docs/spec/schema/embedded.md index ea9a5a23e..474e4821f 100644 --- a/docs/spec/schema/embedded.md +++ b/docs/spec/schema/embedded.md @@ -28,4 +28,70 @@ relations: ## Fields * `kind` **(mandatory)** *(type: string)*: `embedded` + * `fields` **(mandatory)** *(type: list:field)*: Contains all fields + + +## Field properties + +* `name` **(mandatory)** *(type: string)*: specifies the name of the column +* `type` **(mandatory)** *(type: data type)*: specifies the data type of the column +* `nullable` **(optional)** *(type: boolean)* *(default: true)* +* `description` **(optional)** *(type: string)* +* `default` **(optional)** *(type: string)* Specifies a default value +* `format` **(optional)** *(type: string)* Some relations or file formats may support different formats for example +for storing dates + + +## Data Types + +The following simple data types are supported by Flowman + +* `string`, `text` - text and strings of arbitrary length +* `binary` - binary data of arbitrary length +* `tinyint`, `byte` - 8 bit signed numbers +* `smallint`, `short` - 16 bit signed numbers +* `int`, `integer` - 32 bit signed numbers +* `bigint`, `long` - 64 bit signed numbers +* `boolean` - true or false +* `float` - 32 bit floating point number +* `double` - 64 bit floating point number +* `decimal(a,b)` +* `varchar(n)` - text with up to `n`characters. Note that this data type is only supported for specifying input or +output data types. Internally Spark and therefore Flowman convert these columns to a `string` column of arbitrary length. +* `char(n)` - text with exactly `n`characters. Note that this data type is only supported for specifying input or + output data types. Internally Spark and therefore Flowman convert these columns to a `string` column of arbitrary length. +* `date` - date type +* `timestamp` - timestamp type (date and time) +* `duration` - duration type + +In addition to those simple data types the following complex types are supported: + +* `struct` for creating nested data types +```yaml +name: some_struct +type: + kind: struct + fields: + - name: some_field + type: int + - name: some_other_field + type: string +``` + +* `map` +```yaml +name: keyValue +type: + kind: map + keyType: string + valueType: int +``` + +* `array` for storing arrays of sub elements + ```yaml +name: names +type: + kind: array + elementType: string +``` diff --git a/docs/spec/target/blackhole.md b/docs/spec/target/blackhole.md index ac37a385a..b9e4cc274 100644 --- a/docs/spec/target/blackhole.md +++ b/docs/spec/target/blackhole.md @@ -17,9 +17,14 @@ targets: * `kind` **(mandatory)** *(type: string)*: `blackhole` +* `description` **(optional)** *(type: string)*: +Optional descriptive text of the build target + * `mapping` **(mandatory)** *(type: string)*: Specifies the name of the mapping output to be materialized -## Supported Phases +## Supported Execution Phases * `BUILD` - In the build phase, all records of the specified mapping will be materialized + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/compare.md b/docs/spec/target/compare.md index 3ec0cf5b2..610722d10 100644 --- a/docs/spec/target/compare.md +++ b/docs/spec/target/compare.md @@ -24,13 +24,20 @@ targets: ## Fields * `kind` **(mandatory)** *(type: string)*: `relation` + +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `actual` **(mandatory)** *(type: dataset)*: Specifies the data set containing the actual data. Often you will either use a relation written to by Flowman or a mapping. + * `expected` **(mandatory)** *(type: dataset)*: Specifies the data set containing the expected data. In most cases you probably will use a file data set referencing some predefined results -## Supported Phases +## Supported Execution Phases * `VERIFY` - Comparison will be performed in the *verify* build phase. If the comparison fails, the build will stop with an error + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/console.md b/docs/spec/target/console.md index bc7df3cb3..2b7bbe69a 100644 --- a/docs/spec/target/console.md +++ b/docs/spec/target/console.md @@ -16,10 +16,15 @@ targets: ## Fields * `kind` **(mandatory)** *(type: string)*: `console` +* `description` **(optional)** *(type: string)*: +Optional descriptive text of the build target * `input` **(mandatory)** *(type: dataset)*: Specified the [dataset](../dataset/index.md) containing the records to be dumped * `limit` **(optional)** *(type: integer)* *(default: 100)*: Specified the number of records to be displayed -## Supported Phases + +## Supported Execution Phases * `BUILD` - The target will only be executed in the *build* phase + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/copy-file.md b/docs/spec/target/copy-file.md index a50d813fd..9adf75541 100644 --- a/docs/spec/target/copy-file.md +++ b/docs/spec/target/copy-file.md @@ -4,12 +4,17 @@ * `kind` **(mandatory)** *(type: string)*: `copyFile` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `source` **(mandatory)** *(type: string)*: * `target` **(mandatory)** *(type: string)*: -## Supported Phases +## Supported Execution Phases * `BUILD` * `VERIFY` * `TRUNCATE` * `DESTROY` + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/copy.md b/docs/spec/target/copy.md index 6f682e1a1..39edfd6d5 100644 --- a/docs/spec/target/copy.md +++ b/docs/spec/target/copy.md @@ -27,6 +27,9 @@ targets: * `kind` **(mandatory)** *(type: string)*: `copy` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `source` **(mandatory)** *(type: dataset)*: Specifies the source data set to be copied from. @@ -50,8 +53,10 @@ and output file will contain approximately the same number of records. The defau Flowman config variable `floman.default.target.rebalance`. -## Supported Phases +## Supported Execution Phases * `BUILD` - The *build* phase will perform the copy operation * `VERIFY` - The *verify* phase will ensure that the target exists * `TRUNCATE` - The *truncate* phase will remove the target * `DESTROY` - The *destroy* phase will remove the target + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/count.md b/docs/spec/target/count.md index 9a5496404..42b8e57ae 100644 --- a/docs/spec/target/count.md +++ b/docs/spec/target/count.md @@ -10,9 +10,13 @@ targets: ## Fields * `kind` **(mandatory)** *(string)*: `count` + * `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target * `mapping` **(mandatory)** *(string)*: Specifies the name of the input mapping to be counted -## Supported Phases -* `BUILD` +## Supported Execution Phases +* `BUILD` - Counting records of a mapping will be executed as part of the `BUILD` phase + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/delete-file.md b/docs/spec/target/delete-file.md index 75cc2215f..c4e066366 100644 --- a/docs/spec/target/delete-file.md +++ b/docs/spec/target/delete-file.md @@ -13,8 +13,13 @@ targets: * `kind` **(mandatory)** *(type: string)*: `deleteFile` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `location` **(mandatory)** *(type: string)*: -## Supported Phases +## Supported Execution Phases * `BUILD` - This will remove the specified location + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/deltaVacuum.md b/docs/spec/target/delta-vacuum.md similarity index 90% rename from docs/spec/target/deltaVacuum.md rename to docs/spec/target/delta-vacuum.md index 96efd545b..8fe9b0855 100644 --- a/docs/spec/target/deltaVacuum.md +++ b/docs/spec/target/delta-vacuum.md @@ -37,6 +37,9 @@ targets: * `kind` **(mandatory)** *(type: string)*: `deleteFile` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `relation` **(mandatory)** *(type: string or relation)*: Either the name of a `deltaTable` or `deltaFile` relation or alternatively an embedded delta relation @@ -55,5 +58,7 @@ targets: will be performed. -## Supported Phases +## Supported Execution Phases * `BUILD` - This will execute the vacuum operation + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/document.md b/docs/spec/target/document.md new file mode 100644 index 000000000..474b643e1 --- /dev/null +++ b/docs/spec/target/document.md @@ -0,0 +1,61 @@ +# Document Target + +The `document` (or equivalently `documentation`) target is used to build a documentation of the current project. +You can find more details about that feature in the [documentation section](../../documenting/index.md). You can either +generate the project documentation via `flowexec documentation generate`, or you also generate the documentation via +this special target, which will be executed as part of the `VERIFY` phsae (after the `BUILD` phase has finished). + +## Example + +```yaml +targets: + documentation: + kind: documentation + collectors: + # Collect documentation of relations + - kind: relations + # Collect documentation of mappings + - kind: mappings + # Collect documentation of build targets + - kind: targets + # Execute all tests + - kind: tests + + generators: + # Create an output file in the project directory + - kind: file + location: ${project.basedir}/generated-documentation + template: html + excludeRelations: + # You can either specify a name (without the project) + - "stations_raw" + # Or can also explicitly specify a name with the project + - ".*/measurements_raw" +``` + +## Fields + +* `kind` **(mandatory)** *(type: string)*: `documentation` or `document` + +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + +* `collectors` **(optional)** *(type: list:collector)*: + List of documentation collectors + +* `generators` **(optional)** *(type: list:generator)*: + List of documentation generators + + +## Configuration + +When no explicit configuration is provided via `generators` or `collectors`, then Flowman will use the +[documentation configuration](../../documenting/config.md) provided in `documentation.yml`. If that file does not +exist, Flowman will fall back to some default configuration, which creates a html based documentation in a +subdirectory `generated-documentation` within the projects base directory. + + +## Supported Execution Phases +* `VERIFY` - This will generate the documentation + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/drop.md b/docs/spec/target/drop.md new file mode 100644 index 000000000..121e7e2c4 --- /dev/null +++ b/docs/spec/target/drop.md @@ -0,0 +1,64 @@ +# Drop Relation Target + +The `drop` target is used for dropping relations, i.e. dropping tables in relational database, +dropping tables in Hive or removing output directories. The target can be used for cleaning up tables which are not +used any more. + +## Example + +```yaml +targets: + drop_stations: + kind: drop + relation: stations + +relations: + stations: + kind: file + format: parquet + location: "$basedir/stations/" + schema: + kind: avro + file: "${project.basedir}/schema/stations.avsc" +``` + +You can also directly specify the relation inside the target definition. This saves you +from having to create a separate relation definition in the `relations` section. This is only recommended, if you +do not access the target relation otherwise, such that a shared definition would not provide any benefit. +```yaml +targets: + drop_stations: + kind: drop + relation: + kind: file + name: stations-relation + format: parquet + location: "$basedir/stations/" + schema: + kind: avro + file: "${project.basedir}/schema/stations.avsc" +``` + +## Fields + +* `kind` **(mandatory)** *(type: string)*: `drop` + +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + +* `relation` **(mandatory)** *(type: string)*: +Specifies the name of the relation to drop, or alternatively directly embeds the relation. + + +## Description + +The `drop` target will drop a relation and all its contents. It will be executed both during the `CREATE` phase and +during the `DESTROY` phase. + + +## Supported Execution Phases +* `CREATE` - This will drop the target relation or migrate it to the newest schema (if possible). +* `VERIFY` - This will verify that the target relation does not exist any more +* `DESTROY` - This will also drop the relation itself and all its content. + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/file.md b/docs/spec/target/file.md index a68daf86d..accc58526 100644 --- a/docs/spec/target/file.md +++ b/docs/spec/target/file.md @@ -26,6 +26,9 @@ targets: * `kind` **(mandatory)** *(type: string)*: `file` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `mapping` **(optional)** *(type: string)*: Specifies the name of the input mapping to be written @@ -53,11 +56,13 @@ Flowman config variable `floman.default.target.rebalance`. ## Supported Phases -* `CREATE` -* `BUILD` -* `VERIFY` -* `TRUNCATE` -* `DESTROY` +* `CREATE` - creates the target directory +* `BUILD` - build the target files containing records +* `VERIFY` - verifies that the target file exists +* `TRUNCATE` - removes the target file, but keeps the directory +* `DESTROY` - recursively removes the target directory and all files inside + +Read more about [execution phases](../../lifecycle.md). ## Provided Metrics diff --git a/docs/spec/target/hive-database.md b/docs/spec/target/hive-database.md index ccc0ecb15..885439c80 100644 --- a/docs/spec/target/hive-database.md +++ b/docs/spec/target/hive-database.md @@ -12,7 +12,20 @@ targets: database: "my_database" ``` -## Supported Phases +## Fields + +* `kind` **(mandatory)** *(type: string)*: `hiveDatabase` + +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + +* `database` **(mandatory)** *(type: string)*: + Name of the Hive database to be created + + +## Supported Execution Phases * `CREATE` - Ensures that the specified Hive database exists and creates one if it is not found * `VERIFY` - Verifies that the specified Hive database exists * `DESTROY` - Drops the Hive database + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/local.md b/docs/spec/target/local.md index 1069be816..8e9c92dde 100644 --- a/docs/spec/target/local.md +++ b/docs/spec/target/local.md @@ -18,6 +18,8 @@ targets: ## Fields * `kind` **(mandatory)** *(string)*: `local` + * `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target * `mapping` **(mandatory)** *(string)*: Specifies the name of the input mapping to be counted * `filename` **(mandatory)** *(string)*: @@ -30,11 +32,13 @@ targets: * `columns` **(optional)** *(list)* *(default: [])*: -## Supported Phases -* `BUILD` -* `VERIFY` -* `TRUNCATE` -* `DESTROY` +## Supported Execution Phases +* `BUILD` - build the target files containing records +* `VERIFY` - verifies that the target file exists +* `TRUNCATE` - removes the target file +* `DESTROY` - removes the target file, equivalent to `TRUNCATE` + +Read more about [execution phases](../../lifecycle.md). ## Provided Metrics diff --git a/docs/spec/target/measure.md b/docs/spec/target/measure.md index 4f992afcd..3ad049359 100644 --- a/docs/spec/target/measure.md +++ b/docs/spec/target/measure.md @@ -21,3 +21,19 @@ targets: This example will provide two metrics, `record_count` and `column_sum`, which then can be sent to a [metric sink](../metric) configured in the [namespace](../namespace.md). + + +## Provided Metrics +All metrics defined as named columns are exported with the following labels: + - `name` - The name of the measure (i.e. `record_stats` above) + - `category` - Always set to `measure` + - `kind` - Always set to `sql` + - `namespace` - Name of the namespace (typically `default`) + - `project` - Name of the project + - `version` - Version of the project + + +## Supported Execution Phases +* `VERIFY` - The evaluation of all measures will only be performed in the `VERIFY` phase + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/merge-files.md b/docs/spec/target/merge-files.md index a512d96af..9334ebcec 100644 --- a/docs/spec/target/merge-files.md +++ b/docs/spec/target/merge-files.md @@ -16,13 +16,17 @@ targets: ## Fields * `kind` **(mandatory)** *(string)*: `mergeFiles` + * `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target * `source` **(mandatory)** *(string)*: Source directory containing all files to be concatenated * `target` **(optional)** *(string)*: Name of single target file * `overwrite` **(optional)** *(boolean)* *(default: true)*: -## Supported Phases +## Supported Execution Phases * `BUILD` * `VERIFY` * `TRUNCATE` * `DESTROY` + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/merge.md b/docs/spec/target/merge.md index cada0694e..c0ca7da1c 100644 --- a/docs/spec/target/merge.md +++ b/docs/spec/target/merge.md @@ -69,6 +69,9 @@ relations: * `kind` **(mandatory)** *(type: string)*: `merge` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `mapping` **(optional)** *(type: string)*: Specifies the name of the input mapping to be written @@ -99,7 +102,7 @@ relations: Flowman config variable `floman.default.target.rebalance`. -## Supported Phases +## Supported Execution Phases * `CREATE` - This will create the target relation or migrate it to the newest schema (if possible). * `BUILD` - This will write the output of the specified mapping into the relation. If no mapping is specified, nothing will be done. @@ -108,6 +111,8 @@ relations: if the relation refers to a Hive table) * `DESTROY` - This drops the relation itself and all its content. +Read more about [execution phases](../../lifecycle.md). + ## Provided Metrics The relation target also provides some metric containing the number of records written: diff --git a/docs/spec/target/null.md b/docs/spec/target/null.md index a66a24c68..56c5cf6ec 100644 --- a/docs/spec/target/null.md +++ b/docs/spec/target/null.md @@ -11,10 +11,21 @@ targets: kind: null ``` -## Supported Phases + +## Fields + +* `kind` **(mandatory)** *(type: string)*: `null` + +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + + +## Supported Execution Phases * `CREATE` * `MIGRATE` * `BUILD` * `VERIFY` * `TRUNCATE` * `DESTROY` + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/relation.md b/docs/spec/target/relation.md index 1d81657c6..4c13f0360 100644 --- a/docs/spec/target/relation.md +++ b/docs/spec/target/relation.md @@ -16,7 +16,7 @@ targets: parallelism: 32 rebalance: true partition: - processing_date: "${processing_date}" + year: "${processing_date}" relations: stations: @@ -26,11 +26,15 @@ relations: schema: kind: avro file: "${project.basedir}/schema/stations.avsc" + partitions: + - name: year + type: integer + granularity: 1 ``` Since Flowman 0.18.0, you can also directly specify the relation inside the target definition. This saves you -from having to create a separate relation definition in the `relations` section. This is only recommeneded, if you -do not access the target relation otherwise, such that a shared definition would not provide any benefir. +from having to create a separate relation definition in the `relations` section. This is only recommended, if you +do not access the target relation otherwise, such that a shared definition would not provide any benefit. ```yaml targets: stations: @@ -44,22 +48,29 @@ targets: schema: kind: avro file: "${project.basedir}/schema/stations.avsc" + partitions: + - name: year + type: integer + granularity: 1 mode: overwrite parallelism: 32 rebalance: true partition: - processing_date: "${processing_date}" + year: "${processing_date}" ``` ## Fields * `kind` **(mandatory)** *(type: string)*: `relation` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `mapping` **(optional)** *(type: string)*: Specifies the name of the input mapping to be written * `relation` **(mandatory)** *(type: string)*: -Specifies the name of the relation to write to +Specifies the name of the relation to write to, or alternatively directly embeds the relation. * `mode` **(optional)** *(type: string)* *(default=overwrite)*: Specifies the behavior when data or table or partition already exists. Options include: @@ -102,7 +113,7 @@ the relation during the `CREATE`, `TRUNCATE` and `DESTROY` phase. In this case, target. -## Supported Phases +## Supported Execution Phases * `CREATE` - This will create the target relation or migrate it to the newest schema (if possible). * `BUILD` - This will write the output of the specified mapping into the relation. If no mapping is specified, nothing will be done. @@ -111,6 +122,8 @@ target. if the relation refers to a Hive table) * `DESTROY` - This drops the relation itself and all its content. +Read more about [execution phases](../../lifecycle.md). + ## Provided Metrics The relation target also provides some metric containing the number of records written: diff --git a/docs/spec/target/sftp-upload.md b/docs/spec/target/sftp-upload.md index df2d206a9..cdb9a0780 100644 --- a/docs/spec/target/sftp-upload.md +++ b/docs/spec/target/sftp-upload.md @@ -47,8 +47,9 @@ jobs: ## Fields * `kind` **(mandatory)** *(type: string)*: `sftp-upload` + * `description` **(optional)** *(type: string)*: -A textual description of the task. +A textual description of the build target. * `source` **(mandatory)** *(type: string)*: Specifies the source location in the Hadoop compatible filesystem. This may be either a single @@ -75,3 +76,9 @@ Set to `true` in order to overwrite existing files on the SFTP server. Otherwise file will result in an error. ## Description + + +## Supported Execution Phases +* `BUILD` - This will upload the specified file via SFTP + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/stream.md b/docs/spec/target/stream.md index 0ee5e82df..379927cc1 100644 --- a/docs/spec/target/stream.md +++ b/docs/spec/target/stream.md @@ -41,6 +41,9 @@ targets: * `kind` **(mandatory)** *(type: string)*: `stream` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `mapping` **(optional)** *(type: string)*: Specifies the name of the input mapping to be read from diff --git a/docs/spec/target/template.md b/docs/spec/target/template.md index 8bcda8e0c..6d6f0e926 100644 --- a/docs/spec/target/template.md +++ b/docs/spec/target/template.md @@ -15,3 +15,7 @@ targets: environment: - table=fee ``` + +## Supported Execution Phases + +The supported execution phases are determined by the referenced target. diff --git a/docs/spec/target/truncate.md b/docs/spec/target/truncate.md index fe8ac2004..5b4070cc5 100644 --- a/docs/spec/target/truncate.md +++ b/docs/spec/target/truncate.md @@ -2,7 +2,7 @@ The `truncate` target is used to truncate a relation or individual partitions of a relation. Truncating means that the relation itself is not removed, but the contents are deleted (either all records or individual partitions). -Note that the `truncate` target is executed as part of the `BUILD`phase, which might be surprising. +Note that the `truncate` target is executed both as part of the `BUILD` and `TRUNCATE` phases, which might be surprising. ## Example @@ -15,19 +15,62 @@ targets: year: start: $start_year end: $end_year + +relations: + stations: + kind: file + format: parquet + location: "$basedir/stations/" + schema: + kind: avro + file: "${project.basedir}/schema/stations.avsc" + partitions: + - name: year + type: integer + granularity: 1 +``` + +Since Flowman 0.22.0, you can also directly specify the relation inside the target definition. This saves you +from having to create a separate relation definition in the `relations` section. This is only recommended, if you +do not access the target relation otherwise, such that a shared definition would not provide any benefit. +```yaml +targets: + truncate_stations: + kind: truncate + partitions: + year: + start: $start_year + end: $end_year + relation: stations-relation + kind: file + format: parquet + location: "$basedir/stations/" + schema: + kind: avro + file: "${project.basedir}/schema/stations.avsc" + partitions: + - name: year + type: integer + granularity: 1 ``` ## Fields * `kind` **(mandatory)** *(type: string)*: `truncate` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `relation` **(mandatory)** *(type: string)*: - Specifies the name of the relation to truncate. + Specifies the name of the relation to truncate, or alternatively directly embeds the relation. * `partitions` **(optional)** *(type: map:partition)*: Specifies the partition (or multiple partitions) to truncate. -## Supported Phases +## Supported Execution Phases * `BUILD` - This will truncate the specified relation. * `VERIFY` - This will verify that the relation (and any specified partition) actually contains no data. +* `TRUNCATE` - This will truncate the specified relation. + +Read more about [execution phases](../../lifecycle.md). diff --git a/docs/spec/target/validate.md b/docs/spec/target/validate.md index 7f69b1c0a..d7240292d 100644 --- a/docs/spec/target/validate.md +++ b/docs/spec/target/validate.md @@ -29,6 +29,9 @@ targets: * `kind` **(mandatory)** *(type: string)*: `validate` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `assertions` **(optional)** *(type: map:assertion)*: Map of [assertions](../assertion/index.md) to be executed. The validation is marked as *failed* if a single assertion fails. @@ -37,9 +40,11 @@ targets: Specify how to proceed in case individual assertions fail. Possible values are `failFast`, `failAtEnd` and `failNever` -## Supported Phases +## Supported Execution Phases * `VALIDATE` - The specified assertions will be run in the `VALIDATE` phase before the `CREATE` and `BUILD` phases. +Read more about [execution phases](../../lifecycle.md). + ## Remarks diff --git a/docs/spec/target/verify.md b/docs/spec/target/verify.md index 279591aac..4a8156eca 100644 --- a/docs/spec/target/verify.md +++ b/docs/spec/target/verify.md @@ -31,6 +31,9 @@ targets: * `kind` **(mandatory)** *(type: string)*: `verify` +* `description` **(optional)** *(type: string)*: + Optional descriptive text of the build target + * `assertions` **(optional)** *(type: map:assertion)*: Map of [assertions](../assertion/index.md) to be executed. The verification is marked as *failed* if a single assertion fails. @@ -39,9 +42,11 @@ targets: Specify how to proceed in case individual assertions fail. Possible values are `failFast`, `failAtEnd` and `failNever` -## Supported Phases +## Supported Execution Phases * `VERIDY` - The specified assertions will be run in the `VERIFY` phase after the `CREATE` and `BUILD` phases. +Read more about [execution phases](../../lifecycle.md). + ## Remarks diff --git a/docs/spec/template/connection.md b/docs/spec/template/connection.md index 9f2c333a2..81eb4b872 100644 --- a/docs/spec/template/connection.md +++ b/docs/spec/template/connection.md @@ -5,6 +5,7 @@ # All template definitions (independent of their kind) go into the templates section templates: default_connection: + # The template is a connection template kind: connection parameters: - name: dir @@ -32,14 +33,14 @@ relations: dir: /opt/flowman/derby_new table: "advertiser_setting" schema: - kind: embedded - fields: - - name: id - type: Integer - - name: business_rule_id - type: Integer - - name: rtb_advertiser_id - type: Integer + kind: embedded + fields: + - name: id + type: Integer + - name: business_rule_id + type: Integer + - name: rtb_advertiser_id + type: Integer rel_2: kind: jdbc @@ -49,5 +50,15 @@ relations: * FROM line_item li " - ``` + +Once a relation template is defined, you can create instances of the template at any place where a connection can be +specified. You need to use the special syntax `template/` when creating an instance of the template. +The template instance then can also contain values for all parameters defined in the template. + + +## Fields + +* `kind` **(mandatory)** *(type: string)*: `connection` +* `parameters` **(optional)** *(type: list[parameter])*: list of parameter definitions. +* `template` **(mandatory)** *(type: mapping)*: The actual definition of a connection. diff --git a/docs/spec/template/index.md b/docs/spec/template/index.md index 7bfdef7ea..bc10a4256 100644 --- a/docs/spec/template/index.md +++ b/docs/spec/template/index.md @@ -12,17 +12,24 @@ There are some differences when creating an instance of a template as opposed to ## Example ```yaml +# The 'templates' section contains template definitions for relation, mappings, connections and more templates: + # Define a new template called "key_value" key_value: + # The template is a mapping template kind: mapping + # Specify a list of template parameters, which then can be provided during instantiation parameters: - name: key type: string - name: value type: int default: 12 + # Now comes the template definition itself. template: + # Specify the kind within the "mapping" entitiy class kind: values + # The following settings are all specific to the "values" mapping kind records: - ["$key",$value] schema: @@ -33,10 +40,17 @@ templates: - name: value_column type: integer +# Now we can use the "key_value" template in the "mappings" section mappings: + # First instance mapping_1: + # You need to prefix the template name with "template/" kind: template/key_value + # Provide a value for the "key" parameter. + # The "value" parameter has a default value, so it doesn't need to be provided key: some_value + + # Second instance mapping_2: kind: template/key_value key: some_other_value diff --git a/docs/spec/template/mapping.md b/docs/spec/template/mapping.md index b924c5c0a..f8637e645 100644 --- a/docs/spec/template/mapping.md +++ b/docs/spec/template/mapping.md @@ -5,7 +5,9 @@ # All template definitions (independent of their kind) go into the templates section templates: key_value: + # The template is a mapping template kind: mapping + # Specify a list of template parameters, which then can be provided during instantiation parameters: - name: key type: string @@ -28,9 +30,12 @@ templates: # Now you can create instances of the template in the corresponding entity section or at any other place where # a mapping is allowed mappings: + # First instance mapping_1: kind: template/key_value key: some_value + + # Second instance mapping_2: kind: template/key_value key: some_other_value diff --git a/docs/spec/template/relation.md b/docs/spec/template/relation.md index dc4cc1e0b..0022d15a7 100644 --- a/docs/spec/template/relation.md +++ b/docs/spec/template/relation.md @@ -5,7 +5,9 @@ # All template definitions (independent of their kind) go into the templates section templates: key_value: + # The template is a relation template kind: relation + # Specify a list of template parameters, which then can be provided during instantiation parameters: - name: key type: string @@ -28,9 +30,12 @@ templates: # Now you can create instances of the template in the corresponding entity section or at any other place where # a relation is allowed relation: + # First instance source_1: kind: template/key_value key: some_value + + # Second instance source_2: kind: template/key_value key: some_other_value @@ -41,6 +46,7 @@ Once a relation template is defined, you can create instances of the template at specified. You need to use the special syntax `template/` when creating an instance of the template. The template instance then can also contain values for all parameters defined in the template. + ## Fields * `kind` **(mandatory)** *(type: string)*: `relation` diff --git a/docs/cookbook/testing.md b/docs/testing/index.md similarity index 84% rename from docs/cookbook/testing.md rename to docs/testing/index.md index fd2c4687d..45d5a4b2f 100644 --- a/docs/cookbook/testing.md +++ b/docs/testing/index.md @@ -1,4 +1,4 @@ -# Testing +# Testing with Flowman Testing data pipelines often turns out to be a difficult undertaking, since the pipeline relies on external data sources which need to be mocked. Fortunately, Flowman natively supports writing `tests`, which in turn support simple @@ -77,3 +77,11 @@ The easiest way to execute tests is to use the [Flowman Shell](../cli/flowshell. Flowman now also includes a `flowman-testing` library which allows one to write lightweight unittests using either Scala or Java. The library provides some simple test runner for executing tests and jobs specified as usual in YAML files. + + +## Data Quality Tests + +The testing framework above is meant for implementing unittests (i.e. self-contained tests without any dependency to +external systems like databases for additional files). If you want to assess the data quality of either the source tables +or the generated tables, you may want to have a look at [documenting with Flowman](../documenting/index.md) and +the [validation](../spec/target/validate.md) and [verify](../spec/target/verify.md) targets. diff --git a/examples/weather/.gitignore b/examples/weather/.gitignore new file mode 100644 index 000000000..86cf85d0b --- /dev/null +++ b/examples/weather/.gitignore @@ -0,0 +1,2 @@ +generated-documentation +generated-report.txt diff --git a/examples/weather/documentation.yml b/examples/weather/documentation.yml new file mode 100644 index 000000000..72ec3bf34 --- /dev/null +++ b/examples/weather/documentation.yml @@ -0,0 +1,20 @@ +collectors: + # Collect documentation of relations + - kind: relations + # Collect documentation of mappings + - kind: mappings + # Collect documentation of build targets + - kind: targets + # Execute all checks + - kind: checks + +generators: + # Create an output file in the project directory + - kind: file + location: ${project.basedir}/generated-documentation + template: html + excludeRelations: + # You can either specify a name (without the project) + - "stations_raw" + # Or can also explicitly specify a name with the project + - ".*/measurements_raw" diff --git a/examples/weather/job/main.yml b/examples/weather/job/main.yml index 5424d77e6..8a16cb298 100644 --- a/examples/weather/job/main.yml +++ b/examples/weather/job/main.yml @@ -13,3 +13,19 @@ jobs: - stations - aggregates - validate_stations_raw + # Collect some measures which will be published as metrics + - metrics + # Generate documentation + - documentation + + # Define metrics to be published while running this job + metrics: + labels: + project: "${project.name}" + metrics: + - selector: + name: ".*" + labels: + category: "$category" + kind: "$kind" + name: "$name" diff --git a/examples/weather/mapping/aggregates.yml b/examples/weather/mapping/aggregates.yml index a6a63b88c..cbefd0ddd 100644 --- a/examples/weather/mapping/aggregates.yml +++ b/examples/weather/mapping/aggregates.yml @@ -12,3 +12,29 @@ mappings: min_temperature: "MIN(air_temperature)" max_temperature: "MAX(air_temperature)" avg_temperature: "AVG(air_temperature)" + + documentation: + description: "This mapping calculates the aggregated metrics per year and per country" + columns: + - name: country + checks: + - kind: notNull + - kind: unique + - name: min_wind_speed + description: Minimum wind speed + checks: + - kind: expression + expression: "min_wind_speed >= 0" + - name: max_wind_speed + description: Maximum wind speed + checks: + - kind: expression + expression: "max_wind_speed <= 60" + - name: min_temperature + checks: + - kind: expression + expression: "min_temperature >= -100" + - name: max_temperature + checks: + - kind: expression + expression: "max_temperature <= 100" diff --git a/examples/weather/mapping/measurements.yml b/examples/weather/mapping/measurements.yml index 2d59b83a1..5fdbc9eab 100644 --- a/examples/weather/mapping/measurements.yml +++ b/examples/weather/mapping/measurements.yml @@ -5,27 +5,52 @@ mappings: relation: measurements_raw partitions: year: $year - columns: - raw_data: String # Extract multiple columns from the raw measurements data using SQL SUBSTR functions measurements_extracted: kind: select input: measurements_raw columns: - usaf: "SUBSTR(raw_data,5,6)" - wban: "SUBSTR(raw_data,11,5)" + usaf: "CAST(SUBSTR(raw_data,5,6) AS INT)" + wban: "CAST(SUBSTR(raw_data,11,5) AS INT)" date: "TO_DATE(SUBSTR(raw_data,16,8), 'yyyyMMdd')" time: "SUBSTR(raw_data,24,4)" report_type: "SUBSTR(raw_data,42,5)" - wind_direction: "SUBSTR(raw_data,61,3)" + wind_direction: "CAST(SUBSTR(raw_data,61,3) AS INT)" wind_direction_qual: "SUBSTR(raw_data,64,1)" wind_observation: "SUBSTR(raw_data,65,1)" - wind_speed: "CAST(SUBSTR(raw_data,66,4) AS FLOAT)/10" + wind_speed: "CAST(CAST(SUBSTR(raw_data,66,4) AS FLOAT)/10 AS FLOAT)" wind_speed_qual: "SUBSTR(raw_data,70,1)" - air_temperature: "CAST(SUBSTR(raw_data,88,5) AS FLOAT)/10" + air_temperature: "CAST(CAST(SUBSTR(raw_data,88,5) AS FLOAT)/10 AS FLOAT)" air_temperature_qual: "SUBSTR(raw_data,93,1)" + documentation: + columns: + - name: usaf + description: "The USAF (US Air Force) id of the weather station" + - name: wban + description: "The WBAN id of the weather station" + - name: date + description: "The date when the measurement was made" + - name: time + description: "The time when the measurement was made" + - name: report_type + description: "The report type of the measurement" + - name: wind_direction + description: "The direction from where the wind blows in degrees" + - name: wind_direction_qual + description: "The quality indicator of the wind direction. 1 means trustworthy quality." + - name: wind_observation + description: "" + - name: wind_speed + description: "The wind speed in m/s" + - name: wind_speed_qual + description: "The quality indicator of the wind speed. 1 means trustworthy quality." + - name: air_temperature + description: "The air temperature in degree Celsius" + - name: air_temperature_qual + description: "The quality indicator of the air temperature. 1 means trustworthy quality." + # This mapping refers to the processed data stored as Parquet on the local filesystem measurements: diff --git a/examples/weather/model/aggregates.yml b/examples/weather/model/aggregates.yml index aba6ea933..bdd95f36e 100644 --- a/examples/weather/model/aggregates.yml +++ b/examples/weather/model/aggregates.yml @@ -5,8 +5,10 @@ relations: format: parquet # Specify the base directory where all data is stored. This location does not include the partition pattern location: "$basedir/aggregates/" - # Specify the pattern how to identify files and/or partitions. This pattern is relative to the `location` - pattern: "${year}" + # You could specify the pattern how to identify files and/or partitions. This pattern is relative to the `location`. + # Actually, it is highly recommended NOT to explicitly specify a partition pattern for outgoing relations + # and let Spark generate this according to the Hive standard. + #pattern: "${year}" # Add partition column, which can be used in the `pattern` partitions: - name: year @@ -31,3 +33,30 @@ relations: type: FLOAT - name: avg_temperature type: FLOAT + + documentation: + description: "The aggregate table contains min/max temperature value per year and country" + columns: + - name: country + checks: + - kind: notNull + - name: year + checks: + - kind: notNull + - name: min_wind_speed + checks: + - kind: expression + expression: "min_wind_speed >= 0" + - name: min_temperature + checks: + - kind: expression + expression: "min_temperature >= -100" + - name: max_temperature + checks: + - kind: expression + expression: "max_temperature <= 100" + checks: + kind: primaryKey + columns: + - country + - year diff --git a/examples/weather/model/measurements-raw.yml b/examples/weather/model/measurements-raw.yml index d114cc6b4..c0d4f28e7 100644 --- a/examples/weather/model/measurements-raw.yml +++ b/examples/weather/model/measurements-raw.yml @@ -8,6 +8,7 @@ relations: - name: year type: integer granularity: 1 + description: "The year when the measurement was made" schema: kind: embedded fields: diff --git a/examples/weather/model/measurements.yml b/examples/weather/model/measurements.yml index 8e069aa1a..369af728b 100644 --- a/examples/weather/model/measurements.yml +++ b/examples/weather/model/measurements.yml @@ -3,11 +3,63 @@ relations: kind: file format: parquet location: "$basedir/measurements/" - pattern: "${year}" partitions: - name: year type: integer granularity: 1 + # The following schema would use an explicitly specified schema + #schema: + # kind: avro + # file: "${project.basedir}/schema/measurements.avsc" + + # We prefer to use the inferred schema of the mapping that is written into the relation schema: - kind: avro - file: "${project.basedir}/schema/measurements.avsc" + kind: mapping + mapping: measurements_extracted + + documentation: + description: "This model contains all individual measurements" + # This section contains additional documentation to the columns, including some simple test cases + columns: + - name: year + description: "The year of the measurement, used for partitioning the data" + checks: + - kind: notNull + - kind: range + lower: 1901 + upper: 2022 + - name: usaf + checks: + - kind: notNull + - name: wban + checks: + - kind: notNull + - name: date + checks: + - kind: notNull + - name: time + checks: + - kind: notNull + - name: wind_direction_qual + checks: + - kind: notNull + - name: wind_direction + checks: + - kind: notNull + - kind: expression + expression: "(wind_direction >= 0 AND wind_direction <= 360) OR wind_direction_qual <> 1" + - name: air_temperature_qual + checks: + - kind: notNull + - kind: values + values: [0,1,2,3,4,5,6,7,8,9] + # Schema tests, which might involve multiple columns + checks: + kind: foreignKey + relation: stations + columns: + - usaf + - wban + references: + - usaf + - wban diff --git a/examples/weather/model/stations.yml b/examples/weather/model/stations.yml index c7f21f0d7..830ef88d2 100644 --- a/examples/weather/model/stations.yml +++ b/examples/weather/model/stations.yml @@ -1,8 +1,16 @@ relations: stations: kind: file + description: "The 'stations' table contains meta data on all weather stations" format: parquet location: "$basedir/stations/" schema: kind: avro file: "${project.basedir}/schema/stations.avsc" + + documentation: + checks: + kind: primaryKey + columns: + - usaf + - wban diff --git a/examples/weather/project.yml b/examples/weather/project.yml index 45979cd4e..ee1536669 100644 --- a/examples/weather/project.yml +++ b/examples/weather/project.yml @@ -1,6 +1,13 @@ name: "weather" version: "1.0" +description: " + This is a simple but very comprehensive example project for Flowman using publicly available weather data. + The project will demonstrate many features of Flowman, like reading and writing data, performing data transformations, + joining, filtering and aggregations. The project will also create a meaningful documentation containing data quality + tests. + " +# The following modules simply contain a list of subdirectories containing the specification files modules: - model - mapping diff --git a/examples/weather/schema/measurements.avsc b/examples/weather/schema/measurements.avsc index c81633a53..ced3fc4fc 100644 --- a/examples/weather/schema/measurements.avsc +++ b/examples/weather/schema/measurements.avsc @@ -5,23 +5,27 @@ "fields": [ { "name": "usaf", - "type": "int" + "type": "int", + "doc": "USAF station id" }, { "name": "wban", - "type": "int" + "type": "int", + "doc": "WBAN station id" }, { "name": "date", - "type": { "type": "int", "logicalType": "date" } + "type": { "type": "int", "logicalType": "date" }, + "doc": "The date when the measurement was made" }, { "name": "time", - "type": "string" + "type": "string", + "doc": "The time when the measurement was made" }, { "name": "wind_direction", - "type": [ "string", "null" ] + "type": [ "int", "null" ] }, { "name": "wind_direction_qual", diff --git a/examples/weather/schema/stations.avsc b/examples/weather/schema/stations.avsc index 13f208e46..6a7cddd56 100644 --- a/examples/weather/schema/stations.avsc +++ b/examples/weather/schema/stations.avsc @@ -5,47 +5,58 @@ "fields": [ { "name": "usaf", - "type": "int" + "type": "int", + "doc": "USAF station id" }, { "name": "wban", - "type": "int" + "type": "int", + "doc": "WBAN station id" }, { "name": "name", - "type": [ "string", "null" ] + "type": [ "string", "null" ], + "doc": "An optional name for the weather station" }, { "name": "country", - "type": [ "string", "null" ] + "type": [ "string", "null" ], + "doc": "The country the weather station belongs to" }, { "name": "state", - "type": [ "string", "null" ] + "type": [ "string", "null" ], + "doc": "Optional state within the country the weather station belongs to" }, { "name": "icao", - "type": [ "string", "null" ] + "type": [ "string", "null" ], + "doc": "" }, { "name": "latitude", - "type": [ "float", "null" ] + "type": [ "float", "null" ], + "doc": "The latitude of the geo location of the weather station" }, { "name": "longitude", - "type": [ "float", "null" ] + "type": [ "float", "null" ], + "doc": "The longitude of the geo location of the weather station" }, { "name": "elevation", - "type": [ "float", "null" ] + "type": [ "float", "null" ], + "doc": "The elevation above sea level in meters of the weather station" }, { "name": "date_begin", - "type": [ { "type": "int", "logicalType": "date" }, "null" ] + "type": [ { "type": "int", "logicalType": "date" }, "null" ], + "doc": "The date when the weather station went into service" }, { "name": "date_end", - "type": [ { "type": "int", "logicalType": "date" }, "null" ] + "type": [ { "type": "int", "logicalType": "date" }, "null" ], + "doc": "The date when the weather station went out of service" } ] } diff --git a/examples/weather/target/aggregates.yml b/examples/weather/target/aggregates.yml index a1109f2ab..0cf9a007c 100644 --- a/examples/weather/target/aggregates.yml +++ b/examples/weather/target/aggregates.yml @@ -1,6 +1,7 @@ targets: aggregates: kind: relation + description: "Write aggregated measurements per year" mapping: aggregates relation: aggregates partition: diff --git a/examples/weather/target/documentation.yml b/examples/weather/target/documentation.yml new file mode 100644 index 000000000..7f0949506 --- /dev/null +++ b/examples/weather/target/documentation.yml @@ -0,0 +1,5 @@ +targets: + # This target will create a documentation in the VERIFY phase + documentation: + kind: documentation + # We do not specify any additional configuration, so the project's documentation.yml file will be used diff --git a/examples/weather/target/measurements.yml b/examples/weather/target/measurements.yml index 73e05bad0..cfdb09314 100644 --- a/examples/weather/target/measurements.yml +++ b/examples/weather/target/measurements.yml @@ -1,6 +1,7 @@ targets: measurements: kind: relation + description: "Write extracted measurements per year" mapping: measurements_extracted relation: measurements partition: diff --git a/examples/weather/target/metrics.yml b/examples/weather/target/metrics.yml new file mode 100644 index 000000000..48350b6f5 --- /dev/null +++ b/examples/weather/target/metrics.yml @@ -0,0 +1,24 @@ +targets: + metrics: + kind: measure + description: "Collect relevant metrics from measurements, to be published to a metrics collector" + measures: + measurement_metrics: + kind: sql + # The following SQL will provide the following metrics: + # - valid_wind_direction + # - invalid_wind_direction + # - valid_wind_speed + # - invalid_wind_speed + # - valid_air_temperature + # - invalid_air_temperature + query: " + SELECT + SUM(IF(wind_direction_qual = '1', 1, 0)) AS valid_wind_direction, + SUM(IF(wind_direction_qual <> '1', 1, 0)) AS invalid_wind_direction, + SUM(IF(wind_speed_qual = '1', 1, 0)) AS valid_wind_speed, + SUM(IF(wind_speed_qual <> '1', 1, 0)) AS invalid_wind_speed, + SUM(IF(air_temperature_qual = '1', 1, 0)) AS valid_air_temperature, + SUM(IF(air_temperature_qual <> '1', 1, 0)) AS invalid_air_temperature + FROM measurements + " diff --git a/examples/weather/target/stations.yml b/examples/weather/target/stations.yml index ad32188b5..294889278 100644 --- a/examples/weather/target/stations.yml +++ b/examples/weather/target/stations.yml @@ -1,5 +1,9 @@ targets: stations: kind: relation + description: "Write stations" mapping: stations_raw relation: stations + + documentation: + description: "This build target is used to write the weather stations" diff --git a/flowman-client/pom.xml b/flowman-client/pom.xml index 323a4e213..c7afe386f 100644 --- a/flowman-client/pom.xml +++ b/flowman-client/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml @@ -51,7 +51,7 @@ - *:*:sources + *:sources diff --git a/flowman-common/pom.xml b/flowman-common/pom.xml index 22ea1b313..91c7961c5 100644 --- a/flowman-common/pom.xml +++ b/flowman-common/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-common/src/main/scala/com/dimajix/common/SynchronizedMap.scala b/flowman-common/src/main/scala/com/dimajix/common/SynchronizedMap.scala index b63510137..733e0b31c 100644 --- a/flowman-common/src/main/scala/com/dimajix/common/SynchronizedMap.scala +++ b/flowman-common/src/main/scala/com/dimajix/common/SynchronizedMap.scala @@ -91,6 +91,16 @@ case class SynchronizedMap[K,V](impl:mutable.Map[K,V]) { } } + /** + * Remove a value from the map + * @param key + */ + def remove(key: K) : Unit = { + synchronized { + impl.remove(key) + } + } + /** Retrieves the value which is associated with the given key. This * method invokes the `default` method of the map if there is no mapping * from the given key to a value. Unless overridden, the `default` method throws a @@ -147,11 +157,21 @@ case class SynchronizedMap[K,V](impl:mutable.Map[K,V]) { toSeq.iterator } + /** Collects all keys of this map in an Set. + * + * @return the keys of this map as a Set. + */ + def keys : Set[K] = { + synchronized { + impl.keySet.toSet + } + } + /** Collects all values of this map in an iterable collection. * - * @return the values of this map as an iterable. + * @return the values of this map as a Sequence. */ - def values: Iterable[V] = { + def values: Seq[V] = { synchronized { Seq(impl.values.toSeq:_*) } diff --git a/flowman-core/pom.xml b/flowman-core/pom.xml index 288e68fac..4202ffd3a 100644 --- a/flowman-core/pom.xml +++ b/flowman-core/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-core/src/main/resources/META-INF/services/com.dimajix.flowman.spi.ColumnCheckExecutor b/flowman-core/src/main/resources/META-INF/services/com.dimajix.flowman.spi.ColumnCheckExecutor new file mode 100644 index 000000000..6b3e56004 --- /dev/null +++ b/flowman-core/src/main/resources/META-INF/services/com.dimajix.flowman.spi.ColumnCheckExecutor @@ -0,0 +1 @@ +com.dimajix.flowman.documentation.DefaultColumnCheckExecutor diff --git a/flowman-core/src/main/resources/META-INF/services/com.dimajix.flowman.spi.SchemaCheckExecutor b/flowman-core/src/main/resources/META-INF/services/com.dimajix.flowman.spi.SchemaCheckExecutor new file mode 100644 index 000000000..85fb4a1a8 --- /dev/null +++ b/flowman-core/src/main/resources/META-INF/services/com.dimajix.flowman.spi.SchemaCheckExecutor @@ -0,0 +1 @@ +com.dimajix.flowman.documentation.DefaultSchemaCheckExecutor diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/catalog/HiveCatalog.scala b/flowman-core/src/main/scala/com/dimajix/flowman/catalog/HiveCatalog.scala index e14f20220..f5ee93c97 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/catalog/HiveCatalog.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/catalog/HiveCatalog.scala @@ -23,7 +23,6 @@ import scala.collection.mutable import org.apache.hadoop.fs.Path import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkShim -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.DatabaseAlreadyExistsException import org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException import org.apache.spark.sql.catalyst.analysis.NoSuchPartitionException @@ -32,18 +31,15 @@ import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException import org.apache.spark.sql.catalyst.catalog.CatalogTable import org.apache.spark.sql.catalyst.catalog.CatalogTablePartition import org.apache.spark.sql.catalyst.catalog.CatalogTableType -import org.apache.spark.sql.catalyst.plans.logical.AnalysisOnlyCommand import org.apache.spark.sql.execution.command.AlterTableAddColumnsCommand import org.apache.spark.sql.execution.command.AlterTableAddPartitionCommand import org.apache.spark.sql.execution.command.AlterTableChangeColumnCommand import org.apache.spark.sql.execution.command.AlterTableDropPartitionCommand import org.apache.spark.sql.execution.command.AlterTableSetLocationCommand -import org.apache.spark.sql.execution.command.AlterViewAsCommand import org.apache.spark.sql.execution.command.AnalyzePartitionCommand import org.apache.spark.sql.execution.command.AnalyzeTableCommand import org.apache.spark.sql.execution.command.CreateDatabaseCommand import org.apache.spark.sql.execution.command.CreateTableCommand -import org.apache.spark.sql.execution.command.CreateViewCommand import org.apache.spark.sql.execution.command.DropDatabaseCommand import org.apache.spark.sql.execution.command.DropTableCommand import org.apache.spark.sql.hive.HiveClientShim @@ -141,7 +137,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex val dbName = formatDatabaseName(database) catalog.externalCatalog .listTables(dbName) - .map(name =>TableIdentifier(name, Some(database))) + .map(name =>TableIdentifier(name, Seq(database))) } /** @@ -154,7 +150,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex val dbName = formatDatabaseName(database) catalog.externalCatalog .listTables(dbName, pattern) - .map(name =>TableIdentifier(name, Some(database))) + .map(name =>TableIdentifier(name, Seq(database))) } /** @@ -167,7 +163,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex def createTable(table:CatalogTable, ignoreIfExists:Boolean) : Unit = { require(table != null) - val exists = tableExists(table.identifier) + val exists = tableExists(TableIdentifier.of(table)) if (!ignoreIfExists && exists) { throw new TableAlreadyExistsException(table.identifier.database.getOrElse(""), table.identifier.table) } @@ -208,7 +204,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex } if (config.flowmanConf.hiveAnalyzeTable) { - val cmd = AnalyzeTableCommand(table, false) + val cmd = AnalyzeTableCommand(table.toSpark, false) cmd.run(spark) } @@ -225,7 +221,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex * @return */ def tableExists(name:TableIdentifier) : Boolean = { - catalog.tableExists(name) + catalog.tableExists(name.toSpark) } /** @@ -240,7 +236,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex val db = formatDatabaseName(name.database.getOrElse(catalog.getCurrentDatabase)) val table = formatTableName(name.table) requireDbExists(db) - requireTableExists(TableIdentifier(table, Some(db))) + requireTableExists(TableIdentifier(table, Seq(db))) catalog.externalCatalog.getTable(db, table) } @@ -279,7 +275,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex // Delete all partitions if (catalogTable.partitionSchema != null && catalogTable.partitionSchema.fields.nonEmpty) { - catalog.listPartitions(table).foreach { p => + catalog.listPartitions(table.toSpark).foreach { p => val location = new Path(p.location) val fs = location.getFileSystem(hadoopConf) FileUtils.deleteLocation(fs, location) @@ -287,7 +283,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex } // Delete table itself - val cmd = DropTableCommand(table, ignoreIfNotExists, false, true) + val cmd = DropTableCommand(table.toSpark, ignoreIfNotExists, false, true) cmd.run(spark) // Delete location to cleanup any remaining files @@ -315,7 +311,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex // First drop partitions if (catalogTable.partitionSchema != null && catalogTable.partitionSchema.fields.nonEmpty) { - dropPartitions(table, catalog.listPartitions(table).map(p => PartitionSpec(p.spec))) + dropPartitions(table, catalog.listPartitions(table.toSpark).map(p => PartitionSpec(p.spec))) } // Then cleanup directory from any remainders @@ -352,18 +348,18 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex logger.info(s"Updating nullability of column ${u.column} to ${u.nullable} in Hive table '$table'") val field = tableColumns.getOrElse(u.column.toLowerCase(Locale.ROOT), throw new IllegalArgumentException(s"Table column ${u.column} does not exist in table $table")) .copy(nullable = u.nullable) - val cmd = AlterTableChangeColumnCommand(table, u.column, field) + val cmd = AlterTableChangeColumnCommand(table.toSpark, u.column, field) cmd.run(spark) case u:UpdateColumnComment => logger.info(s"Updating comment of column ${u.column} in Hive table '$table'") val field = tableColumns.getOrElse(u.column.toLowerCase(Locale.ROOT), throw new IllegalArgumentException(s"Table column ${u.column} does not exist in table $table")) .withComment(u.comment.getOrElse("")) - val cmd = AlterTableChangeColumnCommand(table, u.column, field) + val cmd = AlterTableChangeColumnCommand(table.toSpark, u.column, field) cmd.run(spark) case x:TableChange => throw new UnsupportedOperationException(s"Unsupported table change $x for Hive table $table") } - val cmd = AlterTableAddColumnsCommand(table, colsToAdd) + val cmd = AlterTableAddColumnsCommand(table.toSpark, colsToAdd) cmd.run(spark) externalCatalogs.foreach(_.alterTable(catalogTable)) @@ -384,7 +380,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex val catalogTable = getTable(table) require(catalogTable.tableType != CatalogTableType.VIEW) - val cmd = AlterTableAddColumnsCommand(table, colsToAdd) + val cmd = AlterTableAddColumnsCommand(table.toSpark, colsToAdd) cmd.run(spark) externalCatalogs.foreach(_.alterTable(catalogTable)) @@ -401,7 +397,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex def partitionExists(table:TableIdentifier, partition:PartitionSpec) : Boolean = { require(table != null) require(partition != null) - catalog.listPartitions(table, Some(partition.mapValues(_.toString).toMap).filter(_.nonEmpty)).nonEmpty + catalog.listPartitions(table.toSpark, Some(partition.mapValues(_.toString).toMap).filter(_.nonEmpty)).nonEmpty } /** @@ -412,7 +408,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex @throws[NoSuchTableException] @throws[NoSuchPartitionException] def getPartition(table: TableIdentifier, partition:PartitionSpec): CatalogTablePartition = { - catalog.getPartition(table, partition.mapValues(_.toString).toMap) + catalog.getPartition(table.toSpark, partition.mapValues(_.toString).toMap) } /** @@ -458,14 +454,14 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex logger.info(s"Adding partition ${partition.spec} to table $table at '$location'") val sparkPartition = partition.mapValues(_.toString).toMap - val cmd = AlterTableAddPartitionCommand(table, Seq((sparkPartition, Some(location.toString))), false) + val cmd = AlterTableAddPartitionCommand(table.toSpark, Seq((sparkPartition, Some(location.toString))), false) cmd.run(spark) analyzePartition(table, sparkPartition) externalCatalogs.foreach { ec => val catalogTable = getTable(table) - val catalogPartition = catalog.getPartition(table, sparkPartition) + val catalogPartition = catalog.getPartition(table.toSpark, sparkPartition) ec.addPartition(catalogTable, catalogPartition) } } @@ -486,7 +482,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex val sparkPartition = partition.mapValues(_.toString).toMap if (partitionExists(table, partition)) { logger.info(s"Replacing partition ${partition.spec} of table $table with location '$location'") - val cmd = AlterTableSetLocationCommand(table, Some(sparkPartition), location.toString) + val cmd = AlterTableSetLocationCommand(table.toSpark, Some(sparkPartition), location.toString) cmd.run(spark) refreshPartition(table, partition) @@ -519,14 +515,14 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex externalCatalogs.foreach { ec => val catalogTable = getTable(table) - val catalogPartition = catalog.getPartition(table, sparkPartition) + val catalogPartition = catalog.getPartition(table.toSpark, sparkPartition) ec.alterPartition(catalogTable, catalogPartition) } } private def analyzePartition(table:TableIdentifier, sparkPartition:Map[String,String]) : Unit = { def doIt(): Unit = { - val cmd = AnalyzePartitionCommand(table, sparkPartition.map { case (k, v) => k -> Some(v) }, false) + val cmd = AnalyzePartitionCommand(table.toSpark, sparkPartition.map { case (k, v) => k -> Some(v) }, false) cmd.run(spark) } @@ -563,7 +559,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex externalCatalogs.foreach { ec => val sparkPartition = partition.mapValues(_.toString).toMap val catalogTable = getTable(table) - val catalogPartition = catalog.getPartition(table, sparkPartition) + val catalogPartition = catalog.getPartition(table.toSpark, sparkPartition) ec.truncatePartition(catalogTable, catalogPartition) } } @@ -602,7 +598,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex // Convert to Spark partitions val sparkPartitions = dropPartitions.map(_.mapValues(_.toString).toMap) // Convert to external catalog partitions which can be reused in the last step - val catalogPartitions = sparkPartitions.map(catalog.getPartition(table, _)).filter(_ != null) + val catalogPartitions = sparkPartitions.map(catalog.getPartition(table.toSpark, _)).filter(_ != null) logger.info(s"Dropping partitions ${dropPartitions.map(_.spec).mkString(",")} from Hive table $table") catalogPartitions.foreach { partition => @@ -612,7 +608,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex } // Note that "purge" is not supported with Hive < 1.2 - val cmd = AlterTableDropPartitionCommand(table, sparkPartitions, ignoreIfNotExists, purge = false, retainData = false) + val cmd = AlterTableDropPartitionCommand(table.toSpark, sparkPartitions, ignoreIfNotExists, purge = false, retainData = false) cmd.run(spark) externalCatalogs.foreach { ec => @@ -635,12 +631,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex logger.info(s"Creating Hive view $table") val plan = spark.sql(select).queryExecution.analyzed - //@annotation.nowarn // Disable warning about unreachable code for Spark 3.2 - val cmd = CreateViewCommand(table, Nil, None, Map(), Some(select), plan, false, false, SparkShim.PersistedView) match { - // Workaround for providing compatibility with Spark 3.2 and older versions - case ac:AnalysisOnlyCommand => ac.markAsAnalyzed().asInstanceOf[CreateViewCommand] - case c:CreateViewCommand => c - } + val cmd = SparkShim.createView(table.toSpark, select, plan, false, false) cmd.run(spark) // Publish view to external catalog @@ -656,13 +647,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex logger.info(s"Redefining Hive view $table") val plan = spark.sql(select).queryExecution.analyzed - //@annotation.nowarn // Disable warning about unreachable code for Spark 3.2 - val cmd = AlterViewAsCommand(table, select, plan) match { - // Workaround for providing compatibility with Spark 3.2 and older versions - case ac:AnalysisOnlyCommand => ac.markAsAnalyzed().asInstanceOf[AlterViewAsCommand] - case c:AlterViewAsCommand => c - } - + val cmd = SparkShim.alterView(table.toSpark, select, plan) cmd.run(spark) // Publish view to external catalog @@ -689,7 +674,7 @@ final class HiveCatalog(val spark:SparkSession, val config:Configuration, val ex require(catalogTable.tableType == CatalogTableType.VIEW) // Delete table itself - val cmd = DropTableCommand(table, ignoreIfNotExists, true, false) + val cmd = DropTableCommand(table.toSpark, ignoreIfNotExists, true, false) cmd.run(spark) // Remove table from external catalog diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/catalog/TableChange.scala b/flowman-core/src/main/scala/com/dimajix/flowman/catalog/TableChange.scala index 72a6f2f0a..e7fc65f79 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/catalog/TableChange.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/catalog/TableChange.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -26,8 +26,9 @@ import com.dimajix.flowman.types.SchemaUtils.coerce import com.dimajix.flowman.types.StructType -abstract sealed class TableChange +abstract sealed class TableChange extends Product with Serializable abstract sealed class ColumnChange extends TableChange +abstract sealed class IndexChange extends TableChange object TableChange { case class ReplaceTable(schema:StructType) extends TableChange @@ -38,6 +39,11 @@ object TableChange { case class UpdateColumnType(column:String, dataType:FieldType) extends ColumnChange case class UpdateColumnComment(column:String, comment:Option[String]) extends ColumnChange + case class CreatePrimaryKey(columns:Seq[String]) extends IndexChange + case class DropPrimaryKey() extends IndexChange + case class CreateIndex(name:String, columns:Seq[String], unique:Boolean) extends IndexChange + case class DropIndex(name:String) extends IndexChange + /** * Creates a Sequence of [[TableChange]] objects, which will transform a source schema into a target schema. * The specified [[MigrationPolicy]] is used to decide on a per-column basis, if a migration is required. @@ -46,11 +52,33 @@ object TableChange { * @param migrationPolicy * @return */ - def migrate(sourceSchema:StructType, targetSchema:StructType, migrationPolicy:MigrationPolicy) : Seq[TableChange] = { - val targetFields = targetSchema.fields.map(f => (f.name.toLowerCase(Locale.ROOT), f)) + def migrate(sourceTable:TableDefinition, targetTable:TableDefinition, migrationPolicy:MigrationPolicy) : Seq[TableChange] = { + val normalizedSource = sourceTable.normalize() + val normalizedTarget = targetTable.normalize() + + // Check which Indexes need to be dropped + val dropIndexes = sourceTable.indexes.flatMap { src => + targetTable.indexes.find(_.name.toLowerCase(Locale.ROOT) == src.name.toLowerCase(Locale.ROOT)) match { + case None => + Some(DropIndex(src.name)) + case Some(tgt) => + if (src.normalize() != tgt.normalize()) + Some(DropIndex(src.name)) + else None + } + } + + // Check if primary key needs to be dropped + val dropPk = if(normalizedSource.primaryKey.nonEmpty && normalizedSource.primaryKey != normalizedTarget.primaryKey) + Some(DropPrimaryKey()) + else + None + + val targetFields = targetTable.columns.map(f => (f.name.toLowerCase(Locale.ROOT), f)) val targetFieldsByName = targetFields.toMap - val sourceFieldsByName = sourceSchema.fields.map(f => (f.name.toLowerCase(Locale.ROOT), f)).toMap + val sourceFieldsByName = sourceTable.columns.map(f => (f.name.toLowerCase(Locale.ROOT), f)).toMap + // Check which fields need to be dropped val dropFields = (sourceFieldsByName.keySet -- targetFieldsByName.keySet).toSeq.flatMap { fieldName => if (migrationPolicy == MigrationPolicy.STRICT) Some(DropColumn(sourceFieldsByName(fieldName).name)) @@ -58,6 +86,7 @@ object TableChange { None } + // Infer column changes val changeFields = targetFields.flatMap { case(tgtName,tgtField) => sourceFieldsByName.get(tgtName) match { case None => Seq(AddColumn(tgtField)) @@ -86,23 +115,64 @@ object TableChange { } } - dropFields ++ changeFields + // Create new PK + val createPk = if (normalizedTarget.primaryKey.nonEmpty && normalizedTarget.primaryKey != normalizedSource.primaryKey) + Some(CreatePrimaryKey(targetTable.primaryKey)) + else + None + + // Create new indexes + val addIndexes = targetTable.indexes.flatMap { tgt => + sourceTable.indexes.find(_.name.toLowerCase(Locale.ROOT) == tgt.name.toLowerCase(Locale.ROOT)) match { + case None => + Some(CreateIndex(tgt.name, tgt.columns, tgt.unique)) + case Some(src) => + if (src.normalize() != tgt.normalize()) + Some(CreateIndex(tgt.name, tgt.columns, tgt.unique)) + else + None + } + } + + dropIndexes ++ dropPk ++ dropFields ++ changeFields ++ createPk ++ addIndexes } - def requiresMigration(sourceSchema:StructType, targetSchema:StructType, migrationPolicy:MigrationPolicy) : Boolean = { - // Ensure that current real Hive schema is compatible with specified schema - migrationPolicy match { + /** + * Performs a check if a migration is required + * @param sourceTable + * @param targetTable + * @param migrationPolicy + * @return + */ + def requiresMigration(sourceTable:TableDefinition, targetTable:TableDefinition, migrationPolicy:MigrationPolicy) : Boolean = { + val normalizedSource = sourceTable.normalize() + val normalizedTarget = targetTable.normalize() + + // Check if PK needs change + val pkChanges = normalizedSource.primaryKey != normalizedTarget.primaryKey + + // Check if indices need change + val dropIndexes = !normalizedSource.indexes.forall(src => + normalizedTarget.indexes.contains(src) + ) + val addIndexes = !normalizedTarget.indexes.forall(tgt => + normalizedSource.indexes.contains(tgt) + ) + + // Ensure that current real schema is compatible with specified schema + val columnChanges = migrationPolicy match { case MigrationPolicy.RELAXED => - val sourceFields = sourceSchema.fields.map(f => (f.name.toLowerCase(Locale.ROOT), f)).toMap - targetSchema.fields.exists { tgt => + val sourceFields = sourceTable.columns.map(f => (f.name.toLowerCase(Locale.ROOT), f)).toMap + targetTable.columns.exists { tgt => !sourceFields.get(tgt.name.toLowerCase(Locale.ROOT)) .exists(src => SchemaUtils.isCompatible(tgt, src)) } case MigrationPolicy.STRICT => - SchemaUtils.normalize(sourceSchema) - val sourceFields = SchemaUtils.normalize(sourceSchema).fields.sortBy(_.name) - val targetFields = SchemaUtils.normalize(targetSchema).fields.sortBy(_.name) + val sourceFields = SchemaUtils.normalize(sourceTable.columns).sortBy(_.name) + val targetFields = SchemaUtils.normalize(targetTable.columns).sortBy(_.name) sourceFields != targetFields } + + pkChanges || dropIndexes || addIndexes || columnChanges } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/catalog/TableDefinition.scala b/flowman-core/src/main/scala/com/dimajix/flowman/catalog/TableDefinition.scala new file mode 100644 index 000000000..fc4b2d19d --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/catalog/TableDefinition.scala @@ -0,0 +1,61 @@ +/* + * Copyright 2018-2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.catalog + +import java.util.Locale + +import org.apache.spark.sql.catalyst.catalog.CatalogTable + +import com.dimajix.flowman.types.Field +import com.dimajix.flowman.types.StructType + + +object TableDefinition { + def ofTable(table:CatalogTable) : TableDefinition = { + val id = table.identifier + val schema = com.dimajix.flowman.types.StructType.of(table.dataSchema) + TableDefinition(TableIdentifier(id.table, id.database.toSeq), schema.fields) + } +} +final case class TableDefinition( + identifier: TableIdentifier, + columns: Seq[Field] = Seq.empty, + comment: Option[String] = None, + primaryKey: Seq[String] = Seq.empty, + indexes: Seq[TableIndex] = Seq.empty +) { + def schema : StructType = StructType(columns) + + def normalize() : TableDefinition = copy( + columns = columns.map(f => f.copy(name = f.name.toLowerCase(Locale.ROOT))), + primaryKey = primaryKey.map(_.toLowerCase(Locale.ROOT)).sorted, + indexes = indexes.map(_.normalize()) + ) + +} + + +final case class TableIndex( + name: String, + columns: Seq[String], + unique:Boolean = false +) { + def normalize() : TableIndex = copy( + name = name.toLowerCase(Locale.ROOT), + columns = columns.map(_.toLowerCase(Locale.ROOT)).sorted + ) +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/catalog/TableIdentifier.scala b/flowman-core/src/main/scala/com/dimajix/flowman/catalog/TableIdentifier.scala new file mode 100644 index 000000000..8deec49cf --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/catalog/TableIdentifier.scala @@ -0,0 +1,60 @@ +/* + * Copyright 2018-2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.catalog + +import org.apache.spark.sql.catalyst.catalog.CatalogTable + + +object TableIdentifier { + def apply(table: String, db:Option[String]) : TableIdentifier = TableIdentifier(table, db.toSeq) + + def empty : TableIdentifier = TableIdentifier("", Seq.empty) + + def of(table:CatalogTable) : TableIdentifier = { + of(table.identifier) + } + def of(id: org.apache.spark.sql.catalyst.TableIdentifier) : TableIdentifier = { + TableIdentifier(id.table, id.database.toSeq) + } +} +final case class TableIdentifier( + table: String, + space: Seq[String] = Seq.empty +) { + private def quoteIdentifier(name: String): String = s"`${name.replace("`", "``")}`" + + def quotedString: String = { + val replacedId = quoteIdentifier(table) + val replacedSpace = space.map(quoteIdentifier) + + if (replacedSpace.nonEmpty) s"${replacedSpace.mkString(".")}.$replacedId" else replacedId + } + + def unquotedString: String = { + if (space.nonEmpty) s"${space.mkString(".")}.$table" else table + } + + def toSpark : org.apache.spark.sql.catalyst.TableIdentifier = { + org.apache.spark.sql.catalyst.TableIdentifier(table, database) + } + + def quotedDatabase : Option[String] = if (space.nonEmpty) Some(space.map(quoteIdentifier).mkString(".")) else None + def unquotedDatabase : Option[String] = if (space.nonEmpty) Some(space.mkString(".")) else None + def database : Option[String] = unquotedDatabase + + override def toString: String = quotedString +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/config/FlowmanConf.scala b/flowman-core/src/main/scala/com/dimajix/flowman/config/FlowmanConf.scala index b0c49a130..0ced47940 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/config/FlowmanConf.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/config/FlowmanConf.scala @@ -31,6 +31,7 @@ import com.dimajix.flowman.execution.OutputMode import com.dimajix.flowman.execution.SimpleExecutor import com.dimajix.flowman.execution.DependencyScheduler import com.dimajix.flowman.execution.Scheduler +import com.dimajix.flowman.model.VerifyPolicy import com.dimajix.flowman.transforms.ColumnMismatchStrategy import com.dimajix.flowman.transforms.TypeMismatchStrategy import com.dimajix.spark.features @@ -101,6 +102,14 @@ object FlowmanConf { .doc("Parallelism of mapping instantiation") .intConf .createWithDefault(1) + val EXECUTION_MAPPING_SCHEMA_CACHE = buildConf("flowman.execution.mapping.schemaCache") + .doc("Cache schema information of mapping instances") + .booleanConf + .createWithDefault(true) + val EXECUTION_RELATION_SCHEMA_CACHE = buildConf("flowman.execution.relation.schemaCache") + .doc("Cache schema information of relation instances") + .booleanConf + .createWithDefault(true) val DEFAULT_RELATION_MIGRATION_POLICY = buildConf("flowman.default.relation.migrationPolicy") .doc("Default migration policy. Allowed values are 'relaxed' and 'strict'") @@ -128,6 +137,10 @@ object FlowmanConf { .stringConf .createWithDefault(TypeMismatchStrategy.CAST_ALWAYS.toString) + val DEFAULT_TARGET_VERIFY_POLICY = buildConf("flowman.default.target.verifyPolicy") + .doc("Policy for verifying a target. Accepted verify policies are 'empty_as_success', 'empty_as_failure' and 'empty_as_success_with_errors'.") + .stringConf + .createWithDefault(VerifyPolicy.EMPTY_AS_FAILURE.toString) val DEFAULT_TARGET_OUTPUT_MODE = buildConf("flowman.default.target.outputMode") .doc("Default output mode of targets") .stringConf diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Category.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Category.scala new file mode 100644 index 000000000..ec005a106 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Category.scala @@ -0,0 +1,46 @@ +/* + * Copyright 2018-2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import java.util.Locale + + +sealed abstract class Category extends Product with Serializable { + def lower : String = toString.toLowerCase(Locale.ROOT) + def upper : String = toString.toUpperCase(Locale.ROOT) +} + +object Category { + case object PROJECT extends Category + case object COLUMN extends Category + case object MAPPING extends Category + case object RELATION extends Category + case object SCHEMA extends Category + case object TARGET extends Category + + def ofString(category:String) : Category = { + category.toLowerCase(Locale.ROOT) match { + case "column" => COLUMN + case "mapping" => MAPPING + case "project" => PROJECT + case "relation" => RELATION + case "schema" => SCHEMA + case "target" => TARGET + case _ => throw new IllegalArgumentException(s"No such category $category") + } + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/CheckCollector.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/CheckCollector.scala new file mode 100644 index 000000000..86a165090 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/CheckCollector.scala @@ -0,0 +1,64 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import org.slf4j.LoggerFactory + +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.graph.Graph + + +class CheckCollector extends Collector { + private val logger = LoggerFactory.getLogger(getClass) + + /** + * This will execute all checks and change the documentation accordingly + * @param execution + * @param graph + * @param documentation + * @return + */ + override def collect(execution: Execution, graph: Graph, documentation: ProjectDoc): ProjectDoc = { + val resolver = new ReferenceResolver(graph) + val executor = new CheckExecutor(execution) + val mappings = documentation.mappings.map { m => + resolver.resolve(m.reference) match { + case None => + // This should not happen - but who knows... + logger.warn(s"Cannot find mapping for document reference '${m.reference.toString}'") + m + case Some(mapping) => + executor.executeTests(mapping, m) + } + } + val relations = documentation.relations.map { r => + resolver.resolve(r.reference) match { + case None => + // This should not happen - but who knows... + logger.warn(s"Cannot find relation for document reference '${r.reference.toString}'") + r + case Some(relation) => + executor.executeTests(relation, r) + } + } + + documentation.copy( + mappings = mappings, + relations = relations + ) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/CheckExecutor.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/CheckExecutor.scala new file mode 100644 index 000000000..62db0ff3e --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/CheckExecutor.scala @@ -0,0 +1,173 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import scala.util.control.NonFatal + +import org.apache.spark.sql.DataFrame +import org.slf4j.LoggerFactory + +import com.dimajix.common.ExceptionUtils.reasons +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.model.Mapping +import com.dimajix.flowman.model.Relation +import com.dimajix.flowman.spi.ColumnCheckExecutor +import com.dimajix.flowman.spi.SchemaCheckExecutor + + +class CheckExecutor(execution: Execution) { + private val logger = LoggerFactory.getLogger(getClass) + private val columnTestExecutors = ColumnCheckExecutor.executors + private val schemaTestExecutors = SchemaCheckExecutor.executors + + /** + * Executes all checks for a relation as defined within the documentation + * @param relation + * @param doc + * @return + */ + def executeTests(relation:Relation, doc:RelationDoc) : RelationDoc = { + val schemaDoc = doc.schema.map { schema => + if (containsTests(schema)) { + logger.info(s"Conducting checks on relation '${relation.identifier}'") + try { + val df = relation.read(execution, doc.partitions) + runSchemaTests(relation.context, df, schema) + } catch { + case NonFatal(ex) => + logger.warn(s"Error executing checks for relation '${relation.identifier}': ${reasons(ex)}") + failSchemaTests(schema) + } + } + else { + schema + } + } + doc.copy(schema=schemaDoc) + } + + /** + * Executes all checks for a mapping as defined within the documentation + * @param relation + * @param doc + * @return + */ + def executeTests(mapping:Mapping, doc:MappingDoc) : MappingDoc = { + val outputs = doc.outputs.map { output => + val schema = output.schema.map { schema => + if (containsTests(schema)) { + logger.info(s"Conducting checks on mapping '${mapping.identifier}'") + try { + val df = execution.instantiate(mapping, output.name) + runSchemaTests(mapping.context, df, schema) + } catch { + case NonFatal(ex) => + logger.warn(s"Error executing checks for mapping '${mapping.identifier}': ${reasons(ex)}") + failSchemaTests(schema) + } + } + else { + schema + } + } + output.copy(schema=schema) + } + doc.copy(outputs=outputs) + } + + private def containsTests(doc:SchemaDoc) : Boolean = { + doc.checks.nonEmpty || containsTests(doc.columns) + } + private def containsTests(docs:Seq[ColumnDoc]) : Boolean = { + docs.exists(col => col.checks.nonEmpty || containsTests(col.children)) + } + + private def failSchemaTests(schema:SchemaDoc) : SchemaDoc = { + val columns = failColumnTests(schema.columns) + val tests = schema.checks.map { test => + val result = CheckResult(Some(test.reference), status = CheckStatus.ERROR) + test.withResult(result) + } + schema.copy(columns=columns, checks=tests) + } + private def failColumnTests(columns:Seq[ColumnDoc]) : Seq[ColumnDoc] = { + columns.map(col => failColumnTests(col)) + } + private def failColumnTests(column:ColumnDoc) : ColumnDoc = { + val tests = column.checks.map { test => + val result = CheckResult(Some(test.reference), status = CheckStatus.ERROR) + test.withResult(result) + } + val children = failColumnTests(column.children) + column.copy(children=children, checks=tests) + } + + private def runSchemaTests(context:Context, df:DataFrame, schema:SchemaDoc) : SchemaDoc = { + val columns = runColumnTests(context, df, schema.columns) + val tests = schema.checks.map { test => + logger.info(s" - Executing schema test '${test.name}'") + val result = + try { + val result = schemaTestExecutors.flatMap(_.execute(execution, context, df, test)).headOption + result match { + case None => + logger.warn(s"Could not find appropriate test executor for testing schema") + CheckResult(Some(test.reference), status = CheckStatus.NOT_RUN) + case Some(result) => + result.reparent(test.reference) + } + } catch { + case NonFatal(ex) => + logger.warn(s"Error executing column test: ${reasons(ex)}") + CheckResult(Some(test.reference), status = CheckStatus.ERROR) + + } + test.withResult(result) + } + + schema.copy(columns=columns, checks=tests) + } + private def runColumnTests(context:Context, df:DataFrame, columns:Seq[ColumnDoc], path:String = "") : Seq[ColumnDoc] = { + columns.map(col => runColumnTests(context, df, col, path)) + } + private def runColumnTests(context:Context, df:DataFrame, column:ColumnDoc, path:String) : ColumnDoc = { + val columnPath = path + column.name + val tests = column.checks.map { test => + logger.info(s" - Executing test '${test.name}' on column ${columnPath}") + val result = + try { + val result = columnTestExecutors.flatMap(_.execute(execution, context, df, columnPath, test)).headOption + result match { + case None => + logger.warn(s"Could not find appropriate test executor for testing column $columnPath") + CheckResult(Some(test.reference), status = CheckStatus.NOT_RUN) + case Some(result) => + result.reparent(test.reference) + } + } catch { + case NonFatal(ex) => + logger.warn(s"Error executing column test: ${reasons(ex)}") + CheckResult(Some(test.reference), status = CheckStatus.ERROR) + + } + test.withResult(result) + } + val children = runColumnTests(context, df, column.children, path + column.name + ".") + column.copy(children=children, checks=tests) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/CheckResult.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/CheckResult.scala new file mode 100644 index 000000000..721d15725 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/CheckResult.scala @@ -0,0 +1,83 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + + +sealed abstract class CheckStatus extends Product with Serializable { + def success : Boolean + def failure : Boolean + def run : Boolean +} + +object CheckStatus { + final case object FAILED extends CheckStatus { + def success : Boolean = false + def failure : Boolean = true + def run : Boolean = true + } + final case object SUCCESS extends CheckStatus { + def success : Boolean = true + def failure : Boolean = false + def run : Boolean = true + } + final case object ERROR extends CheckStatus { + def success : Boolean = false + def failure : Boolean = true + def run : Boolean = true + } + final case object NOT_RUN extends CheckStatus { + def success : Boolean = false + def failure : Boolean = false + def run : Boolean = false + } +} + + +final case class CheckResultReference( + parent:Option[Reference] +) extends Reference { + override def toString: String = { + parent match { + case Some(ref) => ref.toString + "/result" + case None => "" + } + } + override def kind : String = "check_result" +} + + +final case class CheckResult( + parent:Some[Reference], + status:CheckStatus, + description:Option[String] = None, + details:Option[Fragment] = None +) extends Fragment { + override def reference: CheckResultReference = CheckResultReference(parent) + override def fragments: Seq[Fragment] = details.toSeq + + override def reparent(parent:Reference) : CheckResult = { + val ref = CheckResultReference(Some(parent)) + copy( + parent = Some(parent), + details = details.map(_.reparent(ref)) + ) + } + + def success : Boolean = status.success + def failure : Boolean = status.failure + def run : Boolean = status.run +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/TableDefinition.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Collector.scala similarity index 63% rename from flowman-core/src/main/scala/com/dimajix/flowman/jdbc/TableDefinition.scala rename to flowman-core/src/main/scala/com/dimajix/flowman/documentation/Collector.scala index 4c69f6871..87c330215 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/TableDefinition.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Collector.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -14,17 +14,12 @@ * limitations under the License. */ -package com.dimajix.flowman.jdbc +package com.dimajix.flowman.documentation -import org.apache.spark.sql.catalyst.TableIdentifier +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.graph.Graph -import com.dimajix.flowman.types.Field - -case class TableDefinition( - identifier: TableIdentifier, - fields: Seq[Field], - comment: Option[String] = None, - primaryKey: Seq[String] = Seq() -) { +abstract class Collector { + def collect(execution: Execution, graph:Graph, documentation:ProjectDoc) : ProjectDoc } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ColumnCheck.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ColumnCheck.scala new file mode 100644 index 000000000..a69240b1f --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ColumnCheck.scala @@ -0,0 +1,202 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import org.apache.spark.sql.Column +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.expr +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.types.BooleanType + +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.spi.ColumnCheckExecutor + + +final case class ColumnCheckReference( + override val parent:Option[Reference] +) extends Reference { + override def toString: String = { + parent match { + case Some(ref) => ref.toString + "/check" + case None => "" + } + } + override def kind : String = "column_check" +} + + +abstract class ColumnCheck extends Fragment with Product with Serializable { + def name : String + def result : Option[CheckResult] + def withResult(result:CheckResult) : ColumnCheck + + override def reparent(parent: Reference): ColumnCheck + + override def parent: Option[Reference] + override def reference: ColumnCheckReference = ColumnCheckReference(parent) + override def fragments: Seq[Fragment] = result.toSeq +} + + +final case class NotNullColumnCheck( + parent:Option[Reference], + description: Option[String] = None, + result:Option[CheckResult] = None +) extends ColumnCheck { + override def name : String = "IS NOT NULL" + override def withResult(result: CheckResult): ColumnCheck = copy(result=Some(result)) + override def reparent(parent: Reference): ColumnCheck = { + val ref = ColumnCheckReference(Some(parent)) + copy(parent=Some(parent), result=result.map(_.reparent(ref))) + } +} + +final case class UniqueColumnCheck( + parent:Option[Reference], + description: Option[String] = None, + result:Option[CheckResult] = None +) extends ColumnCheck { + override def name : String = "HAS UNIQUE VALUES" + override def withResult(result: CheckResult): ColumnCheck = copy(result=Some(result)) + override def reparent(parent: Reference): UniqueColumnCheck = { + val ref = ColumnCheckReference(Some(parent)) + copy(parent=Some(parent), result=result.map(_.reparent(ref))) + } +} + +final case class RangeColumnCheck( + parent:Option[Reference], + description: Option[String] = None, + lower:Any, + upper:Any, + result:Option[CheckResult] = None +) extends ColumnCheck { + override def name : String = s"IS BETWEEN $lower AND $upper" + override def withResult(result: CheckResult): ColumnCheck = copy(result=Some(result)) + override def reparent(parent: Reference): RangeColumnCheck = { + val ref = ColumnCheckReference(Some(parent)) + copy(parent=Some(parent), result=result.map(_.reparent(ref))) + } +} + +final case class ValuesColumnCheck( + parent:Option[Reference], + description: Option[String] = None, + values: Seq[Any] = Seq(), + result:Option[CheckResult] = None +) extends ColumnCheck { + override def name : String = s"IS IN (${values.mkString(",")})" + override def withResult(result: CheckResult): ColumnCheck = copy(result=Some(result)) + override def reparent(parent: Reference): ValuesColumnCheck = { + val ref = ColumnCheckReference(Some(parent)) + copy(parent=Some(parent), result=result.map(_.reparent(ref))) + } +} + +final case class ForeignKeyColumnCheck( + parent:Option[Reference], + description: Option[String] = None, + relation: Option[RelationIdentifier] = None, + mapping: Option[MappingOutputIdentifier] = None, + column: Option[String] = None, + result:Option[CheckResult] = None +) extends ColumnCheck { + override def name : String = { + val otherEntity = relation.map(_.toString).orElse(mapping.map(_.toString)).getOrElse("") + val otherColumn = column.getOrElse("") + s"FOREIGN KEY REFERENCES ${otherEntity} (${otherColumn})" + } + override def withResult(result: CheckResult): ColumnCheck = copy(result=Some(result)) + override def reparent(parent: Reference): ForeignKeyColumnCheck = { + val ref = ColumnCheckReference(Some(parent)) + copy(parent=Some(parent), result=result.map(_.reparent(ref))) + } +} + +final case class ExpressionColumnCheck( + parent:Option[Reference], + description: Option[String] = None, + expression: String, + result:Option[CheckResult] = None +) extends ColumnCheck { + override def name: String = expression + override def withResult(result: CheckResult): ColumnCheck = copy(result=Some(result)) + override def reparent(parent: Reference): ExpressionColumnCheck = { + val ref = ColumnCheckReference(Some(parent)) + copy(parent=Some(parent), result=result.map(_.reparent(ref))) + } +} + + +class DefaultColumnCheckExecutor extends ColumnCheckExecutor { + override def execute(execution: Execution, context:Context, df: DataFrame, column:String, check: ColumnCheck): Option[CheckResult] = { + check match { + case _: NotNullColumnCheck => + executePredicateTest(df, check, df(column).isNotNull) + + case _: UniqueColumnCheck => + val agg = df.filter(df(column).isNotNull).groupBy(df(column)).count() + val result = agg.groupBy(agg(agg.columns(1)) > 1).count().collect() + val numSuccess = result.find(_.getBoolean(0) == false).map(_.getLong(1)).getOrElse(0L) + val numFailed = result.find(_.getBoolean(0) == true).map(_.getLong(1)).getOrElse(0L) + val status = if (numFailed > 0) CheckStatus.FAILED else CheckStatus.SUCCESS + val description = s"$numSuccess values are unique, $numFailed values are non-unique" + Some(CheckResult(Some(check.reference), status, Some(description))) + + case v: ValuesColumnCheck => + val dt = df.schema(column).dataType + val values = v.values.map(v => lit(v).cast(dt)) + executePredicateTest(df.filter(df(column).isNotNull), check, df(column).isin(values:_*)) + + case v: RangeColumnCheck => + val dt = df.schema(column).dataType + val lower = lit(v.lower).cast(dt) + val upper = lit(v.upper).cast(dt) + executePredicateTest(df.filter(df(column).isNotNull), check, df(column).between(lower, upper)) + + case v: ExpressionColumnCheck => + executePredicateTest(df, check, expr(v.expression).cast(BooleanType)) + + case f:ForeignKeyColumnCheck => + val otherDf = + f.relation.map { rel => + val relation = context.getRelation(rel) + relation.read(execution) + }.orElse(f.mapping.map { map=> + val mapping = context.getMapping(map.mapping) + execution.instantiate(mapping, map.output) + }).getOrElse(throw new IllegalArgumentException(s"Need either mapping or relation in foreignKey test of column '$column' in check ${check.reference.toString}")) + val otherColumn = f.column.getOrElse(column) + val joined = df.join(otherDf, df(column) === otherDf(otherColumn), "left") + executePredicateTest(joined.filter(df(column).isNotNull), check,otherDf(otherColumn).isNotNull) + + case _ => None + } + } + + private def executePredicateTest(df: DataFrame, test:ColumnCheck, predicate:Column) : Option[CheckResult] = { + val result = df.groupBy(predicate).count().collect() + val numSuccess = result.find(_.getBoolean(0) == true).map(_.getLong(1)).getOrElse(0L) + val numFailed = result.find(_.getBoolean(0) == false).map(_.getLong(1)).getOrElse(0L) + val status = if (numFailed > 0) CheckStatus.FAILED else CheckStatus.SUCCESS + val description = s"$numSuccess records passed, $numFailed records failed" + Some(CheckResult(Some(test.reference), status, Some(description))) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ColumnDoc.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ColumnDoc.scala new file mode 100644 index 000000000..bfb0ee7aa --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ColumnDoc.scala @@ -0,0 +1,118 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import com.dimajix.common.MapIgnoreCase +import com.dimajix.flowman.types.Field +import com.dimajix.flowman.types.NullType + + + +final case class ColumnReference( + override val parent:Option[Reference], + name:String +) extends Reference { + override def toString: String = { + parent match { + case Some(col:ColumnReference) => col.toString + "." + name + case Some(ref) => ref.toString + "/column=" + name + case None => name + } + } + override def kind : String = "column" + + def sql : String = { + parent match { + case Some(schema:SchemaReference) => schema.sql + "." + name + case Some(col:ColumnReference) => col.sql + "." + name + case _ => name + } + } +} + + +object ColumnDoc { + def merge(thisCols:Seq[ColumnDoc], otherCols:Seq[ColumnDoc]) :Seq[ColumnDoc] = { + val thisColsByName = MapIgnoreCase(thisCols.map(c => c.name -> c)) + val otherColsByName = MapIgnoreCase(otherCols.map(c => c.name -> c)) + val mergedColumns = thisCols.map { column => + column.merge(otherColsByName.get(column.name)) + } + mergedColumns ++ otherCols.filter(c => !thisColsByName.contains(c.name)) + } +} +final case class ColumnDoc( + parent:Option[Reference], + field:Field, + children:Seq[ColumnDoc] = Seq(), + checks:Seq[ColumnCheck] = Seq() +) extends EntityDoc { + override def reference: ColumnReference = ColumnReference(parent, name) + override def fragments: Seq[Fragment] = children + override def reparent(parent: Reference): ColumnDoc = { + val ref = ColumnReference(Some(parent), name) + copy( + parent = Some(parent), + children = children.map(_.reparent(ref)), + checks = checks.map(_.reparent(ref)) + ) + } + + def name : String = field.name + def description : Option[String] = field.description + def nullable : Boolean = field.nullable + def typeName : String = field.typeName + def sqlType : String = field.sqlType + def sparkType : String = field.sparkType.sql + def catalogType : String = field.catalogType.sql + + /** + * Merge this schema documentation with another column documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:Option[ColumnDoc]) : ColumnDoc = other.map(merge).getOrElse(this) + + /** + * Merge this schema documentation with another column documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:ColumnDoc) : ColumnDoc = { + val childs = + if (this.children.nonEmpty && other.children.nonEmpty) + ColumnDoc.merge(children, other.children) + else + this.children ++ other.children + val desc = other.description.orElse(description) + val tsts = checks ++ other.checks + val ftyp = if (field.ftype == NullType) other.field.ftype else field.ftype + val nll = if (field.ftype == NullType) other.field.nullable else field.nullable + val fld = field.copy(ftype=ftyp, nullable=nll, description=desc) + copy(field=fld, children=childs, checks=tsts) + } + + /** + * Enriches a Flowman [[Field]] with documentation + */ + def enrich(field:Field) : Field = { + val desc = description.filter(_.nonEmpty).orElse(field.description) + field.copy(description = desc) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Documenter.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Documenter.scala new file mode 100644 index 000000000..9edc26527 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Documenter.scala @@ -0,0 +1,122 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import java.util.ServiceLoader + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.Path +import org.slf4j.LoggerFactory + +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.execution.Phase +import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.graph.Graph +import com.dimajix.flowman.hadoop.File +import com.dimajix.flowman.model.Job +import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.Prototype +import com.dimajix.flowman.spi.DocumenterReader + + +object Documenter { + private lazy val loader = ServiceLoader.load(classOf[DocumenterReader]).iterator().asScala.toSeq + private lazy val defaultDocumenter = { + val collectors = Seq( + new RelationCollector(), + new MappingCollector(), + new TargetCollector(), + new CheckCollector() + ) + Documenter( + collectors=collectors + ) + } + + class Reader { + private val logger = LoggerFactory.getLogger(classOf[Documenter]) + private var format = "yaml" + + def default() : Documenter = defaultDocumenter + + def format(fmt:String) : Reader = { + format = fmt + this + } + + /** + * Loads a single file or a whole directory (non recursibely) + * + * @param file + * @return + */ + def file(file:File) : Prototype[Documenter] = { + if (!file.isAbsolute()) { + this.file(file.absolute) + } + else { + logger.info(s"Reading documenter from ${file.toString}") + reader.file(file) + } + } + + def string(text:String) : Prototype[Documenter] = { + reader.string(text) + } + + private def reader : DocumenterReader = { + loader.find(_.supports(format)) + .getOrElse(throw new IllegalArgumentException(s"Module format '$format' not supported'")) + } + } + + def read = new Reader +} + + +final case class Documenter( + collectors:Seq[Collector] = Seq(), + generators:Seq[Generator] = Seq() +) { + def execute(session:Session, job:Job, args:Map[String,Any]) : Unit = { + val runner = session.runner + runner.withExecution(isolated=true) { execution => + runner.withJobContext(job, args, Some(execution)) { (context, arguments) => + execute(context, execution, job.project.get) + } + } + } + def execute(context:Context, execution: Execution, project:Project) : Unit = { + // 1. Get Project documentation + val projectDoc = ProjectDoc( + project.name, + version = project.version, + description = project.description + ) + + // 2. Apply all other collectors + val graph = Graph.ofProject(context, project, Phase.BUILD) + val finalDoc = collectors.foldLeft(projectDoc)((doc, collector) => collector.collect(execution, graph, doc)) + + // 3. Generate documentation + generators.foreach { gen => + gen.generate(context, execution, finalDoc) + } + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/EntityDoc.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/EntityDoc.scala new file mode 100644 index 000000000..b9551bc3d --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/EntityDoc.scala @@ -0,0 +1,21 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +abstract class EntityDoc extends Fragment { + +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Fragment.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Fragment.scala new file mode 100644 index 000000000..e0fb2bab2 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Fragment.scala @@ -0,0 +1,58 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + + +/** + * A [[Fragment]] represents a piece of documentation. The full documentation then is a tree structure build from many + * fragments. + */ +abstract class Fragment { + /** + * Optional textual description of the fragment to be shown in the documentation + * @return + */ + def description : Option[String] + + /** + * A resolvable reference to the fragment itself + * @return + */ + def reference : Reference + + /** + * A reference to the parent of this [[Fragment]] + * @return + */ + def parent : Option[Reference] + + /** + * List of child fragments + * @return + */ + def fragments : Seq[Fragment] + + def reparent(parent:Reference) : Fragment + + def resolve(path:Seq[Reference]) : Option[Fragment] = { + path match { + case head :: tail => + fragments.find(_.reference == head).flatMap(_.resolve(tail)) + case Nil => Some(this) + } + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Generator.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Generator.scala new file mode 100644 index 000000000..0b4d1d331 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Generator.scala @@ -0,0 +1,30 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution + + +abstract class Generator { + def generate(context:Context, execution: Execution, documentation: ProjectDoc) : Unit +} + + +abstract class BaseGenerator extends Generator { + +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/MappingCollector.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/MappingCollector.scala new file mode 100644 index 000000000..ef244994b --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/MappingCollector.scala @@ -0,0 +1,114 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import scala.util.control.NonFatal + +import org.slf4j.LoggerFactory + +import com.dimajix.common.ExceptionUtils.reasons +import com.dimajix.common.IdentityHashMap +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.graph.Graph +import com.dimajix.flowman.graph.MappingRef +import com.dimajix.flowman.graph.ReadRelation +import com.dimajix.flowman.model.Mapping +import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.types.StructType + + +class MappingCollector extends Collector { + private val logger = LoggerFactory.getLogger(getClass) + + override def collect(execution: Execution, graph: Graph, documentation: ProjectDoc): ProjectDoc = { + val mappings = IdentityHashMap[Mapping, MappingDoc]() + val parent = documentation.reference + + def getMappingDoc(node:MappingRef) : MappingDoc = { + val mapping = node.mapping + mappings.getOrElseUpdate(mapping, genDoc(node)) + } + def getOutputDoc(mapping:Mapping, output:String) : Option[MappingOutputDoc] = { + val doc = mappings.getOrElseUpdate(mapping, genDoc(graph.mapping(mapping))) + doc.outputs.find(_.identifier.output == output) + } + def genDoc(node:MappingRef) : MappingDoc = { + val mapping = node.mapping + logger.info(s"Collecting documentation for mapping '${mapping.identifier}'") + + // Collect fundamental basis information + val inputs = mapping.inputs.flatMap { in => + val inmap = mapping.context.getMapping(in.mapping) + getOutputDoc(inmap, in.output).map(in -> _) + }.toMap + val doc = document(execution, parent, mapping, inputs) + + // Add additional inputs from non-mapping entities + val incoming = node.incoming.collect { + // TODO: The following logic is not correct in case of embedded relations. We would need an IdentityHashMap instead + case ReadRelation(input, _, _) => documentation.relations.find(_.identifier == input.relation.identifier).map(_.reference) + }.flatten + doc.copy(inputs=doc.inputs ++ incoming) + } + + val docs = graph.mappings.map(mapping => getMappingDoc(mapping)) + + documentation.copy(mappings=docs) + } + + /** + * Generates a documentation for this mapping + * @param execution + * @param parent + * @param inputs + * @return + */ + private def document(execution: Execution, parent:Reference, mapping:Mapping, inputs:Map[MappingOutputIdentifier,MappingOutputDoc]) : MappingDoc = { + val inputSchemas = inputs.map(kv => kv._1 -> kv._2.schema.map(_.toStruct).getOrElse(StructType(Seq()))) + val doc = MappingDoc( + Some(parent), + mapping.identifier, + None, + inputs.map(_._2.reference).toSeq + ) + val ref = doc.reference + + val outputs = try { + // Do not use Execution.describe because that wouldn't use our hand-crafted input documentation + val schemas = mapping.describe(execution, inputSchemas) + schemas.map { case(output,schema) => + val doc = MappingOutputDoc( + Some(ref), + MappingOutputIdentifier(mapping.identifier, output) + ) + val schemaDoc = SchemaDoc.ofStruct(doc.reference, schema) + doc.copy(schema = Some(schemaDoc)) + } + } catch { + case NonFatal(ex) => + logger.warn(s"Error while inferring schema description of mapping '${mapping.identifier}': ${reasons(ex)}") + mapping.outputs.map { output => + MappingOutputDoc( + Some(ref), + MappingOutputIdentifier(mapping.identifier, output) + ) + } + } + + doc.copy(outputs=outputs.toSeq).merge(mapping.documentation) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/MappingDoc.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/MappingDoc.scala new file mode 100644 index 000000000..870927aaa --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/MappingDoc.scala @@ -0,0 +1,190 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.MappingOutputIdentifier + + +final case class MappingOutputReference( + override val parent:Option[Reference], + name:String +) extends Reference { + override def toString: String = { + parent match { + case Some(ref) => ref.toString + "/output=" + name + case None => name + } + } + override def kind : String = "mapping_output" + + def sql : String = { + parent match { + case Some(MappingReference(Some(ProjectReference(project)), mapping)) => s"$project/[$mapping:$name]" + case Some(p:MappingReference) => s"[${p.sql}:$name]" + case _ => s"[:$name]" + } + } +} + + +final case class MappingOutputDoc( + parent:Some[Reference], + identifier: MappingOutputIdentifier, + description: Option[String] = None, + schema:Option[SchemaDoc] = None +) extends Fragment { + override def reference: Reference = MappingOutputReference(parent, identifier.output) + override def fragments: Seq[Fragment] = schema.toSeq + override def reparent(parent: Reference): MappingOutputDoc = { + val ref = MappingOutputReference(Some(parent), identifier.output) + copy( + parent=Some(parent), + schema=schema.map(_.reparent(ref)) + ) + } + + /** + * Returns the name of the project of the mapping of this output + * @return + */ + def project : Option[String] = identifier.project + + /** + * Returns the mapping identifier of this output + * @return + */ + def mapping : MappingIdentifier = identifier.mapping + + /** + * Returns the name of the output + * @return + */ + def name : String = identifier.output + + /** + * Merge this schema documentation with another mapping documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:Option[MappingOutputDoc]) : MappingOutputDoc = other.map(merge).getOrElse(this) + + /** + * Merge this schema documentation with another mapping documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:MappingOutputDoc) : MappingOutputDoc = { + val id = if (identifier.mapping.isEmpty) other.identifier else identifier + val desc = other.description.orElse(this.description) + val schm = schema.map(_.merge(other.schema)).orElse(other.schema) + val result = copy(identifier=id, description=desc, schema=schm) + parent.orElse(other.parent) + .map(result.reparent) + .getOrElse(result) + } +} + + +object MappingReference { + def of(parent:Reference, identifier:MappingIdentifier) : MappingReference = { + identifier.project match { + case None => MappingReference(Some(parent), identifier.name) + case Some(project) => MappingReference(Some(ProjectReference(project)), identifier.name) + } + } +} +final case class MappingReference( + override val parent:Option[Reference] = None, + name:String +) extends Reference { + override def toString: String = { + parent match { + case Some(ref) => ref.toString + "/mapping=" + name + case None => name + } + } + override def kind: String = "mapping" + + def sql : String = { + parent match { + case Some(ProjectReference(project)) => project + "/" + name + case _ => name + } + } +} + + +final case class MappingDoc( + parent:Option[Reference] = None, + identifier:MappingIdentifier, + description:Option[String] = None, + inputs:Seq[Reference] = Seq.empty, + outputs:Seq[MappingOutputDoc] = Seq.empty +) extends EntityDoc { + override def reference: MappingReference = MappingReference(parent, identifier.name) + override def fragments: Seq[Fragment] = outputs + override def reparent(parent: Reference): MappingDoc = { + val ref = MappingReference(Some(parent), identifier.name) + copy( + parent=Some(parent), + outputs=outputs.map(_.reparent(ref)) + ) + } + + /** + * Returns the name of the project of this mapping + * @return + */ + def project : Option[String] = identifier.project + + /** + * Returns the name of this mapping + * @return + */ + def name : String = identifier.name + + /** + * Merge this schema documentation with another mapping documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:Option[MappingDoc]) : MappingDoc = other.map(merge).getOrElse(this) + + /** + * Merge this schema documentation with another mapping documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:MappingDoc) : MappingDoc = { + val id = if (identifier.isEmpty) other.identifier else identifier + val desc = other.description.orElse(this.description) + val in = inputs.toSet ++ other.inputs.toSet + val out = outputs.map { out => + out.merge(other.outputs.find(_.identifier.output == out.identifier.output)) + } ++ + other.outputs.filter(out => !outputs.exists(_.identifier.output == out.identifier.output)) + val result = copy(identifier=id, description=desc, inputs=in.toSeq, outputs=out) + parent.orElse(other.parent) + .map(result.reparent) + .getOrElse(result) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ProjectDoc.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ProjectDoc.scala new file mode 100644 index 000000000..80cbccc49 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ProjectDoc.scala @@ -0,0 +1,63 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + + +final case class ProjectReference( + name:String +) extends Reference { + override def toString: String = "/project=" + name + override def parent: Option[Reference] = None + override def kind : String = "reference" +} + + +final case class ProjectDoc( + name: String, + version: Option[String] = None, + description: Option[String] = None, + targets:Seq[TargetDoc] = Seq.empty, + relations:Seq[RelationDoc] = Seq.empty, + mappings:Seq[MappingDoc] = Seq.empty +) extends EntityDoc { + override def reference: Reference = ProjectReference(name) + override def parent: Option[Reference] = None + override def fragments: Seq[Fragment] = (targets ++ relations ++ mappings).toSeq + + override def resolve(path:Seq[Reference]) : Option[Fragment] = { + if (path.isEmpty) + Some(this) + else + None + } + + override def reparent(parent: Reference): ProjectDoc = ??? + + def resolve(ref:Reference) : Option[Fragment] = { + ref.path match { + case head :: tail => + if (head != reference) + None + else if (tail.isEmpty) + Some(this) + else + fragments.find(_.reference == tail.head).flatMap(_.resolve(tail.tail)) + case Nil => + None + } + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Reference.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Reference.scala new file mode 100644 index 000000000..c2089174e --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/Reference.scala @@ -0,0 +1,24 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + + +abstract class Reference extends Product with Serializable { + def kind : String + def parent : Option[Reference] + def path : Seq[Reference] = parent.toSeq.flatMap(_.path) :+ this +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ReferenceResolver.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ReferenceResolver.scala new file mode 100644 index 000000000..abc921911 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/ReferenceResolver.scala @@ -0,0 +1,60 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import com.dimajix.flowman.graph.Graph +import com.dimajix.flowman.model.Mapping +import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.Relation +import com.dimajix.flowman.model.RelationIdentifier + + +class ReferenceResolver(graph:Graph) { + /** + * Resolve a mapping via its documentation reference in the graph + * @param graph + * @param ref + * @return + */ + def resolve(ref:MappingReference) : Option[Mapping] = { + ref.parent match { + case None => + graph.mappings.find(m => m.name == ref.name).map(_.mapping) + case Some(ProjectReference(project)) => + val id = MappingIdentifier(ref.name, project) + graph.mappings.find(m => m.identifier == id).map(_.mapping) + case _ => None + } + } + + /** + * Resolve a relation via its documentation reference in the graph + * @param graph + * @param ref + * @return + */ + def resolve(ref:RelationReference) : Option[Relation] = { + ref.parent match { + case None => + graph.relations.find(m => m.name == ref.name).map(_.relation) + case Some(ProjectReference(project)) => + val id = RelationIdentifier(ref.name, project) + graph.relations.find(m => m.identifier == id).map(_.relation) + case _ => None + } + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/RelationCollector.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/RelationCollector.scala new file mode 100644 index 000000000..72a81d982 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/RelationCollector.scala @@ -0,0 +1,167 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import scala.util.Failure +import scala.util.Success +import scala.util.Try + +import org.slf4j.LoggerFactory + +import com.dimajix.common.ExceptionUtils.reasons +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.graph.Graph +import com.dimajix.flowman.graph.InputMapping +import com.dimajix.flowman.graph.MappingRef +import com.dimajix.flowman.graph.ReadRelation +import com.dimajix.flowman.graph.RelationRef +import com.dimajix.flowman.graph.WriteRelation +import com.dimajix.flowman.model.ResourceIdentifier +import com.dimajix.flowman.types.FieldValue + + +class RelationCollector extends Collector { + private val logger = LoggerFactory.getLogger(getClass) + + override def collect(execution: Execution, graph: Graph, documentation: ProjectDoc): ProjectDoc = { + val parent = documentation.reference + val docs = graph.relations.map(t => document(execution, parent, t)) + documentation.copy(relations = docs) + } + + /** + * Create a documentation for the relation. + * @param execution + * @param parent + * @return + */ + private def document(execution:Execution, parent:Reference, node:RelationRef) : RelationDoc = { + val relation = node.relation + logger.info(s"Collecting documentation for relation '${relation.identifier}'") + + val inputs = node.incoming.flatMap { + case write:WriteRelation => + write.input.incoming.flatMap { + case map: InputMapping => + val mapref = MappingReference.of(parent, map.mapping.identifier) + val outref = MappingOutputReference(Some(mapref), map.pin) + Some(outref) + case _ => None + } + case _ => Seq() + } + val inputPartitions = node.outgoing.flatMap { + case read:ReadRelation => + logger.debug(s"read partition ${relation.identifier}: ${read.input.identifier} ${read.partitions}") + Some(read.partitions) + case _ => None + } + val outputPartitions = node.incoming.flatMap { + case write:WriteRelation => + logger.debug(s"write partition ${relation.identifier}: ${write.output.identifier} ${write.partition}") + Some(write.partition) + case _ => None + } + + // Recursively collect all sources from upstream mappings + def collectMappingSources(map:MappingRef) : Seq[ResourceIdentifier] = { + val direct = map.mapping.requires.toSeq + val indirect = map.incoming.flatMap { + case map:InputMapping => + collectMappingSources(map.mapping) + case _ => Seq.empty + } + (direct ++ indirect).distinct + } + + val sources = node.incoming.flatMap { + case write:WriteRelation => + write.input.incoming.flatMap { + case map: InputMapping => + collectMappingSources(map.mapping) + case rel: ReadRelation => + rel.input.relation.provides.toSeq + case _ => Seq.empty + } + case _ => Seq.empty + }.distinct + + val partitions = (inputPartitions ++ outputPartitions).foldLeft(Map.empty[String,FieldValue])((a,b) => a ++ b) + + val doc = RelationDoc( + Some(parent), + relation.identifier, + description = relation.description, + None, + inputs, + relation.provides.toSeq, + relation.requires.toSeq, + sources, + partitions + ) + val ref = doc.reference + + val schema = relation.schema.map { schema => + val fieldsDoc = SchemaDoc.ofFields(parent, schema.fields) + SchemaDoc( + Some(ref), + description = schema.description, + columns = fieldsDoc.columns + ) + }.orElse { + // Try to infer schema from input + getInputSchema(execution, ref, node) + } + val mergedSchema = { + Try { + SchemaDoc.ofStruct(ref, execution.describe(relation, partitions)) + } match { + case Success(desc) => + Some(desc.merge(schema)) + case Failure(ex) => + logger.warn(s"Error while inferring schema description of relation '${relation.identifier}': ${reasons(ex)}") + schema + } + } + + doc.copy(schema = mergedSchema).merge(relation.documentation) + } + + private def getInputSchema(execution:Execution, parent:Reference, node:RelationRef) : Option[SchemaDoc] = { + // Try to infer schema from input + val schema = node.incoming.flatMap { + case write:WriteRelation => + write.input.incoming.flatMap { + case map: InputMapping => + Try { + val mapout = map.input + execution.describe(mapout.mapping.mapping, mapout.output) + }.toOption + case _ => None + } + case _ => Seq() + }.headOption + + schema.map { schema => + val fieldsDoc = SchemaDoc.ofFields(parent.parent.get, schema.fields) + SchemaDoc( + Some(parent), + columns = fieldsDoc.columns + ) + } + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/RelationDoc.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/RelationDoc.scala new file mode 100644 index 000000000..331b5c7de --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/RelationDoc.scala @@ -0,0 +1,100 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.model.ResourceIdentifier +import com.dimajix.flowman.types.FieldValue + + +object RelationReference { + def of(parent:Reference, identifier:RelationIdentifier) : RelationReference = { + identifier.project match { + case None => RelationReference(Some(parent), identifier.name) + case Some(project) => RelationReference(Some(ProjectReference(project)), identifier.name) + } + } +} +final case class RelationReference( + parent:Option[Reference], + name:String +) extends Reference { + override def toString: String = { + parent match { + case Some(ref) => ref.toString + "/relation=" + name + case None => name + } + } + override def kind : String = "relation" + + def sql : String = { + parent match { + case Some(ProjectReference(project)) => project + "/" + name + case _ => name + } + } +} + + +final case class RelationDoc( + parent:Option[Reference], + identifier:RelationIdentifier, + description:Option[String] = None, + schema:Option[SchemaDoc] = None, + inputs:Seq[Reference] = Seq(), + provides:Seq[ResourceIdentifier] = Seq(), + requires:Seq[ResourceIdentifier] = Seq(), + sources:Seq[ResourceIdentifier] = Seq(), + partitions:Map[String,FieldValue] = Map() +) extends EntityDoc { + override def reference: RelationReference = RelationReference(parent, identifier.name) + override def fragments: Seq[Fragment] = schema.toSeq + override def reparent(parent: Reference): RelationDoc = { + val ref = RelationReference(Some(parent), identifier.name) + copy( + parent = Some(parent), + schema = schema.map(_.reparent(ref)) + ) + } + + /** + * Merge this schema documentation with another relation documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:Option[RelationDoc]) : RelationDoc = other.map(merge).getOrElse(this) + + /** + * Merge this schema documentation with another relation documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:RelationDoc) : RelationDoc = { + val id = if (identifier.isEmpty) other.identifier else identifier + val desc = other.description.orElse(this.description) + val schm = schema.map(_.merge(other.schema)).orElse(other.schema) + val prov = provides.toSet ++ other.provides.toSet + val reqs = requires.toSet ++ other.requires.toSet + val srcs = sources.toSet ++ other.sources.toSet + val result = copy(identifier=id, description=desc, schema=schm, provides=prov.toSeq, requires=reqs.toSeq, sources=srcs.toSeq) + parent.orElse(other.parent) + .map(result.reparent) + .getOrElse(result) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/SchemaCheck.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/SchemaCheck.scala new file mode 100644 index 000000000..225921d66 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/SchemaCheck.scala @@ -0,0 +1,151 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import org.apache.spark.sql.Column +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions.expr +import org.apache.spark.sql.types.BooleanType + +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.spi.SchemaCheckExecutor + + +final case class SchemaCheckReference( + override val parent:Option[Reference] +) extends Reference { + override def toString: String = { + parent match { + case Some(ref) => ref.toString + "/check" + case None => "" + } + } + override def kind : String = "schema_check" +} + + +abstract class SchemaCheck extends Fragment with Product with Serializable { + def name : String + def result : Option[CheckResult] + def withResult(result:CheckResult) : SchemaCheck + + override def parent: Option[Reference] + override def reference: SchemaCheckReference = SchemaCheckReference(parent) + override def fragments: Seq[Fragment] = result.toSeq + override def reparent(parent: Reference): SchemaCheck +} + +final case class PrimaryKeySchemaCheck( + parent:Option[Reference], + description: Option[String] = None, + columns:Seq[String] = Seq.empty, + result:Option[CheckResult] = None +) extends SchemaCheck { + override def name : String = s"PRIMARY KEY (${columns.mkString(",")})" + override def withResult(result: CheckResult): SchemaCheck = copy(result=Some(result)) + override def reparent(parent: Reference): PrimaryKeySchemaCheck = { + val ref = SchemaCheckReference(Some(parent)) + copy(parent=Some(parent), result=result.map(_.reparent(ref))) + } +} + +final case class ForeignKeySchemaCheck( + parent:Option[Reference], + description: Option[String] = None, + columns: Seq[String] = Seq.empty, + relation: Option[RelationIdentifier] = None, + mapping: Option[MappingOutputIdentifier] = None, + references: Seq[String] = Seq.empty, + result:Option[CheckResult] = None +) extends SchemaCheck { + override def name : String = { + val otherEntity = relation.map(_.toString).orElse(mapping.map(_.toString)).getOrElse("") + val otherColumns = if (references.isEmpty) columns else references + s"FOREIGN KEY (${columns.mkString(",")}) REFERENCES ${otherEntity}(${otherColumns.mkString(",")})" + } + override def withResult(result: CheckResult): SchemaCheck = copy(result=Some(result)) + override def reparent(parent: Reference): ForeignKeySchemaCheck = { + val ref = SchemaCheckReference(Some(parent)) + copy(parent=Some(parent), result=result.map(_.reparent(ref))) + } +} + +final case class ExpressionSchemaCheck( + parent:Option[Reference], + description: Option[String] = None, + expression: String, + result:Option[CheckResult] = None +) extends SchemaCheck { + override def name: String = expression + override def withResult(result: CheckResult): SchemaCheck = copy(result=Some(result)) + override def reparent(parent: Reference): ExpressionSchemaCheck = { + val ref = SchemaCheckReference(Some(parent)) + copy(parent=Some(parent), result=result.map(_.reparent(ref))) + } +} + + +class DefaultSchemaCheckExecutor extends SchemaCheckExecutor { + override def execute(execution: Execution, context:Context, df: DataFrame, check: SchemaCheck): Option[CheckResult] = { + check match { + case p:PrimaryKeySchemaCheck => + val cols = p.columns.map(df(_)) + val agg = df.filter(cols.map(_.isNotNull).reduce(_ || _)).groupBy(cols:_*).count() + val result = agg.groupBy(agg(agg.columns(cols.length)) > 1).count().collect() + val numSuccess = result.find(_.getBoolean(0) == false).map(_.getLong(1)).getOrElse(0L) + val numFailed = result.find(_.getBoolean(0) == true).map(_.getLong(1)).getOrElse(0L) + val status = if (numFailed > 0) CheckStatus.FAILED else CheckStatus.SUCCESS + val description = s"$numSuccess keys are unique, $numFailed keys are non-unique" + Some(CheckResult(Some(check.reference), status, Some(description))) + + case f:ForeignKeySchemaCheck => + val otherDf = + f.relation.map { rel => + val relation = context.getRelation(rel) + relation.read(execution) + }.orElse(f.mapping.map { map=> + val mapping = context.getMapping(map.mapping) + execution.instantiate(mapping, map.output) + }).getOrElse(throw new IllegalArgumentException(s"Need either mapping or relation in foreignKey test ${check.reference.toString}")) + val cols = f.columns.map(df(_)) + val otherCols = + if (f.references.nonEmpty) + f.references.map(otherDf(_)) + else + f.columns.map(otherDf(_)) + val joined = df.join(otherDf, cols.zip(otherCols).map(lr => lr._1 === lr._2).reduce(_ && _), "left") + executePredicateTest(joined, check, otherCols.map(_.isNotNull).reduce(_ || _)) + + case e:ExpressionSchemaCheck => + executePredicateTest(df, check, expr(e.expression).cast(BooleanType)) + + case _ => None + } + } + + private def executePredicateTest(df: DataFrame, test:SchemaCheck, predicate:Column) : Option[CheckResult] = { + val result = df.groupBy(predicate).count().collect() + val numSuccess = result.find(_.getBoolean(0) == true).map(_.getLong(1)).getOrElse(0L) + val numFailed = result.find(_.getBoolean(0) == false).map(_.getLong(1)).getOrElse(0L) + val status = if (numFailed > 0) CheckStatus.FAILED else CheckStatus.SUCCESS + val description = s"$numSuccess records passed, $numFailed records failed" + Some(CheckResult(Some(test.reference), status, Some(description))) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/SchemaDoc.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/SchemaDoc.scala new file mode 100644 index 000000000..60b0eec35 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/SchemaDoc.scala @@ -0,0 +1,142 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import scala.annotation.tailrec + +import com.dimajix.common.MapIgnoreCase +import com.dimajix.flowman.types.ArrayType +import com.dimajix.flowman.types.Field +import com.dimajix.flowman.types.FieldType +import com.dimajix.flowman.types.MapType +import com.dimajix.flowman.types.StructType + + +final case class SchemaReference( + override val parent:Option[Reference] +) extends Reference { + override def toString: String = { + parent match { + case Some(ref) => ref.toString + "/schema" + case None => "schema" + } + } + override def kind : String = "schema" + + def sql : String = { + parent match { + case Some(rel:RelationReference) => rel.sql + case Some(map:MappingOutputReference) => map.sql + case _ => "" + } + } +} + + +object SchemaDoc { + def ofStruct(parent:Reference, struct:StructType) : SchemaDoc = ofFields(parent, struct.fields) + def ofFields(parent:Reference, fields:Seq[Field]) : SchemaDoc = { + val doc = SchemaDoc(Some(parent), None, Seq(), Seq()) + + def genColumns(parent:Reference, fields:Seq[Field]) : Seq[ColumnDoc] = { + fields.map(f => genColumn(parent, f)) + } + @tailrec + def genChildren(parent:Reference, ftype:FieldType) : Seq[ColumnDoc] = { + ftype match { + case s:StructType => + genColumns(parent, s.fields) + case m:MapType => + genChildren(parent, m.valueType) + case a:ArrayType => + genChildren(parent, a.elementType) + case _ => + Seq() + } + + } + def genColumn(parent:Reference, field:Field) : ColumnDoc = { + val doc = ColumnDoc(Some(parent), field, Seq(), Seq()) + val children = genChildren(doc.reference, field.ftype) + doc.copy(children = children) + } + val columns = genColumns(doc.reference, fields) + doc.copy(columns = columns) + } +} + + +final case class SchemaDoc( + parent:Option[Reference], + description:Option[String] = None, + columns:Seq[ColumnDoc] = Seq(), + checks:Seq[SchemaCheck] = Seq() +) extends EntityDoc { + override def reference: SchemaReference = SchemaReference(parent) + override def fragments: Seq[Fragment] = columns ++ checks + override def reparent(parent: Reference): SchemaDoc = { + val ref = SchemaReference(Some(parent)) + copy( + parent = Some(parent), + columns = columns.map(_.reparent(ref)), + checks = checks.map(_.reparent(ref)) + ) + } + + /** + * Convert this schema documentation to a Flowman struct + */ + def toStruct : StructType = StructType(columns.map(_.field)) + + /** + * Merge this schema documentation with another schema documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:Option[SchemaDoc]) : SchemaDoc = other.map(merge).getOrElse(this) + + /** + * Merge this schema documentation with another schema documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:SchemaDoc) : SchemaDoc = { + val desc = other.description.orElse(this.description) + val tsts = checks ++ other.checks + val cols = ColumnDoc.merge(columns, other.columns) + val result = copy(description=desc, columns=cols, checks=tsts) + parent.orElse(other.parent) + .map(result.reparent) + .getOrElse(result) + } + + /** + * Enrich a Flowman struct with information from schema documentation + * @param schema + * @return + */ + def enrich(schema:StructType) : StructType = { + def enrichStruct(columns:Seq[ColumnDoc], struct:StructType) : StructType = { + val columnsByName = MapIgnoreCase(columns.map(c => c.name -> c)) + val fields = struct.fields.map(f => columnsByName.get(f.name).map(_.enrich(f)).getOrElse(f)) + struct.copy(fields = fields) + } + enrichStruct(columns, schema) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/TargetCollector.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/TargetCollector.scala new file mode 100644 index 000000000..9e666c68f --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/TargetCollector.scala @@ -0,0 +1,86 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import org.slf4j.LoggerFactory + +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.graph.Graph +import com.dimajix.flowman.graph.InputMapping +import com.dimajix.flowman.graph.ReadRelation +import com.dimajix.flowman.graph.TargetRef +import com.dimajix.flowman.graph.WriteRelation +import com.dimajix.flowman.model.Target + + +class TargetCollector extends Collector { + private val logger = LoggerFactory.getLogger(getClass) + + override def collect(execution: Execution, graph: Graph, documentation: ProjectDoc): ProjectDoc = { + val parent = documentation.reference + val docs = graph.targets.map(t => document(execution, parent, t)) + documentation.copy(targets = docs) + } + + /** + * Creates a documentation of this target + * @param execution + * @param parent + * @return + */ + private def document(execution: Execution, parent:Reference, node:TargetRef) : TargetDoc = { + val target = node.target + logger.info(s"Collecting documentation for target '${target.identifier}'") + + val inputs = node.incoming.flatMap { + case map: InputMapping => + val mapref = MappingReference.of(parent, map.mapping.identifier) + val outref = MappingOutputReference(Some(mapref), map.pin) + Some(outref) + case read: ReadRelation => + val relref = RelationReference.of(parent, read.input.identifier) + Some(relref) + case _ => None + } + val outputs = node.outgoing.flatMap { + case write:WriteRelation => + val relref = RelationReference.of(parent, write.output.identifier) + Some(relref) + case _ => None + } + + val doc = TargetDoc( + Some(parent), + target.identifier, + description = target.description, + inputs = inputs, + outputs = outputs + ) + val ref = doc.reference + + val phaseDocs = target.phases.toSeq.map { p => + TargetPhaseDoc( + Some(ref), + p, + provides = target.provides(p).toSeq, + requires = target.requires(p).toSeq + ) + } + + doc.copy(phases=phaseDocs).merge(target.documentation) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/TargetDoc.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/TargetDoc.scala new file mode 100644 index 000000000..fb0a20a9f --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/TargetDoc.scala @@ -0,0 +1,109 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import com.dimajix.flowman.execution.Phase +import com.dimajix.flowman.model.ResourceIdentifier +import com.dimajix.flowman.model.TargetIdentifier + + +final case class TargetPhaseReference( + override val parent:Option[Reference], + phase:Phase +) extends Reference { + override def toString: String = { + parent match { + case Some(ref) => ref.toString + "/phase=" + phase.upper + case None => phase.upper + } + } + override def kind : String = "target_phase" +} + + +final case class TargetPhaseDoc( + parent:Option[Reference], + phase:Phase, + description:Option[String] = None, + provides:Seq[ResourceIdentifier] = Seq.empty, + requires:Seq[ResourceIdentifier] = Seq.empty +) extends Fragment { + override def reference: Reference = TargetPhaseReference(parent, phase) + override def fragments: Seq[Fragment] = Seq() + override def reparent(parent: Reference): TargetPhaseDoc = { + copy(parent = Some(parent)) + } +} + + +final case class TargetReference( + override val parent:Option[Reference], + name:String +) extends Reference { + override def toString: String = { + parent match { + case Some(ref) => ref.toString + "/target=" + name + case None => name + } + } + override def kind: String = "target" +} + + +final case class TargetDoc( + parent:Option[Reference], + identifier:TargetIdentifier, + description:Option[String] = None, + phases:Seq[TargetPhaseDoc] = Seq.empty, + inputs:Seq[Reference] = Seq.empty, + outputs:Seq[Reference] = Seq.empty +) extends EntityDoc { + override def reference: TargetReference = TargetReference(parent, identifier.name) + override def fragments: Seq[Fragment] = phases + override def reparent(parent: Reference): TargetDoc = { + val ref = TargetReference(Some(parent), identifier.name) + copy( + parent = Some(parent), + phases = phases.map(_.reparent(ref)) + ) + } + + /** + * Merge this schema documentation with another target documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:Option[TargetDoc]) : TargetDoc = other.map(merge).getOrElse(this) + + /** + * Merge this schema documentation with another target documentation. Note that while documentation attributes + * of [[other]] have a higher priority than those of the instance itself, the parent of itself has higher priority + * than the one of [[other]]. This allows for a simply information overlay mechanism. + * @param other + */ + def merge(other:TargetDoc) : TargetDoc = { + val id = if (identifier.isEmpty) other.identifier else identifier + val desc = other.description.orElse(this.description) + val in = inputs.toSet ++ other.inputs.toSet + val out = outputs.toSet ++ other.outputs.toSet + val result = copy(identifier=id, description=desc, inputs=in.toSeq, outputs=out.toSeq) + parent.orElse(other.parent) + .map(result.reparent) + .getOrElse(result) + } +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/documentation/velocity.scala b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/velocity.scala new file mode 100644 index 000000000..7dfb71496 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/documentation/velocity.scala @@ -0,0 +1,192 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import scala.collection.JavaConverters._ + +import com.dimajix.flowman.model.ResourceIdentifierWrapper + + +final case class ReferenceWrapper(reference:Reference) { + override def toString: String = reference.toString + + def getParent() : ReferenceWrapper = reference.parent.map(ReferenceWrapper).orNull + def getKind() : String = reference.kind + def getSql() : String = reference match { + case m:MappingReference => m.sql + case m:MappingOutputReference => m.sql + case m:RelationReference => m.sql + case m:ColumnReference => m.sql + case m:SchemaReference => m.sql + case _ => "" + } +} + + +class FragmentWrapper(fragment:Fragment) { + def getReference() : ReferenceWrapper = ReferenceWrapper(fragment.reference) + def getParent() : ReferenceWrapper = fragment.parent.map(ReferenceWrapper).orNull + def getDescription() : String = fragment.description.getOrElse("") +} + + +final case class CheckResultWrapper(result:CheckResult) extends FragmentWrapper(result) { + override def toString: String = result.status.toString + + def getStatus() : String = result.status.toString + def getSuccess() : Boolean = result.success + def getFailure() : Boolean = result.failure +} + + +final case class ColumnCheckWrapper(check:ColumnCheck) extends FragmentWrapper(check) { + override def toString: String = check.name + + def getName() : String = check.name + def getResult() : CheckResultWrapper = check.result.map(CheckResultWrapper).orNull + def getStatus() : String = check.result.map(_.status.toString).getOrElse("NOT_RUN") + def getSuccess() : Boolean = check.result.exists(_.success) + def getFailure() : Boolean = check.result.exists(_.failure) +} + + +final case class ColumnDocWrapper(column:ColumnDoc) extends FragmentWrapper(column) { + override def toString: String = column.name + + def getName() : String = column.name + def getNullable() : Boolean = column.nullable + def getType() : String = column.typeName + def getSqlType() : String = column.sqlType + def getSparkType() : String = column.sparkType + def getCatalogType() : String = column.catalogType + def getColumns() : java.util.List[ColumnDocWrapper] = column.children.map(ColumnDocWrapper).asJava + def getChecks() : java.util.List[ColumnCheckWrapper] = column.checks.map(ColumnCheckWrapper).asJava +} + + +final case class SchemaCheckWrapper(check:SchemaCheck) extends FragmentWrapper(check) { + override def toString: String = check.name + + def getName() : String = check.name + def getResult() : CheckResultWrapper = check.result.map(CheckResultWrapper).orNull + def getStatus() : String = check.result.map(_.status.toString).getOrElse("NOT_RUN") + def getSuccess() : Boolean = check.result.exists(_.success) + def getFailure() : Boolean = check.result.exists(_.failure) +} + + +final case class SchemaDocWrapper(schema:SchemaDoc) extends FragmentWrapper(schema) { + def getColumns() : java.util.List[ColumnDocWrapper] = schema.columns.map(ColumnDocWrapper).asJava + def getChecks() : java.util.List[SchemaCheckWrapper] = schema.checks.map(SchemaCheckWrapper).asJava +} + + +final case class MappingOutputDocWrapper(output:MappingOutputDoc) extends FragmentWrapper(output) { + override def toString: String = output.identifier.toString + + def getIdentifier() : String = output.identifier.toString + def getProject() : String = output.identifier.project.getOrElse("") + def getName() : String = output.identifier.output + def getMapping() : String = output.identifier.name + def getOutput() : String = output.identifier.output + def getSchema() : SchemaDocWrapper = output.schema.map(SchemaDocWrapper).orNull +} + + +final case class MappingDocWrapper(mapping:MappingDoc) extends FragmentWrapper(mapping) { + override def toString: String = mapping.identifier.toString + + def getIdentifier() : String = mapping.identifier.toString + def getProject() : String = mapping.identifier.project.getOrElse("") + def getName() : String = mapping.identifier.name + def getInputs() : java.util.List[ReferenceWrapper] = mapping.inputs.map(ReferenceWrapper).asJava + def getOutputs() : java.util.List[MappingOutputDocWrapper] = mapping.outputs.map(MappingOutputDocWrapper).asJava +} + + +final case class RelationDocWrapper(relation:RelationDoc) extends FragmentWrapper(relation) { + override def toString: String = relation.identifier.toString + + def getIdentifier() : String = relation.identifier.toString + def getProject() : String = relation.identifier.project.getOrElse("") + def getName() : String = relation.identifier.name + def getSchema() : SchemaDocWrapper = relation.schema.map(SchemaDocWrapper).orNull + def getInputs() : java.util.List[ReferenceWrapper] = relation.inputs.map(ReferenceWrapper).asJava + def getResources() : java.util.List[ResourceIdentifierWrapper] = relation.provides.map(ResourceIdentifierWrapper).asJava + def getDependencies() : java.util.List[ResourceIdentifierWrapper] = relation.requires.map(ResourceIdentifierWrapper).asJava + def getSources() : java.util.List[ResourceIdentifierWrapper] = relation.sources.map(ResourceIdentifierWrapper).asJava +} + + +final case class TargetPhaseDocWrapper(phase:TargetPhaseDoc) extends FragmentWrapper(phase) { + override def toString: String = phase.phase.upper + + def getName() : String = phase.phase.upper +} + + +final case class TargetDocWrapper(target:TargetDoc) extends FragmentWrapper(target) { + override def toString: String = target.identifier.toString + + def getIdentifier() : String = target.identifier.toString + def getProject() : String = target.identifier.project.getOrElse("") + def getName() : String = target.identifier.name + def getPhases() : java.util.List[TargetPhaseDocWrapper] = target.phases.map(TargetPhaseDocWrapper).asJava + + def getOutputs() : java.util.List[ReferenceWrapper] = target.outputs.map(ReferenceWrapper).asJava + def getInputs() : java.util.List[ReferenceWrapper] = target.inputs.map(ReferenceWrapper).asJava +} + + +final case class ProjectDocWrapper(project:ProjectDoc) extends FragmentWrapper(project) { + override def toString: String = project.name + + def getName() : String = project.name + def getVersion() : String = project.version.getOrElse("") + + def resolve(reference:ReferenceWrapper) : FragmentWrapper = { + project.resolve(reference.reference).map { + case m:MappingDoc => MappingDocWrapper(m) + case o:MappingOutputDoc => MappingOutputDocWrapper(o) + case r:RelationDoc => RelationDocWrapper(r) + case t:TargetDoc => TargetDocWrapper(t) + case p:TargetPhaseDoc => TargetPhaseDocWrapper(p) + case s:SchemaDoc => SchemaDocWrapper(s) + case s:SchemaCheck => SchemaCheckWrapper(s) + case t:CheckResult => CheckResultWrapper(t) + case c:ColumnDoc => ColumnDocWrapper(c) + case t:ColumnCheck => ColumnCheckWrapper(t) + case f:Fragment => new FragmentWrapper(f) + }.orNull + } + + def getMappings() : java.util.List[MappingDocWrapper] = + project.mappings + .sortBy(_.identifier.toString) + .map(MappingDocWrapper) + .asJava + def getRelations() : java.util.List[RelationDocWrapper] = + project.relations + .sortBy(_.identifier.toString) + .map(RelationDocWrapper) + .asJava + def getTargets() : java.util.List[TargetDocWrapper] = + project.targets + .sortBy(_.identifier.toString) + .map(TargetDocWrapper) + .asJava +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/execution/AbstractExecution.scala b/flowman-core/src/main/scala/com/dimajix/flowman/execution/AbstractExecution.scala index 1649940bb..aed2fc310 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/execution/AbstractExecution.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/execution/AbstractExecution.scala @@ -20,15 +20,9 @@ import java.time.Instant import scala.util.control.NonFatal -import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.SparkSession import org.slf4j.LoggerFactory -import com.dimajix.flowman.catalog.HiveCatalog -import com.dimajix.flowman.config.FlowmanConf -import com.dimajix.flowman.hadoop.FileSystem import com.dimajix.flowman.metric.MetricBoard -import com.dimajix.flowman.metric.MetricSystem import com.dimajix.flowman.metric.withWallTime import com.dimajix.flowman.model.Assertion import com.dimajix.flowman.model.AssertionResult @@ -37,13 +31,10 @@ import com.dimajix.flowman.model.JobDigest import com.dimajix.flowman.model.JobLifecycle import com.dimajix.flowman.model.JobResult import com.dimajix.flowman.model.LifecycleResult -import com.dimajix.flowman.model.Mapping import com.dimajix.flowman.model.Measure import com.dimajix.flowman.model.MeasureResult -import com.dimajix.flowman.model.ResourceIdentifier import com.dimajix.flowman.model.Target import com.dimajix.flowman.model.TargetResult -import com.dimajix.flowman.types.StructType import com.dimajix.flowman.util.withShutdownHook diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/execution/CachingExecution.scala b/flowman-core/src/main/scala/com/dimajix/flowman/execution/CachingExecution.scala index 1e6bf0c73..85e3154e8 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/execution/CachingExecution.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/execution/CachingExecution.scala @@ -39,7 +39,9 @@ import com.dimajix.flowman.common.ThreadUtils import com.dimajix.flowman.config.FlowmanConf import com.dimajix.flowman.model.Mapping import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.ResourceIdentifier +import com.dimajix.flowman.types.FieldValue import com.dimajix.flowman.types.StructType @@ -55,6 +57,8 @@ abstract class CachingExecution(parent:Option[Execution], isolated:Boolean) exte } } private lazy val parallelism = flowmanConf.getConf(FlowmanConf.EXECUTION_MAPPING_PARALLELISM) + private lazy val useMappingSchemaCache = flowmanConf.getConf(FlowmanConf.EXECUTION_MAPPING_SCHEMA_CACHE) + private lazy val useRelationSchemaCache = flowmanConf.getConf(FlowmanConf.EXECUTION_RELATION_SCHEMA_CACHE) private val frameCache:SynchronizedMap[Mapping,Map[String,DataFrame]] = { parent match { @@ -74,15 +78,24 @@ abstract class CachingExecution(parent:Option[Execution], isolated:Boolean) exte } } - private val schemaCache:SynchronizedMap[Mapping,TrieMap[String,StructType]] = { + private val mappingSchemaCache:SynchronizedMap[Mapping,TrieMap[String,StructType]] = { parent match { case Some(ce:CachingExecution) if !isolated => - ce.schemaCache + ce.mappingSchemaCache case _ => SynchronizedMap(IdentityHashMap[Mapping,TrieMap[String,StructType]]()) } } + private val relationSchemaCache:SynchronizedMap[Relation,StructType] = { + parent match { + case Some(ce:CachingExecution) if !isolated => + ce.relationSchemaCache + case _ => + SynchronizedMap(IdentityHashMap[Relation,StructType]()) + } + } + private val resources:mutable.ListBuffer[(ResourceIdentifier,() => Unit)] = { parent match { case Some(ce: CachingExecution) if !isolated => @@ -129,11 +142,16 @@ abstract class CachingExecution(parent:Option[Execution], isolated:Boolean) exte * @return */ override def describe(mapping:Mapping, output:String) : StructType = { - schemaCache.getOrElseUpdate(mapping, TrieMap()) - .getOrElseUpdate(output, createSchema(mapping, output)) + if (useMappingSchemaCache) { + mappingSchemaCache.getOrElseUpdate(mapping, TrieMap()) + .getOrElseUpdate(output, describeMapping(mapping, output)) + } + else { + describeMapping(mapping, output) + } } - private def createSchema(mapping:Mapping, output:String) : StructType = { + private def describeMapping(mapping:Mapping, output:String) : StructType = { if (!mapping.outputs.contains(output)) throw new NoSuchMappingOutputException(mapping.identifier, output) val context = mapping.context @@ -166,6 +184,36 @@ abstract class CachingExecution(parent:Option[Execution], isolated:Boolean) exte } } + /** + * Returns the schema for a specific relation + * @param relation + * @param partitions + * @return + */ + override def describe(relation:Relation, partitions:Map[String,FieldValue] = Map()) : StructType = { + if (useRelationSchemaCache) { + relationSchemaCache.getOrElseUpdate(relation, describeRelation(relation, partitions)) + } + else { + describeRelation(relation, partitions) + } + } + + private def describeRelation(relation:Relation, partitions:Map[String,FieldValue] = Map()) : StructType = { + try { + logger.info(s"Describing relation '${relation.identifier}'") + listeners.foreach { l => + Try { + l._1.describeRelation(this, relation, l._2) + } + } + relation.describe(this, partitions) + } + catch { + case NonFatal(e) => throw new DescribeRelationFailedException(relation.identifier, e) + } + } + /** * Registers a refresh function associated with a [[ResourceIdentifier]] * @param key @@ -186,6 +234,12 @@ abstract class CachingExecution(parent:Option[Execution], isolated:Boolean) exte resources.filter(kv => kv._1.contains(key) || key.contains(kv._1)).foreach(_._2()) } parent.foreach(_.refreshResource(key)) + + // Invalidate schema caches + relationSchemaCache.toSeq + .map(_._1) + .filter(_.provides.exists(_.contains(key))) + .foreach(relationSchemaCache.impl.remove) } /** @@ -201,7 +255,8 @@ abstract class CachingExecution(parent:Option[Execution], isolated:Boolean) exte if (!sharedCache) { frameCache.values.foreach(_.values.foreach(_.unpersist(true))) frameCache.clear() - schemaCache.clear() + mappingSchemaCache.clear() + relationSchemaCache.clear() resources.clear() } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/execution/Execution.scala b/flowman-core/src/main/scala/com/dimajix/flowman/execution/Execution.scala index 75189efa9..079bb9041 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/execution/Execution.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/execution/Execution.scala @@ -35,9 +35,11 @@ import com.dimajix.flowman.model.Mapping import com.dimajix.flowman.model.MappingOutputIdentifier import com.dimajix.flowman.model.Measure import com.dimajix.flowman.model.MeasureResult +import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.ResourceIdentifier import com.dimajix.flowman.model.Target import com.dimajix.flowman.model.TargetResult +import com.dimajix.flowman.types.FieldValue import com.dimajix.flowman.types.StructType @@ -186,6 +188,14 @@ abstract class Execution { mapping.describe(this, deps) } + /** + * Returns the schema for a specific relation + * @param relation + * @param partitions + * @return + */ + def describe(relation:Relation, partitions:Map[String,FieldValue] = Map()) : StructType + /** * Registers a refresh function associated with a [[ResourceIdentifier]] * @param key diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/execution/ExecutionListener.scala b/flowman-core/src/main/scala/com/dimajix/flowman/execution/ExecutionListener.scala index f9c90d907..2c3f6ba78 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/execution/ExecutionListener.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/execution/ExecutionListener.scala @@ -26,6 +26,7 @@ import com.dimajix.flowman.model.LifecycleResult import com.dimajix.flowman.model.Mapping import com.dimajix.flowman.model.Measure import com.dimajix.flowman.model.MeasureResult +import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.Target import com.dimajix.flowman.model.TargetDigest import com.dimajix.flowman.model.TargetResult @@ -126,6 +127,14 @@ trait ExecutionListener { * @param parent */ def describeMapping(execution: Execution, mapping:Mapping, parent:Option[Token]) : Unit + + /** + * Informs the listener that a specific relation is about to be described + * @param execution + * @param relation + * @param parent + */ + def describeRelation(execution: Execution, relation:Relation, parent:Option[Token]) : Unit } @@ -142,4 +151,5 @@ abstract class AbstractExecutionListener extends ExecutionListener { override def finishMeasure(execution:Execution, token: MeasureToken, result: MeasureResult): Unit = {} override def instantiateMapping(execution: Execution, mapping:Mapping, parent:Option[Token]) : Unit = {} override def describeMapping(execution: Execution, mapping:Mapping, parent:Option[Token]) : Unit = {} + override def describeRelation(execution: Execution, relation:Relation, parent:Option[Token]) : Unit = {} } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/execution/MonitorExecution.scala b/flowman-core/src/main/scala/com/dimajix/flowman/execution/MonitorExecution.scala index 5a7dcfb5f..736281f33 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/execution/MonitorExecution.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/execution/MonitorExecution.scala @@ -25,7 +25,9 @@ import com.dimajix.flowman.hadoop.FileSystem import com.dimajix.flowman.metric.MetricBoard import com.dimajix.flowman.metric.MetricSystem import com.dimajix.flowman.model.Mapping +import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.ResourceIdentifier +import com.dimajix.flowman.types.FieldValue import com.dimajix.flowman.types.StructType @@ -98,6 +100,14 @@ final class MonitorExecution(parent:Execution, override val listeners:Seq[(Execu */ override def describe(mapping: Mapping, output: String): StructType = parent.describe(mapping, output) + /** + * Returns the schema for a specific relation + * @param relation + * @param partitions + * @return + */ + override def describe(relation:Relation, partitions:Map[String,FieldValue] = Map()) : StructType = parent.describe(relation, partitions) + override def addResource(key:ResourceIdentifier)(refresh: => Unit) : Unit = parent.addResource(key)(refresh) override def refreshResource(key:ResourceIdentifier) : Unit = parent.refreshResource(key) diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/execution/RootContext.scala b/flowman-core/src/main/scala/com/dimajix/flowman/execution/RootContext.scala index 9a3f9e9c7..84023c833 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/execution/RootContext.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/execution/RootContext.scala @@ -16,15 +16,15 @@ package com.dimajix.flowman.execution +import scala.collection.concurrent.TrieMap import scala.collection.mutable +import scala.util.control.NonFatal import org.apache.hadoop.conf.{Configuration => HadoopConf} import org.apache.spark.SparkConf import org.slf4j.LoggerFactory -import com.dimajix.flowman.config.Configuration import com.dimajix.flowman.config.FlowmanConf -import com.dimajix.flowman.execution.ProjectContext.Builder import com.dimajix.flowman.hadoop.FileSystem import com.dimajix.flowman.model.Connection import com.dimajix.flowman.model.ConnectionIdentifier @@ -36,11 +36,11 @@ import com.dimajix.flowman.model.Namespace import com.dimajix.flowman.model.NamespaceWrapper import com.dimajix.flowman.model.Profile import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.Prototype import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.RelationIdentifier import com.dimajix.flowman.model.Target import com.dimajix.flowman.model.TargetIdentifier -import com.dimajix.flowman.model.Prototype import com.dimajix.flowman.model.Template import com.dimajix.flowman.model.TemplateIdentifier import com.dimajix.flowman.model.Test @@ -125,7 +125,8 @@ final class RootContext private[execution]( _env + ("namespace" -> (NamespaceWrapper(_namespace) -> SettingLevel.SCOPE_OVERRIDE.level)), _config ) { - private val _children: mutable.Map[String, Context] = mutable.Map() + private val _children: TrieMap[String, Context] = TrieMap() + private val _imports:TrieMap[String,(Context,Project.Import)] = TrieMap() private lazy val _fs = FileSystem(hadoopConf) private lazy val _exec = _execution match { case Some(execution) => execution @@ -169,10 +170,12 @@ final class RootContext private[execution]( override def getMapping(identifier: MappingIdentifier, allowOverrides:Boolean=true): Mapping = { require(identifier != null && identifier.nonEmpty) - if (identifier.project.isEmpty) - throw new NoSuchMappingException(identifier) - val child = getProjectContext(identifier.project.get) - child.getMapping(identifier, allowOverrides) + identifier.project match { + case None => throw new NoSuchMappingException(identifier) + case Some(project) => + val child = getProjectContext(project) + child.getMapping(identifier, allowOverrides) + } } /** @@ -184,10 +187,12 @@ final class RootContext private[execution]( override def getRelation(identifier: RelationIdentifier, allowOverrides:Boolean=true): Relation = { require(identifier != null && identifier.nonEmpty) - if (identifier.project.isEmpty) - throw new NoSuchRelationException(identifier) - val child = getProjectContext(identifier.project.get) - child.getRelation(identifier, allowOverrides) + identifier.project match { + case None => throw new NoSuchRelationException (identifier) + case Some(project) => + val child = getProjectContext (project) + child.getRelation (identifier, allowOverrides) + } } /** @@ -199,10 +204,12 @@ final class RootContext private[execution]( override def getTarget(identifier: TargetIdentifier): Target = { require(identifier != null && identifier.nonEmpty) - if (identifier.project.isEmpty) - throw new NoSuchTargetException(identifier) - val child = getProjectContext(identifier.project.get) - child.getTarget(identifier) + identifier.project match { + case None => throw new NoSuchTargetException(identifier) + case Some(project) => + val child = getProjectContext(project) + child.getTarget(identifier) + } } /** @@ -214,20 +221,20 @@ final class RootContext private[execution]( override def getConnection(identifier:ConnectionIdentifier) : Connection = { require(identifier != null && identifier.nonEmpty) - if (identifier.project.isEmpty) { - connections.getOrElseUpdate(identifier.name, - extraConnections.get(identifier.name) - .orElse( - namespace - .flatMap(_.connections.get(identifier.name)) - ) - .map(_.instantiate(this)) - .getOrElse(throw new NoSuchConnectionException(identifier)) - ) - } - else { - val child = getProjectContext(identifier.project.get) - child.getConnection(identifier) + identifier.project match { + case None => + connections.getOrElseUpdate(identifier.name, + extraConnections.get(identifier.name) + .orElse( + namespace + .flatMap(_.connections.get(identifier.name)) + ) + .map(_.instantiate(this)) + .getOrElse(throw new NoSuchConnectionException(identifier)) + ) + case Some(project) => + val child = getProjectContext(project) + child.getConnection(identifier) } } @@ -240,10 +247,12 @@ final class RootContext private[execution]( override def getJob(identifier: JobIdentifier): Job = { require(identifier != null && identifier.nonEmpty) - if (identifier.project.isEmpty) - throw new NoSuchJobException(identifier) - val child = getProjectContext(identifier.project.get) - child.getJob(identifier) + identifier.project match { + case None => throw new NoSuchJobException (identifier) + case Some(project) => + val child = getProjectContext (project) + child.getJob (identifier) + } } /** @@ -255,10 +264,12 @@ final class RootContext private[execution]( override def getTest(identifier: TestIdentifier): Test = { require(identifier != null && identifier.nonEmpty) - if (identifier.project.isEmpty) - throw new NoSuchTestException(identifier) - val child = getProjectContext(identifier.project.get) - child.getTest(identifier) + identifier.project match { + case None => throw new NoSuchTestException(identifier) + case Some(project) => + val child = getProjectContext(project) + child.getTest(identifier) + } } /** @@ -270,10 +281,24 @@ final class RootContext private[execution]( override def getTemplate(identifier: TemplateIdentifier): Template[_] = { require(identifier != null && identifier.nonEmpty) - if (identifier.project.isEmpty) - throw new NoSuchTemplateException(identifier) - val child = getProjectContext(identifier.project.get) - child.getTemplate(identifier) + identifier.project match { + case None => throw new NoSuchTemplateException(identifier) + case Some(project) => + val child = getProjectContext(project) + child.getTemplate(identifier) + } + } + + /** + * Returns the context for a specific project. This will either return an existing context or create a new + * one if it does not exist yet. + * + * @param project + * @return + */ + def getProjectContext(project:Project) : Context = { + require(project != null) + _children.getOrElseUpdate(project.name, createProjectContext(project)) } /** @@ -286,10 +311,6 @@ final class RootContext private[execution]( require(projectName != null && projectName.nonEmpty) _children.getOrElseUpdate(projectName, createProjectContext(loadProject(projectName))) } - def getProjectContext(project:Project) : Context = { - require(project != null) - _children.getOrElseUpdate(project.name, createProjectContext(project)) - } private def createProjectContext(project: Project) : Context = { val builder = ProjectContext.builder(this, project) @@ -299,15 +320,50 @@ final class RootContext private[execution]( } } + // We need to instantiate the projects job within its context, so we create a very temporary context + def getImportJob(name:String) : Job = { + try { + val projectContext = ProjectContext.builder(this, project) + .withEnvironment(project.environment, SettingLevel.PROJECT_SETTING) + .build() + projectContext.getJob(JobIdentifier(name)) + } catch { + case NonFatal(ex) => + throw new IllegalArgumentException(s"Cannot instantiate job '$name' to apply import settings for project ${project.name}", ex) + } + } + + // Apply any import setting + _imports.get(project.name).foreach { case(context,imprt) => + val job = context.evaluate(imprt.job) match { + case Some(name) => + Some(getImportJob(name)) + case None => + if (project.jobs.contains("main")) + Some(getImportJob("main")) + else None + } + job.foreach { job => + val args = job.arguments(context.evaluate(imprt.arguments)) + builder.withEnvironment(args, SettingLevel.SCOPE_OVERRIDE) + builder.withEnvironment(job.environment, SettingLevel.JOB_OVERRIDE) + } + } + // Apply overrides builder.overrideMappings(overrideMappings.filter(_._1.project.contains(project.name)).map(kv => (kv._1.name, kv._2))) builder.overrideRelations(overrideRelations.filter(_._1.project.contains(project.name)).map(kv => (kv._1.name, kv._2))) - val context = builder.withEnvironment(project.environment) + val context = builder + .withEnvironment(project.environment, SettingLevel.PROJECT_SETTING) .withConfig(project.config) .build() - _children.update(project.name, context) + // Store imports, together with context + project.imports.foreach { im => + _imports.update(im.project, (context, im)) + } + context } private def loadProject(name: String): Project = { diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/execution/Runner.scala b/flowman-core/src/main/scala/com/dimajix/flowman/execution/Runner.scala index 549c99b83..9fca652c3 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/execution/Runner.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/execution/Runner.scala @@ -51,6 +51,8 @@ import com.dimajix.flowman.model.TargetResult import com.dimajix.flowman.model.Test import com.dimajix.flowman.model.TestWrapper import com.dimajix.flowman.spi.LogFilter +import com.dimajix.flowman.types.FieldType +import com.dimajix.flowman.types.LongType import com.dimajix.flowman.util.ConsoleColors._ import com.dimajix.spark.SparkUtils.withJobGroup @@ -97,25 +99,6 @@ private[execution] sealed class RunnerImpl { result } - def withExecution[T](parent:Execution, isolated:Boolean=false)(fn:Execution => T) : T = { - val execution : Execution = new ScopedExecution(parent, isolated) - val result = fn(execution) - - // Wait for any running background operation, and do not perform a cleanup - val ops = execution.operations - val activeOps = ops.listActive() - if (activeOps.nonEmpty) { - logger.info("Some background operations are still active:") - activeOps.foreach(o => logger.info(s" - s${o.name}")) - logger.info("Waiting for termination...") - ops.awaitTermination() - } - - // Finally release any resources - execution.cleanup() - result - } - protected val lineSize = 109 protected val separator = boldWhite(StringUtils.repeat('-', lineSize)) protected val doubleSeparator = boldWhite(StringUtils.repeat('=', lineSize)) @@ -237,7 +220,6 @@ private[execution] sealed class RunnerImpl { private[execution] final class JobRunnerImpl(runner:Runner) extends RunnerImpl { private val stateStore = runner.stateStore private val stateStoreListener = new StateStoreAdaptorListener(stateStore) - private val parentExecution = runner.parentExecution /** * Executes a single job using the given execution and a map of parameters. The Runner may decide not to @@ -257,7 +239,7 @@ private[execution] final class JobRunnerImpl(runner:Runner) extends RunnerImpl { val startTime = Instant.now() val isolated2 = isolated || job.parameters.nonEmpty || job.environment.nonEmpty - withExecution(parentExecution, isolated2) { execution => + runner.withExecution(isolated2) { execution => runner.withJobContext(job, args, Some(execution), force, dryRun, isolated2) { (context, arguments) => val title = s"lifecycle for job '${job.identifier}' ${arguments.map(kv => kv._1 + "=" + kv._2).mkString(", ")}" val listeners = if (!dryRun) stateStoreListener +: (runner.hooks ++ job.hooks).map(_.instantiate(context)) else Seq() @@ -452,10 +434,8 @@ private[execution] final class JobRunnerImpl(runner:Runner) extends RunnerImpl { * @param runner */ private[execution] final class TestRunnerImpl(runner:Runner) extends RunnerImpl { - private val parentExecution = runner.parentExecution - def executeTest(test:Test, keepGoing:Boolean=false, dryRun:Boolean=false) : Status = { - withExecution(parentExecution, true) { execution => + runner.withExecution(true) { execution => runner.withTestContext(test, Some(execution), dryRun) { context => val title = s"Running test '${test.identifier}'" logTitle(title) @@ -635,7 +615,7 @@ final class Runner( * @param phases * @return */ - def executeTargets(targets:Seq[Target], phases:Seq[Phase], force:Boolean, keepGoing:Boolean=false, dryRun:Boolean=false, isolated:Boolean=true) : Status = { + def executeTargets(targets:Seq[Target], phases:Seq[Phase], jobName:String="execute-target", force:Boolean, keepGoing:Boolean=false, dryRun:Boolean=false, isolated:Boolean=true) : Status = { if (targets.nonEmpty) { val context = targets.head.context @@ -645,17 +625,37 @@ final class Runner( .withTargets(targets.map(tgt => (tgt.name, Prototype.of(tgt))).toMap) .build() val job = Job.builder(jobContext) - .setName("execute-target-" + Clock.systemUTC().millis()) + .setName(jobName) .setTargets(targets.map(_.identifier)) + .setParameters(Seq(Job.Parameter("execution_ts", LongType))) .build() - executeJob(job, phases, force=force, keepGoing=keepGoing, dryRun=dryRun, isolated=isolated) + executeJob(job, phases, args=Map("execution_ts" -> Clock.systemUTC().millis()), force=force, keepGoing=keepGoing, dryRun=dryRun, isolated=isolated) } else { Status.SUCCESS } } + def withExecution[T](isolated:Boolean=false)(fn:Execution => T) : T = { + val execution : Execution = new ScopedExecution(parentExecution, isolated) + val result = fn(execution) + + // Wait for any running background operation, and do not perform a cleanup + val ops = execution.operations + val activeOps = ops.listActive() + if (activeOps.nonEmpty) { + logger.info("Some background operations are still active:") + activeOps.foreach(o => logger.info(s" - s${o.name}")) + logger.info("Waiting for termination...") + ops.awaitTermination() + } + + // Finally release any resources + execution.cleanup() + result + } + /** * Provides a context for the given job. This will apply all environment variables of the job and add * additional variables like a `force` flag. diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/execution/Session.scala b/flowman-core/src/main/scala/com/dimajix/flowman/execution/Session.scala index 66066baca..8ae2eafe5 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/execution/Session.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/execution/Session.scala @@ -27,6 +27,7 @@ import org.slf4j.LoggerFactory import com.dimajix.flowman.catalog.HiveCatalog import com.dimajix.flowman.config.Configuration import com.dimajix.flowman.config.FlowmanConf +import com.dimajix.flowman.documentation.Documenter import com.dimajix.flowman.execution.Session.builder import com.dimajix.flowman.hadoop.FileSystem import com.dimajix.flowman.history.NullStateStore diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/execution/exceptions.scala b/flowman-core/src/main/scala/com/dimajix/flowman/execution/exceptions.scala index be67e6b43..39a132a6d 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/execution/exceptions.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/execution/exceptions.scala @@ -65,6 +65,8 @@ class DescribeMappingFailedException(val mapping:MappingIdentifier, cause:Throwa extends ExecutionException(s"Describing mapping $mapping failed", cause) class InstantiateMappingFailedException(val mapping:MappingIdentifier, cause:Throwable = None.orNull) extends ExecutionException(s"Instantiating mapping $mapping failed", cause) +class DescribeRelationFailedException(val relation:RelationIdentifier, cause:Throwable = None.orNull) + extends ExecutionException(s"Describing relation $relation failed", cause) class ValidationFailedException(val target:TargetIdentifier, cause:Throwable = None.orNull) extends ExecutionException(s"Validation of target $target failed", cause) diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/execution/migration.scala b/flowman-core/src/main/scala/com/dimajix/flowman/execution/migration.scala index 4e8d94c96..78fc459c9 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/execution/migration.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/execution/migration.scala @@ -20,7 +20,6 @@ import java.util.Locale sealed abstract class MigrationPolicy extends Product with Serializable - object MigrationPolicy { case object RELAXED extends MigrationPolicy case object STRICT extends MigrationPolicy @@ -37,7 +36,6 @@ object MigrationPolicy { sealed abstract class MigrationStrategy extends Product with Serializable - object MigrationStrategy { case object NEVER extends MigrationStrategy case object FAIL extends MigrationStrategy @@ -50,7 +48,7 @@ object MigrationStrategy { case "never" => MigrationStrategy.NEVER case "fail" => MigrationStrategy.FAIL case "alter" => MigrationStrategy.ALTER - case "alter_replace" => MigrationStrategy.ALTER_REPLACE + case "alter_replace"|"alterreplace" => MigrationStrategy.ALTER_REPLACE case "replace" => MigrationStrategy.REPLACE case _ => throw new IllegalArgumentException(s"Unknown migration strategy: '$mode'. " + "Accepted migration strategy are 'NEVER', 'FAIL', 'ALTER', 'ALTER_REPLACE' and 'REPLACE'.") diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/graph/Category.scala b/flowman-core/src/main/scala/com/dimajix/flowman/graph/Category.scala index 4c358271c..668c170bf 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/graph/Category.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/graph/Category.scala @@ -26,6 +26,7 @@ sealed abstract class Category extends Product with Serializable { object Category { case object MAPPING extends Category + case object MAPPING_OUTPUT extends Category case object MAPPING_COLUMN extends Category case object RELATION extends Category case object RELATION_COLUMN extends Category @@ -34,6 +35,7 @@ object Category { def ofString(category:String) : Category = { category.toLowerCase(Locale.ROOT) match { case "mapping" => MAPPING + case "mapping_output" => MAPPING_OUTPUT case "mapping_column" => MAPPING_COLUMN case "relation" => RELATION case "relation_column" => RELATION_COLUMN diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/graph/Graph.scala b/flowman-core/src/main/scala/com/dimajix/flowman/graph/Graph.scala index 64c967600..20a4dfec7 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/graph/Graph.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/graph/Graph.scala @@ -16,6 +16,8 @@ package com.dimajix.flowman.graph +import scala.annotation.tailrec + import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.NoSuchMappingException import com.dimajix.flowman.execution.NoSuchRelationException @@ -32,6 +34,10 @@ import com.dimajix.flowman.model.TargetIdentifier object Graph { + def empty(context:Context) : Graph = { + Graph(context, Seq.empty, Seq.empty, Seq.empty) + } + /** * Creates a Graph from a given project. The [[Context]] required for lookups and instantiation is retrieved from * the given [[Session]] @@ -75,7 +81,18 @@ final case class Graph( relations:Seq[RelationRef], targets:Seq[TargetRef] ) { - def nodes : Seq[Node] = mappings ++ relations ++ targets + def project : Option[Project] = context.project + + def nodes : Seq[Node] = { + def collectChildren(nodes:Seq[Node]) : Seq[Node] = { + val children = nodes.flatMap(_.children) + val next = if (children.nonEmpty) collectChildren(children) else Seq.empty + nodes ++ next + } + + val roots = mappings ++ relations ++ targets + collectChildren(roots) + } def edges : Seq[Edge] = nodes.flatMap(_.outgoing) /** diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/graph/GraphBuilder.scala b/flowman-core/src/main/scala/com/dimajix/flowman/graph/GraphBuilder.scala index 9a2186669..e96dba514 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/graph/GraphBuilder.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/graph/GraphBuilder.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -105,7 +105,8 @@ final class GraphBuilder(context:Context, phase:Phase) { } else { // Create new node and *first* put it into map of known mappings - val node = MappingRef(nextNodeId(), mapping) + val outputs = mapping.outputs.toSeq.map(o => MappingOutput(nextNodeId(), null, o)) + val node = MappingRef(nextNodeId(), mapping, outputs) mappings.put(mapping, node) // Now recursively run the linking process on the newly created node val linker = Linker(this, mapping.context, node) diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/graph/Linker.scala b/flowman-core/src/main/scala/com/dimajix/flowman/graph/Linker.scala index a12e8fe74..e8cfb14c0 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/graph/Linker.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/graph/Linker.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -17,30 +17,60 @@ package com.dimajix.flowman.graph import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.model.IdentifierRelationReference +import com.dimajix.flowman.model.Mapping import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.Reference +import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.model.ValueRelationReference import com.dimajix.flowman.types.FieldValue import com.dimajix.flowman.types.SingleValue final case class Linker private[graph](builder:GraphBuilder, context:Context, node:Node) { + def input(mapping: Mapping, output:String) : Linker = { + val in = builder.refMapping(mapping) + val out = in.outputs.find(_.output == output) + .getOrElse(throw new IllegalArgumentException(s"Mapping '${mapping.identifier}' doesn't provide output '$output'")) + val edge = InputMapping(out, node) + link(edge) + } def input(mapping: MappingIdentifier, output:String) : Linker = { val instance = context.getMapping(mapping) - val in = builder.refMapping(instance) - val edge = InputMapping(in, node, output) + input(instance, output) + } + + def read(relation: Reference[Relation], partitions:Map[String,FieldValue]) : Linker = { + relation match { + case ref:ValueRelationReference => read(ref.value, partitions) + case ref:IdentifierRelationReference => read(ref.identifier, partitions) + } + } + def read(relation: Relation, partitions:Map[String,FieldValue]) : Linker = { + val in = builder.refRelation(relation) + val edge = ReadRelation(in, node, partitions) link(edge) } def read(relation: RelationIdentifier, partitions:Map[String,FieldValue]) : Linker = { val instance = context.getRelation(relation) - val in = builder.refRelation(instance) - val edge = ReadRelation(in, node, partitions) + read(instance, partitions) + } + + def write(relation: Reference[Relation], partitions:Map[String,SingleValue]) : Linker = { + relation match { + case ref:ValueRelationReference => write(ref.value, partitions) + case ref:IdentifierRelationReference => write(ref.identifier, partitions) + } + } + def write(relation: Relation, partition:Map[String,SingleValue]) : Linker = { + val out = builder.refRelation(relation) + val edge = WriteRelation(node, out, partition) link(edge) } def write(relation: RelationIdentifier, partition:Map[String,SingleValue]) : Linker = { val instance = context.getRelation(relation) - val out = builder.refRelation(instance) - val edge = WriteRelation(node, out, partition) - link(edge) + write(instance, partition) } /** diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/graph/edges.scala b/flowman-core/src/main/scala/com/dimajix/flowman/graph/edges.scala index 065192c28..a7b69434d 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/graph/edges.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/graph/edges.scala @@ -34,9 +34,11 @@ final case class ReadRelation(override val input:RelationRef, override val outpu def resources : Set[ResourceIdentifier] = input.relation.resources(partitions) } -final case class InputMapping(override val input:MappingRef,override val output:Node,pin:String="main") extends Edge { +final case class InputMapping(override val input:MappingOutput,override val output:Node) extends Edge { override def action: Action = Action.INPUT - override def label: String = s"${action.upper} from ${input.label} output '$pin'" + override def label: String = s"${action.upper} from ${input.mapping.label} output '${input.output}'" + def mapping : MappingRef = input.mapping + def pin : String = input.output } final case class WriteRelation(override val input:Node, override val output:RelationRef, partition:Map[String,SingleValue] = Map()) extends Edge { diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/graph/nodes.scala b/flowman-core/src/main/scala/com/dimajix/flowman/graph/nodes.scala index c6d47a98c..405eb9787 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/graph/nodes.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/graph/nodes.scala @@ -20,16 +20,17 @@ import scala.collection.mutable import com.dimajix.flowman.execution.Phase import com.dimajix.flowman.model.Mapping +import com.dimajix.flowman.model.MappingIdentifier import com.dimajix.flowman.model.Relation +import com.dimajix.flowman.model.RelationIdentifier import com.dimajix.flowman.model.ResourceIdentifier import com.dimajix.flowman.model.Target +import com.dimajix.flowman.model.TargetIdentifier sealed abstract class Node extends Product with Serializable { private[graph] val inEdges = mutable.Buffer[Edge]() private[graph] val outEdges = mutable.Buffer[Edge]() - private[graph] val _parent : Option[Node] = None - private[graph] val _children = mutable.Seq[Node]() /** Unique node ID, generated by GraphBuilder */ val id : Int @@ -39,9 +40,7 @@ sealed abstract class Node extends Product with Serializable { def category : Category def kind : String def name : String - - def provides : Set[ResourceIdentifier] - def requires : Set[ResourceIdentifier] + def project : Option[String] /** * List of incoming edges, i.e. the upstream nodes which provide input data @@ -55,18 +54,23 @@ sealed abstract class Node extends Product with Serializable { */ def outgoing : Seq[Edge] = outEdges + /** + * Returns upstream resources + */ + def upstream : Seq[Edge] = incoming + /** * Child nodes providing more detail. For example a "Mapping" node might contain detail information on individual * columns, which would be logical children of the mapping. * @return */ - def children : Seq[Node] = _children + def children : Seq[Node] = Seq.empty /** * Optional parent node. For example a "Column" node might be a child of a "Mapping" node * @return */ - def parent : Option[Node] = _parent + def parent : Option[Node] = None /** * Create a nice string representation of the upstream dependency tree @@ -87,7 +91,9 @@ sealed abstract class Node extends Product with Serializable { Iterator() } } - val trees = incoming.map { child => + + // Do not use incoming edges, but upstream edges instead - this mainly makes sense for MappingOutputs + val trees = upstream.map { child => child.label + "\n" + child.input.upstreamTreeRec } val headChildren = trees.dropRight(1) @@ -99,39 +105,55 @@ sealed abstract class Node extends Product with Serializable { } } -final case class MappingRef(id:Int, mapping:Mapping) extends Node { +final case class MappingRef(id:Int, mapping:Mapping, outputs:Seq[MappingOutput]) extends Node { + require(outputs.forall(_.mapping == null)) + outputs.foreach(_.mapping = this) + override def category: Category = Category.MAPPING override def kind: String = mapping.kind override def name: String = mapping.name - override def provides : Set[ResourceIdentifier] = Set() - override def requires : Set[ResourceIdentifier] = mapping.requires + override def project: Option[String] = mapping.project.map(_.name) + override def children: Seq[Node] = outputs + def requires : Set[ResourceIdentifier] = mapping.requires + def identifier : MappingIdentifier = mapping.identifier } final case class TargetRef(id:Int, target:Target, phase:Phase) extends Node { override def category: Category = Category.TARGET override def kind: String = target.kind override def name: String = target.name - override def provides : Set[ResourceIdentifier] = target.provides(phase) - override def requires : Set[ResourceIdentifier] = target.requires(phase) + override def project: Option[String] = target.project.map(_.name) + def provides : Set[ResourceIdentifier] = target.provides(phase) + def requires : Set[ResourceIdentifier] = target.requires(phase) + def identifier : TargetIdentifier = target.identifier } final case class RelationRef(id:Int, relation:Relation) extends Node { override def category: Category = Category.RELATION override def kind: String = relation.kind override def name: String = relation.name - override def provides : Set[ResourceIdentifier] = relation.provides - override def requires : Set[ResourceIdentifier] = relation.requires + override def project: Option[String] = relation.project.map(_.name) + def provides : Set[ResourceIdentifier] = relation.provides + def requires : Set[ResourceIdentifier] = relation.requires + def identifier : RelationIdentifier = relation.identifier +} +final case class MappingOutput(id:Int, var mapping: MappingRef, output:String) extends Node { + override def toString: String = s"MappingOutput($id, ${mapping.id}, $output)" + override def category: Category = Category.MAPPING_OUTPUT + override def kind: String = "mapping_output" + override def parent: Option[Node] = Some(mapping) + override def name: String = mapping.name + "." + output + override def project: Option[String] = mapping.project + override def upstream : Seq[Edge] = mapping.incoming } final case class MappingColumn(id:Int, mapping: Mapping, output:String, column:String) extends Node { override def category: Category = Category.MAPPING_COLUMN override def kind: String = "mapping_column" override def name: String = mapping.name + "." + output + "." + column - override def provides : Set[ResourceIdentifier] = Set() - override def requires : Set[ResourceIdentifier] = Set() + override def project: Option[String] = mapping.project.map(_.name) } final case class RelationColumn(id:Int, relation: Relation, column:String) extends Node { override def category: Category = Category.RELATION_COLUMN override def kind: String = "relation_column" override def name: String = relation.name + "." + column - override def provides : Set[ResourceIdentifier] = Set() - override def requires : Set[ResourceIdentifier] = Set() + override def project: Option[String] = relation.project.map(_.name) } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/hadoop/FileSystem.scala b/flowman-core/src/main/scala/com/dimajix/flowman/hadoop/FileSystem.scala index 845c219b0..5eb818fe7 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/hadoop/FileSystem.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/hadoop/FileSystem.scala @@ -19,6 +19,7 @@ package com.dimajix.flowman.hadoop import java.net.URI import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.LocalFileSystem import org.apache.hadoop.fs.Path @@ -29,7 +30,10 @@ import org.apache.hadoop.fs.Path case class FileSystem(conf:Configuration) { private val localFs = org.apache.hadoop.fs.FileSystem.getLocal(conf) - def file(path:Path) : File = File(path.getFileSystem(conf), path) + def file(path:Path) : File = { + val fs = path.getFileSystem(conf) + File(fs, path) + } def file(path:String) : File = file(new Path(path)) def file(path:URI) : File = file(new Path(path)) diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/history/JdbcStateRepository.scala b/flowman-core/src/main/scala/com/dimajix/flowman/history/JdbcStateRepository.scala index 2e2a017f7..e69179a9b 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/history/JdbcStateRepository.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/history/JdbcStateRepository.scala @@ -22,8 +22,8 @@ import java.time.ZonedDateTime import java.util.Locale import java.util.Properties -import scala.collection.mutable import scala.concurrent.Await +import scala.concurrent.Future import scala.concurrent.duration.Duration import scala.language.higherKinds import scala.util.control.NonFatal @@ -154,10 +154,10 @@ private[history] class JdbcStateRepository(connection: JdbcStateStore.Connection private lazy val db = { val url = connection.url val driver = connection.driver - val user = connection.user - val password = connection.password val props = new Properties() connection.properties.foreach(kv => props.setProperty(kv._1, kv._2)) + connection.user.foreach(props.setProperty("user", _)) + connection.password.foreach(props.setProperty("password", _)) logger.debug(s"Connecting via JDBC to $url with driver $driver") val executor = slick.util.AsyncExecutor( name="Flowman.default", @@ -165,7 +165,8 @@ private[history] class JdbcStateRepository(connection: JdbcStateStore.Connection maxThreads = 20, queueSize = 1000, maxConnections = 20) - Database.forURL(url, driver=driver, user=user.orNull, password=password.orNull, prop=props, executor=executor) + // Do not set username and password, since a bug in Slick would discard all other connection properties + Database.forURL(url, driver=driver, prop=props, executor=executor) } val jobRuns = TableQuery[JobRuns] @@ -381,7 +382,7 @@ private[history] class JdbcStateRepository(connection: JdbcStateStore.Connection Await.result(query, Duration.Inf) } catch { - case NonFatal(ex) => logger.error("Cannot connect to JDBC history database", ex) + case NonFatal(ex) => logger.error(s"Cannot create tables of JDBC history database: ${ex.getMessage}") } } @@ -442,15 +443,19 @@ private[history] class JdbcStateRepository(connection: JdbcStateStore.Connection } def insertJobMetrics(jobId:Long, metrics:Seq[Measurement]) : Unit = { - metrics.foreach { m => + implicit val ec = db.executor.executionContext + + val result = metrics.map { m => val jobMetric = JobMetric(0, jobId, m.name, new Timestamp(m.ts.toInstant.toEpochMilli), m.value) val jmQuery = (jobMetrics returning jobMetrics.map(_.id) into((jm,id) => jm.copy(id=id))) += jobMetric - val jmResult = Await.result(db.run(jmQuery), Duration.Inf) - - val labels = m.labels.map(l => JobMetricLabel(jmResult.id, l._1, l._2)) - val mlQuery = jobMetricLabels ++= labels - Await.result(db.run(mlQuery), Duration.Inf) + db.run(jmQuery).flatMap { metric => + val labels = m.labels.map(l => JobMetricLabel(metric.id, l._1, l._2)) + val mlQuery = jobMetricLabels ++= labels + db.run(mlQuery) + } } + + Await.result(Future.sequence(result), Duration.Inf) } def getJobMetrics(jobId:Long) : Seq[Measurement] = { diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/history/JdbcStateStore.scala b/flowman-core/src/main/scala/com/dimajix/flowman/history/JdbcStateStore.scala index 4dd2eb835..8a37b8f31 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/history/JdbcStateStore.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/history/JdbcStateStore.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -21,26 +21,17 @@ import java.sql.SQLRecoverableException import java.sql.SQLTransientException import java.sql.Timestamp import java.time.Clock -import java.time.ZoneId import javax.xml.bind.DatatypeConverter import org.slf4j.LoggerFactory -import slick.jdbc.DerbyProfile -import slick.jdbc.H2Profile -import slick.jdbc.MySQLProfile -import slick.jdbc.PostgresProfile -import slick.jdbc.SQLServerProfile -import slick.jdbc.SQLiteProfile - -import com.dimajix.flowman.execution.Phase + import com.dimajix.flowman.execution.Status import com.dimajix.flowman.graph.GraphBuilder import com.dimajix.flowman.history.JdbcStateRepository.JobRun import com.dimajix.flowman.history.JdbcStateRepository.TargetRun import com.dimajix.flowman.history.JdbcStateStore.JdbcJobToken import com.dimajix.flowman.history.JdbcStateStore.JdbcTargetToken -import com.dimajix.flowman.metric.GaugeMetric -import com.dimajix.flowman.metric.Metric +import com.dimajix.flowman.jdbc.JdbcUtils import com.dimajix.flowman.model.Job import com.dimajix.flowman.model.JobDigest import com.dimajix.flowman.model.JobResult @@ -98,7 +89,7 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t None ) logger.debug(s"Checking last state of '${run.phase}' job '${run.name}' in history database") - withSession { repository => + withRepository { repository => repository.getJobState(run) } } @@ -109,7 +100,7 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t * @return */ override def getJobMetrics(jobId:String) : Seq[Measurement] = { - withSession { repository => + withRepository { repository => repository.getJobMetrics(jobId.toLong) } } @@ -121,7 +112,7 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t * @return */ override def getJobGraph(jobId: String): Option[Graph] = { - withSession { repository => + withRepository { repository => repository.getJobGraph(jobId.toLong) } } @@ -133,7 +124,7 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t * @return */ override def getJobEnvironment(jobId: String): Map[String, String] = { - withSession { repository => + withRepository { repository => repository.getJobEnvironment(jobId.toLong) } } @@ -167,7 +158,7 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t } logger.debug(s"Start '${digest.phase}' job '${run.name}' in history database") - val run2 = withSession { repository => + val run2 = withRepository { repository => repository.insertJobRun(run, digest.args, env) } @@ -187,7 +178,7 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t val now = new Timestamp(Clock.systemDefaultZone().instant().toEpochMilli) val graph = Graph.ofGraph(jdbcToken.graph.build()) - withSession{ repository => + withRepository{ repository => repository.setJobStatus(run.copy(end_ts = Some(now), status=status.upper, error=result.exception.map(_.toString))) repository.insertJobMetrics(run.id, metrics) repository.insertJobGraph(run.id, graph) @@ -215,13 +206,13 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t None ) logger.debug(s"Checking state of target '${run.name}' in history database") - withSession { repository => + withRepository { repository => repository.getTargetState(run, target.partitions) } } def getTargetState(targetId: String): TargetState = { - withSession { repository => + withRepository { repository => repository.getTargetState(targetId.toLong) } } @@ -251,7 +242,7 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t ) logger.debug(s"Start '${digest.phase}' target '${run.name}' in history database") - val run2 = withSession { repository => + val run2 = withRepository { repository => repository.insertTargetRun(run, digest.partitions) } JdbcTargetToken(run2, parentRun) @@ -269,7 +260,7 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t logger.info(s"Mark '${run.phase}' target '${run.name}' as $status in history database") val now = new Timestamp(Clock.systemDefaultZone().instant().toEpochMilli) - withSession{ repository => + withRepository{ repository => repository.setTargetStatus(run.copy(end_ts = Some(now), status=status.upper, error=result.exception.map(_.toString))) } @@ -289,21 +280,21 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t * @return */ override def findJobs(query:JobQuery, order:Seq[JobOrder]=Seq(), limit:Int=10000, offset:Int=0) : Seq[JobState] = { - withSession { repository => + withRepository { repository => repository.findJobs(query, order, limit, offset) } } override def countJobs(query: JobQuery): Int = { - withSession { repository => + withRepository { repository => repository.countJobs(query) } } override def countJobs(query: JobQuery, grouping: JobColumn): Map[String, Int] = { - withSession { repository => + withRepository { repository => repository.countJobs(query, grouping).toMap } } @@ -317,25 +308,25 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t * @return */ override def findTargets(query:TargetQuery, order:Seq[TargetOrder]=Seq(), limit:Int=10000, offset:Int=0) : Seq[TargetState] = { - withSession { repository => + withRepository { repository => repository.findTargets(query, order, limit, offset) } } override def countTargets(query: TargetQuery): Int = { - withSession { repository => + withRepository { repository => repository.countTargets(query) } } override def countTargets(query: TargetQuery, grouping: TargetColumn): Map[String, Int] = { - withSession { repository => + withRepository { repository => repository.countTargets(query, grouping).toMap } } override def findJobMetrics(jobQuery: JobQuery, groupings: Seq[String]): Seq[MetricSeries] = { - withSession { repository => + withRepository { repository => repository.findMetrics(jobQuery, groupings) } } @@ -362,13 +353,13 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t * @tparam T * @return */ - private def withSession[T](query: JdbcStateRepository => T) : T = { + private def withRepository[T](query: JdbcStateRepository => T) : T = { def retry[T](n:Int)(fn: => T) : T = { try { fn } catch { case e @(_:SQLRecoverableException|_:SQLTransientException) if n > 1 => { - logger.error("Retrying after error while executing SQL: {}", e.getMessage) + logger.warn("Retrying after error while executing SQL: {}", e.getMessage) Thread.sleep(timeout) retry(n - 1)(fn) } @@ -376,42 +367,20 @@ case class JdbcStateStore(connection:JdbcStateStore.Connection, retries:Int=3, t } retry(retries) { - val repository = newRepository() + ensureTables() query(repository) } } private var tablesCreated:Boolean = false + private lazy val repository = new JdbcStateRepository(connection, JdbcUtils.getProfile(connection.driver)) - private def newRepository() : JdbcStateRepository = { - // Get Connection - val derbyPattern = """.*\.derby\..*""".r - val sqlitePattern = """.*\.sqlite\..*""".r - val h2Pattern = """.*\.h2\..*""".r - val mariadbPattern = """.*\.mariadb\..*""".r - val mysqlPattern = """.*\.mysql\..*""".r - val postgresqlPattern = """.*\.postgresql\..*""".r - val sqlserverPattern = """.*\.sqlserver\..*""".r - val profile = connection.driver match { - case derbyPattern() => DerbyProfile - case sqlitePattern() => SQLiteProfile - case h2Pattern() => H2Profile - case mysqlPattern() => MySQLProfile - case mariadbPattern() => MySQLProfile - case postgresqlPattern() => PostgresProfile - case sqlserverPattern() => SQLServerProfile - case _ => throw new UnsupportedOperationException(s"Database with driver ${connection.driver} is not supported") - } - - val repository = new JdbcStateRepository(connection, profile) - + private def ensureTables() : Unit = { // Create Database if not exists if (!tablesCreated) { repository.create() tablesCreated = true } - - repository } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/history/StateStore.scala b/flowman-core/src/main/scala/com/dimajix/flowman/history/StateStore.scala index a91257454..74de90524 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/history/StateStore.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/history/StateStore.scala @@ -34,6 +34,7 @@ import com.dimajix.flowman.model.TargetResult abstract class JobToken abstract class TargetToken + abstract class StateStore { /** * Returns the state of a job, or None if no information is available diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/history/graph.scala b/flowman-core/src/main/scala/com/dimajix/flowman/history/graph.scala index 86e1236c0..3503b3d5e 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/history/graph.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/history/graph.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -128,18 +128,19 @@ object Graph { */ def ofGraph(graph:g.Graph) : Graph = { val builder = Graph.builder() - val nodesById = graph.nodes.map { + val nodesById = graph.nodes.flatMap { case target:g.TargetRef => val provides = target.provides.map(r => Resource(r.category, r.name, r.partition)).toSeq val requires = target.requires.map(r => Resource(r.category, r.name, r.partition)).toSeq - target.id -> builder.newTargetNode(target.name, target.kind, provides, requires) + Some(target.id -> builder.newTargetNode(target.name, target.kind, provides, requires)) case mapping:g.MappingRef => val requires = mapping.requires.map(r => Resource(r.category, r.name, r.partition)).toSeq - mapping.id -> builder.newMappingNode(mapping.name, mapping.kind, requires) + Some(mapping.id -> builder.newMappingNode(mapping.name, mapping.kind, requires)) case relation:g.RelationRef => val provides = relation.provides.map(r => Resource(r.category, r.name, r.partition)).toSeq val requires = relation.requires.map(r => Resource(r.category, r.name, r.partition)).toSeq - relation.id -> builder.newRelationNode(relation.name, relation.kind, provides, requires) + Some(relation.id -> builder.newRelationNode(relation.name, relation.kind, provides, requires)) + case _ => None }.toMap val relationsById = graph.nodes.collect { @@ -155,10 +156,10 @@ object Graph { val partitionFields = MapIgnoreCase(relation.partitions.map(p => p.name -> p)) val p = read.partitions.map { case(k,v) => (k -> partitionFields(k).interpolate(v).map(_.toString).toSeq) } builder.addEdge(ReadRelation(in, out, p)) - case input:g.InputMapping => - val in = nodesById(input.input.id).asInstanceOf[MappingNode] - val out = nodesById(input.output.id) - builder.addEdge(InputMapping(in, out, input.pin)) + case map:g.InputMapping => + val in = nodesById(map.mapping.id).asInstanceOf[MappingNode] + val out = nodesById(map.output.id) + builder.addEdge(InputMapping(in, out, map.pin)) case write:g.WriteRelation => val in = nodesById(write.input.id) val out = nodesById(write.output.id).asInstanceOf[RelationNode] diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/BaseDialect.scala b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/BaseDialect.scala index 241e22490..ea6c868a1 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/BaseDialect.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/BaseDialect.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -23,21 +23,26 @@ import java.util.Locale import org.apache.commons.lang3.StringUtils import org.apache.spark.sql.Column -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions.Expression import org.apache.spark.sql.jdbc.JdbcType import org.apache.spark.sql.types.StructType -import com.dimajix.common.MapIgnoreCase import com.dimajix.common.SetIgnoreCase import com.dimajix.flowman.catalog.PartitionSpec import com.dimajix.flowman.catalog.TableChange import com.dimajix.flowman.catalog.TableChange.AddColumn +import com.dimajix.flowman.catalog.TableChange.CreateIndex +import com.dimajix.flowman.catalog.TableChange.CreatePrimaryKey import com.dimajix.flowman.catalog.TableChange.DropColumn +import com.dimajix.flowman.catalog.TableChange.DropIndex +import com.dimajix.flowman.catalog.TableChange.DropPrimaryKey import com.dimajix.flowman.catalog.TableChange.UpdateColumnComment import com.dimajix.flowman.catalog.TableChange.UpdateColumnNullability import com.dimajix.flowman.catalog.TableChange.UpdateColumnType +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier +import com.dimajix.flowman.catalog.TableIndex import com.dimajix.flowman.execution.DeleteClause import com.dimajix.flowman.execution.InsertClause import com.dimajix.flowman.execution.MergeClause @@ -161,8 +166,8 @@ abstract class BaseDialect extends SqlDialect { * @return */ override def quote(table:TableIdentifier) : String = { - if (table.database.isDefined) - quoteIdentifier(table.database.get) + "." + quoteIdentifier(table.table) + if (table.space.nonEmpty) + table.space.map(quoteIdentifier).mkString(".") + "." + quoteIdentifier(table.table) else quoteIdentifier(table.table) } @@ -203,6 +208,10 @@ abstract class BaseDialect extends SqlDialect { case _:UpdateColumnNullability => true case _:UpdateColumnType => true case _:UpdateColumnComment => true + case _:CreateIndex => true + case _:DropIndex => true + case _:CreatePrimaryKey => true + case _:DropPrimaryKey => true case x:TableChange => throw new UnsupportedOperationException(s"Table change ${x} not supported") } } @@ -228,7 +237,7 @@ class BaseStatements(dialect: SqlDialect) extends SqlStatements { override def createTable(table: TableDefinition): String = { // Column definitions - val columns = table.fields.map { field => + val columns = table.columns.map { field => val name = dialect.quoteIdentifier(field.name) val typ = dialect.getJdbcType(field.ftype).databaseTypeDefinition val nullable = if (field.nullable) "" @@ -336,6 +345,22 @@ class BaseStatements(dialect: SqlDialect) extends SqlStatements { newExpr.sql } + + override def dropPrimaryKey(table: TableIdentifier): String = ??? + + override def addPrimaryKey(table: TableIdentifier, columns: Seq[String]): String = ??? + + override def dropIndex(table: TableIdentifier, indexName: String): String = { + s"DROP INDEX ${dialect.quoteIdentifier(indexName)}" + } + + override def createIndex(table: TableIdentifier, index: TableIndex): String = { + // Column definitions + val columns = index.columns.map(dialect.quoteIdentifier) + val unique = if (index.unique) "UNIQUE" else "" + + s"CREATE $unique INDEX ${dialect.quoteIdentifier(index.name)} ON ${dialect.quote(table)} (${columns.mkString(",")})" + } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/DerbyDialect.scala b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/DerbyDialect.scala index 19a6aaac8..23c8f293f 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/DerbyDialect.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/DerbyDialect.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,15 +16,11 @@ package com.dimajix.flowman.jdbc -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.jdbc.JdbcType import com.dimajix.flowman.catalog.TableChange -import com.dimajix.flowman.catalog.TableChange.AddColumn -import com.dimajix.flowman.catalog.TableChange.DropColumn -import com.dimajix.flowman.catalog.TableChange.UpdateColumnComment -import com.dimajix.flowman.catalog.TableChange.UpdateColumnNullability import com.dimajix.flowman.catalog.TableChange.UpdateColumnType +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.types.BooleanType import com.dimajix.flowman.types.ByteType import com.dimajix.flowman.types.DecimalType @@ -45,8 +41,8 @@ object DerbyDialect extends BaseDialect { * @return */ override def quote(table:TableIdentifier) : String = { - if (table.database.isDefined) - table.database.get + "." + table.table + if (table.space.nonEmpty) + table.space.mkString(".") + "." + table.table else table.table } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/HiveDialect.scala b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/HiveDialect.scala index 333edd190..f9a897396 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/HiveDialect.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/HiveDialect.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -20,6 +20,11 @@ package com.dimajix.flowman.jdbc object HiveDialect extends BaseDialect { override def canHandle(url : String): Boolean = url.startsWith("jdbc:hive") + def quote(table:org.apache.spark.sql.catalyst.TableIdentifier): String = { + table.database.map(db => quoteIdentifier(db) + "." + quoteIdentifier(table.table)) + .getOrElse(quoteIdentifier(table.table)) + } + /** * Quotes the identifier. This is used to put quotes around the identifier in case the column * name is a reserved keyword, or in case it contains characters that require quotes (e.g. space). diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/JdbcUtils.scala b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/JdbcUtils.scala index 576493b60..d2965bc1a 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/JdbcUtils.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/JdbcUtils.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -17,6 +17,7 @@ package com.dimajix.flowman.jdbc import java.sql.Connection +import java.sql.DatabaseMetaData import java.sql.PreparedStatement import java.sql.ResultSet import java.sql.ResultSetMetaData @@ -29,19 +30,32 @@ import scala.util.Try import org.apache.spark.sql.Column import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions import org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.createConnectionFactory import org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.savePartition import org.apache.spark.sql.jdbc.JdbcDialects import org.slf4j.LoggerFactory +import slick.jdbc.DerbyProfile +import slick.jdbc.H2Profile +import slick.jdbc.JdbcProfile +import slick.jdbc.MySQLProfile +import slick.jdbc.PostgresProfile +import slick.jdbc.SQLServerProfile +import slick.jdbc.SQLiteProfile import com.dimajix.flowman.catalog.TableChange import com.dimajix.flowman.catalog.TableChange.AddColumn +import com.dimajix.flowman.catalog.TableChange.CreateIndex +import com.dimajix.flowman.catalog.TableChange.CreatePrimaryKey import com.dimajix.flowman.catalog.TableChange.DropColumn +import com.dimajix.flowman.catalog.TableChange.DropIndex +import com.dimajix.flowman.catalog.TableChange.DropPrimaryKey import com.dimajix.flowman.catalog.TableChange.UpdateColumnComment import com.dimajix.flowman.catalog.TableChange.UpdateColumnNullability import com.dimajix.flowman.catalog.TableChange.UpdateColumnType +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier +import com.dimajix.flowman.catalog.TableIndex import com.dimajix.flowman.execution.MergeClause import com.dimajix.flowman.types.Field import com.dimajix.flowman.types.StructType @@ -71,6 +85,23 @@ object JdbcUtils { factory() } + def withTransaction[T](con:java.sql.Connection)(fn: => T) : T = { + val oldMode = con.getAutoCommit + con.setAutoCommit(false) + try { + val result = fn + con.commit() + result + } catch { + case ex:SQLException => + logger.error(s"SQL transaction failed, rolling back: ${ex.getMessage}") + con.rollback() + throw ex + } finally { + con.setAutoCommit(oldMode) + } + } + def withStatement[T](conn:Connection, options: JDBCOptions)(fn:Statement => T) : T = { val statement = conn.createStatement() try { @@ -132,6 +163,78 @@ object JdbcUtils { } } + /** + * Returns the table definition of a table + * @param conn + * @param table + * @param options + * @return + */ + def getTable(conn: Connection, table:TableIdentifier, options: JDBCOptions) : TableDefinition = { + val meta = conn.getMetaData + val realTable = resolveTable(meta, table) + + val currentSchema = getSchema(conn, table, options) + val pk = getPrimaryKey(meta, realTable) + val idxs = getIndexes(meta, realTable) + // Remove primary key + .filter { idx => + idx.normalize().columns != pk.map(_.toLowerCase(Locale.ROOT)).sorted + } + + TableDefinition(table, currentSchema.fields, primaryKey=pk, indexes=idxs) + } + + private def getPrimaryKey(meta: DatabaseMetaData, table:TableIdentifier) : Seq[String] = { + val pkrs = meta.getPrimaryKeys(null, table.database.orNull, table.table) + val pk = mutable.ListBuffer[(Short,String)]() + while(pkrs.next()) { + val col = pkrs.getString(4) + val seq = pkrs.getShort(5) + // val name = pkrs.getString(6) + pk.append((seq,col)) + } + pkrs.close() + pk.sortBy(_._1).map(_._2) + } + + private def getIndexes(meta: DatabaseMetaData, table:TableIdentifier) : Seq[TableIndex] = { + val idxrs = meta.getIndexInfo(null, table.database.orNull, table.table, false, true) + val idxcols = mutable.ListBuffer[(String, String, Boolean)]() + while(idxrs.next()) { + val unique = !idxrs.getBoolean(4) + val name = idxrs.getString(6) // May be null for statistics + val col = idxrs.getString(9) + idxcols.append((name, col, unique)) + } + idxrs.close() + + idxcols.filter(_._1 != null) + .groupBy(_._1).map { case(name,cols) => + TableIndex(name, cols.map(_._2), cols.foldLeft(false)(_ || _._3)) + }.toSeq + } + + /** + * Resolves the table name, even if upper/lower case does not match + * @param conn + * @param table + * @return + */ + private def resolveTable(meta: DatabaseMetaData, table:TableIdentifier) : TableIdentifier = { + val tblrs = meta.getTables(null, table.database.orNull, null, Array("TABLE")) + var name = table.table + val db = table.database + while(tblrs.next()) { + val thisName = tblrs.getString(3) + if (name.toLowerCase(Locale.ROOT) == thisName.toLowerCase(Locale.ROOT)) + name = thisName + } + tblrs.close() + + TableIdentifier(name, db) + } + /** * Returns the schema if the table already exists in the JDBC database. */ @@ -168,7 +271,13 @@ object JdbcUtils { val dialect = SqlDialects.get(options.url) withStatement(conn, dialect.statement.schema(table), options) { statement => - getJdbcSchemaImpl(statement.executeQuery()) + val rs = statement.executeQuery() + try { + getJdbcSchemaImpl(rs) + } + finally { + rs.close() + } } } @@ -215,9 +324,11 @@ object JdbcUtils { */ def createTable(conn:Connection, table:TableDefinition, options: JDBCOptions) : Unit = { val dialect = SqlDialects.get(options.url) - val sql = dialect.statement.createTable(table) + val tableSql = dialect.statement.createTable(table) + val indexSql = table.indexes.map(idx => dialect.statement.createIndex(table.identifier, idx)) withStatement(conn, options) { statement => - statement.executeUpdate(sql) + statement.executeUpdate(tableSql) + indexSql.foreach(statement.executeUpdate) } } @@ -230,6 +341,7 @@ object JdbcUtils { def dropTable(conn:Connection, table:TableIdentifier, options: JDBCOptions) : Unit = { val dialect = SqlDialects.get(options.url) withStatement(conn, options) { statement => + // TODO: Drop indices(?) statement.executeUpdate(s"DROP TABLE ${dialect.quote(table)}") } } @@ -248,6 +360,35 @@ object JdbcUtils { } } + /** + * Adds an index to an existing table + * @param conn + * @param table + * @param index + * @param options + */ + def createIndex(conn:Connection, table:TableIdentifier, index:TableIndex, options: JDBCOptions) : Unit = { + val dialect = SqlDialects.get(options.url) + val indexSql = dialect.statement.createIndex(table, index) + withStatement(conn, options) { statement => + statement.executeUpdate(indexSql) + } + } + + /** + * Drops an index from an existing table + * @param conn + * @param indexName + * @param options + */ + def dropIndex(conn:Connection, table:TableIdentifier, indexName:String, options: JDBCOptions) : Unit = { + val dialect = SqlDialects.get(options.url) + val indexSql = dialect.statement.dropIndex(table, indexName) + withStatement(conn, options) { statement => + statement.executeUpdate(indexSql) + } + } + /** * Applies a list of [[TableChange]] to an existing table. Will throw an exception if one of the operations * is not supported or if the table does not exist. @@ -288,6 +429,18 @@ object JdbcUtils { case u:UpdateColumnComment => logger.info(s"Updating comment of column ${u.column} in JDBC table $table") None + case idx:CreateIndex => + logger.info(s"Adding index ${idx.name} to JDBC table $table on columns ${idx.columns.mkString(",")}") + Some(statements.createIndex(table, TableIndex(idx.name, idx.columns, idx.unique))) + case idx:DropIndex => + logger.info(s"Dropping index ${idx.name} from JDBC table $table") + Some(statements.dropIndex(table, idx.name)) + case pk:CreatePrimaryKey => + logger.info(s"Creating primary key for JDBC table $table on columns ${pk.columns.mkString(",")}") + Some(statements.addPrimaryKey(table, pk.columns)) + case pk:DropPrimaryKey => + logger.info(s"Removing primary key from JDBC table $table}") + Some(statements.dropPrimaryKey(table)) case chg:TableChange => throw new SQLException(s"Unsupported table change $chg for JDBC table $table") } @@ -324,4 +477,24 @@ object JdbcUtils { getConnection, quotedTarget, iterator, sourceSchema, insertStmt, batchSize, sparkDialect, isolationLevel, options) } } + + def getProfile(driver:String) : JdbcProfile = { + val derbyPattern = """.*\.derby\..*""".r + val sqlitePattern = """.*\.sqlite\..*""".r + val h2Pattern = """.*\.h2\..*""".r + val mariadbPattern = """.*\.mariadb\..*""".r + val mysqlPattern = """.*\.mysql\..*""".r + val postgresqlPattern = """.*\.postgresql\..*""".r + val sqlserverPattern = """.*\.sqlserver\..*""".r + driver match { + case derbyPattern() => DerbyProfile + case sqlitePattern() => SQLiteProfile + case h2Pattern() => H2Profile + case mysqlPattern() => MySQLProfile + case mariadbPattern() => MySQLProfile + case postgresqlPattern() => PostgresProfile + case sqlserverPattern() => SQLServerProfile + case _ => throw new UnsupportedOperationException(s"Database with driver ${driver} is not supported") + } + } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/MsSqlServerDialect.scala b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/MsSqlServerDialect.scala index 9b63d07cd..0324fdb3f 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/MsSqlServerDialect.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/MsSqlServerDialect.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,15 +16,11 @@ package com.dimajix.flowman.jdbc -import java.sql.SQLFeatureNotSupportedException import java.util.Locale -import org.apache.spark.sql.Column -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.jdbc.JdbcType -import org.apache.spark.sql.types.StructType -import com.dimajix.flowman.execution.MergeClause +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.types.BinaryType import com.dimajix.flowman.types.BooleanType import com.dimajix.flowman.types.FieldType @@ -105,4 +101,8 @@ class MsSqlServerStatements(dialect: BaseDialect) extends BaseStatements(dialect val nullable = if (isNullable) "NULL" else "NOT NULL" s"ALTER TABLE ${dialect.quote(table)} ALTER COLUMN ${dialect.quoteIdentifier(columnName)} $dataType $nullable" } + + override def dropIndex(table: TableIdentifier, indexName: String): String = { + s"DROP INDEX ${dialect.quote(table)}.${dialect.quoteIdentifier(indexName)}" + } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/MySQLDialect.scala b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/MySQLDialect.scala index 0e0cf414c..4581a21fb 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/MySQLDialect.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/MySQLDialect.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -19,8 +19,7 @@ package com.dimajix.flowman.jdbc import java.sql.SQLFeatureNotSupportedException import java.sql.Types -import org.apache.spark.sql.catalyst.TableIdentifier - +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.types.FieldType import com.dimajix.flowman.types.LongType import com.dimajix.flowman.types.BooleanType @@ -81,4 +80,8 @@ class MySQLStatements(dialect: BaseDialect) extends BaseStatements(dialect) { override def updateColumnNullability(table: TableIdentifier, columnName: String, dataType:String, isNullable: Boolean): String = { throw new SQLFeatureNotSupportedException(s"UpdateColumnNullability is not supported") } + + override def dropIndex(table: TableIdentifier, indexName: String): String = { + s"DROP INDEX ${dialect.quoteIdentifier(indexName)} ON ${dialect.quote(table)}" + } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/SqlDialect.scala b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/SqlDialect.scala index e8d75268f..efb0a8f73 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/SqlDialect.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/SqlDialect.scala @@ -16,11 +16,10 @@ package com.dimajix.flowman.jdbc -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.jdbc.JdbcType -import org.apache.spark.sql.types.DataType import com.dimajix.flowman.catalog.TableChange +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.types.FieldType diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/SqlStatements.scala b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/SqlStatements.scala index 178b91878..c42b6630e 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/SqlStatements.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/jdbc/SqlStatements.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -17,9 +17,11 @@ package com.dimajix.flowman.jdbc import org.apache.spark.sql.Column -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.types.StructType +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier +import com.dimajix.flowman.catalog.TableIndex import com.dimajix.flowman.execution.MergeClause @@ -57,5 +59,11 @@ abstract class SqlStatements { def updateColumnType(table: TableIdentifier, columnName: String, newDataType: String): String def updateColumnNullability(table: TableIdentifier, columnName: String, dataType: String, isNullable: Boolean): String + def dropPrimaryKey(table: TableIdentifier) : String + def addPrimaryKey(table: TableIdentifier, columns:Seq[String]) : String + + def dropIndex(table: TableIdentifier, indexName: String) : String + def createIndex(table: TableIdentifier, index:TableIndex) : String + def merge(table: TableIdentifier, targetAlias:String, targetSchema:Option[StructType], sourceAlias:String, sourceSchema:StructType, condition:Column, clauses:Seq[MergeClause]) : String } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/metric/ConsoleMetricSink.scala b/flowman-core/src/main/scala/com/dimajix/flowman/metric/ConsoleMetricSink.scala index f90819a21..43fc30c25 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/metric/ConsoleMetricSink.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/metric/ConsoleMetricSink.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -22,7 +22,7 @@ import com.dimajix.flowman.execution.Status class ConsoleMetricSink extends AbstractMetricSink { override def commit(board:MetricBoard, status:Status): Unit = { println("Collected metrics") - board.metrics(catalog(board), status).foreach{ metric => + board.metrics(catalog(board), status).sortBy(_.name).foreach{ metric => val name = metric.name val labels = metric.labels.map(kv => kv._1 + "=" + kv._2) metric match { diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/metric/MetricBoard.scala b/flowman-core/src/main/scala/com/dimajix/flowman/metric/MetricBoard.scala index 4670a523f..860e7cae5 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/metric/MetricBoard.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/metric/MetricBoard.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019-2020 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,6 +16,8 @@ package com.dimajix.flowman.metric +import scala.util.matching.Regex + import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Status @@ -54,7 +56,8 @@ final case class MetricBoard( selections.flatMap { sel => // Relabeling should happen has late as possible, since some values might be dynamic def relabel(metric:Metric) : Metric = metric match { - case gauge:GaugeMetric => FixedGaugeMetric(sel.name, env.evaluate(labels ++ sel.labels, gauge.labels + ("status" -> status)), gauge.value) + // Remove "project" from gauge.labels + case gauge:GaugeMetric => FixedGaugeMetric(sel.name.getOrElse(gauge.name), env.evaluate(labels ++ sel.labels, gauge.labels - "project" + ("status" -> status)), gauge.value) case _ => throw new IllegalArgumentException(s"Metric of type ${metric.getClass} not supported") } @@ -67,7 +70,7 @@ final case class MetricBoard( /** * A MetricSelection represents a possibly dynamic set of Metrics to be published inside a MetricBoard */ -final case class MetricSelection(name:String, selector:Selector, labels:Map[String,String]) { +final case class MetricSelection(name:Option[String] = None, selector:Selector, labels:Map[String,String] = Map()) { /** * Returns all metrics identified by this selection. This operation may be expensive, since the set of metrics may be * dynamic and change over time @@ -83,8 +86,18 @@ final case class MetricSelection(name:String, selector:Selector, labels:Map[Stri def bundles(implicit catalog:MetricCatalog) : Seq[MetricBundle] = catalog.findBundle(selector) } - +object Selector { + def apply(labels:Map[String,String]) : Selector = { + new Selector(None, labels.map { case(k,v) => k -> v.r } ) + } + def apply(name:String) : Selector = { + new Selector(Some(name.r), Map.empty ) + } + def apply(name:String, labels:Map[String,String]) : Selector = { + new Selector(Some(name.r), labels.map { case(k,v) => k -> v.r } ) + } +} final case class Selector( - name:Option[String] = None, - labels:Map[String,String] = Map() + name:Option[Regex] = None, + labels:Map[String,Regex] = Map() ) diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/metric/MetricSystem.scala b/flowman-core/src/main/scala/com/dimajix/flowman/metric/MetricSystem.scala index 0a150c7f1..70518d529 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/metric/MetricSystem.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/metric/MetricSystem.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,6 +16,11 @@ package com.dimajix.flowman.metric +import scala.util.control.NonFatal +import scala.util.matching.Regex + +import org.slf4j.LoggerFactory + import com.dimajix.common.IdentityHashSet import com.dimajix.common.SynchronizedSet import com.dimajix.flowman.execution.Status @@ -53,6 +58,7 @@ trait MetricCatalog { class MetricSystem extends MetricCatalog { + private val logger = LoggerFactory.getLogger(getClass) private val metricBundles : SynchronizedSet[MetricBundle] = SynchronizedSet(IdentityHashSet()) private val metricBoards : SynchronizedSet[MetricBoard] = SynchronizedSet(IdentityHashSet()) private val metricSinks : SynchronizedSet[MetricSink] = SynchronizedSet(IdentityHashSet()) @@ -81,12 +87,12 @@ class MetricSystem extends MetricCatalog { metricBundles.remove(bundle) } - def getOrCreateBundle[T <: MetricBundle](query:Selector)(creator: => T) : T = { - metricBundles.find(bundle => query.name.forall(_ == bundle.name) && bundle.labels == query.labels) + def getOrCreateBundle[T <: MetricBundle](name:String, labels:Map[String,String])(creator: => T) : T = { + metricBundles.find(bundle => name == bundle.name && bundle.labels == labels) .map(_.asInstanceOf[T]) .getOrElse{ val bundle = creator - if (!query.name.forall(_ == bundle.name) || query.labels != bundle.labels) + if (name != bundle.name || labels != bundle.labels) throw new IllegalArgumentException("Newly created bundle needs to match query") addBundle(bundle) bundle @@ -132,7 +138,15 @@ class MetricSystem extends MetricCatalog { def commitBoard(board:MetricBoard, status:Status) : Unit = { if (!metricBoards.contains(board)) throw new IllegalArgumentException("MetricBoard not registered") - metricSinks.foreach(_.commit(board, status)) + + metricSinks.foreach { sink => + try { + sink.commit(board, status) + } + catch { + case NonFatal(ex) => logger.warn(s"Error while committing metrics to sink: ${ex.getMessage}") + } + } } /** @@ -181,13 +195,13 @@ class MetricSystem extends MetricCatalog { // Matches bundle labels to query. Only existing labels need to match def matchBundle(bundle:MetricBundle) : Boolean = { val labels = bundle.labels - selector.name.forall(_ == bundle.name) && - labels.keySet.intersect(selector.labels.keySet).forall(key => selector.labels(key) == labels(key)) + selector.name.forall(_.unapplySeq(bundle.name).nonEmpty) && + labels.keySet.intersect(selector.labels.keySet).forall(key => selector.labels(key).unapplySeq(labels(key)).nonEmpty) } // Matches metric labels to query. All labels need to match - def matchMetric(metric:Metric, query:Map[String,String]) : Boolean = { + def matchMetric(metric:Metric, query:Map[String,Regex]) : Boolean = { val labels = metric.labels - query.forall(kv => labels.get(kv._1).contains(kv._2)) + query.forall(kv => labels.get(kv._1).exists(v => kv._2.unapplySeq(v).nonEmpty)) } // Query a bundle and return all matching metrics within that bundle def queryBundle(bundle:MetricBundle) : Seq[Metric] = { @@ -214,8 +228,8 @@ class MetricSystem extends MetricCatalog { def matchBundle(bundle:MetricBundle) : Boolean = { val labels = bundle.labels - selector.name.forall(_ == bundle.name) && - selector.labels.forall(kv => labels.get(kv._1).contains(kv._2)) + selector.name.forall(_.unapplySeq(bundle.name).nonEmpty) && + selector.labels.forall(kv => labels.get(kv._1).exists(v => kv._2.unapplySeq(v).nonEmpty)) } metricBundles.toSeq diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/metric/MultiMetricBundle.scala b/flowman-core/src/main/scala/com/dimajix/flowman/metric/MultiMetricBundle.scala index 47069ad55..1a286d9ee 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/metric/MultiMetricBundle.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/metric/MultiMetricBundle.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -31,12 +31,12 @@ final case class MultiMetricBundle(override val name:String, override val labels bundleMetrics.remove(metric) } - def getOrCreateMetric[T <: Metric](query:Selector)(creator: => T) : T = { - bundleMetrics.find(metric => query.name.forall(_ == metric.name) && metric.labels == query.labels) + def getOrCreateMetric[T <: Metric](name:String, labels:Map[String,String])(creator: => T) : T = { + bundleMetrics.find(metric => name == metric.name && metric.labels == labels) .map(_.asInstanceOf[T]) .getOrElse{ val metric = creator - if (!query.name.forall(_ == metric.name) || query.labels != metric.labels) + if (name != metric.name || labels != metric.labels) throw new IllegalArgumentException("Newly created metric needs to match query") addMetric(metric) metric diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/metric/package.scala b/flowman-core/src/main/scala/com/dimajix/flowman/metric/package.scala index e5a022f42..f0b300aab 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/metric/package.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/metric/package.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019-2021 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -63,11 +63,11 @@ package object metric { // Create and register bundle val metricName = metadata.category + "_runtime" val bundleLabels = metadata.asMap + ("phase" -> phase.toString) - val bundle = registry.getOrCreateBundle(Selector(Some(metricName), bundleLabels))(MultiMetricBundle(metricName, bundleLabels)) + val bundle = registry.getOrCreateBundle(metricName, bundleLabels)(MultiMetricBundle(metricName, bundleLabels)) // Create and register metric val metricLabels = bundleLabels ++ Map("name" -> metadata.name) ++ metadata.labels - val metric = bundle.getOrCreateMetric(Selector(Some(metricName), metricLabels))(WallTimeMetric(metricName, metricLabels)) + val metric = bundle.getOrCreateMetric(metricName, metricLabels)(WallTimeMetric(metricName, metricLabels)) metric.reset() // Execute function itself, and catch any exception diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/Hook.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/Hook.scala index de254697d..3ad1a2855 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/Hook.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/Hook.scala @@ -147,4 +147,5 @@ abstract class BaseHook extends AbstractInstance with Hook { override def finishMeasure(execution:Execution, token:MeasureToken, result:MeasureResult) : Unit = {} override def instantiateMapping(execution: Execution, mapping:Mapping, parent:Option[Token]) : Unit = {} override def describeMapping(execution: Execution, mapping:Mapping, parent:Option[Token]) : Unit = {} + override def describeRelation(execution: Execution, relation:Relation, parent:Option[Token]) : Unit = {} } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/Mapping.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/Mapping.scala index a783e7bf3..309045e6e 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/Mapping.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/Mapping.scala @@ -19,6 +19,7 @@ package com.dimajix.flowman.model import org.apache.spark.sql.DataFrame import org.apache.spark.storage.StorageLevel +import com.dimajix.flowman.documentation.MappingDoc import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.execution.NoSuchMappingOutputException @@ -35,7 +36,8 @@ object Mapping { Metadata(context, name, Category.MAPPING, kind), false, false, - StorageLevel.NONE + StorageLevel.NONE, + None ) } } @@ -44,7 +46,8 @@ object Mapping { metadata:Metadata, broadcast:Boolean, checkpoint:Boolean, - cache:StorageLevel + cache:StorageLevel, + documentation:Option[MappingDoc] ) extends Instance.Properties[Properties] { override val namespace : Option[Namespace] = context.namespace override val project : Option[Project] = context.project @@ -70,6 +73,12 @@ trait Mapping extends Instance { */ def identifier : MappingIdentifier + /** + * Returns a (static) documentation of this mapping + * @return + */ + def documentation : Option[MappingDoc] + /** * This method should return true, if the resulting dataframe should be broadcast for map-side joins * @return @@ -99,7 +108,7 @@ trait Mapping extends Instance { * Returns the dependencies (i.e. names of tables in the Dataflow model) * @return */ - def inputs : Seq[MappingOutputIdentifier] + def inputs : Set[MappingOutputIdentifier] /** * Lists all outputs of this mapping. Every mapping should have one "main" output, which is the default output @@ -107,7 +116,7 @@ trait Mapping extends Instance { * recommended. * @return */ - def outputs : Seq[String] + def outputs : Set[String] /** * Creates an output identifier for the primary output @@ -169,6 +178,12 @@ abstract class BaseMapping extends AbstractInstance with Mapping { */ override def identifier : MappingIdentifier = instanceProperties.identifier + /** + * Returns a (static) documentation of this mapping + * @return + */ + override def documentation : Option[MappingDoc] = instanceProperties.documentation + /** * This method should return true, if the resulting dataframe should be broadcast for map-side joins * @return @@ -198,7 +213,7 @@ abstract class BaseMapping extends AbstractInstance with Mapping { * Lists all outputs of this mapping. Every mapping should have one "main" output * @return */ - override def outputs : Seq[String] = Seq("main") + override def outputs : Set[String] = Set("main") /** * Creates an output identifier for the primary output @@ -238,7 +253,10 @@ abstract class BaseMapping extends AbstractInstance with Mapping { val results = execute(execution, replacements) // Extract schemas - results.map { case (name,df) => name -> StructType.of(df.schema)} + val schemas = results.map { case (name,df) => name -> StructType.of(df.schema)} + + // Apply documentation + applyDocumentation(schemas) } /** @@ -264,4 +282,24 @@ abstract class BaseMapping extends AbstractInstance with Mapping { linker.input(in.mapping, in.output) ) } + + /** + * Applies optional documentation to the result of a [[describe]] + * @param schemas + * @return + */ + protected def applyDocumentation(schemas:Map[String,StructType]) : Map[String,StructType] = { + val outputDoc = documentation.map(_.outputs.map(o => o.identifier.output -> o).toMap).getOrElse(Map()) + schemas.map { case (output,schema) => + output -> outputDoc.get(output) + .flatMap(_.schema.map(_.enrich(schema))) + .getOrElse(schema) + } + } + + protected def applyDocumentation(output:String, schema:StructType) : StructType = { + documentation.flatMap(_.outputs.find(_.identifier.output == output)) + .flatMap(_.schema.map(_.enrich(schema))) + .getOrElse(schema) + } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/Metadata.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/Metadata.scala index 4b99e475f..513172d5d 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/Metadata.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/Metadata.scala @@ -40,6 +40,9 @@ final case class Metadata( kind: String, labels: Map[String,String] = Map() ) { + require(name != null) + require(category != null && category.nonEmpty) + require(kind != null) def asMap : Map[String,String] = { Map( "name" -> name, diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/Module.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/Module.scala index f65298559..b4ace46b7 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/Module.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/Module.scala @@ -70,10 +70,10 @@ object Module { } private def readFile(file:File) : Module = { - if (file.isDirectory) { + if (file.isDirectory()) { logger.info(s"Reading all module files in directory ${file.toString}") file.list() - .filter(_.isFile) + .filter(_.isFile()) .map(f => loadFile(f)) .foldLeft(Module())((l,r) => l.merge(r)) } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/Project.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/Project.scala index 21eff1e85..bb921a54c 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/Project.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/Project.scala @@ -26,9 +26,16 @@ import com.dimajix.flowman.hadoop.File import com.dimajix.flowman.spi.ProjectReader + object Project { private lazy val loader = ServiceLoader.load(classOf[ProjectReader]).iterator().asScala.toSeq + case class Import( + project:String, + job:Option[String] = None, + arguments:Map[String,String] = Map() + ) + class Reader { private val logger = LoggerFactory.getLogger(classOf[Reader]) private var format = "yaml" @@ -46,10 +53,12 @@ object Project { */ def file(file: File): Project = { if (!file.isAbsolute()) { - readFile(file.absolute) + this.file(file.absolute) } else { - readFile(file) + logger.info(s"Reading project from $file") + val spec = reader.file(file) + loadModules(spec, spec.basedir.getOrElse(file)) } } @@ -63,17 +72,9 @@ object Project { if (!file.isAbsolute()) { manifest(file.absolute) } - else if (file.isDirectory) { - logger.info(s"Reading project manifest in directory $file") - manifest(file / "project.yml") - } else { logger.info(s"Reading project manifest from $file") - val project = reader.file(file) - project.copy( - filename = Some(file.absolute), - basedir = Some(file.absolute.parent) - ) + reader.file(file) } } @@ -81,22 +82,6 @@ object Project { reader.string(text) } - private def readFile(file: File): Project = { - if (file.isDirectory) { - logger.info(s"Reading project in directory $file") - this.file(file / "project.yml") - } - else { - logger.info(s"Reading project from $file") - val spec = reader.file(file) - val project = loadModules(spec, file.parent) - project.copy( - filename = Some(file.absolute), - basedir = Some(file.absolute.parent) - ) - } - } - private def loadModules(project: Project, directory: File): Project = { val module = project.modules .map(f => Module.read.file(directory / f)) @@ -138,7 +123,9 @@ final case class Project( config : Map[String,String] = Map(), environment : Map[String,String] = Map(), + imports: Seq[Project.Import] = Seq(), profiles : Map[String,Profile] = Map(), + relations : Map[String,Prototype[Relation]] = Map(), connections : Map[String,Prototype[Connection]] = Map(), mappings : Map[String,Prototype[Mapping]] = Map(), diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/Reference.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/Reference.scala index 5954784ee..e2ab090fb 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/Reference.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/Reference.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -19,12 +19,13 @@ package com.dimajix.flowman.model import com.dimajix.flowman.execution.Context -abstract class Reference[T] { +sealed abstract class Reference[T] { val value:T def name:String def identifier:Identifier[T] } + object RelationReference { def apply(context:Context, prototype:Prototype[Relation]) : ValueRelationReference = ValueRelationReference(context, prototype) diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/Relation.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/Relation.scala index e4d5324db..f7a8bedb8 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/Relation.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/Relation.scala @@ -34,6 +34,7 @@ import com.dimajix.common.MapIgnoreCase import com.dimajix.common.SetIgnoreCase import com.dimajix.common.Trilean import com.dimajix.flowman.config.FlowmanConf +import com.dimajix.flowman.documentation.RelationDoc import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.execution.MergeClause @@ -56,6 +57,7 @@ object Relation { Properties( context, Metadata(context, name, Category.RELATION, kind), + None, None ) } @@ -63,7 +65,8 @@ object Relation { final case class Properties( context:Context, metadata:Metadata, - description:Option[String] + description:Option[String], + documentation:Option[RelationDoc] ) extends Instance.Properties[Properties] { override val namespace : Option[Namespace] = context.namespace @@ -99,6 +102,12 @@ trait Relation extends Instance { */ def description : Option[String] + /** + * Returns a (static) documentation of this relation + * @return + */ + def documentation : Option[RelationDoc] + /** * Returns the list of all resources which will be created by this relation. This method mainly refers to the * CREATE and DESTROY execution phase. @@ -149,7 +158,7 @@ trait Relation extends Instance { * @param execution * @return */ - def describe(execution:Execution) : StructType + def describe(execution:Execution, partitions:Map[String,FieldValue] = Map()) : StructType /** * Reads data from the relation, possibly from specific partitions @@ -276,6 +285,12 @@ abstract class BaseRelation extends AbstractInstance with Relation { */ override def description : Option[String] = instanceProperties.description + /** + * Returns a (static) documentation of this relation + * @return + */ + override def documentation : Option[RelationDoc] = instanceProperties.documentation + /** * Returns the schema of the relation, excluding partition columns * @return @@ -305,24 +320,26 @@ abstract class BaseRelation extends AbstractInstance with Relation { * @param execution * @return */ - override def describe(execution:Execution) : StructType = { - val partitions = SetIgnoreCase(this.partitions.map(_.name)) - if (!fields.forall(f => partitions.contains(f.name))) { + override def describe(execution:Execution, partitions:Map[String,FieldValue] = Map()) : StructType = { + val partitionNames = SetIgnoreCase(this.partitions.map(_.name)) + val result = if (!fields.forall(f => partitionNames.contains(f.name))) { // Use given fields if relation contains valid list of fields in addition to the partition columns StructType(fields) } else { // Otherwise let Spark infer the schema - val df = read(execution) + val df = read(execution, partitions) StructType.of(df.schema) } + + applyDocumentation(result) } /** * Creates all known links for building a descriptive graph of the whole data flow * Params: linker - The linker object to use for creating new edges */ - def link(linker:Linker) : Unit = {} + override def link(linker:Linker) : Unit = {} /** * Creates a DataFrameReader which is already configured with the schema @@ -494,6 +511,12 @@ abstract class BaseRelation extends AbstractInstance with Relation { } .getOrElse(df) } + + protected def applyDocumentation(schema:StructType) : StructType = { + documentation + .flatMap(_.schema.map(_.enrich(schema))) + .getOrElse(schema) + } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/ResourceIdentifier.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/ResourceIdentifier.scala index 45357eff7..fa3a07555 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/ResourceIdentifier.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/ResourceIdentifier.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -24,6 +24,7 @@ import scala.annotation.tailrec import org.apache.hadoop.fs.Path +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.hadoop.GlobPattern @@ -36,26 +37,34 @@ object ResourceIdentifier { GlobbingResourceIdentifier("local", new Path(file.toURI.getPath).toString) def ofHiveDatabase(database:String): RegexResourceIdentifier = RegexResourceIdentifier("hiveDatabase", database) + def ofHiveTable(table:TableIdentifier): RegexResourceIdentifier = + ofHiveTable(table.table, table.space.headOption) def ofHiveTable(table:String): RegexResourceIdentifier = RegexResourceIdentifier("hiveTable", table) def ofHiveTable(table:String, database:Option[String]): RegexResourceIdentifier = RegexResourceIdentifier("hiveTable", fqTable(table, database)) + def ofHivePartition(table:TableIdentifier, partition:Map[String,Any]): RegexResourceIdentifier = + ofHivePartition(table.table, table.space.headOption, partition) def ofHivePartition(table:String, partition:Map[String,Any]): RegexResourceIdentifier = RegexResourceIdentifier("hiveTablePartition", table, partition.map { case(k,v) => k -> v.toString }) def ofHivePartition(table:String, database:Option[String], partition:Map[String,Any]): RegexResourceIdentifier = RegexResourceIdentifier("hiveTablePartition", fqTable(table, database), partition.map { case(k,v) => k -> v.toString }) def ofJdbcDatabase(database:String): RegexResourceIdentifier = RegexResourceIdentifier("jdbcDatabase", database) + def ofJdbcTable(table:TableIdentifier): RegexResourceIdentifier = + ofJdbcTable(table.table, table.space.headOption) def ofJdbcTable(table:String, database:Option[String]): RegexResourceIdentifier = RegexResourceIdentifier("jdbcTable", fqTable(table, database)) def ofJdbcQuery(query:String): SimpleResourceIdentifier = SimpleResourceIdentifier("jdbcQuery", "") + def ofJdbcTablePartition(table:TableIdentifier, partition:Map[String,Any]): RegexResourceIdentifier = + ofJdbcTablePartition(table.table, table.space.headOption, partition) def ofJdbcTablePartition(table:String, database:Option[String], partition:Map[String,Any]): RegexResourceIdentifier = RegexResourceIdentifier("jdbcTablePartition", fqTable(table, database), partition.map { case(k,v) => k -> v.toString }) def ofURL(url:URL): RegexResourceIdentifier = RegexResourceIdentifier("url", url.toString) - private def fqTable(table:String, database:Option[String]) : String = database.map(_ + ".").getOrElse("") + table + private def fqTable(table:String, database:Option[String]) : String = database.filter(_.nonEmpty).map(_ + ".").getOrElse("") + table } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/Schema.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/Schema.scala index 36582bfa8..5818d798d 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/Schema.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/Schema.scala @@ -86,9 +86,13 @@ trait Schema extends Instance { */ def sparkSchema : org.apache.spark.sql.types.StructType + /** + * Returns a Spark schema useable for Catalog entries. This Schema may include VARCHAR(n) and CHAR(n) entries + * @return + */ def catalogSchema : org.apache.spark.sql.types.StructType - /** + /** * Provides a human readable string representation of the schema */ def printTree() : Unit = { @@ -118,6 +122,10 @@ abstract class BaseSchema extends AbstractInstance with Schema { org.apache.spark.sql.types.StructType(fields.map(_.sparkField)) } + /** + * Returns a Spark schema useable for Catalog entries. This Schema may include VARCHAR(n) and CHAR(n) entries + * @return + */ override def catalogSchema : org.apache.spark.sql.types.StructType = { org.apache.spark.sql.types.StructType(fields.map(_.catalogField)) } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/Target.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/Target.scala index 8a8238c9b..afa0d3b62 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/Target.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/Target.scala @@ -16,10 +16,13 @@ package com.dimajix.flowman.model +import java.util.Locale + import org.apache.spark.sql.DataFrame import com.dimajix.common.Trilean import com.dimajix.common.Unknown +import com.dimajix.flowman.documentation.TargetDoc import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.execution.Phase @@ -28,6 +31,25 @@ import com.dimajix.flowman.metric.LongAccumulatorMetric import com.dimajix.flowman.metric.Selector import com.dimajix.spark.sql.functions.count_records + +sealed abstract class VerifyPolicy extends Product with Serializable +object VerifyPolicy { + case object EMPTY_AS_SUCCESS extends VerifyPolicy + case object EMPTY_AS_FAILURE extends VerifyPolicy + case object EMPTY_AS_SUCCESS_WITH_ERRORS extends VerifyPolicy + + def ofString(mode:String) : VerifyPolicy = { + mode.toLowerCase(Locale.ROOT) match { + case "empty_as_success" => VerifyPolicy.EMPTY_AS_SUCCESS + case "empty_as_failure" => VerifyPolicy.EMPTY_AS_FAILURE + case "empty_as_success_with_errors" => VerifyPolicy.EMPTY_AS_SUCCESS_WITH_ERRORS + case _ => throw new IllegalArgumentException(s"Unknown verify policy: '$mode'. " + + "Accepted verify policies are 'empty_as_success', 'empty_as_failure' and 'empty_as_success_with_errors'.") + } + } +} + + /** * * @param namespace @@ -60,7 +82,9 @@ object Target { context, Metadata(context, name, Category.TARGET, kind), Seq(), - Seq() + Seq(), + None, + None ) } } @@ -68,7 +92,9 @@ object Target { context:Context, metadata:Metadata, before: Seq[TargetIdentifier], - after: Seq[TargetIdentifier] + after: Seq[TargetIdentifier], + description:Option[String], + documentation: Option[TargetDoc] ) extends Instance.Properties[Properties] { override val namespace : Option[Namespace] = context.namespace override val project : Option[Project] = context.project @@ -94,6 +120,18 @@ trait Target extends Instance { */ def identifier : TargetIdentifier + /** + * Returns a description of the build target + * @return + */ + def description : Option[String] + + /** + * Returns a (static) documentation of this target + * @return + */ + def documentation : Option[TargetDoc] + /** * Returns an instance representing this target with the context * @return @@ -169,6 +207,20 @@ abstract class BaseTarget extends AbstractInstance with Target { */ override def identifier : TargetIdentifier = instanceProperties.identifier + /** + * Returns a description of the build target + * + * @return + */ + override def description: Option[String] = instanceProperties.description + + /** + * Returns a (static) documentation of this target + * + * @return + */ + override def documentation : Option[TargetDoc] = instanceProperties.documentation + /** * Returns an instance representing this target with the context * @return @@ -359,7 +411,7 @@ abstract class BaseTarget extends AbstractInstance with Target { protected def countRecords(execution:Execution, df:DataFrame, phase:Phase=Phase.BUILD) : DataFrame = { val labels = metadata.asMap + ("phase" -> phase.upper) - val counter = execution.metricSystem.findMetric(Selector(Some("target_records"), labels)) + val counter = execution.metricSystem.findMetric(Selector("target_records", labels)) .headOption .map(_.asInstanceOf[LongAccumulatorMetric].counter) .getOrElse { diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/result.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/result.scala index b76fd0f0d..cfb0fde7f 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/result.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/result.scala @@ -302,6 +302,14 @@ final case class TargetResult( override def category : Category = target.category override def kind : String = target.kind override def description: Option[String] = None + + def withoutTime : TargetResult = { + val ts = Instant.ofEpochSecond(0) + copy( + startTime=ts, + endTime=ts + ) + } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/model/templating.scala b/flowman-core/src/main/scala/com/dimajix/flowman/model/velocity.scala similarity index 96% rename from flowman-core/src/main/scala/com/dimajix/flowman/model/templating.scala rename to flowman-core/src/main/scala/com/dimajix/flowman/model/velocity.scala index f2c1d0a18..847d85e02 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/model/templating.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/model/velocity.scala @@ -171,3 +171,11 @@ final case class MeasureResultWrapper(result:MeasureResult) extends ResultWrappe final case class AssertionTestResultWrapper(result:AssertionTestResult) extends ResultWrapper(result) { } + + +final case class ResourceIdentifierWrapper(resource:ResourceIdentifier) { + override def toString: String = resource.category + ":" + resource.name + + def getCategory() : String = resource.category + def getName() : String = resource.name +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/spi/ColumnCheckExecutor.scala b/flowman-core/src/main/scala/com/dimajix/flowman/spi/ColumnCheckExecutor.scala new file mode 100644 index 000000000..61c0e0da3 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/spi/ColumnCheckExecutor.scala @@ -0,0 +1,50 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spi + +import java.util.ServiceLoader + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.DataFrame + +import com.dimajix.flowman.documentation.ColumnCheck +import com.dimajix.flowman.documentation.CheckResult +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.graph.Graph + + +object ColumnCheckExecutor { + def executors : Seq[ColumnCheckExecutor] = { + val loader = ServiceLoader.load(classOf[ColumnCheckExecutor]) + loader.iterator().asScala.toSeq + } +} + +trait ColumnCheckExecutor { + /** + * Executes a column check + * @param execution - execution to use + * @param context - context that can be used for resource lookups like relations or mappings + * @param df - DataFrame containing the output to check + * @param column - Path of the column to check + * @param test - Test to execute + * @return + */ + def execute(execution: Execution, context:Context, df: DataFrame, column:String, test: ColumnCheck): Option[CheckResult] +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/spi/DocumenterReader.scala b/flowman-core/src/main/scala/com/dimajix/flowman/spi/DocumenterReader.scala new file mode 100644 index 000000000..94ff08658 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/spi/DocumenterReader.scala @@ -0,0 +1,61 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spi + +import java.io.IOException + +import com.dimajix.flowman.documentation.Documenter +import com.dimajix.flowman.hadoop.File +import com.dimajix.flowman.model.Prototype + + +abstract class DocumenterReader { + /** + * Returns the human readable name of the documenter file format + * @return + */ + def name: String + + /** + * Returns the internally used short name of the documenter file format + * @return + */ + def format: String + + /** + * Returns true if a given format is supported by this reader + * @param format + * @return + */ + def supports(format: String): Boolean = this.format == format + + /** + * Loads a [[Documenter]] from the given file + * @param file + * @return + */ + @throws[IOException] + def file(file: File): Prototype[Documenter] + + /** + * Loads a [[Documenter]] from the given String + * @param file + * @return + */ + @throws[IOException] + def string(text: String): Prototype[Documenter] +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/spi/SchemaCheckExecutor.scala b/flowman-core/src/main/scala/com/dimajix/flowman/spi/SchemaCheckExecutor.scala new file mode 100644 index 000000000..f8535d869 --- /dev/null +++ b/flowman-core/src/main/scala/com/dimajix/flowman/spi/SchemaCheckExecutor.scala @@ -0,0 +1,40 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spi + +import java.util.ServiceLoader + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.DataFrame + +import com.dimajix.flowman.documentation.SchemaCheck +import com.dimajix.flowman.documentation.CheckResult +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution + + +object SchemaCheckExecutor { + def executors : Seq[SchemaCheckExecutor] = { + val loader = ServiceLoader.load(classOf[SchemaCheckExecutor]) + loader.iterator().asScala.toSeq + } +} + +trait SchemaCheckExecutor { + def execute(execution: Execution, context:Context, df:DataFrame, test:SchemaCheck) : Option[CheckResult] +} diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/transforms/ProjectTransformer.scala b/flowman-core/src/main/scala/com/dimajix/flowman/transforms/ProjectTransformer.scala index 19726a05d..6ad1857d9 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/transforms/ProjectTransformer.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/transforms/ProjectTransformer.scala @@ -46,7 +46,9 @@ final case class ProjectTransformer( val tree = ColumnTree.ofSchema(df.schema) def col(spec:ProjectTransformer.Column) = { - val input = tree.find(spec.column).get.mkValue() + val input = tree.find(spec.column) + .getOrElse(throw new NoSuchColumnException(spec.column.toString)) + .mkValue() val typed = spec.dtype match { case None => input case Some(ft) => input.cast(ft.sparkType) @@ -71,7 +73,9 @@ final case class ProjectTransformer( val tree = SchemaTree.ofSchema(schema) def col(spec:ProjectTransformer.Column) = { - val input = tree.find(spec.column).get.mkValue() + val input = tree.find(spec.column) + .getOrElse(throw new NoSuchColumnException(spec.column.toString)) + .mkValue() val typed = spec.dtype match { case None => input case Some(ft) => input.copy(ftype = ft) diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/transforms/SchemaEnforcer.scala b/flowman-core/src/main/scala/com/dimajix/flowman/transforms/SchemaEnforcer.scala index 9a906b83d..7f105db1c 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/transforms/SchemaEnforcer.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/transforms/SchemaEnforcer.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -39,7 +39,7 @@ import com.dimajix.flowman.util.SchemaUtils.coerce import com.dimajix.spark.sql.functions.nullable_struct -sealed abstract class ColumnMismatchStrategy +sealed abstract class ColumnMismatchStrategy extends Product with Serializable object ColumnMismatchStrategy { case object IGNORE extends ColumnMismatchStrategy case object ERROR extends ColumnMismatchStrategy @@ -65,7 +65,7 @@ object ColumnMismatchStrategy { } -sealed abstract class TypeMismatchStrategy +sealed abstract class TypeMismatchStrategy extends Product with Serializable object TypeMismatchStrategy { case object IGNORE extends TypeMismatchStrategy case object ERROR extends TypeMismatchStrategy diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/transforms/exceptions.scala b/flowman-core/src/main/scala/com/dimajix/flowman/transforms/exceptions.scala index cf259f34b..961192fd5 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/transforms/exceptions.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/transforms/exceptions.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -18,9 +18,7 @@ package com.dimajix.flowman.transforms class AnalysisException(val message: String,val cause: Option[Throwable] = None) - extends Exception(message, cause.orNull) { - -} + extends IllegalArgumentException(message, cause.orNull) class NoSuchColumnException(column:String) extends AnalysisException(s"Column '$column' not found") diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/types/Field.scala b/flowman-core/src/main/scala/com/dimajix/flowman/types/Field.scala index 3a37ff78a..249290615 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/types/Field.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/types/Field.scala @@ -165,7 +165,8 @@ class Field { val format = this.format.map(", format=" + _).getOrElse("") val default = this.default.map(", default=" + _).getOrElse("") val size = this.size.map(", size=" + _).getOrElse("") - s"Field($name, $ftype, $nullable$format$size$default})" + val desc = this.description.map(", description=\"" + _ + "\"").getOrElse("") + s"Field($name, $ftype, $nullable$format$size$default$desc))" } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/types/SchemaUtils.scala b/flowman-core/src/main/scala/com/dimajix/flowman/types/SchemaUtils.scala index ddb96ea1a..76c75219d 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/types/SchemaUtils.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/types/SchemaUtils.scala @@ -30,7 +30,10 @@ object SchemaUtils { * @return */ def normalize(schema:StructType) : StructType = { - com.dimajix.flowman.types.StructType(schema.fields.map(normalize)) + com.dimajix.flowman.types.StructType(normalize(schema.fields)) + } + def normalize(fields:Seq[Field]) : Seq[Field] = { + fields.map(normalize) } private def normalize(field:Field) : Field = { Field(field.name.toLowerCase(Locale.ROOT), normalize(field.ftype), field.nullable, description=field.description) diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/types/SchemaWriter.scala b/flowman-core/src/main/scala/com/dimajix/flowman/types/SchemaWriter.scala index 3f4fa72d0..f21a98a63 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/types/SchemaWriter.scala +++ b/flowman-core/src/main/scala/com/dimajix/flowman/types/SchemaWriter.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -59,8 +59,12 @@ class SchemaWriter(fields:Seq[Field]) { // Manually convert string to UTF-8 and use write, since writeUTF apparently would write a BOM val bytes = Charset.forName("UTF-8").encode(schema) val output = file.create(true) - output.write(bytes.array(), bytes.arrayOffset(), bytes.limit()) - output.close() + try { + output.write(bytes.array(), bytes.arrayOffset(), bytes.limit()) + } + finally { + output.close() + } } private var format: String = "" diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/catalog/TableChangeTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/catalog/TableChangeTest.scala index e94ee5e52..1cd47e068 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/catalog/TableChangeTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/catalog/TableChangeTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -20,7 +20,11 @@ import org.scalatest.flatspec.AnyFlatSpec import org.scalatest.matchers.should.Matchers import com.dimajix.flowman.catalog.TableChange.AddColumn +import com.dimajix.flowman.catalog.TableChange.CreateIndex +import com.dimajix.flowman.catalog.TableChange.CreatePrimaryKey import com.dimajix.flowman.catalog.TableChange.DropColumn +import com.dimajix.flowman.catalog.TableChange.DropIndex +import com.dimajix.flowman.catalog.TableChange.DropPrimaryKey import com.dimajix.flowman.catalog.TableChange.UpdateColumnNullability import com.dimajix.flowman.catalog.TableChange.UpdateColumnType import com.dimajix.flowman.execution.MigrationPolicy @@ -28,191 +32,263 @@ import com.dimajix.flowman.types.Field import com.dimajix.flowman.types.IntegerType import com.dimajix.flowman.types.LongType import com.dimajix.flowman.types.StringType -import com.dimajix.flowman.types.StructType import com.dimajix.flowman.types.VarcharType class TableChangeTest extends AnyFlatSpec with Matchers { "TableChange.requiresMigration" should "accept same schemas in strict mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType), Field("f2", StringType))), - StructType(Seq(Field("f1", StringType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType))), MigrationPolicy.STRICT ) should be (false) } it should "not accept dropped columns in strict mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType), Field("f2", StringType))), - StructType(Seq(Field("f1", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType))), MigrationPolicy.STRICT ) should be (true) } it should "not accept added columns in strict mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType))), - StructType(Seq(Field("f1", StringType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType))), MigrationPolicy.STRICT ) should be (true) } it should "not accept changed data types in strict mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", IntegerType), Field("f2", StringType))), - StructType(Seq(Field("f1", StringType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", IntegerType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType))), MigrationPolicy.STRICT ) should be (true) TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType), Field("f2", StringType))), - StructType(Seq(Field("f1", IntegerType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", IntegerType), Field("f2", StringType))), MigrationPolicy.STRICT ) should be (true) TableChange.requiresMigration( - StructType(Seq(Field("f1", LongType), Field("f2", StringType))), - StructType(Seq(Field("f1", IntegerType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", LongType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", IntegerType), Field("f2", StringType))), MigrationPolicy.STRICT ) should be (true) TableChange.requiresMigration( - StructType(Seq(Field("f1", VarcharType(10)), Field("f2", StringType))), - StructType(Seq(Field("f1", StringType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", VarcharType(10)), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType))), MigrationPolicy.STRICT ) should be (true) } it should "not accept changed nullability in strict mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, true))), - StructType(Seq(Field("f1", StringType, false))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, true))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, false))), MigrationPolicy.STRICT ) should be (true) TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, false))), - StructType(Seq(Field("f1", StringType, true))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, false))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, true))), MigrationPolicy.STRICT ) should be (true) } it should "accept changed comments in strict mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, description = Some("lala")))), - StructType(Seq(Field("f1", StringType, description = Some("lala")))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = Some("lala")))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = Some("lala")))), MigrationPolicy.RELAXED ) should be (false) TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, description = Some("lala")))), - StructType(Seq(Field("f1", StringType, description = Some("lolo")))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = Some("lala")))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = Some("lolo")))), MigrationPolicy.RELAXED ) should be (false) TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, description = None))), - StructType(Seq(Field("f1", StringType, description = Some("lolo")))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = None))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = Some("lolo")))), MigrationPolicy.RELAXED ) should be (false) TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, description = Some("lala")))), - StructType(Seq(Field("f1", StringType, description = None))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = Some("lala")))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = None))), MigrationPolicy.RELAXED ) should be (false) } it should "accept same schemas in relaxed mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType), Field("f2", StringType))), - StructType(Seq(Field("f1", StringType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType))), MigrationPolicy.RELAXED ) should be (false) } it should "handle data type in relaxed mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", LongType), Field("f2", StringType))), - StructType(Seq(Field("f1", IntegerType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", LongType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", IntegerType), Field("f2", StringType))), MigrationPolicy.RELAXED ) should be (false) TableChange.requiresMigration( - StructType(Seq(Field("f1", IntegerType), Field("f2", StringType))), - StructType(Seq(Field("f1", LongType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", IntegerType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", LongType), Field("f2", StringType))), MigrationPolicy.RELAXED ) should be (true) } it should "accept removed columns in relaxed mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", LongType), Field("f2", StringType))), - StructType(Seq(Field("f1", LongType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", LongType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", LongType))), MigrationPolicy.RELAXED ) should be (false) } it should "not accept added columns in relaxed mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", LongType))), - StructType(Seq(Field("f1", LongType), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", LongType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", LongType), Field("f2", StringType))), MigrationPolicy.RELAXED ) should be (true) } it should "accept changed comments in relaxed mode" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, description = Some("lala")))), - StructType(Seq(Field("F1", StringType, description = Some("lolo")))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = Some("lala")))), + TableDefinition(TableIdentifier(""), Seq(Field("F1", StringType, description = Some("lolo")))), MigrationPolicy.RELAXED ) should be (false) TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, description = None))), - StructType(Seq(Field("F1", StringType, description = Some("lolo")))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = None))), + TableDefinition(TableIdentifier(""), Seq(Field("F1", StringType, description = Some("lolo")))), MigrationPolicy.RELAXED ) should be (false) TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, description = Some("lala")))), - StructType(Seq(Field("F1", StringType, description = None))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, description = Some("lala")))), + TableDefinition(TableIdentifier(""), Seq(Field("F1", StringType, description = None))), MigrationPolicy.RELAXED ) should be (false) } it should "handle changed nullability" in { TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, true), Field("f2", StringType))), - StructType(Seq(Field("F1", StringType, false), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, true), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("F1", StringType, false), Field("f2", StringType))), MigrationPolicy.RELAXED ) should be (false) TableChange.requiresMigration( - StructType(Seq(Field("f1", StringType, false), Field("f2", StringType))), - StructType(Seq(Field("F1", StringType, true), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType, false), Field("f2", StringType))), + TableDefinition(TableIdentifier(""), Seq(Field("F1", StringType, true), Field("f2", StringType))), MigrationPolicy.RELAXED ) should be (true) } + it should "handle changed primary key" in { + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType)), primaryKey=Seq("f1", "f2")), + TableDefinition(TableIdentifier(""), Seq(Field("F1", StringType), Field("f2", StringType)), primaryKey=Seq("f1", "f2")), + MigrationPolicy.RELAXED + ) should be (false) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType)), primaryKey=Seq("f1", "f2")), + TableDefinition(TableIdentifier(""), Seq(Field("F1", StringType), Field("f2", StringType)), primaryKey=Seq("f2", "f1")), + MigrationPolicy.RELAXED + ) should be (false) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType)), primaryKey=Seq("f1")), + TableDefinition(TableIdentifier(""), Seq(Field("F1", StringType), Field("f2", StringType)), primaryKey=Seq("f1", "f2")), + MigrationPolicy.RELAXED + ) should be (true) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType)), primaryKey=Seq("f1", "f2")), + TableDefinition(TableIdentifier(""), Seq(Field("F1", StringType), Field("f2", StringType)), primaryKey=Seq()), + MigrationPolicy.RELAXED + ) should be (true) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(Field("f1", StringType), Field("f2", StringType)), primaryKey=Seq()), + TableDefinition(TableIdentifier(""), Seq(Field("F1", StringType), Field("f2", StringType)), primaryKey=Seq("f1", "f2")), + MigrationPolicy.RELAXED + ) should be (true) + } + + it should "handle changed index" in { + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("c1")))), + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("c1")))), + MigrationPolicy.RELAXED + ) should be (false) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("c1")))), + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("C1")))), + MigrationPolicy.RELAXED + ) should be (false) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("c1")))), + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("NAME", Seq("C1")))), + MigrationPolicy.RELAXED + ) should be (false) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq()), + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("NAME", Seq("C1")))), + MigrationPolicy.RELAXED + ) should be (true) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("c1")))), + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq()), + MigrationPolicy.RELAXED + ) should be (true) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("c1")))), + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("other", Seq("C1")))), + MigrationPolicy.RELAXED + ) should be (true) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("c1")))), + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("C1","c2")))), + MigrationPolicy.RELAXED + ) should be (true) + TableChange.requiresMigration( + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("c2","c1")))), + TableDefinition(TableIdentifier(""), Seq(), indexes=Seq(TableIndex("name", Seq("C1","c2")))), + MigrationPolicy.RELAXED + ) should be (false) + } + "TableChange.migrate" should "work in strict mode" in { - val changes = TableChange.migrate( - StructType(Seq( + val oldTable = TableDefinition(TableIdentifier(""), + Seq( Field("f1", StringType, true), Field("f2", LongType), Field("f3", StringType), Field("f4", StringType), Field("f6", StringType, false) - )), - StructType(Seq( + ) + ) + val newTable = TableDefinition(TableIdentifier(""), + Seq( Field("F1", StringType, false), Field("F2", StringType), Field("F3", LongType), Field("F5", StringType), Field("F6", StringType, true) - )), - MigrationPolicy.STRICT + ) ) + val changes = TableChange.migrate(oldTable, newTable, MigrationPolicy.STRICT) changes should be (Seq( DropColumn("f4"), @@ -225,23 +301,25 @@ class TableChangeTest extends AnyFlatSpec with Matchers { } it should "work in relaxed mode" in { - val changes = TableChange.migrate( - StructType(Seq( + val oldTable = TableDefinition(TableIdentifier(""), + Seq( Field("f1", StringType, true), Field("f2", LongType), Field("f3", StringType), Field("f4", StringType), Field("f6", StringType, false) - )), - StructType(Seq( + ) + ) + val newTable = TableDefinition(TableIdentifier(""), + Seq( Field("F1", StringType, false), Field("F2", StringType), Field("F3", LongType), Field("F5", StringType), Field("F6", StringType, true) - )), - MigrationPolicy.RELAXED + ) ) + val changes = TableChange.migrate(oldTable, newTable, MigrationPolicy.RELAXED) changes should be (Seq( UpdateColumnType("f2", StringType), @@ -249,4 +327,151 @@ class TableChangeTest extends AnyFlatSpec with Matchers { UpdateColumnNullability("f6", true) )) } + + it should "do nothing on unchanged PK" in { + val oldTable = TableDefinition(TableIdentifier(""), + Seq( + Field("f1", StringType), + Field("f2", LongType), + Field("f3", StringType) + ), + primaryKey = Seq("f1", "f2") + ) + val newTable = TableDefinition(TableIdentifier(""), + Seq( + Field("F1", StringType), + Field("F2", LongType), + Field("F3", StringType) + ), + primaryKey = Seq("F2", "f1") + ) + val changes = TableChange.migrate(oldTable, newTable, MigrationPolicy.RELAXED) + + changes should be (Seq()) + } + + it should "add PK" in { + val oldTable = TableDefinition(TableIdentifier(""), + Seq( + Field("f1", StringType), + Field("f2", LongType), + Field("f3", StringType) + ), + primaryKey = Seq() + ) + val newTable = TableDefinition(TableIdentifier(""), + Seq( + Field("F1", StringType), + Field("F2", LongType), + Field("F3", StringType) + ), + primaryKey = Seq("f1", "f2") + ) + val changes = TableChange.migrate(oldTable, newTable, MigrationPolicy.RELAXED) + + changes should be (Seq( + CreatePrimaryKey(Seq("f1", "f2")) + )) + } + + it should "drop PK" in { + val oldTable = TableDefinition(TableIdentifier(""), + Seq( + Field("f1", StringType), + Field("f2", LongType), + Field("f3", StringType) + ), + primaryKey = Seq("f1", "f2") + ) + val newTable = TableDefinition(TableIdentifier(""), + Seq( + Field("f1", StringType), + Field("f2", LongType), + Field("f3", StringType) + ), + primaryKey = Seq() + ) + val changes = TableChange.migrate(oldTable, newTable, MigrationPolicy.RELAXED) + + changes should be (Seq( + DropPrimaryKey() + )) + } + + it should "drop/add PK" in { + val oldTable = TableDefinition(TableIdentifier(""), + Seq( + Field("f1", StringType), + Field("f2", LongType), + Field("f3", StringType) + ), + primaryKey = Seq("f1", "f2") + ) + val newTable = TableDefinition(TableIdentifier(""), + Seq( + Field("f1", StringType), + Field("f2", LongType), + Field("f3", StringType) + ), + primaryKey = Seq("f2") + ) + val changes = TableChange.migrate(oldTable, newTable, MigrationPolicy.RELAXED) + + changes should be (Seq( + DropPrimaryKey(), + CreatePrimaryKey(Seq("f2")) + )) + } + + it should "do nothing on an unchanged index" in { + val oldTable = TableDefinition(TableIdentifier(""), + indexes = Seq(TableIndex("name", Seq("col1", "col2"))) + ) + val newTable = TableDefinition(TableIdentifier(""), + indexes = Seq(TableIndex("NAME", Seq("col2", "COL1"))) + ) + + val changes = TableChange.migrate(oldTable, newTable, MigrationPolicy.RELAXED) + + changes should be (Seq.empty) + } + + it should "add an index" in { + val oldTable = TableDefinition(TableIdentifier(""), + indexes = Seq() + ) + val newTable = TableDefinition(TableIdentifier(""), + indexes = Seq(TableIndex("NAME", Seq("col2", "COL1"))) + ) + + val changes = TableChange.migrate(oldTable, newTable, MigrationPolicy.RELAXED) + + changes should be (Seq(CreateIndex("NAME", Seq("col2", "COL1"), false))) + } + + it should "drop an index" in { + val oldTable = TableDefinition(TableIdentifier(""), + indexes = Seq(TableIndex("name", Seq("col1", "col2"))) + ) + val newTable = TableDefinition(TableIdentifier(""), + indexes = Seq() + ) + + val changes = TableChange.migrate(oldTable, newTable, MigrationPolicy.RELAXED) + + changes should be (Seq(DropIndex("name"))) + } + + it should "drop/add an index" in { + val oldTable = TableDefinition(TableIdentifier(""), + indexes = Seq(TableIndex("name", Seq("col1", "col3"))) + ) + val newTable = TableDefinition(TableIdentifier(""), + indexes = Seq(TableIndex("NAME", Seq("col2", "COL1"))) + ) + + val changes = TableChange.migrate(oldTable, newTable, MigrationPolicy.RELAXED) + + changes should be (Seq(DropIndex("name"), CreateIndex("NAME", Seq("col2", "COL1"), false))) + } } diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/catalog/TableIdentifierTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/catalog/TableIdentifierTest.scala new file mode 100644 index 000000000..a58d4b5e0 --- /dev/null +++ b/flowman-core/src/test/scala/com/dimajix/flowman/catalog/TableIdentifierTest.scala @@ -0,0 +1,56 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.catalog + +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + + +class TableIdentifierTest extends AnyFlatSpec with Matchers { + "The TableIdentifier" should "work without a namespace" in { + val id = TableIdentifier("some_table") + id.toString should be ("`some_table`") + id.quotedString should be ("`some_table`") + id.unquotedString should be ("some_table") + id.database should be (None) + id.quotedDatabase should be (None) + id.unquotedDatabase should be (None) + id.toSpark should be (org.apache.spark.sql.catalyst.TableIdentifier("some_table", None)) + } + + it should "work with a single namespace" in { + val id = TableIdentifier("some_table", Some("db")) + id.toString should be ("`db`.`some_table`") + id.quotedString should be ("`db`.`some_table`") + id.unquotedString should be ("db.some_table") + id.database should be (Some("db")) + id.quotedDatabase should be (Some("`db`")) + id.unquotedDatabase should be (Some("db")) + id.toSpark should be (org.apache.spark.sql.catalyst.TableIdentifier("some_table", Some("db"))) + } + + it should "work with a nested namespace" in { + val id = TableIdentifier("some_table", Seq("db","ns")) + id.toString should be ("`db`.`ns`.`some_table`") + id.quotedString should be ("`db`.`ns`.`some_table`") + id.unquotedString should be ("db.ns.some_table") + id.database should be (Some("db.ns")) + id.quotedDatabase should be (Some("`db`.`ns`")) + id.unquotedDatabase should be (Some("db.ns")) + id.toSpark should be (org.apache.spark.sql.catalyst.TableIdentifier("some_table", Some("db.ns"))) + } +} diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/documentation/ColumnCheckTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/ColumnCheckTest.scala new file mode 100644 index 000000000..56484affa --- /dev/null +++ b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/ColumnCheckTest.scala @@ -0,0 +1,255 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import org.apache.spark.storage.StorageLevel +import org.scalamock.scalatest.MockFactory +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.model.Mapping +import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.Prototype +import com.dimajix.spark.testing.LocalSparkSession + + +class ColumnCheckTest extends AnyFlatSpec with Matchers with MockFactory with LocalSparkSession { + "A NotNullColumnCheck" should "be executable" in { + val session = Session.builder() + .withSparkSession(spark) + .build() + val execution = session.execution + val context = session.context + val testExecutor = new DefaultColumnCheckExecutor + + val df = spark.createDataFrame(Seq((Some(1),2), (None,3))) + + val test = NotNullColumnCheck(None) + val result1 = testExecutor.execute(execution, context, df, "_1", test) + result1 should be (Some(CheckResult(Some(test.reference), CheckStatus.FAILED, description=Some("1 records passed, 1 records failed")))) + val result2 = testExecutor.execute(execution, context, df, "_2", test) + result2 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + an[Exception] should be thrownBy(testExecutor.execute(execution, context, df, "_3", test)) + } + + "A UniqueColumnCheck" should "be executable" in { + val session = Session.builder() + .withSparkSession(spark) + .build() + val execution = session.execution + val context = session.context + val testExecutor = new DefaultColumnCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),2,3), + (None,3,4), + (None,3,5) + )) + + val test = UniqueColumnCheck(None) + val result1 = testExecutor.execute(execution, context, df, "_1", test) + result1 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("1 values are unique, 0 values are non-unique")))) + val result2 = testExecutor.execute(execution, context, df, "_2", test) + result2 should be (Some(CheckResult(Some(test.reference), CheckStatus.FAILED, description=Some("1 values are unique, 1 values are non-unique")))) + val result3 = testExecutor.execute(execution, context, df, "_3", test) + result3 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("3 values are unique, 0 values are non-unique")))) + an[Exception] should be thrownBy(testExecutor.execute(execution, context, df, "_4", test)) + } + + "A ValuesColumnCheck" should "be executable" in { + val session = Session.builder() + .withSparkSession(spark) + .build() + val execution = session.execution + val context = session.context + val testExecutor = new DefaultColumnCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),2,1), + (None,3,2) + )) + + val test = ValuesColumnCheck(None, values=Seq(1,2)) + val result1 = testExecutor.execute(execution, context, df, "_1", test) + result1 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("1 records passed, 0 records failed")))) + val result2 = testExecutor.execute(execution, context, df, "_2", test) + result2 should be (Some(CheckResult(Some(test.reference), CheckStatus.FAILED, description=Some("1 records passed, 1 records failed")))) + val result3 = testExecutor.execute(execution, context, df, "_3", test) + result3 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + an[Exception] should be thrownBy(testExecutor.execute(execution, context, df, "_4", test)) + } + + it should "use correct data types" in { + val session = Session.builder() + .withSparkSession(spark) + .build() + val execution = session.execution + val context = session.context + val testExecutor = new DefaultColumnCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),2,1), + (None,3,2) + )) + + val test = ValuesColumnCheck(None, values=Seq(1,2)) + val result1 = testExecutor.execute(execution, context, df, "_1", test) + result1 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("1 records passed, 0 records failed")))) + val result2 = testExecutor.execute(execution, context, df, "_2", test) + result2 should be (Some(CheckResult(Some(test.reference), CheckStatus.FAILED, description=Some("1 records passed, 1 records failed")))) + val result3 = testExecutor.execute(execution, context, df, "_3", test) + result3 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + an[Exception] should be thrownBy(testExecutor.execute(execution, context, df, "_4", test)) + } + + "A RangeColumnCheck" should "be executable" in { + val session = Session.builder() + .withSparkSession(spark) + .build() + val execution = session.execution + val context = session.context + val testExecutor = new DefaultColumnCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),2,1), + (None,3,2) + )) + + val test = RangeColumnCheck(None, lower=1, upper=2) + val result1 = testExecutor.execute(execution, context, df, "_1", test) + result1 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("1 records passed, 0 records failed")))) + val result2 = testExecutor.execute(execution, context, df, "_2", test) + result2 should be (Some(CheckResult(Some(test.reference), CheckStatus.FAILED, description=Some("1 records passed, 1 records failed")))) + val result3 = testExecutor.execute(execution, context, df, "_3", test) + result3 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + an[Exception] should be thrownBy(testExecutor.execute(execution, context, df, "_4", test)) + } + + it should "use correct data types" in { + val session = Session.builder() + .withSparkSession(spark) + .build() + val execution = session.execution + val context = session.context + val testExecutor = new DefaultColumnCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),2,1), + (None,3,2) + )) + + val test = RangeColumnCheck(None, lower="1.0", upper="2.2") + val result1 = testExecutor.execute(execution, context, df, "_1", test) + result1 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("1 records passed, 0 records failed")))) + val result2 = testExecutor.execute(execution, context, df, "_2", test) + result2 should be (Some(CheckResult(Some(test.reference), CheckStatus.FAILED, description=Some("1 records passed, 1 records failed")))) + val result3 = testExecutor.execute(execution, context, df, "_3", test) + result3 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + } + + "An ExpressionColumnCheck" should "succeed" in { + val session = Session.builder() + .withSparkSession(spark) + .build() + val execution = session.execution + val context = session.context + val testExecutor = new DefaultColumnCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),2,1), + (None,3,2) + )) + + val test = ExpressionColumnCheck(None, expression="_2 > _3") + val result1 = testExecutor.execute(execution, context, df, "_1", test) + result1 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + val result2 = testExecutor.execute(execution, context, df, "_2", test) + result2 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + val result4 = testExecutor.execute(execution, context, df, "_4", test) + result4 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + } + + it should "fail" in { + val session = Session.builder() + .withSparkSession(spark) + .build() + val execution = session.execution + val context = session.context + val testExecutor = new DefaultColumnCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),2,1), + (None,3,2) + )) + + val test = ExpressionColumnCheck(None, expression="_2 < _3") + val result1 = testExecutor.execute(execution, context, df, "_1", test) + result1 should be (Some(CheckResult(Some(test.reference), CheckStatus.FAILED, description=Some("0 records passed, 2 records failed")))) + val result2 = testExecutor.execute(execution, context, df, "_2", test) + result2 should be (Some(CheckResult(Some(test.reference), CheckStatus.FAILED, description=Some("0 records passed, 2 records failed")))) + val result4 = testExecutor.execute(execution, context, df, "_4", test) + result4 should be (Some(CheckResult(Some(test.reference), CheckStatus.FAILED, description=Some("0 records passed, 2 records failed")))) + } + + "A ForeignKeyColumnCheck" should "work" in { + val mappingSpec = mock[Prototype[Mapping]] + val mapping = mock[Mapping] + + val session = Session.builder() + .withSparkSession(spark) + .build() + val project = Project( + name = "project", + mappings = Map("mapping" -> mappingSpec) + ) + val context = session.getContext(project) + val execution = session.execution + + val testExecutor = new DefaultColumnCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),1,1), + (None,2,3) + )) + val otherDf = spark.createDataFrame(Seq( + (1,1), + (2,2) + )) + + (mappingSpec.instantiate _).expects(*).returns(mapping) + (mapping.context _).expects().returns(context) + (mapping.inputs _).expects().returns(Set()) + (mapping.outputs _).expects().atLeastOnce().returns(Set("main")) + (mapping.broadcast _).expects().returns(false) + (mapping.cache _).expects().returns(StorageLevel.NONE) + (mapping.checkpoint _).expects().returns(false) + (mapping.identifier _).expects().returns(MappingIdentifier("project/mapping")) + (mapping.execute _).expects(*,*).returns(Map("main" -> otherDf)) + + val test = ForeignKeyColumnCheck(None, mapping=Some(MappingOutputIdentifier("mapping")), column=Some("_1")) + val result1 = testExecutor.execute(execution, context, df, "_1", test) + result1 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("1 records passed, 0 records failed")))) + val result2 = testExecutor.execute(execution, context, df, "_2", test) + result2 should be (Some(CheckResult(Some(test.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + val result3 = testExecutor.execute(execution, context, df, "_3", test) + result3 should be (Some(CheckResult(Some(test.reference), CheckStatus.FAILED, description=Some("1 records passed, 1 records failed")))) + an[Exception] should be thrownBy(testExecutor.execute(execution, context, df, "_4", test)) + } +} diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/documentation/ColumnDocTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/ColumnDocTest.scala new file mode 100644 index 000000000..b36db9d37 --- /dev/null +++ b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/ColumnDocTest.scala @@ -0,0 +1,234 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.types.DoubleType +import com.dimajix.flowman.types.Field +import com.dimajix.flowman.types.NullType +import com.dimajix.flowman.types.StringType + + +class ColumnDocTest extends AnyFlatSpec with Matchers { + "A ColumnDoc" should "support merge" in { + val doc1 = ColumnDoc( + None, + Field("col1", NullType, description = Some("Some desc 1")), + children = Seq( + ColumnDoc(None, Field("child1", StringType, description = Some("Some child desc 1"))), + ColumnDoc(None, Field("child2", StringType, description = Some("Some child desc 1"))) + ) + ) + val doc2 = ColumnDoc( + None, + Field("col2", DoubleType, description = Some("Some desc 2")), + children = Seq( + ColumnDoc(None, Field("child2", NullType, description = Some("Some override child desc 1"))), + ColumnDoc(None, Field("child3", NullType, description = Some("Some override child desc 1"))) + ) + ) + + val result = doc1.merge(doc2) + + result should be (ColumnDoc( + None, + Field("col1", DoubleType, description = Some("Some desc 2")), + children = Seq( + ColumnDoc(None, Field("child1", StringType, description = Some("Some child desc 1"))), + ColumnDoc(None, Field("child2", StringType, description = Some("Some override child desc 1"))), + ColumnDoc(None, Field("child3", NullType, description = Some("Some override child desc 1"))) + ) + )) + } + + it should "support reparent" in { + val doc1 = ColumnDoc( + None, + Field("col1", NullType, description = Some("Some desc 1")), + children = Seq( + ColumnDoc(None, Field("child1", StringType, description = Some("Some child desc 1"))), + ColumnDoc(None, Field("child2", StringType, description = Some("Some child desc 1"))) + ) + ) + val parent = SchemaDoc(None) + + val result = doc1.reparent(parent.reference) + + result should be (ColumnDoc( + Some(parent.reference), + Field("col1", NullType, description = Some("Some desc 1")), + children = Seq( + ColumnDoc(Some(ColumnReference(Some(parent.reference), "col1")), Field("child1", StringType, description = Some("Some child desc 1"))), + ColumnDoc(Some(ColumnReference(Some(parent.reference), "col1")), Field("child2", StringType, description = Some("Some child desc 1"))) + ) + )) + } + + it should "support sql with no parent" in { + val doc = ColumnDoc( + None, + Field("col1", NullType, description = Some("Some desc 1")) + ) + val doc2 = doc.copy(children = Seq( + ColumnDoc(Some(doc.reference), Field("child1", StringType)), + ColumnDoc(Some(doc.reference), Field("child2", StringType)) + )) + + doc2.reference.sql should be ("col1") + doc2.children(0).reference.sql should be ("col1.child1") + doc2.children(1).reference.sql should be ("col1.child2") + } + + it should "support sql with a relation parent" in { + val doc0 = RelationDoc( + None, + RelationIdentifier("project/rel1") + ) + val doc1 = SchemaDoc( + Some(doc0.reference) + ) + val doc2 = ColumnDoc( + Some(doc1.reference), + Field("col1", NullType, description = Some("Some desc 1")) + ) + val doc2p = doc2.copy(children = Seq( + ColumnDoc(Some(doc2.reference), Field("child1", StringType)), + ColumnDoc(Some(doc2.reference), Field("child2", StringType)) + )) + val doc1p = doc1.copy( + columns = Seq(doc2p) + ) + val doc0p = doc0.copy( + schema = Some(doc1p) + ) + + doc0p.schema.get.columns(0).reference.sql should be ("rel1.col1") + doc0p.schema.get.columns(0).children(0).reference.sql should be ("rel1.col1.child1") + doc0p.schema.get.columns(0).children(1).reference.sql should be ("rel1.col1.child2") + } + + it should "support sql with a relation parent and a project" in { + val doc0 = ProjectDoc("project") + val doc1 = RelationDoc( + Some(doc0.reference), + RelationIdentifier("project/rel1") + ) + val doc2 = SchemaDoc( + Some(doc1.reference) + ) + val doc3 = ColumnDoc( + Some(doc2.reference), + Field("col1", NullType, description = Some("Some desc 1")) + ) + val doc3p = doc3.copy(children = Seq( + ColumnDoc(Some(doc3.reference), Field("child1", StringType)), + ColumnDoc(Some(doc3.reference), Field("child2", StringType)) + )) + val doc2p = doc2.copy( + columns = Seq(doc3p) + ) + val doc1p = doc1.copy( + schema = Some(doc2p) + ) + val doc0p = doc0.copy( + relations = Seq(doc1p) + ) + + doc0p.relations(0).schema.get.columns(0).reference.sql should be ("project/rel1.col1") + doc0p.relations(0).schema.get.columns(0).children(0).reference.sql should be ("project/rel1.col1.child1") + doc0p.relations(0).schema.get.columns(0).children(1).reference.sql should be ("project/rel1.col1.child2") + } + + it should "support sql with a mapping parent and a no project" in { + val doc1 = MappingDoc( + None, + MappingIdentifier("project/map1") + ) + val doc2 = MappingOutputDoc( + Some(doc1.reference), + MappingOutputIdentifier("project/map1:lala") + ) + val doc3 = SchemaDoc( + Some(doc2.reference) + ) + val doc4 = ColumnDoc( + Some(doc3.reference), + Field("col1", NullType, description = Some("Some desc 1")) + ) + val doc4p = doc4.copy(children = Seq( + ColumnDoc(Some(doc4.reference), Field("child1", StringType)), + ColumnDoc(Some(doc4.reference), Field("child2", StringType)) + )) + val doc3p = doc3.copy( + columns = Seq(doc4p) + ) + val doc2p = doc2.copy( + schema = Some(doc3p) + ) + val doc1p = doc1.copy( + outputs = Seq(doc2p) + ) + + doc1p.outputs(0).schema.get.columns(0).reference.sql should be ("[map1:lala].col1") + doc1p.outputs(0).schema.get.columns(0).children(0).reference.sql should be ("[map1:lala].col1.child1") + doc1p.outputs(0).schema.get.columns(0).children(1).reference.sql should be ("[map1:lala].col1.child2") + } + + it should "support sql with a mapping parent and a project" in { + val doc0 = ProjectDoc("project") + val doc1 = MappingDoc( + Some(doc0.reference), + MappingIdentifier("project/map1") + ) + val doc2 = MappingOutputDoc( + Some(doc1.reference), + MappingOutputIdentifier("project/map1:lala") + ) + val doc3 = SchemaDoc( + Some(doc2.reference) + ) + val doc4 = ColumnDoc( + Some(doc3.reference), + Field("col1", NullType, description = Some("Some desc 1")) + ) + val doc4p = doc4.copy(children = Seq( + ColumnDoc(Some(doc4.reference), Field("child1", StringType)), + ColumnDoc(Some(doc4.reference), Field("child2", StringType)) + )) + val doc3p = doc3.copy( + columns = Seq(doc4p) + ) + val doc2p = doc2.copy( + schema = Some(doc3p) + ) + val doc1p = doc1.copy( + outputs = Seq(doc2p) + ) + val doc0p = doc0.copy( + mappings = Seq(doc1p) + ) + + doc0p.mappings(0).outputs(0).schema.get.columns(0).reference.sql should be ("project/[map1:lala].col1") + doc0p.mappings(0).outputs(0).schema.get.columns(0).children(0).reference.sql should be ("project/[map1:lala].col1.child1") + doc0p.mappings(0).outputs(0).schema.get.columns(0).children(1).reference.sql should be ("project/[map1:lala].col1.child2") + } +} diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/documentation/MappingCollectorTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/MappingCollectorTest.scala new file mode 100644 index 000000000..014f2df21 --- /dev/null +++ b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/MappingCollectorTest.scala @@ -0,0 +1,122 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import org.scalamock.scalatest.MockFactory +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.execution.Phase +import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.graph.Graph +import com.dimajix.flowman.graph.Linker +import com.dimajix.flowman.model.Mapping +import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.Prototype +import com.dimajix.flowman.model.Relation +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.model.Target +import com.dimajix.flowman.types.SingleValue +import com.dimajix.flowman.types.StructType + + +class MappingCollectorTest extends AnyFlatSpec with Matchers with MockFactory { + "MappingCollector.collect" should "work" in { + val mapping1 = mock[Mapping] + val mappingTemplate1 = mock[Prototype[Mapping]] + val mapping2 = mock[Mapping] + val mappingTemplate2 = mock[Prototype[Mapping]] + val sourceRelation = mock[Relation] + val sourceRelationTemplate = mock[Prototype[Relation]] + + val project = Project( + name = "project", + mappings = Map( + "m1" -> mappingTemplate1, + "m2" -> mappingTemplate2 + ), + relations = Map( + "src" -> sourceRelationTemplate + ) + ) + val session = Session.builder().disableSpark().build() + val context = session.getContext(project) + val execution = session.execution + + (mappingTemplate1.instantiate _).expects(context).returns(mapping1) + (mapping1.context _).expects().returns(context) + (mapping1.outputs _).expects().returns(Set("main")) + (mapping1.link _).expects(*).onCall((l:Linker) => Some(1).foreach(_ => l.input(MappingIdentifier("m2"), "main"))) + + (mappingTemplate2.instantiate _).expects(context).returns(mapping2) + (mapping2.context _).expects().returns(context) + (mapping2.outputs _).expects().returns(Set("main")) + (mapping2.link _).expects(*).onCall((l:Linker) => Some(1).foreach(_ => l.read(RelationIdentifier("src"), Map("pcol"-> SingleValue("part1"))))) + + (sourceRelationTemplate.instantiate _).expects(context).returns(sourceRelation) + (sourceRelation.context _).expects().returns(context) + (sourceRelation.link _).expects(*).returns(Unit) + + val graph = Graph.ofProject(session, project, Phase.BUILD) + + (mapping1.identifier _).expects().atLeastOnce().returns(MappingIdentifier("project/m1")) + (mapping1.inputs _).expects().returns(Set(MappingOutputIdentifier("project/m2"))) + (mapping1.describe: (Execution,Map[MappingOutputIdentifier,StructType]) => Map[String,StructType] ).expects(*,*).returns(Map("main" -> StructType(Seq()))) + (mapping1.documentation _).expects().returns(None) + (mapping1.context _).expects().returns(context) + (mapping2.identifier _).expects().atLeastOnce().returns(MappingIdentifier("project/m2")) + (mapping2.inputs _).expects().returns(Set()) + (mapping2.describe: (Execution,Map[MappingOutputIdentifier,StructType]) => Map[String,StructType] ).expects(*,*).returns(Map("main" -> StructType(Seq()))) + (mapping2.documentation _).expects().returns(None) + + val collector = new MappingCollector() + val projectDoc = collector.collect(execution, graph, ProjectDoc(project.name)) + + val mapping1Doc = projectDoc.mappings.find(_.identifier == RelationIdentifier("project/m1")) + val mapping2Doc = projectDoc.mappings.find(_.identifier == RelationIdentifier("project/m2")) + + mapping1Doc should be (Some(MappingDoc( + parent = Some(ProjectReference("project")), + identifier = MappingIdentifier("project/m1"), + inputs = Seq(MappingOutputReference(Some(MappingReference(Some(ProjectReference("project")), "m2")), "main")), + outputs = Seq( + MappingOutputDoc( + parent = Some(MappingReference(Some(ProjectReference("project")), "m1")), + identifier = MappingOutputIdentifier("project/m1:main"), + schema = Some(SchemaDoc( + parent = Some(MappingOutputReference(Some(MappingReference(Some(ProjectReference("project")), "m1")), "main")) + )) + )) + ))) + mapping2Doc should be (Some(MappingDoc( + parent = Some(ProjectReference("project")), + identifier = MappingIdentifier("project/m2"), + inputs = Seq(), + outputs = Seq( + MappingOutputDoc( + parent = Some(MappingReference(Some(ProjectReference("project")), "m2")), + identifier = MappingOutputIdentifier("project/m2:main"), + schema = Some(SchemaDoc( + parent = Some(MappingOutputReference(Some(MappingReference(Some(ProjectReference("project")), "m2")), "main")) + )) + )) + ))) + } +} diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/documentation/ProjectDocTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/ProjectDocTest.scala new file mode 100644 index 000000000..8fa324fd6 --- /dev/null +++ b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/ProjectDocTest.scala @@ -0,0 +1,56 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.MappingOutputIdentifier + + +class ProjectDocTest extends AnyFlatSpec with Matchers { + "A ProjectDoc" should "support resolving" in { + val project = ProjectDoc( + name = "project" + ) + val projectRef = project.reference + val mapping = MappingDoc( + parent = Some(projectRef), + identifier = MappingIdentifier("project/m1") + ) + val mappingRef = mapping.reference + val output = MappingOutputDoc( + parent = Some(mappingRef), + identifier = MappingOutputIdentifier("project/m1:main") + ) + val outputRef = output.reference + val schema = SchemaDoc( + parent = Some(outputRef) + ) + val schemaRef = schema.reference + + val finalOutput = output.copy(schema = Some(schema)) + val finalMapping = mapping.copy(outputs = Seq(finalOutput)) + val finalProject = project.copy(mappings = Seq(finalMapping)) + + finalProject.resolve(projectRef) should be (Some(finalProject)) + finalProject.resolve(mappingRef) should be (Some(finalMapping)) + finalProject.resolve(outputRef) should be (Some(finalOutput)) + finalProject.resolve(schemaRef) should be (Some(schema)) + } +} diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/documentation/RelationCollectorTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/RelationCollectorTest.scala new file mode 100644 index 000000000..fb7cd4c14 --- /dev/null +++ b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/RelationCollectorTest.scala @@ -0,0 +1,144 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import org.scalamock.scalatest.MockFactory +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.execution.Phase +import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.graph.Graph +import com.dimajix.flowman.graph.Linker +import com.dimajix.flowman.model.Mapping +import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.Prototype +import com.dimajix.flowman.model.Relation +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.model.Target +import com.dimajix.flowman.types.SingleValue +import com.dimajix.flowman.types.StructType + + +class RelationCollectorTest extends AnyFlatSpec with Matchers with MockFactory { + "RelationCollector.collect" should "work" in { + val mapping1 = mock[Mapping] + val mappingTemplate1 = mock[Prototype[Mapping]] + val mapping2 = mock[Mapping] + val mappingTemplate2 = mock[Prototype[Mapping]] + val sourceRelation = mock[Relation] + val sourceRelationTemplate = mock[Prototype[Relation]] + val targetRelation = mock[Relation] + val targetRelationTemplate = mock[Prototype[Relation]] + val target = mock[Target] + val targetTemplate = mock[Prototype[Target]] + + val project = Project( + name = "project", + mappings = Map( + "m1" -> mappingTemplate1, + "m2" -> mappingTemplate2 + ), + targets = Map( + "t" -> targetTemplate + ), + relations = Map( + "src" -> sourceRelationTemplate, + "tgt" -> targetRelationTemplate + ) + ) + val session = Session.builder().disableSpark().build() + val context = session.getContext(project) + val execution = session.execution + + (mappingTemplate1.instantiate _).expects(context).returns(mapping1) + (mapping1.context _).expects().returns(context) + (mapping1.outputs _).expects().returns(Set("main")) + (mapping1.link _).expects(*).onCall((l:Linker) => Some(1).foreach(_ => l.input(MappingIdentifier("m2"), "main"))) + + (mappingTemplate2.instantiate _).expects(context).returns(mapping2) + (mapping2.context _).expects().returns(context) + (mapping2.outputs _).expects().returns(Set("main")) + (mapping2.link _).expects(*).onCall((l:Linker) => Some(1).foreach(_ => l.read(RelationIdentifier("src"), Map("pcol"-> SingleValue("part1"))))) + + (sourceRelationTemplate.instantiate _).expects(context).returns(sourceRelation) + (sourceRelation.context _).expects().returns(context) + (sourceRelation.link _).expects(*).returns(Unit) + + (targetRelationTemplate.instantiate _).expects(context).returns(targetRelation) + (targetRelation.context _).expects().returns(context) + (targetRelation.link _).expects(*).returns(Unit) + + (targetTemplate.instantiate _).expects(context).returns(target) + (target.context _).expects().returns(context) + (target.link _).expects(*,*).onCall((l:Linker, _:Phase) => Some(1).foreach { _ => + l.input(MappingIdentifier("m1"), "main") + l.write(RelationIdentifier("tgt"), Map("outcol"-> SingleValue("part1"))) + }) + + val graph = Graph.ofProject(session, project, Phase.BUILD) + + (mapping1.identifier _).expects().atLeastOnce().returns(MappingIdentifier("project/m1")) + //(mapping2.identifier _).expects().atLeastOnce().returns(MappingIdentifier("project/m2")) + (mapping1.requires _).expects().returns(Set()) + (mapping2.requires _).expects().returns(Set()) + + (sourceRelation.identifier _).expects().atLeastOnce().returns(RelationIdentifier("project/src")) + (sourceRelation.description _).expects().atLeastOnce().returns(Some("source relation")) + (sourceRelation.documentation _).expects().returns(None) + (sourceRelation.provides _).expects().returns(Set()) + (sourceRelation.requires _).expects().returns(Set()) + (sourceRelation.schema _).expects().returns(None) + (sourceRelation.describe _).expects(*,Map("pcol"-> SingleValue("part1"))).returns(StructType(Seq())) + + (targetRelation.identifier _).expects().atLeastOnce().returns(RelationIdentifier("project/tgt")) + (targetRelation.description _).expects().atLeastOnce().returns(Some("target relation")) + (targetRelation.documentation _).expects().returns(None) + (targetRelation.provides _).expects().returns(Set()) + (targetRelation.requires _).expects().returns(Set()) + (targetRelation.schema _).expects().returns(None) + (targetRelation.describe _).expects(*,Map("outcol"-> SingleValue("part1"))).returns(StructType(Seq())) + + val collector = new RelationCollector() + val projectDoc = collector.collect(execution, graph, ProjectDoc(project.name)) + + val sourceRelationDoc = projectDoc.relations.find(_.identifier == RelationIdentifier("project/src")) + val targetRelationDoc = projectDoc.relations.find(_.identifier == RelationIdentifier("project/tgt")) + + sourceRelationDoc should be (Some(RelationDoc( + parent = Some(ProjectReference("project")), + identifier = RelationIdentifier("project/src"), + description = Some("source relation"), + schema = Some(SchemaDoc( + parent = Some(RelationReference(Some(ProjectReference("project")), "src")) + )), + partitions = Map("pcol" -> SingleValue("part1")) + ))) + + targetRelationDoc should be (Some(RelationDoc( + parent = Some(ProjectReference("project")), + identifier = RelationIdentifier("project/tgt"), + description = Some("target relation"), + schema = Some(SchemaDoc( + parent = Some(RelationReference(Some(ProjectReference("project")), "tgt")) + )), + inputs = Seq(MappingOutputReference(Some(MappingReference(Some(ProjectReference("project")), "m1")), "main")), + partitions = Map("outcol" -> SingleValue("part1")) + ))) + } +} diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/documentation/SchemaCheckTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/SchemaCheckTest.scala new file mode 100644 index 000000000..d9663f438 --- /dev/null +++ b/flowman-core/src/test/scala/com/dimajix/flowman/documentation/SchemaCheckTest.scala @@ -0,0 +1,129 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.documentation + +import org.apache.spark.storage.StorageLevel +import org.scalamock.scalatest.MockFactory +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.model.Mapping +import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.Prototype +import com.dimajix.spark.testing.LocalSparkSession + + +class SchemaCheckTest extends AnyFlatSpec with Matchers with MockFactory with LocalSparkSession { + "A PrimaryKeySchemaCheck" should "be executable" in { + val session = Session.builder() + .withSparkSession(spark) + .build() + val execution = session.execution + val context = session.context + val testExecutor = new DefaultSchemaCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),2,3), + (None,3,4), + (None,3,5) + )) + + val test1 = PrimaryKeySchemaCheck(None, columns=Seq("_1","_3")) + val result1 = testExecutor.execute(execution, context, df, test1) + result1 should be (Some(CheckResult(Some(test1.reference), CheckStatus.SUCCESS, description=Some("3 keys are unique, 0 keys are non-unique")))) + + val test2 = PrimaryKeySchemaCheck(None, columns=Seq("_1","_2")) + val result2 = testExecutor.execute(execution, context, df, test2) + result2 should be (Some(CheckResult(Some(test1.reference), CheckStatus.FAILED, description=Some("1 keys are unique, 1 keys are non-unique")))) + } + + "An ExpressionSchemaCheck" should "work" in { + val session = Session.builder() + .withSparkSession(spark) + .build() + val execution = session.execution + val context = session.context + val testExecutor = new DefaultSchemaCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),2,1), + (None,3,2) + )) + + val test1 = ExpressionSchemaCheck(None, expression="_2 > _3") + val result1 = testExecutor.execute(execution, context, df, test1) + result1 should be (Some(CheckResult(Some(test1.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + + val test2 = ExpressionSchemaCheck(None, expression="_2 < _3") + val result2 = testExecutor.execute(execution, context, df, test2) + result2 should be (Some(CheckResult(Some(test1.reference), CheckStatus.FAILED, description=Some("0 records passed, 2 records failed")))) + } + + "A ForeignKeySchemaCheck" should "work" in { + val mappingSpec = mock[Prototype[Mapping]] + val mapping = mock[Mapping] + + val session = Session.builder() + .withSparkSession(spark) + .build() + val project = Project( + name = "project", + mappings = Map("mapping" -> mappingSpec) + ) + val context = session.getContext(project) + val execution = session.execution + + val testExecutor = new DefaultSchemaCheckExecutor + + val df = spark.createDataFrame(Seq( + (Some(1),1,1), + (None,2,3) + )) + val otherDf = spark.createDataFrame(Seq( + (1,1), + (2,2) + )) + + (mappingSpec.instantiate _).expects(*).returns(mapping) + (mapping.context _).expects().returns(context) + (mapping.inputs _).expects().returns(Set()) + (mapping.outputs _).expects().atLeastOnce().returns(Set("main")) + (mapping.broadcast _).expects().returns(false) + (mapping.cache _).expects().returns(StorageLevel.NONE) + (mapping.checkpoint _).expects().returns(false) + (mapping.identifier _).expects().returns(MappingIdentifier("project/mapping")) + (mapping.execute _).expects(*,*).returns(Map("main" -> otherDf)) + + val test1 = ForeignKeySchemaCheck(None, mapping=Some(MappingOutputIdentifier("mapping")), columns=Seq("_1")) + val result1 = testExecutor.execute(execution, context, df, test1) + result1 should be (Some(CheckResult(Some(test1.reference), CheckStatus.FAILED, description=Some("1 records passed, 1 records failed")))) + + val test2 = ForeignKeySchemaCheck(None, mapping=Some(MappingOutputIdentifier("mapping")), columns=Seq("_3"), references=Seq("_2")) + val result2 = testExecutor.execute(execution, context, df, test2) + result2 should be (Some(CheckResult(Some(test1.reference), CheckStatus.FAILED, description=Some("1 records passed, 1 records failed")))) + + val test3 = ForeignKeySchemaCheck(None, mapping=Some(MappingOutputIdentifier("mapping")), columns=Seq("_2")) + val result3 = testExecutor.execute(execution, context, df, test3) + result3 should be (Some(CheckResult(Some(test3.reference), CheckStatus.SUCCESS, description=Some("2 records passed, 0 records failed")))) + + val test4 = ForeignKeySchemaCheck(None, mapping=Some(MappingOutputIdentifier("mapping")), columns=Seq("_2"), references=Seq("_3")) + an[Exception] should be thrownBy(testExecutor.execute(execution, context, df, test4)) + } +} diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/execution/MappingUtilsTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/execution/MappingUtilsTest.scala index 00f6ab993..f8873d0be 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/execution/MappingUtilsTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/execution/MappingUtilsTest.scala @@ -35,7 +35,7 @@ object MappingUtilsTest { case class DummyMapping( override val context: Context, override val name: String, - override val inputs: Seq[MappingOutputIdentifier], + override val inputs: Set[MappingOutputIdentifier], override val requires: Set[ResourceIdentifier] ) extends BaseMapping { protected override def instanceProperties: Mapping.Properties = Mapping.Properties(context, name) @@ -47,7 +47,7 @@ object MappingUtilsTest { inputs: Seq[MappingOutputIdentifier], requires: Set[ResourceIdentifier] ) extends Prototype[Mapping] { - override def instantiate(context: Context): Mapping = DummyMapping(context, name, inputs, requires) + override def instantiate(context: Context): Mapping = DummyMapping(context, name, inputs.toSet, requires) } } diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/execution/RootContextTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/execution/RootContextTest.scala index 4c49dd655..0c3aa98f9 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/execution/RootContextTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/execution/RootContextTest.scala @@ -22,6 +22,7 @@ import org.scalatest.matchers.should.Matchers import com.dimajix.flowman.model.Connection import com.dimajix.flowman.model.ConnectionIdentifier +import com.dimajix.flowman.model.Job import com.dimajix.flowman.model.Mapping import com.dimajix.flowman.model.MappingIdentifier import com.dimajix.flowman.model.Namespace @@ -30,6 +31,7 @@ import com.dimajix.flowman.model.Project import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.RelationIdentifier import com.dimajix.flowman.model.Prototype +import com.dimajix.flowman.types.StringType class RootContextTest extends AnyFlatSpec with Matchers with MockFactory { @@ -238,4 +240,52 @@ class RootContextTest extends AnyFlatSpec with Matchers with MockFactory { rootContext.getRelation(RelationIdentifier("my_project/m2")) should be (overrideRelation) rootContext.getRelation(RelationIdentifier("my_project/m2"), false) should be (projectRelation2) } + + it should "support importing projects" in { + val session = Session.builder() + .disableSpark() + .build() + val rootContext = RootContext.builder(session.context) + .build() + + val project1 = Project( + name = "project1", + imports = Seq( + Project.Import(project="project2"), + Project.Import(project="project3"), + Project.Import(project="project4", job=Some("job"), arguments=Map("arg1" -> "val1")) + ) + ) + val project1Ctx = rootContext.getProjectContext(project1) + project1Ctx.evaluate("$project") should be ("project1") + + val project2 = Project( + name = "project2", + environment = Map("env1" -> "val1") + ) + val project2Ctx = rootContext.getProjectContext(project2) + project2Ctx.evaluate("$project") should be ("project2") + project2Ctx.evaluate("$env1") should be ("val1") + + val project3JobGen = mock[Prototype[Job]] + (project3JobGen.instantiate _).expects(*).onCall((ctx:Context) => Job.builder(ctx).setName("main").addEnvironment("jobenv", "jobval").build()) + val project3 = Project( + name = "project3", + jobs = Map("main" -> project3JobGen) + ) + val project3Ctx = rootContext.getProjectContext(project3) + project3Ctx.evaluate("$project") should be ("project3") + project3Ctx.evaluate("$jobenv") should be ("jobval") + + val project4JobGen = mock[Prototype[Job]] + (project4JobGen.instantiate _).expects(*).onCall((ctx:Context) => Job.builder(ctx).setName("job").addParameter("arg1", StringType).addParameter("arg2", StringType, value=Some("default")).build()) + val project4 = Project( + name = "project4", + jobs = Map("job" -> project4JobGen) + ) + val project4Ctx = rootContext.getProjectContext(project4) + project4Ctx.evaluate("$project") should be ("project4") + project4Ctx.evaluate("$arg1") should be ("val1") + project4Ctx.evaluate("$arg2") should be ("default") + } } diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/execution/RootExecutionTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/execution/RootExecutionTest.scala index f2ceb1127..e0e47a5e9 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/execution/RootExecutionTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/execution/RootExecutionTest.scala @@ -39,7 +39,7 @@ import com.dimajix.spark.testing.LocalSparkSession object RootExecutionTest { case class TestMapping( instanceProperties: Mapping.Properties, - inputs:Seq[MappingOutputIdentifier] + inputs:Set[MappingOutputIdentifier] ) extends BaseMapping { override def execute(execution: Execution, input: Map[MappingOutputIdentifier, DataFrame]): Map[String, DataFrame] = { Map("main" -> input.values.head) @@ -49,7 +49,7 @@ object RootExecutionTest { case class RangeMapping( instanceProperties: Mapping.Properties ) extends BaseMapping { - override def inputs: Seq[MappingOutputIdentifier] = Seq() + override def inputs: Set[MappingOutputIdentifier] = Set.empty override def execute(execution: Execution, input: Map[MappingOutputIdentifier, DataFrame]): Map[String, DataFrame] = { val spark = execution.spark @@ -61,7 +61,7 @@ object RootExecutionTest { override def instantiate(context: Context): Mapping = { TestMapping( Mapping.Properties(context, name), - inputs.map(i => MappingOutputIdentifier(i)) + inputs.map(i => MappingOutputIdentifier(i)).toSet ) } } diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/execution/RunnerJobTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/execution/RunnerJobTest.scala index 307f0d6cd..5683858ec 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/execution/RunnerJobTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/execution/RunnerJobTest.scala @@ -403,7 +403,7 @@ class RunnerJobTest extends AnyFlatSpec with MockFactory with Matchers with Loca name = "default", targets = Map( "t0" -> genTarget("t0", true, Yes, produces=Set(ResourceIdentifier.ofHivePartition("some_table", Map("p1" -> "123")))), - "t1" -> genTarget("t1", true, No, requires=Set(ResourceIdentifier.ofHivePartition("some_table", Map()))), + "t1" -> genTarget("t1", true, No, requires=Set(ResourceIdentifier.ofHivePartition("some_table", Map.empty[String,Any]))), "t2" -> genTarget("t2", false, No) ) ) diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/execution/RunnerTestTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/execution/RunnerTestTest.scala index 27d5a0cc7..f23855045 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/execution/RunnerTestTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/execution/RunnerTestTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -193,8 +193,8 @@ class RunnerTestTest extends AnyFlatSpec with MockFactory with Matchers with Loc overrideMappingContext = ctx overrideMapping } - (overrideMapping.inputs _).expects().atLeastOnce().returns(Seq()) - (overrideMapping.outputs _).expects().atLeastOnce().returns(Seq("main")) + (overrideMapping.inputs _).expects().atLeastOnce().returns(Set()) + (overrideMapping.outputs _).expects().atLeastOnce().returns(Set("main")) (overrideMapping.identifier _).expects().atLeastOnce().returns(MappingIdentifier("map")) (overrideMapping.context _).expects().onCall(() => overrideMappingContext) (overrideMapping.broadcast _).expects().returns(false) @@ -312,8 +312,8 @@ class RunnerTestTest extends AnyFlatSpec with MockFactory with Matchers with Loc mappingContext = ctx mapping } - (mapping.inputs _).expects().atLeastOnce().returns(Seq()) - (mapping.outputs _).expects().atLeastOnce().returns(Seq("main")) + (mapping.inputs _).expects().atLeastOnce().returns(Set()) + (mapping.outputs _).expects().atLeastOnce().returns(Set("main")) (mapping.identifier _).expects().atLeastOnce().returns(MappingIdentifier("map")) (mapping.context _).expects().onCall(() => mappingContext) (mapping.broadcast _).expects().returns(false) diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/graph/GraphBuilderTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/graph/GraphBuilderTest.scala index 0f4da01b6..d2223ebb1 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/graph/GraphBuilderTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/graph/GraphBuilderTest.scala @@ -49,11 +49,13 @@ class GraphBuilderTest extends AnyFlatSpec with Matchers with MockFactory { (mappingTemplate1.instantiate _).expects(context).returns(mapping1) (mapping1.context _).expects().returns(context) + (mapping1.outputs _).expects().returns(Set("main")) (mapping1.kind _).expects().returns("m1_kind") (mapping1.name _).expects().atLeastOnce().returns("m1") (mapping1.link _).expects(*).onCall((l:Linker) => Some(1).foreach(_ => l.input(MappingIdentifier("m2"), "main"))) (mappingTemplate2.instantiate _).expects(context).returns(mapping2) (mapping2.context _).expects().returns(context) + (mapping2.outputs _).expects().returns(Set("main")) (mapping2.kind _).expects().returns("m2_kind") (mapping2.name _).expects().atLeastOnce().returns("m2") (mapping2.link _).expects(*).returns(Unit) @@ -66,13 +68,14 @@ class GraphBuilderTest extends AnyFlatSpec with Matchers with MockFactory { val ref1 = nodes.find(_.name == "m1").head.asInstanceOf[MappingRef] val ref2 = nodes.find(_.name == "m2").head.asInstanceOf[MappingRef] + val out2main = ref2.outputs.head ref1.category should be (Category.MAPPING) ref1.kind should be ("m1_kind") ref1.name should be ("m1") ref1.mapping should be (mapping1) ref1.incoming should be (Seq( - InputMapping(ref2, ref1, "main") + InputMapping(out2main, ref1) )) ref1.outgoing should be (Seq()) @@ -81,8 +84,9 @@ class GraphBuilderTest extends AnyFlatSpec with Matchers with MockFactory { ref2.name should be ("m2") ref2.mapping should be (mapping2) ref2.incoming should be (Seq()) - ref2.outgoing should be (Seq( - InputMapping(ref2, ref1, "main") + ref2.outgoing should be (Seq()) + ref2.outputs.head.outgoing should be (Seq( + InputMapping(out2main, ref1) )) } } diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/graph/GraphTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/graph/GraphTest.scala index 272b60fd6..f97013888 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/graph/GraphTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/graph/GraphTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -25,11 +25,13 @@ import com.dimajix.flowman.execution.Session import com.dimajix.flowman.model.Mapping import com.dimajix.flowman.model.MappingIdentifier import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.Prototype import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.RelationIdentifier import com.dimajix.flowman.model.Target import com.dimajix.flowman.model.TargetIdentifier -import com.dimajix.flowman.model.Prototype +import com.dimajix.flowman.types.FieldValue +import com.dimajix.flowman.types.SingleValue class GraphTest extends AnyFlatSpec with Matchers with MockFactory { @@ -64,13 +66,15 @@ class GraphTest extends AnyFlatSpec with Matchers with MockFactory { (mappingTemplate1.instantiate _).expects(context).returns(mapping1) (mapping1.context _).expects().returns(context) + (mapping1.outputs _).expects().returns(Set("main")) (mapping1.name _).expects().atLeastOnce().returns("m1") (mapping1.link _).expects(*).onCall((l:Linker) => Some(1).foreach(_ => l.input(MappingIdentifier("m2"), "main"))) (mappingTemplate2.instantiate _).expects(context).returns(mapping2) (mapping2.context _).expects().returns(context) + (mapping2.outputs _).expects().returns(Set("main")) (mapping2.name _).expects().atLeastOnce().returns("m2") - (mapping2.link _).expects(*).onCall((l:Linker) => Some(1).foreach(_ => l.read(RelationIdentifier("src"), Map()))) + (mapping2.link _).expects(*).onCall((l:Linker) => Some(1).foreach(_ => l.read(RelationIdentifier("src"), Map.empty[String,FieldValue]))) (sourceRelationTemplate.instantiate _).expects(context).returns(sourceRelation) (sourceRelation.context _).expects().returns(context) @@ -87,13 +91,13 @@ class GraphTest extends AnyFlatSpec with Matchers with MockFactory { (target.name _).expects().atLeastOnce().returns("t") (target.link _).expects(*,*).onCall((l:Linker, _:Phase) => Some(1).foreach { _ => l.input(MappingIdentifier("m1"), "main") - l.write(RelationIdentifier("tgt"), Map()) + l.write(RelationIdentifier("tgt"), Map.empty[String,SingleValue]) }) val graph = Graph.ofProject(session, project, Phase.BUILD) val nodes = graph.nodes - nodes.size should be (5) + nodes.size should be (7) nodes.find(_.name == "m1") should not be (None) nodes.find(_.name == "m1").get shouldBe a[MappingRef] nodes.find(_.name == "m2") should not be (None) @@ -116,6 +120,8 @@ class GraphTest extends AnyFlatSpec with Matchers with MockFactory { maps.find(_.name == "m3") should be (None) val m1 = maps.find(_.name == "m1").get val m2 = maps.find(_.name == "m2").get + val out1main = m1.outputs.head + val out2main = m2.outputs.head val tgts = graph.targets tgts.size should be (1) @@ -126,13 +132,15 @@ class GraphTest extends AnyFlatSpec with Matchers with MockFactory { val src = rels.find(_.name == "src").get val tgt = rels.find(_.name == "tgt").get - m1.incoming should be (Seq(InputMapping(m2, m1, "main"))) - m1.outgoing should be (Seq(InputMapping(m1, t, "main"))) + m1.incoming should be (Seq(InputMapping(out2main, m1))) + m1.outgoing should be (Seq()) + m1.outputs.head.outgoing should be (Seq(InputMapping(out1main, t))) m2.incoming should be (Seq(ReadRelation(src, m2, Map()))) - m2.outgoing should be (Seq(InputMapping(m2, m1, "main"))) + m2.outgoing should be (Seq()) + m2.outputs.head.outgoing should be (Seq(InputMapping(out2main, m1))) src.incoming should be (Seq()) src.outgoing should be (Seq(ReadRelation(src, m2, Map()))) - t.incoming should be (Seq(InputMapping(m1, t, "main"))) + t.incoming should be (Seq(InputMapping(out1main, t))) t.outgoing should be (Seq(WriteRelation(t, tgt, Map()))) tgt.incoming should be (Seq(WriteRelation(t, tgt, Map()))) tgt.outgoing should be (Seq()) diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/graph/NodeTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/graph/NodeTest.scala index 0b43cfa03..8f54f1589 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/graph/NodeTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/graph/NodeTest.scala @@ -54,22 +54,22 @@ class NodeTest extends AnyFlatSpec with Matchers with MockFactory { (tgtRelation.name _).expects().atLeastOnce().returns("facts") val srcRelationNode = RelationRef(1, srcRelation) - val readMappingNode = MappingRef(2, readMapping) - val mapping1Node = MappingRef(3, mapping1) - val mapping2Node = MappingRef(4, mapping2) - val mapping3Node = MappingRef(5, mapping3) - val unionMappingNode = MappingRef(6, unionMapping) - val targetNode = TargetRef(7, target, Phase.BUILD) - val tgtRelationNode = RelationRef(8, tgtRelation) + val readMappingNode = MappingRef(2, readMapping, Seq(MappingOutput(3, null, "main"))) + val mapping1Node = MappingRef(4, mapping1, Seq(MappingOutput(5, null, "main"))) + val mapping2Node = MappingRef(6, mapping2, Seq(MappingOutput(7, null, "main"))) + val mapping3Node = MappingRef(8, mapping3, Seq(MappingOutput(9, null, "main"))) + val unionMappingNode = MappingRef(10, unionMapping, Seq(MappingOutput(11, null, "main"))) + val targetNode = TargetRef(12, target, Phase.BUILD) + val tgtRelationNode = RelationRef(13, tgtRelation) tgtRelationNode.inEdges.append(WriteRelation(targetNode, tgtRelationNode)) - targetNode.inEdges.append(InputMapping(unionMappingNode, targetNode)) - unionMappingNode.inEdges.append(InputMapping(mapping1Node, unionMappingNode)) - unionMappingNode.inEdges.append(InputMapping(mapping2Node, unionMappingNode)) - unionMappingNode.inEdges.append(InputMapping(mapping3Node, unionMappingNode)) - mapping1Node.inEdges.append(InputMapping(readMappingNode, mapping1Node)) - mapping2Node.inEdges.append(InputMapping(readMappingNode, mapping2Node)) - mapping3Node.inEdges.append(InputMapping(readMappingNode, mapping3Node)) + targetNode.inEdges.append(InputMapping(unionMappingNode.outputs.head, targetNode)) + unionMappingNode.inEdges.append(InputMapping(mapping1Node.outputs.head, unionMappingNode)) + unionMappingNode.inEdges.append(InputMapping(mapping2Node.outputs.head, unionMappingNode)) + unionMappingNode.inEdges.append(InputMapping(mapping3Node.outputs.head, unionMappingNode)) + mapping1Node.inEdges.append(InputMapping(readMappingNode.outputs.head, mapping1Node)) + mapping2Node.inEdges.append(InputMapping(readMappingNode.outputs.head, mapping2Node)) + mapping3Node.inEdges.append(InputMapping(readMappingNode.outputs.head, mapping3Node)) readMappingNode.inEdges.append(ReadRelation(srcRelationNode, readMappingNode)) println(tgtRelationNode.upstreamDependencyTree) diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/history/GraphTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/history/GraphTest.scala new file mode 100644 index 000000000..ea2dc8ca2 --- /dev/null +++ b/flowman-core/src/test/scala/com/dimajix/flowman/history/GraphTest.scala @@ -0,0 +1,167 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.history + +import org.scalamock.scalatest.MockFactory +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.execution.Phase +import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.graph.Category +import com.dimajix.flowman.graph.Linker +import com.dimajix.flowman.model.Mapping +import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.PartitionField +import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.Prototype +import com.dimajix.flowman.model.Relation +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.model.Target +import com.dimajix.flowman.types.SingleValue +import com.dimajix.flowman.types.StringType +import com.dimajix.flowman.types.StructType +import com.dimajix.flowman.{graph => g} + + +class GraphTest extends AnyFlatSpec with Matchers with MockFactory { + "Graph.ofGraph" should "work" in { + val mapping1 = mock[Mapping] + val mappingTemplate1 = mock[Prototype[Mapping]] + val mapping2 = mock[Mapping] + val mappingTemplate2 = mock[Prototype[Mapping]] + val sourceRelation = mock[Relation] + val sourceRelationTemplate = mock[Prototype[Relation]] + val targetRelation = mock[Relation] + val targetRelationTemplate = mock[Prototype[Relation]] + val target = mock[Target] + val targetTemplate = mock[Prototype[Target]] + + val project = Project( + name = "project", + mappings = Map( + "m1" -> mappingTemplate1, + "m2" -> mappingTemplate2 + ), + targets = Map( + "t" -> targetTemplate + ), + relations = Map( + "src" -> sourceRelationTemplate, + "tgt" -> targetRelationTemplate + ) + ) + val session = Session.builder().disableSpark().build() + val context = session.getContext(project) + val execution = session.execution + + (mappingTemplate1.instantiate _).expects(context).returns(mapping1) + (mapping1.context _).expects().returns(context) + (mapping1.outputs _).expects().returns(Set("main")) + (mapping1.link _).expects(*).onCall((l:Linker) => Some(1).foreach(_ => l.input(MappingIdentifier("m2"), "main"))) + + (mappingTemplate2.instantiate _).expects(context).returns(mapping2) + (mapping2.context _).expects().returns(context) + (mapping2.outputs _).expects().returns(Set("main")) + (mapping2.link _).expects(*).onCall((l:Linker) => Some(1).foreach(_ => l.read(RelationIdentifier("src"), Map("pcol"-> SingleValue("part1"))))) + + (sourceRelationTemplate.instantiate _).expects(context).returns(sourceRelation) + (sourceRelation.context _).expects().returns(context) + (sourceRelation.link _).expects(*).returns(Unit) + + (targetRelationTemplate.instantiate _).expects(context).returns(targetRelation) + (targetRelation.context _).expects().returns(context) + (targetRelation.link _).expects(*).returns(Unit) + + (targetTemplate.instantiate _).expects(context).returns(target) + (target.context _).expects().returns(context) + (target.link _).expects(*,*).onCall((l:Linker, _:Phase) => Some(1).foreach { _ => + l.input(MappingIdentifier("m1"), "main") + l.write(RelationIdentifier("tgt"), Map("outcol"-> SingleValue("part1"))) + }) + + val graph = g.Graph.ofProject(session, project, Phase.BUILD) + + (mapping1.name _).expects().atLeastOnce().returns("m1") + (mapping1.kind _).expects().atLeastOnce().returns("m1_kind") + (mapping1.requires _).expects().returns(Set()) + (mapping2.name _).expects().atLeastOnce().returns("m2") + (mapping2.kind _).expects().atLeastOnce().returns("m2_kind") + (mapping2.requires _).expects().returns(Set()) + + (sourceRelation.name _).expects().atLeastOnce().returns("src") + (sourceRelation.kind _).expects().atLeastOnce().returns("src_kind") + (sourceRelation.provides _).expects().returns(Set()) + (sourceRelation.requires _).expects().returns(Set()) + (sourceRelation.partitions _).expects().returns(Seq(PartitionField("pcol", StringType))) + + (targetRelation.name _).expects().atLeastOnce().returns("tgt") + (targetRelation.kind _).expects().atLeastOnce().returns("tgt_kind") + (targetRelation.provides _).expects().returns(Set()) + (targetRelation.requires _).expects().returns(Set()) + + (target.provides _).expects(*).returns(Set.empty) + (target.requires _).expects(*).returns(Set.empty) + (target.name _).expects().returns("tgt1") + (target.kind _).expects().returns("tgt1_kind") + + val hgraph = Graph.ofGraph(graph) + val srcRelNode = hgraph.nodes.find(_.name == "src").get + val tgtRelNode = hgraph.nodes.find(_.name == "tgt").get + val m1Node = hgraph.nodes.find(_.name == "m1").get + val m2Node = hgraph.nodes.find(_.name == "m2").get + val tgtNode = hgraph.nodes.find(_.name == "tgt1").get + + srcRelNode.name should be ("src") + srcRelNode.category should be (Category.RELATION) + srcRelNode.kind should be ("src_kind") + srcRelNode.incoming should be (Seq.empty) + srcRelNode.outgoing.head.input should be (srcRelNode) + srcRelNode.outgoing.head.output should be (m2Node) + + m2Node.name should be ("m2") + m2Node.category should be (Category.MAPPING) + m2Node.kind should be ("m2_kind") + m2Node.incoming.head.input should be (srcRelNode) + m2Node.incoming.head.output should be (m2Node) + m2Node.outgoing.head.input should be (m2Node) + m2Node.outgoing.head.output should be (m1Node) + + m1Node.name should be ("m1") + m1Node.category should be (Category.MAPPING) + m1Node.kind should be ("m1_kind") + m1Node.incoming.head.input should be (m2Node) + m1Node.incoming.head.output should be (m1Node) + m1Node.outgoing.head.input should be (m1Node) + m1Node.outgoing.head.output should be (tgtNode) + + tgtNode.name should be ("tgt1") + tgtNode.category should be (Category.TARGET) + tgtNode.kind should be ("tgt1_kind") + tgtNode.incoming.head.input should be (m1Node) + tgtNode.incoming.head.output should be (tgtNode) + tgtNode.outgoing.head.input should be (tgtNode) + tgtNode.outgoing.head.output should be (tgtRelNode) + + tgtRelNode.name should be ("tgt") + tgtRelNode.category should be (Category.RELATION) + tgtRelNode.kind should be ("tgt_kind") + tgtRelNode.outgoing should be (Seq.empty) + tgtRelNode.incoming.head.input should be (tgtNode) + tgtRelNode.incoming.head.output should be (tgtRelNode) + } +} diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/BaseDialectTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/BaseDialectTest.scala index dccc92aef..afc25d3dc 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/BaseDialectTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/BaseDialectTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,7 +16,6 @@ package com.dimajix.flowman.jdbc -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.functions.expr import org.apache.spark.sql.types.IntegerType import org.apache.spark.sql.types.StringType @@ -25,7 +24,10 @@ import org.apache.spark.sql.types.StructType import org.scalatest.flatspec.AnyFlatSpec import org.scalatest.matchers.should.Matchers +import com.dimajix.flowman.catalog import com.dimajix.flowman.catalog.PartitionSpec +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.DeleteClause import com.dimajix.flowman.execution.InsertClause import com.dimajix.flowman.execution.UpdateClause @@ -79,7 +81,7 @@ class BaseDialectTest extends AnyFlatSpec with Matchers { it should "provide CREATE statements with PK" in { val dialect = NoopDialect val table = TableIdentifier("table_1", Some("my_db")) - val tableDefinition = TableDefinition( + val tableDefinition = catalog.TableDefinition( table, Seq( Field("id", com.dimajix.flowman.types.IntegerType, nullable = false), diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/DerbyJdbcTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/DerbyJdbcTest.scala index ab4a6ecef..d4d3c919d 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/DerbyJdbcTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/DerbyJdbcTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -18,14 +18,17 @@ package com.dimajix.flowman.jdbc import java.nio.file.Path -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions import org.scalatest.flatspec.AnyFlatSpec import org.scalatest.matchers.should.Matchers +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier +import com.dimajix.flowman.catalog.TableIndex import com.dimajix.flowman.types.Field import com.dimajix.flowman.types.IntegerType import com.dimajix.flowman.types.StringType +import com.dimajix.flowman.types.VarcharType import com.dimajix.spark.testing.LocalTempDir @@ -43,21 +46,39 @@ class DerbyJdbcTest extends AnyFlatSpec with Matchers with LocalTempDir { "A Derby Table" should "be creatable" in { val options = new JDBCOptions(url, "table_001", Map(JDBCOptions.JDBC_DRIVER_CLASS -> driver)) val conn = JdbcUtils.createConnection(options) - val table = TableDefinition( + val table1 = TableDefinition( TableIdentifier("table_001"), Seq( Field("Id", IntegerType, nullable=false), - Field("str_field", StringType), + Field("str_field", VarcharType(32)), Field("int_field", IntegerType) ), - None, - Seq("Id") + primaryKey = Seq("Id"), + indexes = Seq( + TableIndex("table_001_idx1", Seq("str_field", "int_field")) + ) ) - JdbcUtils.tableExists(conn, table.identifier, options) should be (false) - JdbcUtils.createTable(conn, table, options) - JdbcUtils.tableExists(conn, table.identifier, options) should be (true) - JdbcUtils.dropTable(conn, table.identifier, options) - JdbcUtils.tableExists(conn, table.identifier, options) should be (false) + + //==== CREATE ================================================================================================ + JdbcUtils.tableExists(conn, table1.identifier, options) should be (false) + JdbcUtils.createTable(conn, table1, options) + JdbcUtils.tableExists(conn, table1.identifier, options) should be (true) + + JdbcUtils.getTable(conn, table1.identifier, options) should be (table1) + + //==== DROP INDEX ============================================================================================ + val table2 = table1.copy(indexes = Seq.empty) + JdbcUtils.dropIndex(conn, table1.identifier, "table_001_idx1", options) + JdbcUtils.getTable(conn, table1.identifier, options) should be (table2) + + //==== CREATE INDEX ============================================================================================ + val table3 = table2.copy(indexes = Seq(TableIndex("table_001_idx1", Seq("str_field", "Id")))) + JdbcUtils.createIndex(conn, table3.identifier, table3.indexes.head, options) + JdbcUtils.getTable(conn, table3.identifier, options) should be (table3) + + //==== DROP ================================================================================================== + JdbcUtils.dropTable(conn, table1.identifier, options) + JdbcUtils.tableExists(conn, table1.identifier, options) should be (false) conn.close() } } diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/H2JdbcTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/H2JdbcTest.scala index 183e13790..006f3db3d 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/H2JdbcTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/H2JdbcTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -20,7 +20,6 @@ import java.nio.file.Path import java.util.Properties import org.apache.spark.sql.Row -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions import org.apache.spark.sql.functions.col import org.apache.spark.sql.functions.expr @@ -29,12 +28,17 @@ import org.apache.spark.sql.types.StructField import org.scalatest.flatspec.AnyFlatSpec import org.scalatest.matchers.should.Matchers +import com.dimajix.flowman.catalog +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier +import com.dimajix.flowman.catalog.TableIndex import com.dimajix.flowman.execution.DeleteClause import com.dimajix.flowman.execution.InsertClause import com.dimajix.flowman.execution.UpdateClause import com.dimajix.flowman.types.Field import com.dimajix.flowman.types.IntegerType import com.dimajix.flowman.types.StringType +import com.dimajix.flowman.types.VarcharType import com.dimajix.spark.sql.DataFrameBuilder import com.dimajix.spark.testing.LocalSparkSession @@ -57,15 +61,33 @@ class H2JdbcTest extends AnyFlatSpec with Matchers with LocalSparkSession { TableIdentifier("table_001"), Seq( Field("Id", IntegerType, nullable=false), - Field("str_field", StringType), + Field("str_field", VarcharType(32)), Field("int_field", IntegerType) ), - None, - Seq("iD") + primaryKey = Seq("iD"), + indexes = Seq( + TableIndex("table_001_idx1", Seq("str_field", "int_field")) + ) ) + + //==== CREATE ================================================================================================ JdbcUtils.tableExists(conn, table.identifier, options) should be (false) JdbcUtils.createTable(conn, table, options) JdbcUtils.tableExists(conn, table.identifier, options) should be (true) + + JdbcUtils.getTable(conn, table.identifier, options).normalize() should be (table.normalize()) + + //==== DROP INDEX ============================================================================================ + val table2 = table.copy(indexes = Seq.empty) + JdbcUtils.dropIndex(conn, table.identifier, "table_001_idx1", options) + JdbcUtils.getTable(conn, table.identifier, options).normalize() should be (table2.normalize()) + + //==== CREATE INDEX ============================================================================================ + val table3 = table2.copy(indexes = Seq(TableIndex("table_001_idx1", Seq("str_field", "Id")))) + JdbcUtils.createIndex(conn, table3.identifier, table3.indexes.head, options) + JdbcUtils.getTable(conn, table3.identifier, options).normalize() should be (table3.normalize()) + + //==== DROP ================================================================================================== JdbcUtils.dropTable(conn, table.identifier, options) JdbcUtils.tableExists(conn, table.identifier, options) should be (false) conn.close() @@ -74,7 +96,7 @@ class H2JdbcTest extends AnyFlatSpec with Matchers with LocalSparkSession { "JdbcUtils.mergeTable()" should "work with complex clauses" in { val options = new JDBCOptions(url, "table_002", Map(JDBCOptions.JDBC_DRIVER_CLASS -> driver)) val conn = JdbcUtils.createConnection(options) - val table = TableDefinition( + val table = catalog.TableDefinition( TableIdentifier("table_001"), Seq( Field("id", IntegerType), @@ -171,7 +193,7 @@ class H2JdbcTest extends AnyFlatSpec with Matchers with LocalSparkSession { it should "work with trivial clauses" in { val options = new JDBCOptions(url, "table_002", Map(JDBCOptions.JDBC_DRIVER_CLASS -> driver)) val conn = JdbcUtils.createConnection(options) - val table = TableDefinition( + val table = catalog.TableDefinition( TableIdentifier("table_001"), Seq( Field("id", IntegerType), diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/JdbcUtilsTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/JdbcUtilsTest.scala index d046d1863..84b6dc650 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/JdbcUtilsTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/jdbc/JdbcUtilsTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -18,12 +18,14 @@ package com.dimajix.flowman.jdbc import java.nio.file.Path -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions import org.scalatest.flatspec.AnyFlatSpec import org.scalatest.matchers.should.Matchers +import com.dimajix.flowman.catalog import com.dimajix.flowman.catalog.TableChange +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.MigrationPolicy import com.dimajix.flowman.types.BooleanType import com.dimajix.flowman.types.Field @@ -73,7 +75,7 @@ class JdbcUtilsTest extends AnyFlatSpec with Matchers with LocalTempDir { "JdbcUtils.alterTable()" should "work" in { val options = new JDBCOptions(url, "table_002", Map(JDBCOptions.JDBC_DRIVER_CLASS -> driver)) val conn = JdbcUtils.createConnection(options) - val table = TableDefinition( + val table = catalog.TableDefinition( TableIdentifier("table_001"), Seq( Field("str_field", VarcharType(20)), @@ -91,7 +93,9 @@ class JdbcUtilsTest extends AnyFlatSpec with Matchers with LocalTempDir { Field("BOOL_FIELD", BooleanType) )) - val migrations = TableChange.migrate(curSchema, newSchema, MigrationPolicy.STRICT) + val curTable = TableDefinition(TableIdentifier(""), curSchema.fields) + val newTable = TableDefinition(TableIdentifier(""), newSchema.fields) + val migrations = TableChange.migrate(curTable, newTable, MigrationPolicy.STRICT) JdbcUtils.alterTable(conn, table.identifier, migrations, options) JdbcUtils.getSchema(conn, table.identifier, options) should be (newSchema) diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/metric/MetricBoardTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/metric/MetricBoardTest.scala index ef13b061d..60dc6eb94 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/metric/MetricBoardTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/metric/MetricBoardTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019-2020 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -39,8 +39,8 @@ class MetricBoardTest extends AnyFlatSpec with Matchers { registry.addBundle(CounterAccumulatorMetricBundle("some_metric", Map("raw_label" -> "raw_value"), accumulator1, "sublabel")) val selections = Seq( MetricSelection( - "m1", - Selector(Some("some_metric"), + Some("m1"), + Selector("some_metric", Map("raw_label" -> "raw_value", "sublabel" -> "a") ), Map("rl" -> "$raw_label", "sl" -> "$sublabel", "ev" -> "$env_var") diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/metric/MetricSystemTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/metric/MetricSystemTest.scala index 110d1015f..8c3c589df 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/metric/MetricSystemTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/metric/MetricSystemTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -81,16 +81,16 @@ class MetricSystemTest extends AnyFlatSpec with Matchers { r7.size should be (1) r7.forall(m => m.labels("label") == "acc1" && m.labels("sublabel") == "a") should be (true) - val r8 = registry.findMetric(Selector(name=Some("no_such_metric"))) + val r8 = registry.findMetric(Selector(name="no_such_metric")) r8.size should be (0) - val r9 = registry.findMetric(Selector(name=Some("some_metric_1"))) + val r9 = registry.findMetric(Selector(name="some_metric_1")) r9.size should be (2) - val r10 = registry.findMetric(Selector(name=Some("some_metric_1"), labels=Map("label" -> "acc1"))) + val r10 = registry.findMetric(Selector(name="some_metric_1", labels=Map("label" -> "acc1"))) r10.size should be (2) - val r11 = registry.findMetric(Selector(name=Some("some_metric_1"), labels=Map("label" -> "acc2"))) + val r11 = registry.findMetric(Selector(name="some_metric_1", labels=Map("label" -> "acc2"))) r11.size should be (0) } @@ -98,17 +98,22 @@ class MetricSystemTest extends AnyFlatSpec with Matchers { val registry = new MetricSystem val accumulator1 = new CounterAccumulator() - registry.addBundle(new CounterAccumulatorMetricBundle("some_metric_1", Map("label" -> "acc1"), accumulator1, "sublabel")) + registry.addBundle(CounterAccumulatorMetricBundle("some_metric_1", Map("label" -> "acc1"), accumulator1, "sublabel")) val accumulator2 = new CounterAccumulator() - registry.addBundle(new CounterAccumulatorMetricBundle("some_metric_2", Map("label" -> "acc2"), accumulator2, "sublabel")) + registry.addBundle(CounterAccumulatorMetricBundle("some_metric_2", Map("label" -> "acc2"), accumulator2, "sublabel")) registry.findBundle(Selector()).size should be (2) registry.findBundle(Selector(labels=Map("label" -> "acc2"))).size should be (1) registry.findBundle(Selector(labels=Map("label" -> "acc3"))).size should be (0) - registry.findBundle(Selector(name=Some("no_such_metric"))).size should be (0) - registry.findBundle(Selector(name=Some("some_metric_1"))).size should be (1) - registry.findBundle(Selector(name=Some("some_metric_1"), labels=Map("label" -> "acc1"))).size should be (1) - registry.findBundle(Selector(name=Some("some_metric_1"), labels=Map("label" -> "acc2"))).size should be (0) + registry.findBundle(Selector(name="no_such_metric")).size should be (0) + registry.findBundle(Selector(name="some_metric_1")).size should be (1) + registry.findBundle(Selector(name="some_metric_1", labels=Map("label" -> "acc1"))).size should be (1) + registry.findBundle(Selector(name="some_metric_1", labels=Map("label" -> "acc2"))).size should be (0) + + registry.findBundle(Selector(labels=Map("label" -> ".*2"))).size should be (1) + registry.findBundle(Selector(labels=Map("label" -> ".*3"))).size should be (0) + registry.findBundle(Selector(name="no_such_.*")).size should be (0) + registry.findBundle(Selector(name="some_metric_.*")).size should be (2) } } diff --git a/flowman-core/src/test/scala/com/dimajix/flowman/model/MappingTest.scala b/flowman-core/src/test/scala/com/dimajix/flowman/model/MappingTest.scala index 32f34ab43..91843bcd8 100644 --- a/flowman-core/src/test/scala/com/dimajix/flowman/model/MappingTest.scala +++ b/flowman-core/src/test/scala/com/dimajix/flowman/model/MappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -38,10 +38,10 @@ import com.dimajix.spark.testing.LocalSparkSession object MappingTest { - class DummyMapping(props:Mapping.Properties, ins:Seq[MappingOutputIdentifier]) extends BaseMapping { + class DummyMapping(props:Mapping.Properties, ins:Set[MappingOutputIdentifier]) extends BaseMapping { protected override def instanceProperties: Mapping.Properties = props - override def inputs: Seq[MappingOutputIdentifier] = ins + override def inputs: Set[MappingOutputIdentifier] = ins override def execute(execution: Execution, input: Map[MappingOutputIdentifier, DataFrame]): Map[String, DataFrame] = { val df = input.head._2.groupBy("id").agg(f.sum("val")) @@ -60,7 +60,7 @@ class MappingTest extends AnyFlatSpec with Matchers with MockFactory with LocalS val mapping = new DummyMapping( Mapping.Properties(context, "m1"), - Seq() + Set() ) mapping.metadata should be (Metadata( @@ -83,7 +83,7 @@ class MappingTest extends AnyFlatSpec with Matchers with MockFactory with LocalS val mapping = new DummyMapping( Mapping.Properties(context, "m1"), - Seq() + Set() ) mapping.output("main") should be (MappingOutputIdentifier("project/m1:main")) an[NoSuchMappingOutputException] should be thrownBy(mapping.output("no_such_output")) @@ -95,7 +95,7 @@ class MappingTest extends AnyFlatSpec with Matchers with MockFactory with LocalS val mapping = new DummyMapping( Mapping.Properties(context, "m1"), - Seq() + Set() ) mapping.output("main") should be (MappingOutputIdentifier("m1:main")) an[NoSuchMappingOutputException] should be thrownBy(mapping.output("no_such_output")) @@ -108,7 +108,7 @@ class MappingTest extends AnyFlatSpec with Matchers with MockFactory with LocalS val mapping = new DummyMapping( Mapping.Properties(context, "m1"), - Seq(MappingOutputIdentifier("input:main")) + Set(MappingOutputIdentifier("input:main")) ) val inputSchema = StructType(Seq( @@ -140,11 +140,11 @@ class MappingTest extends AnyFlatSpec with Matchers with MockFactory with LocalS val mapping1 = new DummyMapping( Mapping.Properties(context, "m1"), - Seq(MappingOutputIdentifier("m2")) + Set(MappingOutputIdentifier("m2")) ) val mapping2 = new DummyMapping( Mapping.Properties(context, "m2"), - Seq() + Set() ) //(mappingTemplate1.instantiate _).expects(context).returns(mapping1) (mappingTemplate2.instantiate _).expects(context).returns(mapping2) @@ -152,17 +152,20 @@ class MappingTest extends AnyFlatSpec with Matchers with MockFactory with LocalS val graphBuilder = new GraphBuilder(context, Phase.BUILD) val ref1 = graphBuilder.refMapping(mapping1) val ref2 = graphBuilder.refMapping(mapping2) + val out11 = ref1.outputs.head + val out21 = ref2.outputs.head ref1.mapping should be (mapping1) ref1.incoming should be (Seq( - InputMapping(ref2, ref1, "main") + InputMapping(out21, ref1) )) ref1.outgoing should be (Seq()) ref2.mapping should be (mapping2) ref2.incoming should be (Seq()) - ref2.outgoing should be (Seq( - InputMapping(ref2, ref1, "main") + ref2.outgoing should be (Seq()) + ref2.outputs.head.outgoing should be (Seq( + InputMapping(out21, ref1) )) } } diff --git a/flowman-dist/conf/default-namespace.yml.template b/flowman-dist/conf/default-namespace.yml.template index 0b67358c8..a71a36954 100644 --- a/flowman-dist/conf/default-namespace.yml.template +++ b/flowman-dist/conf/default-namespace.yml.template @@ -18,10 +18,61 @@ connections: password: $System.getenv('FLOWMAN_HISTORY_PASSWORD', '') +# This adds a hook for creating an execution log in a file +hooks: + kind: report + location: ${project.basedir}/generated-report.txt + metrics: + # Define common labels for all metrics + labels: + project: ${project.name} + metrics: + # Collect everything + - selector: + name: .* + labels: + category: ${category} + kind: ${kind} + name: ${name} + # This metric contains the number of records per output + - name: output_records + selector: + name: target_records + labels: + category: target + labels: + target: ${name} + # This metric contains the processing time per output + - name: output_time + selector: + name: target_runtime + labels: + category: target + labels: + target: ${name} + # This metric contains the overall processing time + - name: processing_time + selector: + name: job_runtime + labels: + category: job + + # This configures where metrics should be written to. Since we cannot assume a working Prometheus push gateway, we # simply print them onto the console metrics: - kind: console + - kind: console + # Optionally add a JDBC metric sink + #- kind: jdbc + # labels: + # project: ${project.name} + # version: ${project.version} + # connection: + # kind: jdbc + # url: jdbc:sqlserver://localhost:1433;databaseName=flowman_metrics + # driver: "com.microsoft.sqlserver.jdbc.SQLServerDriver" + # username: "sa" + # password: "yourStrong(!)Password" # This section contains global configuration properties. These still can be overwritten within projects or profiles diff --git a/flowman-dist/conf/flowman-env.sh.template b/flowman-dist/conf/flowman-env.sh.template index ca28cace3..904a49360 100644 --- a/flowman-dist/conf/flowman-env.sh.template +++ b/flowman-dist/conf/flowman-env.sh.template @@ -53,6 +53,9 @@ SPARK_DRIVER_MEMORY="3G" # #SPARK_SUBMIT= +# Add some more jars to spark-submit. This can be a comma-separated list of jars. +# +#SPARK_JARS= # Apply any proxy settings from the system environment # diff --git a/flowman-dist/conf/history-server.yml.template b/flowman-dist/conf/history-server.yml.template index 90df9b7ce..eab638f25 100644 --- a/flowman-dist/conf/history-server.yml.template +++ b/flowman-dist/conf/history-server.yml.template @@ -16,10 +16,6 @@ connections: password: $System.getenv('FLOWMAN_HISTORY_PASSWORD', '') -# This section contains global configuration properties. -config: - - # This section enables plugins. You may want to remove plugins which are of no use for you. plugins: - flowman-mariadb diff --git a/flowman-dist/libexec/flowman-common.sh b/flowman-dist/libexec/flowman-common.sh index c57dcb969..cd73013cb 100644 --- a/flowman-dist/libexec/flowman-common.sh +++ b/flowman-dist/libexec/flowman-common.sh @@ -19,6 +19,7 @@ fi # Set basic Spark options : ${SPARK_SUBMIT:="$SPARK_HOME"/bin/spark-submit} : ${SPARK_OPTS:=""} +: ${SPARK_JARS:=""} : ${SPARK_DRIVER_JAVA_OPTS:="-server"} : ${SPARK_EXECUTOR_JAVA_OPTS:="-server"} @@ -84,11 +85,17 @@ flowman_lib() { spark_submit() { + if [ "$SPARK_JARS" != "" ]; then + extra_jars=",$SPARK_JARS" + else + extra_jars="" + fi + $SPARK_SUBMIT \ --driver-java-options "$SPARK_DRIVER_JAVA_OPTS" \ --conf spark.execution.extraJavaOptions="$SPARK_EXECUTOR_JAVA_OPTS" \ --class $3 \ $SPARK_OPTS \ - --jars "$(flowman_lib $2)" \ + --jars "$(flowman_lib $2)$extra_jars" \ $FLOWMAN_HOME/lib/$1 "${@:4}" } diff --git a/flowman-dist/pom.xml b/flowman-dist/pom.xml index 4137f863b..20ee89fa8 100644 --- a/flowman-dist/pom.xml +++ b/flowman-dist/pom.xml @@ -10,7 +10,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-dsl/pom.xml b/flowman-dsl/pom.xml index 87e30d096..2c841db2b 100644 --- a/flowman-dsl/pom.xml +++ b/flowman-dsl/pom.xml @@ -9,7 +9,7 @@ flowman-root com.dimajix.flowman - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveTable.scala b/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveTable.scala index 848214c49..a6e35e9ed 100644 --- a/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveTable.scala +++ b/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveTable.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2020 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -18,6 +18,7 @@ package com.dimajix.flowman.dsl.relation import org.apache.hadoop.fs.Path +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.dsl.RelationGen import com.dimajix.flowman.model.PartitionField import com.dimajix.flowman.model.Relation @@ -45,10 +46,9 @@ case class HiveTable( override def apply(props:Relation.Properties) : HiveTableRelation = { HiveTableRelation( props, - database = database, schema = schema.map(s => s.instantiate(props.context)), partitions = partitions, - table = table, + table = TableIdentifier(table, database.toSeq), external = external, location = location, format = format, diff --git a/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveUnionTable.scala b/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveUnionTable.scala index 515de80f4..6429193c7 100644 --- a/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveUnionTable.scala +++ b/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveUnionTable.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2020 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -18,6 +18,7 @@ package com.dimajix.flowman.dsl.relation import org.apache.hadoop.fs.Path +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.dsl.RelationGen import com.dimajix.flowman.model.PartitionField import com.dimajix.flowman.model.Relation @@ -49,11 +50,9 @@ case class HiveUnionTable( props, schema.map(_.instantiate(context)), partitions, - tableDatabase, - tablePrefix, + TableIdentifier(tablePrefix, tableDatabase), locationPrefix, - viewDatabase, - view, + TableIdentifier(view, viewDatabase), external, format, options, diff --git a/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveView.scala b/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveView.scala index e5b0376cd..02d0155f1 100644 --- a/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveView.scala +++ b/flowman-dsl/src/main/scala/com/dimajix/flowman/dsl/relation/HiveView.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2020 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,6 +16,7 @@ package com.dimajix.flowman.dsl.relation +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.dsl.RelationGen import com.dimajix.flowman.model.MappingOutputIdentifier import com.dimajix.flowman.model.PartitionField @@ -33,8 +34,7 @@ case class HiveView( override def apply(props: Relation.Properties): HiveViewRelation = { HiveViewRelation( props, - database, - view, + TableIdentifier(view, database), partitions, sql, mapping diff --git a/flowman-hub/pom.xml b/flowman-hub/pom.xml index 568f1d468..0b8ff9350 100644 --- a/flowman-hub/pom.xml +++ b/flowman-hub/pom.xml @@ -9,7 +9,7 @@ flowman-root com.dimajix.flowman - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-parent/pom.xml b/flowman-parent/pom.xml index a6ad72b8a..da6aa9be0 100644 --- a/flowman-parent/pom.xml +++ b/flowman-parent/pom.xml @@ -10,7 +10,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml @@ -63,7 +63,7 @@ true org.codehaus.mojo build-helper-maven-plugin - 3.2.0 + 3.3.0 true @@ -90,7 +90,7 @@ true org.apache.maven.plugins maven-compiler-plugin - 3.8.1 + 3.9.0 ${maven.compiler.source} ${maven.compiler.target} @@ -100,7 +100,7 @@ true net.alchim31.maven scala-maven-plugin - 4.5.3 + 4.5.6 ${scala.version} ${scala.api_version} @@ -193,7 +193,7 @@ true org.codehaus.mojo versions-maven-plugin - 2.8.1 + 2.9.0 true @@ -240,7 +240,19 @@ true org.apache.maven.plugins maven-site-plugin - 3.9.1 + 3.10.0 + + + true + org.apache.maven.plugins + maven-project-info-reports-plugin + 3.2.1 + + + true + org.apache.maven.plugins + maven-help-plugin + 3.2.0 diff --git a/flowman-plugins/aws/pom.xml b/flowman-plugins/aws/pom.xml index c87f38e26..776177c92 100644 --- a/flowman-plugins/aws/pom.xml +++ b/flowman-plugins/aws/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml diff --git a/flowman-plugins/azure/pom.xml b/flowman-plugins/azure/pom.xml index 1db822ec0..b473e1976 100644 --- a/flowman-plugins/azure/pom.xml +++ b/flowman-plugins/azure/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml diff --git a/flowman-plugins/delta/pom.xml b/flowman-plugins/delta/pom.xml index 2aeff40c6..946135545 100644 --- a/flowman-plugins/delta/pom.xml +++ b/flowman-plugins/delta/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml @@ -17,7 +17,7 @@ flowman-delta ${project.version} ${project.build.finalName}.jar - 1.0.0 + 1.1.0 diff --git a/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaFileRelation.scala b/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaFileRelation.scala index 6778570e7..0acec4fba 100644 --- a/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaFileRelation.scala +++ b/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaFileRelation.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -22,10 +22,8 @@ import java.nio.file.FileAlreadyExistsException import com.fasterxml.jackson.annotation.JsonProperty import io.delta.tables.DeltaTable import org.apache.hadoop.fs.Path -import org.apache.spark.sql.Column import org.apache.spark.sql.DataFrame import org.apache.spark.sql.delta.catalog.DeltaTableV2 -import org.apache.spark.sql.functions.col import org.apache.spark.sql.streaming.StreamingQuery import org.apache.spark.sql.streaming.Trigger import org.apache.spark.sql.types.StructType @@ -37,9 +35,10 @@ import com.dimajix.common.Trilean import com.dimajix.common.Yes import com.dimajix.flowman.catalog.PartitionSpec import com.dimajix.flowman.catalog.TableChange +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution -import com.dimajix.flowman.execution.MergeClause import com.dimajix.flowman.execution.MigrationPolicy import com.dimajix.flowman.execution.MigrationStrategy import com.dimajix.flowman.execution.OutputMode @@ -223,7 +222,9 @@ case class DeltaFileRelation( val table = deltaCatalogTable(execution) val sourceSchema = com.dimajix.flowman.types.StructType.of(table.schema()) val targetSchema = com.dimajix.flowman.types.SchemaUtils.replaceCharVarchar(fullSchema.get) - !TableChange.requiresMigration(sourceSchema, targetSchema, migrationPolicy) + val sourceTable = TableDefinition(TableIdentifier.empty, sourceSchema.fields) + val targetTable = TableDefinition(TableIdentifier.empty, targetSchema.fields) + !TableChange.requiresMigration(sourceTable, targetTable, migrationPolicy) } else { true @@ -288,6 +289,8 @@ case class DeltaFileRelation( properties, description ) + + provides.foreach(execution.refreshResource) } } @@ -344,6 +347,7 @@ case class DeltaFileRelation( else { logger.info(s"Destroying Delta file relation '$identifier' by deleting directory '$location'") fs.delete(location, true) + provides.foreach(execution.refreshResource) } } diff --git a/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaRelation.scala b/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaRelation.scala index 238d498f6..03cd6be33 100644 --- a/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaRelation.scala +++ b/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaRelation.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -33,12 +33,14 @@ import org.apache.spark.sql.types.StructType import org.slf4j.LoggerFactory import com.dimajix.common.SetIgnoreCase +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.catalog.TableChange import com.dimajix.flowman.catalog.TableChange.AddColumn import com.dimajix.flowman.catalog.TableChange.DropColumn import com.dimajix.flowman.catalog.TableChange.UpdateColumnComment import com.dimajix.flowman.catalog.TableChange.UpdateColumnNullability import com.dimajix.flowman.catalog.TableChange.UpdateColumnType +import com.dimajix.flowman.catalog.TableDefinition import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.execution.MergeClause import com.dimajix.flowman.execution.MigrationFailedException @@ -124,29 +126,32 @@ abstract class DeltaRelation(options: Map[String,String], mergeKey: Seq[String]) val table = deltaCatalogTable(execution) val sourceSchema = com.dimajix.flowman.types.StructType.of(table.schema()) val targetSchema = com.dimajix.flowman.types.SchemaUtils.replaceCharVarchar(fullSchema.get) + val sourceTable = TableDefinition(TableIdentifier.empty, sourceSchema.fields) + val targetTable = TableDefinition(TableIdentifier.empty, targetSchema.fields) - val requiresMigration = TableChange.requiresMigration(sourceSchema, targetSchema, migrationPolicy) + val requiresMigration = TableChange.requiresMigration(sourceTable, targetTable, migrationPolicy) if (requiresMigration) { - doMigration(execution, table, sourceSchema, targetSchema, migrationPolicy, migrationStrategy) + doMigration(execution, table, sourceTable, targetTable, migrationPolicy, migrationStrategy) + provides.foreach(execution.refreshResource) } } - private def doMigration(execution: Execution, table:DeltaTableV2, currentSchema:com.dimajix.flowman.types.StructType, targetSchema:com.dimajix.flowman.types.StructType, migrationPolicy:MigrationPolicy, migrationStrategy:MigrationStrategy) : Unit = { + private def doMigration(execution: Execution, table:DeltaTableV2, currentTable:TableDefinition, targetTable:TableDefinition, migrationPolicy:MigrationPolicy, migrationStrategy:MigrationStrategy) : Unit = { migrationStrategy match { case MigrationStrategy.NEVER => - logger.warn(s"Migration required for Delta relation '$identifier', but migrations are disabled.\nCurrent schema:\n${currentSchema.treeString}New schema:\n${targetSchema.treeString}") + logger.warn(s"Migration required for Delta relation '$identifier', but migrations are disabled.\nCurrent schema:\n${currentTable.schema.treeString}New schema:\n${targetTable.schema.treeString}") case MigrationStrategy.FAIL => - logger.error(s"Cannot migrate Delta relation '$identifier', since migrations are disabled.\nCurrent schema:\n${currentSchema.treeString}New schema:\n${targetSchema.treeString}") + logger.error(s"Cannot migrate Delta relation '$identifier', since migrations are disabled.\nCurrent schema:\n${currentTable.schema.treeString}New schema:\n${targetTable.schema.treeString}") throw new MigrationFailedException(identifier) case MigrationStrategy.ALTER => - val migrations = TableChange.migrate(currentSchema, targetSchema, migrationPolicy) + val migrations = TableChange.migrate(currentTable, targetTable, migrationPolicy) if (migrations.exists(m => !supported(m))) { - logger.error(s"Cannot migrate Delta relation '$identifier', since that would require unsupported changes.\nCurrent schema:\n${currentSchema.treeString}New schema:\n${targetSchema.treeString}") + logger.error(s"Cannot migrate Delta relation '$identifier', since that would require unsupported changes.\nCurrent schema:\n${currentTable.schema.treeString}New schema:\n${targetTable.schema.treeString}") throw new MigrationFailedException(identifier) } alter(migrations) case MigrationStrategy.ALTER_REPLACE => - val migrations = TableChange.migrate(currentSchema, targetSchema, migrationPolicy) + val migrations = TableChange.migrate(currentTable, targetTable, migrationPolicy) if (migrations.forall(m => supported(m))) { alter(migrations) } @@ -158,7 +163,7 @@ abstract class DeltaRelation(options: Map[String,String], mergeKey: Seq[String]) } def alter(migrations:Seq[TableChange]) : Unit = { - logger.info(s"Migrating Delta relation '$identifier'. New schema:\n${targetSchema.treeString}") + logger.info(s"Migrating Delta relation '$identifier'. New schema:\n${targetTable.schema.treeString}") try { val spark = execution.spark diff --git a/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaTableRelation.scala b/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaTableRelation.scala index 6ab742ba7..86e715bb8 100644 --- a/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaTableRelation.scala +++ b/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/relation/DeltaTableRelation.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -19,9 +19,7 @@ package com.dimajix.flowman.spec.relation import com.fasterxml.jackson.annotation.JsonProperty import io.delta.tables.DeltaTable import org.apache.hadoop.fs.Path -import org.apache.spark.sql.Column import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException import org.apache.spark.sql.catalyst.catalog.CatalogTableType import org.apache.spark.sql.delta.catalog.DeltaTableV2 @@ -36,9 +34,10 @@ import com.dimajix.common.Trilean import com.dimajix.common.Yes import com.dimajix.flowman.catalog.PartitionSpec import com.dimajix.flowman.catalog.TableChange +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution -import com.dimajix.flowman.execution.MergeClause import com.dimajix.flowman.execution.MigrationFailedException import com.dimajix.flowman.execution.MigrationPolicy import com.dimajix.flowman.execution.MigrationStrategy @@ -59,8 +58,7 @@ case class DeltaTableRelation( override val instanceProperties:Relation.Properties, override val schema:Option[Schema] = None, override val partitions: Seq[PartitionField] = Seq(), - database: String, - table: String, + table: TableIdentifier, location: Option[Path] = None, options: Map[String,String] = Map(), properties: Map[String, String] = Map(), @@ -68,17 +66,13 @@ case class DeltaTableRelation( ) extends DeltaRelation(options, mergeKey) { private val logger = LoggerFactory.getLogger(classOf[DeltaTableRelation]) - private lazy val tableIdentifier: TableIdentifier = { - TableIdentifier(table, Some(database)) - } - /** * Returns the list of all resources which will be created by this relation. * * @return */ override def provides: Set[ResourceIdentifier] = { - Set(ResourceIdentifier.ofHiveTable(table, Some(database))) + Set(ResourceIdentifier.ofHiveTable(table)) } /** @@ -87,7 +81,7 @@ case class DeltaTableRelation( * @return */ override def requires: Set[ResourceIdentifier] = { - Set(ResourceIdentifier.ofHiveDatabase(database)) + table.space.headOption.map(ResourceIdentifier.ofHiveDatabase).toSet } /** @@ -104,7 +98,7 @@ case class DeltaTableRelation( requireValidPartitionKeys(partition) val allPartitions = PartitionSchema(this.partitions).interpolate(partition) - allPartitions.map(p =>ResourceIdentifier.ofHivePartition(table, Some(database), p.toMap)).toSet + allPartitions.map(p =>ResourceIdentifier.ofHivePartition(table, p.toMap)).toSet } /** @@ -116,11 +110,11 @@ case class DeltaTableRelation( * @return */ override def read(execution: Execution, partitions: Map[String, FieldValue]): DataFrame = { - logger.info(s"Reading Delta relation '$identifier' from table $tableIdentifier using partition values $partitions") + logger.info(s"Reading Delta relation '$identifier' from table $table using partition values $partitions") val tableDf = execution.spark.read .options(options) - .table(tableIdentifier.quotedString) + .table(table.quotedString) val filteredDf = filterPartition(tableDf, partitions) applyInputSchema(execution, filteredDf) @@ -138,7 +132,7 @@ case class DeltaTableRelation( val partitionSpec = PartitionSchema(partitions).spec(partition) - logger.info(s"Writing Delta relation '$identifier' to table $tableIdentifier partition ${HiveDialect.expr.partition(partitionSpec)} with mode '$mode'") + logger.info(s"Writing Delta relation '$identifier' to table $table partition ${HiveDialect.expr.partition(partitionSpec)} with mode '$mode'") val extDf = applyOutputSchema(execution, addPartition(df, partition)) mode match { @@ -147,7 +141,7 @@ case class DeltaTableRelation( case _ => doWrite(extDf, partitionSpec, mode) } - execution.catalog.refreshTable(tableIdentifier) + execution.catalog.refreshTable(table) } private def doWrite(df: DataFrame, partitionSpec: PartitionSpec, mode: OutputMode) : Unit = { val writer = @@ -163,12 +157,12 @@ case class DeltaTableRelation( .format("delta") .options(options) .mode(mode.batchMode) - .insertInto(tableIdentifier.quotedString) + .insertInto(table.quotedString) } private def doUpdate(df: DataFrame, partitionSpec: PartitionSpec) : Unit = { val withinPartitionKeyColumns = if (mergeKey.nonEmpty) mergeKey else schema.map(_.primaryKey).getOrElse(Seq()) val keyColumns = SetIgnoreCase(partitions.map(_.name)) -- partitionSpec.keys ++ withinPartitionKeyColumns - val table = DeltaTable.forName(df.sparkSession, tableIdentifier.quotedString) + val table = DeltaTable.forName(df.sparkSession, this.table.quotedString) DeltaUtils.upsert(table, df, keyColumns, partitionSpec) } @@ -180,8 +174,8 @@ case class DeltaTableRelation( * @return */ override def readStream(execution: Execution): DataFrame = { - logger.info(s"Streaming from Delta table relation '$identifier' at $tableIdentifier") - val location = DeltaUtils.getLocation(execution, tableIdentifier) + logger.info(s"Streaming from Delta table relation '$identifier' at $table") + val location = DeltaUtils.getLocation(execution, table.toSpark) readStreamFrom(execution, location) } @@ -193,8 +187,8 @@ case class DeltaTableRelation( * @return */ override def writeStream(execution: Execution, df: DataFrame, mode: OutputMode, trigger: Trigger, checkpointLocation: Path): StreamingQuery = { - logger.info(s"Streaming to Delta table relation '$identifier' $tableIdentifier") - val location = DeltaUtils.getLocation(execution, tableIdentifier) + logger.info(s"Streaming to Delta table relation '$identifier' $table") + val location = DeltaUtils.getLocation(execution, table.toSpark) writeStreamTo(execution, df, location, mode, trigger, checkpointLocation) } @@ -207,7 +201,7 @@ case class DeltaTableRelation( * @return */ override def exists(execution: Execution): Trilean = { - execution.catalog.tableExists(tableIdentifier) + execution.catalog.tableExists(table) } @@ -220,8 +214,8 @@ case class DeltaTableRelation( */ override def conforms(execution: Execution, migrationPolicy: MigrationPolicy): Trilean = { val catalog = execution.catalog - if (catalog.tableExists(tableIdentifier)) { - val table = catalog.getTable(tableIdentifier) + if (catalog.tableExists(table)) { + val table = catalog.getTable(this.table) if (table.tableType == CatalogTableType.VIEW) { false } @@ -229,7 +223,9 @@ case class DeltaTableRelation( val table = deltaCatalogTable(execution) val sourceSchema = com.dimajix.flowman.types.StructType.of(table.schema()) val targetSchema = com.dimajix.flowman.types.SchemaUtils.replaceCharVarchar(fullSchema.get) - !TableChange.requiresMigration(sourceSchema, targetSchema, migrationPolicy) + val sourceTable = TableDefinition(this.table, sourceSchema.fields) + val targetTable = TableDefinition(this.table, targetSchema.fields) + !TableChange.requiresMigration(sourceTable, targetTable, migrationPolicy) } else { true @@ -256,15 +252,15 @@ case class DeltaTableRelation( requireValidPartitionKeys(partition) val catalog = execution.catalog - if (!catalog.tableExists(tableIdentifier)) { + if (!catalog.tableExists(table)) { false } else if (partitions.nonEmpty) { val partitionSpec = PartitionSchema(partitions).spec(partition) - DeltaUtils.isLoaded(execution, tableIdentifier, partitionSpec) + DeltaUtils.isLoaded(execution, table.toSpark, partitionSpec) } else { - val location = catalog.getTableLocation(tableIdentifier) + val location = catalog.getTableLocation(table) DeltaUtils.isLoaded(execution, location) } } @@ -279,23 +275,25 @@ case class DeltaTableRelation( val tableExists = exists(execution) == Yes if (!ifNotExists || !tableExists) { val sparkSchema = HiveTableRelation.cleanupSchema(StructType(fields.map(_.catalogField))) - logger.info(s"Creating Delta table relation '$identifier' with table $tableIdentifier and schema\n${sparkSchema.treeString}") + logger.info(s"Creating Delta table relation '$identifier' with table $table and schema\n${sparkSchema.treeString}") if (schema.isEmpty) { throw new UnspecifiedSchemaException(identifier) } if (tableExists) - throw new TableAlreadyExistsException(database, table) + throw new TableAlreadyExistsException(table.database.getOrElse(""), table.table) DeltaUtils.createTable( execution, - Some(tableIdentifier), + Some(table.toSpark), location, sparkSchema, partitions, properties, description ) + + provides.foreach(execution.refreshResource) } } @@ -309,15 +307,15 @@ case class DeltaTableRelation( requireValidPartitionKeys(partitions) if (partitions.nonEmpty) { - val deltaTable = DeltaTable.forName(execution.spark, tableIdentifier.quotedString) + val deltaTable = DeltaTable.forName(execution.spark, table.quotedString) PartitionSchema(this.partitions).interpolate(partitions).foreach { p => deltaTable.delete(p.predicate) } deltaTable.vacuum() } else { - logger.info(s"Truncating Delta table relation '$identifier' by truncating table $tableIdentifier") - val deltaTable = DeltaTable.forName(execution.spark, tableIdentifier.quotedString) + logger.info(s"Truncating Delta table relation '$identifier' by truncating table $table") + val deltaTable = DeltaTable.forName(execution.spark, table.quotedString) deltaTable.delete() deltaTable.vacuum() } @@ -333,9 +331,10 @@ case class DeltaTableRelation( require(execution != null) val catalog = execution.catalog - if (!ifExists || catalog.tableExists(tableIdentifier)) { - logger.info(s"Destroying Delta table relation '$identifier' by dropping table $tableIdentifier") - catalog.dropTable(tableIdentifier) + if (!ifExists || catalog.tableExists(table)) { + logger.info(s"Destroying Delta table relation '$identifier' by dropping table $table") + catalog.dropTable(table) + provides.foreach(execution.refreshResource) } } @@ -348,18 +347,18 @@ case class DeltaTableRelation( require(execution != null) val catalog = execution.catalog - if (catalog.tableExists(tableIdentifier)) { - val table = catalog.getTable(tableIdentifier) + if (catalog.tableExists(table)) { + val table = catalog.getTable(this.table) if (table.tableType == CatalogTableType.VIEW) { migrationStrategy match { case MigrationStrategy.NEVER => - logger.warn(s"Migration required for HiveTable relation '$identifier' from VIEW to a TABLE $tableIdentifier, but migrations are disabled.") + logger.warn(s"Migration required for HiveTable relation '$identifier' from VIEW to a TABLE $this.table, but migrations are disabled.") case MigrationStrategy.FAIL => - logger.error(s"Cannot migrate relation HiveTable '$identifier' from VIEW to a TABLE $tableIdentifier, since migrations are disabled.") + logger.error(s"Cannot migrate relation HiveTable '$identifier' from VIEW to a TABLE $this.table, since migrations are disabled.") throw new MigrationFailedException(identifier) case MigrationStrategy.ALTER|MigrationStrategy.ALTER_REPLACE|MigrationStrategy.REPLACE => - logger.warn(s"TABLE target $tableIdentifier is currently a VIEW, dropping...") - catalog.dropView(tableIdentifier, false) + logger.warn(s"TABLE target $this.table is currently a VIEW, dropping...") + catalog.dropView(this.table, false) create(execution, false) } } @@ -370,27 +369,26 @@ case class DeltaTableRelation( } override protected def deltaTable(execution: Execution) : DeltaTable = { - DeltaTable.forName(execution.spark, tableIdentifier.quotedString) + DeltaTable.forName(execution.spark, table.quotedString) } override protected def deltaCatalogTable(execution: Execution): DeltaTableV2 = { val catalog = execution.catalog - val table = catalog.getTable(tableIdentifier) + val table = catalog.getTable(this.table) DeltaTableV2( execution.spark, new Path(table.location), catalogTable = Some(table), - tableIdentifier = Some(tableIdentifier.toString()) + tableIdentifier = Some(table.toString()) ) } } - @RelationType(kind="deltaTable") class DeltaTableRelationSpec extends RelationSpec with SchemaRelationSpec with PartitionedRelationSpec { - @JsonProperty(value = "database", required = false) private var database: String = "default" + @JsonProperty(value = "database", required = false) private var database: Option[String] = Some("default") @JsonProperty(value = "table", required = true) private var table: String = "" @JsonProperty(value = "location", required = false) private var location: Option[String] = None @JsonProperty(value = "options", required=false) private var options:Map[String,String] = Map() @@ -402,8 +400,7 @@ class DeltaTableRelationSpec extends RelationSpec with SchemaRelationSpec with P instanceProperties(context), schema.map(_.instantiate(context)), partitions.map(_.instantiate(context)), - context.evaluate(database), - context.evaluate(table), + TableIdentifier(context.evaluate(table), context.evaluate(database)), context.evaluate(location).map(p => new Path(p)), context.evaluate(options), context.evaluate(properties), diff --git a/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/target/DeltaVacuumTarget.scala b/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/target/DeltaVacuumTarget.scala index 8c8a587fc..9ccc9c092 100644 --- a/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/target/DeltaVacuumTarget.scala +++ b/flowman-plugins/delta/src/main/scala/com/dimajix/flowman/spec/target/DeltaVacuumTarget.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -21,11 +21,9 @@ import java.time.Duration import com.fasterxml.jackson.annotation.JsonProperty import io.delta.tables.DeltaTable import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.delta.DeltaLog import org.apache.spark.sql.functions import org.apache.spark.sql.functions.col -import org.apache.spark.sql.functions.count import org.apache.spark.sql.functions.lit import org.slf4j.LoggerFactory @@ -99,7 +97,7 @@ case class DeltaVacuumTarget( */ override protected def build(execution: Execution): Unit = { val deltaTable = relation.value match { - case table:DeltaTableRelation => DeltaTable.forName(execution.spark, TableIdentifier(table.table, Some(table.database)).toString()) + case table:DeltaTableRelation => DeltaTable.forName(execution.spark, table.table.toString()) case files:DeltaFileRelation => DeltaTable.forPath(execution.spark, files.location.toString) case rel:Relation => throw new IllegalArgumentException(s"DeltaVacuumTarget only supports relations of type deltaTable and deltaFiles, but it was given relation '${rel.identifier}' of kind '${rel.kind}'") } @@ -120,14 +118,14 @@ case class DeltaVacuumTarget( */ override def link(linker: Linker, phase:Phase): Unit = { if (phase == Phase.BUILD) { - linker.write(relation.identifier, Map()) + linker.write(relation, Map.empty[String,SingleValue]) } } private def compact(deltaTable:DeltaTable) : Unit = { val spark = deltaTable.toDF.sparkSession val deltaLog = relation.value match { - case table:DeltaTableRelation => DeltaLog.forTable(spark, TableIdentifier(table.table, Some(table.database))) + case table:DeltaTableRelation => DeltaLog.forTable(spark, table.table.toSpark) case files:DeltaFileRelation => DeltaLog.forTable(spark, files.location.toString) case rel:Relation => throw new IllegalArgumentException(s"DeltaVacuumTarget only supports relations of type deltaTable and deltaFiles, but it was given relation '${rel.identifier}' of kind '${rel.kind}'") } @@ -149,7 +147,7 @@ case class DeltaVacuumTarget( filter.map(writer.option("replaceWhere", _)) relation.value match { - case table:DeltaTableRelation => writer.insertInto(TableIdentifier(table.table, Some(table.database)).toString()) + case table:DeltaTableRelation => writer.insertInto(table.table.toString()) case files:DeltaFileRelation => writer.save(files.location.toString) case rel:Relation => throw new IllegalArgumentException(s"DeltaVacuumTarget only supports relations of type deltaTable and deltaFiles, but it was given relation '${rel.identifier}' of kind '${rel.kind}'") } diff --git a/flowman-plugins/delta/src/test/scala/com/dimajix/flowman/spec/relation/DeltaTableRelationTest.scala b/flowman-plugins/delta/src/test/scala/com/dimajix/flowman/spec/relation/DeltaTableRelationTest.scala index 443bbaeb6..b96369133 100644 --- a/flowman-plugins/delta/src/test/scala/com/dimajix/flowman/spec/relation/DeltaTableRelationTest.scala +++ b/flowman-plugins/delta/src/test/scala/com/dimajix/flowman/spec/relation/DeltaTableRelationTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -24,7 +24,6 @@ import io.delta.sql.DeltaSparkSessionExtension import org.apache.hadoop.fs.Path import org.apache.spark.sql.Row import org.apache.spark.sql.SparkSession -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.NoSuchTableException import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException import org.apache.spark.sql.catalyst.catalog.CatalogTableType @@ -40,6 +39,7 @@ import org.scalatest.matchers.should.Matchers import com.dimajix.common.No import com.dimajix.common.Yes +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.DeleteClause import com.dimajix.flowman.execution.InsertClause import com.dimajix.flowman.execution.MigrationFailedException @@ -50,14 +50,12 @@ import com.dimajix.flowman.execution.Session import com.dimajix.flowman.execution.UpdateClause import com.dimajix.flowman.model.PartitionField import com.dimajix.flowman.model.Relation -import com.dimajix.flowman.model.RelationIdentifier import com.dimajix.flowman.model.Schema import com.dimajix.flowman.spec.ObjectMapper import com.dimajix.flowman.spec.schema.EmbeddedSchema import com.dimajix.flowman.types.Field import com.dimajix.flowman.types.SingleValue import com.dimajix.flowman.{types => ftypes} -import com.dimajix.spark.sql.SchemaUtils import com.dimajix.spark.sql.streaming.StreamingUtils import com.dimajix.spark.testing.LocalSparkSession import com.dimajix.spark.testing.QueryTest @@ -86,8 +84,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val relation = relationSpec.instantiate(session.context).asInstanceOf[DeltaTableRelation] relation.description should be (Some("Some Delta Table")) relation.partitions should be (Seq()) - relation.database should be ("some_db") - relation.table should be ("some_table") + relation.table should be (TableIdentifier("some_table", Some("some_db"))) relation.location should be (Some(new Path("hdfs://ns/some/path"))) relation.options should be (Map()) relation.properties should be (Map()) @@ -107,8 +104,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe Field("int_col", ftypes.IntegerType) )) ), - database = "default", - table = "delta_table" + table = TableIdentifier("delta_table", Some("default")) ) relation.fields should be (Seq( @@ -141,7 +137,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe // Inspect Hive table val table_1 = session.catalog.getTable(TableIdentifier("delta_table", Some("default"))) - table_1.identifier should be (TableIdentifier("delta_table", Some("default"))) + table_1.identifier should be (TableIdentifier("delta_table", Some("default")).toSpark) table_1.tableType should be (CatalogTableType.MANAGED) table_1.schema should be (StructType(Seq())) table_1.dataSchema should be (StructType(Seq())) @@ -216,8 +212,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val location = new File(tempDir, "delta/default/lala2") val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table2", + table = TableIdentifier("delta_table2", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -254,7 +249,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe // Inspect Hive table val table_1 = session.catalog.getTable(TableIdentifier("delta_table2", Some("default"))) - table_1.identifier should be (TableIdentifier("delta_table2", Some("default"))) + table_1.identifier should be (TableIdentifier("delta_table2", Some("default")).toSpark) table_1.tableType should be (CatalogTableType.EXTERNAL) table_1.schema should be (StructType(Seq())) table_1.dataSchema should be (StructType(Seq())) @@ -373,8 +368,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table2", + table = TableIdentifier("delta_table2", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -479,8 +473,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val location = new File(tempDir, "delta/default/lala2") val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table2", + table = TableIdentifier("delta_table2", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -575,8 +568,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table2", + table = TableIdentifier("delta_table2", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -632,8 +624,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table2", + table = TableIdentifier("delta_table2", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -737,8 +728,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val relation0 = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table2", + table = TableIdentifier("delta_table2", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -767,8 +757,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe // == Check ================================================================================================= val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table2", + table = TableIdentifier("delta_table2", Some("default")), partitions = Seq( PartitionField("part", ftypes.StringType) ) @@ -892,8 +881,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table2", + table = TableIdentifier("delta_table2", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -967,8 +955,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table2", + table = TableIdentifier("delta_table2", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -1157,8 +1144,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val location = new File(tempDir, "delta/default/lala3") val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table2", + table =TableIdentifier("delta_table2", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -1240,8 +1226,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val rel_1 = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table", + table =TableIdentifier("delta_table", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -1252,8 +1237,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe ) val rel_2 = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table", + table = TableIdentifier("delta_table", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -1278,7 +1262,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe // Inspect Hive table val table_1 = session.catalog.getTable(TableIdentifier("delta_table", Some("default"))) - table_1.identifier should be (TableIdentifier("delta_table", Some("default"))) + table_1.identifier should be (TableIdentifier("delta_table", Some("default")).toSpark) table_1.tableType should be (CatalogTableType.MANAGED) table_1.schema should be (StructType(Seq())) table_1.dataSchema should be (StructType(Seq())) @@ -1354,8 +1338,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val rel_1 = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table", + table = TableIdentifier("delta_table", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -1366,8 +1349,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe ) val rel_2 = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table", + table = TableIdentifier("delta_table", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -1447,8 +1429,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val rel_1 = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table", + table = TableIdentifier("delta_table", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -1459,8 +1440,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe ) val rel_2 = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "delta_table", + table = TableIdentifier("delta_table", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -1549,8 +1529,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "streaming_test", + table = TableIdentifier("streaming_test", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( @@ -1610,8 +1589,7 @@ class DeltaTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSe val relation = DeltaTableRelation( Relation.Properties(context, "delta_relation"), - database = "default", - table = "streaming_test", + table = TableIdentifier("streaming_test", Some("default")), schema = Some(EmbeddedSchema( Schema.Properties(context, "delta_schema"), fields = Seq( diff --git a/flowman-plugins/delta/src/test/scala/com/dimajix/flowman/spec/target/DeltaVacuumTargetTest.scala b/flowman-plugins/delta/src/test/scala/com/dimajix/flowman/spec/target/DeltaVacuumTargetTest.scala index b1963db40..7434ffafa 100644 --- a/flowman-plugins/delta/src/test/scala/com/dimajix/flowman/spec/target/DeltaVacuumTargetTest.scala +++ b/flowman-plugins/delta/src/test/scala/com/dimajix/flowman/spec/target/DeltaVacuumTargetTest.scala @@ -23,14 +23,17 @@ import io.delta.sql.DeltaSparkSessionExtension import org.apache.hadoop.fs.Path import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.col +import org.apache.spark.sql.{types => stypes} import org.scalatest.flatspec.AnyFlatSpec import org.scalatest.matchers.should.Matchers import com.dimajix.common.No import com.dimajix.common.Unknown import com.dimajix.common.Yes +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.Phase import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.model.PartitionField import com.dimajix.flowman.model.Prototype import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.Schema @@ -44,9 +47,6 @@ import com.dimajix.flowman.types.Field import com.dimajix.flowman.types.IntegerType import com.dimajix.flowman.types.StringType import com.dimajix.spark.testing.LocalSparkSession -import org.apache.spark.sql.{types => stypes} - -import com.dimajix.flowman.model.PartitionField class DeltaVacuumTargetTest extends AnyFlatSpec with Matchers with LocalSparkSession { @@ -149,8 +149,7 @@ class DeltaVacuumTargetTest extends AnyFlatSpec with Matchers with LocalSparkSes Field("int_col", IntegerType) ) )), - database = "default", - table = "delta_table", + table = TableIdentifier("delta_table", Some("default")), location = Some(new Path(location.toURI)) ) @@ -193,8 +192,7 @@ class DeltaVacuumTargetTest extends AnyFlatSpec with Matchers with LocalSparkSes Field("int_col", IntegerType) ) )), - database = "default", - table = "delta_table", + table = TableIdentifier("delta_table", Some("default")), location = Some(new Path(location.toURI)) ) @@ -252,8 +250,7 @@ class DeltaVacuumTargetTest extends AnyFlatSpec with Matchers with LocalSparkSes partitions = Seq( PartitionField("part", StringType) ), - database = "default", - table = "delta_table", + table = TableIdentifier("delta_table", Some("default")), location = Some(new Path(location.toURI)) ) diff --git a/flowman-plugins/impala/pom.xml b/flowman-plugins/impala/pom.xml index 2cfd8bef3..ad6ed5ff9 100644 --- a/flowman-plugins/impala/pom.xml +++ b/flowman-plugins/impala/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml diff --git a/flowman-plugins/json/pom.xml b/flowman-plugins/json/pom.xml index 4ed4c1b1c..e642b9643 100644 --- a/flowman-plugins/json/pom.xml +++ b/flowman-plugins/json/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml diff --git a/flowman-plugins/kafka/pom.xml b/flowman-plugins/kafka/pom.xml index bf8bc2360..ff2103606 100644 --- a/flowman-plugins/kafka/pom.xml +++ b/flowman-plugins/kafka/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml diff --git a/flowman-plugins/kafka/src/main/scala/com/dimajix/flowman/spec/relation/KafkaRelation.scala b/flowman-plugins/kafka/src/main/scala/com/dimajix/flowman/spec/relation/KafkaRelation.scala index b3a760990..ce443d7f7 100644 --- a/flowman-plugins/kafka/src/main/scala/com/dimajix/flowman/spec/relation/KafkaRelation.scala +++ b/flowman-plugins/kafka/src/main/scala/com/dimajix/flowman/spec/relation/KafkaRelation.scala @@ -121,8 +121,10 @@ case class KafkaRelation( * @param execution * @return */ - override def describe(execution: Execution): types.StructType = { - types.StructType(fields) + override def describe(execution: Execution, partitions:Map[String,FieldValue] = Map()): types.StructType = { + val result = types.StructType(fields) + + applyDocumentation(result) } /** diff --git a/flowman-plugins/mariadb/pom.xml b/flowman-plugins/mariadb/pom.xml index 0922788e8..4dd28aaf4 100644 --- a/flowman-plugins/mariadb/pom.xml +++ b/flowman-plugins/mariadb/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml diff --git a/flowman-plugins/mssqlserver/pom.xml b/flowman-plugins/mssqlserver/pom.xml index 9377b16a2..fb7313ff3 100644 --- a/flowman-plugins/mssqlserver/pom.xml +++ b/flowman-plugins/mssqlserver/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml @@ -18,8 +18,55 @@ ${project.version} ${project.build.finalName}.jar 9.2.1.jre8 + 1.2.0 + _${scala.api_version} + + + CDH-6.3 + + 1.0.2 + + + + + CDP-7.1 + + 1.0.2 + + + + + spark-2.4 + + 1.0.2 + + + + + spark-3.0 + + 1.1.0 + _${scala.api_version} + + + + spark-3.1 + + 1.2.0 + _${scala.api_version} + + + + spark-3.2 + + 1.2.0 + _${scala.api_version} + + + + @@ -28,6 +75,14 @@ + + net.alchim31.maven + scala-maven-plugin + + + org.scalatest + scalatest-maven-plugin + org.apache.maven.plugins maven-assembly-plugin @@ -38,8 +93,38 @@ com.dimajix.flowman - flowman-core - provided + flowman-spec + + + + com.dimajix.flowman + flowman-spark-testing + test + + + + org.apache.spark + spark-core_${scala.api_version} + + + + org.apache.spark + spark-sql_${scala.api_version} + + + + org.apache.hadoop + hadoop-client + + + + org.apache.spark + spark-hive_${scala.api_version} + + + + com.fasterxml.jackson.dataformat + jackson-dataformat-yaml @@ -47,7 +132,11 @@ mssql-jdbc ${mssqlserver-java-client.version} + + + com.microsoft.azure + spark-mssql-connector${spark-mssql-connector.suffix} + ${spark-mssql-connector.version} + - - diff --git a/flowman-plugins/mssqlserver/src/main/assembly/assembly.xml b/flowman-plugins/mssqlserver/src/main/assembly/assembly.xml index 9dc35b6db..89fea5c4e 100644 --- a/flowman-plugins/mssqlserver/src/main/assembly/assembly.xml +++ b/flowman-plugins/mssqlserver/src/main/assembly/assembly.xml @@ -22,11 +22,17 @@ plugins/${plugin.name} - false - false + true + true false runtime true + + com.dimajix.flowman:flowman-spec + org.scala-lang.modules:scala-collection-compat_${scala.api_version} + org.apache.hadoop:hadoop-client-api + org.apache.hadoop:hadoop-client-runtime + diff --git a/flowman-plugins/mssqlserver/src/main/resources/plugin.yml b/flowman-plugins/mssqlserver/src/main/resources/plugin.yml index e2f82485b..da571a151 100644 --- a/flowman-plugins/mssqlserver/src/main/resources/plugin.yml +++ b/flowman-plugins/mssqlserver/src/main/resources/plugin.yml @@ -3,4 +3,6 @@ description: ${project.name} version: ${plugin.version} isolation: false jars: + - ${plugin.jar} - mssql-jdbc-${mssqlserver-java-client.version}.jar + - spark-mssql-connector${spark-mssql-connector.suffix}-${spark-mssql-connector.version}.jar diff --git a/flowman-plugins/mssqlserver/src/main/scala/com/dimajix/flowman/spec/relation/SqlServerRelation.scala b/flowman-plugins/mssqlserver/src/main/scala/com/dimajix/flowman/spec/relation/SqlServerRelation.scala new file mode 100644 index 000000000..65dac09ac --- /dev/null +++ b/flowman-plugins/mssqlserver/src/main/scala/com/dimajix/flowman/spec/relation/SqlServerRelation.scala @@ -0,0 +1,160 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.relation + +import scala.collection.mutable + +import com.fasterxml.jackson.annotation.JsonProperty +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions + +import com.dimajix.flowman.catalog +import com.dimajix.flowman.catalog.TableIdentifier +import com.dimajix.flowman.catalog.TableIndex +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.jdbc.JdbcUtils +import com.dimajix.flowman.jdbc.SqlDialects +import com.dimajix.flowman.model.Connection +import com.dimajix.flowman.model.PartitionField +import com.dimajix.flowman.model.Reference +import com.dimajix.flowman.model.Relation +import com.dimajix.flowman.model.Schema +import com.dimajix.flowman.spec.annotation.RelationType +import com.dimajix.flowman.spec.connection.ConnectionReferenceSpec +import com.dimajix.flowman.spec.connection.JdbcConnection +import com.dimajix.flowman.types.StructType + + +case class SqlServerRelation( + override val instanceProperties:Relation.Properties, + override val schema:Option[Schema] = None, + override val partitions: Seq[PartitionField] = Seq.empty, + connection: Reference[Connection], + properties: Map[String,String] = Map.empty, + table: Option[TableIdentifier] = None, + query: Option[String] = None, + mergeKey: Seq[String] = Seq.empty, + primaryKey: Seq[String] = Seq.empty, + indexes: Seq[TableIndex] = Seq.empty +) extends JdbcRelationBase(instanceProperties, schema, partitions, connection, properties, table, query, mergeKey, primaryKey, indexes) { + private val tempTableIdentifier = TableIdentifier(s"##${tableIdentifier.table}_temp_staging") + + override protected def doOverwriteAll(execution: Execution, df:DataFrame) : Unit = { + withConnection { (con, options) => + createTempTable(con, options, StructType.of(df.schema)) + logger.info(s"Writing new data into temporary staging table '${tempTableIdentifier}'") + appendTable(execution, df, tempTableIdentifier) + + withTransaction(con) { + withStatement(con, options) { case (statement, options) => + val dialect = SqlDialects.get(options.url) + logger.info(s"Truncating table '${tableIdentifier}'") + statement.executeUpdate(s"TRUNCATE TABLE ${dialect.quote(tableIdentifier)}") + logger.info(s"Copying data from temporary staging table '${tempTableIdentifier}' into table '${tableIdentifier}'") + statement.executeUpdate(s"INSERT INTO ${dialect.quote(tableIdentifier)} SELECT * FROM ${dialect.quote(tempTableIdentifier)}") + logger.info(s"Dropping temporary staging table '${tempTableIdentifier}'") + statement.executeUpdate(s"DROP TABLE ${dialect.quote(tempTableIdentifier)}") + } + } + } + } + override protected def doAppend(execution: Execution, df:DataFrame): Unit = { + withConnection { (con, options) => + createTempTable(con, options, StructType.of(df.schema)) + logger.info(s"Writing new data into temporary staging table '${tempTableIdentifier}'") + appendTable(execution, df, tempTableIdentifier) + + withTransaction(con) { + withStatement(con, options) { case (statement, options) => + val dialect = SqlDialects.get(options.url) + logger.info(s"Copying data from temporary staging table '${tempTableIdentifier}' into table '${tableIdentifier}'") + statement.executeUpdate(s"INSERT INTO ${dialect.quote(tableIdentifier)} SELECT * FROM ${dialect.quote(tempTableIdentifier)}") + logger.info(s"Dropping temporary staging table '${tempTableIdentifier}'") + statement.executeUpdate(s"DROP TABLE ${dialect.quote(tempTableIdentifier)}") + } + } + } + } + + private def appendTable(execution: Execution, df:DataFrame, table:TableIdentifier): Unit = { + val (_,props) = createConnectionProperties() + this.writer(execution, df, "com.microsoft.sqlserver.jdbc.spark", Map(), SaveMode.Append) + .options(props ++ Map("tableLock" -> "true", "mssqlIsolationLevel" -> "READ_UNCOMMITTED")) + .option(JDBCOptions.JDBC_TABLE_NAME, table.unquotedString) + .save() + } + private def createTempTable(con:java.sql.Connection,options: JDBCOptions, schema:StructType) : Unit = { + logger.info(s"Creating temporary staging table '${tempTableIdentifier}' with schema\n${schema.treeString}") + + // First drop temp table if it already exists + withStatement(con, options) { case (statement, options) => + val dialect = SqlDialects.get(options.url) + statement.executeUpdate(s"DROP TABLE IF EXISTS ${dialect.quote(tempTableIdentifier)}") + } + + // Create temp table with specified schema, but without any primary key or indices + val table = catalog.TableDefinition( + tempTableIdentifier, + schema.fields + ) + JdbcUtils.createTable(con, table, options) + } + + override protected def createConnectionProperties() : (String,Map[String,String]) = { + val connection = this.connection.value.asInstanceOf[JdbcConnection] + val props = mutable.Map[String,String]() + props.put(JDBCOptions.JDBC_URL, connection.url) + props.put(JDBCOptions.JDBC_DRIVER_CLASS, "com.microsoft.sqlserver.jdbc.SQLServerDriver") + connection.username.foreach(props.put("user", _)) + connection.password.foreach(props.put("password", _)) + + connection.properties.foreach(kv => props.put(kv._1, kv._2)) + properties.foreach(kv => props.put(kv._1, kv._2)) + + (connection.url,props.toMap) + } +} + + + +@RelationType(kind="sqlserver") +class SqlServerRelationSpec extends RelationSpec with PartitionedRelationSpec with SchemaRelationSpec with IndexedRelationSpec { + @JsonProperty(value = "connection", required = true) private var connection: ConnectionReferenceSpec = _ + @JsonProperty(value = "properties", required = false) private var properties: Map[String, String] = Map.empty + @JsonProperty(value = "database", required = false) private var database: Option[String] = None + @JsonProperty(value = "table", required = false) private var table: Option[String] = None + @JsonProperty(value = "query", required = false) private var query: Option[String] = None + @JsonProperty(value = "mergeKey", required = false) private var mergeKey: Seq[String] = Seq.empty + @JsonProperty(value = "primaryKey", required = false) private var primaryKey: Seq[String] = Seq.empty + + override def instantiate(context: Context): SqlServerRelation = { + new SqlServerRelation( + instanceProperties(context), + schema.map(_.instantiate(context)), + partitions.map(_.instantiate(context)), + connection.instantiate(context), + context.evaluate(properties), + context.evaluate(table).map(t => TableIdentifier(t, context.evaluate(database))), + context.evaluate(query), + mergeKey.map(context.evaluate), + primaryKey.map(context.evaluate), + indexes.map(_.instantiate(context)) + ) + } +} diff --git a/flowman-plugins/mssqlserver/src/test/scala/com/dimajix/flowman/spec/relation/SqlServerRelationTest.scala b/flowman-plugins/mssqlserver/src/test/scala/com/dimajix/flowman/spec/relation/SqlServerRelationTest.scala new file mode 100644 index 000000000..d2ee55a74 --- /dev/null +++ b/flowman-plugins/mssqlserver/src/test/scala/com/dimajix/flowman/spec/relation/SqlServerRelationTest.scala @@ -0,0 +1,73 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.relation + +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.model.ConnectionIdentifier +import com.dimajix.flowman.model.Schema +import com.dimajix.flowman.model.ValueConnectionReference +import com.dimajix.flowman.spec.ObjectMapper +import com.dimajix.flowman.spec.schema.EmbeddedSchema +import com.dimajix.flowman.types.Field +import com.dimajix.flowman.types.IntegerType +import com.dimajix.flowman.types.StringType + + +class SqlServerRelationTest extends AnyFlatSpec with Matchers { + "The JdbcRelation" should "support embedding the connection" in { + val spec = + s""" + |kind: sqlserver + |name: some_relation + |description: "This is a test table" + |connection: + | kind: jdbc + | name: some_connection + | url: some_url + |table: lala_001 + |schema: + | kind: inline + | fields: + | - name: str_col + | type: string + | - name: int_col + | type: integer + """.stripMargin + + val relationSpec = ObjectMapper.parse[RelationSpec](spec).asInstanceOf[SqlServerRelationSpec] + + val session = Session.builder().disableSpark().build() + val context = session.context + + val relation = relationSpec.instantiate(context) + relation shouldBe a[SqlServerRelation] + relation.name should be ("some_relation") + relation.schema should be (Some(EmbeddedSchema( + Schema.Properties(context, name="embedded", kind="inline"), + fields = Seq( + Field("str_col", StringType), + Field("int_col", IntegerType) + ) + ))) + relation.connection shouldBe a[ValueConnectionReference] + relation.connection.identifier should be (ConnectionIdentifier("some_connection")) + relation.connection.name should be ("some_connection") + } +} diff --git a/flowman-plugins/mysql/pom.xml b/flowman-plugins/mysql/pom.xml index 521d687a5..dd867d237 100644 --- a/flowman-plugins/mysql/pom.xml +++ b/flowman-plugins/mysql/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml diff --git a/flowman-plugins/openapi/pom.xml b/flowman-plugins/openapi/pom.xml index 14388c033..f17d9413f 100644 --- a/flowman-plugins/openapi/pom.xml +++ b/flowman-plugins/openapi/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml diff --git a/flowman-plugins/swagger/pom.xml b/flowman-plugins/swagger/pom.xml index 5facc043d..a4bf2afc7 100644 --- a/flowman-plugins/swagger/pom.xml +++ b/flowman-plugins/swagger/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../../pom.xml diff --git a/flowman-scalatest-compat/pom.xml b/flowman-scalatest-compat/pom.xml index 5aacd7233..b1c03e057 100644 --- a/flowman-scalatest-compat/pom.xml +++ b/flowman-scalatest-compat/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-server-ui/package-lock.json b/flowman-server-ui/package-lock.json index a7435d279..520f418ed 100644 --- a/flowman-server-ui/package-lock.json +++ b/flowman-server-ui/package-lock.json @@ -2473,9 +2473,9 @@ } }, "node_modules/@vue/component-compiler-utils": { - "version": "3.2.2", - "resolved": "https://registry.npmjs.org/@vue/component-compiler-utils/-/component-compiler-utils-3.2.2.tgz", - "integrity": "sha512-rAYMLmgMuqJFWAOb3Awjqqv5X3Q3hVr4jH/kgrFJpiU0j3a90tnNBplqbj+snzrgZhC9W128z+dtgMifOiMfJg==", + "version": "3.3.0", + "resolved": "https://registry.npmjs.org/@vue/component-compiler-utils/-/component-compiler-utils-3.3.0.tgz", + "integrity": "sha512-97sfH2mYNU+2PzGrmK2haqffDpVASuib9/w2/noxiFi31Z54hW+q3izKQXXQZSNhtiUpAI36uSuYepeBe4wpHQ==", "dev": true, "dependencies": { "consolidate": "^0.15.1", @@ -2488,7 +2488,7 @@ "vue-template-es2015-compiler": "^1.9.0" }, "optionalDependencies": { - "prettier": "^1.18.2" + "prettier": "^1.18.2 || ^2.0.0" } }, "node_modules/@vue/component-compiler-utils/node_modules/hash-sum": { @@ -7287,11 +7287,22 @@ } }, "node_modules/follow-redirects": { - "version": "1.14.4", - "resolved": "https://registry.npmjs.org/follow-redirects/-/follow-redirects-1.14.4.tgz", - "integrity": "sha512-zwGkiSXC1MUJG/qmeIFH2HBJx9u0V46QGUe3YR1fXG8bXQxq7fLj0RjLZQ5nubr9qNJUZrH+xUcwXEoXNpfS+g==", + "version": "1.14.8", + "resolved": "https://registry.npmjs.org/follow-redirects/-/follow-redirects-1.14.8.tgz", + "integrity": "sha512-1x0S9UVJHsQprFcEC/qnNzBLcIxsjAV905f/UkQxbclCsoTWlacCNOpQa/anodLl2uaEKFhfWOvM2Qg77+15zA==", + "funding": [ + { + "type": "individual", + "url": "https://github.com/sponsors/RubenVerborgh" + } + ], "engines": { "node": ">=4.0" + }, + "peerDependenciesMeta": { + "debug": { + "optional": true + } } }, "node_modules/for-in": { @@ -9263,9 +9274,9 @@ "dev": true }, "node_modules/json-schema": { - "version": "0.2.3", - "resolved": "https://registry.npmjs.org/json-schema/-/json-schema-0.2.3.tgz", - "integrity": "sha1-tIDIkuWaLwWVTOcnvT8qTogvnhM=", + "version": "0.4.0", + "resolved": "https://registry.npmjs.org/json-schema/-/json-schema-0.4.0.tgz", + "integrity": "sha512-es94M3nTIfsEPisRafak+HDLfHXnKBhV3vU5eqPcS3flIWqcxJWgXHXiey3YrpaNsanY5ei1VoYEbOzijuq9BA==", "dev": true }, "node_modules/json-schema-traverse": { @@ -9317,18 +9328,18 @@ } }, "node_modules/jsprim": { - "version": "1.4.1", - "resolved": "https://registry.npmjs.org/jsprim/-/jsprim-1.4.1.tgz", - "integrity": "sha1-MT5mvB5cwG5Di8G3SZwuXFastqI=", + "version": "1.4.2", + "resolved": "https://registry.npmjs.org/jsprim/-/jsprim-1.4.2.tgz", + "integrity": "sha512-P2bSOMAc/ciLz6DzgjVlGJP9+BrJWu5UDGK70C2iweC5QBIeFf0ZXRvGjEj2uYgrY2MkAAhsSWHDWlFtEroZWw==", "dev": true, - "engines": [ - "node >=0.6.0" - ], "dependencies": { "assert-plus": "1.0.0", "extsprintf": "1.3.0", - "json-schema": "0.2.3", + "json-schema": "0.4.0", "verror": "1.10.0" + }, + "engines": { + "node": ">=0.6.0" } }, "node_modules/killable": { @@ -13332,9 +13343,9 @@ "dev": true }, "node_modules/selfsigned": { - "version": "1.10.11", - "resolved": "https://registry.npmjs.org/selfsigned/-/selfsigned-1.10.11.tgz", - "integrity": "sha512-aVmbPOfViZqOZPgRBT0+3u4yZFHpmnIghLMlAcb5/xhp5ZtB/RVnKhz5vl2M32CLXAqR4kha9zfhNg0Lf/sxKA==", + "version": "1.10.14", + "resolved": "https://registry.npmjs.org/selfsigned/-/selfsigned-1.10.14.tgz", + "integrity": "sha512-lkjaiAye+wBZDCBsu5BGi0XiLRxeUlsGod5ZP924CRSEoGuZAw/f7y9RKu28rwTfiHVhdavhB0qH0INV6P1lEA==", "dev": true, "dependencies": { "node-forge": "^0.10.0" @@ -13576,9 +13587,9 @@ "dev": true }, "node_modules/shelljs": { - "version": "0.8.4", - "resolved": "https://registry.npmjs.org/shelljs/-/shelljs-0.8.4.tgz", - "integrity": "sha512-7gk3UZ9kOfPLIAbslLzyWeGiEqx9e3rxwZM0KE6EL8GlGwjym9Mrlx5/p33bWTu9YG6vcS4MBxYZDHYr5lr8BQ==", + "version": "0.8.5", + "resolved": "https://registry.npmjs.org/shelljs/-/shelljs-0.8.5.tgz", + "integrity": "sha512-TiwcRcrkhHvbrZbnRcFYMLl30Dfov3HKqzp5tO5b4pt6G/SezKcYhmDg15zXVBswHmctSAQKznqNW2LO5tTDow==", "dev": true, "dependencies": { "glob": "^7.0.0", @@ -15244,9 +15255,9 @@ } }, "node_modules/url-parse": { - "version": "1.5.3", - "resolved": "https://registry.npmjs.org/url-parse/-/url-parse-1.5.3.tgz", - "integrity": "sha512-IIORyIQD9rvj0A4CLWsHkBBJuNqWpFQe224b6j9t/ABmquIS0qDU2pY6kl6AuOrL5OkCXHMCFNe1jBcuAggjvQ==", + "version": "1.5.10", + "resolved": "https://registry.npmjs.org/url-parse/-/url-parse-1.5.10.tgz", + "integrity": "sha512-WypcfiRhfeUP9vvF0j6rw0J3hrWrw6iZv3+22h6iRMJ/8z1Tj6XfLP4DsUix5MhMPnXpiHDoKyoZ/bdCkwBCiQ==", "dev": true, "dependencies": { "querystringify": "^2.1.1", @@ -18596,9 +18607,9 @@ } }, "@vue/component-compiler-utils": { - "version": "3.2.2", - "resolved": "https://registry.npmjs.org/@vue/component-compiler-utils/-/component-compiler-utils-3.2.2.tgz", - "integrity": "sha512-rAYMLmgMuqJFWAOb3Awjqqv5X3Q3hVr4jH/kgrFJpiU0j3a90tnNBplqbj+snzrgZhC9W128z+dtgMifOiMfJg==", + "version": "3.3.0", + "resolved": "https://registry.npmjs.org/@vue/component-compiler-utils/-/component-compiler-utils-3.3.0.tgz", + "integrity": "sha512-97sfH2mYNU+2PzGrmK2haqffDpVASuib9/w2/noxiFi31Z54hW+q3izKQXXQZSNhtiUpAI36uSuYepeBe4wpHQ==", "dev": true, "requires": { "consolidate": "^0.15.1", @@ -18607,7 +18618,7 @@ "merge-source-map": "^1.1.0", "postcss": "^7.0.36", "postcss-selector-parser": "^6.0.2", - "prettier": "^1.18.2", + "prettier": "^1.18.2 || ^2.0.0", "source-map": "~0.6.1", "vue-template-es2015-compiler": "^1.9.0" }, @@ -22677,9 +22688,9 @@ } }, "follow-redirects": { - "version": "1.14.4", - "resolved": "https://registry.npmjs.org/follow-redirects/-/follow-redirects-1.14.4.tgz", - "integrity": "sha512-zwGkiSXC1MUJG/qmeIFH2HBJx9u0V46QGUe3YR1fXG8bXQxq7fLj0RjLZQ5nubr9qNJUZrH+xUcwXEoXNpfS+g==" + "version": "1.14.8", + "resolved": "https://registry.npmjs.org/follow-redirects/-/follow-redirects-1.14.8.tgz", + "integrity": "sha512-1x0S9UVJHsQprFcEC/qnNzBLcIxsjAV905f/UkQxbclCsoTWlacCNOpQa/anodLl2uaEKFhfWOvM2Qg77+15zA==" }, "for-in": { "version": "1.0.2", @@ -24278,9 +24289,9 @@ "dev": true }, "json-schema": { - "version": "0.2.3", - "resolved": "https://registry.npmjs.org/json-schema/-/json-schema-0.2.3.tgz", - "integrity": "sha1-tIDIkuWaLwWVTOcnvT8qTogvnhM=", + "version": "0.4.0", + "resolved": "https://registry.npmjs.org/json-schema/-/json-schema-0.4.0.tgz", + "integrity": "sha512-es94M3nTIfsEPisRafak+HDLfHXnKBhV3vU5eqPcS3flIWqcxJWgXHXiey3YrpaNsanY5ei1VoYEbOzijuq9BA==", "dev": true }, "json-schema-traverse": { @@ -24326,14 +24337,14 @@ } }, "jsprim": { - "version": "1.4.1", - "resolved": "https://registry.npmjs.org/jsprim/-/jsprim-1.4.1.tgz", - "integrity": "sha1-MT5mvB5cwG5Di8G3SZwuXFastqI=", + "version": "1.4.2", + "resolved": "https://registry.npmjs.org/jsprim/-/jsprim-1.4.2.tgz", + "integrity": "sha512-P2bSOMAc/ciLz6DzgjVlGJP9+BrJWu5UDGK70C2iweC5QBIeFf0ZXRvGjEj2uYgrY2MkAAhsSWHDWlFtEroZWw==", "dev": true, "requires": { "assert-plus": "1.0.0", "extsprintf": "1.3.0", - "json-schema": "0.2.3", + "json-schema": "0.4.0", "verror": "1.10.0" } }, @@ -27664,9 +27675,9 @@ "dev": true }, "selfsigned": { - "version": "1.10.11", - "resolved": "https://registry.npmjs.org/selfsigned/-/selfsigned-1.10.11.tgz", - "integrity": "sha512-aVmbPOfViZqOZPgRBT0+3u4yZFHpmnIghLMlAcb5/xhp5ZtB/RVnKhz5vl2M32CLXAqR4kha9zfhNg0Lf/sxKA==", + "version": "1.10.14", + "resolved": "https://registry.npmjs.org/selfsigned/-/selfsigned-1.10.14.tgz", + "integrity": "sha512-lkjaiAye+wBZDCBsu5BGi0XiLRxeUlsGod5ZP924CRSEoGuZAw/f7y9RKu28rwTfiHVhdavhB0qH0INV6P1lEA==", "dev": true, "requires": { "node-forge": "^0.10.0" @@ -27880,9 +27891,9 @@ "dev": true }, "shelljs": { - "version": "0.8.4", - "resolved": "https://registry.npmjs.org/shelljs/-/shelljs-0.8.4.tgz", - "integrity": "sha512-7gk3UZ9kOfPLIAbslLzyWeGiEqx9e3rxwZM0KE6EL8GlGwjym9Mrlx5/p33bWTu9YG6vcS4MBxYZDHYr5lr8BQ==", + "version": "0.8.5", + "resolved": "https://registry.npmjs.org/shelljs/-/shelljs-0.8.5.tgz", + "integrity": "sha512-TiwcRcrkhHvbrZbnRcFYMLl30Dfov3HKqzp5tO5b4pt6G/SezKcYhmDg15zXVBswHmctSAQKznqNW2LO5tTDow==", "dev": true, "requires": { "glob": "^7.0.0", @@ -29287,9 +29298,9 @@ } }, "url-parse": { - "version": "1.5.3", - "resolved": "https://registry.npmjs.org/url-parse/-/url-parse-1.5.3.tgz", - "integrity": "sha512-IIORyIQD9rvj0A4CLWsHkBBJuNqWpFQe224b6j9t/ABmquIS0qDU2pY6kl6AuOrL5OkCXHMCFNe1jBcuAggjvQ==", + "version": "1.5.10", + "resolved": "https://registry.npmjs.org/url-parse/-/url-parse-1.5.10.tgz", + "integrity": "sha512-WypcfiRhfeUP9vvF0j6rw0J3hrWrw6iZv3+22h6iRMJ/8z1Tj6XfLP4DsUix5MhMPnXpiHDoKyoZ/bdCkwBCiQ==", "dev": true, "requires": { "querystringify": "^2.1.1", diff --git a/flowman-server-ui/pom.xml b/flowman-server-ui/pom.xml index ad92131a9..caf8dbec3 100644 --- a/flowman-server-ui/pom.xml +++ b/flowman-server-ui/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-server-ui/src/components/JobDetails.vue b/flowman-server-ui/src/components/JobDetails.vue index 5f3e73396..ac6ff9c0b 100644 --- a/flowman-server-ui/src/components/JobDetails.vue +++ b/flowman-server-ui/src/components/JobDetails.vue @@ -2,7 +2,7 @@ gavel - Job '{{properties.project}}/{{properties.name}}' {{ properties.phase }} id {{job}} status {{properties.status}} + Job '{{ details.project }}/{{ details.job }}' {{ details.phase }} id {{ job }} status {{ details.status }} @@ -80,7 +80,7 @@ {{ p[0] }} : {{ p[1] }} @@ -120,6 +120,8 @@ import EnvironmentTable from '@/components/EnvironmentTable.vue' import MetricTable from '@/components/MetricTable.vue' import moment from "moment"; +let hash = require('object-hash'); + export default { name: 'JobDetails', components: {Status,EnvironmentTable,MetricTable}, @@ -130,7 +132,7 @@ export default { data () { return { - properties: {}, + details: {}, metrics: [], targets: [], environment: [] @@ -144,18 +146,7 @@ export default { methods: { refresh() { this.$api.getJobDetails(this.job).then(response => { - this.properties = { - namespace: response.namespace, - project: response.project, - name: response.job, - args: response.args, - phase: response.phase, - status: response.status, - startDt: response.startDateTime, - endDt: response.endDateTime, - parameters: response.args, - metrics: response.metrics - } + this.details = response this.metrics = response.metrics }) @@ -175,6 +166,10 @@ export default { }, duration(dt) { return moment.duration(dt).humanize() + }, + + hash(obj) { + return hash(obj) } } } diff --git a/flowman-server-ui/src/components/MetricTable.vue b/flowman-server-ui/src/components/MetricTable.vue index 3d57d8f6a..b64b684f7 100644 --- a/flowman-server-ui/src/components/MetricTable.vue +++ b/flowman-server-ui/src/components/MetricTable.vue @@ -17,13 +17,13 @@ {{ item[1].name }} {{ p[0] }} : {{ p[1] }} @@ -36,11 +36,19 @@ diff --git a/flowman-server-ui/src/components/TargetDetails.vue b/flowman-server-ui/src/components/TargetDetails.vue index 9fdee0c91..ca37c592e 100644 --- a/flowman-server-ui/src/components/TargetDetails.vue +++ b/flowman-server-ui/src/components/TargetDetails.vue @@ -2,7 +2,7 @@ gavel - Target '{{details.project}}/{{details.name}}' {{ details.phase }} id {{target}} status {{details.status}} + Target '{{details.project}}/{{details.target}}' {{ details.phase }} id {{target}} status {{details.status}} Jobs - + - + @@ -22,10 +22,10 @@ Targets - + - + diff --git a/flowman-server/pom.xml b/flowman-server/pom.xml index 1cdadeb8c..53c141516 100644 --- a/flowman-server/pom.xml +++ b/flowman-server/pom.xml @@ -9,7 +9,7 @@ flowman-root com.dimajix.flowman - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-spark-extensions/pom.xml b/flowman-spark-extensions/pom.xml index efcde82ba..8c7a00250 100644 --- a/flowman-spark-extensions/pom.xml +++ b/flowman-spark-extensions/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-spark-extensions/src/main/scala/com/dimajix/spark/sql/sources/sequencefile/SequenceFileOptions.scala b/flowman-spark-extensions/src/main/scala/com/dimajix/spark/sql/sources/sequencefile/SequenceFileOptions.scala index abea493ca..c2414b4cb 100644 --- a/flowman-spark-extensions/src/main/scala/com/dimajix/spark/sql/sources/sequencefile/SequenceFileOptions.scala +++ b/flowman-spark-extensions/src/main/scala/com/dimajix/spark/sql/sources/sequencefile/SequenceFileOptions.scala @@ -93,7 +93,7 @@ object WritableConverter { case StringType => WritableConverter( classOf[Text], classOf[String], - (w:Writable) => UTF8String.fromBytes(w.asInstanceOf[Text].getBytes), + (w:Writable) => UTF8String.fromBytes(w.asInstanceOf[Text].copyBytes()), (row:InternalRow) => if (row.isNullAt(idx)) new Text() else new Text(row.getString(idx)), () => new Text() ) diff --git a/flowman-spark-extensions/src/main/spark-2.4/org/apache/spark/sql/SparkShim.scala b/flowman-spark-extensions/src/main/spark-2.4/org/apache/spark/sql/SparkShim.scala index 7d99bf199..cba359fab 100644 --- a/flowman-spark-extensions/src/main/spark-2.4/org/apache/spark/sql/SparkShim.scala +++ b/flowman-spark-extensions/src/main/spark-2.4/org/apache/spark/sql/SparkShim.scala @@ -34,6 +34,8 @@ import org.apache.spark.sql.execution.QueryExecution import org.apache.spark.sql.execution.SQLExecution import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.execution.columnar.InMemoryRelation +import org.apache.spark.sql.execution.command.AlterViewAsCommand +import org.apache.spark.sql.execution.command.CreateViewCommand import org.apache.spark.sql.execution.command.ViewType import org.apache.spark.sql.execution.datasources.DataSource import org.apache.spark.sql.execution.datasources.FileFormat @@ -97,6 +99,13 @@ object SparkShim { def functionRegistry(spark:SparkSession) : FunctionRegistry = spark.sessionState.functionRegistry + def createView(table:TableIdentifier, select:String, plan:LogicalPlan, allowExisting:Boolean, replace:Boolean) : CreateViewCommand = { + CreateViewCommand(table, Nil, None, Map(), Some(select), plan, allowExisting, replace, SparkShim.PersistedView) + } + def alterView(table:TableIdentifier, select:String, plan:LogicalPlan) : AlterViewAsCommand = { + AlterViewAsCommand(table, select, plan) + } + val LocalTempView : ViewType = org.apache.spark.sql.execution.command.LocalTempView val GlobalTempView : ViewType = org.apache.spark.sql.execution.command.GlobalTempView val PersistedView : ViewType = org.apache.spark.sql.execution.command.PersistedView diff --git a/flowman-spark-extensions/src/main/spark-3.0/org/apache/spark/sql/SparkShim.scala b/flowman-spark-extensions/src/main/spark-3.0/org/apache/spark/sql/SparkShim.scala index 9d0946771..ef2978cd6 100644 --- a/flowman-spark-extensions/src/main/spark-3.0/org/apache/spark/sql/SparkShim.scala +++ b/flowman-spark-extensions/src/main/spark-3.0/org/apache/spark/sql/SparkShim.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -37,6 +37,8 @@ import org.apache.spark.sql.execution.QueryExecution import org.apache.spark.sql.execution.SQLExecution import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.execution.columnar.InMemoryRelation +import org.apache.spark.sql.execution.command.AlterViewAsCommand +import org.apache.spark.sql.execution.command.CreateViewCommand import org.apache.spark.sql.execution.datasources.DataSource import org.apache.spark.sql.execution.datasources.FileFormat import org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2 @@ -101,6 +103,13 @@ object SparkShim { def functionRegistry(spark:SparkSession) : FunctionRegistry = spark.sessionState.functionRegistry + def createView(table:TableIdentifier, select:String, plan:LogicalPlan, allowExisting:Boolean, replace:Boolean) : CreateViewCommand = { + CreateViewCommand(table, Nil, None, Map(), Some(select), plan, allowExisting, replace, SparkShim.PersistedView) + } + def alterView(table:TableIdentifier, select:String, plan:LogicalPlan) : AlterViewAsCommand = { + AlterViewAsCommand(table, select, plan) + } + val LocalTempView : ViewType = org.apache.spark.sql.catalyst.analysis.LocalTempView val GlobalTempView : ViewType = org.apache.spark.sql.catalyst.analysis.GlobalTempView val PersistedView : ViewType = org.apache.spark.sql.catalyst.analysis.PersistedView diff --git a/flowman-spark-extensions/src/main/spark-3.1/org/apache/spark/sql/SparkShim.scala b/flowman-spark-extensions/src/main/spark-3.1/org/apache/spark/sql/SparkShim.scala index 35307a2ff..b51715c4b 100644 --- a/flowman-spark-extensions/src/main/spark-3.1/org/apache/spark/sql/SparkShim.scala +++ b/flowman-spark-extensions/src/main/spark-3.1/org/apache/spark/sql/SparkShim.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -23,6 +23,7 @@ import java.util.TimeZone import org.apache.spark.SparkConf import org.apache.spark.deploy.SparkHadoopUtil import org.apache.spark.internal.config.ConfigEntry +import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.FunctionRegistry import org.apache.spark.sql.catalyst.analysis.ViewType import org.apache.spark.sql.catalyst.expressions.Expression @@ -32,6 +33,8 @@ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan import org.apache.spark.sql.catalyst.util.IntervalUtils import org.apache.spark.sql.execution.QueryExecution import org.apache.spark.sql.execution.SQLExecution +import org.apache.spark.sql.execution.command.AlterViewAsCommand +import org.apache.spark.sql.execution.command.CreateViewCommand import org.apache.spark.sql.execution.datasources.DataSource import org.apache.spark.sql.execution.datasources.FileFormat import org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2 @@ -101,6 +104,13 @@ object SparkShim { def functionRegistry(spark:SparkSession) : FunctionRegistry = spark.sessionState.functionRegistry + def createView(table:TableIdentifier, select:String, plan:LogicalPlan, allowExisting:Boolean, replace:Boolean) : CreateViewCommand = { + CreateViewCommand(table, Nil, None, Map(), Some(select), plan, allowExisting, replace, SparkShim.PersistedView) + } + def alterView(table:TableIdentifier, select:String, plan:LogicalPlan) : AlterViewAsCommand = { + AlterViewAsCommand(table, select, plan) + } + val LocalTempView : ViewType = org.apache.spark.sql.catalyst.analysis.LocalTempView val GlobalTempView : ViewType = org.apache.spark.sql.catalyst.analysis.GlobalTempView val PersistedView : ViewType = org.apache.spark.sql.catalyst.analysis.PersistedView diff --git a/flowman-spark-extensions/src/main/spark-3.2/org/apache/spark/sql/SparkShim.scala b/flowman-spark-extensions/src/main/spark-3.2/org/apache/spark/sql/SparkShim.scala index 5c877f668..539d1dc6e 100644 --- a/flowman-spark-extensions/src/main/spark-3.2/org/apache/spark/sql/SparkShim.scala +++ b/flowman-spark-extensions/src/main/spark-3.2/org/apache/spark/sql/SparkShim.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -42,6 +42,8 @@ import org.apache.spark.sql.execution.QueryExecution import org.apache.spark.sql.execution.SQLExecution import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.execution.columnar.InMemoryRelation +import org.apache.spark.sql.execution.command.AlterViewAsCommand +import org.apache.spark.sql.execution.command.CreateViewCommand import org.apache.spark.sql.execution.datasources.DataSource import org.apache.spark.sql.execution.datasources.FileFormat import org.apache.spark.sql.execution.datasources.v2.FileDataSourceV2 @@ -109,6 +111,13 @@ object SparkShim { def functionRegistry(spark:SparkSession) : FunctionRegistry = spark.sessionState.functionRegistry + def createView(table:TableIdentifier, select:String, plan:LogicalPlan, allowExisting:Boolean, replace:Boolean) : CreateViewCommand = { + CreateViewCommand(table, Nil, None, Map(), Some(select), plan, allowExisting, replace, SparkShim.PersistedView, isAnalyzed=true) + } + def alterView(table:TableIdentifier, select:String, plan:LogicalPlan) : AlterViewAsCommand = { + AlterViewAsCommand(table, select, plan, isAnalyzed=true) + } + val LocalTempView : ViewType = org.apache.spark.sql.catalyst.analysis.LocalTempView val GlobalTempView : ViewType = org.apache.spark.sql.catalyst.analysis.GlobalTempView val PersistedView : ViewType = org.apache.spark.sql.catalyst.analysis.PersistedView diff --git a/flowman-spark-testing/pom.xml b/flowman-spark-testing/pom.xml index f4161ab68..307a9b514 100644 --- a/flowman-spark-testing/pom.xml +++ b/flowman-spark-testing/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-spec/pom.xml b/flowman-spec/pom.xml index cebf911bb..03f1e4599 100644 --- a/flowman-spec/pom.xml +++ b/flowman-spec/pom.xml @@ -9,7 +9,7 @@ flowman-root com.dimajix.flowman - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-spec/src/main/java/com/dimajix/flowman/spec/annotation/ColumnCheckType.java b/flowman-spec/src/main/java/com/dimajix/flowman/spec/annotation/ColumnCheckType.java new file mode 100644 index 000000000..67ae63442 --- /dev/null +++ b/flowman-spec/src/main/java/com/dimajix/flowman/spec/annotation/ColumnCheckType.java @@ -0,0 +1,37 @@ +/* + * Copyright 2021 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.annotation; + +import java.lang.annotation.ElementType; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; + + +/** + * This annotation marks a specific class as a [[ColumnCheck]] to be used in a data flow spec. The specific ColumnCheck itself has + * to derive from the ColumnCheck class + */ +@Retention(RetentionPolicy.RUNTIME) +@Target({ElementType.TYPE}) +public @interface ColumnCheckType { + /** + * Specifies the kind of the column check which is used in data flow specifications. + * @return + */ + String kind(); +} diff --git a/flowman-spec/src/main/java/com/dimajix/flowman/spec/annotation/GeneratorType.java b/flowman-spec/src/main/java/com/dimajix/flowman/spec/annotation/GeneratorType.java new file mode 100644 index 000000000..ad6dbe071 --- /dev/null +++ b/flowman-spec/src/main/java/com/dimajix/flowman/spec/annotation/GeneratorType.java @@ -0,0 +1,37 @@ +/* + * Copyright 2021 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.annotation; + +import java.lang.annotation.ElementType; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; + + +/** + * This annotation marks a specific class as a [[Generator]] to be used in a data flow spec. The specific Generator itself has + * to derive from the Generator class + */ +@Retention(RetentionPolicy.RUNTIME) +@Target({ElementType.TYPE}) +public @interface GeneratorType { + /** + * Specifies the kind of the relation which is used in data flow specifications. + * @return + */ + String kind(); +} diff --git a/flowman-spec/src/main/java/com/dimajix/flowman/spec/annotation/SchemaCheckType.java b/flowman-spec/src/main/java/com/dimajix/flowman/spec/annotation/SchemaCheckType.java new file mode 100644 index 000000000..b87c4807a --- /dev/null +++ b/flowman-spec/src/main/java/com/dimajix/flowman/spec/annotation/SchemaCheckType.java @@ -0,0 +1,37 @@ +/* + * Copyright 2021 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.annotation; + +import java.lang.annotation.ElementType; +import java.lang.annotation.Retention; +import java.lang.annotation.RetentionPolicy; +import java.lang.annotation.Target; + + +/** + * This annotation marks a specific class as a [[SchemaCheck]] to be used in a data flow spec. The specific SchemaCheck itself has + * to derive from the SchemaCheck class + */ +@Retention(RetentionPolicy.RUNTIME) +@Target({ElementType.TYPE}) +public @interface SchemaCheckType { + /** + * Specifies the kind of the schema test which is used in data flow specifications. + * @return + */ + String kind(); +} diff --git a/flowman-spec/src/main/resources/META-INF/services/com.dimajix.flowman.spi.ClassAnnotationHandler b/flowman-spec/src/main/resources/META-INF/services/com.dimajix.flowman.spi.ClassAnnotationHandler index 32c433818..684781b65 100644 --- a/flowman-spec/src/main/resources/META-INF/services/com.dimajix.flowman.spi.ClassAnnotationHandler +++ b/flowman-spec/src/main/resources/META-INF/services/com.dimajix.flowman.spi.ClassAnnotationHandler @@ -11,3 +11,6 @@ com.dimajix.flowman.spec.storage.StoreSpecAnnotationHandler com.dimajix.flowman.spec.target.TargetSpecAnnotationHandler com.dimajix.flowman.spec.assertion.AssertionSpecAnnotationHandler com.dimajix.flowman.spec.storage.ParcelSpecAnnotationHandler +com.dimajix.flowman.spec.documentation.GeneratorSpecAnnotationHandler +com.dimajix.flowman.spec.documentation.ColumnCheckSpecAnnotationHandler +com.dimajix.flowman.spec.documentation.SchemaCheckSpecAnnotationHandler diff --git a/flowman-spec/src/main/resources/META-INF/services/com.dimajix.flowman.spi.DocumenterReader b/flowman-spec/src/main/resources/META-INF/services/com.dimajix.flowman.spi.DocumenterReader new file mode 100644 index 000000000..efbfcd79b --- /dev/null +++ b/flowman-spec/src/main/resources/META-INF/services/com.dimajix.flowman.spi.DocumenterReader @@ -0,0 +1 @@ +com.dimajix.flowman.spec.YamlDocumenterReader diff --git a/flowman-spec/src/main/resources/META-INF/services/com.dimajix.flowman.spi.PluginListener b/flowman-spec/src/main/resources/META-INF/services/com.dimajix.flowman.spi.PluginListener new file mode 100644 index 000000000..c4ccdcf7d --- /dev/null +++ b/flowman-spec/src/main/resources/META-INF/services/com.dimajix.flowman.spi.PluginListener @@ -0,0 +1 @@ +com.dimajix.flowman.spec.ObjectMapperPluginListener diff --git a/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/html/project.vtl b/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/html/project.vtl new file mode 100644 index 000000000..45aecb410 --- /dev/null +++ b/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/html/project.vtl @@ -0,0 +1,383 @@ + + + + + Flowman Project '${project.name}' version ${project.version} + + + + +#macro(testStatus $check) +#if(${check.success})#elseif(${check.failure})#else#end${check.status} +#end + +#macro(schema $schema) + + + + + + + + + + + + #if($schema) + #foreach($column in ${schema.columns}) + + + + + + + + #end + #end + +
Column NameData TypeConstraintsDescriptionQuality Checks
${column.name}${column.catalogType}#if(!$column.nullable) NOT NULL #end${column.description} + + #foreach($check in ${column.checks}) + + + + + + #end +
${check.name}#testStatus($check)#if(${check.result})${check.result.description}#end
+
+#if($schema.checks) + + + + + + + + + + #foreach($check in ${schema.checks}) + + + + + + #end + +
Quality CheckResultRemarks
${check.name}#testStatus($check)#if(${check.result})${check.result.description}#end
+#end +#end + +#macro(references $refs) + + #foreach($input in ${refs}) + + #if(${project.resolve($input)}) + + + #else + + + #end + + #end +
${input.kind}${project.resolve($input)}${input.kind}${input}
+#end + +#macro(resources $res) + + #foreach($source in ${res}) + + + + + #end +
${source.category}${source.name}
+#end + + + +
+

Flowman Project '${project.name}' version ${project.version}

+
Description: ${project.description}
+
Generated at ${Timestamp.now()}
+
+ +
+

Index

+ #if(${project.mappings}) +

Mappings

+ + #end + + #if(${project.relations}) +

Relations

+ + #end + + #if(${project.targets}) +

Targets

+ + #end +
+ +#if(${project.mappings}) +

Mappings

+#foreach($mapping in ${project.mappings}) +
+
+

Mapping '${mapping.identifier}'

+
Description: ${mapping.description}
+
+ + #if(${mapping.inputs}) +
+

Inputs

+ #references(${mapping.inputs}) +
+ #end + +

Outputs

+
+ #foreach($output in ${mapping.outputs}) +

Output '${output.name}'

+ #if(${output.description})
Description: ${output.description}
#end + #schema($output.schema) + #end +
+
+#end +#end + +#if(${project.relations}) +

Relations

+#foreach($relation in ${project.relations}) +
+
+

Relation '${relation.identifier}'

+
Description: ${relation.description}
+
+ + #if(${relation.resources}) +
+

Physical Resources

+ #resources(${relation.resources}) +
+ #end + +
+ #if(${relation.sources}) +
+
+

Sources

+ #resources(${relation.sources}) +
+
+ #end + + #if(${relation.inputs}) +
+
+

Direct Inputs

+ #references(${relation.inputs}) +
+
+ #end +
+ +

Schema

+ #schema($relation.schema) +
+#end +#end + +#if(${project.targets}) +

Targets

+#foreach($target in ${project.targets}) +
+
+

Target '${target.identifier}'

+
Description: ${target.description}
+
+ +
+
+
+

Inputs

+ #references(${target.inputs}) +
+
+ +
+
+

Outputs

+ #references(${target.outputs}) +
+
+ +
+
+

Phases

+ #foreach($phase in ${target.phases}) + ${phase.name} + #end +
+
+
+
+#end +#end + diff --git a/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/html/template.properties b/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/html/template.properties new file mode 100644 index 000000000..9e0fa5bbd --- /dev/null +++ b/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/html/template.properties @@ -0,0 +1,2 @@ +template.project.input=project.vtl +template.project.output=project.html diff --git a/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/text/project.vtl b/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/text/project.vtl new file mode 100644 index 000000000..edcbd7b40 --- /dev/null +++ b/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/text/project.vtl @@ -0,0 +1,67 @@ +Project: ${project.name} version ${project.version} +Description: ${project.description} + + +=========== Mappings: ================================================================================================= +#foreach($mapping in ${project.mappings}) +Mapping '${mapping}' (${mapping.reference}) + Description: ${mapping.description} + Inputs: + #foreach($input in ${mapping.inputs}) + - ${input.kind}: ${input} + #end + Outputs: + #foreach($output in ${mapping.outputs}) + - '${output.name}': + #foreach($column in ${output.schema.columns}) + ${column.name} ${column.catalogType} #if(!$column.nullable)NOT NULL #end- ${column.description} + #foreach($test in $column.tests) + Test: '${test.name}' => ${test.result.status} + #end + #end + #end + +#end + + +=========== Relations: ================================================================================================ +#foreach($relation in ${project.relations}) +Relation '${relation}' (${relation.reference}) + Description: ${relation.description} + Resources: + #foreach($resource in ${relation.resources}) + - ${resource.category} : ${resource.name} + #end + Inputs: + #foreach($input in ${relation.inputs}) + - ${input} + #end + Schema: + #foreach($column in ${relation.schema.columns}) + ${column.name} ${column.catalogType} #if(!$column.nullable)NOT NULL #end- ${column.description} + #foreach($test in $column.tests) + Test: '${test.name}' => ${test.result.status} + #end + #end + +#end + + +=========== Targets: ================================================================================================== +#foreach($target in ${project.targets}) +Target '${target}' (${target.reference}) + Description: ${target.description} + Inputs: + #foreach($input in ${target.inputs}) + - ${input} + #end + Outputs: + #foreach($output in ${target.outputs}) + - ${output} + #end + Phases: + #foreach($phase in ${target.phases}) + - ${phase.name} ${phase.description} + #end + +#end diff --git a/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/text/template.properties b/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/text/template.properties new file mode 100644 index 000000000..ea4cee436 --- /dev/null +++ b/flowman-spec/src/main/resources/com/dimajix/flowman/documentation/text/template.properties @@ -0,0 +1,2 @@ +template.project.input=project.vtl +template.project.output=project.txt diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Module.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/ModuleSpec.scala similarity index 98% rename from flowman-spec/src/main/scala/com/dimajix/flowman/spec/Module.scala rename to flowman-spec/src/main/scala/com/dimajix/flowman/spec/ModuleSpec.scala index 9b44a8ee2..64cb4192a 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Module.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/ModuleSpec.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Namespace.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/NamespaceSpec.scala similarity index 100% rename from flowman-spec/src/main/scala/com/dimajix/flowman/spec/Namespace.scala rename to flowman-spec/src/main/scala/com/dimajix/flowman/spec/NamespaceSpec.scala diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/ObjectMapper.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/ObjectMapper.scala index 27969c3d0..9d2ca895a 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/ObjectMapper.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/ObjectMapper.scala @@ -19,10 +19,14 @@ package com.dimajix.flowman.spec import com.fasterxml.jackson.databind.jsontype.NamedType import com.fasterxml.jackson.databind.{ObjectMapper => JacksonMapper} +import com.dimajix.flowman.plugin.Plugin import com.dimajix.flowman.spec.assertion.AssertionSpec import com.dimajix.flowman.spec.catalog.CatalogSpec import com.dimajix.flowman.spec.connection.ConnectionSpec import com.dimajix.flowman.spec.dataset.DatasetSpec +import com.dimajix.flowman.spec.documentation.ColumnCheckSpec +import com.dimajix.flowman.spec.documentation.GeneratorSpec +import com.dimajix.flowman.spec.documentation.SchemaCheckSpec import com.dimajix.flowman.spec.history.HistorySpec import com.dimajix.flowman.spec.mapping.MappingSpec import com.dimajix.flowman.spec.measure.MeasureSpec @@ -32,6 +36,7 @@ import com.dimajix.flowman.spec.schema.SchemaSpec import com.dimajix.flowman.spec.storage.ParcelSpec import com.dimajix.flowman.spec.target.TargetSpec import com.dimajix.flowman.spi.ClassAnnotationScanner +import com.dimajix.flowman.spi.PluginListener import com.dimajix.flowman.util.{ObjectMapper => CoreObjectMapper} @@ -40,41 +45,62 @@ import com.dimajix.flowman.util.{ObjectMapper => CoreObjectMapper} * extensions and can directly be used for reading flowman specification files */ object ObjectMapper extends CoreObjectMapper { + private var _mapper:JacksonMapper = null + /** * Create a new Jackson ObjectMapper * @return */ - override def mapper : JacksonMapper = { - // Ensure that all extensions are loaded - ClassAnnotationScanner.load() + override def mapper : JacksonMapper = synchronized { + // Implement a stupidly simple cache + if (_mapper == null) { + // Ensure that all extensions are loaded + ClassAnnotationScanner.load() + + val stateStoreTypes = HistorySpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val catalogTypes = CatalogSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val monitorTypes = HistorySpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val relationTypes = RelationSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val mappingTypes = MappingSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val targetTypes = TargetSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val schemaTypes = SchemaSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val connectionTypes = ConnectionSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val assertionTypes = AssertionSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val measureTypes = MeasureSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val datasetTypes = DatasetSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val metricSinkTypes = MetricSinkSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val parcelTypes = ParcelSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val generatorTypes = GeneratorSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val columnTestTypes = ColumnCheckSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val schemaTestTypes = SchemaCheckSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) + val mapper = super.mapper + mapper.registerSubtypes(stateStoreTypes: _*) + mapper.registerSubtypes(catalogTypes: _*) + mapper.registerSubtypes(monitorTypes: _*) + mapper.registerSubtypes(relationTypes: _*) + mapper.registerSubtypes(mappingTypes: _*) + mapper.registerSubtypes(targetTypes: _*) + mapper.registerSubtypes(schemaTypes: _*) + mapper.registerSubtypes(connectionTypes: _*) + mapper.registerSubtypes(assertionTypes: _*) + mapper.registerSubtypes(measureTypes: _*) + mapper.registerSubtypes(datasetTypes: _*) + mapper.registerSubtypes(metricSinkTypes: _*) + mapper.registerSubtypes(parcelTypes: _*) + mapper.registerSubtypes(generatorTypes: _*) + mapper.registerSubtypes(columnTestTypes: _*) + mapper.registerSubtypes(schemaTestTypes: _*) + _mapper = mapper + } + _mapper + } - val stateStoreTypes = HistorySpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val catalogTypes = CatalogSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val monitorTypes = HistorySpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val relationTypes = RelationSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val mappingTypes = MappingSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val targetTypes = TargetSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val schemaTypes = SchemaSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val connectionTypes = ConnectionSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val assertionTypes = AssertionSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val measureTypes = MeasureSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val datasetTypes = DatasetSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val metricSinkTypes = MetricSinkSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val parcelTypes = ParcelSpec.subtypes.map(kv => new NamedType(kv._2, kv._1)) - val mapper = super.mapper - mapper.registerSubtypes(stateStoreTypes:_*) - mapper.registerSubtypes(catalogTypes:_*) - mapper.registerSubtypes(monitorTypes:_*) - mapper.registerSubtypes(relationTypes:_*) - mapper.registerSubtypes(mappingTypes:_*) - mapper.registerSubtypes(targetTypes:_*) - mapper.registerSubtypes(schemaTypes:_*) - mapper.registerSubtypes(connectionTypes:_*) - mapper.registerSubtypes(assertionTypes:_*) - mapper.registerSubtypes(measureTypes:_*) - mapper.registerSubtypes(datasetTypes:_*) - mapper.registerSubtypes(metricSinkTypes:_*) - mapper.registerSubtypes(parcelTypes:_*) - mapper + def invalidate(): Unit = synchronized { + _mapper = null } } + + +class ObjectMapperPluginListener extends PluginListener { + override def pluginLoaded(plugin: Plugin, classLoader: ClassLoader): Unit = ObjectMapper.invalidate() +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Profile.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/ProfileSpec.scala similarity index 97% rename from flowman-spec/src/main/scala/com/dimajix/flowman/spec/Profile.scala rename to flowman-spec/src/main/scala/com/dimajix/flowman/spec/ProfileSpec.scala index d55bccd21..633999c76 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Profile.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/ProfileSpec.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Project.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/ProjectSpec.scala similarity index 56% rename from flowman-spec/src/main/scala/com/dimajix/flowman/spec/Project.scala rename to flowman-spec/src/main/scala/com/dimajix/flowman/spec/ProjectSpec.scala index 52d53c0b3..8c11f20fd 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Project.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/ProjectSpec.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -19,21 +19,38 @@ package com.dimajix.flowman.spec import com.fasterxml.jackson.annotation.JsonProperty import com.dimajix.flowman.model.Project - - +import com.dimajix.flowman.spec.ProjectSpec.ImportSpec + + +object ProjectSpec { + final class ImportSpec { + @JsonProperty(value = "project", required = true) private var project: String = "" + @JsonProperty(value = "job", required = false) private var job: Option[String] = None + @JsonProperty(value = "arguments", required = false) private var arguments: Map[String,String] = Map() + def instantiate(): Project.Import = { + Project.Import( + project, + job, + arguments + ) + } + } +} final class ProjectSpec { @JsonProperty(value="name", required = true) private var name: String = "" @JsonProperty(value="description", required = false) private var description: Option[String] = None @JsonProperty(value="version", required = false) private var version: Option[String] = None - @JsonProperty(value="modules", required = true) private[spec] var modules: Seq[String] = Seq() + @JsonProperty(value="modules", required = true) private var modules: Seq[String] = Seq() + @JsonProperty(value="imports", required = true) private var imports: Seq[ImportSpec] = Seq() def instantiate(): Project = { Project( name=name, description=description, version=version, - modules=modules + modules=modules, + imports=imports.map(_.instantiate()) ) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Spec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Spec.scala index f62951613..2ceadecfc 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Spec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/Spec.scala @@ -62,7 +62,6 @@ final class MetadataSpec { abstract class NamedSpec[T] extends Spec[T] { - @JsonProperty(value="kind", required = true) protected var kind: String = _ @JsonProperty(value="name", required = false) protected[spec] var name:String = "" @JsonProperty(value="metadata", required=false) protected var metadata:Option[MetadataSpec] = None diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlDocumenterReader.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlDocumenterReader.scala new file mode 100644 index 000000000..a74e1e108 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlDocumenterReader.scala @@ -0,0 +1,67 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec + +import com.dimajix.flowman.documentation.Documenter +import com.dimajix.flowman.hadoop.File +import com.dimajix.flowman.model.Prototype +import com.dimajix.flowman.spec.documentation.DocumenterSpec +import com.dimajix.flowman.spi.DocumenterReader + + +class YamlDocumenterReader extends DocumenterReader { + /** + * Returns the human readable name of the documenter file format + * + * @return + */ + override def name: String = "yaml documenter settings reader" + + /** + * Returns the internally used short name of the documenter file format + * + * @return + */ + override def format: String = "yaml" + + override def supports(format: String): Boolean = format == "yaml" || format == "yml" + + /** + * Loads a [[Documenter]] from the given file + * + * @param file + * @return + */ + override def file(file: File): Prototype[Documenter] = { + if (file.isDirectory()) { + this.file(file / "documentation.yml") + } + else { + ObjectMapper.read[DocumenterSpec](file) + } + } + + /** + * Loads a [[Documenter]] from the given String + * + * @param file + * @return + */ + override def string(text: String): Prototype[Documenter] = { + ObjectMapper.parse[DocumenterSpec](text) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlModuleReader.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlModuleReader.scala index 2c1c26824..31ab6fd16 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlModuleReader.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlModuleReader.scala @@ -19,6 +19,9 @@ package com.dimajix.flowman.spec import java.io.IOException import java.io.InputStream +import com.fasterxml.jackson.core.JsonProcessingException +import com.fasterxml.jackson.databind.JsonMappingException + import com.dimajix.flowman.hadoop.File import com.dimajix.flowman.model.Module import com.dimajix.flowman.spi.ModuleReader @@ -39,16 +42,22 @@ class YamlModuleReader extends ModuleReader { * @return */ @throws[IOException] + @throws[JsonProcessingException] + @throws[JsonMappingException] override def file(file:File) : Module = { ObjectMapper.read[ModuleSpec](file).instantiate() } @throws[IOException] + @throws[JsonProcessingException] + @throws[JsonMappingException] override def stream(stream:InputStream) : Module = { ObjectMapper.read[ModuleSpec](stream).instantiate() } @throws[IOException] + @throws[JsonProcessingException] + @throws[JsonMappingException] override def string(text:String) : Module = { ObjectMapper.parse[ModuleSpec](text).instantiate() } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlNamespaceReader.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlNamespaceReader.scala index ea11fbe95..3b9092706 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlNamespaceReader.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlNamespaceReader.scala @@ -20,6 +20,9 @@ import java.io.File import java.io.IOException import java.io.InputStream +import com.fasterxml.jackson.core.JsonProcessingException +import com.fasterxml.jackson.databind.JsonMappingException + import com.dimajix.flowman.model.Namespace import com.dimajix.flowman.spi.NamespaceReader @@ -39,16 +42,22 @@ class YamlNamespaceReader extends NamespaceReader { * @return */ @throws[IOException] + @throws[JsonProcessingException] + @throws[JsonMappingException] override def file(file:File) : Namespace = { ObjectMapper.read[NamespaceSpec](file).instantiate() } @throws[IOException] + @throws[JsonProcessingException] + @throws[JsonMappingException] override def stream(stream:InputStream) : Namespace = { ObjectMapper.read[NamespaceSpec](stream).instantiate() } @throws[IOException] + @throws[JsonProcessingException] + @throws[JsonMappingException] override def string(text:String) : Namespace = { ObjectMapper.parse[NamespaceSpec](text).instantiate() } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlProjectReader.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlProjectReader.scala index 9ccab626e..3e729f9cc 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlProjectReader.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/YamlProjectReader.scala @@ -18,13 +18,15 @@ package com.dimajix.flowman.spec import java.io.IOException +import com.fasterxml.jackson.core.JsonProcessingException +import com.fasterxml.jackson.databind.JsonMappingException + import com.dimajix.flowman.hadoop.File import com.dimajix.flowman.model.Project import com.dimajix.flowman.spi.ProjectReader class YamlProjectReader extends ProjectReader { - override def name: String = "yaml project reader" override def format: String = "yaml" @@ -38,11 +40,21 @@ class YamlProjectReader extends ProjectReader { * @return */ @throws[IOException] + @throws[JsonProcessingException] + @throws[JsonMappingException] override def file(file:File) : Project = { - ObjectMapper.read[ProjectSpec](file).instantiate() + if (file.isDirectory()) { + this.file(file / "project.yml") + } + else { + val prj = ObjectMapper.read[ProjectSpec](file).instantiate() + prj.copy(basedir = Some(file.parent.absolute), filename = Some(file)) + } } @throws[IOException] + @throws[JsonProcessingException] + @throws[JsonMappingException] override def string(text:String) : Project = { ObjectMapper.parse[ProjectSpec](text).instantiate() } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/assertion/AssertionSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/assertion/AssertionSpec.scala index 5f7671923..a2a3f4b82 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/assertion/AssertionSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/assertion/AssertionSpec.scala @@ -47,6 +47,7 @@ object AssertionSpec extends TypeRegistry[AssertionSpec] { new JsonSubTypes.Type(name = "uniqueKey", value = classOf[UniqueKeyAssertionSpec]) )) abstract class AssertionSpec extends NamedSpec[Assertion] { + @JsonProperty(value="kind", required = true) protected var kind: String = _ @JsonProperty(value="description", required = false) private var description: Option[String] = None override def instantiate(context: Context): Assertion diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/connection/ConnectionSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/connection/ConnectionSpec.scala index 711ca1293..e07284754 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/connection/ConnectionSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/connection/ConnectionSpec.scala @@ -16,6 +16,7 @@ package com.dimajix.flowman.spec.connection +import com.fasterxml.jackson.annotation.JsonProperty import com.fasterxml.jackson.annotation.JsonSubTypes import com.fasterxml.jackson.annotation.JsonTypeInfo import com.fasterxml.jackson.databind.annotation.JsonTypeResolver @@ -46,6 +47,8 @@ object ConnectionSpec extends TypeRegistry[ConnectionSpec] { new JsonSubTypes.Type(name = "sftp", value = classOf[SshConnectionSpec]) )) abstract class ConnectionSpec extends NamedSpec[Connection] { + @JsonProperty(value="kind", required = true) protected var kind: String = "jdbc" + /** * Creates an instance of this specification and performs the interpolation of all variables * diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/dataset/RelationDataset.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/dataset/RelationDataset.scala index 783461031..211a93eee 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/dataset/RelationDataset.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/dataset/RelationDataset.scala @@ -119,8 +119,8 @@ case class RelationDataset( * @return */ override def describe(execution:Execution) : Option[StructType] = { - val instance = relation.value - Some(instance.describe(execution)) + val schema = execution.describe(relation.value, partition) + Some(schema) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/CollectorSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/CollectorSpec.scala new file mode 100644 index 000000000..9af9d8a50 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/CollectorSpec.scala @@ -0,0 +1,64 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import com.fasterxml.jackson.annotation.JsonSubTypes +import com.fasterxml.jackson.annotation.JsonTypeInfo + +import com.dimajix.flowman.documentation.Collector +import com.dimajix.flowman.documentation.MappingCollector +import com.dimajix.flowman.documentation.RelationCollector +import com.dimajix.flowman.documentation.TargetCollector +import com.dimajix.flowman.documentation.CheckCollector +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.spec.Spec + + +@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "kind") +@JsonSubTypes(value = Array( + new JsonSubTypes.Type(name = "mappings", value = classOf[MappingCollectorSpec]), + new JsonSubTypes.Type(name = "relations", value = classOf[RelationCollectorSpec]), + new JsonSubTypes.Type(name = "targets", value = classOf[TargetCollectorSpec]), + new JsonSubTypes.Type(name = "checks", value = classOf[CheckCollectorSpec]) +)) +abstract class CollectorSpec extends Spec[Collector] { + override def instantiate(context: Context): Collector +} + +final class MappingCollectorSpec extends CollectorSpec { + override def instantiate(context: Context): MappingCollector = { + new MappingCollector() + } +} + +final class RelationCollectorSpec extends CollectorSpec { + override def instantiate(context: Context): RelationCollector = { + new RelationCollector() + } +} + +final class TargetCollectorSpec extends CollectorSpec { + override def instantiate(context: Context): TargetCollector = { + new TargetCollector() + } +} + +final class CheckCollectorSpec extends CollectorSpec { + override def instantiate(context: Context): CheckCollector = { + new CheckCollector() + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/ColumnCheckSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/ColumnCheckSpec.scala new file mode 100644 index 000000000..c7addfd20 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/ColumnCheckSpec.scala @@ -0,0 +1,109 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import com.fasterxml.jackson.annotation.JsonProperty +import com.fasterxml.jackson.annotation.JsonSubTypes +import com.fasterxml.jackson.annotation.JsonTypeInfo + +import com.dimajix.common.TypeRegistry +import com.dimajix.flowman.documentation.ColumnReference +import com.dimajix.flowman.documentation.ColumnCheck +import com.dimajix.flowman.documentation.ExpressionColumnCheck +import com.dimajix.flowman.documentation.ForeignKeyColumnCheck +import com.dimajix.flowman.documentation.NotNullColumnCheck +import com.dimajix.flowman.documentation.RangeColumnCheck +import com.dimajix.flowman.documentation.UniqueColumnCheck +import com.dimajix.flowman.documentation.ValuesColumnCheck +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.spec.annotation.ColumnCheckType +import com.dimajix.flowman.spi.ClassAnnotationHandler + + +object ColumnCheckSpec extends TypeRegistry[ColumnCheckSpec] { +} + + +@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "kind") +@JsonSubTypes(value = Array( + new JsonSubTypes.Type(name = "expression", value = classOf[ExpressionColumnCheckSpec]), + new JsonSubTypes.Type(name = "foreignKey", value = classOf[ForeignKeyColumnCheckSpec]), + new JsonSubTypes.Type(name = "notNull", value = classOf[NotNullColumnCheckSpec]), + new JsonSubTypes.Type(name = "unique", value = classOf[UniqueColumnCheckSpec]), + new JsonSubTypes.Type(name = "range", value = classOf[RangeColumnCheckSpec]), + new JsonSubTypes.Type(name = "values", value = classOf[ValuesColumnCheckSpec]) +)) +abstract class ColumnCheckSpec { + def instantiate(context: Context, parent:ColumnReference): ColumnCheck +} + + +class ColumnCheckSpecAnnotationHandler extends ClassAnnotationHandler { + override def annotation: Class[_] = classOf[ColumnCheckType] + + override def register(clazz: Class[_]): Unit = + ColumnCheckSpec.register(clazz.getAnnotation(classOf[ColumnCheckType]).kind(), clazz.asInstanceOf[Class[_ <: ColumnCheckSpec]]) +} + + +class NotNullColumnCheckSpec extends ColumnCheckSpec { + override def instantiate(context: Context, parent:ColumnReference): NotNullColumnCheck = NotNullColumnCheck(Some(parent)) +} +class UniqueColumnCheckSpec extends ColumnCheckSpec { + override def instantiate(context: Context, parent:ColumnReference): UniqueColumnCheck = UniqueColumnCheck(Some(parent)) +} +class RangeColumnCheckSpec extends ColumnCheckSpec { + @JsonProperty(value="lower", required=true) private var lower:String = "" + @JsonProperty(value="upper", required=true) private var upper:String = "" + + override def instantiate(context: Context, parent:ColumnReference): RangeColumnCheck = RangeColumnCheck( + Some(parent), + None, + context.evaluate(lower), + context.evaluate(upper) + ) +} +class ValuesColumnCheckSpec extends ColumnCheckSpec { + @JsonProperty(value="values", required=false) private var values:Seq[String] = Seq() + + override def instantiate(context: Context, parent:ColumnReference): ValuesColumnCheck = ValuesColumnCheck( + Some(parent), + values=values.map(context.evaluate) + ) +} +class ExpressionColumnCheckSpec extends ColumnCheckSpec { + @JsonProperty(value="expression", required=true) private var expression:String = _ + + override def instantiate(context: Context, parent:ColumnReference): ExpressionColumnCheck = ExpressionColumnCheck( + Some(parent), + expression=context.evaluate(expression) + ) +} +class ForeignKeyColumnCheckSpec extends ColumnCheckSpec { + @JsonProperty(value="mapping", required=false) private var mapping:Option[String] = None + @JsonProperty(value="relation", required=false) private var relation:Option[String] = None + @JsonProperty(value="column", required=false) private var column:Option[String] = None + + override def instantiate(context: Context, parent:ColumnReference): ForeignKeyColumnCheck = ForeignKeyColumnCheck( + Some(parent), + relation=context.evaluate(relation).map(RelationIdentifier(_)), + mapping=context.evaluate(mapping).map(MappingOutputIdentifier(_)), + column=context.evaluate(column) + ) +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/ColumnDocSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/ColumnDocSpec.scala new file mode 100644 index 000000000..391a92f43 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/ColumnDocSpec.scala @@ -0,0 +1,51 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import com.fasterxml.jackson.annotation.JsonProperty + +import com.dimajix.flowman.documentation.ColumnDoc +import com.dimajix.flowman.documentation.Reference +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.types.Field +import com.dimajix.flowman.types.NullType + + +class ColumnDocSpec { + @JsonProperty(value="name", required=true) private var name:String = _ + @JsonProperty(value="description", required=false) private var description:Option[String] = None + @JsonProperty(value="columns", required=false) private var columns:Seq[ColumnDocSpec] = Seq() + @JsonProperty(value="checks", required=false) private var checks:Seq[ColumnCheckSpec] = Seq() + + def instantiate(context: Context, parent:Reference): ColumnDoc = { + val doc = ColumnDoc( + Some(parent), + Field(context.evaluate(name), NullType, description=context.evaluate(description)), + Seq(), + Seq() + ) + def ref = doc.reference + + val cols = columns.map(_.instantiate(context, ref)) + val tests = this.checks.map(_.instantiate(context, ref)) + + doc.copy( + children = cols, + checks = tests + ) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/DocumenterLoader.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/DocumenterLoader.scala new file mode 100644 index 000000000..9ff1d727a --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/DocumenterLoader.scala @@ -0,0 +1,48 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import org.apache.hadoop.fs.Path + +import com.dimajix.flowman.documentation.Documenter +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.model.Project + + +object DocumenterLoader { + def load(context: Context, project: Project): Documenter = { + project.basedir.flatMap { basedir => + val docpath = basedir / "documentation.yml" + if (docpath.isFile()) { + val file = Documenter.read.file(docpath) + Some(file.instantiate(context)) + } + else { + Some(defaultDocumenter((basedir / "generated-documentation").path)) + } + }.getOrElse { + defaultDocumenter(new Path("/tmp/flowman/generated-documentation")) + } + } + + private def defaultDocumenter(outputDir: Path): Documenter = { + val generators = Seq( + new FileGenerator(outputDir) + ) + Documenter.read.default().copy(generators = generators) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/PrometheusMetricSinkSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/DocumenterSpec.scala similarity index 54% rename from flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/PrometheusMetricSinkSpec.scala rename to flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/DocumenterSpec.scala index 6865ad221..391e1c026 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/PrometheusMetricSinkSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/DocumenterSpec.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -14,23 +14,23 @@ * limitations under the License. */ -package com.dimajix.flowman.spec.metric +package com.dimajix.flowman.spec.documentation import com.fasterxml.jackson.annotation.JsonProperty +import com.dimajix.flowman.documentation.Documenter import com.dimajix.flowman.execution.Context -import com.dimajix.flowman.metric.MetricSink -import com.dimajix.flowman.metric.PrometheusMetricSink +import com.dimajix.flowman.spec.Spec -class PrometheusMetricSinkSpec extends MetricSinkSpec { - @JsonProperty(value = "url", required = true) private var url:String = "" - @JsonProperty(value = "labels", required = false) private var labels:Map[String,String] = Map() +final class DocumenterSpec extends Spec[Documenter] { + @JsonProperty(value="collectors") private var collectors: Seq[CollectorSpec] = Seq() + @JsonProperty(value="generators") private var generators: Seq[GeneratorSpec] = Seq() - override def instantiate(context: Context): MetricSink = { - new PrometheusMetricSink( - context.evaluate(url), - labels + def instantiate(context:Context): Documenter = { + Documenter( + collectors.map(_.instantiate(context)), + generators.map(_.instantiate(context)) ) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/FileGenerator.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/FileGenerator.scala new file mode 100644 index 000000000..ccce6e754 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/FileGenerator.scala @@ -0,0 +1,117 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import java.io.StringReader +import java.net.URL +import java.nio.charset.Charset +import java.util.Properties + +import scala.collection.JavaConverters._ +import scala.util.matching.Regex + +import com.fasterxml.jackson.annotation.JsonProperty +import com.google.common.io.Resources +import org.apache.hadoop.fs.Path +import org.slf4j.LoggerFactory + +import com.dimajix.flowman.documentation.Generator +import com.dimajix.flowman.documentation.ProjectDoc +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.hadoop.File + + +object FileGenerator { + val textTemplate : URL = Resources.getResource(classOf[FileGenerator], "/com/dimajix/flowman/documentation/text") + val htmlTemplate : URL = Resources.getResource(classOf[FileGenerator], "/com/dimajix/flowman/documentation/html") + val defaultTemplate : URL = htmlTemplate +} + + +case class FileGenerator( + location:Path, + template:URL = FileGenerator.defaultTemplate, + includeRelations:Seq[Regex] = Seq.empty, + excludeRelations:Seq[Regex] = Seq.empty, + includeMappings:Seq[Regex] = Seq.empty, + excludeMappings:Seq[Regex] = Seq.empty, + includeTargets:Seq[Regex] = Seq.empty, + excludeTargets:Seq[Regex] = Seq.empty +) extends TemplateGenerator(template, includeRelations, excludeRelations, includeMappings, excludeMappings, includeTargets, excludeTargets) { + private val logger = LoggerFactory.getLogger(classOf[FileGenerator]) + + protected override def generateInternal(context:Context, execution: Execution, documentation: ProjectDoc): Unit = { + val props = new Properties() + props.load(new StringReader(loadResource("template.properties"))) + + val fs = execution.fs + val outputDir = fs.file(location) + + // Cleanup any existing output directory + if (outputDir.isDirectory()) { + outputDir.list().foreach(_.delete(true)) + } + else if (outputDir.isFile()) { + outputDir.isFile() + } + outputDir.mkdirs() + + generateProjectFile(context, documentation, outputDir, props.asScala.toMap) + } + + private def generateProjectFile(context:Context, documentation: ProjectDoc, outputDir:File, properties: Map[String,String]) : Unit= { + val in = properties.getOrElse("template.project.input", "project.vtl") + val out = properties("template.project.output") + + val projectDoc = renderProject(context, documentation, in) + writeFile(outputDir / out, projectDoc) + } + + private def writeFile(file:File, content:String) : Unit = { + logger.info(s"Writing documentation file '${file.toString}'") + val out = file.create(true) + try { + // Manually convert string to UTF-8 and use write, since writeUTF apparently would write a BOM + val bytes = Charset.forName("UTF-8").encode(content) + out.write(bytes.array(), bytes.arrayOffset(), bytes.limit()) + } + finally { + out.close() + } + } +} + + +class FileGeneratorSpec extends TemplateGeneratorSpec { + @JsonProperty(value="location", required=true) private var location:String = _ + @JsonProperty(value="template", required=false) private var template:String = FileGenerator.defaultTemplate.toString + + override def instantiate(context: Context): Generator = { + val url = getTemplateUrl(context) + FileGenerator( + new Path(context.evaluate(location)), + url, + includeRelations = includeRelations.map(context.evaluate).map(_.trim).filter(_.nonEmpty).map(_.r), + excludeRelations = excludeRelations.map(context.evaluate).map(_.trim).filter(_.nonEmpty).map(_.r), + includeMappings = includeMappings.map(context.evaluate).map(_.trim).filter(_.nonEmpty).map(_.r), + excludeMappings = excludeMappings.map(context.evaluate).map(_.trim).filter(_.nonEmpty).map(_.r), + includeTargets = includeTargets.map(context.evaluate).map(_.trim).filter(_.nonEmpty).map(_.r), + excludeTargets = excludeTargets.map(context.evaluate).map(_.trim).filter(_.nonEmpty).map(_.r) + ) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/GeneratorSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/GeneratorSpec.scala new file mode 100644 index 000000000..cbb6395d3 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/GeneratorSpec.scala @@ -0,0 +1,48 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import com.fasterxml.jackson.annotation.JsonSubTypes +import com.fasterxml.jackson.annotation.JsonTypeInfo + +import com.dimajix.common.TypeRegistry +import com.dimajix.flowman.documentation.Generator +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.spec.Spec +import com.dimajix.flowman.spec.annotation.GeneratorType +import com.dimajix.flowman.spi.ClassAnnotationHandler + + +object GeneratorSpec extends TypeRegistry[GeneratorSpec] { +} + + +@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "kind") +@JsonSubTypes(value = Array( + new JsonSubTypes.Type(name = "file", value = classOf[FileGeneratorSpec]) +)) +abstract class GeneratorSpec extends Spec[Generator] { + def instantiate(context:Context): Generator +} + + +class GeneratorSpecAnnotationHandler extends ClassAnnotationHandler { + override def annotation: Class[_] = classOf[GeneratorType] + + override def register(clazz: Class[_]): Unit = + GeneratorSpec.register(clazz.getAnnotation(classOf[GeneratorType]).kind(), clazz.asInstanceOf[Class[_ <: GeneratorSpec]]) +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/MappingDocSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/MappingDocSpec.scala new file mode 100644 index 000000000..55606f17b --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/MappingDocSpec.scala @@ -0,0 +1,119 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import com.fasterxml.jackson.annotation.JsonProperty + +import com.dimajix.flowman.documentation.MappingDoc +import com.dimajix.flowman.documentation.MappingOutputDoc +import com.dimajix.flowman.documentation.MappingReference +import com.dimajix.flowman.documentation.SchemaDoc +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.model.MappingIdentifier +import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.spec.Spec + + +class MappingOutputDocSpec { + @JsonProperty(value="description", required=false) private var description:Option[String] = None + @JsonProperty(value="columns", required=false) private var columns:Seq[ColumnDocSpec] = Seq() + @JsonProperty(value="checks", required=false) private var checks:Seq[SchemaCheckSpec] = Seq() + + def instantiate(context: Context, parent:MappingReference, name:String): MappingOutputDoc = { + val doc = MappingOutputDoc( + Some(parent), + MappingOutputIdentifier.empty.copy(output=name), + context.evaluate(description), + None + ) + val ref = doc.reference + + val schema = + if (columns.nonEmpty || checks.nonEmpty) { + val schema = SchemaDoc( + Some(ref), + None, + Seq(), + Seq() + ) + val ref2 = schema.reference + val cols = columns.map(_.instantiate(context, ref2)) + val tests = this.checks.map(_.instantiate(context, ref2)) + Some(schema.copy( + columns=cols, + checks=tests + )) + } + else { + None + } + + doc.copy( + schema = schema + ) + } +} + + +class MappingDocSpec extends Spec[MappingDoc] { + @JsonProperty(value="description", required=false) private var description:Option[String] = None + @JsonProperty(value="outputs", required=false) private var outputs:Map[String,MappingOutputDocSpec] = Map() + @JsonProperty(value="columns", required=false) private var columns:Seq[ColumnDocSpec] = Seq() + @JsonProperty(value="tests", required=false) private var tests:Seq[SchemaCheckSpec] = Seq() + + def instantiate(context: Context): MappingDoc = { + val doc = MappingDoc( + None, + MappingIdentifier.empty, + description = context.evaluate(description) + ) + val ref = doc.reference + + val output = + if (columns.nonEmpty || tests.nonEmpty) { + val output = MappingOutputDoc( + Some(ref), + MappingOutputIdentifier.empty.copy(output="main") + ) + val ref2 = output.reference + + val schema = SchemaDoc( + Some(ref2) + ) + val ref3 = schema.reference + val cols = columns.map(_.instantiate(context, ref3)) + val tsts = tests.map(_.instantiate(context, ref3)) + Some( + output.copy( + schema = Some(schema.copy( + columns=cols, + checks=tsts + )) + ) + ) + } + else { + None + } + + val outputs = this.outputs.map { case(name,output) => + output.instantiate(context, ref, name) + } ++ output.toSeq + + doc.copy(outputs=outputs.toSeq) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/RelationDocSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/RelationDocSpec.scala new file mode 100644 index 000000000..967c02db0 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/RelationDocSpec.scala @@ -0,0 +1,62 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import com.fasterxml.jackson.annotation.JsonProperty + +import com.dimajix.flowman.documentation.RelationDoc +import com.dimajix.flowman.documentation.SchemaDoc +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.spec.Spec + + +class RelationDocSpec extends Spec[RelationDoc] { + @JsonProperty(value="description", required=false) private var description:Option[String] = None + @JsonProperty(value="columns", required=false) private var columns:Seq[ColumnDocSpec] = Seq() + @JsonProperty(value="checks", required=false) private var checks:Seq[SchemaCheckSpec] = Seq() + + override def instantiate(context: Context): RelationDoc = { + val doc = RelationDoc( + None, + RelationIdentifier.empty, + description = context.evaluate(description) + ) + val ref = doc.reference + + val schema = + if (columns.nonEmpty || checks.nonEmpty) { + val schema = SchemaDoc( + Some(ref) + ) + val ref2 = schema.reference + val cols = columns.map(_.instantiate(context, ref2)) + val tests = this.checks.map(_.instantiate(context, ref2)) + Some(schema.copy( + columns=cols, + checks=tests + )) + } + else { + None + } + + doc.copy( + schema = schema + ) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/SchemaCheckSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/SchemaCheckSpec.scala new file mode 100644 index 000000000..8060b7c9b --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/SchemaCheckSpec.scala @@ -0,0 +1,88 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import com.fasterxml.jackson.annotation.JsonProperty +import com.fasterxml.jackson.annotation.JsonSubTypes +import com.fasterxml.jackson.annotation.JsonTypeInfo + +import com.dimajix.common.TypeRegistry +import com.dimajix.flowman.documentation.ExpressionSchemaCheck +import com.dimajix.flowman.documentation.ForeignKeySchemaCheck +import com.dimajix.flowman.documentation.PrimaryKeySchemaCheck +import com.dimajix.flowman.documentation.SchemaReference +import com.dimajix.flowman.documentation.SchemaCheck +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.model.MappingOutputIdentifier +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.spec.annotation.SchemaCheckType +import com.dimajix.flowman.spi.ClassAnnotationHandler + + +object SchemaCheckSpec extends TypeRegistry[SchemaCheckSpec] { +} + + +@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "kind") +@JsonSubTypes(value = Array( + new JsonSubTypes.Type(name = "expression", value = classOf[ExpressionSchemaCheckSpec]), + new JsonSubTypes.Type(name = "foreignKey", value = classOf[ForeignKeySchemaCheckSpec]), + new JsonSubTypes.Type(name = "primaryKey", value = classOf[PrimaryKeySchemaCheckSpec]) +)) +abstract class SchemaCheckSpec { + def instantiate(context: Context, parent:SchemaReference): SchemaCheck +} + + +class SchemaCheckSpecAnnotationHandler extends ClassAnnotationHandler { + override def annotation: Class[_] = classOf[SchemaCheckType] + + override def register(clazz: Class[_]): Unit = + SchemaCheckSpec.register(clazz.getAnnotation(classOf[SchemaCheckType]).kind(), clazz.asInstanceOf[Class[_ <: SchemaCheckSpec]]) +} + + +class PrimaryKeySchemaCheckSpec extends SchemaCheckSpec { + @JsonProperty(value="columns", required=false) private var columns:Seq[String] = Seq.empty + + override def instantiate(context: Context, parent:SchemaReference): PrimaryKeySchemaCheck = PrimaryKeySchemaCheck( + Some(parent), + columns = columns.map(context.evaluate) + ) +} +class ExpressionSchemaCheckSpec extends SchemaCheckSpec { + @JsonProperty(value="expression", required=true) private var expression:String = _ + + override def instantiate(context: Context, parent:SchemaReference): ExpressionSchemaCheck = ExpressionSchemaCheck( + Some(parent), + expression = context.evaluate(expression) + ) +} +class ForeignKeySchemaCheckSpec extends SchemaCheckSpec { + @JsonProperty(value="mapping", required=false) private var mapping:Option[String] = None + @JsonProperty(value="relation", required=false) private var relation:Option[String] = None + @JsonProperty(value="columns", required=false) private var columns:Seq[String] = Seq.empty + @JsonProperty(value="references", required=false) private var references:Seq[String] = Seq.empty + + override def instantiate(context: Context, parent:SchemaReference): ForeignKeySchemaCheck = ForeignKeySchemaCheck( + Some(parent), + columns=columns.map(context.evaluate), + relation=context.evaluate(relation).map(RelationIdentifier(_)), + mapping=context.evaluate(mapping).map(MappingOutputIdentifier(_)), + references=references.map(context.evaluate) + ) +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/SchemaDocSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/SchemaDocSpec.scala new file mode 100644 index 000000000..948ae4010 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/SchemaDocSpec.scala @@ -0,0 +1,45 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import com.fasterxml.jackson.annotation.JsonProperty + +import com.dimajix.flowman.documentation.Reference +import com.dimajix.flowman.documentation.SchemaDoc +import com.dimajix.flowman.execution.Context + + +class SchemaDocSpec { + @JsonProperty(value="description", required=false) private var description:Option[String] = None + @JsonProperty(value="columns", required=false) private var columns:Seq[ColumnDocSpec] = Seq() + @JsonProperty(value="checks", required=false) private var checks:Seq[SchemaCheckSpec] = Seq() + + def instantiate(context: Context, parent:Reference): SchemaDoc = { + val doc = SchemaDoc( + Some(parent), + description = context.evaluate(description) + ) + val ref = doc.reference + + val cols = columns.map(_.instantiate(context, ref)) + val tests = this.checks.map(_.instantiate(context, ref)) + doc.copy( + columns = cols, + checks = tests + ) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/TargetDocSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/TargetDocSpec.scala new file mode 100644 index 000000000..c31084f7d --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/TargetDocSpec.scala @@ -0,0 +1,38 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import com.fasterxml.jackson.annotation.JsonProperty + +import com.dimajix.flowman.documentation.TargetDoc +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.model.TargetIdentifier +import com.dimajix.flowman.spec.Spec + + +class TargetDocSpec extends Spec[TargetDoc] { + @JsonProperty(value="description", required=false) private var description:Option[String] = None + + override def instantiate(context: Context): TargetDoc = { + val doc = TargetDoc( + None, + TargetIdentifier.empty, + description = context.evaluate(description) + ) + doc + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/TemplateGenerator.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/TemplateGenerator.scala new file mode 100644 index 000000000..bc53c09c3 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/documentation/TemplateGenerator.scala @@ -0,0 +1,126 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import java.net.URL +import java.nio.charset.Charset + +import scala.util.matching.Regex + +import com.fasterxml.jackson.annotation.JsonProperty +import com.google.common.io.Resources +import org.apache.hadoop.fs.Path + +import com.dimajix.flowman.documentation.BaseGenerator +import com.dimajix.flowman.documentation.Generator +import com.dimajix.flowman.documentation.MappingDoc +import com.dimajix.flowman.documentation.MappingDocWrapper +import com.dimajix.flowman.documentation.ProjectDoc +import com.dimajix.flowman.documentation.ProjectDocWrapper +import com.dimajix.flowman.documentation.RelationDoc +import com.dimajix.flowman.documentation.RelationDocWrapper +import com.dimajix.flowman.documentation.TargetDoc +import com.dimajix.flowman.documentation.TargetDocWrapper +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.model.Identifier + + +abstract class TemplateGenerator( + template:URL, + includeRelations:Seq[Regex] = Seq(".*".r), + excludeRelations:Seq[Regex] = Seq.empty, + includeMappings:Seq[Regex] = Seq(".*".r), + excludeMappings:Seq[Regex] = Seq.empty, + includeTargets:Seq[Regex] = Seq(".*".r), + excludeTargets:Seq[Regex] = Seq.empty +) extends BaseGenerator { + override def generate(context:Context, execution: Execution, documentation: ProjectDoc): Unit = { + def checkRegex(id:Identifier[_], regex:Regex) : Boolean = { + regex.unapplySeq(id.toString).nonEmpty || regex.unapplySeq(id.name).nonEmpty + } + // Apply all filters + val relations = documentation.relations.filter { relation => + includeRelations.exists(regex => checkRegex(relation.identifier, regex)) && + !excludeRelations.exists(regex => checkRegex(relation.identifier, regex)) + } + val mappings = documentation.mappings.filter { mapping => + includeMappings.exists(regex => checkRegex(mapping.identifier, regex)) && + !excludeMappings.exists(regex => checkRegex(mapping.identifier, regex)) + } + val targets = documentation.targets.filter { target => + includeTargets.exists(regex => checkRegex(target.identifier, regex)) && + !excludeTargets.exists(regex => checkRegex(target.identifier, regex)) + } + val doc = documentation.copy( + relations = relations, + mappings = mappings, + targets = targets + ) + + generateInternal(context:Context, execution: Execution, doc) + } + + protected def generateInternal(context:Context, execution: Execution, documentation: ProjectDoc): Unit + + protected def renderProject(context:Context, documentation: ProjectDoc, template:String="project.vtl") : String = { + val temp = loadResource(template) + context.evaluate(temp, Map("project" -> ProjectDocWrapper(documentation))) + } + protected def renderRelation(context:Context, documentation: RelationDoc, template:String="relation.vtl") : String = { + val temp = loadResource(template) + context.evaluate(temp, Map("relation" -> RelationDocWrapper(documentation))) + } + protected def renderMapping(context:Context, documentation: MappingDoc, template:String="mapping.vtl") : String = { + val temp = loadResource(template) + context.evaluate(temp, Map("mapping" -> MappingDocWrapper(documentation))) + } + protected def renderTarget(context:Context, documentation: TargetDoc, template:String="target.vtl") : String = { + val temp = loadResource(template) + context.evaluate(temp, Map("target" -> TargetDocWrapper(documentation))) + } + + protected def loadResource(name: String): String = { + val path = template.getPath + val url = + if (path.endsWith("/")) + new URL(template.toString + name) + else + new URL(template.toString + "/" + name) + Resources.toString(url, Charset.forName("UTF-8")) + } +} + + +abstract class TemplateGeneratorSpec extends GeneratorSpec { + @JsonProperty(value="template", required=false) private var template:String = FileGenerator.defaultTemplate.toString + @JsonProperty(value="includeRelations", required=false) protected var includeRelations:Seq[String] = Seq(".*") + @JsonProperty(value="excludeRelations", required=false) protected var excludeRelations:Seq[String] = Seq.empty + @JsonProperty(value="includeMappings", required=false) protected var includeMappings:Seq[String] = Seq(".*") + @JsonProperty(value="excludeMappings", required=false) protected var excludeMappings:Seq[String] = Seq.empty + @JsonProperty(value="includeTargets", required=false) protected var includeTargets:Seq[String] = Seq(".*") + @JsonProperty(value="excludeTargets", required=false) protected var excludeTargets:Seq[String] = Seq.empty + + protected def getTemplateUrl(context: Context): URL = { + context.evaluate(template) match { + case "text" => FileGenerator.textTemplate + case "html" => FileGenerator.htmlTemplate + case str => new URL(str) + } + } + +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/hook/HookSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/hook/HookSpec.scala index a468debb8..677f3ac94 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/hook/HookSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/hook/HookSpec.scala @@ -34,7 +34,7 @@ object HookSpec extends TypeRegistry[HookSpec] { } -@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "kind") +@JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "kind", visible=true) @JsonSubTypes(value = Array( new JsonSubTypes.Type(name = "simpleReport", value = classOf[SimpleReportHookSpec]), new JsonSubTypes.Type(name = "report", value = classOf[ReportHookSpec]), diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/hook/ReportHook.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/hook/ReportHook.scala index eac55bbc0..55364da17 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/hook/ReportHook.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/hook/ReportHook.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -192,7 +192,7 @@ case class ReportHook( None } - // Register custom metrics board + // Reset metrics of custom metrics board without adding it metrics.foreach { board => board.reset(execution.metricSystem) } @@ -235,7 +235,7 @@ case class ReportHook( "phase" -> result.instance.phase.toString, "status" -> result.status.toString, "result" -> JobResultWrapper(result), - "metrics" -> (boardMetrics ++ sinkMetrics).asJava + "metrics" -> (boardMetrics ++ sinkMetrics).sortBy(_.getName()).asJava ) val text = context.evaluate(jobFinishVtl, vars) p.print(text) diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/job/JobSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/job/JobSpec.scala index 82496851d..a41f7da75 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/job/JobSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/job/JobSpec.scala @@ -94,7 +94,7 @@ final class JobSpec extends NamedSpec[Job] { val name = context.evaluate(this.name) Job.Properties( context, - metadata.map(_.instantiate(context, name, Category.JOB, kind)).getOrElse(Metadata(context, name, Category.JOB, kind)), + metadata.map(_.instantiate(context, name, Category.JOB, "job")).getOrElse(Metadata(context, name, Category.JOB, "job")), description.map(context.evaluate) ) } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AggregateMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AggregateMapping.scala index 3fc572aef..13e511066 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AggregateMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AggregateMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -68,8 +68,8 @@ case class AggregateMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } } @@ -92,7 +92,7 @@ class AggregateMappingSpec extends MappingSpec { instanceProperties(context), MappingOutputIdentifier.parse(context.evaluate(input)), dimensions.map(context.evaluate), - context.evaluate(aggregations), + ListMap(aggregations.toSeq.map { case(k,v) => k -> context.evaluate(v) }:_*), context.evaluate(filter), if (partitions.isEmpty) 0 else context.evaluate(partitions).toInt ) diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AliasMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AliasMapping.scala index ad62e024b..7b45c2011 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AliasMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AliasMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -36,8 +36,8 @@ case class AliasMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -61,7 +61,10 @@ case class AliasMapping( require(input != null) val result = input(this.input) - Map("main" -> result) + + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AssembleMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AssembleMapping.scala index edc422b33..bd01b91b1 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AssembleMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/AssembleMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -126,8 +126,8 @@ case class AssembleMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -164,7 +164,9 @@ case class AssembleMapping( val asm = assembler val result = asm.reassemble(schema) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } private def assembler : Assembler = { diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/CoalesceMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/CoalesceMapping.scala index a4dcfa448..6b1a3346d 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/CoalesceMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/CoalesceMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -54,8 +54,8 @@ case class CoalesceMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -68,7 +68,10 @@ case class CoalesceMapping( require(input != null) val result = input(this.input) - Map("main" -> result) + + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ConformMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ConformMapping.scala index 6e8014e23..45caf0b83 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ConformMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ConformMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -47,8 +47,8 @@ extends BaseMapping { * * @return */ - override def inputs: Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs: Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -89,7 +89,9 @@ extends BaseMapping { // Apply all transformations in order val result = transforms.foldLeft(schema)((df,xfs) => xfs.transform(df)) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } private def transforms : Seq[Transformer] = { diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DeduplicateMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DeduplicateMapping.scala index 89f711efe..311e334e3 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DeduplicateMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DeduplicateMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2020 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -38,8 +38,8 @@ case class DeduplicateMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -81,7 +81,10 @@ case class DeduplicateMapping( require(input != null) val result = input(this.input) - Map("main" -> result) + + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DistinctMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DistinctMapping.scala index 6dbed2076..74487d88a 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DistinctMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DistinctMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -37,8 +37,8 @@ case class DistinctMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -71,7 +71,10 @@ case class DistinctMapping( require(input != null) val result = input(this.input) - Map("main" -> result) + + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DropMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DropMapping.scala index f2f505a6d..81c055aee 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DropMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/DropMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -40,8 +40,8 @@ case class DropMapping( * * @return */ - override def inputs: Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs: Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -56,13 +56,14 @@ case class DropMapping( require(deps != null) val df = deps(input) - val asm = assembler - val result = asm.reassemble(df) - // Apply optional filter - val filteredResult = filter.map(result.filter).getOrElse(result) + // Apply optional filter, before dropping columns! + val filtered = filter.map(df.filter).getOrElse(df) + + val asm = assembler + val result = asm.reassemble(filtered) - Map("main" -> filteredResult) + Map("main" -> result) } /** @@ -78,12 +79,14 @@ case class DropMapping( val asm = assembler val result = asm.reassemble(schema) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } private def assembler : Assembler = { val builder = Assembler.builder() - .columns(_.drop(columns)) + .columns(_.drop(columns)) builder.build() } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExplodeMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExplodeMapping.scala index fe84aeba4..d638602d7 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExplodeMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExplodeMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -53,15 +53,15 @@ case class ExplodeMapping( flatten: Boolean = false, naming: CaseFormat = CaseFormat.SNAKE_CASE ) extends BaseMapping { - override def outputs: Seq[String] = Seq("main", "explode") + override def outputs: Set[String] = Set("main", "explode") /** * Returns the dependencies (i.e. names of tables in the Dataflow model) * * @return */ - override def inputs: Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs: Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -123,7 +123,10 @@ case class ExplodeMapping( lift.transform(exploded) val result = flat.transform(lifted) - Map("main" -> result, "explode" -> exploded) + val schemas = Map("main" -> result, "explode" -> exploded) + + // Apply documentation + applyDocumentation(schemas) } private def explode = ExplodeTransformer(array, outerColumns.keep, outerColumns.drop, outerColumns.rename) diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExtendMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExtendMapping.scala index 67147dc5c..215b9cb7c 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExtendMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExtendMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -39,8 +39,8 @@ case class ExtendMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExtractJsonMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExtractJsonMapping.scala index 2fd3ff757..72b2e1a8d 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExtractJsonMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ExtractJsonMapping.scala @@ -53,8 +53,8 @@ case class ExtractJsonMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } @@ -63,7 +63,7 @@ case class ExtractJsonMapping( * * @return */ - override def outputs: Seq[String] = Seq("main", "error") + override def outputs: Set[String] = Set("main", "error") /** * Executes this MappingType and returns a corresponding DataFrame @@ -129,10 +129,13 @@ case class ExtractJsonMapping( val mainSchema = ftypes.StructType(schema.map(_.fields).getOrElse(Seq())) val errorSchema = ftypes.StructType(Seq(Field("record", ftypes.StringType, false))) - Map( + val schemas = Map( "main" -> mainSchema, "error" -> errorSchema ) + + // Apply documentation + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/FilterMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/FilterMapping.scala index d48124927..6ac27e2a0 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/FilterMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/FilterMapping.scala @@ -37,8 +37,8 @@ case class FilterMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -69,7 +69,9 @@ case class FilterMapping( val result = input(this.input) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/FlattenMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/FlattenMapping.scala index bf24e541c..625b699d0 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/FlattenMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/FlattenMapping.scala @@ -40,8 +40,8 @@ case class FlattenMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -81,7 +81,9 @@ case class FlattenMapping( val result = xfs.transform(schema) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/GroupedAggregateMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/GroupedAggregateMapping.scala index 7607f0646..c75df85b2 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/GroupedAggregateMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/GroupedAggregateMapping.scala @@ -67,15 +67,15 @@ case class GroupedAggregateMapping( * recommended. * @return */ - override def outputs: Seq[String] = groups.keys.toSeq :+ "cache" + override def outputs: Set[String] = groups.keys.toSet + "cache" /** * Returns the dependencies of this mapping, which is exactly one input table * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/HistorizeMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/HistorizeMapping.scala index 766826e50..9f837d37b 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/HistorizeMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/HistorizeMapping.scala @@ -47,8 +47,8 @@ case class HistorizeMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -124,7 +124,9 @@ case class HistorizeMapping( ) } - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/JoinMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/JoinMapping.scala index 240b77b93..560df3ae1 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/JoinMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/JoinMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -40,8 +40,8 @@ case class JoinMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - input + override def inputs : Set[MappingOutputIdentifier] = { + input.toSet } /** @@ -56,10 +56,10 @@ case class JoinMapping( require(tables != null) val result = if (condition.nonEmpty) { - require(inputs.size == 2, "Joining using an condition only supports exactly two inputs") + require(input.size == 2, "Joining using an condition only supports exactly two inputs") - val left = inputs(0) - val right = inputs(1) + val left = input(0) + val right = input(1) val leftDf = tables(left).as(left.name) val rightDf = tables(right).as(right.name) leftDf.join(rightDf, expr(condition), mode) diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/MappingSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/MappingSpec.scala index 9d4d6e2f7..65b889ce4 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/MappingSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/MappingSpec.scala @@ -29,6 +29,7 @@ import com.dimajix.flowman.model.Mapping import com.dimajix.flowman.model.Metadata import com.dimajix.flowman.spec.NamedSpec import com.dimajix.flowman.spec.annotation.MappingType +import com.dimajix.flowman.spec.documentation.MappingDocSpec import com.dimajix.flowman.spec.template.CustomTypeResolverBuilder import com.dimajix.flowman.spi.ClassAnnotationHandler @@ -89,9 +90,11 @@ object MappingSpec extends TypeRegistry[MappingSpec] { new JsonSubTypes.Type(name = "values", value = classOf[ValuesMappingSpec]) )) abstract class MappingSpec extends NamedSpec[Mapping] { - @JsonProperty("broadcast") protected var broadcast:String = "false" - @JsonProperty("checkpoint") protected var checkpoint:String = "false" - @JsonProperty("cache") protected var cache:String = "NONE" + @JsonProperty(value="kind", required = true) protected var kind: String = _ + @JsonProperty(value="broadcast", required = false) protected var broadcast:String = "false" + @JsonProperty(value="checkpoint", required = false) protected var checkpoint:String = "false" + @JsonProperty(value="cache", required = false) protected var cache:String = "NONE" + @JsonProperty(value="documentation", required = false) private var documentation: Option[MappingDocSpec] = None /** * Creates an instance of this specification and performs the interpolation of all variables @@ -113,7 +116,8 @@ abstract class MappingSpec extends NamedSpec[Mapping] { metadata.map(_.instantiate(context, name, Category.MAPPING, kind)).getOrElse(Metadata(context, name, Category.MAPPING, kind)), context.evaluate(broadcast).toBoolean, context.evaluate(checkpoint).toBoolean, - StorageLevel.fromString(context.evaluate(cache)) + StorageLevel.fromString(context.evaluate(cache)), + documentation.map(_.instantiate(context)) ) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/MockMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/MockMapping.scala index 4f1b29447..913e5ab9a 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/MockMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/MockMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -46,7 +46,23 @@ case class MockMapping( * * @return */ - override def inputs: Seq[MappingOutputIdentifier] = Seq() + override def inputs: Set[MappingOutputIdentifier] = Set.empty + + /** + * Creates an output identifier for the primary output + * + * @return + */ + override def output: MappingOutputIdentifier = { + MappingOutputIdentifier(identifier, mocked.output.output) + } + + /** + * Lists all outputs of this mapping. Every mapping should have one "main" output + * + * @return + */ + override def outputs: Set[String] = mocked.outputs /** * Executes this Mapping and returns a corresponding map of DataFrames per output @@ -74,23 +90,6 @@ case class MockMapping( } } - - /** - * Creates an output identifier for the primary output - * - * @return - */ - override def output: MappingOutputIdentifier = { - MappingOutputIdentifier(identifier, mocked.output.output) - } - - /** - * Lists all outputs of this mapping. Every mapping should have one "main" output - * - * @return - */ - override def outputs: Seq[String] = mocked.outputs - /** * Returns the schema as produced by this mapping, relative to the given input schema. The map might not contain * schema information for all outputs, if the schema cannot be inferred. @@ -99,7 +98,10 @@ case class MockMapping( * @return */ override def describe(execution: Execution, input: Map[MappingOutputIdentifier, StructType]): Map[String, StructType] = { - mocked.outputs.map(out => out -> describe(execution, Map(), out)).toMap + val schemas = mocked.outputs.map(out => out -> describe(execution, Map(), out)).toMap + + // Apply documentation + applyDocumentation(schemas) } /** @@ -114,7 +116,8 @@ case class MockMapping( require(input != null) require(output != null && output.nonEmpty) - execution.describe(mocked, output) + val schema = execution.describe(mocked, output) + applyDocumentation(output, schema) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/NullMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/NullMapping.scala index e00827c65..7cde7e32c 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/NullMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/NullMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -55,7 +55,7 @@ case class NullMapping( * * @return */ - override def inputs: Seq[MappingOutputIdentifier] = Seq() + override def inputs: Set[MappingOutputIdentifier] = Set.empty /** * Executes this Mapping and returns a corresponding map of DataFrames per output @@ -79,7 +79,9 @@ case class NullMapping( * @return */ override def describe(execution: Execution, input: Map[MappingOutputIdentifier, StructType]): Map[String, StructType] = { - Map("main" -> effectiveSchema) + // Apply documentation + val schemas = Map("main" -> effectiveSchema) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ProjectMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ProjectMapping.scala index 32a76cd8b..0698db051 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ProjectMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ProjectMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -50,8 +50,8 @@ extends BaseMapping { * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -86,7 +86,9 @@ extends BaseMapping { val schema = input(this.input) val result = xfs.transform(schema) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } private def xfs : ProjectTransformer = ProjectTransformer(columns) diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ProvidedMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ProvidedMapping.scala index c1c755291..7f54f89b9 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ProvidedMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ProvidedMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -36,9 +36,7 @@ extends BaseMapping { * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq() - } + override def inputs : Set[MappingOutputIdentifier] = Set.empty /** * Instantiates the specified table, which must be available in the Spark session diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RankMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RankMapping.scala index b7e33a0cb..1cf05dd50 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RankMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RankMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -61,8 +61,8 @@ case class RankMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -103,7 +103,10 @@ case class RankMapping( require(input != null) val result = input(this.input) - Map("main" -> result) + + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadHiveMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadHiveMapping.scala index 017729db6..49cec00ab 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadHiveMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadHiveMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,12 +16,17 @@ package com.dimajix.flowman.spec.mapping +import scala.collection.immutable.ListMap + import com.fasterxml.jackson.annotation.JsonProperty +import com.fasterxml.jackson.databind.annotation.JsonDeserialize import org.apache.spark import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.catalyst.TableIdentifier import org.slf4j.LoggerFactory +import com.dimajix.jackson.ListMapDeserializer + +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.model.BaseMapping @@ -36,24 +41,21 @@ import com.dimajix.spark.sql.SchemaUtils case class ReadHiveMapping( instanceProperties:Mapping.Properties, - database: Option[String] = None, - table: String, + table: TableIdentifier, columns:Seq[Field] = Seq(), filter:Option[String] = None ) extends BaseMapping { private val logger = LoggerFactory.getLogger(classOf[ReadHiveMapping]) - def tableIdentifier: TableIdentifier = new TableIdentifier(table, database) - /** * Returns a list of physical resources required by this mapping. This list will only be non-empty for mappings * which actually read from physical data. * @return */ override def requires : Set[ResourceIdentifier] = { - Set(ResourceIdentifier.ofHiveTable(table, database)) ++ - database.map(db => ResourceIdentifier.ofHiveDatabase(db)).toSet + Set(ResourceIdentifier.ofHiveTable(table)) ++ + table.database.map(db => ResourceIdentifier.ofHiveDatabase(db)).toSet } /** @@ -61,9 +63,7 @@ extends BaseMapping { * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq() - } + override def inputs : Set[MappingOutputIdentifier] = Set.empty /** * Executes this Transform by reading from the specified source and returns a corresponding DataFrame @@ -77,10 +77,10 @@ extends BaseMapping { require(input != null) val schema = if (columns.nonEmpty) Some(spark.sql.types.StructType(columns.map(_.sparkField))) else None - logger.info(s"Reading Hive table $tableIdentifier with filter '${filter.getOrElse("")}'") + logger.info(s"Reading Hive table $table with filter '${filter.getOrElse("")}'") val reader = execution.spark.read - val tableDf = reader.table(tableIdentifier.unquotedString) + val tableDf = reader.table(table.unquotedString) val df = SchemaUtils.applySchema(tableDf, schema) // Apply optional filter @@ -98,16 +98,18 @@ extends BaseMapping { require(execution != null) require(input != null) - val schema = if (columns.nonEmpty) { + val result = if (columns.nonEmpty) { // Use user specified schema StructType(columns) } else { - val tableDf = execution.spark.read.table(tableIdentifier.unquotedString) + val tableDf = execution.spark.read.table(table.unquotedString) StructType.of(tableDf.schema) } - Map("main" -> schema) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } @@ -115,7 +117,8 @@ extends BaseMapping { class ReadHiveMappingSpec extends MappingSpec { @JsonProperty(value = "database", required = false) private var database: Option[String] = None @JsonProperty(value = "table", required = true) private var table: String = "" - @JsonProperty(value = "columns", required=false) private var columns:Map[String,String] = Map() + @JsonDeserialize(using = classOf[ListMapDeserializer]) // Old Jackson in old Spark doesn't support ListMap + @JsonProperty(value = "columns", required=false) private var columns:ListMap[String,String] = ListMap() @JsonProperty(value = "filter", required=false) private var filter:Option[String] = None /** @@ -126,9 +129,8 @@ class ReadHiveMappingSpec extends MappingSpec { override def instantiate(context: Context): ReadHiveMapping = { ReadHiveMapping( instanceProperties(context), - context.evaluate(database), - context.evaluate(table), - context.evaluate(columns).map { case(name,typ) => Field(name, FieldType.of(typ))}.toSeq, + TableIdentifier(context.evaluate(table), context.evaluate(database)), + columns.toSeq.map { case(name,typ) => Field(name, FieldType.of(typ))}, context.evaluate(filter) ) } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadRelationMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadRelationMapping.scala index 95f397c71..5cf5e3dce 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadRelationMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadRelationMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -66,9 +66,7 @@ case class ReadRelationMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq() - } + override def inputs : Set[MappingOutputIdentifier] = Set.empty /** * Executes this Transform by reading from the specified source and returns a corresponding DataFrame @@ -105,16 +103,18 @@ case class ReadRelationMapping( require(execution != null) require(input != null) - val schema = if (columns.nonEmpty) { + val result = if (columns.nonEmpty) { // Use user specified schema StructType(columns) } else { val relation = this.relation.value - relation.describe(execution) + execution.describe(relation, partitions) } - Map("main" -> schema) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } /** @@ -122,7 +122,7 @@ case class ReadRelationMapping( * Params: linker - The linker object to use for creating new edges */ override def link(linker: Linker): Unit = { - linker.read(relation.identifier, partitions) + linker.read(relation, partitions) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadStreamMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadStreamMapping.scala index 1a0eb5ed4..9068ed7f3 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadStreamMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ReadStreamMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -34,6 +34,7 @@ import com.dimajix.flowman.model.ResourceIdentifier import com.dimajix.flowman.spec.relation.RelationReferenceSpec import com.dimajix.flowman.types.Field import com.dimajix.flowman.types.FieldType +import com.dimajix.flowman.types.FieldValue import com.dimajix.flowman.types.StructType import com.dimajix.spark.sql.SchemaUtils @@ -61,9 +62,7 @@ case class ReadStreamMapping ( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq() - } + override def inputs : Set[MappingOutputIdentifier] = Set.empty /** * Executes this Transform by reading from the specified source and returns a corresponding DataFrame @@ -97,16 +96,18 @@ case class ReadStreamMapping ( require(execution != null) require(input != null) - val schema = if (columns.nonEmpty) { + val result = if (columns.nonEmpty) { // Use user specified schema StructType(columns) } else { val relation = this.relation.value - relation.describe(execution) + execution.describe(relation) } - Map("main" -> schema) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } /** @@ -114,7 +115,7 @@ case class ReadStreamMapping ( * Params: linker - The linker object to use for creating new edges */ override def link(linker: Linker): Unit = { - linker.read(relation.identifier, Map()) + linker.read(relation, Map.empty[String,FieldValue]) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RebalanceMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RebalanceMapping.scala index 64fd789cf..5460cf469 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RebalanceMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RebalanceMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -37,8 +37,8 @@ case class RebalanceMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -68,7 +68,10 @@ case class RebalanceMapping( require(input != null) val result = input(this.input) - Map("main" -> result) + + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RecursiveSqlMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RecursiveSqlMapping.scala index 17ada2fff..b5b7858a2 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RecursiveSqlMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RecursiveSqlMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2020 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -59,11 +59,10 @@ extends BaseMapping { * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { + override def inputs : Set[MappingOutputIdentifier] = { SqlParser.resolveDependencies(statement) .filter(_.toLowerCase(Locale.ROOT) != "__this__") .map(MappingOutputIdentifier.parse) - .toSeq } /** @@ -142,7 +141,9 @@ extends BaseMapping { firstDf(spark, statement) } - Map("main" -> StructType.of(result.schema)) + // Apply documentation + val schemas = Map("main" -> StructType.of(result.schema)) + applyDocumentation(schemas) } private def statement : String = { diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RepartitionMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RepartitionMapping.scala index a8bc9b7c7..b13b6d308 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RepartitionMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/RepartitionMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -40,8 +40,8 @@ case class RepartitionMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -83,7 +83,9 @@ case class RepartitionMapping( val result = input(this.input) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SchemaMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SchemaMapping.scala index a18aac474..b04128eb0 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SchemaMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SchemaMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -53,8 +53,8 @@ extends BaseMapping { * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -100,7 +100,9 @@ extends BaseMapping { StructType(columns) } - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SelectMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SelectMapping.scala index 005f0ddd6..e2334d886 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SelectMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SelectMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -44,8 +44,8 @@ extends BaseMapping { * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -87,7 +87,7 @@ class SelectMappingSpec extends MappingSpec { SelectMapping( instanceProperties(context), MappingOutputIdentifier(context.evaluate(input)), - context.evaluate(columns).toSeq, + columns.toSeq.map { case(k,v) => k -> context.evaluate(v) }, context.evaluate(filter) ) } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SortMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SortMapping.scala index 6fe5a3c3a..856b6e7c6 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SortMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SortMapping.scala @@ -38,8 +38,8 @@ case class SortMapping( * Returns the dependencies (i.e. names of tables in the Dataflow model) * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -79,7 +79,9 @@ case class SortMapping( val result = input(this.input) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SqlMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SqlMapping.scala index 99ce1bb29..a9f4e82ca 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SqlMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/SqlMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -64,8 +64,8 @@ extends BaseMapping { * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - SqlParser.resolveDependencies(statement).map(MappingOutputIdentifier.parse).toSeq + override def inputs : Set[MappingOutputIdentifier] = { + SqlParser.resolveDependencies(statement).map(MappingOutputIdentifier.parse) } private def statement : String = { diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/StackMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/StackMapping.scala index 4cde6f19b..73d5fe4a2 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/StackMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/StackMapping.scala @@ -50,8 +50,8 @@ case class StackMapping( * Returns the dependencies (i.e. names of tables in the Dataflow model) * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -86,7 +86,9 @@ case class StackMapping( val result = xfs.transform(schema) val assembledResult = asm.map(_.reassemble(result)).getOrElse(result) - Map("main" -> assembledResult) + // Apply documentation + val schemas = Map("main" -> assembledResult) + applyDocumentation(schemas) } private lazy val xfs : StackTransformer = @@ -138,7 +140,7 @@ class StackMappingSpec extends MappingSpec { MappingOutputIdentifier(context.evaluate(input)), context.evaluate(nameColumn), context.evaluate(valueColumn), - ListMap(context.evaluate(stackColumns).toSeq:_*), + ListMap(stackColumns.toSeq.map {case(k,v) => k -> context.evaluate(v) }:_*), context.evaluate(dropNulls).toBoolean, keepColumns.map(context.evaluate), dropColumns.map(context.evaluate), diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/TemplateMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/TemplateMapping.scala index 88343c99f..31452389b 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/TemplateMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/TemplateMapping.scala @@ -20,6 +20,7 @@ import com.fasterxml.jackson.annotation.JsonProperty import org.apache.spark.sql.DataFrame import com.dimajix.flowman.common.ParserUtils.splitSettings +import com.dimajix.flowman.documentation.MappingDoc import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.execution.ScopeContext @@ -55,6 +56,12 @@ case class TemplateMapping( } } + /** + * Returns a (static) documentation of this mapping + * + * @return + */ + override def documentation: Option[MappingDoc] = mappingInstance.documentation.map(_.merge(instanceProperties.documentation)) /** * Returns a list of physical resources required by this mapping. This list will only be non-empty for mappings @@ -71,7 +78,7 @@ case class TemplateMapping( * * @return */ - override def outputs : Seq[String] = { + override def outputs : Set[String] = { mappingInstance.outputs } @@ -80,7 +87,7 @@ case class TemplateMapping( * * @return */ - override def inputs: Seq[MappingOutputIdentifier] = { + override def inputs: Set[MappingOutputIdentifier] = { mappingInstance.inputs } @@ -110,7 +117,8 @@ case class TemplateMapping( require(execution != null) require(input != null) - mappingInstance.describe(execution, input) + val schemas = mappingInstance.describe(execution, input) + applyDocumentation(schemas) } /** @@ -124,7 +132,8 @@ case class TemplateMapping( require(input != null) require(output != null && output.nonEmpty) - mappingInstance.describe(execution, input, output) + val schema = mappingInstance.describe(execution, input, output) + applyDocumentation(output, schema) } /** diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/TransitiveChildrenMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/TransitiveChildrenMapping.scala index a9d7a8c79..35238551f 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/TransitiveChildrenMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/TransitiveChildrenMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2020 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -42,7 +42,7 @@ case class TransitiveChildrenMapping( * * @return */ - override def inputs: Seq[MappingOutputIdentifier] = Seq(input) + override def inputs: Set[MappingOutputIdentifier] = Set(input) /** * Executes this MappingType and returns a corresponding DataFrame @@ -132,7 +132,9 @@ case class TransitiveChildrenMapping( childColumns.map(n => fieldsByName(n.toLowerCase(Locale.ROOT))) val result = StructType(columns) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnionMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnionMapping.scala index d70d1beae..5d5ceb0a6 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnionMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnionMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -43,8 +43,8 @@ case class UnionMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - input + override def inputs : Set[MappingOutputIdentifier] = { + input.toSet } /** @@ -58,7 +58,7 @@ case class UnionMapping( require(execution != null) require(tables != null) - val dfs = inputs.map(tables(_)) + val dfs = input.map(tables(_)) // Now create a union of all tables val union = @@ -101,7 +101,9 @@ case class UnionMapping( xfs.transformSchemas(schemas) } - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnitMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnitMapping.scala index e8c86fd97..f48b0a8f1 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnitMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnitMapping.scala @@ -62,10 +62,10 @@ case class UnitMapping( * Return all outputs provided by this unit * @return */ - override def outputs: Seq[String] = { + override def outputs: Set[String] = { mappingInstances .filter(_._2.outputs.contains("main")) - .keys.toSeq + .keySet } /** @@ -73,14 +73,14 @@ case class UnitMapping( * * @return */ - override def inputs: Seq[MappingOutputIdentifier] = { + override def inputs: Set[MappingOutputIdentifier] = { // For all mappings, find only external dependencies. val ownMappings = mappingInstances.keySet mappingInstances.values .filter(_.outputs.contains("main")) .flatMap(_.inputs) .filter(dep => dep.project.nonEmpty || !ownMappings.contains(dep.name)) - .toSeq + .toSet } /** @@ -106,11 +106,13 @@ case class UnitMapping( require(execution != null) require(input != null) - mappingInstances + val schemas = mappingInstances .filter(_._2.outputs.contains("main")) .keys .map(name => name -> describe(execution, input, name)) .toMap + + applyDocumentation(schemas) } /** @@ -138,11 +140,12 @@ case class UnitMapping( .toMap } - mappingInstances + val schema = mappingInstances .filter(_._2.outputs.contains("main")) .get(output) .map(mapping => describe(mapping, "main")) .getOrElse(throw new NoSuchElementException(s"Cannot find output '$output' in unit mapping '$identifier'")) + applyDocumentation(output, schema) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnpackJsonMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnpackJsonMapping.scala index 659a27138..9663b7a5a 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnpackJsonMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UnpackJsonMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -55,8 +55,8 @@ case class UnpackJsonMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input) } /** @@ -105,7 +105,9 @@ case class UnpackJsonMapping( val fields = schema.fields ++ columns.map(c => Field(Option(c.alias).getOrElse(c.name), StructType(c.schema.fields))) val result = StructType(fields) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UpsertMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UpsertMapping.scala index fe1c204fd..12ff8c45f 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UpsertMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/UpsertMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -40,8 +40,8 @@ case class UpsertMapping( * * @return */ - override def inputs : Seq[MappingOutputIdentifier] = { - Seq(input, updates) + override def inputs : Set[MappingOutputIdentifier] = { + Set(input, updates) } /** @@ -88,7 +88,9 @@ case class UpsertMapping( val result = input(this.input) - Map("main" -> result) + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ValuesMapping.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ValuesMapping.scala index 3142e9e35..31766aad6 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ValuesMapping.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/mapping/ValuesMapping.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -53,40 +53,39 @@ case class ValuesMapping( * * @return */ - override def inputs: Seq[MappingOutputIdentifier] = Seq() + override def inputs: Set[MappingOutputIdentifier] = Set.empty /** - * Executes this Mapping and returns a corresponding map of DataFrames per output + * Creates an output identifier for the primary output * - * @param execution - * @param input * @return */ - override def execute(execution: Execution, input: Map[MappingOutputIdentifier, DataFrame]): Map[String, DataFrame] = { - val recordsSchema = StructType(schema.map(_.fields).getOrElse(columns)) - val sparkSchema = recordsSchema.sparkType - - val values = records.map(_.toArray(recordsSchema)) - val df = DataFrameBuilder.ofStringValues(execution.spark, values, sparkSchema) - Map("main" -> df) + override def output: MappingOutputIdentifier = { + MappingOutputIdentifier(identifier, "main") } - /** - * Creates an output identifier for the primary output + * Lists all outputs of this mapping. Every mapping should have one "main" output * * @return */ - override def output: MappingOutputIdentifier = { - MappingOutputIdentifier(identifier, "main") - } + override def outputs: Set[String] = Set("main") /** - * Lists all outputs of this mapping. Every mapping should have one "main" output + * Executes this Mapping and returns a corresponding map of DataFrames per output * + * @param execution + * @param input * @return */ - override def outputs: Seq[String] = Seq("main") + override def execute(execution: Execution, input: Map[MappingOutputIdentifier, DataFrame]): Map[String, DataFrame] = { + val recordsSchema = StructType(schema.map(_.fields).getOrElse(columns)) + val sparkSchema = recordsSchema.sparkType + + val values = records.map(_.toArray(recordsSchema)) + val df = DataFrameBuilder.ofStringValues(execution.spark, values, sparkSchema) + Map("main" -> df) + } /** * Returns the schema as produced by this mapping, relative to the given input schema. The map might not contain @@ -96,7 +95,11 @@ case class ValuesMapping( * @return */ override def describe(execution: Execution, input: Map[MappingOutputIdentifier, StructType]): Map[String, StructType] = { - Map("main" -> StructType(schema.map(_.fields).getOrElse(columns))) + val result = StructType(schema.map(_.fields).getOrElse(columns)) + + // Apply documentation + val schemas = Map("main" -> result) + applyDocumentation(schemas) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/measure/MeasureSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/measure/MeasureSpec.scala index 86c403173..730f55613 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/measure/MeasureSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/measure/MeasureSpec.scala @@ -43,6 +43,7 @@ object MeasureSpec extends TypeRegistry[MeasureSpec] { new JsonSubTypes.Type(name = "sql", value = classOf[SqlMeasureSpec]) )) abstract class MeasureSpec extends NamedSpec[Measure] { + @JsonProperty(value="kind", required = true) protected var kind: String = _ @JsonProperty(value="description", required = false) private var description: Option[String] = None override def instantiate(context: Context): Measure diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/JdbcMetricRepository.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/JdbcMetricRepository.scala new file mode 100644 index 000000000..fff558cc0 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/JdbcMetricRepository.scala @@ -0,0 +1,193 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.metric + +import java.sql.Timestamp +import java.time.Instant +import java.util.Locale +import java.util.Properties + +import scala.concurrent.Await +import scala.concurrent.Future +import scala.concurrent.duration.Duration +import scala.language.higherKinds +import scala.util.Success +import scala.util.control.NonFatal + +import org.slf4j.LoggerFactory +import slick.jdbc.JdbcProfile + +import com.dimajix.flowman.metric.GaugeMetric +import com.dimajix.flowman.spec.connection.JdbcConnection +import com.dimajix.flowman.spec.metric.JdbcMetricRepository.Commit +import com.dimajix.flowman.spec.metric.JdbcMetricRepository.CommitLabel +import com.dimajix.flowman.spec.metric.JdbcMetricRepository.Measurement +import com.dimajix.flowman.spec.metric.JdbcMetricRepository.MetricLabel + + + +private[metric] object JdbcMetricRepository { + case class Commit( + id:Long, + ts:Timestamp + ) + case class CommitLabel( + commit_id:Long, + name:String, + value:String + ) + case class Measurement( + id:Long, + commit_id:Long, + name:String, + ts:Timestamp, + value:Double + ) + case class MetricLabel( + metric_id:Long, + name:String, + value:String + ) +} + + +private[metric] class JdbcMetricRepository( + connection: JdbcConnection, + val profile: JdbcProfile, + commitTable: String = "flowman_metric_commits", + commitLabelTable: String = "flowman_metric_commit_labels", + metricTable: String = "flowman_metrics", + metricLabelTable: String = "flowman_metric_labels" +) { + private val logger = LoggerFactory.getLogger(getClass) + + import profile.api._ + + private lazy val db = { + val url = connection.url + val driver = connection.driver + val props = new Properties() + connection.properties.foreach(kv => props.setProperty(kv._1, kv._2)) + connection.username.foreach(props.setProperty("user", _)) + connection.password.foreach(props.setProperty("password", _)) + logger.debug(s"Connecting via JDBC to $url with driver $driver") + val executor = slick.util.AsyncExecutor( + name="Flowman.jdbc_metric_sink", + minThreads = 20, + maxThreads = 20, + queueSize = 1000, + maxConnections = 20) + // Do not set username and password, since a bug in Slick would discard all other connection properties + Database.forURL(url, driver=driver, prop=props, executor=executor) + } + + class Commits(tag: Tag) extends Table[Commit](tag, commitTable) { + def id = column[Long]("id", O.PrimaryKey, O.AutoInc) + def ts = column[Timestamp]("ts") + + def * = (id, ts) <> (Commit.tupled, Commit.unapply) + } + class CommitLabels(tag: Tag) extends Table[CommitLabel](tag, commitLabelTable) { + def commit_id = column[Long]("commit_id") + def name = column[String]("name", O.Length(64)) + def value = column[String]("value", O.Length(64)) + + def pk = primaryKey(commitLabelTable + "_pk", (commit_id, name)) + def commit = foreignKey(commitLabelTable + "_fk", commit_id, commits)(_.id, onUpdate=ForeignKeyAction.Restrict, onDelete=ForeignKeyAction.Cascade) + def idx = index(commitLabelTable + "_idx", (name, value), unique = false) + + def * = (commit_id, name, value) <> (CommitLabel.tupled, CommitLabel.unapply) + } + class Metrics(tag: Tag) extends Table[Measurement](tag, metricTable) { + def id = column[Long]("id", O.PrimaryKey, O.AutoInc) + def commit_id = column[Long]("commit_id") + def name = column[String]("name", O.Length(64)) + def ts = column[Timestamp]("ts") + def value = column[Double]("value") + + def commit = foreignKey(metricTable + "_fk", commit_id, commits)(_.id, onUpdate=ForeignKeyAction.Restrict, onDelete=ForeignKeyAction.Cascade) + + def * = (id, commit_id, name, ts, value) <> (Measurement.tupled, Measurement.unapply) + } + class MetricLabels(tag: Tag) extends Table[MetricLabel](tag, metricLabelTable) { + def metric_id = column[Long]("metric_id") + def name = column[String]("name", O.Length(64)) + def value = column[String]("value", O.Length(64)) + + def pk = primaryKey(metricLabelTable + "_pk", (metric_id, name)) + def metric = foreignKey(metricLabelTable + "_fk", metric_id, metrics)(_.id, onUpdate=ForeignKeyAction.Restrict, onDelete=ForeignKeyAction.Cascade) + def idx = index(metricLabelTable + "_idx", (name, value), unique = false) + + def * = (metric_id, name, value) <> (MetricLabel.tupled, MetricLabel.unapply) + } + + val commits = TableQuery[Commits] + val commitLabels = TableQuery[CommitLabels] + val metrics = TableQuery[Metrics] + val metricLabels = TableQuery[MetricLabels] + + + def create() : Unit = { + import scala.concurrent.ExecutionContext.Implicits.global + val tables = Seq( + commits, + commitLabels, + metrics, + metricLabels + ) + + try { + val existing = db.run(profile.defaultTables) + val query = existing.flatMap(v => { + val names = v.map(mt => mt.name.name.toLowerCase(Locale.ROOT)) + val createIfNotExist = tables + .filter(table => !names.contains(table.baseTableRow.tableName.toLowerCase(Locale.ROOT))) + .map(_.schema.create) + db.run(DBIO.sequence(createIfNotExist)) + }) + Await.result(query, Duration.Inf) + } + catch { + case NonFatal(ex) => logger.error(s"Cannot connect to JDBC metric database to create tables: ${ex.getMessage}") + } + } + + def commit(metrics:Seq[GaugeMetric], labels:Map[String,String]) : Unit = { + implicit val ec = db.executor.executionContext + val ts = Timestamp.from(Instant.now()) + + val cmQuery = (commits returning commits.map(_.id) into((jm, id) => jm.copy(id=id))) += Commit(0, ts) + val commit = db.run(cmQuery).flatMap { commit => + val lbls = labels.map(l => CommitLabel(commit.id, l._1, l._2)) + val clQuery = commitLabels ++= lbls + db.run(clQuery).flatMap(_ => Future.successful(commit)) + } + + val result = commit.flatMap { commit => + Future.sequence(metrics.map { m => + val metrics = this.metrics + val mtQuery = (metrics returning metrics.map(_.id) into ((jm, id) => jm.copy(id = id))) += Measurement(0, commit.id, m.name, ts, m.value) + db.run(mtQuery).flatMap { metric => + val lbls = m.labels.map(l => MetricLabel(metric.id, l._1, l._2)) + val mlQuery = metricLabels ++= lbls + db.run(mlQuery) + } + }) + } + Await.result(result, Duration.Inf) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/JdbcMetricSink.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/JdbcMetricSink.scala new file mode 100644 index 000000000..0ed4bbf6c --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/JdbcMetricSink.scala @@ -0,0 +1,129 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.metric + +import java.sql.SQLRecoverableException +import java.sql.SQLTransientException + +import com.fasterxml.jackson.annotation.JsonProperty +import org.slf4j.LoggerFactory + +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Status +import com.dimajix.flowman.jdbc.JdbcUtils +import com.dimajix.flowman.metric.AbstractMetricSink +import com.dimajix.flowman.metric.GaugeMetric +import com.dimajix.flowman.metric.MetricBoard +import com.dimajix.flowman.metric.MetricSink +import com.dimajix.flowman.model.Connection +import com.dimajix.flowman.model.Reference +import com.dimajix.flowman.spec.connection.ConnectionReferenceSpec +import com.dimajix.flowman.spec.connection.JdbcConnection + + +class JdbcMetricSink( + connection: Reference[Connection], + labels: Map[String,String] = Map(), + commitTable: String = "flowman_metric_commits", + commitLabelTable: String = "flowman_metric_commit_labels", + metricTable: String = "flowman_metrics", + metricLabelTable: String = "flowman_metric_labels" +) extends AbstractMetricSink { + private val logger = LoggerFactory.getLogger(getClass) + private val retries:Int = 3 + private val timeout:Int = 1000 + + override def commit(board:MetricBoard, status:Status): Unit = { + logger.info(s"Committing execution metrics to JDBC at '${jdbcConnection.url}'") + val rawLabels = this.labels + val labels = rawLabels.map { case(k,v) => k -> board.context.evaluate(v, Map("status" -> status.toString)) } + + val metrics = board.metrics(catalog(board), status).collect { + case metric:GaugeMetric => metric + } + + withRepository { session => + session.commit(metrics, labels) + } + } + + /** + * Performs some a task with a JDBC session, also automatically performing retries and timeouts + * + * @param query + * @tparam T + * @return + */ + private def withRepository[T](query: JdbcMetricRepository => T) : T = { + def retry[T](n:Int)(fn: => T) : T = { + try { + fn + } catch { + case e @(_:SQLRecoverableException|_:SQLTransientException) if n > 1 => { + logger.warn("Retrying after error while executing SQL: {}", e.getMessage) + Thread.sleep(timeout) + retry(n - 1)(fn) + } + } + } + + retry(retries) { + ensureTables() + query(repository) + } + } + + private lazy val jdbcConnection = connection.value.asInstanceOf[JdbcConnection] + private lazy val repository = new JdbcMetricRepository( + jdbcConnection, + JdbcUtils.getProfile(jdbcConnection.driver), + commitTable, + commitLabelTable, + metricTable, + metricLabelTable + ) + + private var tablesCreated:Boolean = false + private def ensureTables() : Unit = { + // Create Database if not exists + if (!tablesCreated) { + repository.create() + tablesCreated = true + } + } +} + + +class JdbcMetricSinkSpec extends MetricSinkSpec { + @JsonProperty(value = "connection", required = true) private var connection:ConnectionReferenceSpec = _ + @JsonProperty(value = "labels", required = false) private var labels:Map[String,String] = Map.empty + @JsonProperty(value = "commitTable", required = false) private var commitTable:String = "flowman_metric_commits" + @JsonProperty(value = "commitLabelTable", required = false) private var commitLabelTable:String = "flowman_metric_commit_labels" + @JsonProperty(value = "metricTable", required = false) private var metricTable:String = "flowman_metrics" + @JsonProperty(value = "metricLabelTable", required = false) private var metricLabelTable:String = "flowman_metric_labels" + + override def instantiate(context: Context): MetricSink = { + new JdbcMetricSink( + connection.instantiate(context), + labels, + context.evaluate(commitTable), + context.evaluate(commitLabelTable), + context.evaluate(metricTable), + context.evaluate(metricLabelTable) + ) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/MetricSinkSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/MetricSinkSpec.scala index 3735cf67d..73772f272 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/MetricSinkSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/MetricSinkSpec.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -33,6 +33,7 @@ object MetricSinkSpec extends TypeRegistry[MetricSinkSpec] { @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "kind") @JsonSubTypes(value = Array( new JsonSubTypes.Type(name = "console", value = classOf[ConsoleMetricSinkSpec]), + new JsonSubTypes.Type(name = "jdbc", value = classOf[JdbcMetricSinkSpec]), new JsonSubTypes.Type(name = "null", value = classOf[NullMetricSinkSpec]), new JsonSubTypes.Type(name = "prometheus", value = classOf[PrometheusMetricSinkSpec]) )) diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/MetricSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/MetricSpec.scala index 10d6f91bc..4cbe762d1 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/MetricSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/MetricSpec.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019-2020 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -26,7 +26,7 @@ import com.dimajix.flowman.spec.Spec class MetricSpec extends Spec[MetricSelection] { - @JsonProperty(value = "name", required = true) var name:String = _ + @JsonProperty(value = "name", required = true) var name:Option[String] = None @JsonProperty(value = "labels", required = false) var labels:Map[String,String] = Map() @JsonProperty(value = "selector", required = true) var selector:SelectorSpec = _ @@ -46,8 +46,8 @@ class SelectorSpec extends Spec[Selector] { def instantiate(context: Context): Selector = { Selector( - name.map(context.evaluate), - context.evaluate(labels) + name.map(context.evaluate).map(_.r), + context.evaluate(labels).map { case(k,v) => k -> v.r } ) } } diff --git a/flowman-core/src/main/scala/com/dimajix/flowman/metric/PrometheusMetricSink.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/PrometheusMetricSink.scala similarity index 77% rename from flowman-core/src/main/scala/com/dimajix/flowman/metric/PrometheusMetricSink.scala rename to flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/PrometheusMetricSink.scala index 1709c333e..49d093ac5 100644 --- a/flowman-core/src/main/scala/com/dimajix/flowman/metric/PrometheusMetricSink.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/metric/PrometheusMetricSink.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -14,13 +14,14 @@ * limitations under the License. */ -package com.dimajix.flowman.metric +package com.dimajix.flowman.spec.metric import java.io.IOException import java.net.URI import scala.util.control.NonFatal +import com.fasterxml.jackson.annotation.JsonProperty import org.apache.http.HttpResponse import org.apache.http.client.HttpResponseException import org.apache.http.client.ResponseHandler @@ -29,7 +30,12 @@ import org.apache.http.entity.StringEntity import org.apache.http.impl.client.HttpClients import org.slf4j.LoggerFactory +import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Status +import com.dimajix.flowman.metric.AbstractMetricSink +import com.dimajix.flowman.metric.GaugeMetric +import com.dimajix.flowman.metric.MetricBoard +import com.dimajix.flowman.metric.MetricSink class PrometheusMetricSink( @@ -47,7 +53,8 @@ extends AbstractMetricSink { val labels = rawLabels.map(l => l._1 -> board.context.evaluate(l._2, Map("status" -> status.toString))) val path = labels.map(kv => kv._1 + "/" + kv._2).mkString("/") val url = new URI(this.url).resolve("/metrics/" + path) - logger.info(s"Publishing all metrics to Prometheus at $url") + + logger.info(s"Committing metrics to Prometheus at '$url'") /* # TYPE some_metric counter @@ -89,9 +96,9 @@ extends AbstractMetricSink { } catch { case ex:HttpResponseException => - logger.warn(s"Got error response ${ex.getStatusCode} from Prometheus at $url: ${ex.toString}. Payload was:\n$payload") + logger.warn(s"Got error response ${ex.getStatusCode} from Prometheus at '$url': ${ex.getMessage}. Payload was:\n$payload") case NonFatal(ex) => - logger.warn(s"Cannot publishing metrics to Prometheus at $url: ${ex.toString}") + logger.warn(s"Error while publishing metrics to Prometheus at '$url': ${ex.getMessage}") } finally { httpClient.close() @@ -102,3 +109,16 @@ extends AbstractMetricSink { str.replace("\"","\\\"").replace("\n","").trim } } + + +class PrometheusMetricSinkSpec extends MetricSinkSpec { + @JsonProperty(value = "url", required = true) private var url:String = "" + @JsonProperty(value = "labels", required = false) private var labels:Map[String,String] = Map() + + override def instantiate(context: Context): MetricSink = { + new PrometheusMetricSink( + context.evaluate(url), + labels + ) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/FileRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/FileRelation.scala index e5bda9971..c68b818f6 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/FileRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/FileRelation.scala @@ -197,8 +197,9 @@ case class FileRelation( appendPartitionColumns(df1) } private def readSpark(execution:Execution, partitions:Map[String,FieldValue]) : DataFrame = { - val df = this.reader(execution, format, options) - .load(qualifiedLocation.toString) + val reader = this.reader(execution, format, options) + val reader1 = if (execution.fs.file(qualifiedLocation).isDirectory()) reader.option("basePath", qualifiedLocation.toString) else reader + val df = reader1.load(qualifiedLocation.toString) // Filter partitions val parts = MapIgnoreCase(this.partitions.map(p => p.name -> p)) @@ -224,7 +225,7 @@ case class FileRelation( else doWriteStaticPartitions(execution, df, partition, mode) - execution.refreshResource(ResourceIdentifier.ofFile(qualifiedLocation)) + provides.foreach(execution.refreshResource) } private def doWriteDynamicPartitions(execution:Execution, df:DataFrame, mode:OutputMode) : Unit = { val outputPath = qualifiedLocation @@ -411,6 +412,8 @@ case class FileRelation( throw new FileSystemException(qualifiedLocation.toString, "", "Cannot create directory.") } } + + provides.foreach(execution.refreshResource) } /** @@ -468,6 +471,8 @@ case class FileRelation( val fs = collector.fs fs.delete(qualifiedLocation, true) } + + provides.foreach(execution.refreshResource) } /** diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveRelation.scala index 45c7bd33b..8f162c066 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveRelation.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -17,10 +17,10 @@ package com.dimajix.flowman.spec.relation import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.catalyst.TableIdentifier import org.slf4j.Logger import com.dimajix.common.Trilean +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.model.BaseRelation import com.dimajix.flowman.model.PartitionedRelation @@ -30,9 +30,7 @@ import com.dimajix.flowman.types.FieldValue abstract class HiveRelation extends BaseRelation with PartitionedRelation { protected val logger:Logger - def database: Option[String] - def table: String - def tableIdentifier: TableIdentifier = new TableIdentifier(table, database) + def table: TableIdentifier /** * Reads data from the relation, possibly from specific partitions @@ -46,10 +44,10 @@ abstract class HiveRelation extends BaseRelation with PartitionedRelation { require(execution != null) require(partitions != null) - logger.info(s"Reading Hive relation '$identifier' from table $tableIdentifier using partition values $partitions") + logger.info(s"Reading Hive relation '$identifier' from table $table using partition values $partitions") val reader = execution.spark.read - val tableDf = reader.table(tableIdentifier.unquotedString) + val tableDf = reader.table(table.unquotedString) val filteredDf = filterPartition(tableDf, partitions) applyInputSchema(execution, filteredDf) @@ -64,6 +62,6 @@ abstract class HiveRelation extends BaseRelation with PartitionedRelation { require(execution != null) val catalog = execution.catalog - catalog.tableExists(tableIdentifier) + catalog.tableExists(table) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveTableRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveTableRelation.scala index fc9bd6cf1..e75a93e0f 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveTableRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveTableRelation.scala @@ -24,7 +24,6 @@ import com.fasterxml.jackson.annotation.JsonProperty import org.apache.hadoop.fs.Path import org.apache.spark.sql.DataFrame import org.apache.spark.sql.SparkShim -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.PartitionAlreadyExistsException import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException import org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat @@ -49,6 +48,8 @@ import com.dimajix.flowman.catalog.TableChange.DropColumn import com.dimajix.flowman.catalog.TableChange.UpdateColumnComment import com.dimajix.flowman.catalog.TableChange.UpdateColumnNullability import com.dimajix.flowman.catalog.TableChange.UpdateColumnType +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.execution.MigrationFailedException @@ -95,8 +96,7 @@ case class HiveTableRelation( override val instanceProperties:Relation.Properties, override val schema:Option[Schema] = None, override val partitions: Seq[PartitionField] = Seq(), - override val database: Option[String] = None, - override val table: String, + override val table: TableIdentifier, external: Boolean = false, location: Option[Path] = None, format: Option[String] = None, @@ -124,7 +124,7 @@ case class HiveTableRelation( // Only return Hive table partitions! val allPartitions = PartitionSchema(this.partitions).interpolate(partition) - allPartitions.map(p => ResourceIdentifier.ofHivePartition(table, database, p.toMap)).toSet + allPartitions.map(p => ResourceIdentifier.ofHivePartition(table, p.toMap)).toSet } /** @@ -133,7 +133,7 @@ case class HiveTableRelation( * @return */ override def provides : Set[ResourceIdentifier] = Set( - ResourceIdentifier.ofHiveTable(table, database) + ResourceIdentifier.ofHiveTable(table) ) /** @@ -142,7 +142,7 @@ case class HiveTableRelation( * @return */ override def requires : Set[ResourceIdentifier] = { - database.map(db => ResourceIdentifier.ofHiveDatabase(db)).toSet ++ super.requires + table.database.map(db => ResourceIdentifier.ofHiveDatabase(db)).toSet ++ super.requires } /** @@ -168,6 +168,8 @@ case class HiveTableRelation( writeSpark(execution, df, partitionSpec, mode) else throw new IllegalArgumentException("Hive relations only support write modes 'hive' and 'spark'") + + provides.foreach(execution.refreshResource) } /** @@ -184,7 +186,7 @@ case class HiveTableRelation( require(partitionSpec != null) require(mode != null) - logger.info(s"Writing Hive relation '$identifier' to table $tableIdentifier partition ${HiveDialect.expr.partition(partitionSpec)} with mode '$mode' using Hive insert") + logger.info(s"Writing Hive relation '$identifier' to table $table partition ${HiveDialect.expr.partition(partitionSpec)} with mode '$mode' using Hive insert") // Apply output schema before writing to Hive val outputDf = { @@ -197,10 +199,10 @@ case class HiveTableRelation( def loaded() : Boolean = { val catalog = execution.catalog if (partitionSpec.nonEmpty) { - catalog.partitionExists(tableIdentifier, partitionSpec) + catalog.partitionExists(table, partitionSpec) } else { - val location = catalog.getTableLocation(tableIdentifier) + val location = catalog.getTableLocation(table) val fs = location.getFileSystem(execution.hadoopConf) FileUtils.isValidHiveData(fs, location) } @@ -216,9 +218,9 @@ case class HiveTableRelation( case OutputMode.ERROR_IF_EXISTS => if (loaded()) { if (partitionSpec.nonEmpty) - throw new PartitionAlreadyExistsException(database.getOrElse(""), table, partitionSpec.mapValues(_.toString).toMap) + throw new PartitionAlreadyExistsException(table.database.getOrElse(""), table.table, partitionSpec.mapValues(_.toString).toMap) else - throw new TableAlreadyExistsException(database.getOrElse(""), table) + throw new TableAlreadyExistsException(table.database.getOrElse(""), table.table) } writeHiveTable(execution, outputDf, partitionSpec, mode) case _ => @@ -231,7 +233,7 @@ case class HiveTableRelation( val catalog = execution.catalog if (partitionSpec.nonEmpty) { - val hiveTable = catalog.getTable(TableIdentifier(table, database)) + val hiveTable = catalog.getTable(table) val query = df.queryExecution.logical val overwrite = mode == OutputMode.OVERWRITE || mode == OutputMode.OVERWRITE_DYNAMIC @@ -247,20 +249,20 @@ case class HiveTableRelation( SparkShim.withNewExecutionId(spark, qe)(qe.toRdd) // Finally refresh Hive partition - catalog.refreshPartition(tableIdentifier, partitionSpec) + catalog.refreshPartition(table, partitionSpec) } else { // If OVERWRITE is specified, remove all partitions for partitioned tables if (partitions.nonEmpty && mode == OutputMode.OVERWRITE) { - catalog.truncateTable(tableIdentifier) + catalog.truncateTable(table) } val writer = df.write .mode(mode.batchMode) .options(options) format.foreach(writer.format) - writer.insertInto(tableIdentifier.unquotedString) + writer.insertInto(table.unquotedString) - execution.catalog.refreshTable(tableIdentifier) + execution.catalog.refreshTable(table) } } @@ -279,7 +281,7 @@ case class HiveTableRelation( require(partitionSpec != null) require(mode != null) - logger.info(s"Writing Hive relation '$identifier' to table $tableIdentifier partition ${HiveDialect.expr.partition(partitionSpec)} with mode '$mode' using direct write") + logger.info(s"Writing Hive relation '$identifier' to table $table partition ${HiveDialect.expr.partition(partitionSpec)} with mode '$mode' using direct write") if (location.isEmpty) throw new IllegalArgumentException("Hive table relation requires 'location' for direct write mode") @@ -300,10 +302,10 @@ case class HiveTableRelation( // Finally add Hive partition val catalog = execution.catalog if (partitionSpec.nonEmpty) { - catalog.addOrReplacePartition(tableIdentifier, partitionSpec, outputPath) + catalog.addOrReplacePartition(table, partitionSpec, outputPath) } else { - catalog.refreshTable(tableIdentifier) + catalog.refreshTable(table) } } @@ -324,13 +326,13 @@ case class HiveTableRelation( if (partitions.nonEmpty) { val partitionSchema = PartitionSchema(this.partitions) partitionSchema.interpolate(partitions).foreach { spec => - logger.info(s"Truncating Hive relation '$identifier' by truncating table $tableIdentifier partition ${HiveDialect.expr.partition(spec)}") - catalog.dropPartition(tableIdentifier, spec) + logger.info(s"Truncating Hive relation '$identifier' by truncating table $table partition ${HiveDialect.expr.partition(spec)}") + catalog.dropPartition(table, spec) } } else { - logger.info(s"Truncating Hive relation '$identifier' by truncating table $tableIdentifier") - catalog.truncateTable(tableIdentifier) + logger.info(s"Truncating Hive relation '$identifier' by truncating table $table") + catalog.truncateTable(table) } } @@ -343,14 +345,14 @@ case class HiveTableRelation( */ override def conforms(execution: Execution, migrationPolicy: MigrationPolicy): Trilean = { val catalog = execution.catalog - if (catalog.tableExists(tableIdentifier)) { + if (catalog.tableExists(table)) { if (schema.nonEmpty) { - val table = catalog.getTable(tableIdentifier) + val table = catalog.getTable(this.table) if (table.tableType == CatalogTableType.VIEW) { false } else { - val sourceSchema = com.dimajix.flowman.types.StructType.of(table.dataSchema) + val sourceTable = TableDefinition.ofTable(table) val targetSchema = { val dataSchema = com.dimajix.flowman.types.StructType(schema.get.fields) if (hiveVarcharSupported) @@ -358,8 +360,9 @@ case class HiveTableRelation( else SchemaUtils.replaceCharVarchar(dataSchema) } + val targetTable = TableDefinition(this.table, targetSchema.fields) - !TableChange.requiresMigration(sourceSchema, targetSchema, migrationPolicy) + !TableChange.requiresMigration(sourceTable, targetTable, migrationPolicy) } } else { @@ -390,12 +393,12 @@ case class HiveTableRelation( if (partitions.nonEmpty) { val schema = PartitionSchema(partitions) val partitionSpec = schema.spec(partition) - catalog.tableExists(tableIdentifier) && - catalog.partitionExists(tableIdentifier, partitionSpec) + catalog.tableExists(table) && + catalog.partitionExists(table, partitionSpec) } else { - if (catalog.tableExists(tableIdentifier)) { - val location = catalog.getTableLocation(tableIdentifier) + if (catalog.tableExists(table)) { + val location = catalog.getTableLocation(table) val fs = location.getFileSystem(execution.hadoopConf) FileUtils.isValidHiveData(fs, location) } @@ -415,7 +418,7 @@ case class HiveTableRelation( if (!ifNotExists || exists(execution) == No) { val catalogSchema = HiveTableRelation.cleanupSchema(StructType(fields.map(_.catalogField))) - logger.info(s"Creating Hive table relation '$identifier' with table $tableIdentifier and schema\n${catalogSchema.treeString}") + logger.info(s"Creating Hive table relation '$identifier' with table $table and schema\n${catalogSchema.treeString}") if (schema.isEmpty) { throw new UnspecifiedSchemaException(identifier) } @@ -439,7 +442,7 @@ case class HiveTableRelation( outputFormat = s.outputFormat, serde = s.serde) case None => - throw new IllegalArgumentException(s"File format '$format' not supported in Hive relation '$identifier' while creating hive table $tableIdentifier") + throw new IllegalArgumentException(s"File format '$format' not supported in Hive relation '$identifier' while creating hive table $table") } } else { @@ -462,7 +465,7 @@ case class HiveTableRelation( // Configure catalog table by assembling all options val catalogTable = CatalogTable( - identifier = tableIdentifier, + identifier = table.toSpark, tableType = if (external) CatalogTableType.EXTERNAL @@ -486,6 +489,7 @@ case class HiveTableRelation( // Create table val catalog = execution.catalog catalog.createTable(catalogTable, false) + provides.foreach(execution.refreshResource) } } @@ -498,9 +502,10 @@ case class HiveTableRelation( require(execution != null) val catalog = execution.catalog - if (!ifExists || catalog.tableExists(tableIdentifier)) { - logger.info(s"Destroying Hive table relation '$identifier' by dropping table $tableIdentifier") - catalog.dropTable(tableIdentifier) + if (!ifExists || catalog.tableExists(table)) { + logger.info(s"Destroying Hive table relation '$identifier' by dropping table $table") + catalog.dropTable(table) + provides.foreach(execution.refreshResource) } } @@ -512,23 +517,24 @@ case class HiveTableRelation( require(execution != null) val catalog = execution.catalog - if (schema.nonEmpty && catalog.tableExists(tableIdentifier)) { - val table = catalog.getTable(tableIdentifier) + if (schema.nonEmpty && catalog.tableExists(table)) { + val table = catalog.getTable(this.table) if (table.tableType == CatalogTableType.VIEW) { migrationStrategy match { case MigrationStrategy.NEVER => - logger.warn(s"Migration required for HiveTable relation '$identifier' from VIEW to a TABLE $tableIdentifier, but migrations are disabled.") + logger.warn(s"Migration required for HiveTable relation '$identifier' from VIEW to a TABLE ${this.table}, but migrations are disabled.") case MigrationStrategy.FAIL => - logger.error(s"Cannot migrate relation HiveTable '$identifier' from VIEW to a TABLE $tableIdentifier, since migrations are disabled.") + logger.error(s"Cannot migrate relation HiveTable '$identifier' from VIEW to a TABLE ${this.table}, since migrations are disabled.") throw new MigrationFailedException(identifier) case MigrationStrategy.ALTER|MigrationStrategy.ALTER_REPLACE|MigrationStrategy.REPLACE => - logger.warn(s"TABLE target $tableIdentifier is currently a VIEW, dropping...") - catalog.dropView(tableIdentifier, false) + logger.warn(s"TABLE target ${this.table} is currently a VIEW, dropping...") + catalog.dropView(this.table, false) create(execution, false) + provides.foreach(execution.refreshResource) } } else { - val sourceSchema = com.dimajix.flowman.types.StructType.of(table.dataSchema) + val sourceTable = TableDefinition.ofTable(table) val targetSchema = { val dataSchema = com.dimajix.flowman.types.StructType(schema.get.fields) if (hiveVarcharSupported) @@ -536,31 +542,33 @@ case class HiveTableRelation( else SchemaUtils.replaceCharVarchar(dataSchema) } + val targetTable = TableDefinition(this.table, targetSchema.fields) - val requiresMigration = TableChange.requiresMigration(sourceSchema, targetSchema, migrationPolicy) + val requiresMigration = TableChange.requiresMigration(sourceTable, targetTable, migrationPolicy) if (requiresMigration) { - doMigration(execution, sourceSchema, targetSchema, migrationPolicy, migrationStrategy) + doMigration(execution, sourceTable, targetTable, migrationPolicy, migrationStrategy) + provides.foreach(execution.refreshResource) } } } } - private def doMigration(execution: Execution, currentSchema:com.dimajix.flowman.types.StructType, targetSchema:com.dimajix.flowman.types.StructType, migrationPolicy:MigrationPolicy, migrationStrategy:MigrationStrategy) : Unit = { + private def doMigration(execution: Execution, currentTable:TableDefinition, targetTable:TableDefinition, migrationPolicy:MigrationPolicy, migrationStrategy:MigrationStrategy) : Unit = { migrationStrategy match { case MigrationStrategy.NEVER => - logger.warn(s"Migration required for HiveTable relation '$identifier' of Hive table $tableIdentifier, but migrations are disabled.\nCurrent schema:\n${currentSchema.treeString}New schema:\n${targetSchema.treeString}") + logger.warn(s"Migration required for HiveTable relation '$identifier' of Hive table $table, but migrations are disabled.\nCurrent schema:\n${currentTable.schema.treeString}New schema:\n${targetTable.schema.treeString}") case MigrationStrategy.FAIL => - logger.error(s"Cannot migrate relation HiveTable '$identifier' of Hive table $tableIdentifier, since migrations are disabled.\nCurrent schema:\n${currentSchema.treeString}New schema:\n${targetSchema.treeString}") + logger.error(s"Cannot migrate relation HiveTable '$identifier' of Hive table $table, since migrations are disabled.\nCurrent schema:\n${currentTable.schema.treeString}New schema:\n${targetTable.schema.treeString}") throw new MigrationFailedException(identifier) case MigrationStrategy.ALTER => - val migrations = TableChange.migrate(currentSchema, targetSchema, migrationPolicy) + val migrations = TableChange.migrate(currentTable, targetTable, migrationPolicy) if (migrations.exists(m => !supported(m))) { - logger.error(s"Cannot migrate relation HiveTable '$identifier' of Hive table $tableIdentifier, since that would require unsupported changes.\nCurrent schema:\n${currentSchema.treeString}New schema:\n${targetSchema.treeString}") + logger.error(s"Cannot migrate relation HiveTable '$identifier' of Hive table $table, since that would require unsupported changes.\nCurrent schema:\n${currentTable.schema.treeString}New schema:\n${targetTable.schema.treeString}") throw new MigrationFailedException(identifier) } alter(migrations) case MigrationStrategy.ALTER_REPLACE => - val migrations = TableChange.migrate(currentSchema, targetSchema, migrationPolicy) + val migrations = TableChange.migrate(currentTable, targetTable, migrationPolicy) if (migrations.forall(m => supported(m))) { alter(migrations) } @@ -572,13 +580,13 @@ case class HiveTableRelation( } def alter(migrations:Seq[TableChange]) : Unit = { - logger.info(s"Migrating HiveTable relation '$identifier', this will alter the Hive table $tableIdentifier. New schema:\n${targetSchema.treeString}") + logger.info(s"Migrating HiveTable relation '$identifier', this will alter the Hive table $table. New schema:\n${targetTable.schema.treeString}") if (migrations.isEmpty) { logger.warn("Empty list of migrations - nothing to do") } try { - execution.catalog.alterTable(tableIdentifier, migrations) + execution.catalog.alterTable(table, migrations) } catch { case NonFatal(ex) => throw new MigrationFailedException(identifier, ex) @@ -586,7 +594,7 @@ case class HiveTableRelation( } def recreate() : Unit = { - logger.info(s"Migrating HiveTable relation '$identifier', this will drop/create the Hive table $tableIdentifier.") + logger.info(s"Migrating HiveTable relation '$identifier', this will drop/create the Hive table $table.") try { destroy(execution, true) create(execution, true) @@ -610,7 +618,7 @@ case class HiveTableRelation( override protected def outputSchema(execution:Execution) : Option[StructType] = { // We specifically use the existing physical Hive schema - val currentSchema = execution.catalog.getTable(tableIdentifier).dataSchema + val currentSchema = execution.catalog.getTable(table).dataSchema // If a schema is explicitly specified, we use that one to back-merge VarChar(n) and Char(n). This // is mainly required for Spark < 3.1, which cannot correctly handle VARCHAR and CHAR types in Hive @@ -689,8 +697,7 @@ class HiveTableRelationSpec extends RelationSpec with SchemaRelationSpec with Pa instanceProperties(context), schema.map(_.instantiate(context)), partitions.map(_.instantiate(context)), - context.evaluate(database), - context.evaluate(table), + TableIdentifier(context.evaluate(table), context.evaluate(database).toSeq), context.evaluate(external).toBoolean, context.evaluate(location).map(p => new Path(p)), context.evaluate(format), diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveUnionTableRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveUnionTableRelation.scala index c16cd7ee9..013aca5f8 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveUnionTableRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveUnionTableRelation.scala @@ -19,7 +19,6 @@ package com.dimajix.flowman.spec.relation import com.fasterxml.jackson.annotation.JsonProperty import org.apache.hadoop.fs.Path import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.catalog.CatalogTable import org.apache.spark.sql.types.StructField import org.apache.spark.sql.types.StructType @@ -30,6 +29,7 @@ import com.dimajix.common.No import com.dimajix.common.SetIgnoreCase import com.dimajix.common.Trilean import com.dimajix.common.Yes +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.execution.ExecutionException @@ -48,7 +48,6 @@ import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.ResourceIdentifier import com.dimajix.flowman.model.Schema import com.dimajix.flowman.model.SchemaRelation -import com.dimajix.flowman.spec.schema.EmbeddedSchema import com.dimajix.flowman.transforms.SchemaEnforcer import com.dimajix.flowman.transforms.UnionTransformer import com.dimajix.flowman.types.FieldValue @@ -75,11 +74,9 @@ case class HiveUnionTableRelation( override val instanceProperties:Relation.Properties, override val schema:Option[Schema] = None, override val partitions: Seq[PartitionField] = Seq(), - tableDatabase: Option[String] = None, - tablePrefix: String, + tablePrefix: TableIdentifier, locationPrefix: Option[Path] = None, - viewDatabase: Option[String] = None, - view: String, + view: TableIdentifier, external: Boolean = false, format: Option[String] = None, options: Map[String,String] = Map(), @@ -91,15 +88,22 @@ case class HiveUnionTableRelation( ) extends BaseRelation with SchemaRelation with PartitionedRelation { private val logger = LoggerFactory.getLogger(classOf[HiveUnionTableRelation]) - def viewIdentifier: TableIdentifier = TableIdentifier(view, viewDatabase) - def tableIdentifier(version:Int) : TableIdentifier = { - TableIdentifier(tablePrefix + "_" + version.toString, tableDatabase) + private lazy val tableRegex : TableIdentifier = { + TableIdentifier(tablePrefix.table + "_[0-9]+", tablePrefix.database.orElse(view.database)) } + private lazy val viewIdentifier : TableIdentifier = { + TableIdentifier(view.table, view.database.orElse(tablePrefix.database)) + } + private def tableIdentifier(version:Int) : TableIdentifier = { + TableIdentifier(tablePrefix.table + "_" + version.toString, tablePrefix.database.orElse(view.database)) + } + + private def resolve(execution: Execution, table:TableIdentifier) : TableIdentifier = TableIdentifier(table.table, table.database.orElse(view.database).orElse(Some(execution.catalog.currentDatabase))) private def listTables(executor: Execution) : Seq[TableIdentifier] = { val catalog = executor.catalog - val regex = (TableIdentifier(tablePrefix, tableDatabase.orElse(Some(catalog.currentDatabase))).unquotedString + "_[0-9]+").r - catalog.listTables(tableDatabase.getOrElse(catalog.currentDatabase), tablePrefix + "_*") + val regex = resolve(executor, tableRegex).unquotedString.r + catalog.listTables(tablePrefix.database.getOrElse(catalog.currentDatabase), tablePrefix.table + "_*") .filter { table => table.unquotedString match { case regex() => true @@ -110,7 +114,7 @@ case class HiveUnionTableRelation( private def tableRelation(version:Int) : HiveTableRelation = tableRelation( - TableIdentifier(tablePrefix + "_" + version.toString, tableDatabase), + TableIdentifier(tablePrefix.table + "_" + version.toString, tablePrefix.database.orElse(view.database)), locationPrefix.map(p => new Path(p.toString + "_" + version.toString)) ) @@ -118,8 +122,7 @@ case class HiveUnionTableRelation( instanceProperties, schema, partitions, - tableIdentifier.database, - tableIdentifier.table, + tableIdentifier, external, location, format, @@ -135,7 +138,6 @@ case class HiveUnionTableRelation( private def viewRelationFromSql(sql:String) : HiveViewRelation = { HiveViewRelation( instanceProperties, - viewDatabase, view, partitions, Some(sql), @@ -158,8 +160,8 @@ case class HiveUnionTableRelation( * @return */ override def provides: Set[ResourceIdentifier] = Set( - ResourceIdentifier.ofHiveTable(tablePrefix + "_[0-9]+", tableDatabase), - ResourceIdentifier.ofHiveTable(view, viewDatabase.orElse(tableDatabase)) + ResourceIdentifier.ofHiveTable(tableRegex), + ResourceIdentifier.ofHiveTable(viewIdentifier) ) /** @@ -168,8 +170,8 @@ case class HiveUnionTableRelation( * @return */ override def requires: Set[ResourceIdentifier] = { - tableDatabase.map(db => ResourceIdentifier.ofHiveDatabase(db)).toSet ++ - viewDatabase.map(db => ResourceIdentifier.ofHiveDatabase(db)).toSet ++ + tablePrefix.database.map(db => ResourceIdentifier.ofHiveDatabase(db)).toSet ++ + viewIdentifier.database.map(db => ResourceIdentifier.ofHiveDatabase(db)).toSet ++ super.requires } @@ -188,7 +190,7 @@ case class HiveUnionTableRelation( // Only return Hive table partitions! val allPartitions = PartitionSchema(this.partitions).interpolate(partition) - allPartitions.map(p => ResourceIdentifier.ofHivePartition(tablePrefix + "_[0-9]+", tableDatabase, p.toMap)).toSet + allPartitions.map(p => ResourceIdentifier.ofHivePartition(tableRegex, p.toMap)).toSet } @@ -416,7 +418,7 @@ case class HiveUnionTableRelation( // Create initial view val spark = execution.spark - val df = spark.read.table(hiveTableRelation.tableIdentifier.unquotedString) + val df = spark.read.table(hiveTableRelation.table.unquotedString) val sql = new SqlBuilder(df).toSQL val hiveViewRelation = viewRelationFromSql(sql) hiveViewRelation.create(execution, ifNotExists) @@ -516,7 +518,7 @@ case class HiveUnionTableRelation( private def doMigrateAlterTable(execution:Execution, table:CatalogTable, rawMissingFields:Seq[StructField], migrationStrategy:MigrationStrategy) : Unit = { doMigrate(migrationStrategy) { val catalog = execution.catalog - val id = table.identifier + val id = TableIdentifier.of(table.identifier) val targetSchema = table.dataSchema val missingFields = HiveTableRelation.cleanupFields(rawMissingFields) val newSchema = StructType(targetSchema.fields ++ missingFields) @@ -541,7 +543,7 @@ case class HiveUnionTableRelation( class HiveUnionTableRelationSpec extends RelationSpec with SchemaRelationSpec with PartitionedRelationSpec { @JsonProperty(value = "tableDatabase", required = false) private var tableDatabase: Option[String] = None - @JsonProperty(value = "tablePrefix", required = true) private var tablePrefix: String = "" + @JsonProperty(value = "tablePrefix", required = true) private var tablePrefix: String = "zz" @JsonProperty(value = "locationPrefix", required = false) private var locationPrefix: Option[String] = None @JsonProperty(value = "viewDatabase", required = false) private var viewDatabase: Option[String] = None @JsonProperty(value = "view", required = true) private var view: String = "" @@ -564,11 +566,9 @@ class HiveUnionTableRelationSpec extends RelationSpec with SchemaRelationSpec wi instanceProperties(context), schema.map(_.instantiate(context)), partitions.map(_.instantiate(context)), - context.evaluate(tableDatabase), - context.evaluate(tablePrefix), + TableIdentifier(context.evaluate(tablePrefix), context.evaluate(tableDatabase)), context.evaluate(locationPrefix).map(p => new Path(context.evaluate(p))), - context.evaluate(viewDatabase), - context.evaluate(view), + TableIdentifier(context.evaluate(view), context.evaluate(viewDatabase)), context.evaluate(external).toBoolean, context.evaluate(format), context.evaluate(options), diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveViewRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveViewRelation.scala index d96ef6c17..0e10b1370 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveViewRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/HiveViewRelation.scala @@ -23,6 +23,7 @@ import org.slf4j.LoggerFactory import com.dimajix.common.Trilean import com.dimajix.flowman.catalog.HiveCatalog +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.execution.MappingUtils @@ -44,8 +45,7 @@ import com.dimajix.spark.sql.catalyst.SqlBuilder case class HiveViewRelation( override val instanceProperties:Relation.Properties, - override val database: Option[String], - override val table: String, + override val table: TableIdentifier, override val partitions: Seq[PartitionField] = Seq(), sql: Option[String] = None, mapping: Option[MappingOutputIdentifier] = None @@ -63,7 +63,7 @@ case class HiveViewRelation( mapping.map(m => MappingUtils.requires(context, m.mapping)) .orElse( // Only return Hive Table Partitions! - sql.map(s => SqlParser.resolveDependencies(s).map(t => ResourceIdentifier.ofHivePartition(t, Map()).asInstanceOf[ResourceIdentifier])) + sql.map(s => SqlParser.resolveDependencies(s).map(t => ResourceIdentifier.ofHivePartition(t, Map.empty[String,Any]).asInstanceOf[ResourceIdentifier])) ) .getOrElse(Set()) } @@ -74,7 +74,7 @@ case class HiveViewRelation( * @return */ override def provides : Set[ResourceIdentifier] = Set( - ResourceIdentifier.ofHiveTable(table, database) + ResourceIdentifier.ofHiveTable(table) ) /** @@ -83,7 +83,7 @@ case class HiveViewRelation( * @return */ override def requires : Set[ResourceIdentifier] = { - val db = database.map(db => ResourceIdentifier.ofHiveDatabase(db)).toSet + val db = table.database.map(db => ResourceIdentifier.ofHiveDatabase(db)).toSet val other = mapping.map {m => MappingUtils.requires(context, m.mapping) // Replace all Hive partitions with Hive tables @@ -120,13 +120,13 @@ case class HiveViewRelation( */ override def conforms(execution: Execution, migrationPolicy: MigrationPolicy): Trilean = { val catalog = execution.catalog - if (catalog.tableExists(tableIdentifier)) { + if (catalog.tableExists(table)) { val newSelect = getSelect(execution) - val curTable = catalog.getTable(tableIdentifier) + val curTable = catalog.getTable(table) // Check if current table is a VIEW or a table if (curTable.tableType == CatalogTableType.VIEW) { // Check that both SQL and schema are correct - val curTable = catalog.getTable(tableIdentifier) + val curTable = catalog.getTable(table) val curSchema = SchemaUtils.normalize(curTable.schema) val newSchema = SchemaUtils.normalize(catalog.spark.sql(newSelect).schema) curTable.viewText.get == newSelect && curSchema == newSchema @@ -161,9 +161,10 @@ case class HiveViewRelation( override def create(execution:Execution, ifNotExists:Boolean=false) : Unit = { val select = getSelect(execution) val catalog = execution.catalog - if (!ifNotExists || !catalog.tableExists(tableIdentifier)) { - logger.info(s"Creating Hive view relation '$identifier' with VIEW $tableIdentifier") - catalog.createView(tableIdentifier, select, ifNotExists) + if (!ifNotExists || !catalog.tableExists(table)) { + logger.info(s"Creating Hive view relation '$identifier' with VIEW $table") + catalog.createView(table, select, ifNotExists) + provides.foreach(execution.refreshResource) } } @@ -174,9 +175,9 @@ case class HiveViewRelation( */ override def migrate(execution:Execution, migrationPolicy:MigrationPolicy, migrationStrategy:MigrationStrategy) : Unit = { val catalog = execution.catalog - if (catalog.tableExists(tableIdentifier)) { + if (catalog.tableExists(table)) { val newSelect = getSelect(execution) - val curTable = catalog.getTable(tableIdentifier) + val curTable = catalog.getTable(table) // Check if current table is a VIEW or a table if (curTable.tableType == CatalogTableType.VIEW) { migrateFromView(catalog, newSelect, migrationStrategy) @@ -184,23 +185,24 @@ case class HiveViewRelation( else { migrateFromTable(catalog, newSelect, migrationStrategy) } + provides.foreach(execution.refreshResource) } } private def migrateFromView(catalog:HiveCatalog, newSelect:String, migrationStrategy:MigrationStrategy) : Unit = { - val curTable = catalog.getTable(tableIdentifier) + val curTable = catalog.getTable(table) val curSchema = SchemaUtils.normalize(curTable.schema) val newSchema = SchemaUtils.normalize(catalog.spark.sql(newSelect).schema) if (curTable.viewText.get != newSelect || curSchema != newSchema) { migrationStrategy match { case MigrationStrategy.NEVER => - logger.warn(s"Migration required for HiveView relation '$identifier' of Hive view $tableIdentifier, but migrations are disabled.") + logger.warn(s"Migration required for HiveView relation '$identifier' of Hive view $table, but migrations are disabled.") case MigrationStrategy.FAIL => - logger.error(s"Cannot migrate relation HiveView '$identifier' of Hive view $tableIdentifier, since migrations are disabled.") + logger.error(s"Cannot migrate relation HiveView '$identifier' of Hive view $table, since migrations are disabled.") throw new MigrationFailedException(identifier) case MigrationStrategy.ALTER|MigrationStrategy.ALTER_REPLACE|MigrationStrategy.REPLACE => - logger.info(s"Migrating HiveView relation '$identifier' with VIEW $tableIdentifier") - catalog.alterView(tableIdentifier, newSelect) + logger.info(s"Migrating HiveView relation '$identifier' with VIEW $table") + catalog.alterView(table, newSelect) } } } @@ -208,14 +210,14 @@ case class HiveViewRelation( private def migrateFromTable(catalog:HiveCatalog, newSelect:String, migrationStrategy:MigrationStrategy) : Unit = { migrationStrategy match { case MigrationStrategy.NEVER => - logger.warn(s"Migration required for HiveView relation '$identifier' from TABLE to a VIEW $tableIdentifier, but migrations are disabled.") + logger.warn(s"Migration required for HiveView relation '$identifier' from TABLE to a VIEW $table, but migrations are disabled.") case MigrationStrategy.FAIL => - logger.error(s"Cannot migrate relation HiveView '$identifier' from TABLE to a VIEW $tableIdentifier, since migrations are disabled.") + logger.error(s"Cannot migrate relation HiveView '$identifier' from TABLE to a VIEW $table, since migrations are disabled.") throw new MigrationFailedException(identifier) case MigrationStrategy.ALTER|MigrationStrategy.ALTER_REPLACE|MigrationStrategy.REPLACE => - logger.info(s"Migrating HiveView relation '$identifier' from TABLE to a VIEW $tableIdentifier") - catalog.dropTable(tableIdentifier, false) - catalog.createView(tableIdentifier, newSelect, false) + logger.info(s"Migrating HiveView relation '$identifier' from TABLE to a VIEW $table") + catalog.dropTable(table, false) + catalog.createView(table, newSelect, false) } } @@ -225,9 +227,10 @@ case class HiveViewRelation( */ override def destroy(execution:Execution, ifExists:Boolean=false) : Unit = { val catalog = execution.catalog - if (!ifExists || catalog.tableExists(tableIdentifier)) { - logger.info(s"Destroying Hive view relation '$identifier' with VIEW $tableIdentifier") - catalog.dropView(tableIdentifier) + if (!ifExists || catalog.tableExists(table)) { + logger.info(s"Destroying Hive view relation '$identifier' with VIEW $table") + catalog.dropView(table) + provides.foreach(execution.refreshResource) } } @@ -235,7 +238,7 @@ case class HiveViewRelation( val select = sql.orElse(mapping.map(id => buildMappingSql(executor, id))) .getOrElse(throw new IllegalArgumentException("HiveView either requires explicit SQL SELECT statement or mapping")) - logger.debug(s"Hive SQL SELECT statement for VIEW $tableIdentifier: $select") + logger.debug(s"Hive SQL SELECT statement for VIEW $table: $select") select } @@ -263,8 +266,7 @@ class HiveViewRelationSpec extends RelationSpec with PartitionedRelationSpec{ override def instantiate(context: Context): HiveViewRelation = { HiveViewRelation( instanceProperties(context), - context.evaluate(database), - context.evaluate(view), + TableIdentifier(context.evaluate(view), context.evaluate(database)), partitions.map(_.instantiate(context)), context.evaluate(sql), context.evaluate(mapping).map(MappingOutputIdentifier.parse) diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/IndexedRelationSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/IndexedRelationSpec.scala new file mode 100644 index 000000000..86fbb9417 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/IndexedRelationSpec.scala @@ -0,0 +1,42 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.relation + +import com.fasterxml.jackson.annotation.JsonProperty + +import com.dimajix.flowman.catalog.TableIndex +import com.dimajix.flowman.execution.Context + + +class IndexSpec { + @JsonProperty(value = "name", required = true) protected var name: String = _ + @JsonProperty(value = "columns", required = true) protected var columns: Seq[String] = Seq.empty + @JsonProperty(value = "unique", required = true) protected var unique: String = "false" + + def instantiate(context:Context) : TableIndex = { + TableIndex( + context.evaluate(name), + columns.map(context.evaluate), + context.evaluate(unique).toBoolean + ) + } +} + + +trait IndexedRelationSpec { this: RelationSpec => + @JsonProperty(value = "indexes", required = false) protected var indexes: Seq[IndexSpec] = Seq.empty +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/JdbcRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/JdbcRelation.scala index 70f46b4d8..c55bf4428 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/JdbcRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/JdbcRelation.scala @@ -29,19 +29,21 @@ import com.fasterxml.jackson.annotation.JsonProperty import org.apache.spark.sql.Column import org.apache.spark.sql.DataFrame import org.apache.spark.sql.SaveMode -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.PartitionAlreadyExistsException -import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute import org.apache.spark.sql.catalyst.expressions.Expression import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions import org.apache.spark.sql.functions.col import org.apache.spark.sql.types.StructType +import org.slf4j.Logger import org.slf4j.LoggerFactory import com.dimajix.common.SetIgnoreCase import com.dimajix.common.Trilean import com.dimajix.flowman.catalog.TableChange +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier +import com.dimajix.flowman.catalog.TableIndex import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.DeleteClause import com.dimajix.flowman.execution.Execution @@ -56,7 +58,6 @@ import com.dimajix.flowman.execution.UpdateClause import com.dimajix.flowman.jdbc.JdbcUtils import com.dimajix.flowman.jdbc.SqlDialect import com.dimajix.flowman.jdbc.SqlDialects -import com.dimajix.flowman.jdbc.TableDefinition import com.dimajix.flowman.model.BaseRelation import com.dimajix.flowman.model.Connection import com.dimajix.flowman.model.PartitionField @@ -72,24 +73,44 @@ import com.dimajix.flowman.spec.connection.JdbcConnection import com.dimajix.flowman.types.FieldValue import com.dimajix.flowman.types.SingleValue import com.dimajix.flowman.types.{StructType => FlowmanStructType} -import com.dimajix.spark.sql.SchemaUtils -case class JdbcRelation( +class JdbcRelationBase( override val instanceProperties:Relation.Properties, override val schema:Option[Schema] = None, - override val partitions: Seq[PartitionField] = Seq(), + override val partitions: Seq[PartitionField] = Seq.empty, connection: Reference[Connection], - properties: Map[String,String] = Map(), - database: Option[String] = None, - table: Option[String] = None, + properties: Map[String,String] = Map.empty, + table: Option[TableIdentifier] = None, query: Option[String] = None, - mergeKey: Seq[String] = Seq(), - primaryKey: Seq[String] = Seq() + mergeKey: Seq[String] = Seq.empty, + primaryKey: Seq[String] = Seq.empty, + indexes: Seq[TableIndex] = Seq.empty ) extends BaseRelation with PartitionedRelation with SchemaRelation { - private val logger = LoggerFactory.getLogger(classOf[JdbcRelation]) + protected val logger: Logger = LoggerFactory.getLogger(getClass) + protected val tableIdentifier: TableIdentifier = table.getOrElse(TableIdentifier.empty) + protected lazy val tableDefinition: Option[TableDefinition] = { + schema.map { schema => + val pk = if (primaryKey.nonEmpty) primaryKey else schema.primaryKey + + // Make Primary key columns not-nullable + val pkSet = SetIgnoreCase(pk) + val columns = fullSchema.get.fields.map { f => + if (pkSet.contains(f.name)) + f.copy(nullable=false) + else + f + } - def tableIdentifier : TableIdentifier = TableIdentifier(table.getOrElse(""), database) + TableDefinition( + tableIdentifier, + columns, + schema.description, + pk, + indexes + ) + } + } if (query.nonEmpty && table.nonEmpty) throw new IllegalArgumentException(s"JDBC relation '$identifier' cannot have both a table and a SQL query defined") @@ -105,7 +126,7 @@ case class JdbcRelation( override def provides: Set[ResourceIdentifier] = { // Only return a resource if a table is defined, which implies that this relation can be used for creating // and destroying JDBC tables - table.map(t => ResourceIdentifier.ofJdbcTable(t, database)).toSet + table.map(t => ResourceIdentifier.ofJdbcTable(t)).toSet } /** @@ -116,7 +137,7 @@ case class JdbcRelation( override def requires: Set[ResourceIdentifier] = { // Only return a resource if a table is defined, which implies that this relation can be used for creating // and destroying JDBC tables - database.map(db => ResourceIdentifier.ofJdbcDatabase(db)).toSet ++ super.requires + table.flatMap(_.database.map(db => ResourceIdentifier.ofJdbcDatabase(db))).toSet ++ super.requires } /** @@ -137,7 +158,7 @@ case class JdbcRelation( } else { val allPartitions = PartitionSchema(this.partitions).interpolate(partitions) - allPartitions.map(p => ResourceIdentifier.ofJdbcTablePartition(table.get, database, p.toMap)).toSet + allPartitions.map(p => ResourceIdentifier.ofJdbcTablePartition(tableIdentifier, p.toMap)).toSet } } @@ -147,8 +168,8 @@ case class JdbcRelation( * @param execution * @return */ - override def describe(execution:Execution) : FlowmanStructType = { - if (schema.nonEmpty) { + override def describe(execution:Execution, partitions:Map[String,FieldValue] = Map()) : FlowmanStructType = { + val result = if (schema.nonEmpty) { FlowmanStructType(fields) } else { @@ -156,6 +177,8 @@ case class JdbcRelation( JdbcUtils.getSchema(con, tableIdentifier, options) } } + + applyDocumentation(result) } /** @@ -168,7 +191,7 @@ case class JdbcRelation( require(partitions != null) // Get Connection - val (_,props) = createProperties() + val (_,props) = createConnectionProperties() // Read from database. We do not use this.reader, because Spark JDBC sources do not support explicit schemas val reader = execution.spark.read @@ -218,43 +241,59 @@ case class JdbcRelation( // Write partition into DataBase mode match { case OutputMode.OVERWRITE if partition.isEmpty => - withConnection { (con, options) => - JdbcUtils.truncateTable(con, tableIdentifier, options) - } - doWrite(execution, dfExt) + doOverwriteAll(execution, dfExt) case OutputMode.OVERWRITE => - withStatement { (statement, options) => - val dialect = SqlDialects.get(options.url) - val condition = partitionCondition(dialect, partition) - val query = "DELETE FROM " + dialect.quote(tableIdentifier) + " WHERE " + condition - statement.executeUpdate(query) - } - doWrite(execution, dfExt) + doOverwritePartition(execution, dfExt, partition) case OutputMode.APPEND => - doWrite(execution, dfExt) + doAppend(execution, dfExt) case OutputMode.IGNORE_IF_EXISTS => if (!checkPartition(partition)) { - doWrite(execution, dfExt) + doAppend(execution, dfExt) } case OutputMode.ERROR_IF_EXISTS => if (!checkPartition(partition)) { - doWrite(execution, dfExt) + doAppend(execution, dfExt) } else { - throw new PartitionAlreadyExistsException(database.getOrElse(""), table.get, partition.mapValues(_.value)) + throw new PartitionAlreadyExistsException(tableIdentifier.database.getOrElse(""), tableIdentifier.table, partition.mapValues(_.value)) } + case OutputMode.UPDATE => + doUpdate(execution, df) case _ => throw new IllegalArgumentException(s"Unsupported save mode: '$mode'. " + - "Accepted save modes are 'overwrite', 'append', 'ignore', 'error', 'errorifexists'.") + "Accepted save modes are 'overwrite', 'append', 'ignore', 'error', 'update', 'errorifexists'.") + } + } + protected def doOverwriteAll(execution: Execution, df:DataFrame) : Unit = { + withConnection { (con, options) => + JdbcUtils.truncateTable(con, tableIdentifier, options) } + doAppend(execution, df) } - private def doWrite(execution: Execution, df:DataFrame): Unit = { - val (_,props) = createProperties() + protected def doOverwritePartition(execution: Execution, df:DataFrame, partition:Map[String,SingleValue]) : Unit = { + withStatement { (statement, options) => + val dialect = SqlDialects.get(options.url) + val condition = partitionCondition(dialect, partition) + val query = "DELETE FROM " + dialect.quote(tableIdentifier) + " WHERE " + condition + statement.executeUpdate(query) + } + doAppend(execution, df) + } + protected def doAppend(execution: Execution, df:DataFrame): Unit = { + val (_,props) = createConnectionProperties() this.writer(execution, df, "jdbc", Map(), SaveMode.Append) .options(props) .option(JDBCOptions.JDBC_TABLE_NAME, tableIdentifier.unquotedString) .save() } + protected def doUpdate(execution: Execution, df:DataFrame): Unit = { + val mergeCondition = this.mergeCondition + val clauses = Seq( + InsertClause(), + UpdateClause() + ) + doMerge(execution, df, mergeCondition, clauses) + } /** * Performs a merge operation. Either you need to specify a [[mergeKey]], or the relation needs to provide some @@ -270,30 +309,33 @@ case class JdbcRelation( if (query.nonEmpty) throw new UnsupportedOperationException(s"Cannot write into JDBC relation '$identifier' which is defined by an SQL query") - val mergeCondition = - condition.getOrElse { - val withinPartitionKeyColumns = - if (mergeKey.nonEmpty) - mergeKey - else if (primaryKey.nonEmpty) - primaryKey - else if (schema.exists(_.primaryKey.nonEmpty)) - schema.map(_.primaryKey).get - else - throw new IllegalArgumentException(s"Merging JDBC relation '$identifier' requires primary key in schema, explicit merge key or merge condition") - (SetIgnoreCase(partitions.map(_.name)) ++ withinPartitionKeyColumns) - .toSeq - .map(c => col("source." + c) === col("target." + c)) - .reduce(_ && _) - } - - val sourceColumns = collectColumns(mergeCondition.expr, "source") ++ clauses.flatMap(c => collectColumns(df.schema, c, "source")) + val mergeCondition = condition.getOrElse(this.mergeCondition) + doMerge(execution, df, mergeCondition, clauses) + } + protected def doMerge(execution: Execution, df: DataFrame, condition:Column, clauses: Seq[MergeClause]) : Unit = { + val sourceColumns = collectColumns(condition.expr, "source") ++ clauses.flatMap(c => collectColumns(df.schema, c, "source")) val sourceDf = df.select(sourceColumns.toSeq.map(col):_*) - val (url, props) = createProperties() + val (url, props) = createConnectionProperties() val options = new JDBCOptions(url, tableIdentifier.unquotedString, props) val targetSchema = outputSchema(execution) - JdbcUtils.mergeTable(tableIdentifier, "target", targetSchema, sourceDf, "source", mergeCondition, clauses, options) + JdbcUtils.mergeTable(tableIdentifier, "target", targetSchema, sourceDf, "source", condition, clauses, options) + } + + protected def mergeCondition : Column = { + val withinPartitionKeyColumns = + if (mergeKey.nonEmpty) + mergeKey + else if (primaryKey.nonEmpty) + primaryKey + else if (schema.exists(_.primaryKey.nonEmpty)) + schema.map(_.primaryKey).get + else + throw new IllegalArgumentException(s"Merging JDBC relation '$identifier' requires primary key in schema, explicit merge key or merge condition") + (SetIgnoreCase(partitions.map(_.name)) ++ withinPartitionKeyColumns) + .toSeq + .map(c => col("source." + c) === col("target." + c)) + .reduce(_ && _) } /** @@ -359,13 +401,11 @@ case class JdbcRelation( else { withConnection { (con, options) => if (JdbcUtils.tableExists(con, tableIdentifier, options)) { - if (schema.nonEmpty) { - val targetSchema = fullSchema.get - val currentSchema = JdbcUtils.getSchema(con, tableIdentifier, options) - !TableChange.requiresMigration(currentSchema, targetSchema, migrationPolicy) - } - else { - true + tableDefinition match { + case Some(targetTable) => + val currentTable = JdbcUtils.getTable(con, tableIdentifier, options) + !TableChange.requiresMigration(currentTable, targetTable, migrationPolicy) + case None => true } } else { @@ -411,23 +451,22 @@ case class JdbcRelation( withConnection{ (con,options) => if (!ifNotExists || !JdbcUtils.tableExists(con, tableIdentifier, options)) { doCreate(con, options) + provides.foreach(execution.refreshResource) } } } - private def doCreate(con:java.sql.Connection, options:JDBCOptions): Unit = { - logger.info(s"Creating JDBC relation '$identifier', this will create JDBC table $tableIdentifier with schema\n${this.schema.map(_.treeString).orNull}") - if (this.schema.isEmpty) { - throw new UnspecifiedSchemaException(identifier) + protected def doCreate(con:java.sql.Connection, options:JDBCOptions): Unit = { + val pk = tableDefinition.filter(_.primaryKey.nonEmpty).map(t => s"\n Primary key ${t.primaryKey.mkString(",")}").getOrElse("") + val idx = tableDefinition.map(t => t.indexes.map(i => s"\n Index '${i.name}' on ${i.columns.mkString(",")}").foldLeft("")(_ + _)).getOrElse("") + logger.info(s"Creating JDBC relation '$identifier', this will create JDBC table $tableIdentifier with schema\n${schema.map(_.treeString).orNull}$pk$idx") + + tableDefinition match { + case Some(table) => + JdbcUtils.createTable(con, table, options) + case None => + throw new UnspecifiedSchemaException(identifier) } - val schema = this.schema.get - val table = TableDefinition( - tableIdentifier, - schema.fields ++ partitions.map(_.field), - schema.description, - if (primaryKey.nonEmpty) primaryKey else schema.primaryKey - ) - JdbcUtils.createTable(con, table, options) } /** @@ -444,6 +483,7 @@ case class JdbcRelation( withConnection{ (con,options) => if (!ifExists || JdbcUtils.tableExists(con, tableIdentifier, options)) { JdbcUtils.dropTable(con, tableIdentifier, options) + provides.foreach(execution.refreshResource) } } } @@ -453,38 +493,39 @@ case class JdbcRelation( throw new UnsupportedOperationException(s"Cannot migrate JDBC relation '$identifier' which is defined by an SQL query") // Only try migration if schema is explicitly specified - if (schema.isDefined) { + tableDefinition.foreach { targetTable => withConnection { (con, options) => if (JdbcUtils.tableExists(con, tableIdentifier, options)) { - val targetSchema = fullSchema.get - val currentSchema = JdbcUtils.getSchema(con, tableIdentifier, options) - if (TableChange.requiresMigration(currentSchema, targetSchema, migrationPolicy)) { - doMigration(currentSchema, targetSchema, migrationPolicy, migrationStrategy) + val currentTable = JdbcUtils.getTable(con, tableIdentifier, options) + + if (TableChange.requiresMigration(currentTable, targetTable, migrationPolicy)) { + doMigration(currentTable, targetTable, migrationPolicy, migrationStrategy) + provides.foreach(execution.refreshResource) } } } } } - private def doMigration(currentSchema:FlowmanStructType, targetSchema:FlowmanStructType, migrationPolicy:MigrationPolicy, migrationStrategy:MigrationStrategy) : Unit = { + private def doMigration(currentTable:TableDefinition, targetTable:TableDefinition, migrationPolicy:MigrationPolicy, migrationStrategy:MigrationStrategy) : Unit = { withConnection { (con, options) => migrationStrategy match { case MigrationStrategy.NEVER => - logger.warn(s"Migration required for relation '$identifier', but migrations are disabled.\nCurrent schema:\n${currentSchema.treeString}New schema:\n${targetSchema.treeString}") + logger.warn(s"Migration required for relation '$identifier', but migrations are disabled.\nCurrent schema:\n${currentTable.schema.treeString}New schema:\n${targetTable.schema.treeString}") case MigrationStrategy.FAIL => - logger.error(s"Cannot migrate relation '$identifier', but migrations are disabled.\nCurrent schema:\n${currentSchema.treeString}New schema:\n${targetSchema.treeString}") + logger.error(s"Cannot migrate relation '$identifier', but migrations are disabled.\nCurrent schema:\n${currentTable.schema.treeString}New schema:\n${targetTable.schema.treeString}") throw new MigrationFailedException(identifier) case MigrationStrategy.ALTER => val dialect = SqlDialects.get(options.url) - val migrations = TableChange.migrate(currentSchema, targetSchema, migrationPolicy) + val migrations = TableChange.migrate(currentTable, targetTable, migrationPolicy) if (migrations.exists(m => !dialect.supportsChange(tableIdentifier, m))) { - logger.error(s"Cannot migrate relation JDBC relation '$identifier' of table $tableIdentifier, since that would require unsupported changes.\nCurrent schema:\n${currentSchema.treeString}New schema:\n${targetSchema.treeString}") + logger.error(s"Cannot migrate relation JDBC relation '$identifier' of table $tableIdentifier, since that would require unsupported changes.\nCurrent schema:\n${currentTable.schema.treeString}New schema:\n${targetTable.schema.treeString}") throw new MigrationFailedException(identifier) } alter(migrations, con, options) case MigrationStrategy.ALTER_REPLACE => val dialect = SqlDialects.get(options.url) - val migrations = TableChange.migrate(currentSchema, targetSchema, migrationPolicy) + val migrations = TableChange.migrate(currentTable, targetTable, migrationPolicy) if (migrations.forall(m => dialect.supportsChange(tableIdentifier, m))) { try { alter(migrations, con, options) @@ -504,7 +545,7 @@ case class JdbcRelation( } def alter(migrations:Seq[TableChange], con:java.sql.Connection, options:JDBCOptions) : Unit = { - logger.info(s"Migrating JDBC relation '$identifier', this will alter JDBC table $tableIdentifier. New schema:\n${targetSchema.treeString}") + logger.info(s"Migrating JDBC relation '$identifier', this will alter JDBC table $tableIdentifier. New schema:\n${targetTable.schema.treeString}") if (migrations.isEmpty) logger.warn("Empty list of migrations - nothing to do") @@ -518,7 +559,7 @@ case class JdbcRelation( def recreate(con:java.sql.Connection, options:JDBCOptions) : Unit = { try { - logger.info(s"Migrating JDBC relation '$identifier', this will recreate JDBC table $tableIdentifier. New schema:\n${targetSchema.treeString}") + logger.info(s"Migrating JDBC relation '$identifier', this will recreate JDBC table $tableIdentifier. New schema:\n${targetTable.schema.treeString}") JdbcUtils.dropTable(con, tableIdentifier, options) doCreate(con, options) } @@ -553,7 +594,7 @@ case class JdbcRelation( } } - private def createProperties() : (String,Map[String,String]) = { + protected def createConnectionProperties() : (String,Map[String,String]) = { val connection = this.connection.value.asInstanceOf[JdbcConnection] val props = mutable.Map[String,String]() props.put(JDBCOptions.JDBC_URL, connection.url) @@ -567,8 +608,8 @@ case class JdbcRelation( (connection.url,props.toMap) } - private def withConnection[T](fn:(java.sql.Connection,JDBCOptions) => T) : T = { - val (url,props) = createProperties() + protected def withConnection[T](fn:(java.sql.Connection,JDBCOptions) => T) : T = { + val (url,props) = createConnectionProperties() logger.debug(s"Connecting to jdbc source at $url") val options = new JDBCOptions(url, tableIdentifier.unquotedString, props) @@ -588,16 +629,24 @@ case class JdbcRelation( } } - private def withStatement[T](fn:(Statement,JDBCOptions) => T) : T = { + protected def withTransaction[T](con:java.sql.Connection)(fn: => T) : T = { + JdbcUtils.withTransaction(con)(fn) + } + + protected def withStatement[T](fn:(Statement,JDBCOptions) => T) : T = { withConnection { (con, options) => - val statement = con.createStatement() - try { - statement.setQueryTimeout(JdbcUtils.queryTimeout(options)) - fn(statement, options) - } - finally { - statement.close() - } + withStatement(con,options)(fn) + } + } + + protected def withStatement[T](con:java.sql.Connection,options:JDBCOptions)(fn:(Statement,JDBCOptions) => T) : T = { + val statement = con.createStatement() + try { + statement.setQueryTimeout(JdbcUtils.queryTimeout(options)) + fn(statement, options) + } + finally { + statement.close() } } @@ -649,16 +698,40 @@ case class JdbcRelation( } +case class JdbcRelation( + override val instanceProperties:Relation.Properties, + override val schema:Option[Schema] = None, + override val partitions: Seq[PartitionField] = Seq.empty, + connection: Reference[Connection], + properties: Map[String,String] = Map.empty, + table: Option[TableIdentifier] = None, + query: Option[String] = None, + mergeKey: Seq[String] = Seq.empty, + primaryKey: Seq[String] = Seq.empty, + indexes: Seq[TableIndex] = Seq.empty +) extends JdbcRelationBase( + instanceProperties, + schema, + partitions, + connection, + properties, + table, + query, + mergeKey, + primaryKey, + indexes +) { +} -class JdbcRelationSpec extends RelationSpec with PartitionedRelationSpec with SchemaRelationSpec { +class JdbcRelationSpec extends RelationSpec with PartitionedRelationSpec with SchemaRelationSpec with IndexedRelationSpec { @JsonProperty(value = "connection", required = true) private var connection: ConnectionReferenceSpec = _ - @JsonProperty(value = "properties", required = false) private var properties: Map[String, String] = Map() + @JsonProperty(value = "properties", required = false) private var properties: Map[String, String] = Map.empty @JsonProperty(value = "database", required = false) private var database: Option[String] = None @JsonProperty(value = "table", required = false) private var table: Option[String] = None @JsonProperty(value = "query", required = false) private var query: Option[String] = None - @JsonProperty(value = "mergeKey", required = false) private var mergeKey: Seq[String] = Seq() - @JsonProperty(value = "primaryKey", required = false) private var primaryKey: Seq[String] = Seq() + @JsonProperty(value = "mergeKey", required = false) private var mergeKey: Seq[String] = Seq.empty + @JsonProperty(value = "primaryKey", required = false) private var primaryKey: Seq[String] = Seq.empty /** * Creates the instance of the specified Relation with all variable interpolation being performed @@ -672,11 +745,11 @@ class JdbcRelationSpec extends RelationSpec with PartitionedRelationSpec with Sc partitions.map(_.instantiate(context)), connection.instantiate(context), context.evaluate(properties), - database.map(context.evaluate).filter(_.nonEmpty), - table.map(context.evaluate).filter(_.nonEmpty), - query.map(context.evaluate).filter(_.nonEmpty), + context.evaluate(table).map(t => TableIdentifier(t, context.evaluate(database))), + context.evaluate(query), mergeKey.map(context.evaluate), - primaryKey.map(context.evaluate) + primaryKey.map(context.evaluate), + indexes.map(_.instantiate(context)) ) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/LocalRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/LocalRelation.scala index 135c72ac5..24421c9b7 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/LocalRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/LocalRelation.scala @@ -176,6 +176,8 @@ extends BaseRelation with SchemaRelation with PartitionedRelation { writer.format(format) .mode(mode.batchMode) .save(outputFile) + + provides.foreach(execution.refreshResource) } /** @@ -276,6 +278,7 @@ extends BaseRelation with SchemaRelation with PartitionedRelation { else { logger.info(s"Creating local directory '$localDirectory' for local file relation") path.mkdirs() + provides.foreach(execution.refreshResource) } } @@ -312,6 +315,7 @@ extends BaseRelation with SchemaRelation with PartitionedRelation { } delete(root) + provides.foreach(execution.refreshResource) } /** diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/MockRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/MockRelation.scala index a562c5e44..82caf628e 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/MockRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/MockRelation.scala @@ -225,8 +225,10 @@ case class MockRelation( * @param execution * @return */ - override def describe(execution: Execution): types.StructType = { - types.StructType(mocked.fields) + override def describe(execution: Execution, partitions:Map[String,FieldValue] = Map()): types.StructType = { + val result = types.StructType(mocked.fields) + + applyDocumentation(result) } /** diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/NullRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/NullRelation.scala index b2385fc5b..80ab0a4e8 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/NullRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/NullRelation.scala @@ -166,8 +166,10 @@ case class NullRelation( * @param execution * @return */ - override def describe(execution:Execution) : types.StructType = { - types.StructType(fields) + override def describe(execution:Execution, partitions:Map[String,FieldValue] = Map()) : types.StructType = { + val result = types.StructType(fields) + + applyDocumentation(result) } /** diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/RelationSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/RelationSpec.scala index 091705e8f..6e9484c51 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/RelationSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/RelationSpec.scala @@ -29,6 +29,7 @@ import com.dimajix.flowman.model.Metadata import com.dimajix.flowman.model.Relation import com.dimajix.flowman.spec.NamedSpec import com.dimajix.flowman.spec.annotation.RelationType +import com.dimajix.flowman.spec.documentation.RelationDocSpec import com.dimajix.flowman.spec.template.CustomTypeResolverBuilder import com.dimajix.flowman.spi.ClassAnnotationHandler @@ -51,6 +52,7 @@ object RelationSpec extends TypeRegistry[RelationSpec] { new JsonSubTypes.Type(name = "hiveUnionTable", value = classOf[HiveUnionTableRelationSpec]), new JsonSubTypes.Type(name = "hiveView", value = classOf[HiveViewRelationSpec]), new JsonSubTypes.Type(name = "jdbc", value = classOf[JdbcRelationSpec]), + new JsonSubTypes.Type(name = "jdbcTable", value = classOf[JdbcRelationSpec]), new JsonSubTypes.Type(name = "local", value = classOf[LocalRelationSpec]), new JsonSubTypes.Type(name = "mock", value = classOf[MockRelationSpec]), new JsonSubTypes.Type(name = "null", value = classOf[NullRelationSpec]), @@ -61,7 +63,9 @@ object RelationSpec extends TypeRegistry[RelationSpec] { new JsonSubTypes.Type(name = "view", value = classOf[HiveViewRelationSpec]) )) abstract class RelationSpec extends NamedSpec[Relation] { + @JsonProperty(value="kind", required = true) protected var kind: String = _ @JsonProperty(value="description", required = false) private var description: Option[String] = None + @JsonProperty(value="documentation", required = false) private var documentation: Option[RelationDocSpec] = None override def instantiate(context:Context) : Relation @@ -76,7 +80,8 @@ abstract class RelationSpec extends NamedSpec[Relation] { Relation.Properties( context, metadata.map(_.instantiate(context, name, Category.RELATION, kind)).getOrElse(Metadata(context, name, Category.RELATION, kind)), - description.map(context.evaluate) + context.evaluate(description), + documentation.map(_.instantiate(context)) ) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/TemplateRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/TemplateRelation.scala index 3a81b74c8..79929d37a 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/TemplateRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/TemplateRelation.scala @@ -22,6 +22,7 @@ import org.apache.spark.sql.DataFrame import com.dimajix.common.Trilean import com.dimajix.flowman.common.ParserUtils.splitSettings +import com.dimajix.flowman.documentation.RelationDoc import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.execution.MergeClause @@ -93,6 +94,12 @@ case class TemplateRelation( */ override def description : Option[String] = relationInstance.description + /** + * Returns a (static) documentation of this relation + * @return + */ + override def documentation : Option[RelationDoc] = relationInstance.documentation.map(_.merge(instanceProperties.documentation)) + /** * Returns the schema of the relation * @return diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/ValuesRelation.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/ValuesRelation.scala index b3fb9c560..e4dd90758 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/ValuesRelation.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/relation/ValuesRelation.scala @@ -213,8 +213,10 @@ case class ValuesRelation( * @param execution * @return */ - override def describe(execution: Execution): types.StructType = { - types.StructType(effectiveSchema.fields) + override def describe(execution: Execution, partitions:Map[String,FieldValue] = Map()): types.StructType = { + val result = types.StructType(effectiveSchema.fields) + + applyDocumentation(result) } /** diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/schema/RelationSchema.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/schema/RelationSchema.scala index 3d04da6d3..77f7536b5 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/schema/RelationSchema.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/schema/RelationSchema.scala @@ -45,7 +45,7 @@ case class RelationSchema( case Some(schema) => schema.fields ++ rel.partitions.map(_.field) case None => val execution = context.execution - rel.describe(execution).fields + execution.describe(rel).fields } } private lazy val cachedDescription = { diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/schema/SchemaSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/schema/SchemaSpec.scala index ba9e7b160..353fda254 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/schema/SchemaSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/schema/SchemaSpec.scala @@ -51,7 +51,7 @@ object SchemaSpec extends TypeRegistry[SchemaSpec] { new JsonSubTypes.Type(name = "union", value = classOf[UnionSchemaSpec]) )) abstract class SchemaSpec extends Spec[Schema] { - @JsonProperty(value="kind", required = true) protected var kind: String = _ + @JsonProperty(value="kind", required = true) protected var kind: String = "inline" override def instantiate(context:Context) : Schema diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/DocumentTarget.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/DocumentTarget.scala new file mode 100644 index 000000000..811a7ecfa --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/DocumentTarget.scala @@ -0,0 +1,114 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.target + +import com.fasterxml.jackson.annotation.JsonProperty +import org.slf4j.LoggerFactory + +import com.dimajix.common.No +import com.dimajix.common.Trilean +import com.dimajix.common.Yes +import com.dimajix.flowman.documentation.Collector +import com.dimajix.flowman.documentation.Documenter +import com.dimajix.flowman.documentation.Generator +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.execution.Phase +import com.dimajix.flowman.model.BaseTarget +import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.ResourceIdentifier +import com.dimajix.flowman.model.Target +import com.dimajix.flowman.spec.documentation.CollectorSpec +import com.dimajix.flowman.spec.documentation.DocumenterLoader +import com.dimajix.flowman.spec.documentation.GeneratorSpec + + +case class DocumentTarget( + instanceProperties:Target.Properties, + collectors:Seq[Collector] = Seq(), + generators:Seq[Generator] = Seq() +) extends BaseTarget { + private val logger = LoggerFactory.getLogger(getClass) + + /** + * Returns all phases which are implemented by this target in the execute method + * @return + */ + override def phases : Set[Phase] = Set(Phase.VERIFY) + + /** + * Returns a list of physical resources required by this target + * @return + */ + override def requires(phase: Phase) : Set[ResourceIdentifier] = Set() + + /** + * Returns the state of the target, specifically of any artifacts produces. If this method return [[Yes]], + * then an [[execute]] should update the output, such that the target is not 'dirty' any more. + * @param execution + * @param phase + * @return + */ + override def dirty(execution: Execution, phase: Phase) : Trilean = { + phase match { + case Phase.VERIFY => Yes + case _ => No + } + } + + /** + * Build the documentation target + * + * @param execution + */ + override def verify(execution:Execution) : Unit = { + require(execution != null) + + project match { + case Some(project) => + document(execution, project) + case None => + logger.warn("Cannot generator documentation without project") + } + } + + private def document(execution:Execution, project:Project) : Unit = { + val documenter = + if (collectors.isEmpty && generators.isEmpty) { + DocumenterLoader.load(context, project) + } + else { + Documenter(collectors, generators) + } + + documenter.execute(context, execution, project) + } +} + + +class DocumentTargetSpec extends TargetSpec { + @JsonProperty(value="collectors") private var collectors: Seq[CollectorSpec] = Seq() + @JsonProperty(value="generators") private var generators: Seq[GeneratorSpec] = Seq() + + override def instantiate(context: Context): DocumentTarget = { + DocumentTarget( + instanceProperties(context), + collectors.map(_.instantiate(context)), + generators.map(_.instantiate(context)) + ) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/DropTarget.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/DropTarget.scala new file mode 100644 index 000000000..f07c78388 --- /dev/null +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/DropTarget.scala @@ -0,0 +1,167 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.target + +import com.fasterxml.jackson.annotation.JsonProperty +import org.slf4j.LoggerFactory + +import com.dimajix.common.No +import com.dimajix.common.Trilean +import com.dimajix.common.Yes +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.execution.Phase +import com.dimajix.flowman.execution.VerificationFailedException +import com.dimajix.flowman.graph.Linker +import com.dimajix.flowman.model.BaseTarget +import com.dimajix.flowman.model.Reference +import com.dimajix.flowman.model.Relation +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.model.RelationReference +import com.dimajix.flowman.model.ResourceIdentifier +import com.dimajix.flowman.model.Target +import com.dimajix.flowman.model.TargetDigest +import com.dimajix.flowman.spec.relation.RelationReferenceSpec +import com.dimajix.flowman.types.SingleValue + + +object DropTarget { + def apply(context: Context, relation: RelationIdentifier) : DropTarget = { + new DropTarget( + Target.Properties(context, relation.name, "relation"), + RelationReference(context, relation) + ) + } +} +case class DropTarget( + instanceProperties: Target.Properties, + relation: Reference[Relation] +) extends BaseTarget { + private val logger = LoggerFactory.getLogger(classOf[RelationTarget]) + + /** + * Returns all phases which are implemented by this target in the execute method + * @return + */ + override def phases : Set[Phase] = { + Set(Phase.CREATE, Phase.VERIFY, Phase.DESTROY) + } + + /** + * Returns a list of physical resources produced by this target + * @return + */ + override def provides(phase: Phase) : Set[ResourceIdentifier] = { + Set() + } + + /** + * Returns a list of physical resources required by this target + * @return + */ + override def requires(phase: Phase) : Set[ResourceIdentifier] = { + phase match { + case Phase.CREATE|Phase.DESTROY => relation.value.provides ++ relation.value.requires + case _ => Set() + } + } + + /** + * Returns the state of the target, specifically of any artifacts produces. If this method return [[Yes]], + * then an [[execute]] should update the output, such that the target is not 'dirty' any more. + * + * @param execution + * @param phase + * @return + */ + override def dirty(execution: Execution, phase: Phase): Trilean = { + val rel = relation.value + + phase match { + case Phase.CREATE => + rel.exists(execution) != No + case Phase.VERIFY => Yes + case Phase.DESTROY => + rel.exists(execution) != No + case _ => No + } + } + + /** + * Creates all known links for building a descriptive graph of the whole data flow + * Params: linker - The linker object to use for creating new edges + */ + override def link(linker: Linker, phase:Phase): Unit = { + phase match { + case Phase.CREATE|Phase.DESTROY => + linker.write(relation, Map.empty[String,SingleValue]) + case _ => + } + } + + /** + * Drop the relation and all data contained + * + * @param executor + */ + override def create(execution: Execution) : Unit = { + require(execution != null) + + logger.info(s"Destroying relation '${relation.identifier}'") + val rel = relation.value + rel.destroy(execution, true) + } + + /** + * Verifies that the relation does not exist any more + * + * @param execution + */ + override def verify(execution: Execution) : Unit = { + require(execution != null) + + val rel = relation.value + if (rel.exists(execution) == Yes) { + logger.error(s"Verification of target '$identifier' failed - relation '${relation.identifier}' still exists") + throw new VerificationFailedException(identifier) + } + } + + /** + * Destroys both the logical relation and the physical data + * @param executor + */ + override def destroy(execution: Execution) : Unit = { + require(execution != null) + + logger.info(s"Destroying relation '${relation.identifier}'") + val rel = relation.value + rel.destroy(execution, true) + } +} + + +class DropTargetSpec extends TargetSpec { + @JsonProperty(value="relation", required=true) private var relation:RelationReferenceSpec = _ + + override def instantiate(context: Context): DropTarget = { + DropTarget( + instanceProperties(context), + relation.instantiate(context) + ) + } +} diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/MeasureTarget.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/MeasureTarget.scala index ff17cd869..cb47c86fe 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/MeasureTarget.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/MeasureTarget.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -132,7 +132,7 @@ case class MeasureTarget( // Publish result as metrics val metrics = execution.metricSystem result.flatMap(_.measurements).foreach { measurement => - val gauge = metrics.findMetric(Selector(Some(measurement.name), measurement.labels)) + val gauge = metrics.findMetric(Selector(measurement.name, measurement.labels)) .headOption .map(_.asInstanceOf[SettableGaugeMetric]) .getOrElse { diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/MergeTarget.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/MergeTarget.scala index 7ed2d8d15..7f7fc59b0 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/MergeTarget.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/MergeTarget.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -54,6 +54,7 @@ import com.dimajix.flowman.model.Target import com.dimajix.flowman.spec.relation.IdentifierRelationReferenceSpec import com.dimajix.flowman.spec.relation.RelationReferenceSpec import com.dimajix.flowman.spec.target.MergeTargetSpec.MergeClauseSpec +import com.dimajix.flowman.types.SingleValue object MergeTarget { @@ -155,12 +156,12 @@ case class MergeTarget( override def link(linker: Linker, phase:Phase): Unit = { phase match { case Phase.CREATE|Phase.DESTROY => - linker.write(relation.identifier, Map()) + linker.write(relation, Map.empty[String,SingleValue]) case Phase.BUILD => linker.input(mapping.mapping, mapping.output) - linker.write(relation.identifier, Map()) + linker.write(relation, Map.empty[String,SingleValue]) case Phase.TRUNCATE => - linker.write(relation.identifier, Map()) + linker.write(relation, Map.empty[String,SingleValue]) case _ => } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/NullTarget.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/NullTarget.scala index 0b950ec0a..440168c09 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/NullTarget.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/NullTarget.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -68,6 +68,7 @@ object NullTargetSpec { val spec = new NullTargetSpec spec.name = name spec.partition = partition + spec.kind = "null" spec } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/RelationTarget.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/RelationTarget.scala index bafdb424c..8fd744dac 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/RelationTarget.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/RelationTarget.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,6 +16,12 @@ package com.dimajix.flowman.spec.target +import java.time.Instant + +import scala.util.Failure +import scala.util.Success +import scala.util.Try + import com.fasterxml.jackson.annotation.JsonProperty import org.slf4j.LoggerFactory @@ -23,6 +29,7 @@ import com.dimajix.common.No import com.dimajix.common.Trilean import com.dimajix.common.Unknown import com.dimajix.common.Yes +import com.dimajix.flowman.config.FlowmanConf import com.dimajix.flowman.config.FlowmanConf.DEFAULT_RELATION_MIGRATION_POLICY import com.dimajix.flowman.config.FlowmanConf.DEFAULT_RELATION_MIGRATION_STRATEGY import com.dimajix.flowman.config.FlowmanConf.DEFAULT_TARGET_OUTPUT_MODE @@ -35,6 +42,7 @@ import com.dimajix.flowman.execution.MigrationPolicy import com.dimajix.flowman.execution.MigrationStrategy import com.dimajix.flowman.execution.OutputMode import com.dimajix.flowman.execution.Phase +import com.dimajix.flowman.execution.Status import com.dimajix.flowman.execution.VerificationFailedException import com.dimajix.flowman.graph.Linker import com.dimajix.flowman.model.BaseTarget @@ -46,6 +54,8 @@ import com.dimajix.flowman.model.RelationReference import com.dimajix.flowman.model.ResourceIdentifier import com.dimajix.flowman.model.Target import com.dimajix.flowman.model.TargetDigest +import com.dimajix.flowman.model.TargetResult +import com.dimajix.flowman.model.VerifyPolicy import com.dimajix.flowman.spec.relation.IdentifierRelationReferenceSpec import com.dimajix.flowman.spec.relation.RelationReferenceSpec import com.dimajix.flowman.types.SingleValue @@ -200,14 +210,14 @@ case class RelationTarget( override def link(linker: Linker, phase:Phase): Unit = { phase match { case Phase.CREATE|Phase.DESTROY => - linker.write(relation.identifier, Map()) + linker.write(relation, Map.empty[String,SingleValue]) case Phase.BUILD if (mapping.nonEmpty) => val partition = this.partition.mapValues(v => SingleValue(v)) linker.input(mapping.mapping, mapping.output) - linker.write(relation.identifier, partition) + linker.write(relation, partition) case Phase.TRUNCATE => val partition = this.partition.mapValues(v => SingleValue(v)) - linker.write(relation.identifier, partition) + linker.write(relation, partition) case _ => } } @@ -266,16 +276,42 @@ case class RelationTarget( /** * Performs a verification of the build step or possibly other checks. * - * @param executor + * @param execution */ - override def verify(executor: Execution) : Unit = { - require(executor != null) + override def verify2(execution: Execution) : TargetResult = { + require(execution != null) - val partition = this.partition.mapValues(v => SingleValue(v)) - val rel = relation.value - if (rel.loaded(executor, partition) == No) { - logger.error(s"Verification of target '$identifier' failed - partition $partition of relation '${relation.identifier}' does not exist") - throw new VerificationFailedException(identifier) + val startTime = Instant.now() + Try { + val partition = this.partition.mapValues(v => SingleValue(v)) + val rel = relation.value + if (rel.loaded(execution, partition) == No) { + val policy = VerifyPolicy.ofString(execution.flowmanConf.getConf(FlowmanConf.DEFAULT_TARGET_VERIFY_POLICY)) + policy match { + case VerifyPolicy.EMPTY_AS_FAILURE => + logger.error(s"Verification of target '$identifier' failed - partition $partition of relation '${relation.identifier}' does not exist") + throw new VerificationFailedException(identifier) + case VerifyPolicy.EMPTY_AS_SUCCESS|VerifyPolicy.EMPTY_AS_SUCCESS_WITH_ERRORS => + if (rel.exists(execution) != No) { + logger.warn(s"Verification of target '$identifier' failed - partition $partition of relation '${relation.identifier}' does not exist. Ignoring.") + if (policy == VerifyPolicy.EMPTY_AS_SUCCESS_WITH_ERRORS) + Status.SUCCESS_WITH_ERRORS + else + Status.SUCCESS + } + else { + logger.error(s"Verification of target '$identifier' failed - relation '${relation.identifier}' does not exist") + throw new VerificationFailedException(identifier) + } + } + } + else { + Status.SUCCESS + } + } + match { + case Success(status) => TargetResult(this, Phase.VERIFY, status, startTime) + case Failure(ex) => TargetResult(this, Phase.VERIFY, ex, startTime) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/StreamTarget.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/StreamTarget.scala index 914057cfd..d7f29a34e 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/StreamTarget.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/StreamTarget.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -136,9 +136,9 @@ case class StreamTarget( phase match { case Phase.BUILD => linker.input(mapping.mapping, mapping.output) - linker.write(relation.identifier, Map()) + linker.write(relation, Map.empty[String,SingleValue]) case Phase.TRUNCATE|Phase.DESTROY => - linker.write(relation.identifier, Map()) + linker.write(relation, Map.empty[String,SingleValue]) case _ => } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TargetSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TargetSpec.scala index dedd38218..1e86d53e6 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TargetSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TargetSpec.scala @@ -29,6 +29,7 @@ import com.dimajix.flowman.model.Target import com.dimajix.flowman.model.TargetIdentifier import com.dimajix.flowman.spec.NamedSpec import com.dimajix.flowman.spec.annotation.TargetType +import com.dimajix.flowman.spec.documentation.TargetDocSpec import com.dimajix.flowman.spec.template.CustomTypeResolverBuilder import com.dimajix.flowman.spi.ClassAnnotationHandler @@ -48,6 +49,9 @@ object TargetSpec extends TypeRegistry[TargetSpec] { new JsonSubTypes.Type(name = "copyFile", value = classOf[CopyFileTargetSpec]), new JsonSubTypes.Type(name = "count", value = classOf[CountTargetSpec]), new JsonSubTypes.Type(name = "deleteFile", value = classOf[DeleteFileTargetSpec]), + new JsonSubTypes.Type(name = "document", value = classOf[DocumentTargetSpec]), + new JsonSubTypes.Type(name = "documentation", value = classOf[DocumentTargetSpec]), + new JsonSubTypes.Type(name = "drop", value = classOf[DropTargetSpec]), new JsonSubTypes.Type(name = "file", value = classOf[FileTargetSpec]), new JsonSubTypes.Type(name = "getFile", value = classOf[GetFileTargetSpec]), new JsonSubTypes.Type(name = "hiveDatabase", value = classOf[HiveDatabaseTargetSpec]), @@ -67,8 +71,11 @@ object TargetSpec extends TypeRegistry[TargetSpec] { new JsonSubTypes.Type(name = "verify", value = classOf[VerifyTargetSpec]) )) abstract class TargetSpec extends NamedSpec[Target] { + @JsonProperty(value = "kind", required=true) protected var kind: String = _ @JsonProperty(value = "before", required=false) protected[spec] var before:Seq[String] = Seq() @JsonProperty(value = "after", required=false) protected[spec] var after:Seq[String] = Seq() + @JsonProperty(value="description", required = false) private var description: Option[String] = None + @JsonProperty(value = "documentation", required=false) private var documentation: Option[TargetDocSpec] = None override def instantiate(context: Context): Target @@ -84,7 +91,9 @@ abstract class TargetSpec extends NamedSpec[Target] { context, metadata.map(_.instantiate(context, name, Category.TARGET, kind)).getOrElse(Metadata(context, name, Category.TARGET, kind)), before.map(context.evaluate).map(TargetIdentifier.parse), - after.map(context.evaluate).map(TargetIdentifier.parse) + after.map(context.evaluate).map(TargetIdentifier.parse), + context.evaluate(description), + documentation.map(_.instantiate(context)) ) } } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TemplateTarget.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TemplateTarget.scala index 497086c6f..2527d1235 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TemplateTarget.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TemplateTarget.scala @@ -20,6 +20,7 @@ import com.fasterxml.jackson.annotation.JsonProperty import com.dimajix.common.Trilean import com.dimajix.flowman.common.ParserUtils.splitSettings +import com.dimajix.flowman.documentation.TargetDoc import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution import com.dimajix.flowman.execution.Phase @@ -55,6 +56,20 @@ case class TemplateTarget( } } + /** + * Returns a description of the build target + * + * @return + */ + override def description: Option[String] = instanceProperties.description.orElse(targetInstance.description) + + /** + * Returns a (static) documentation of this target + * + * @return + */ + override def documentation : Option[TargetDoc] = targetInstance.documentation.map(_.merge(instanceProperties.documentation)) + /** * Returns an instance representing this target with the context * @return diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TruncateTarget.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TruncateTarget.scala index addc76a51..f58d48250 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TruncateTarget.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/target/TruncateTarget.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -22,27 +22,51 @@ import org.slf4j.LoggerFactory import com.dimajix.common.No import com.dimajix.common.Trilean import com.dimajix.common.Yes +import com.dimajix.flowman.config.FlowmanConf.DEFAULT_TARGET_OUTPUT_MODE +import com.dimajix.flowman.config.FlowmanConf.DEFAULT_TARGET_PARALLELISM +import com.dimajix.flowman.config.FlowmanConf.DEFAULT_TARGET_REBALANCE import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Execution +import com.dimajix.flowman.execution.OutputMode import com.dimajix.flowman.execution.Phase import com.dimajix.flowman.execution.VerificationFailedException import com.dimajix.flowman.graph.Linker import com.dimajix.flowman.model.BaseTarget +import com.dimajix.flowman.model.MappingOutputIdentifier import com.dimajix.flowman.model.PartitionSchema +import com.dimajix.flowman.model.Reference import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.model.RelationReference import com.dimajix.flowman.model.ResourceIdentifier import com.dimajix.flowman.model.Target import com.dimajix.flowman.model.TargetDigest +import com.dimajix.flowman.spec.relation.RelationReferenceSpec import com.dimajix.flowman.types.ArrayValue import com.dimajix.flowman.types.FieldValue import com.dimajix.flowman.types.RangeValue import com.dimajix.flowman.types.SingleValue +object TruncateTarget { + def apply(context: Context, relation: RelationIdentifier) : TruncateTarget = { + new TruncateTarget( + Target.Properties(context, relation.name, "relation"), + RelationReference(context, relation), + Map() + ) + } + def apply(context: Context, relation: RelationIdentifier, partitions:Map[String,FieldValue]) : TruncateTarget = { + new TruncateTarget( + Target.Properties(context, relation.name, "relation"), + RelationReference(context, relation), + partitions + ) + } +} case class TruncateTarget( instanceProperties: Target.Properties, - relation: RelationIdentifier, + relation: Reference[Relation], partitions:Map[String,FieldValue] = Map() ) extends BaseTarget { private val logger = LoggerFactory.getLogger(classOf[RelationTarget]) @@ -57,6 +81,7 @@ case class TruncateTarget( project.map(_.name).getOrElse(""), name, phase, + // TODO: Maybe here should be a partition or a list of partitions.... Map() ) } @@ -66,7 +91,7 @@ case class TruncateTarget( * @return */ override def phases : Set[Phase] = { - Set(Phase.BUILD, Phase.VERIFY) + Set(Phase.BUILD, Phase.VERIFY, Phase.TRUNCATE) } /** @@ -75,8 +100,8 @@ case class TruncateTarget( */ override def provides(phase: Phase) : Set[ResourceIdentifier] = { phase match { - case Phase.BUILD => - val rel = context.getRelation(relation) + case Phase.BUILD|Phase.TRUNCATE => + val rel = relation.value rel.provides ++ rel.resources(partitions) case _ => Set() } @@ -87,10 +112,10 @@ case class TruncateTarget( * @return */ override def requires(phase: Phase) : Set[ResourceIdentifier] = { - val rel = context.getRelation(relation) - phase match { - case Phase.BUILD => rel.provides ++ rel.requires + case Phase.BUILD|Phase.TRUNCATE => + val rel = relation.value + rel.provides ++ rel.requires case _ => Set() } } @@ -106,14 +131,11 @@ case class TruncateTarget( */ override def dirty(execution: Execution, phase: Phase): Trilean = { phase match { - case Phase.VALIDATE => No - case Phase.CREATE => No - case Phase.BUILD => - val rel = context.getRelation(relation) + case Phase.BUILD|Phase.TRUNCATE => + val rel = relation.value resolvedPartitions(rel).foldLeft(No:Trilean)((l,p) => l || rel.loaded(execution, p)) case Phase.VERIFY => Yes - case Phase.TRUNCATE => No - case Phase.DESTROY => No + case _ => No } } @@ -122,9 +144,11 @@ case class TruncateTarget( * Params: linker - The linker object to use for creating new edges */ override def link(linker: Linker, phase:Phase): Unit = { - if (phase == Phase.BUILD) { - val rel = context.getRelation(relation) - resolvedPartitions(rel).foreach(p => linker.write(relation, p)) + phase match { + case Phase.BUILD|Phase.TRUNCATE => + val rel = relation.value + resolvedPartitions(rel).foreach(p => linker.write(rel, p)) + case _ => } } @@ -136,7 +160,7 @@ case class TruncateTarget( override def build(execution:Execution) : Unit = { require(execution != null) - val rel = context.getRelation(relation) + val rel = relation.value rel.truncate(execution, partitions) } @@ -148,7 +172,7 @@ case class TruncateTarget( override def verify(execution: Execution) : Unit = { require(execution != null) - val rel = context.getRelation(relation) + val rel = relation.value resolvedPartitions(rel) .find(p => rel.loaded(execution, p) == Yes) .foreach { partition => @@ -160,6 +184,18 @@ case class TruncateTarget( } } + /** + * Builds the target using the given input tables + * + * @param execution + */ + override def truncate(execution:Execution) : Unit = { + require(execution != null) + + val rel = relation.value + rel.truncate(execution, partitions) + } + private def resolvedPartitions(relation:Relation) : Iterable[Map[String,SingleValue]] = { if (this.partitions.isEmpty) { Seq(Map()) @@ -174,7 +210,7 @@ case class TruncateTarget( class TruncateTargetSpec extends TargetSpec { - @JsonProperty(value = "relation", required = true) private var relation:String = _ + @JsonProperty(value="relation", required=true) private var relation:RelationReferenceSpec = _ @JsonProperty(value = "partitions", required=false) private var partitions:Map[String,FieldValue] = Map() /** @@ -190,7 +226,7 @@ class TruncateTargetSpec extends TargetSpec { } TruncateTarget( instanceProperties(context), - RelationIdentifier(context.evaluate(relation)), + relation.instantiate(context), partitions ) } diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/template/TemplateSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/template/TemplateSpec.scala index df88f3385..a167b07b7 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/template/TemplateSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/template/TemplateSpec.scala @@ -83,6 +83,7 @@ object TemplateSpec { new JsonSubTypes.Type(name = "target", value = classOf[TargetTemplateSpec]) )) abstract class TemplateSpec extends NamedSpec[Template[_]] { + @JsonProperty(value="kind", required = true) protected var kind: String = _ @JsonProperty(value="parameters", required=false) protected var parameters : Seq[TemplateSpec.Parameter] = Seq() protected def instanceProperties(context:Context) : Template.Properties = { diff --git a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/test/TestSpec.scala b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/test/TestSpec.scala index 3ba4bf846..b45b8a3b4 100644 --- a/flowman-spec/src/main/scala/com/dimajix/flowman/spec/test/TestSpec.scala +++ b/flowman-spec/src/main/scala/com/dimajix/flowman/spec/test/TestSpec.scala @@ -80,7 +80,7 @@ class TestSpec extends NamedSpec[Test] { val name = context.evaluate(this.name) Test.Properties( context, - metadata.map(_.instantiate(context, name, Category.TEST, kind)).getOrElse(Metadata(context, name, Category.TEST, kind)), + metadata.map(_.instantiate(context, name, Category.TEST, "test")).getOrElse(Metadata(context, name, Category.TEST, "test")), description.map(context.evaluate) ) } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/ProjectTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/ProjectTest.scala index 4ee37e925..d7306ce89 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/ProjectTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/ProjectTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2020 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -32,12 +32,21 @@ class ProjectTest extends AnyFlatSpec with Matchers { """ |name: test |version: 1.0 + | + |imports: + | - project: common + | job: some_job + | arguments: + | some_arg: $lala """.stripMargin val project = Project.read.string(spec) project.name should be ("test") project.version should be (Some("1.0")) project.filename should be (None) project.basedir should be (None) + project.imports should be (Seq( + Project.Import("common", job=Some("some_job"), arguments=Map("some_arg" -> "$lala")) + )) } it should "be readable from a file" in { diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/dataset/MappingDatasetTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/dataset/MappingDatasetTest.scala index d9f0cb2f5..c4dda06d9 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/dataset/MappingDatasetTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/dataset/MappingDatasetTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -46,7 +46,7 @@ object MappingDatasetTest { ) extends BaseMapping { protected override def instanceProperties: Mapping.Properties = Mapping.Properties(context, name) - override def inputs: Seq[MappingOutputIdentifier] = Seq() + override def inputs: Set[MappingOutputIdentifier] = Set.empty override def execute(execution: Execution, input: Map[MappingOutputIdentifier, DataFrame]): Map[String, DataFrame] = Map("main" -> execution.spark.emptyDataFrame) override def describe(execution: Execution, input: Map[MappingOutputIdentifier, StructType]): Map[String, StructType] = Map("main"-> new StructType()) } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/dataset/RelationDatasetTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/dataset/RelationDatasetTest.scala index fa6f9f89f..734aeb7b3 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/dataset/RelationDatasetTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/dataset/RelationDatasetTest.scala @@ -108,7 +108,8 @@ class RelationDatasetTest extends AnyFlatSpec with Matchers with MockFactory wit (relation.write _).expects(executor,spark.emptyDataFrame,*,OutputMode.APPEND).returns(Unit) dataset.write(executor, spark.emptyDataFrame, OutputMode.APPEND) - (relation.describe _).expects(executor).returns(new StructType()) + (relation.identifier _).expects().returns(RelationIdentifier("relation")) + (relation.describe _).expects(executor, *).returns(new StructType()) dataset.describe(executor) should be (Some(new StructType())) } } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/ColumnCheckTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/ColumnCheckTest.scala new file mode 100644 index 000000000..81eee97b3 --- /dev/null +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/ColumnCheckTest.scala @@ -0,0 +1,103 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.documentation.ColumnReference +import com.dimajix.flowman.documentation.ExpressionColumnCheck +import com.dimajix.flowman.documentation.RangeColumnCheck +import com.dimajix.flowman.documentation.UniqueColumnCheck +import com.dimajix.flowman.documentation.ValuesColumnCheck +import com.dimajix.flowman.execution.RootContext +import com.dimajix.flowman.spec.ObjectMapper + + +class ColumnCheckTest extends AnyFlatSpec with Matchers { + "A ColumnCheck" should "be deserializable" in { + val yaml = + """ + |kind: unique + """.stripMargin + + val spec = ObjectMapper.parse[ColumnCheckSpec](yaml) + spec shouldBe a[UniqueColumnCheckSpec] + + val context = RootContext.builder().build() + val test = spec.instantiate(context, ColumnReference(None, "col0")) + test should be (UniqueColumnCheck( + Some(ColumnReference(None, "col0")) + )) + } + + "A RangeColumnCheck" should "be deserializable" in { + val yaml = + """ + |kind: range + |lower: 7 + |upper: 23 + """.stripMargin + + val spec = ObjectMapper.parse[ColumnCheckSpec](yaml) + spec shouldBe a[RangeColumnCheckSpec] + + val context = RootContext.builder().build() + val test = spec.instantiate(context, ColumnReference(None, "col0")) + test should be (RangeColumnCheck( + Some(ColumnReference(None, "col0")), + lower="7", + upper="23" + )) + } + + "A ValuesColumnCheck" should "be deserializable" in { + val yaml = + """ + |kind: values + |values: ['a', 12, null] + """.stripMargin + + val spec = ObjectMapper.parse[ColumnCheckSpec](yaml) + spec shouldBe a[ValuesColumnCheckSpec] + + val context = RootContext.builder().build() + val test = spec.instantiate(context, ColumnReference(None, "col0")) + test should be (ValuesColumnCheck( + Some(ColumnReference(None, "col0")), + values = Seq("a", "12", null) + )) + } + + "A ExpressionColumnCheck" should "be deserializable" in { + val yaml = + """ + |kind: expression + |expression: "col1 < col2" + """.stripMargin + + val spec = ObjectMapper.parse[ColumnCheckSpec](yaml) + spec shouldBe a[ExpressionColumnCheckSpec] + + val context = RootContext.builder().build() + val test = spec.instantiate(context, ColumnReference(None, "col0")) + test should be (ExpressionColumnCheck( + Some(ColumnReference(None, "col0")), + expression = "col1 < col2" + )) + } +} diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/DocumenterTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/DocumenterTest.scala new file mode 100644 index 000000000..58740320b --- /dev/null +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/DocumenterTest.scala @@ -0,0 +1,54 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.spec.ObjectMapper + + +class DocumenterTest extends AnyFlatSpec with Matchers { + "A DocumenterSpec" should "be parsable" in { + val yaml = + """ + |collectors: + | # Collect documentation of relations + | - kind: relations + | # Collect documentation of mappings + | - kind: mappings + | # Collect documentation of build targets + | - kind: targets + | # Execute all tests + | - kind: checks + | + |generators: + | # Create an output file in the project directory + | - kind: file + | location: ${project.basedir}/generated-documentation + | template: html + | excludeRelations: + | # You can either specify a name (without the project) + | - "stations_raw" + | # Or can also explicitly specify a name with the project + | - ".*/measurements_raw" + |""".stripMargin + + val spec = ObjectMapper.parse[DocumenterSpec](yaml) + spec shouldBe a[DocumenterSpec] + } +} diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/MappingDocTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/MappingDocTest.scala new file mode 100644 index 000000000..a761bacaa --- /dev/null +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/MappingDocTest.scala @@ -0,0 +1,72 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.documentation.ColumnReference +import com.dimajix.flowman.documentation.NotNullColumnCheck +import com.dimajix.flowman.execution.RootContext +import com.dimajix.flowman.spec.ObjectMapper + + +class MappingDocTest extends AnyFlatSpec with Matchers { + "A MappingDocSpec" should "be deserializable" in { + val yaml = + """ + |description: "This is a mapping" + |columns: + | - name: col_a + | description: "This is column a" + | checks: + | - kind: notNull + |outputs: + | other: + | description: "This is an additional output" + | columns: + | - name: col_x + | description: "Column of other output" + |""".stripMargin + + val spec = ObjectMapper.parse[MappingDocSpec](yaml) + + val context = RootContext.builder().build() + val mapping = spec.instantiate(context) + + mapping.description should be (Some("This is a mapping")) + + val main = mapping.outputs.find(_.name == "main").get + main.description should be (None) + val mainSchema = main.schema.get + mainSchema.columns.size should be (1) + mainSchema.columns(0).name should be ("col_a") + mainSchema.columns(0).description should be (Some("This is column a")) + mainSchema.columns(0).checks.size should be (1) + mainSchema.columns(0).checks(0) shouldBe a[NotNullColumnCheck] + mainSchema.checks.size should be (0) + + val other = mapping.outputs.find(_.name == "other").get + other.description should be (Some("This is an additional output")) + val otherSchema = other.schema.get + otherSchema.columns.size should be (1) + otherSchema.columns(0).name should be ("col_x") + otherSchema.columns(0).description should be (Some("Column of other output")) + otherSchema.columns(0).checks.size should be (0) + otherSchema.checks.size should be (0) + } +} diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/RelationDocTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/RelationDocTest.scala new file mode 100644 index 000000000..cf2293ca5 --- /dev/null +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/documentation/RelationDocTest.scala @@ -0,0 +1,61 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.documentation + +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.documentation.NotNullColumnCheck +import com.dimajix.flowman.execution.RootContext +import com.dimajix.flowman.spec.ObjectMapper + + +class RelationDocTest extends AnyFlatSpec with Matchers { + "A RelationDocSpec" should "be deserializable" in { + val yaml = + """ + |description: "This is a mapping" + |columns: + | - name: col_a + | description: "This is column a" + | checks: + | - kind: notNull + | - name: col_x + | description: "Column of other output" + | columns: + | - name: sub_col + |""".stripMargin + + val spec = ObjectMapper.parse[RelationDocSpec](yaml) + + val context = RootContext.builder().build() + val relation = spec.instantiate(context) + + relation.description should be (Some("This is a mapping")) + + val mainSchema = relation.schema.get + mainSchema.columns.size should be (2) + mainSchema.columns(0).name should be ("col_a") + mainSchema.columns(0).description should be (Some("This is column a")) + mainSchema.columns(0).checks.size should be (1) + mainSchema.columns(0).checks(0) shouldBe a[NotNullColumnCheck] + mainSchema.columns(1).name should be ("col_x") + mainSchema.columns(1).description should be (Some("Column of other output")) + mainSchema.columns(1).checks.size should be (0) + mainSchema.checks.size should be (0) + } +} diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/history/JdbcStateStoreTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/history/JdbcStateStoreTest.scala index 1226e4ba6..21fe5b63b 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/history/JdbcStateStoreTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/history/JdbcStateStoreTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -61,4 +61,17 @@ class JdbcStateStoreTest extends AnyFlatSpec with Matchers with BeforeAndAfter { val monitor = ObjectMapper.parse[HistorySpec](spec) monitor shouldBe a[JdbcHistorySpec] } + + it should "be parseable with embedded connection" in { + val spec = + """ + |kind: jdbc + |connection: + | kind: jdbc + | url: some_url + """.stripMargin + + val monitor = ObjectMapper.parse[HistorySpec](spec) + monitor shouldBe a[JdbcHistorySpec] + } } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/AggregateMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/AggregateMappingTest.scala index 8008940c8..91c5aa010 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/AggregateMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/AggregateMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -61,10 +61,10 @@ class AggregateMappingTest extends AnyFlatSpec with Matchers with LocalSparkSess ) xfs.input should be (MappingOutputIdentifier("myview")) - xfs.outputs should be (Seq("main")) + xfs.outputs should be (Set("main")) xfs.dimensions should be (Array("_1", "_2")) xfs.aggregations should be (Map("agg3" -> "sum(_3)", "agg4" -> "sum(_4)", "agg5" -> "sum(_4)", "agg6" -> "sum(_4)", "agg7" -> "sum(_4)")) - xfs.inputs should be (Seq(MappingOutputIdentifier("myview"))) + xfs.inputs should be (Set(MappingOutputIdentifier("myview"))) val df2 = xfs.execute(executor, Map(MappingOutputIdentifier("myview") -> df))("main") .orderBy("_1", "_2") @@ -107,10 +107,10 @@ class AggregateMappingTest extends AnyFlatSpec with Matchers with LocalSparkSess ) xfs.input should be (MappingOutputIdentifier("myview")) - xfs.outputs should be (Seq("main")) + xfs.outputs should be (Set("main")) xfs.dimensions should be (Seq("_1 AS dim1", "upper(_2) AS dim2")) xfs.aggregations should be (Map("agg3" -> "sum(_3)")) - xfs.inputs should be (Seq(MappingOutputIdentifier("myview"))) + xfs.inputs should be (Set(MappingOutputIdentifier("myview"))) val df2 = xfs.execute(executor, Map(MappingOutputIdentifier("myview") -> df))("main") .orderBy("dim1", "dim2") diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/AliasMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/AliasMappingTest.scala index 7af32d4fc..45cc474b8 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/AliasMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/AliasMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -59,7 +59,8 @@ class AliasMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession val inputDf = spark.emptyDataFrame mapping.input should be (MappingOutputIdentifier("input_df:output_2")) - mapping.outputs should be (Seq("main")) + mapping.inputs should be (Set(MappingOutputIdentifier("input_df:output_2"))) + mapping.outputs should be (Set("main")) val result = mapping.execute(executor, Map(MappingOutputIdentifier("input_df:output_2") -> inputDf))("main") result.count() should be (0) diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/CoalesceMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/CoalesceMappingTest.scala index 10f12d1ba..55e7ff927 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/CoalesceMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/CoalesceMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -51,7 +51,8 @@ class CoalesceMappingTest extends AnyFlatSpec with Matchers with LocalSparkSessi val typedInstance = instance.asInstanceOf[CoalesceMapping] typedInstance.input should be (MappingOutputIdentifier("some_mapping")) - typedInstance.outputs should be (Seq("main")) + typedInstance.inputs should be (Set(MappingOutputIdentifier("some_mapping"))) + typedInstance.outputs should be (Set("main")) typedInstance.partitions should be (1) } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ExtendMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ExtendMappingTest.scala index 47bd9ec2b..2b66a5d7a 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ExtendMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ExtendMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -44,8 +44,8 @@ class ExtendMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession Map("new_f" -> "2*_2") ) xfs.input should be (MappingOutputIdentifier("myview")) + xfs.inputs should be (Set(MappingOutputIdentifier("myview"))) xfs.columns should be (Map("new_f" -> "2*_2")) - xfs.inputs should be (Seq(MappingOutputIdentifier("myview"))) val result = xfs.execute(executor, Map(MappingOutputIdentifier("myview") -> df))("main") .orderBy("_1").collect() diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ExtractJsonMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ExtractJsonMappingTest.scala index 3c32b71d2..c1d41ec7d 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ExtractJsonMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ExtractJsonMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -122,7 +122,7 @@ class ExtractJsonMappingTest extends AnyFlatSpec with Matchers with LocalSparkSe ) val mapping = context.getMapping(MappingIdentifier("m0")) - mapping.outputs should be (Seq("main", "error")) + mapping.outputs should be (Set("main", "error")) val result = mapping.execute(executor, Map(MappingOutputIdentifier("p0") -> input))("main") result.count() should be (2) diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/FilterMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/FilterMappingTest.scala index 9ffb2d0a5..c663a07e7 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/FilterMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/FilterMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -51,7 +51,8 @@ class FilterMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession val filter = instance.asInstanceOf[FilterMapping] filter.input should be (MappingOutputIdentifier("some_mapping")) - filter.outputs should be (Seq("main")) + filter.inputs should be (Set(MappingOutputIdentifier("some_mapping"))) + filter.outputs should be (Set("main")) filter.condition should be ("value < 50") } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/HistorizeMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/HistorizeMappingTest.scala index 1399f6c9e..b938dc0d6 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/HistorizeMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/HistorizeMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -67,12 +67,12 @@ class HistorizeMappingTest extends AnyFlatSpec with Matchers with LocalSparkSess "valid_to" ) mapping.input should be (MappingOutputIdentifier("df1")) - mapping.outputs should be (Seq("main")) + mapping.inputs should be (Set(MappingOutputIdentifier("df1"))) + mapping.outputs should be (Set("main")) mapping.keyColumns should be (Seq("id" )) mapping.timeColumn should be ("ts") mapping.validFromColumn should be ("valid_from") mapping.validToColumn should be ("valid_to") - mapping.inputs should be (Seq(MappingOutputIdentifier("df1"))) val expectedSchema = StructType(Seq( StructField("a", ArrayType(LongType)), @@ -123,12 +123,12 @@ class HistorizeMappingTest extends AnyFlatSpec with Matchers with LocalSparkSess InsertPosition.BEGINNING ) mapping.input should be (MappingOutputIdentifier("df1")) - mapping.outputs should be (Seq("main")) + mapping.inputs should be (Set(MappingOutputIdentifier("df1"))) + mapping.outputs should be (Set("main")) mapping.keyColumns should be (Seq("id" )) mapping.timeColumn should be ("ts") mapping.validFromColumn should be ("valid_from") mapping.validToColumn should be ("valid_to") - mapping.inputs should be (Seq(MappingOutputIdentifier("df1"))) val expectedSchema = StructType(Seq( StructField("valid_from", LongType), diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/JoinMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/JoinMappingTest.scala index 73b6bf5ce..ceac86ccb 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/JoinMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/JoinMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -67,9 +67,8 @@ class JoinMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession{ Seq("key"), mode="left" ) - mapping.inputs should be (Seq(MappingOutputIdentifier("df1"), MappingOutputIdentifier("df2"))) + mapping.inputs should be (Set(MappingOutputIdentifier("df1"), MappingOutputIdentifier("df2"))) mapping.columns should be (Seq("key" )) - mapping.inputs should be (Seq(MappingOutputIdentifier("df1"), MappingOutputIdentifier("df2"))) val resultDf = mapping.execute(executor, Map(MappingOutputIdentifier("df1") -> df1, MappingOutputIdentifier("df2") -> df2))("main") .orderBy("key") @@ -120,9 +119,8 @@ class JoinMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession{ condition="df1.key = df2.key", mode="left" ) - mapping.inputs should be (Seq(MappingOutputIdentifier("df1"), MappingOutputIdentifier("df2"))) + mapping.inputs should be (Set(MappingOutputIdentifier("df1"), MappingOutputIdentifier("df2"))) mapping.condition should be ("df1.key = df2.key") - mapping.inputs should be (Seq(MappingOutputIdentifier("df1"), MappingOutputIdentifier("df2"))) val resultDf = mapping.execute(executor, Map(MappingOutputIdentifier("df1") -> df1, MappingOutputIdentifier("df2") -> df2))("main") .orderBy("df1.key") @@ -154,8 +152,7 @@ class JoinMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession{ mapping shouldBe a[JoinMappingSpec] val join = mapping.instantiate(session.context).asInstanceOf[JoinMapping] - join.inputs should be (Seq(MappingOutputIdentifier("df1"), MappingOutputIdentifier("df2"))) + join.inputs should be (Set(MappingOutputIdentifier("df1"), MappingOutputIdentifier("df2"))) join.condition should be ("df1.key = df2.key") - join.inputs should be (Seq(MappingOutputIdentifier("df1"), MappingOutputIdentifier("df2"))) } } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/MockMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/MockMappingTest.scala index 8a5d18de6..ec4902c96 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/MockMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/MockMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -70,7 +70,7 @@ class MockMappingTest extends AnyFlatSpec with Matchers with MockFactory with Lo mapping.kind should be ("mock") mapping.mapping should be (MappingIdentifier("empty")) mapping.output should be (MappingOutputIdentifier("project/mock:main")) - mapping.outputs should be (Seq("main")) + mapping.outputs should be (Set("main")) mapping.records should be (Seq( ArrayRecord("a","12","3"), ArrayRecord("cat","","7"), @@ -113,14 +113,14 @@ class MockMappingTest extends AnyFlatSpec with Matchers with MockFactory with Lo mapping.category should be (Category.MAPPING) (baseMappingTemplate.instantiate _).expects(context).returns(baseMapping) - (baseMapping.outputs _).expects().anyNumberOfTimes().returns(Seq("other", "error")) - mapping.outputs should be (Seq("other", "error")) + (baseMapping.outputs _).expects().anyNumberOfTimes().returns(Set("other", "error")) + mapping.outputs should be (Set("other", "error")) (baseMapping.output _).expects().returns(MappingOutputIdentifier("base", "other", Some(project.name))) mapping.output should be (MappingOutputIdentifier("my_project/mock:other")) (baseMapping.context _).expects().anyNumberOfTimes().returns(context) - (baseMapping.inputs _).expects().anyNumberOfTimes().returns(Seq()) + (baseMapping.inputs _).expects().anyNumberOfTimes().returns(Set()) (baseMapping.identifier _).expects().anyNumberOfTimes().returns(MappingIdentifier("my_project/base")) (baseMapping.describe:(Execution,Map[MappingOutputIdentifier,StructType],String) => StructType).expects(executor,*,"other") .anyNumberOfTimes().returns(otherSchema) @@ -177,14 +177,14 @@ class MockMappingTest extends AnyFlatSpec with Matchers with MockFactory with Lo val mapping = context.getMapping(MappingIdentifier("mock")) (baseMappingTemplate.instantiate _).expects(context).returns(baseMapping) - (baseMapping.outputs _).expects().anyNumberOfTimes().returns(Seq("main")) - mapping.outputs should be (Seq("main")) + (baseMapping.outputs _).expects().anyNumberOfTimes().returns(Set("main")) + mapping.outputs should be (Set("main")) (baseMapping.output _).expects().returns(MappingOutputIdentifier("mock", "main", Some(project.name))) mapping.output should be (MappingOutputIdentifier("my_project/mock:main")) (baseMapping.context _).expects().anyNumberOfTimes().returns(context) - (baseMapping.inputs _).expects().anyNumberOfTimes().returns(Seq()) + (baseMapping.inputs _).expects().anyNumberOfTimes().returns(Set()) (baseMapping.identifier _).expects().anyNumberOfTimes().returns(MappingIdentifier("my_project/base")) (baseMapping.describe:(Execution,Map[MappingOutputIdentifier,StructType],String) => StructType).expects(executor,*,"main") .anyNumberOfTimes().returns(schema) @@ -233,8 +233,8 @@ class MockMappingTest extends AnyFlatSpec with Matchers with MockFactory with Lo (baseMappingTemplate.instantiate _).expects(context).returns(baseMapping) (baseMapping.context _).expects().anyNumberOfTimes().returns(context) - (baseMapping.outputs _).expects().anyNumberOfTimes().returns(Seq("main")) - (baseMapping.inputs _).expects().anyNumberOfTimes().returns(Seq()) + (baseMapping.outputs _).expects().anyNumberOfTimes().returns(Set("main")) + (baseMapping.inputs _).expects().anyNumberOfTimes().returns(Set()) (baseMapping.identifier _).expects().anyNumberOfTimes().returns(MappingIdentifier("my_project/base")) (baseMapping.describe:(Execution,Map[MappingOutputIdentifier,StructType],String) => StructType).expects(executor,*,"main") .anyNumberOfTimes().returns(schema) diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/NullMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/NullMappingTest.scala index ec4b35d32..a65ff3c56 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/NullMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/NullMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -71,7 +71,8 @@ class NullMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { )) mapping1.schema should be (None) mapping1.output should be (MappingOutputIdentifier("project/empty1:main")) - mapping1.outputs should be (Seq("main")) + mapping1.outputs should be (Set("main")) + mapping1.inputs should be (Set.empty) } it should "create empty DataFrames with specified columns" in { @@ -90,7 +91,7 @@ class NullMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { mapping.category should be (Category.MAPPING) //mapping.kind should be ("null") - mapping.outputs should be (Seq("main")) + mapping.outputs should be (Set("main")) mapping.output should be (MappingOutputIdentifier("empty")) mapping.describe(executor, Map()) should be (Map( @@ -131,7 +132,7 @@ class NullMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { mapping.category should be (Category.MAPPING) //mapping.kind should be ("null") - mapping.outputs should be (Seq("main")) + mapping.outputs should be (Set("main")) mapping.output should be (MappingOutputIdentifier("empty")) mapping.describe(executor, Map()) should be (Map( diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ProjectMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ProjectMappingTest.scala index 31e614136..e1a4286a1 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ProjectMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ProjectMappingTest.scala @@ -52,7 +52,7 @@ class ProjectMappingTest extends AnyFlatSpec with Matchers with LocalSparkSessio mapping.input should be (MappingOutputIdentifier("myview")) mapping.columns should be (Seq(ProjectTransformer.Column(Path("_2")))) - mapping.inputs should be (Seq(MappingOutputIdentifier("myview"))) + mapping.inputs should be (Set(MappingOutputIdentifier("myview"))) val result = mapping.execute(executor, Map(MappingOutputIdentifier("myview") -> df))("main") .orderBy("_2").collect() diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RankMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RankMappingTest.scala index c71dd843a..e8bba2f68 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RankMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RankMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -71,7 +71,7 @@ class RankMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { mapping.input should be (MappingOutputIdentifier("df1")) mapping.keyColumns should be (Seq("id" )) mapping.versionColumns should be (Seq("ts")) - mapping.inputs should be (Seq(MappingOutputIdentifier("df1"))) + mapping.inputs should be (Set(MappingOutputIdentifier("df1"))) val result = mapping.execute(executor, Map(MappingOutputIdentifier("df1") -> df))("main") result.schema should be (df.schema) @@ -112,7 +112,7 @@ class RankMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { mapping.input should be (MappingOutputIdentifier("df1")) mapping.keyColumns should be (Seq("id" )) mapping.versionColumns should be (Seq("ts")) - mapping.inputs should be (Seq(MappingOutputIdentifier("df1"))) + mapping.inputs should be (Set(MappingOutputIdentifier("df1"))) val result = mapping.execute(executor, Map(MappingOutputIdentifier("df1") -> df))("main") result.schema should be (df.schema) @@ -177,7 +177,7 @@ class RankMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { mapping.input should be (MappingOutputIdentifier("df1")) mapping.keyColumns should be (Seq("id._1" )) mapping.versionColumns should be (Seq("ts._2")) - mapping.inputs should be (Seq(MappingOutputIdentifier("df1"))) + mapping.inputs should be (Set(MappingOutputIdentifier("df1"))) val result = mapping.execute(executor, Map(MappingOutputIdentifier("df1") -> df))("main") result.schema should be (df.schema) diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ReadHiveTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ReadHiveTest.scala index 2da406ebf..829f7ac18 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ReadHiveTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ReadHiveTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -24,6 +24,7 @@ import org.apache.spark.sql.types.StructType import org.scalatest.flatspec.AnyFlatSpec import org.scalatest.matchers.should.Matchers +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.Session import com.dimajix.flowman.model.Mapping import com.dimajix.flowman.model.MappingIdentifier @@ -57,8 +58,7 @@ class ReadHiveTest extends AnyFlatSpec with Matchers with LocalSparkSession { mapping shouldBe a[ReadHiveMapping] val rrm = mapping.asInstanceOf[ReadHiveMapping] - rrm.database should be (Some("default")) - rrm.table should be ("t0") + rrm.table should be (TableIdentifier("t0", Some("default"))) rrm.filter should be (Some("landing_date > 123")) } @@ -69,8 +69,7 @@ class ReadHiveTest extends AnyFlatSpec with Matchers with LocalSparkSession { val relation = HiveTableRelation( Relation.Properties(context, "t0"), - database = Some("default"), - table = "lala_0007", + table = TableIdentifier("lala_0007", Some("default")), format = Some("parquet"), schema = Some(EmbeddedSchema( Schema.Properties(context), @@ -84,15 +83,14 @@ class ReadHiveTest extends AnyFlatSpec with Matchers with LocalSparkSession { val mapping = ReadHiveMapping( Mapping.Properties(context, "readHive"), - Some("default"), - "lala_0007" + TableIdentifier("lala_0007", Some("default")) ) mapping.requires should be (Set( ResourceIdentifier.ofHiveTable("lala_0007", Some("default")), ResourceIdentifier.ofHiveDatabase("default") )) - mapping.inputs should be (Seq()) + mapping.inputs should be (Set()) mapping.describe(execution, Map()) should be (Map( "main" -> ftypes.StructType(Seq( Field("str_col", ftypes.StringType), @@ -116,8 +114,7 @@ class ReadHiveTest extends AnyFlatSpec with Matchers with LocalSparkSession { val relation = HiveTableRelation( Relation.Properties(context, "t0"), - database = Some("default"), - table = "lala_0007", + table = TableIdentifier("lala_0007", Some("default")), format = Some("parquet"), schema = Some(EmbeddedSchema( Schema.Properties(context), @@ -131,8 +128,7 @@ class ReadHiveTest extends AnyFlatSpec with Matchers with LocalSparkSession { val mapping = ReadHiveMapping( Mapping.Properties(context, "readHive"), - Some("default"), - "lala_0007", + table = TableIdentifier("lala_0007", Some("default")), columns = Seq( Field("int_col", ftypes.DoubleType) ) @@ -142,7 +138,7 @@ class ReadHiveTest extends AnyFlatSpec with Matchers with LocalSparkSession { ResourceIdentifier.ofHiveTable("lala_0007", Some("default")), ResourceIdentifier.ofHiveDatabase("default") )) - mapping.inputs should be (Seq()) + mapping.inputs should be (Set()) mapping.describe(execution, Map()) should be (Map( "main" -> ftypes.StructType(Seq( Field("int_col", ftypes.DoubleType) diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ReadRelationTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ReadRelationTest.scala index 243f4be32..48dd3c9e2 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ReadRelationTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ReadRelationTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -19,12 +19,15 @@ package com.dimajix.flowman.spec.mapping import org.scalatest.flatspec.AnyFlatSpec import org.scalatest.matchers.should.Matchers +import com.dimajix.flowman.execution.Phase import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.graph.Graph import com.dimajix.flowman.model.IdentifierRelationReference import com.dimajix.flowman.model.MappingIdentifier import com.dimajix.flowman.model.Module import com.dimajix.flowman.model.RelationIdentifier import com.dimajix.flowman.model.ValueRelationReference +import com.dimajix.flowman.spec.relation.ValuesRelation import com.dimajix.flowman.types.SingleValue @@ -83,5 +86,62 @@ class ReadRelationTest extends AnyFlatSpec with Matchers { rrm.relation shouldBe a[ValueRelationReference] rrm.relation.identifier should be (RelationIdentifier("embedded", "project")) rrm.relation.name should be ("embedded") + + // Check execution graph + val graph = Graph.ofProject(context, project, Phase.BUILD) + graph.relations.size should be (1) + graph.mappings.size should be (1) + + val relNode = graph.relations.head + relNode.relation should be (rrm.relation.value) + relNode.parent should be (None) + + val mapNode = graph.mappings.head + mapNode.mapping should be (rrm) + mapNode.parent should be (None) + } + + it should "support create an appropriate graph" in { + val spec = + """ + |mappings: + | t0: + | kind: readRelation + | relation: + | name: embedded + | kind: values + | records: + | - ["key",12] + | schema: + | kind: embedded + | fields: + | - name: key_column + | type: string + | - name: value_column + | type: integer + """.stripMargin + + val project = Module.read.string(spec).toProject("project") + val session = Session.builder().withProject(project).disableSpark().build() + val context = session.getContext(project) + val mapping = context.getMapping(MappingIdentifier("t0")) + + mapping shouldBe a[ReadRelationMapping] + val relation = mapping.asInstanceOf[ReadRelationMapping].relation.value + relation shouldBe a[ValuesRelation] + + // Check execution graph + val graph = Graph.ofProject(context, project, Phase.BUILD) + graph.nodes.size should be (3) // 1 mapping + 1 mapping output + 1 relation + graph.relations.size should be (1) + graph.mappings.size should be (1) + + val relNode = graph.relations.head + relNode.relation should be (relation) + relNode.parent should be (None) + + val mapNode = graph.mappings.head + mapNode.mapping should be (mapping) + mapNode.parent should be (None) } } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RebalanceMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RebalanceMappingTest.scala index 4ccb81b83..87717d6c4 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RebalanceMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RebalanceMappingTest.scala @@ -51,7 +51,7 @@ class RebalanceMappingTest extends AnyFlatSpec with Matchers with LocalSparkSess val typedInstance = instance.asInstanceOf[RebalanceMapping] typedInstance.input should be (MappingOutputIdentifier("some_mapping")) - typedInstance.outputs should be (Seq("main")) + typedInstance.outputs should be (Set("main")) typedInstance.partitions should be (2) } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RepartitionMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RepartitionMappingTest.scala index 860e633b9..04702dfaf 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RepartitionMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/RepartitionMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -56,7 +56,8 @@ class RepartitionMappingTest extends AnyFlatSpec with Matchers with LocalSparkSe val typedInstance = instance.asInstanceOf[RepartitionMapping] typedInstance.input should be (MappingOutputIdentifier("some_mapping")) - typedInstance.outputs should be (Seq("main")) + typedInstance.inputs should be (Set(MappingOutputIdentifier("some_mapping"))) + typedInstance.outputs should be (Set("main")) typedInstance.partitions should be (2) typedInstance.columns should be (Seq("col_1", "col_2")) typedInstance.sort should be (true) diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SchemaMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SchemaMappingTest.scala index b493a1098..848a93798 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SchemaMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SchemaMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -59,7 +59,7 @@ class SchemaMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession project.mappings.contains("t1") should be (true) val mapping = context.getMapping(MappingIdentifier("t1")).asInstanceOf[SchemaMapping] - mapping.inputs should be (Seq(MappingOutputIdentifier("t0"))) + mapping.inputs should be (Set(MappingOutputIdentifier("t0"))) mapping.output should be (MappingOutputIdentifier("project/t1:main")) mapping.identifier should be (MappingIdentifier("project/t1")) mapping.schema should be (None) @@ -90,7 +90,7 @@ class SchemaMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession mapping.input should be (MappingOutputIdentifier("myview")) mapping.columns should be (Seq(Field("_2", FieldType.of("int")))) - mapping.inputs should be (Seq(MappingOutputIdentifier("myview"))) + mapping.inputs should be (Set(MappingOutputIdentifier("myview"))) mapping.output should be (MappingOutputIdentifier("map:main")) mapping.identifier should be (MappingIdentifier("map")) @@ -128,8 +128,8 @@ class SchemaMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession ) mapping.input should be (MappingOutputIdentifier("myview")) - mapping.inputs should be (Seq(MappingOutputIdentifier("myview"))) - mapping.outputs should be (Seq("main")) + mapping.inputs should be (Set(MappingOutputIdentifier("myview"))) + mapping.outputs should be (Set("main")) val result = mapping.execute(executor, Map(MappingOutputIdentifier("myview") -> df))("main") .orderBy("_2") diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SortMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SortMappingTest.scala index b57a93dc6..2a9b95fcf 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SortMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SortMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -54,7 +54,8 @@ class SortMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { val typedInstance = instance.asInstanceOf[SortMapping] typedInstance.input should be (MappingOutputIdentifier("some_mapping")) - typedInstance.outputs should be (Seq("main")) + typedInstance.inputs should be (Set(MappingOutputIdentifier("some_mapping"))) + typedInstance.outputs should be (Set("main")) typedInstance.columns should be (Seq( "c1" -> SortOrder(Ascending, NullsFirst), "c2" -> SortOrder(Descending, NullsFirst) diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SqlMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SqlMappingTest.scala index edd10d436..6905ccede 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SqlMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/SqlMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -98,7 +98,7 @@ class SqlMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { val session = Session.builder().withSparkSession(spark).build() val context = session.getContext(project) val mapping = context.getMapping(MappingIdentifier("t1")) - mapping.inputs should be (Seq(MappingOutputIdentifier("t0"))) + mapping.inputs should be (Set(MappingOutputIdentifier("t0"))) } it should "also be correct with subqueries" in { @@ -143,7 +143,7 @@ class SqlMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { val session = Session.builder().withSparkSession(spark).build() val context = session.getContext(project) val mapping = context.getMapping(MappingIdentifier("t1")) - mapping.inputs.map(_.name).sorted should be (Seq("other_table", "some_table", "some_table_archive")) + mapping.inputs.map(_.name) should be (Set("other_table", "some_table", "some_table_archive")) } it should "execute the SQL query" in { diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/TemplateMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/TemplateMappingTest.scala index d7055577e..6c5a07b77 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/TemplateMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/TemplateMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2019 Kaya Kupferschmidt + * Copyright 2019-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -56,6 +56,6 @@ class TemplateMappingTest extends AnyFlatSpec with Matchers { mapping shouldBe a[TemplateMapping] mapping.name should be ("template") - mapping.inputs should be (Seq(MappingOutputIdentifier("lala"))) + mapping.inputs should be (Set(MappingOutputIdentifier("lala"))) } } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/UnitMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/UnitMappingTest.scala index 09cdebec7..b3a7cde5c 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/UnitMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/UnitMappingTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -78,14 +78,14 @@ class UnitMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { val executor = session.execution val instance0 = context.getMapping(MappingIdentifier("instance_0")) - instance0.inputs should be (Seq()) - instance0.outputs should be (Seq("input")) + instance0.inputs should be (Set()) + instance0.outputs should be (Set("input")) val df0 = executor.instantiate(instance0, "input") df0.collect() should be (inputDf0.collect()) val instance1 = context.getMapping(MappingIdentifier("instance_1")) - instance1.inputs should be (Seq()) - instance1.outputs should be (Seq("input")) + instance1.inputs should be (Set()) + instance1.outputs should be (Set("input")) val df1 = executor.instantiate(instance1, "input") df1.collect() should be (inputDf1.collect()) } @@ -118,8 +118,8 @@ class UnitMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { val executor = session.execution val unit = context.getMapping(MappingIdentifier("macro")) - unit.inputs should be (Seq(MappingOutputIdentifier("outside"))) - unit.outputs.sorted should be (Seq("inside", "output")) + unit.inputs should be (Set(MappingOutputIdentifier("outside"))) + unit.outputs should be (Set("inside", "output")) val df_inside = executor.instantiate(unit, "inside") df_inside.collect() should be (inputDf0.collect()) @@ -151,8 +151,8 @@ class UnitMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession { val executor = session.execution val instance0 = context.getMapping(MappingIdentifier("alias")) - instance0.inputs should be (Seq(MappingOutputIdentifier("macro:input"))) - instance0.outputs should be (Seq("main")) + instance0.inputs should be (Set(MappingOutputIdentifier("macro:input"))) + instance0.outputs should be (Set("main")) val df0 = executor.instantiate(instance0, "main") df0.collect() should be (inputDf0.collect()) } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/UpsertMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/UpsertMappingTest.scala index f108a521b..066acc3fa 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/UpsertMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/UpsertMappingTest.scala @@ -201,7 +201,7 @@ class UpsertMappingTest extends AnyFlatSpec with Matchers with LocalSparkSession mapping shouldBe an[UpsertMappingSpec] val updateMapping = mapping.instantiate(session.context).asInstanceOf[UpsertMapping] - updateMapping.inputs should be (Seq(MappingOutputIdentifier("t0"),MappingOutputIdentifier("t1"))) + updateMapping.inputs should be (Set(MappingOutputIdentifier("t0"),MappingOutputIdentifier("t1"))) updateMapping.input should be (MappingOutputIdentifier("t0")) updateMapping.updates should be (MappingOutputIdentifier("t1")) updateMapping.keyColumns should be (Seq("id")) diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ValuesMappingTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ValuesMappingTest.scala index 6b3582a87..ee78b1c4f 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ValuesMappingTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/mapping/ValuesMappingTest.scala @@ -72,7 +72,7 @@ class ValuesMappingTest extends AnyFlatSpec with Matchers with MockFactory with mapping.kind should be ("values") mapping.identifier should be (MappingIdentifier("project/fake")) mapping.output should be (MappingOutputIdentifier("project/fake:main")) - mapping.outputs should be (Seq("main")) + mapping.outputs should be (Set("main")) mapping.records should be (Seq( ArrayRecord("a","12","3"), ArrayRecord("cat","","7"), @@ -108,8 +108,9 @@ class ValuesMappingTest extends AnyFlatSpec with Matchers with MockFactory with mapping.category should be (Category.MAPPING) mapping.kind should be ("values") mapping.identifier should be (MappingIdentifier("project/fake")) + mapping.inputs should be (Set()) mapping.output should be (MappingOutputIdentifier("project/fake:main")) - mapping.outputs should be (Seq("main")) + mapping.outputs should be (Set("main")) mapping.columns should be (Seq( Field("str_col", StringType), Field("int_col", IntegerType), @@ -158,8 +159,8 @@ class ValuesMappingTest extends AnyFlatSpec with Matchers with MockFactory with (mappingTemplate.instantiate _).expects(context).returns(mockMapping) val mapping = context.getMapping(MappingIdentifier("const")) - mapping.inputs should be (Seq()) - mapping.outputs should be (Seq("main")) + mapping.inputs should be (Set()) + mapping.outputs should be (Set("main")) mapping.describe(executor, Map()) should be (Map("main" -> schema)) mapping.describe(executor, Map(), "main") should be (schema) @@ -203,8 +204,8 @@ class ValuesMappingTest extends AnyFlatSpec with Matchers with MockFactory with (mappingTemplate.instantiate _).expects(context).returns(mockMapping) val mapping = context.getMapping(MappingIdentifier("const")) - mapping.inputs should be (Seq()) - mapping.outputs should be (Seq("main")) + mapping.inputs should be (Set()) + mapping.outputs should be (Set("main")) mapping.describe(executor, Map()) should be (Map("main" -> schema)) mapping.describe(executor, Map(), "main") should be (schema) diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/metric/JdbcMetricSinkTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/metric/JdbcMetricSinkTest.scala new file mode 100644 index 000000000..416cf4c90 --- /dev/null +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/metric/JdbcMetricSinkTest.scala @@ -0,0 +1,134 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.metric + +import java.nio.file.Files +import java.nio.file.Path + +import org.scalamock.scalatest.MockFactory +import org.scalatest.BeforeAndAfter +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.execution.RootContext +import com.dimajix.flowman.execution.Status +import com.dimajix.flowman.metric.FixedGaugeMetric +import com.dimajix.flowman.metric.MetricBoard +import com.dimajix.flowman.metric.MetricSelection +import com.dimajix.flowman.metric.MetricSystem +import com.dimajix.flowman.metric.Selector +import com.dimajix.flowman.model.Connection +import com.dimajix.flowman.model.ConnectionReference +import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.Prototype +import com.dimajix.flowman.spec.ObjectMapper +import com.dimajix.flowman.spec.connection.JdbcConnection + + +class JdbcMetricSinkTest extends AnyFlatSpec with Matchers with MockFactory with BeforeAndAfter { + var tempDir:Path = _ + + before { + tempDir = Files.createTempDirectory("jdbc_metric_test") + } + after { + tempDir.toFile.listFiles().foreach(_.delete()) + tempDir.toFile.delete() + } + + "The JdbcMetricSink" should "be parsable" in { + val spec = + """ + |kind: jdbc + |connection: metrics + """.stripMargin + + val monitor = ObjectMapper.parse[MetricSinkSpec](spec) + monitor shouldBe a[JdbcMetricSinkSpec] + } + + it should "be parsable with an embedded connection" in { + val spec = + """ + |kind: jdbc + |connection: + | kind: jdbc + | url: some_url + """.stripMargin + + val monitor = ObjectMapper.parse[MetricSinkSpec](spec) + monitor shouldBe a[JdbcMetricSinkSpec] + } + + it should "work" in { + val db = tempDir.resolve("mydb") + val project = Project("prj1") + val context = RootContext.builder().build().getProjectContext(project) + + val connection = JdbcConnection( + Connection.Properties(context), + url = "jdbc:derby:" + db + ";create=true", + driver = "org.apache.derby.jdbc.EmbeddedDriver" + ) + val connectionPrototype = mock[Prototype[Connection]] + (connectionPrototype.instantiate _).expects(context).returns(connection) + + val sink = new JdbcMetricSink( + ConnectionReference.apply(context, connectionPrototype), + Map("project" -> s"${project.name}") + ) + + val metricSystem = new MetricSystem + val metricBoard = MetricBoard(context, + Map("board_label" -> "v1"), + Seq(MetricSelection(selector=Selector(".*"), labels=Map("target" -> "$target", "status" -> "$status"))) + ) + + metricSystem.addMetric(FixedGaugeMetric("metric1", labels=Map("target" -> "p1", "metric_label" -> "v2"), 23.0)) + + sink.addBoard(metricBoard, metricSystem) + sink.commit(metricBoard, Status.SUCCESS) + sink.commit(metricBoard, Status.SUCCESS) + } + + it should "throw on non-existing database" in { + val db = tempDir.resolve("mydb2") + val context = RootContext.builder().build() + + val connection = JdbcConnection( + Connection.Properties(context), + url = "jdbc:derby:" + db + ";create=false", + driver = "org.apache.derby.jdbc.EmbeddedDriver" + ) + val connectionPrototype = mock[Prototype[Connection]] + (connectionPrototype.instantiate _).expects(context).returns(connection) + + val sink = new JdbcMetricSink( + ConnectionReference.apply(context, connectionPrototype) + ) + + val metricSystem = new MetricSystem + val metricBoard = MetricBoard(context, + Map("board_label" -> "v1"), + Seq(MetricSelection(selector=Selector(".*"), labels=Map("target" -> "$target", "status" -> "$status"))) + ) + + sink.addBoard(metricBoard, metricSystem) + an[Exception] should be thrownBy(sink.commit(metricBoard, Status.SUCCESS)) + an[Exception] should be thrownBy(sink.commit(metricBoard, Status.SUCCESS)) + } +} diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveTableRelationTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveTableRelationTest.scala index be2db60d6..a51b1a2bd 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveTableRelationTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveTableRelationTest.scala @@ -21,7 +21,6 @@ import java.io.File import org.apache.hadoop.fs.Path import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.Row -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.NoSuchTableException import org.apache.spark.sql.catalyst.analysis.PartitionAlreadyExistsException import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException @@ -40,6 +39,7 @@ import org.scalatest.matchers.should.Matchers import com.dimajix.common.No import com.dimajix.common.Yes +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.MigrationFailedException import com.dimajix.flowman.execution.MigrationPolicy import com.dimajix.flowman.execution.MigrationStrategy @@ -112,7 +112,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val table = session.catalog.getTable(TableIdentifier("lala_0001", Some("default"))) table.provider should be (Some("hive")) table.comment should be(Some("This is a test table")) - table.identifier should be (TableIdentifier("lala_0001", Some("default"))) + table.identifier should be (TableIdentifier("lala_0001", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -194,7 +194,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val table = session.catalog.getTable(TableIdentifier("lala_0002", Some("default"))) table.provider should be (Some("hive")) table.comment should be(None) - table.identifier should be (TableIdentifier("lala_0002", Some("default"))) + table.identifier should be (TableIdentifier("lala_0002", Some("default")).toSpark) table.tableType should be (CatalogTableType.EXTERNAL) table.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -262,7 +262,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val table = session.catalog.getTable(TableIdentifier("lala_0003", Some("default"))) table.provider should be (Some("hive")) table.comment should be(None) - table.identifier should be (TableIdentifier("lala_0003", Some("default"))) + table.identifier should be (TableIdentifier("lala_0003", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType( StructField("str_col", StringType) :: @@ -326,7 +326,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val table = session.catalog.getTable(TableIdentifier("lala_0004", Some("default"))) table.provider should be (Some("hive")) table.comment should be(None) - table.identifier should be (TableIdentifier("lala_0004", Some("default"))) + table.identifier should be (TableIdentifier("lala_0004", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType( StructField("str_col", StringType) :: @@ -385,7 +385,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val table = session.catalog.getTable(TableIdentifier("lala_0005", Some("default"))) table.provider should be (Some("hive")) table.comment should be(None) - table.identifier should be (TableIdentifier("lala_0005", Some("default"))) + table.identifier should be (TableIdentifier("lala_0005", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType( StructField("str_col", StringType) :: @@ -409,8 +409,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val relation = HiveTableRelation( Relation.Properties(context, "t0"), - database = Some("default"), - table = "lala_0006", + table = TableIdentifier("lala_0006", Some("default")), format = Some("parquet"), schema = Some(EmbeddedSchema( Schema.Properties(context), @@ -427,7 +426,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes // == Check ================================================================================================= val table = session.catalog.getTable(TableIdentifier("lala_0006", Some("default"))) table.comment should be(None) - table.identifier should be (TableIdentifier("lala_0006", Some("default"))) + table.identifier should be (TableIdentifier("lala_0006", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType( StructField("str_col", StringType) :: @@ -453,8 +452,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val relation = HiveTableRelation( Relation.Properties(context, "t0"), - database = Some("default"), - table = "lala_0007", + table = TableIdentifier("lala_0007", Some("default")), format = Some("avro"), schema = Some(EmbeddedSchema( Schema.Properties(context), @@ -471,7 +469,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes // == Check ================================================================================================= val table = session.catalog.getTable(TableIdentifier("lala_0007", Some("default"))) table.comment should be(None) - table.identifier should be (TableIdentifier("lala_0007", Some("default"))) + table.identifier should be (TableIdentifier("lala_0007", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -496,8 +494,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val relation = HiveTableRelation( Relation.Properties(context, "t0"), - database = Some("default"), - table = "lala_0007", + table = TableIdentifier("lala_0007", Some("default")), format = Some("textfile"), rowFormat = Some("org.apache.hadoop.hive.serde2.OpenCSVSerde"), serdeProperties = Map( @@ -519,7 +516,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes // == Check ================================================================================================= val table = session.catalog.getTable(TableIdentifier("lala_0007", Some("default"))) table.comment should be(None) - table.identifier should be (TableIdentifier("lala_0007", Some("default"))) + table.identifier should be (TableIdentifier("lala_0007", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) SchemaUtils.dropMetadata(table.schema) should be (StructType(Seq( StructField("str_col", StringType), @@ -545,8 +542,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val relation = HiveTableRelation( Relation.Properties(context, "t0"), - database = Some("default"), - table = "lala_0008", + table = TableIdentifier("lala_0008", Some("default")), rowFormat = Some("org.apache.hadoop.hive.serde2.avro.AvroSerDe"), schema = Some(EmbeddedSchema( Schema.Properties(context), @@ -563,7 +559,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes // == Check ================================================================================================= val table = session.catalog.getTable(TableIdentifier("lala_0008", Some("default"))) table.comment should be(None) - table.identifier should be (TableIdentifier("lala_0008", Some("default"))) + table.identifier should be (TableIdentifier("lala_0008", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -588,8 +584,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val relation = HiveTableRelation( Relation.Properties(context, "t0"), - database = Some("default"), - table = "lala_0009", + table = TableIdentifier("lala_0009", Some("default")), rowFormat = Some("org.apache.hadoop.hive.serde2.avro.AvroSerDe"), inputFormat = Some("org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat"), outputFormat = Some("org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat"), @@ -608,7 +603,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes // == Check ================================================================================================= val table = session.catalog.getTable(TableIdentifier("lala_0009", Some("default"))) table.comment should be(None) - table.identifier should be (TableIdentifier("lala_0009", Some("default"))) + table.identifier should be (TableIdentifier("lala_0009", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType( StructField("str_col", StringType) :: @@ -865,7 +860,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val table = session.catalog.getTable(TableIdentifier("lala_0012", Some("default"))) table.comment should be(None) - table.identifier should be (TableIdentifier("lala_0012", Some("default"))) + table.identifier should be (TableIdentifier("lala_0012", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType( StructField("str_col", StringType) :: @@ -1024,8 +1019,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes Field("f3", com.dimajix.flowman.types.StringType) ) )), - table = "some_table", - database = Some("default") + table = TableIdentifier("some_table", Some("default")) ) // == Create ================================================================================================ @@ -1111,8 +1105,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes partitions = Seq( PartitionField("part", com.dimajix.flowman.types.StringType) ), - table = "some_table", - database = Some("default") + table = TableIdentifier("some_table", Some("default")) ) // == Create ================================================================================================ @@ -1234,8 +1227,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes partitions = Seq( PartitionField("part", com.dimajix.flowman.types.StringType) ), - table = "some_table", - database = Some("default") + table = TableIdentifier("some_table", Some("default")) ) // == Create ================================================================================================= @@ -1340,8 +1332,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes partitions = Seq( PartitionField("part", com.dimajix.flowman.types.StringType) ), - table = "some_table", - database = Some("default") + table = TableIdentifier("some_table", Some("default")) ) // == Inspect =============================================================================================== @@ -1471,7 +1462,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes // Inspect Hive table val table = session.catalog.getTable(TableIdentifier("lala", Some("default"))) - table.identifier should be (TableIdentifier("lala", Some("default"))) + table.identifier should be (TableIdentifier("lala", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -1531,8 +1522,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes Field("f3", com.dimajix.flowman.types.StringType) ) )), - table = "some_table", - database = Some("default") + table = TableIdentifier("some_table", Some("default")) ) // == Create ================================================================================================== @@ -1544,15 +1534,15 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes // Inspect Hive table val table_1 = session.catalog.getTable(TableIdentifier("some_table", Some("default"))) - table_1.identifier should be (TableIdentifier("some_table", Some("default"))) + table_1.identifier should be (TableIdentifier("some_table", Some("default")).toSpark) table_1.tableType should be (CatalogTableType.MANAGED) if (hiveVarcharSupported) { - table_1.schema should be(StructType(Seq( + SchemaUtils.dropMetadata(table_1.schema) should be(StructType(Seq( StructField("f1", VarcharType(4)), StructField("f2", CharType(4)), StructField("f3", StringType) ))) - table_1.dataSchema should be(StructType(Seq( + SchemaUtils.dropMetadata(table_1.dataSchema) should be(StructType(Seq( StructField("f1", VarcharType(4)), StructField("f2", CharType(4)), StructField("f3", StringType) @@ -1656,7 +1646,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes // Inspect Hive table val table_1 = session.catalog.getTable(TableIdentifier("lala", Some("default"))) - table_1.identifier should be (TableIdentifier("lala", Some("default"))) + table_1.identifier should be (TableIdentifier("lala", Some("default")).toSpark) table_1.tableType should be (CatalogTableType.MANAGED) table_1.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -1691,16 +1681,16 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes // Inspect Hive table val table_2 = session.catalog.getTable(TableIdentifier("lala", Some("default"))) - table_2.identifier should be (TableIdentifier("lala", Some("default"))) + table_2.identifier should be (TableIdentifier("lala", Some("default")).toSpark) table_2.tableType should be (CatalogTableType.MANAGED) if (hiveVarcharSupported) { - table_2.schema should be(StructType(Seq( + SchemaUtils.dropMetadata(table_2.schema) should be(StructType(Seq( StructField("str_col", StringType), StructField("int_col", IntegerType), StructField("char_col", VarcharType(10)), StructField("partition_col", StringType, nullable = false) ))) - table_2.dataSchema should be(StructType(Seq( + SchemaUtils.dropMetadata(table_2.dataSchema) should be(StructType(Seq( StructField("str_col", StringType), StructField("int_col", IntegerType), StructField("char_col", VarcharType(10)) @@ -1781,8 +1771,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes Field("f3", com.dimajix.flowman.types.IntegerType) ) )), - table = "some_table", - database = Some("default") + table = TableIdentifier("some_table", Some("default")) ) val relation_2 = HiveTableRelation( Relation.Properties(context, "rel_2"), @@ -1794,8 +1783,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes Field("f4", com.dimajix.flowman.types.LongType) ) )), - table = "some_table", - database = Some("default") + table = TableIdentifier("some_table", Some("default")) ) // == Create =================================================================== @@ -1804,7 +1792,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes relation_1.conforms(execution, MigrationPolicy.RELAXED) should be (Yes) relation_1.conforms(execution, MigrationPolicy.STRICT) should be (Yes) session.catalog.tableExists(TableIdentifier("some_table", Some("default"))) should be (true) - session.catalog.getTable(relation_1.tableIdentifier).schema should be (StructType(Seq( + session.catalog.getTable(relation_1.table).schema should be (StructType(Seq( StructField("f1", StringType), StructField("f2", IntegerType), StructField("f3", IntegerType) @@ -1816,7 +1804,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes relation_1.migrate(execution, MigrationPolicy.RELAXED, MigrationStrategy.ALTER) relation_1.migrate(execution, MigrationPolicy.RELAXED, MigrationStrategy.ALTER_REPLACE) relation_1.migrate(execution, MigrationPolicy.RELAXED, MigrationStrategy.REPLACE) - session.catalog.getTable(relation_1.tableIdentifier).schema should be (StructType(Seq( + session.catalog.getTable(relation_1.table).schema should be (StructType(Seq( StructField("f1", StringType), StructField("f2", IntegerType), StructField("f3", IntegerType) @@ -1849,7 +1837,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes relation_2.conforms(execution, MigrationPolicy.RELAXED) should be (Yes) relation_2.conforms(execution, MigrationPolicy.STRICT) should be (No) - session.catalog.getTable(relation_2.tableIdentifier).schema should be (StructType(Seq( + session.catalog.getTable(relation_2.table).schema should be (StructType(Seq( StructField("f1", StringType), StructField("f2", IntegerType), StructField("f3", IntegerType), @@ -1864,7 +1852,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes relation_1.conforms(execution, MigrationPolicy.STRICT) should be (No) relation_2.conforms(execution, MigrationPolicy.RELAXED) should be (Yes) relation_2.conforms(execution, MigrationPolicy.STRICT) should be (Yes) - session.catalog.getTable(relation_2.tableIdentifier).schema should be (StructType(Seq( + session.catalog.getTable(relation_2.table).schema should be (StructType(Seq( StructField("f1", StringType), StructField("f2", ShortType), StructField("f4", LongType) @@ -2037,8 +2025,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes val view = HiveViewRelation( Relation.Properties(context), - database = Some("default"), - table = "table_or_view", + table = TableIdentifier("table_or_view", Some("default")), mapping = Some(MappingOutputIdentifier("t0")) ) val table = HiveTableRelation( @@ -2051,8 +2038,7 @@ class HiveTableRelationTest extends AnyFlatSpec with Matchers with LocalSparkSes Field("f3", com.dimajix.flowman.types.IntegerType) ) )), - table = "table_or_view", - database = Some("default") + table = TableIdentifier("table_or_view", Some("default")) ) // == Create VIEW ============================================================================================ diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveUnionTableRelationTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveUnionTableRelationTest.scala index f63854aa5..f30fd079a 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveUnionTableRelationTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveUnionTableRelationTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -17,7 +17,6 @@ package com.dimajix.flowman.spec.relation import org.apache.spark.sql.Row -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.NoSuchTableException import org.apache.spark.sql.catalyst.analysis.PartitionAlreadyExistsException import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException @@ -37,6 +36,7 @@ import org.scalatest.matchers.should.Matchers import com.dimajix.common.No import com.dimajix.common.Yes +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.MigrationPolicy import com.dimajix.flowman.execution.MigrationStrategy import com.dimajix.flowman.execution.OutputMode @@ -118,7 +118,7 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa val view = session.catalog.getTable(TableIdentifier("lala", Some("default"))) view.provider should be (None) view.comment should be (None) - view.identifier should be (TableIdentifier("lala", Some("default"))) + view.identifier should be (TableIdentifier("lala", Some("default")).toSpark) view.tableType should be (CatalogTableType.VIEW) if (hiveVarcharSupported) { SchemaUtils.dropMetadata(view.schema) should be(StructType(Seq( @@ -143,10 +143,10 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa val table = session.catalog.getTable(TableIdentifier("lala_1", Some("default"))) table.provider should be (Some("hive")) table.comment should be(Some("This is a test table")) - table.identifier should be (TableIdentifier("lala_1", Some("default"))) + table.identifier should be (TableIdentifier("lala_1", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) if (hiveVarcharSupported) { - table.schema should be(StructType(Seq( + SchemaUtils.dropMetadata(table.schema) should be(StructType(Seq( StructField("str_col", StringType), StructField("int_col", IntegerType), StructField("char_col", CharType(10)), @@ -352,7 +352,7 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa val view = session.catalog.getTable(TableIdentifier("lala", Some("default"))) view.provider should be (None) view.comment should be (None) - view.identifier should be (TableIdentifier("lala", Some("default"))) + view.identifier should be (TableIdentifier("lala", Some("default")).toSpark) view.tableType should be (CatalogTableType.VIEW) view.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -366,7 +366,7 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa val table = session.catalog.getTable(TableIdentifier("lala_1", Some("default"))) table.provider should be (Some("hive")) table.comment should be(Some("This is a test table")) - table.identifier should be (TableIdentifier("lala_1", Some("default"))) + table.identifier should be (TableIdentifier("lala_1", Some("default")).toSpark) table.tableType should be (CatalogTableType.MANAGED) table.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -442,8 +442,8 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa Field("f3", com.dimajix.flowman.types.StringType) ) )), - tablePrefix = "zz_", - view = "some_union_table_122" + tablePrefix = TableIdentifier("zz_"), + view = TableIdentifier("some_union_table_122") ) // == Create ================================================================================================ @@ -529,8 +529,8 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa partitions = Seq( PartitionField("part", com.dimajix.flowman.types.StringType) ), - tablePrefix = "zz_", - view = "some_union_table_123" + tablePrefix = TableIdentifier("zz_"), + view = TableIdentifier("some_union_table_123") ) // == Create ================================================================================================ @@ -701,7 +701,7 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa val view_1 = session.catalog.getTable(TableIdentifier("lala", Some("default"))) view_1.provider should be (None) view_1.comment should be (None) - view_1.identifier should be (TableIdentifier("lala", Some("default"))) + view_1.identifier should be (TableIdentifier("lala", Some("default")).toSpark) view_1.tableType should be (CatalogTableType.VIEW) view_1.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -713,7 +713,7 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa // Inspect Hive table val table_1 = session.catalog.getTable(TableIdentifier("lala_1", Some("default"))) - table_1.identifier should be (TableIdentifier("lala_1", Some("default"))) + table_1.identifier should be (TableIdentifier("lala_1", Some("default")).toSpark) table_1.tableType should be (CatalogTableType.MANAGED) table_1.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -754,10 +754,10 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa val view_2 = session.catalog.getTable(TableIdentifier("lala", Some("default"))) view_2.provider should be (None) view_2.comment should be (None) - view_2.identifier should be (TableIdentifier("lala", Some("default"))) + view_2.identifier should be (TableIdentifier("lala", Some("default")).toSpark) view_2.tableType should be (CatalogTableType.VIEW) if (hiveVarcharSupported) { - view_2.schema should be(StructType(Seq( + SchemaUtils.dropMetadata(view_2.schema) should be(StructType(Seq( StructField("str_col", StringType), StructField("char_col", CharType(10)), StructField("int_col", IntegerType), @@ -777,16 +777,16 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa // Inspect Hive table val table_2 = session.catalog.getTable(TableIdentifier("lala_1", Some("default"))) - table_2.identifier should be (TableIdentifier("lala_1", Some("default"))) + table_2.identifier should be (TableIdentifier("lala_1", Some("default")).toSpark) table_2.tableType should be (CatalogTableType.MANAGED) if (hiveVarcharSupported) { - table_2.schema should be(StructType(Seq( + SchemaUtils.dropMetadata(table_2.schema) should be(StructType(Seq( StructField("str_col", StringType), StructField("int_col", IntegerType), StructField("char_col", CharType(10)), StructField("partition_col", StringType, nullable = false) ))) - table_2.dataSchema should be(StructType(Seq( + SchemaUtils.dropMetadata(table_2.dataSchema) should be(StructType(Seq( StructField("str_col", StringType), StructField("int_col", IntegerType), StructField("char_col", CharType(10)) @@ -922,7 +922,7 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa val view_1 = session.catalog.getTable(TableIdentifier("lala", Some("default"))) view_1.provider should be (None) view_1.comment should be (None) - view_1.identifier should be (TableIdentifier("lala", Some("default"))) + view_1.identifier should be (TableIdentifier("lala", Some("default")).toSpark) view_1.tableType should be (CatalogTableType.VIEW) view_1.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -934,7 +934,7 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa // Inspect Hive table val table_1 = session.catalog.getTable(TableIdentifier("lala_1", Some("default"))) - table_1.identifier should be (TableIdentifier("lala_1", Some("default"))) + table_1.identifier should be (TableIdentifier("lala_1", Some("default")).toSpark) table_1.tableType should be (CatalogTableType.MANAGED) table_1.schema should be (StructType(Seq( StructField("str_col", StringType), @@ -975,7 +975,7 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa val view_2 = session.catalog.getTable(TableIdentifier("lala", Some("default"))) view_2.provider should be (None) view_2.comment should be (None) - view_2.identifier should be (TableIdentifier("lala", Some("default"))) + view_2.identifier should be (TableIdentifier("lala", Some("default")).toSpark) view_2.tableType should be (CatalogTableType.VIEW) view_2.schema should be (StructType( StructField("str_col", StringType) :: @@ -988,7 +988,7 @@ class HiveUnionTableRelationTest extends AnyFlatSpec with Matchers with LocalSpa // Inspect Hive table val table_2 = session.catalog.getTable(TableIdentifier("lala_2", Some("default"))) - table_2.identifier should be (TableIdentifier("lala_2", Some("default"))) + table_2.identifier should be (TableIdentifier("lala_2", Some("default")).toSpark) table_2.tableType should be (CatalogTableType.MANAGED) table_2.schema should be (StructType(Seq( StructField("str_col", StringType), diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveViewRelationTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveViewRelationTest.scala index 80ad1b782..f2d300019 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveViewRelationTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/HiveViewRelationTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2019 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,7 +16,6 @@ package com.dimajix.flowman.spec.relation -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.NoSuchTableException import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException import org.apache.spark.sql.catalyst.catalog.CatalogTableType @@ -25,6 +24,7 @@ import org.scalatest.matchers.should.Matchers import com.dimajix.common.No import com.dimajix.common.Yes +import com.dimajix.flowman.catalog.TableIdentifier import com.dimajix.flowman.execution.MigrationPolicy import com.dimajix.flowman.execution.Session import com.dimajix.flowman.model.MappingOutputIdentifier @@ -72,11 +72,8 @@ class HiveViewRelationTest extends AnyFlatSpec with Matchers with LocalSparkSess val relation = HiveViewRelation( Relation.Properties(context), - Some("default"), - "v0", - Seq(), - None, - Some(MappingOutputIdentifier("t0")) + table = TableIdentifier("v0", Some("default")), + mapping = Some(MappingOutputIdentifier("t0")) ) relation.provides should be (Set(ResourceIdentifier.ofHiveTable("v0", Some("default")))) @@ -160,11 +157,8 @@ class HiveViewRelationTest extends AnyFlatSpec with Matchers with LocalSparkSess val relation = HiveViewRelation( Relation.Properties(context), - Some("default"), - "v0", - Seq(), - None, - Some(MappingOutputIdentifier("union")) + table = TableIdentifier("v0", Some("default")), + mapping = Some(MappingOutputIdentifier("union")) ) relation.provides should be (Set(ResourceIdentifier.ofHiveTable("v0", Some("default")))) @@ -235,8 +229,7 @@ class HiveViewRelationTest extends AnyFlatSpec with Matchers with LocalSparkSess val view = HiveViewRelation( Relation.Properties(context), - database = Some("default"), - table = "table_or_view", + table = TableIdentifier("table_or_view", Some("default")), mapping = Some(MappingOutputIdentifier("t0")) ) val table = HiveTableRelation( @@ -249,8 +242,7 @@ class HiveViewRelationTest extends AnyFlatSpec with Matchers with LocalSparkSess Field("f3", com.dimajix.flowman.types.IntegerType) ) )), - table = "table_or_view", - database = Some("default") + table = TableIdentifier("table_or_view", Some("default")) ) // == Create TABLE ============================================================================================ @@ -326,8 +318,7 @@ class HiveViewRelationTest extends AnyFlatSpec with Matchers with LocalSparkSess val view = HiveViewRelation( Relation.Properties(context), - database = Some("default"), - table = "view", + table = TableIdentifier("view", Some("default")), sql = Some("SELECT * FROM table") ) val table = HiveTableRelation( @@ -340,8 +331,7 @@ class HiveViewRelationTest extends AnyFlatSpec with Matchers with LocalSparkSess Field("f3", com.dimajix.flowman.types.IntegerType) ) )), - table = "table", - database = Some("default") + table = TableIdentifier("table", Some("default")) ) val table2 = HiveTableRelation( Relation.Properties(context, "rel_1"), @@ -353,8 +343,7 @@ class HiveViewRelationTest extends AnyFlatSpec with Matchers with LocalSparkSess Field("f4", com.dimajix.flowman.types.IntegerType) ) )), - table = "table", - database = Some("default") + table =TableIdentifier("table", Some("default")) ) // == Create TABLE ============================================================================================ diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/JdbcRelationTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/JdbcRelationTest.scala index ebb507999..aa64dbe60 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/JdbcRelationTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/JdbcRelationTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -23,11 +23,9 @@ import java.sql.Statement import java.util.Properties import scala.collection.JavaConverters._ -import scala.collection.mutable import scala.util.control.NonFatal import org.apache.spark.sql.Row -import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry import org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper import org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions @@ -40,6 +38,9 @@ import org.scalatest.matchers.should.Matchers import com.dimajix.common.No import com.dimajix.common.Yes +import com.dimajix.flowman.catalog.TableDefinition +import com.dimajix.flowman.catalog.TableIdentifier +import com.dimajix.flowman.catalog.TableIndex import com.dimajix.flowman.execution.DeleteClause import com.dimajix.flowman.execution.InsertClause import com.dimajix.flowman.execution.MigrationFailedException @@ -58,15 +59,16 @@ import com.dimajix.flowman.model.ResourceIdentifier import com.dimajix.flowman.model.Schema import com.dimajix.flowman.model.ValueConnectionReference import com.dimajix.flowman.spec.ObjectMapper -import com.dimajix.flowman.spec.connection.JdbcConnection import com.dimajix.flowman.spec.schema.EmbeddedSchema import com.dimajix.flowman.types.DateType import com.dimajix.flowman.types.DoubleType import com.dimajix.flowman.types.Field +import com.dimajix.flowman.types.FloatType import com.dimajix.flowman.types.IntegerType import com.dimajix.flowman.types.SingleValue import com.dimajix.flowman.types.StringType import com.dimajix.flowman.types.StructType +import com.dimajix.flowman.types.VarcharType import com.dimajix.spark.sql.DataFrameBuilder import com.dimajix.spark.testing.LocalSparkSession @@ -115,6 +117,16 @@ class JdbcRelationTest extends AnyFlatSpec with Matchers with LocalSparkSession | type: string | - name: int_col | type: integer + | - name: float_col + | type: float + | primaryKey: + | - int_col + |indexes: + | - name: idx0 + | columns: [str_col, int_col] + | unique: false + |primaryKey: + | - str_col """.stripMargin val relationSpec = ObjectMapper.parse[RelationSpec](spec).asInstanceOf[JdbcRelationSpec] @@ -127,12 +139,16 @@ class JdbcRelationTest extends AnyFlatSpec with Matchers with LocalSparkSession Schema.Properties(context, name="embedded", kind="inline"), fields = Seq( Field("str_col", StringType), - Field("int_col", IntegerType) - ) + Field("int_col", IntegerType), + Field("float_col", FloatType) + ), + primaryKey = Seq("int_col") ))) relation.connection shouldBe a[ValueConnectionReference] relation.connection.identifier should be (ConnectionIdentifier("some_connection")) relation.connection.name should be ("some_connection") + relation.indexes should be (Seq(TableIndex("idx0", Seq("str_col", "int_col")))) + relation.primaryKey should be (Seq("str_col")) } it should "support the full lifecycle" in { @@ -824,6 +840,105 @@ class JdbcRelationTest extends AnyFlatSpec with Matchers with LocalSparkSession relation.loaded(execution, Map()) should be (No) } + it should "support upsert operations" in { + val db = tempDir.toPath.resolve("mydb") + val url = "jdbc:h2:" + db + val driver = "org.h2.Driver" + + val spec = + s""" + |connections: + | c0: + | kind: jdbc + | driver: $driver + | url: $url + |relations: + | t0: + | kind: jdbc + | description: "This is a test table" + | connection: c0 + | table: lala_001 + | schema: + | kind: inline + | fields: + | - name: id + | type: integer + | - name: name + | type: string + | - name: sex + | type: string + | primaryKey: ID + |""".stripMargin + val project = Module.read.string(spec).toProject("project") + + val session = Session.builder().withSparkSession(spark).build() + val execution = session.execution + val context = session.getContext(project) + + val relation = context.getRelation(RelationIdentifier("t0")) + + // == Create ================================================================================================== + relation.exists(execution) should be (No) + relation.loaded(execution, Map()) should be (No) + relation.create(execution) + relation.exists(execution) should be (Yes) + relation.read(execution).count() should be (0) + + // ===== Write Table ========================================================================================== + val tableSchema = org.apache.spark.sql.types.StructType(Seq( + StructField("id", org.apache.spark.sql.types.IntegerType), + StructField("name", org.apache.spark.sql.types.StringType), + StructField("sex", org.apache.spark.sql.types.StringType) + )) + val df0 = DataFrameBuilder.ofRows( + spark, + Seq( + Row(10, "Alice", "male"), + Row(20, "Bob", "male") + ), + tableSchema + ) + relation.write(execution, df0, mode=OutputMode.APPEND) + relation.exists(execution) should be (Yes) + relation.loaded(execution, Map()) should be (Yes) + + // ===== Read Table =========================================================================================== + val df1 = relation.read(execution) + df1.sort(col("id")).collect() should be (Seq( + Row(10, "Alice", "male"), + Row(20, "Bob", "male") + )) + + // ===== Merge Table ========================================================================================== + val updateSchema = org.apache.spark.sql.types.StructType(Seq( + StructField("id", org.apache.spark.sql.types.IntegerType), + StructField("name", org.apache.spark.sql.types.StringType), + StructField("sex", org.apache.spark.sql.types.StringType) + )) + val df2 = DataFrameBuilder.ofRows( + spark, + Seq( + Row(10, "Alice", "female"), + Row(50, "Debora", "female") + ), + updateSchema + ) + relation.write(execution, df2, mode=OutputMode.UPDATE) + + // ===== Read Table =========================================================================================== + val df3 = relation.read(execution) + df3.sort(col("id")).collect() should be (Seq( + Row(10, "Alice", "female"), + Row(20, "Bob", "male"), + Row(50, "Debora", "female") + )) + + // == Destroy ================================================================================================= + relation.destroy(execution) + relation.exists(execution) should be (No) + relation.loaded(execution, Map()) should be (No) + } + it should "support SQL queries" in { val db = tempDir.toPath.resolve("mydb") val url = "jdbc:derby:" + db + ";create=true" @@ -853,7 +968,7 @@ class JdbcRelationTest extends AnyFlatSpec with Matchers with LocalSparkSession ) )), connection = ConnectionReference(context, ConnectionIdentifier("c0")), - table = Some("lala_004") + table = Some(TableIdentifier("lala_004")) ) val relation_t1 = JdbcRelation( Relation.Properties(context, "t1"), @@ -924,7 +1039,7 @@ class JdbcRelationTest extends AnyFlatSpec with Matchers with LocalSparkSession ) )), connection = ConnectionReference(context, ConnectionIdentifier("c0")), - table = Some("lala_005") + table = Some(TableIdentifier("lala_005")) ) val rel1 = JdbcRelation( Relation.Properties(context, "t1"), @@ -936,7 +1051,7 @@ class JdbcRelationTest extends AnyFlatSpec with Matchers with LocalSparkSession ) )), connection = ConnectionReference(context, ConnectionIdentifier("c0")), - table = Some("lala_005") + table = Some(TableIdentifier("lala_005")) ) // == Create ================================================================================================= @@ -1004,6 +1119,132 @@ class JdbcRelationTest extends AnyFlatSpec with Matchers with LocalSparkSession rel1.conforms(execution, MigrationPolicy.STRICT) should be (No) } + it should "support a primary key" in { + val db = tempDir.toPath.resolve("mydb") + val url = "jdbc:derby:" + db + ";create=true" + val driver = "org.apache.derby.jdbc.EmbeddedDriver" + + val spec = + s""" + |connections: + | c0: + | kind: jdbc + | driver: $driver + | url: $url + |""".stripMargin + val project = Module.read.string(spec).toProject("project") + + val session = Session.builder().withSparkSession(spark).build() + val execution = session.execution + val context = session.getContext(project) + + val rel0 = JdbcRelation( + Relation.Properties(context, "t0"), + schema = Some(EmbeddedSchema( + Schema.Properties(context), + fields = Seq( + Field("str_col", StringType), + Field("int_col", IntegerType), + Field("varchar_col", VarcharType(32)) + ) + )), + connection = ConnectionReference(context, ConnectionIdentifier("c0")), + table = Some(TableIdentifier("lala_005")), + primaryKey = Seq("int_col", "varchar_col") + ) + + // == Create ================================================================================================== + rel0.exists(execution) should be (No) + rel0.conforms(execution, MigrationPolicy.RELAXED) should be (No) + rel0.conforms(execution, MigrationPolicy.STRICT) should be (No) + rel0.create(execution) + rel0.exists(execution) should be (Yes) + rel0.conforms(execution, MigrationPolicy.RELAXED) should be (Yes) + rel0.conforms(execution, MigrationPolicy.STRICT) should be (Yes) + + // == Inspect ================================================================================================= + withConnection(url, "lala_005") { (con, options) => + JdbcUtils.getTable(con, TableIdentifier("lala_005"), options) + } should be ( + TableDefinition( + TableIdentifier("lala_005"), + columns = Seq( + Field("str_col", StringType), + Field("int_col", IntegerType, nullable=false), + Field("varchar_col", VarcharType(32), nullable=false) + ), + primaryKey = Seq("int_col", "varchar_col") + )) + + // == Destroy ================================================================================================= + rel0.exists(execution) should be (Yes) + rel0.destroy(execution) + rel0.exists(execution) should be (No) + } + + it should "support indexes" in { + val db = tempDir.toPath.resolve("mydb") + val url = "jdbc:derby:" + db + ";create=true" + val driver = "org.apache.derby.jdbc.EmbeddedDriver" + + val spec = + s""" + |connections: + | c0: + | kind: jdbc + | driver: $driver + | url: $url + |""".stripMargin + val project = Module.read.string(spec).toProject("project") + + val session = Session.builder().withSparkSession(spark).build() + val execution = session.execution + val context = session.getContext(project) + + val rel0 = JdbcRelation( + Relation.Properties(context, "t0"), + schema = Some(EmbeddedSchema( + Schema.Properties(context), + fields = Seq( + Field("str_col", StringType), + Field("int_col", IntegerType), + Field("varchar_col", VarcharType(32)) + ) + )), + connection = ConnectionReference(context, ConnectionIdentifier("c0")), + table = Some(TableIdentifier("lala_005")), + indexes = Seq(TableIndex("idx0",Seq("int_col", "varchar_col"))) + ) + + // == Create ================================================================================================== + rel0.exists(execution) should be (No) + rel0.conforms(execution, MigrationPolicy.RELAXED) should be (No) + rel0.conforms(execution, MigrationPolicy.STRICT) should be (No) + rel0.create(execution) + rel0.exists(execution) should be (Yes) + rel0.conforms(execution, MigrationPolicy.RELAXED) should be (Yes) + rel0.conforms(execution, MigrationPolicy.STRICT) should be (Yes) + + // == Inspect ================================================================================================= + withConnection(url, "lala_005") { (con, options) => + JdbcUtils.getTable(con, TableIdentifier("lala_005"), options) + } should be ( + TableDefinition( + TableIdentifier("lala_005"), + columns = Seq( + Field("str_col", StringType), + Field("int_col", IntegerType), + Field("varchar_col", VarcharType(32)) + ), + indexes = Seq(TableIndex("idx0",Seq("int_col", "varchar_col"))) + )) + + // == Destroy ================================================================================================= + rel0.exists(execution) should be (Yes) + rel0.destroy(execution) + rel0.exists(execution) should be (No) + } + private def withConnection[T](url:String, table:String)(fn:(Connection,JDBCOptions) => T) : T = { val props = Map( JDBCOptions.JDBC_URL -> url, diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/LocalRelationTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/LocalRelationTest.scala index 869ffdd1f..2a3839c71 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/LocalRelationTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/relation/LocalRelationTest.scala @@ -433,4 +433,8 @@ class LocalRelationTest extends AnyFlatSpec with Matchers with LocalSparkSession relation.loaded(execution, Map()) should be (No) relation.loaded(execution, Map("p2" -> SingleValue("2"))) should be (No) } + + it should "support using partitions without a pattern" in { + // TODO + } } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/schema/SchemaTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/schema/SchemaTest.scala index 0174cd74d..b66a1c394 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/schema/SchemaTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/schema/SchemaTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -16,21 +16,15 @@ package com.dimajix.flowman.spec.schema -import com.fasterxml.jackson.databind.ObjectMapper -import com.fasterxml.jackson.dataformat.yaml.YAMLFactory -import com.fasterxml.jackson.module.scala.DefaultScalaModule import org.scalatest.flatspec.AnyFlatSpec import org.scalatest.matchers.should.Matchers import com.dimajix.flowman.execution.RootContext +import com.dimajix.flowman.spec.ObjectMapper class SchemaTest extends AnyFlatSpec with Matchers { - lazy val mapper = { - val mapper = new ObjectMapper(new YAMLFactory()) - mapper.registerModule(DefaultScalaModule) - mapper - } + lazy val mapper = ObjectMapper.mapper "A Schema" should "default to the embedded schema" in { val spec = diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/DocumentTargetTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/DocumentTargetTest.scala new file mode 100644 index 000000000..afec1de0d --- /dev/null +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/DocumentTargetTest.scala @@ -0,0 +1,59 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.target + +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.flowman.model.Module + + +class DocumentTargetTest extends AnyFlatSpec with Matchers { + "A DocumentTarget" should "be parseable" in { + val spec = + """ + |targets: + | docu: + | kind: document + | collectors: + | # Collect documentation of relations + | - kind: relations + | # Collect documentation of mappings + | - kind: mappings + | # Collect documentation of build targets + | - kind: targets + | # Execute all checks + | - kind: checks + | + | generators: + | # Create an output file in the project directory + | - kind: file + | location: ${project.basedir}/generated-documentation + | template: html + | excludeRelations: + | # You can either specify a name (without the project) + | - "stations_raw" + | # Or can also explicitly specify a name with the project + | - ".*/measurements_raw" + |""".stripMargin + + val module = Module.read.string(spec) + val target = module.targets("docu") + target shouldBe an[DocumentTargetSpec] + } + +} diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/DropTargetTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/DropTargetTest.scala new file mode 100644 index 000000000..3ae1d2801 --- /dev/null +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/DropTargetTest.scala @@ -0,0 +1,115 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.spec.target + +import org.scalamock.scalatest.MockFactory +import org.scalatest.flatspec.AnyFlatSpec +import org.scalatest.matchers.should.Matchers + +import com.dimajix.common.No +import com.dimajix.common.Yes +import com.dimajix.flowman.execution.Phase +import com.dimajix.flowman.execution.ScopeContext +import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.execution.Status +import com.dimajix.flowman.execution.VerificationFailedException +import com.dimajix.flowman.model.IdentifierRelationReference +import com.dimajix.flowman.model.Prototype +import com.dimajix.flowman.model.Relation +import com.dimajix.flowman.model.RelationIdentifier +import com.dimajix.flowman.model.ResourceIdentifier +import com.dimajix.flowman.spec.ObjectMapper +import com.dimajix.flowman.types.FieldValue +import com.dimajix.spark.testing.LocalSparkSession + + +class DropTargetTest extends AnyFlatSpec with Matchers with MockFactory with LocalSparkSession { + "The DropTarget" should "be parseable" in { + val spec = + """ + |kind: drop + |relation: some_relation + |""".stripMargin + + val session = Session.builder().disableSpark().build() + val context = session.context + + val targetSpec = ObjectMapper.parse[TargetSpec](spec) + val target = targetSpec.instantiate(context).asInstanceOf[DropTarget] + + target.relation should be (IdentifierRelationReference(context, "some_relation")) + } + + it should "work" in { + val session = Session.builder().withSparkSession(spark).build() + val execution = session.execution + + val relationTemplate = mock[Prototype[Relation]] + val relation = mock[Relation] + val context = ScopeContext.builder(session.context) + .withRelations(Map("some_relation" -> relationTemplate)) + .build() + val target = DropTarget( + context, + RelationIdentifier("some_relation") + ) + + (relationTemplate.instantiate _).expects(*).returns(relation) + + target.phases should be (Set(Phase.CREATE, Phase.VERIFY, Phase.DESTROY)) + + target.provides(Phase.VALIDATE) should be (Set()) + target.provides(Phase.CREATE) should be (Set()) + target.provides(Phase.BUILD) should be (Set()) + target.provides(Phase.VERIFY) should be (Set()) + target.provides(Phase.DESTROY) should be (Set()) + + target.requires(Phase.VALIDATE) should be (Set()) + (relation.requires _).expects().returns(Set(ResourceIdentifier.ofHiveDatabase("db"))) + (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) + target.requires(Phase.CREATE) should be (Set(ResourceIdentifier.ofHiveDatabase("db"), ResourceIdentifier.ofHiveTable("some_table"))) + target.requires(Phase.BUILD) should be (Set()) + target.requires(Phase.VERIFY) should be (Set()) + (relation.requires _).expects().returns(Set(ResourceIdentifier.ofHiveDatabase("db"))) + (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) + target.requires(Phase.DESTROY) should be (Set(ResourceIdentifier.ofHiveDatabase("db"), ResourceIdentifier.ofHiveTable("some_table"))) + + (relation.exists _).expects(execution).returns(Yes) + target.dirty(execution, Phase.CREATE) should be (Yes) + target.dirty(execution, Phase.VERIFY) should be (Yes) + (relation.exists _).expects(execution).returns(Yes) + target.dirty(execution, Phase.DESTROY) should be (Yes) + + (relation.exists _).expects(execution).returns(Yes) + target.execute(execution, Phase.VERIFY).exception.get shouldBe a[VerificationFailedException] + + (relation.destroy _).expects(execution, true) + target.execute(execution, Phase.CREATE).status should be (Status.SUCCESS) + + (relation.exists _).expects(execution).returns(Yes) + target.execute(execution, Phase.VERIFY).status should be (Status.FAILED) + + (relation.exists _).expects(execution).returns(No) + target.execute(execution, Phase.VERIFY).status should be (Status.SUCCESS) + + (relation.exists _).expects(execution).returns(No) + target.dirty(execution, Phase.CREATE) should be (No) + + (relation.exists _).expects(execution).returns(No) + target.dirty(execution, Phase.DESTROY) should be (No) + } +} diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/MeasureTargetTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/MeasureTargetTest.scala index 08e0221d9..fc2f919c0 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/MeasureTargetTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/MeasureTargetTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -102,12 +102,12 @@ class MeasureTargetTest extends AnyFlatSpec with Matchers with MockFactory { measureResult.measurements should be (Seq(Measurement("m1", Map("name" -> "a1", "category" -> "measure", "kind" -> "sql", "phase" -> "VERIFY"), 23))) val metrics = execution.metricSystem - metrics.findMetric(Selector(Some("m1"), Map("name" -> "a1", "category" -> "measure", "kind" -> "sql", "phase" -> "VERIFY"))).size should be (1) - metrics.findMetric(Selector(Some("m1"), Map("name" -> "a1", "category" -> "measure", "kind" -> "sql" ))).size should be (1) - metrics.findMetric(Selector(Some("m1"), Map("name" -> "a1", "category" -> "measure"))).size should be (1) - metrics.findMetric(Selector(Some("m1"), Map("name" -> "a1"))).size should be (1) - metrics.findMetric(Selector(Some("m1"), Map())).size should be (1) - val gauges = metrics.findMetric(Selector(Some("m1"), Map("name" -> "a1", "category" -> "measure", "kind" -> "sql", "phase" -> "VERIFY"))) + metrics.findMetric(Selector("m1", Map("name" -> "a1", "category" -> "measure", "kind" -> "sql", "phase" -> "VERIFY"))).size should be (1) + metrics.findMetric(Selector("m1", Map("name" -> "a1", "category" -> "measure", "kind" -> "sql" ))).size should be (1) + metrics.findMetric(Selector("m1", Map("name" -> "a1", "category" -> "measure"))).size should be (1) + metrics.findMetric(Selector("m1", Map("name" -> "a1"))).size should be (1) + metrics.findMetric(Selector("m1")).size should be (1) + val gauges = metrics.findMetric(Selector("m1", Map("name" -> "a1", "category" -> "measure", "kind" -> "sql", "phase" -> "VERIFY"))) gauges.head.asInstanceOf[GaugeMetric].value should be (23.0) } } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/MergeTargetTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/MergeTargetTest.scala index d4493382c..4f4d9ef27 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/MergeTargetTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/MergeTargetTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -180,7 +180,7 @@ class MergeTargetTest extends AnyFlatSpec with Matchers with LocalSparkSession { target.execute(executor, Phase.BUILD) val metric = executor.metricSystem - .findMetric(Selector(Some("target_records"), target.metadata.asMap)) + .findMetric(Selector("target_records", target.metadata.asMap)) .head .asInstanceOf[GaugeMetric] diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/RelationTargetTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/RelationTargetTest.scala index a6a463e78..88939e9ff 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/RelationTargetTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/RelationTargetTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -23,6 +23,7 @@ import java.util.UUID import org.apache.hadoop.fs.Path import org.apache.spark.sql.Row import org.apache.spark.sql.types.StructType +import org.scalamock.scalatest.MockFactory import org.scalatest.flatspec.AnyFlatSpec import org.scalatest.matchers.should.Matchers @@ -32,17 +33,20 @@ import com.dimajix.common.Yes import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.Phase import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.execution.Status import com.dimajix.flowman.metric.GaugeMetric import com.dimajix.flowman.metric.Selector import com.dimajix.flowman.model.Mapping import com.dimajix.flowman.model.MappingOutputIdentifier import com.dimajix.flowman.model.Module import com.dimajix.flowman.model.Project +import com.dimajix.flowman.model.Prototype import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.RelationIdentifier import com.dimajix.flowman.model.ResourceIdentifier import com.dimajix.flowman.model.Target import com.dimajix.flowman.model.TargetIdentifier +import com.dimajix.flowman.model.TargetResult import com.dimajix.flowman.spec.ObjectMapper import com.dimajix.flowman.spec.dataset.DatasetSpec import com.dimajix.flowman.spec.dataset.RelationDatasetSpec @@ -51,7 +55,7 @@ import com.dimajix.flowman.spec.relation.NullRelation import com.dimajix.spark.testing.LocalSparkSession -class RelationTargetTest extends AnyFlatSpec with Matchers with LocalSparkSession { +class RelationTargetTest extends AnyFlatSpec with Matchers with MockFactory with LocalSparkSession { "The RelationTarget" should "support embedded relations" in { val spec = """ @@ -305,7 +309,7 @@ class RelationTargetTest extends AnyFlatSpec with Matchers with LocalSparkSessio target.execute(executor, Phase.BUILD) val metric = executor.metricSystem - .findMetric(Selector(Some("target_records"), target.metadata.asMap)) + .findMetric(Selector("target_records", target.metadata.asMap)) .head .asInstanceOf[GaugeMetric] @@ -315,4 +319,121 @@ class RelationTargetTest extends AnyFlatSpec with Matchers with LocalSparkSessio target.execute(executor, Phase.BUILD) metric.value should be (4) } + + it should "behave correctly with VerifyPolicy=EMPTY_AS_FAILURE" in { + val relationGen = mock[Prototype[Relation]] + val relation = mock[Relation] + val project = Project( + name = "test", + relations = Map("relation" -> relationGen) + ) + + val session = Session.builder() + .withSparkSession(spark) + .withProject(project) + .withConfig("flowman.default.target.verifyPolicy","empty_as_failure") + .build() + val executor = session.execution + val context = session.getContext(project) + + val target = RelationTarget( + context, + RelationIdentifier("relation"), + MappingOutputIdentifier("mapping") + ) + (relationGen.instantiate _).expects(context).returns(relation) + + (relation.loaded _).expects(*,*).returns(Yes) + target.execute(executor, Phase.VERIFY).withoutTime should be(TargetResult(target, Phase.VERIFY, Status.SUCCESS).withoutTime) + + (relation.loaded _).expects(*,*).returns(Unknown) + target.execute(executor, Phase.VERIFY).withoutTime should be(TargetResult(target, Phase.VERIFY, Status.SUCCESS).withoutTime) + + (relation.loaded _).expects(*,*).returns(No) + target.execute(executor, Phase.VERIFY).status should be(Status.FAILED) + } + + it should "behave correctly with VerifyPolicy=EMPTY_AS_SUCCESS" in { + val relationGen = mock[Prototype[Relation]] + val relation = mock[Relation] + val project = Project( + name = "test", + relations = Map("relation" -> relationGen) + ) + + val session = Session.builder() + .withSparkSession(spark) + .withProject(project) + .withConfig("flowman.default.target.verifyPolicy","EMPTY_AS_SUCCESS") + .build() + val executor = session.execution + val context = session.getContext(project) + + val target = RelationTarget( + context, + RelationIdentifier("relation"), + MappingOutputIdentifier("mapping") + ) + (relationGen.instantiate _).expects(context).returns(relation) + + (relation.loaded _).expects(*,*).returns(Yes) + target.execute(executor, Phase.VERIFY).withoutTime should be(TargetResult(target, Phase.VERIFY, Status.SUCCESS).withoutTime) + + (relation.loaded _).expects(*,*).returns(Unknown) + target.execute(executor, Phase.VERIFY).withoutTime should be(TargetResult(target, Phase.VERIFY, Status.SUCCESS).withoutTime) + + (relation.loaded _).expects(*,*).returns(No) + (relation.exists _).expects(*).returns(Yes) + target.execute(executor, Phase.VERIFY).withoutTime should be(TargetResult(target, Phase.VERIFY, Status.SUCCESS).withoutTime) + + (relation.loaded _).expects(*,*).returns(No) + (relation.exists _).expects(*).returns(Unknown) + target.execute(executor, Phase.VERIFY).status should be(Status.SUCCESS) + + (relation.loaded _).expects(*,*).returns(No) + (relation.exists _).expects(*).returns(No) + target.execute(executor, Phase.VERIFY).status should be(Status.FAILED) + } + + it should "behave correctly with VerifyPolicy=EMPTY_AS_SUCCESS_WITH_ERRORS" in { + val relationGen = mock[Prototype[Relation]] + val relation = mock[Relation] + val project = Project( + name = "test", + relations = Map("relation" -> relationGen) + ) + + val session = Session.builder() + .withSparkSession(spark) + .withProject(project) + .withConfig("flowman.default.target.verifyPolicy","EMPTY_AS_SUCCESS_WITH_ERRORS") + .build() + val executor = session.execution + val context = session.getContext(project) + + val target = RelationTarget( + context, + RelationIdentifier("relation"), + MappingOutputIdentifier("mapping") + ) + (relationGen.instantiate _).expects(context).returns(relation) + + (relation.loaded _).expects(*,*).returns(Yes) + target.execute(executor, Phase.VERIFY).withoutTime should be(TargetResult(target, Phase.VERIFY, Status.SUCCESS).withoutTime) + + (relation.loaded _).expects(*,*).returns(Unknown) + target.execute(executor, Phase.VERIFY).withoutTime should be(TargetResult(target, Phase.VERIFY, Status.SUCCESS).withoutTime) + + (relation.loaded _).expects(*,*).returns(No) + (relation.exists _).expects(*).returns(Yes) + target.execute(executor, Phase.VERIFY).withoutTime should be(TargetResult(target, Phase.VERIFY, Status.SUCCESS_WITH_ERRORS).withoutTime) + + (relation.loaded _).expects(*,*).returns(No) + (relation.exists _).expects(*).returns(Unknown) + target.execute(executor, Phase.VERIFY).status should be(Status.SUCCESS_WITH_ERRORS) + + (relation.loaded _).expects(*,*).returns(No) + (relation.exists _).expects(*).returns(No) + target.execute(executor, Phase.VERIFY).status should be(Status.FAILED) + } } diff --git a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/TruncateTargetTest.scala b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/TruncateTargetTest.scala index 9471fc693..ca0515b28 100644 --- a/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/TruncateTargetTest.scala +++ b/flowman-spec/src/test/scala/com/dimajix/flowman/spec/target/TruncateTargetTest.scala @@ -1,5 +1,5 @@ /* - * Copyright 2021 Kaya Kupferschmidt + * Copyright 2021-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -27,6 +27,7 @@ import com.dimajix.flowman.execution.ScopeContext import com.dimajix.flowman.execution.Session import com.dimajix.flowman.execution.Status import com.dimajix.flowman.execution.VerificationFailedException +import com.dimajix.flowman.model.IdentifierRelationReference import com.dimajix.flowman.model.PartitionField import com.dimajix.flowman.model.Relation import com.dimajix.flowman.model.RelationIdentifier @@ -61,7 +62,7 @@ class TruncateTargetTest extends AnyFlatSpec with Matchers with MockFactory with val targetSpec = ObjectMapper.parse[TargetSpec](spec) val target = targetSpec.instantiate(context).asInstanceOf[TruncateTarget] - target.relation should be (RelationIdentifier("some_relation")) + target.relation should be (IdentifierRelationReference(context, "some_relation")) target.partitions should be (Map( "p1" -> SingleValue("1234"), "p2" -> RangeValue("a", "x") @@ -78,7 +79,7 @@ class TruncateTargetTest extends AnyFlatSpec with Matchers with MockFactory with .withRelations(Map("some_relation" -> relationTemplate)) .build() val target = TruncateTarget( - Target.Properties(context), + context, RelationIdentifier("some_relation"), Map( "p1" -> SingleValue("1234"), @@ -88,30 +89,42 @@ class TruncateTargetTest extends AnyFlatSpec with Matchers with MockFactory with (relationTemplate.instantiate _).expects(*).returns(relation) - target.phases should be (Set(Phase.BUILD, Phase.VERIFY)) + target.phases should be (Set(Phase.BUILD, Phase.VERIFY, Phase.TRUNCATE)) + target.provides(Phase.VALIDATE) should be (Set()) + target.provides(Phase.CREATE) should be (Set()) (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) (relation.resources _).expects(Map("p1" -> SingleValue("1234"),"p2" -> RangeValue("1", "3"))).returns(Set( ResourceIdentifier.ofHivePartition("some_table", Some("db"), Map("p1" -> "1234", "p2" -> "1")), ResourceIdentifier.ofHivePartition("some_table", Some("db"), Map("p1" -> "1234", "p2" -> "2")) )) - - target.provides(Phase.VALIDATE) should be (Set()) - target.provides(Phase.CREATE) should be (Set()) target.provides(Phase.BUILD) should be (Set( ResourceIdentifier.ofHiveTable("some_table"), ResourceIdentifier.ofHivePartition("some_table", Some("db"), Map("p1" -> "1234", "p2" -> "1")), ResourceIdentifier.ofHivePartition("some_table", Some("db"), Map("p1" -> "1234", "p2" -> "2")) )) target.provides(Phase.VERIFY) should be (Set()) + (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) + (relation.resources _).expects(Map("p1" -> SingleValue("1234"),"p2" -> RangeValue("1", "3"))).returns(Set( + ResourceIdentifier.ofHivePartition("some_table", Some("db"), Map("p1" -> "1234", "p2" -> "1")), + ResourceIdentifier.ofHivePartition("some_table", Some("db"), Map("p1" -> "1234", "p2" -> "2")) + )) + target.provides(Phase.TRUNCATE) should be (Set( + ResourceIdentifier.ofHiveTable("some_table"), + ResourceIdentifier.ofHivePartition("some_table", Some("db"), Map("p1" -> "1234", "p2" -> "1")), + ResourceIdentifier.ofHivePartition("some_table", Some("db"), Map("p1" -> "1234", "p2" -> "2")) + )) target.provides(Phase.DESTROY) should be (Set()) - (relation.requires _).expects().returns(Set(ResourceIdentifier.ofHiveDatabase("db"))) - (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) target.requires(Phase.VALIDATE) should be (Set()) target.requires(Phase.CREATE) should be (Set()) + (relation.requires _).expects().returns(Set(ResourceIdentifier.ofHiveDatabase("db"))) + (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) target.requires(Phase.BUILD) should be (Set(ResourceIdentifier.ofHiveDatabase("db"), ResourceIdentifier.ofHiveTable("some_table"))) target.requires(Phase.VERIFY) should be (Set()) + (relation.requires _).expects().returns(Set(ResourceIdentifier.ofHiveDatabase("db"))) + (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) + target.requires(Phase.TRUNCATE) should be (Set(ResourceIdentifier.ofHiveDatabase("db"), ResourceIdentifier.ofHiveTable("some_table"))) target.requires(Phase.DESTROY) should be (Set()) (relation.partitions _).expects().returns(Seq(PartitionField("p1", StringType), PartitionField("p2", IntegerType))) @@ -119,6 +132,10 @@ class TruncateTargetTest extends AnyFlatSpec with Matchers with MockFactory with (relation.loaded _).expects(execution, Map("p1" -> SingleValue("1234"),"p2" -> SingleValue("2"))).returns(No) target.dirty(execution, Phase.BUILD) should be (Yes) target.dirty(execution, Phase.VERIFY) should be (Yes) + (relation.partitions _).expects().returns(Seq(PartitionField("p1", StringType), PartitionField("p2", IntegerType))) + (relation.loaded _).expects(execution, Map("p1" -> SingleValue("1234"),"p2" -> SingleValue("1"))).returns(Yes) + (relation.loaded _).expects(execution, Map("p1" -> SingleValue("1234"),"p2" -> SingleValue("2"))).returns(No) + target.dirty(execution, Phase.TRUNCATE) should be (Yes) (relation.partitions _).expects().returns(Seq(PartitionField("p1", StringType), PartitionField("p2", IntegerType))) (relation.loaded _).expects(execution, Map("p1" -> SingleValue("1234"),"p2" -> SingleValue("1"))).returns(No) @@ -149,34 +166,41 @@ class TruncateTargetTest extends AnyFlatSpec with Matchers with MockFactory with .withRelations(Map("some_relation" -> relationTemplate)) .build() val target = TruncateTarget( - Target.Properties(context), + context, RelationIdentifier("some_relation") ) (relationTemplate.instantiate _).expects(*).returns(relation) - target.phases should be (Set(Phase.BUILD, Phase.VERIFY)) - - (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) - (relation.resources _).expects(Map.empty[String,FieldValue]).returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) + target.phases should be (Set(Phase.BUILD, Phase.VERIFY, Phase.TRUNCATE)) target.provides(Phase.VALIDATE) should be (Set()) target.provides(Phase.CREATE) should be (Set()) + (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) + (relation.resources _).expects(Map.empty[String,FieldValue]).returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) target.provides(Phase.BUILD) should be (Set(ResourceIdentifier.ofHiveTable("some_table"))) target.provides(Phase.VERIFY) should be (Set()) + (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) + (relation.resources _).expects(Map.empty[String,FieldValue]).returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) + target.provides(Phase.TRUNCATE) should be (Set(ResourceIdentifier.ofHiveTable("some_table"))) target.provides(Phase.DESTROY) should be (Set()) - (relation.requires _).expects().returns(Set(ResourceIdentifier.ofHiveDatabase("db"))) - (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) target.requires(Phase.VALIDATE) should be (Set()) target.requires(Phase.CREATE) should be (Set()) + (relation.requires _).expects().returns(Set(ResourceIdentifier.ofHiveDatabase("db"))) + (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) target.requires(Phase.BUILD) should be (Set(ResourceIdentifier.ofHiveDatabase("db"), ResourceIdentifier.ofHiveTable("some_table"))) target.requires(Phase.VERIFY) should be (Set()) + (relation.requires _).expects().returns(Set(ResourceIdentifier.ofHiveDatabase("db"))) + (relation.provides _).expects().returns(Set(ResourceIdentifier.ofHiveTable("some_table"))) + target.requires(Phase.TRUNCATE) should be (Set(ResourceIdentifier.ofHiveDatabase("db"), ResourceIdentifier.ofHiveTable("some_table"))) target.requires(Phase.DESTROY) should be (Set()) (relation.loaded _).expects(execution, Map.empty[String,SingleValue]).returns(Yes) target.dirty(execution, Phase.BUILD) should be (Yes) target.dirty(execution, Phase.VERIFY) should be (Yes) + (relation.loaded _).expects(execution, Map.empty[String,SingleValue]).returns(Yes) + target.dirty(execution, Phase.TRUNCATE) should be (Yes) (relation.loaded _).expects(execution, Map.empty[String,SingleValue]).returns(Yes) target.execute(execution, Phase.VERIFY).exception.get shouldBe a[VerificationFailedException] @@ -189,5 +213,8 @@ class TruncateTargetTest extends AnyFlatSpec with Matchers with MockFactory with (relation.loaded _).expects(execution, Map.empty[String,SingleValue]).returns(No) target.dirty(execution, Phase.BUILD) should be (No) + + (relation.loaded _).expects(execution, Map.empty[String,SingleValue]).returns(No) + target.dirty(execution, Phase.TRUNCATE) should be (No) } } diff --git a/flowman-studio-ui/package-lock.json b/flowman-studio-ui/package-lock.json index b555fbcc6..aff032287 100644 --- a/flowman-studio-ui/package-lock.json +++ b/flowman-studio-ui/package-lock.json @@ -5,6 +5,7 @@ "requires": true, "packages": { "": { + "name": "flowman-studio-ui", "version": "0.1.0", "dependencies": { "axios": "^0.21.4", @@ -2375,7 +2376,6 @@ "thread-loader": "^2.1.3", "url-loader": "^2.2.0", "vue-loader": "^15.9.2", - "vue-loader-v16": "npm:vue-loader@^16.1.0", "vue-style-loader": "^4.1.2", "webpack": "^4.0.0", "webpack-bundle-analyzer": "^3.8.0", @@ -2465,9 +2465,9 @@ } }, "node_modules/@vue/component-compiler-utils": { - "version": "3.2.2", - "resolved": "https://registry.npmjs.org/@vue/component-compiler-utils/-/component-compiler-utils-3.2.2.tgz", - "integrity": "sha512-rAYMLmgMuqJFWAOb3Awjqqv5X3Q3hVr4jH/kgrFJpiU0j3a90tnNBplqbj+snzrgZhC9W128z+dtgMifOiMfJg==", + "version": "3.3.0", + "resolved": "https://registry.npmjs.org/@vue/component-compiler-utils/-/component-compiler-utils-3.3.0.tgz", + "integrity": "sha512-97sfH2mYNU+2PzGrmK2haqffDpVASuib9/w2/noxiFi31Z54hW+q3izKQXXQZSNhtiUpAI36uSuYepeBe4wpHQ==", "dev": true, "dependencies": { "consolidate": "^0.15.1", @@ -2476,12 +2476,11 @@ "merge-source-map": "^1.1.0", "postcss": "^7.0.36", "postcss-selector-parser": "^6.0.2", - "prettier": "^1.18.2", "source-map": "~0.6.1", "vue-template-es2015-compiler": "^1.9.0" }, "optionalDependencies": { - "prettier": "^1.18.2" + "prettier": "^1.18.2 || ^2.0.0" } }, "node_modules/@vue/component-compiler-utils/node_modules/hash-sum": { @@ -4032,7 +4031,6 @@ "dependencies": { "anymatch": "~3.1.2", "braces": "~3.0.2", - "fsevents": "~2.3.2", "glob-parent": "~5.1.2", "is-binary-path": "~2.1.0", "is-glob": "~4.0.1", @@ -6297,13 +6295,13 @@ } }, "node_modules/eslint-module-utils": { - "version": "2.6.2", - "resolved": "https://registry.npmjs.org/eslint-module-utils/-/eslint-module-utils-2.6.2.tgz", - "integrity": "sha512-QG8pcgThYOuqxupd06oYTZoNOGaUdTY1PqK+oS6ElF6vs4pBdk/aYxFVQQXzcrAqp9m7cl7lb2ubazX+g16k2Q==", + "version": "2.7.3", + "resolved": "https://registry.npmjs.org/eslint-module-utils/-/eslint-module-utils-2.7.3.tgz", + "integrity": "sha512-088JEC7O3lDZM9xGe0RerkOMd0EjFl+Yvd1jPWIkMT5u3H9+HC34mWWPnqPrN13gieT9pBOO+Qt07Nb/6TresQ==", "dev": true, "dependencies": { "debug": "^3.2.7", - "pkg-dir": "^2.0.0" + "find-up": "^2.1.0" }, "engines": { "node": ">=4" @@ -6385,18 +6383,6 @@ "node": ">=4" } }, - "node_modules/eslint-module-utils/node_modules/pkg-dir": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/pkg-dir/-/pkg-dir-2.0.0.tgz", - "integrity": "sha1-9tXREJ4Z1j7fQo4L1X4Sd3YVM0s=", - "dev": true, - "dependencies": { - "find-up": "^2.1.0" - }, - "engines": { - "node": ">=4" - } - }, "node_modules/eslint-plugin-es": { "version": "3.0.1", "resolved": "https://registry.npmjs.org/eslint-plugin-es/-/eslint-plugin-es-3.0.1.tgz", @@ -6444,32 +6430,30 @@ } }, "node_modules/eslint-plugin-import": { - "version": "2.24.2", - "resolved": "https://registry.npmjs.org/eslint-plugin-import/-/eslint-plugin-import-2.24.2.tgz", - "integrity": "sha512-hNVtyhiEtZmpsabL4neEj+6M5DCLgpYyG9nzJY8lZQeQXEn5UPW1DpUdsMHMXsq98dbNm7nt1w9ZMSVpfJdi8Q==", + "version": "2.25.4", + "resolved": "https://registry.npmjs.org/eslint-plugin-import/-/eslint-plugin-import-2.25.4.tgz", + "integrity": "sha512-/KJBASVFxpu0xg1kIBn9AUa8hQVnszpwgE7Ld0lKAlx7Ie87yzEzCgSkekt+le/YVhiaosO4Y14GDAOc41nfxA==", "dev": true, "dependencies": { - "array-includes": "^3.1.3", - "array.prototype.flat": "^1.2.4", + "array-includes": "^3.1.4", + "array.prototype.flat": "^1.2.5", "debug": "^2.6.9", "doctrine": "^2.1.0", "eslint-import-resolver-node": "^0.3.6", - "eslint-module-utils": "^2.6.2", - "find-up": "^2.0.0", + "eslint-module-utils": "^2.7.2", "has": "^1.0.3", - "is-core-module": "^2.6.0", + "is-core-module": "^2.8.0", + "is-glob": "^4.0.3", "minimatch": "^3.0.4", - "object.values": "^1.1.4", - "pkg-up": "^2.0.0", - "read-pkg-up": "^3.0.0", + "object.values": "^1.1.5", "resolve": "^1.20.0", - "tsconfig-paths": "^3.11.0" + "tsconfig-paths": "^3.12.0" }, "engines": { "node": ">=4" }, "peerDependencies": { - "eslint": "^2 || ^3 || ^4 || ^5 || ^6 || ^7.2.0" + "eslint": "^2 || ^3 || ^4 || ^5 || ^6 || ^7.2.0 || ^8" } }, "node_modules/eslint-plugin-import/node_modules/debug": { @@ -6493,79 +6477,12 @@ "node": ">=0.10.0" } }, - "node_modules/eslint-plugin-import/node_modules/find-up": { - "version": "2.1.0", - "resolved": "https://registry.npmjs.org/find-up/-/find-up-2.1.0.tgz", - "integrity": "sha1-RdG35QbHF93UgndaK3eSCjwMV6c=", - "dev": true, - "dependencies": { - "locate-path": "^2.0.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/eslint-plugin-import/node_modules/locate-path": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/locate-path/-/locate-path-2.0.0.tgz", - "integrity": "sha1-K1aLJl7slExtnA3pw9u7ygNUzY4=", - "dev": true, - "dependencies": { - "p-locate": "^2.0.0", - "path-exists": "^3.0.0" - }, - "engines": { - "node": ">=4" - } - }, "node_modules/eslint-plugin-import/node_modules/ms": { "version": "2.0.0", "resolved": "https://registry.npmjs.org/ms/-/ms-2.0.0.tgz", "integrity": "sha1-VgiurfwAvmwpAd9fmGF4jeDVl8g=", "dev": true }, - "node_modules/eslint-plugin-import/node_modules/p-limit": { - "version": "1.3.0", - "resolved": "https://registry.npmjs.org/p-limit/-/p-limit-1.3.0.tgz", - "integrity": "sha512-vvcXsLAJ9Dr5rQOPk7toZQZJApBl2K4J6dANSsEuh6QI41JYcsS/qhTGa9ErIUUgK3WNQoJYvylxvjqmiqEA9Q==", - "dev": true, - "dependencies": { - "p-try": "^1.0.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/eslint-plugin-import/node_modules/p-locate": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/p-locate/-/p-locate-2.0.0.tgz", - "integrity": "sha1-IKAQOyIqcMj9OcwuWAaA893l7EM=", - "dev": true, - "dependencies": { - "p-limit": "^1.1.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/eslint-plugin-import/node_modules/p-try": { - "version": "1.0.0", - "resolved": "https://registry.npmjs.org/p-try/-/p-try-1.0.0.tgz", - "integrity": "sha1-y8ec26+P1CKOE/Yh8rGiN8GyB7M=", - "dev": true, - "engines": { - "node": ">=4" - } - }, - "node_modules/eslint-plugin-import/node_modules/path-exists": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/path-exists/-/path-exists-3.0.0.tgz", - "integrity": "sha1-zg6+ql94yxiSXqfYENe1mwEP1RU=", - "dev": true, - "engines": { - "node": ">=4" - } - }, "node_modules/eslint-plugin-node": { "version": "11.1.0", "resolved": "https://registry.npmjs.org/eslint-plugin-node/-/eslint-plugin-node-11.1.0.tgz", @@ -7475,9 +7392,9 @@ } }, "node_modules/follow-redirects": { - "version": "1.14.4", - "resolved": "https://registry.npmjs.org/follow-redirects/-/follow-redirects-1.14.4.tgz", - "integrity": "sha512-zwGkiSXC1MUJG/qmeIFH2HBJx9u0V46QGUe3YR1fXG8bXQxq7fLj0RjLZQ5nubr9qNJUZrH+xUcwXEoXNpfS+g==", + "version": "1.14.8", + "resolved": "https://registry.npmjs.org/follow-redirects/-/follow-redirects-1.14.8.tgz", + "integrity": "sha512-1x0S9UVJHsQprFcEC/qnNzBLcIxsjAV905f/UkQxbclCsoTWlacCNOpQa/anodLl2uaEKFhfWOvM2Qg77+15zA==", "funding": [ { "type": "individual", @@ -9111,9 +9028,9 @@ } }, "node_modules/is-core-module": { - "version": "2.7.0", - "resolved": "https://registry.npmjs.org/is-core-module/-/is-core-module-2.7.0.tgz", - "integrity": "sha512-ByY+tjCciCr+9nLryBYcSD50EOGWt95c7tIsKTG1J2ixKKXPvF7Ej3AVd+UfDydAJom3biBGDBALaO79ktwgEQ==", + "version": "2.8.1", + "resolved": "https://registry.npmjs.org/is-core-module/-/is-core-module-2.8.1.tgz", + "integrity": "sha512-SdNCUs284hr40hFTFP6l0IfZ/RSrMXF3qgoRHd3/79unUTvrFO/JoXwkGm+5J/Oe3E/b5GsnG330uUNgRpu1PA==", "dev": true, "dependencies": { "has": "^1.0.3" @@ -9634,9 +9551,6 @@ "resolved": "https://registry.npmjs.org/jsonfile/-/jsonfile-4.0.0.tgz", "integrity": "sha1-h3Gq4HmbZAdrdmQPygWPnBDjPss=", "dev": true, - "dependencies": { - "graceful-fs": "^4.1.6" - }, "optionalDependencies": { "graceful-fs": "^4.1.6" } @@ -9718,43 +9632,6 @@ "integrity": "sha1-HADHQ7QzzQpOgHWPe2SldEDZ/wA=", "dev": true }, - "node_modules/load-json-file": { - "version": "4.0.0", - "resolved": "https://registry.npmjs.org/load-json-file/-/load-json-file-4.0.0.tgz", - "integrity": "sha1-L19Fq5HjMhYjT9U62rZo607AmTs=", - "dev": true, - "dependencies": { - "graceful-fs": "^4.1.2", - "parse-json": "^4.0.0", - "pify": "^3.0.0", - "strip-bom": "^3.0.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/load-json-file/node_modules/parse-json": { - "version": "4.0.0", - "resolved": "https://registry.npmjs.org/parse-json/-/parse-json-4.0.0.tgz", - "integrity": "sha1-vjX1Qlvh9/bHRxhPmKeIy5lHfuA=", - "dev": true, - "dependencies": { - "error-ex": "^1.3.1", - "json-parse-better-errors": "^1.0.1" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/load-json-file/node_modules/pify": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/pify/-/pify-3.0.0.tgz", - "integrity": "sha1-5aSs0sEB/fPZpNB/DbxNtJ3SgXY=", - "dev": true, - "engines": { - "node": ">=4" - } - }, "node_modules/loader-fs-cache": { "version": "1.0.3", "resolved": "https://registry.npmjs.org/loader-fs-cache/-/loader-fs-cache-1.0.3.tgz", @@ -11723,85 +11600,6 @@ "node": ">=8" } }, - "node_modules/pkg-up": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/pkg-up/-/pkg-up-2.0.0.tgz", - "integrity": "sha1-yBmscoBZpGHKscOImivjxJoATX8=", - "dev": true, - "dependencies": { - "find-up": "^2.1.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/pkg-up/node_modules/find-up": { - "version": "2.1.0", - "resolved": "https://registry.npmjs.org/find-up/-/find-up-2.1.0.tgz", - "integrity": "sha1-RdG35QbHF93UgndaK3eSCjwMV6c=", - "dev": true, - "dependencies": { - "locate-path": "^2.0.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/pkg-up/node_modules/locate-path": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/locate-path/-/locate-path-2.0.0.tgz", - "integrity": "sha1-K1aLJl7slExtnA3pw9u7ygNUzY4=", - "dev": true, - "dependencies": { - "p-locate": "^2.0.0", - "path-exists": "^3.0.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/pkg-up/node_modules/p-limit": { - "version": "1.3.0", - "resolved": "https://registry.npmjs.org/p-limit/-/p-limit-1.3.0.tgz", - "integrity": "sha512-vvcXsLAJ9Dr5rQOPk7toZQZJApBl2K4J6dANSsEuh6QI41JYcsS/qhTGa9ErIUUgK3WNQoJYvylxvjqmiqEA9Q==", - "dev": true, - "dependencies": { - "p-try": "^1.0.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/pkg-up/node_modules/p-locate": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/p-locate/-/p-locate-2.0.0.tgz", - "integrity": "sha1-IKAQOyIqcMj9OcwuWAaA893l7EM=", - "dev": true, - "dependencies": { - "p-limit": "^1.1.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/pkg-up/node_modules/p-try": { - "version": "1.0.0", - "resolved": "https://registry.npmjs.org/p-try/-/p-try-1.0.0.tgz", - "integrity": "sha1-y8ec26+P1CKOE/Yh8rGiN8GyB7M=", - "dev": true, - "engines": { - "node": ">=4" - } - }, - "node_modules/pkg-up/node_modules/path-exists": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/path-exists/-/path-exists-3.0.0.tgz", - "integrity": "sha1-zg6+ql94yxiSXqfYENe1mwEP1RU=", - "dev": true, - "engines": { - "node": ">=4" - } - }, "node_modules/pnp-webpack-plugin": { "version": "1.7.0", "resolved": "https://registry.npmjs.org/pnp-webpack-plugin/-/pnp-webpack-plugin-1.7.0.tgz", @@ -12789,100 +12587,6 @@ "node": ">=8" } }, - "node_modules/read-pkg-up": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/read-pkg-up/-/read-pkg-up-3.0.0.tgz", - "integrity": "sha1-PtSWaF26D4/hGNBpHcUfSh/5bwc=", - "dev": true, - "dependencies": { - "find-up": "^2.0.0", - "read-pkg": "^3.0.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/read-pkg-up/node_modules/find-up": { - "version": "2.1.0", - "resolved": "https://registry.npmjs.org/find-up/-/find-up-2.1.0.tgz", - "integrity": "sha1-RdG35QbHF93UgndaK3eSCjwMV6c=", - "dev": true, - "dependencies": { - "locate-path": "^2.0.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/read-pkg-up/node_modules/locate-path": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/locate-path/-/locate-path-2.0.0.tgz", - "integrity": "sha1-K1aLJl7slExtnA3pw9u7ygNUzY4=", - "dev": true, - "dependencies": { - "p-locate": "^2.0.0", - "path-exists": "^3.0.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/read-pkg-up/node_modules/p-limit": { - "version": "1.3.0", - "resolved": "https://registry.npmjs.org/p-limit/-/p-limit-1.3.0.tgz", - "integrity": "sha512-vvcXsLAJ9Dr5rQOPk7toZQZJApBl2K4J6dANSsEuh6QI41JYcsS/qhTGa9ErIUUgK3WNQoJYvylxvjqmiqEA9Q==", - "dev": true, - "dependencies": { - "p-try": "^1.0.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/read-pkg-up/node_modules/p-locate": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/p-locate/-/p-locate-2.0.0.tgz", - "integrity": "sha1-IKAQOyIqcMj9OcwuWAaA893l7EM=", - "dev": true, - "dependencies": { - "p-limit": "^1.1.0" - }, - "engines": { - "node": ">=4" - } - }, - "node_modules/read-pkg-up/node_modules/p-try": { - "version": "1.0.0", - "resolved": "https://registry.npmjs.org/p-try/-/p-try-1.0.0.tgz", - "integrity": "sha1-y8ec26+P1CKOE/Yh8rGiN8GyB7M=", - "dev": true, - "engines": { - "node": ">=4" - } - }, - "node_modules/read-pkg-up/node_modules/path-exists": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/path-exists/-/path-exists-3.0.0.tgz", - "integrity": "sha1-zg6+ql94yxiSXqfYENe1mwEP1RU=", - "dev": true, - "engines": { - "node": ">=4" - } - }, - "node_modules/read-pkg-up/node_modules/read-pkg": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/read-pkg/-/read-pkg-3.0.0.tgz", - "integrity": "sha1-nLxoaXj+5l0WwA4rGcI3/Pbjg4k=", - "dev": true, - "dependencies": { - "load-json-file": "^4.0.0", - "normalize-package-data": "^2.3.2", - "path-type": "^3.0.0" - }, - "engines": { - "node": ">=4" - } - }, "node_modules/readable-stream": { "version": "2.3.7", "resolved": "https://registry.npmjs.org/readable-stream/-/readable-stream-2.3.7.tgz", @@ -13599,9 +13303,9 @@ } }, "node_modules/sass-loader": { - "version": "10.2.0", - "resolved": "https://registry.npmjs.org/sass-loader/-/sass-loader-10.2.0.tgz", - "integrity": "sha512-kUceLzC1gIHz0zNJPpqRsJyisWatGYNFRmv2CKZK2/ngMJgLqxTbXwe/hJ85luyvZkgqU3VlJ33UVF2T/0g6mw==", + "version": "10.2.1", + "resolved": "https://registry.npmjs.org/sass-loader/-/sass-loader-10.2.1.tgz", + "integrity": "sha512-RRvWl+3K2LSMezIsd008ErK4rk6CulIMSwrcc2aZvjymUgKo/vjXGp1rSWmfTUX7bblEOz8tst4wBwWtCGBqKA==", "dev": true, "dependencies": { "klona": "^2.0.4", @@ -13753,9 +13457,9 @@ "dev": true }, "node_modules/selfsigned": { - "version": "1.10.11", - "resolved": "https://registry.npmjs.org/selfsigned/-/selfsigned-1.10.11.tgz", - "integrity": "sha512-aVmbPOfViZqOZPgRBT0+3u4yZFHpmnIghLMlAcb5/xhp5ZtB/RVnKhz5vl2M32CLXAqR4kha9zfhNg0Lf/sxKA==", + "version": "1.10.14", + "resolved": "https://registry.npmjs.org/selfsigned/-/selfsigned-1.10.14.tgz", + "integrity": "sha512-lkjaiAye+wBZDCBsu5BGi0XiLRxeUlsGod5ZP924CRSEoGuZAw/f7y9RKu28rwTfiHVhdavhB0qH0INV6P1lEA==", "dev": true, "dependencies": { "node-forge": "^0.10.0" @@ -13997,9 +13701,9 @@ "dev": true }, "node_modules/shelljs": { - "version": "0.8.4", - "resolved": "https://registry.npmjs.org/shelljs/-/shelljs-0.8.4.tgz", - "integrity": "sha512-7gk3UZ9kOfPLIAbslLzyWeGiEqx9e3rxwZM0KE6EL8GlGwjym9Mrlx5/p33bWTu9YG6vcS4MBxYZDHYr5lr8BQ==", + "version": "0.8.5", + "resolved": "https://registry.npmjs.org/shelljs/-/shelljs-0.8.5.tgz", + "integrity": "sha512-TiwcRcrkhHvbrZbnRcFYMLl30Dfov3HKqzp5tO5b4pt6G/SezKcYhmDg15zXVBswHmctSAQKznqNW2LO5tTDow==", "dev": true, "dependencies": { "glob": "^7.0.0", @@ -15262,9 +14966,9 @@ } }, "node_modules/tsconfig-paths": { - "version": "3.11.0", - "resolved": "https://registry.npmjs.org/tsconfig-paths/-/tsconfig-paths-3.11.0.tgz", - "integrity": "sha512-7ecdYDnIdmv639mmDwslG6KQg1Z9STTz1j7Gcz0xa+nshh/gKDAHcPxRbWOsA3SPp0tXP2leTcY9Kw+NAkfZzA==", + "version": "3.12.0", + "resolved": "https://registry.npmjs.org/tsconfig-paths/-/tsconfig-paths-3.12.0.tgz", + "integrity": "sha512-e5adrnOYT6zqVnWqZu7i/BQ3BnhzvGbjEjejFXO20lKIKpwTaupkCPgEfv4GZK1IBciJUEhYs3J3p75FdaTFVg==", "dev": true, "dependencies": { "@types/json5": "^0.0.29", @@ -15624,9 +15328,9 @@ } }, "node_modules/url-parse": { - "version": "1.5.3", - "resolved": "https://registry.npmjs.org/url-parse/-/url-parse-1.5.3.tgz", - "integrity": "sha512-IIORyIQD9rvj0A4CLWsHkBBJuNqWpFQe224b6j9t/ABmquIS0qDU2pY6kl6AuOrL5OkCXHMCFNe1jBcuAggjvQ==", + "version": "1.5.10", + "resolved": "https://registry.npmjs.org/url-parse/-/url-parse-1.5.10.tgz", + "integrity": "sha512-WypcfiRhfeUP9vvF0j6rw0J3hrWrw6iZv3+22h6iRMJ/8z1Tj6XfLP4DsUix5MhMPnXpiHDoKyoZ/bdCkwBCiQ==", "dev": true, "dependencies": { "querystringify": "^2.1.1", @@ -16162,10 +15866,8 @@ "integrity": "sha512-9P3MWk6SrKjHsGkLT2KHXdQ/9SNkyoJbabxnKOoJepsvJjJG8uYTR3yTPxPQvNDI3w4Nz1xnE0TLHK4RIVe/MQ==", "dev": true, "dependencies": { - "chokidar": "^3.4.1", "graceful-fs": "^4.1.2", - "neo-async": "^2.5.0", - "watchpack-chokidar2": "^2.0.1" + "neo-async": "^2.5.0" }, "optionalDependencies": { "chokidar": "^3.4.1", @@ -16227,7 +15929,6 @@ "anymatch": "^2.0.0", "async-each": "^1.0.1", "braces": "^2.3.2", - "fsevents": "^1.2.7", "glob-parent": "^3.1.0", "inherits": "^2.0.3", "is-binary-path": "^1.0.0", @@ -16564,7 +16265,6 @@ "anymatch": "^2.0.0", "async-each": "^1.0.1", "braces": "^2.3.2", - "fsevents": "^1.2.7", "glob-parent": "^3.1.0", "inherits": "^2.0.3", "is-binary-path": "^1.0.0", @@ -19059,9 +18759,9 @@ } }, "@vue/component-compiler-utils": { - "version": "3.2.2", - "resolved": "https://registry.npmjs.org/@vue/component-compiler-utils/-/component-compiler-utils-3.2.2.tgz", - "integrity": "sha512-rAYMLmgMuqJFWAOb3Awjqqv5X3Q3hVr4jH/kgrFJpiU0j3a90tnNBplqbj+snzrgZhC9W128z+dtgMifOiMfJg==", + "version": "3.3.0", + "resolved": "https://registry.npmjs.org/@vue/component-compiler-utils/-/component-compiler-utils-3.3.0.tgz", + "integrity": "sha512-97sfH2mYNU+2PzGrmK2haqffDpVASuib9/w2/noxiFi31Z54hW+q3izKQXXQZSNhtiUpAI36uSuYepeBe4wpHQ==", "dev": true, "requires": { "consolidate": "^0.15.1", @@ -19070,7 +18770,7 @@ "merge-source-map": "^1.1.0", "postcss": "^7.0.36", "postcss-selector-parser": "^6.0.2", - "prettier": "^1.18.2", + "prettier": "^1.18.2 || ^2.0.0", "source-map": "~0.6.1", "vue-template-es2015-compiler": "^1.9.0" }, @@ -22244,13 +21944,13 @@ } }, "eslint-module-utils": { - "version": "2.6.2", - "resolved": "https://registry.npmjs.org/eslint-module-utils/-/eslint-module-utils-2.6.2.tgz", - "integrity": "sha512-QG8pcgThYOuqxupd06oYTZoNOGaUdTY1PqK+oS6ElF6vs4pBdk/aYxFVQQXzcrAqp9m7cl7lb2ubazX+g16k2Q==", + "version": "2.7.3", + "resolved": "https://registry.npmjs.org/eslint-module-utils/-/eslint-module-utils-2.7.3.tgz", + "integrity": "sha512-088JEC7O3lDZM9xGe0RerkOMd0EjFl+Yvd1jPWIkMT5u3H9+HC34mWWPnqPrN13gieT9pBOO+Qt07Nb/6TresQ==", "dev": true, "requires": { "debug": "^3.2.7", - "pkg-dir": "^2.0.0" + "find-up": "^2.1.0" }, "dependencies": { "debug": { @@ -22310,15 +22010,6 @@ "resolved": "https://registry.npmjs.org/path-exists/-/path-exists-3.0.0.tgz", "integrity": "sha1-zg6+ql94yxiSXqfYENe1mwEP1RU=", "dev": true - }, - "pkg-dir": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/pkg-dir/-/pkg-dir-2.0.0.tgz", - "integrity": "sha1-9tXREJ4Z1j7fQo4L1X4Sd3YVM0s=", - "dev": true, - "requires": { - "find-up": "^2.1.0" - } } } }, @@ -22350,26 +22041,24 @@ } }, "eslint-plugin-import": { - "version": "2.24.2", - "resolved": "https://registry.npmjs.org/eslint-plugin-import/-/eslint-plugin-import-2.24.2.tgz", - "integrity": "sha512-hNVtyhiEtZmpsabL4neEj+6M5DCLgpYyG9nzJY8lZQeQXEn5UPW1DpUdsMHMXsq98dbNm7nt1w9ZMSVpfJdi8Q==", + "version": "2.25.4", + "resolved": "https://registry.npmjs.org/eslint-plugin-import/-/eslint-plugin-import-2.25.4.tgz", + "integrity": "sha512-/KJBASVFxpu0xg1kIBn9AUa8hQVnszpwgE7Ld0lKAlx7Ie87yzEzCgSkekt+le/YVhiaosO4Y14GDAOc41nfxA==", "dev": true, "requires": { - "array-includes": "^3.1.3", - "array.prototype.flat": "^1.2.4", + "array-includes": "^3.1.4", + "array.prototype.flat": "^1.2.5", "debug": "^2.6.9", "doctrine": "^2.1.0", "eslint-import-resolver-node": "^0.3.6", - "eslint-module-utils": "^2.6.2", - "find-up": "^2.0.0", + "eslint-module-utils": "^2.7.2", "has": "^1.0.3", - "is-core-module": "^2.6.0", + "is-core-module": "^2.8.0", + "is-glob": "^4.0.3", "minimatch": "^3.0.4", - "object.values": "^1.1.4", - "pkg-up": "^2.0.0", - "read-pkg-up": "^3.0.0", + "object.values": "^1.1.5", "resolve": "^1.20.0", - "tsconfig-paths": "^3.11.0" + "tsconfig-paths": "^3.12.0" }, "dependencies": { "debug": { @@ -22390,60 +22079,11 @@ "esutils": "^2.0.2" } }, - "find-up": { - "version": "2.1.0", - "resolved": "https://registry.npmjs.org/find-up/-/find-up-2.1.0.tgz", - "integrity": "sha1-RdG35QbHF93UgndaK3eSCjwMV6c=", - "dev": true, - "requires": { - "locate-path": "^2.0.0" - } - }, - "locate-path": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/locate-path/-/locate-path-2.0.0.tgz", - "integrity": "sha1-K1aLJl7slExtnA3pw9u7ygNUzY4=", - "dev": true, - "requires": { - "p-locate": "^2.0.0", - "path-exists": "^3.0.0" - } - }, "ms": { "version": "2.0.0", "resolved": "https://registry.npmjs.org/ms/-/ms-2.0.0.tgz", "integrity": "sha1-VgiurfwAvmwpAd9fmGF4jeDVl8g=", "dev": true - }, - "p-limit": { - "version": "1.3.0", - "resolved": "https://registry.npmjs.org/p-limit/-/p-limit-1.3.0.tgz", - "integrity": "sha512-vvcXsLAJ9Dr5rQOPk7toZQZJApBl2K4J6dANSsEuh6QI41JYcsS/qhTGa9ErIUUgK3WNQoJYvylxvjqmiqEA9Q==", - "dev": true, - "requires": { - "p-try": "^1.0.0" - } - }, - "p-locate": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/p-locate/-/p-locate-2.0.0.tgz", - "integrity": "sha1-IKAQOyIqcMj9OcwuWAaA893l7EM=", - "dev": true, - "requires": { - "p-limit": "^1.1.0" - } - }, - "p-try": { - "version": "1.0.0", - "resolved": "https://registry.npmjs.org/p-try/-/p-try-1.0.0.tgz", - "integrity": "sha1-y8ec26+P1CKOE/Yh8rGiN8GyB7M=", - "dev": true - }, - "path-exists": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/path-exists/-/path-exists-3.0.0.tgz", - "integrity": "sha1-zg6+ql94yxiSXqfYENe1mwEP1RU=", - "dev": true } } }, @@ -23113,9 +22753,9 @@ } }, "follow-redirects": { - "version": "1.14.4", - "resolved": "https://registry.npmjs.org/follow-redirects/-/follow-redirects-1.14.4.tgz", - "integrity": "sha512-zwGkiSXC1MUJG/qmeIFH2HBJx9u0V46QGUe3YR1fXG8bXQxq7fLj0RjLZQ5nubr9qNJUZrH+xUcwXEoXNpfS+g==" + "version": "1.14.8", + "resolved": "https://registry.npmjs.org/follow-redirects/-/follow-redirects-1.14.8.tgz", + "integrity": "sha512-1x0S9UVJHsQprFcEC/qnNzBLcIxsjAV905f/UkQxbclCsoTWlacCNOpQa/anodLl2uaEKFhfWOvM2Qg77+15zA==" }, "for-in": { "version": "1.0.2", @@ -24364,9 +24004,9 @@ } }, "is-core-module": { - "version": "2.7.0", - "resolved": "https://registry.npmjs.org/is-core-module/-/is-core-module-2.7.0.tgz", - "integrity": "sha512-ByY+tjCciCr+9nLryBYcSD50EOGWt95c7tIsKTG1J2ixKKXPvF7Ej3AVd+UfDydAJom3biBGDBALaO79ktwgEQ==", + "version": "2.8.1", + "resolved": "https://registry.npmjs.org/is-core-module/-/is-core-module-2.8.1.tgz", + "integrity": "sha512-SdNCUs284hr40hFTFP6l0IfZ/RSrMXF3qgoRHd3/79unUTvrFO/JoXwkGm+5J/Oe3E/b5GsnG330uUNgRpu1PA==", "dev": true, "requires": { "has": "^1.0.3" @@ -24821,36 +24461,6 @@ "integrity": "sha1-HADHQ7QzzQpOgHWPe2SldEDZ/wA=", "dev": true }, - "load-json-file": { - "version": "4.0.0", - "resolved": "https://registry.npmjs.org/load-json-file/-/load-json-file-4.0.0.tgz", - "integrity": "sha1-L19Fq5HjMhYjT9U62rZo607AmTs=", - "dev": true, - "requires": { - "graceful-fs": "^4.1.2", - "parse-json": "^4.0.0", - "pify": "^3.0.0", - "strip-bom": "^3.0.0" - }, - "dependencies": { - "parse-json": { - "version": "4.0.0", - "resolved": "https://registry.npmjs.org/parse-json/-/parse-json-4.0.0.tgz", - "integrity": "sha1-vjX1Qlvh9/bHRxhPmKeIy5lHfuA=", - "dev": true, - "requires": { - "error-ex": "^1.3.1", - "json-parse-better-errors": "^1.0.1" - } - }, - "pify": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/pify/-/pify-3.0.0.tgz", - "integrity": "sha1-5aSs0sEB/fPZpNB/DbxNtJ3SgXY=", - "dev": true - } - } - }, "loader-fs-cache": { "version": "1.0.3", "resolved": "https://registry.npmjs.org/loader-fs-cache/-/loader-fs-cache-1.0.3.tgz", @@ -26424,66 +26034,6 @@ "find-up": "^4.0.0" } }, - "pkg-up": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/pkg-up/-/pkg-up-2.0.0.tgz", - "integrity": "sha1-yBmscoBZpGHKscOImivjxJoATX8=", - "dev": true, - "requires": { - "find-up": "^2.1.0" - }, - "dependencies": { - "find-up": { - "version": "2.1.0", - "resolved": "https://registry.npmjs.org/find-up/-/find-up-2.1.0.tgz", - "integrity": "sha1-RdG35QbHF93UgndaK3eSCjwMV6c=", - "dev": true, - "requires": { - "locate-path": "^2.0.0" - } - }, - "locate-path": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/locate-path/-/locate-path-2.0.0.tgz", - "integrity": "sha1-K1aLJl7slExtnA3pw9u7ygNUzY4=", - "dev": true, - "requires": { - "p-locate": "^2.0.0", - "path-exists": "^3.0.0" - } - }, - "p-limit": { - "version": "1.3.0", - "resolved": "https://registry.npmjs.org/p-limit/-/p-limit-1.3.0.tgz", - "integrity": "sha512-vvcXsLAJ9Dr5rQOPk7toZQZJApBl2K4J6dANSsEuh6QI41JYcsS/qhTGa9ErIUUgK3WNQoJYvylxvjqmiqEA9Q==", - "dev": true, - "requires": { - "p-try": "^1.0.0" - } - }, - "p-locate": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/p-locate/-/p-locate-2.0.0.tgz", - "integrity": "sha1-IKAQOyIqcMj9OcwuWAaA893l7EM=", - "dev": true, - "requires": { - "p-limit": "^1.1.0" - } - }, - "p-try": { - "version": "1.0.0", - "resolved": "https://registry.npmjs.org/p-try/-/p-try-1.0.0.tgz", - "integrity": "sha1-y8ec26+P1CKOE/Yh8rGiN8GyB7M=", - "dev": true - }, - "path-exists": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/path-exists/-/path-exists-3.0.0.tgz", - "integrity": "sha1-zg6+ql94yxiSXqfYENe1mwEP1RU=", - "dev": true - } - } - }, "pnp-webpack-plugin": { "version": "1.7.0", "resolved": "https://registry.npmjs.org/pnp-webpack-plugin/-/pnp-webpack-plugin-1.7.0.tgz", @@ -27335,78 +26885,6 @@ "type-fest": "^0.6.0" } }, - "read-pkg-up": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/read-pkg-up/-/read-pkg-up-3.0.0.tgz", - "integrity": "sha1-PtSWaF26D4/hGNBpHcUfSh/5bwc=", - "dev": true, - "requires": { - "find-up": "^2.0.0", - "read-pkg": "^3.0.0" - }, - "dependencies": { - "find-up": { - "version": "2.1.0", - "resolved": "https://registry.npmjs.org/find-up/-/find-up-2.1.0.tgz", - "integrity": "sha1-RdG35QbHF93UgndaK3eSCjwMV6c=", - "dev": true, - "requires": { - "locate-path": "^2.0.0" - } - }, - "locate-path": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/locate-path/-/locate-path-2.0.0.tgz", - "integrity": "sha1-K1aLJl7slExtnA3pw9u7ygNUzY4=", - "dev": true, - "requires": { - "p-locate": "^2.0.0", - "path-exists": "^3.0.0" - } - }, - "p-limit": { - "version": "1.3.0", - "resolved": "https://registry.npmjs.org/p-limit/-/p-limit-1.3.0.tgz", - "integrity": "sha512-vvcXsLAJ9Dr5rQOPk7toZQZJApBl2K4J6dANSsEuh6QI41JYcsS/qhTGa9ErIUUgK3WNQoJYvylxvjqmiqEA9Q==", - "dev": true, - "requires": { - "p-try": "^1.0.0" - } - }, - "p-locate": { - "version": "2.0.0", - "resolved": "https://registry.npmjs.org/p-locate/-/p-locate-2.0.0.tgz", - "integrity": "sha1-IKAQOyIqcMj9OcwuWAaA893l7EM=", - "dev": true, - "requires": { - "p-limit": "^1.1.0" - } - }, - "p-try": { - "version": "1.0.0", - "resolved": "https://registry.npmjs.org/p-try/-/p-try-1.0.0.tgz", - "integrity": "sha1-y8ec26+P1CKOE/Yh8rGiN8GyB7M=", - "dev": true - }, - "path-exists": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/path-exists/-/path-exists-3.0.0.tgz", - "integrity": "sha1-zg6+ql94yxiSXqfYENe1mwEP1RU=", - "dev": true - }, - "read-pkg": { - "version": "3.0.0", - "resolved": "https://registry.npmjs.org/read-pkg/-/read-pkg-3.0.0.tgz", - "integrity": "sha1-nLxoaXj+5l0WwA4rGcI3/Pbjg4k=", - "dev": true, - "requires": { - "load-json-file": "^4.0.0", - "normalize-package-data": "^2.3.2", - "path-type": "^3.0.0" - } - } - } - }, "readable-stream": { "version": "2.3.7", "resolved": "https://registry.npmjs.org/readable-stream/-/readable-stream-2.3.7.tgz", @@ -27985,9 +27463,9 @@ } }, "sass-loader": { - "version": "10.2.0", - "resolved": "https://registry.npmjs.org/sass-loader/-/sass-loader-10.2.0.tgz", - "integrity": "sha512-kUceLzC1gIHz0zNJPpqRsJyisWatGYNFRmv2CKZK2/ngMJgLqxTbXwe/hJ85luyvZkgqU3VlJ33UVF2T/0g6mw==", + "version": "10.2.1", + "resolved": "https://registry.npmjs.org/sass-loader/-/sass-loader-10.2.1.tgz", + "integrity": "sha512-RRvWl+3K2LSMezIsd008ErK4rk6CulIMSwrcc2aZvjymUgKo/vjXGp1rSWmfTUX7bblEOz8tst4wBwWtCGBqKA==", "dev": true, "requires": { "klona": "^2.0.4", @@ -28090,9 +27568,9 @@ "dev": true }, "selfsigned": { - "version": "1.10.11", - "resolved": "https://registry.npmjs.org/selfsigned/-/selfsigned-1.10.11.tgz", - "integrity": "sha512-aVmbPOfViZqOZPgRBT0+3u4yZFHpmnIghLMlAcb5/xhp5ZtB/RVnKhz5vl2M32CLXAqR4kha9zfhNg0Lf/sxKA==", + "version": "1.10.14", + "resolved": "https://registry.npmjs.org/selfsigned/-/selfsigned-1.10.14.tgz", + "integrity": "sha512-lkjaiAye+wBZDCBsu5BGi0XiLRxeUlsGod5ZP924CRSEoGuZAw/f7y9RKu28rwTfiHVhdavhB0qH0INV6P1lEA==", "dev": true, "requires": { "node-forge": "^0.10.0" @@ -28306,9 +27784,9 @@ "dev": true }, "shelljs": { - "version": "0.8.4", - "resolved": "https://registry.npmjs.org/shelljs/-/shelljs-0.8.4.tgz", - "integrity": "sha512-7gk3UZ9kOfPLIAbslLzyWeGiEqx9e3rxwZM0KE6EL8GlGwjym9Mrlx5/p33bWTu9YG6vcS4MBxYZDHYr5lr8BQ==", + "version": "0.8.5", + "resolved": "https://registry.npmjs.org/shelljs/-/shelljs-0.8.5.tgz", + "integrity": "sha512-TiwcRcrkhHvbrZbnRcFYMLl30Dfov3HKqzp5tO5b4pt6G/SezKcYhmDg15zXVBswHmctSAQKznqNW2LO5tTDow==", "dev": true, "requires": { "glob": "^7.0.0", @@ -29348,9 +28826,9 @@ "dev": true }, "tsconfig-paths": { - "version": "3.11.0", - "resolved": "https://registry.npmjs.org/tsconfig-paths/-/tsconfig-paths-3.11.0.tgz", - "integrity": "sha512-7ecdYDnIdmv639mmDwslG6KQg1Z9STTz1j7Gcz0xa+nshh/gKDAHcPxRbWOsA3SPp0tXP2leTcY9Kw+NAkfZzA==", + "version": "3.12.0", + "resolved": "https://registry.npmjs.org/tsconfig-paths/-/tsconfig-paths-3.12.0.tgz", + "integrity": "sha512-e5adrnOYT6zqVnWqZu7i/BQ3BnhzvGbjEjejFXO20lKIKpwTaupkCPgEfv4GZK1IBciJUEhYs3J3p75FdaTFVg==", "dev": true, "requires": { "@types/json5": "^0.0.29", @@ -29649,9 +29127,9 @@ } }, "url-parse": { - "version": "1.5.3", - "resolved": "https://registry.npmjs.org/url-parse/-/url-parse-1.5.3.tgz", - "integrity": "sha512-IIORyIQD9rvj0A4CLWsHkBBJuNqWpFQe224b6j9t/ABmquIS0qDU2pY6kl6AuOrL5OkCXHMCFNe1jBcuAggjvQ==", + "version": "1.5.10", + "resolved": "https://registry.npmjs.org/url-parse/-/url-parse-1.5.10.tgz", + "integrity": "sha512-WypcfiRhfeUP9vvF0j6rw0J3hrWrw6iZv3+22h6iRMJ/8z1Tj6XfLP4DsUix5MhMPnXpiHDoKyoZ/bdCkwBCiQ==", "dev": true, "requires": { "querystringify": "^2.1.1", diff --git a/flowman-studio-ui/pom.xml b/flowman-studio-ui/pom.xml index 264fdf1ec..df95bfe81 100644 --- a/flowman-studio-ui/pom.xml +++ b/flowman-studio-ui/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-studio/pom.xml b/flowman-studio/pom.xml index 562841449..a79842ad7 100644 --- a/flowman-studio/pom.xml +++ b/flowman-studio/pom.xml @@ -9,7 +9,7 @@ flowman-root com.dimajix.flowman - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-studio/src/main/scala/com/dimajix/flowman/studio/model/Converter.scala b/flowman-studio/src/main/scala/com/dimajix/flowman/studio/model/Converter.scala index f0fd83e9f..8f91ed516 100644 --- a/flowman-studio/src/main/scala/com/dimajix/flowman/studio/model/Converter.scala +++ b/flowman-studio/src/main/scala/com/dimajix/flowman/studio/model/Converter.scala @@ -84,8 +84,8 @@ object Converter { mapping.broadcast, mapping.cache.description, mapping.checkpoint, - mapping.inputs.map(_.toString), - mapping.outputs, + mapping.inputs.toSeq.map(_.toString), + mapping.outputs.toSeq, mapping.metadata.labels ) } diff --git a/flowman-testing/pom.xml b/flowman-testing/pom.xml index c4bfe5ee5..d94806b71 100644 --- a/flowman-testing/pom.xml +++ b/flowman-testing/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-tools/pom.xml b/flowman-tools/pom.xml index 82907e8c5..495e992cf 100644 --- a/flowman-tools/pom.xml +++ b/flowman-tools/pom.xml @@ -9,7 +9,7 @@ com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 ../pom.xml diff --git a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/Tool.scala b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/Tool.scala index ad18b365f..2f4beb84d 100644 --- a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/Tool.scala +++ b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/Tool.scala @@ -101,12 +101,13 @@ class Tool { // Create Flowman Session, which also includes a Spark Session val builder = Session.builder() .withNamespace(namespace) - .withProject(project.orNull) .withConfig(allConfigs) .withEnvironment(additionalEnvironment) .withProfiles(profiles) .withJars(plugins.jars.map(_.toString)) + project.foreach(builder.withProject) + if (sparkName.nonEmpty) builder.withSparkName(sparkName) if (sparkMaster.nonEmpty) diff --git a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/Arguments.scala b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/Arguments.scala index c74558ae9..afee7b4f9 100644 --- a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/Arguments.scala +++ b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/Arguments.scala @@ -1,5 +1,5 @@ /* - * Copyright 2018-2021 Kaya Kupferschmidt + * Copyright 2018-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -27,6 +27,7 @@ import org.kohsuke.args4j.spi.SubCommand import org.kohsuke.args4j.spi.SubCommandHandler import org.kohsuke.args4j.spi.SubCommands +import com.dimajix.flowman.tools.exec.documentation.DocumentationCommand import com.dimajix.flowman.tools.exec.history.HistoryCommand import com.dimajix.flowman.tools.exec.info.InfoCommand import com.dimajix.flowman.tools.exec.job.JobCommand @@ -64,6 +65,7 @@ class Arguments(args:Array[String]) { @Argument(required=false,index=0,metaVar="",usage="the object to work with",handler=classOf[SubCommandHandler]) @SubCommands(Array( + new SubCommand(name="documentation",impl=classOf[DocumentationCommand]), new SubCommand(name="history",impl=classOf[HistoryCommand]), new SubCommand(name="info",impl=classOf[InfoCommand]), new SubCommand(name="job",impl=classOf[JobCommand]), diff --git a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/Driver.scala b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/Driver.scala index 46b8138a5..a357306a4 100644 --- a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/Driver.scala +++ b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/Driver.scala @@ -16,6 +16,9 @@ package com.dimajix.flowman.tools.exec +import java.time.Duration +import java.time.Instant + import scala.util.Failure import scala.util.Success import scala.util.Try @@ -112,6 +115,7 @@ class Driver(options:Arguments) extends Tool { else { // Create Flowman Session, which also includes a Spark Session val project = loadProject(new Path(options.projectFile)) + val config = splitSettings(options.config) val environment = splitSettings(options.environment) val session = createSession( diff --git a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/documentation/DocumentationCommand.scala b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/documentation/DocumentationCommand.scala new file mode 100644 index 000000000..a95ab3546 --- /dev/null +++ b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/documentation/DocumentationCommand.scala @@ -0,0 +1,34 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.tools.exec.documentation + +import org.kohsuke.args4j.Argument +import org.kohsuke.args4j.spi.SubCommand +import org.kohsuke.args4j.spi.SubCommandHandler +import org.kohsuke.args4j.spi.SubCommands + +import com.dimajix.flowman.tools.exec.Command +import com.dimajix.flowman.tools.exec.NestedCommand + + +class DocumentationCommand extends NestedCommand { + @Argument(required=true,index=0,metaVar="",usage="the subcommand to run",handler=classOf[SubCommandHandler]) + @SubCommands(Array( + new SubCommand(name="generate",impl=classOf[GenerateCommand]) + )) + override var command:Command = _ +} diff --git a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/documentation/GenerateCommand.scala b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/documentation/GenerateCommand.scala new file mode 100644 index 000000000..f82cd3d38 --- /dev/null +++ b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/documentation/GenerateCommand.scala @@ -0,0 +1,72 @@ +/* + * Copyright 2022 Kaya Kupferschmidt + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package com.dimajix.flowman.tools.exec.documentation + +import scala.util.Failure +import scala.util.Success +import scala.util.Try +import scala.util.control.NonFatal + +import org.kohsuke.args4j.Argument +import org.slf4j.LoggerFactory + +import com.dimajix.common.ExceptionUtils.reasons +import com.dimajix.flowman.common.ParserUtils.splitSettings +import com.dimajix.flowman.execution.Context +import com.dimajix.flowman.execution.Session +import com.dimajix.flowman.execution.Status +import com.dimajix.flowman.model.Job +import com.dimajix.flowman.model.JobIdentifier +import com.dimajix.flowman.model.Project +import com.dimajix.flowman.spec.documentation.DocumenterLoader +import com.dimajix.flowman.tools.exec.Command + + +class GenerateCommand extends Command { + private val logger = LoggerFactory.getLogger(getClass) + + @Argument(index=0, required=false, usage = "specifies job to document", metaVar = "") + var job: String = "main" + @Argument(index=1, required=false, usage = "specifies job parameters", metaVar = "=") + var args: Array[String] = Array() + + override def execute(session: Session, project: Project, context:Context) : Status = { + val args = splitSettings(this.args).toMap + Try { + context.getJob(JobIdentifier(job)) + } + match { + case Failure(e) => + logger.error(s"Error instantiating job '$job': ${reasons(e)}") + Status.FAILED + case Success(job) => + generateDoc(session, project, job, job.arguments(args)) + } + } + + private def generateDoc(session: Session, project:Project, job:Job, args:Map[String,Any]) : Status = { + val documenter = DocumenterLoader.load(job.context, project) + try { + documenter.execute(session, job, args) + Status.SUCCESS + } catch { + case NonFatal(ex) => + logger.error("Cannot generate documentation: ", ex) + Status.FAILED + } + } +} diff --git a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/model/DescribeCommand.scala b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/model/DescribeCommand.scala index 0d08cbb2f..35c3e3445 100644 --- a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/model/DescribeCommand.scala +++ b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/model/DescribeCommand.scala @@ -22,6 +22,7 @@ import org.kohsuke.args4j.Argument import org.kohsuke.args4j.Option import org.slf4j.LoggerFactory +import com.dimajix.flowman.common.ParserUtils import com.dimajix.flowman.execution.Context import com.dimajix.flowman.execution.NoSuchRelationException import com.dimajix.flowman.execution.Session @@ -29,6 +30,7 @@ import com.dimajix.flowman.execution.Status import com.dimajix.flowman.model.Project import com.dimajix.flowman.model.RelationIdentifier import com.dimajix.flowman.tools.exec.Command +import com.dimajix.flowman.types.SingleValue class DescribeCommand extends Command { @@ -38,19 +40,22 @@ class DescribeCommand extends Command { var useSpark: Boolean = false @Argument(usage = "specifies the relation to describe", metaVar = "", required = true) var relation: String = "" + @Option(name="-p", aliases=Array("--partition"), usage = "specify partition to work on, as partition1=value1,partition2=value2") + var partition: String = "" override def execute(session: Session, project: Project, context:Context) : Status = { try { val identifier = RelationIdentifier(this.relation) val relation = context.getRelation(identifier) + val partition = ParserUtils.parseDelimitedKeyValues(this.partition).map { case(k,v) => (k,SingleValue(v)) } if (useSpark) { - val df = relation.read(session.execution, Map()) + val df = relation.read(session.execution, partition) df.printSchema() } else { val execution = session.execution - val schema = relation.describe(execution) + val schema = execution.describe(relation, partition) schema.printTree() } Status.SUCCESS diff --git a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/model/PhaseCommand.scala b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/model/PhaseCommand.scala index c407fad77..515a182b0 100644 --- a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/model/PhaseCommand.scala +++ b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/model/PhaseCommand.scala @@ -59,15 +59,12 @@ class PhaseCommand(phase:Phase) extends Command { project.relations.keys.toSeq val partition = ParserUtils.parseDelimitedKeyValues(this.partition) val targets = toRun.map { rel => - // Create Properties without a project. Otherwise the lookup of the relation will fail, since its identifier - // will refer to the project. And since the relation are not part of the project, this is also really correct - val name = rel + "-" + Clock.systemUTC().millis() - val props = Target.Properties(context.root, name, "relation") + val props = Target.Properties(context.root, rel, "relation") RelationTarget(props, RelationIdentifier(rel, project.name), MappingOutputIdentifier.empty, partition) } val runner = session.runner - runner.executeTargets(targets, Seq(phase), force=force, keepGoing=keepGoing, dryRun=dryRun, isolated=false) + runner.executeTargets(targets, Seq(phase), jobName="cli-tools", force=force, keepGoing=keepGoing, dryRun=dryRun, isolated=false) } } diff --git a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/project/ProjectCommand.scala b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/project/ProjectCommand.scala index fbf31b1d8..8bb84d816 100644 --- a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/project/ProjectCommand.scala +++ b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/project/ProjectCommand.scala @@ -21,8 +21,6 @@ import org.kohsuke.args4j.spi.SubCommand import org.kohsuke.args4j.spi.SubCommandHandler import org.kohsuke.args4j.spi.SubCommands -import com.dimajix.flowman.execution.Session -import com.dimajix.flowman.model.Project import com.dimajix.flowman.tools.exec.Command import com.dimajix.flowman.tools.exec.NestedCommand diff --git a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/target/PhaseCommand.scala b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/target/PhaseCommand.scala index ee4a61c3d..1c6104738 100644 --- a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/target/PhaseCommand.scala +++ b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/exec/target/PhaseCommand.scala @@ -62,7 +62,7 @@ class PhaseCommand(phase:Phase) extends Command { context.getTarget(TargetIdentifier(t)) } val runner = session.runner - runner.executeTargets(allTargets, lifecycle, force=force, keepGoing=keepGoing, dryRun=dryRun, isolated=false) + runner.executeTargets(allTargets, lifecycle, jobName="cli-tools", force=force, keepGoing=keepGoing, dryRun=dryRun, isolated=false) } } diff --git a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/shell/ParsedCommand.scala b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/shell/ParsedCommand.scala index 06cc46e44..023de2735 100644 --- a/flowman-tools/src/main/scala/com/dimajix/flowman/tools/shell/ParsedCommand.scala +++ b/flowman-tools/src/main/scala/com/dimajix/flowman/tools/shell/ParsedCommand.scala @@ -1,5 +1,5 @@ /* - * Copyright 2020-2021 Kaya Kupferschmidt + * Copyright 2020-2022 Kaya Kupferschmidt * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. @@ -23,6 +23,7 @@ import org.kohsuke.args4j.spi.SubCommands import com.dimajix.flowman.tools.exec.Command import com.dimajix.flowman.tools.exec.VersionCommand +import com.dimajix.flowman.tools.exec.documentation.DocumentationCommand import com.dimajix.flowman.tools.exec.info.InfoCommand import com.dimajix.flowman.tools.exec.mapping.MappingCommand import com.dimajix.flowman.tools.exec.model.ModelCommand @@ -38,6 +39,7 @@ import com.dimajix.flowman.tools.exec.history.HistoryCommand class ParsedCommand { @Argument(required=false,index=0,metaVar="",usage="the object to work with",handler=classOf[SubCommandHandler]) @SubCommands(Array( + new SubCommand(name="documentation",impl=classOf[DocumentationCommand]), new SubCommand(name="eval",impl=classOf[EvaluateCommand]), new SubCommand(name="exit",impl=classOf[ExitCommand]), new SubCommand(name="history",impl=classOf[HistoryCommand]), diff --git a/pom.xml b/pom.xml index fd035143f..368e356c0 100644 --- a/pom.xml +++ b/pom.xml @@ -5,7 +5,7 @@ 4.0.0 com.dimajix.flowman flowman-root - 0.21.2 + 0.22.0 pom Flowman root pom A Spark based ETL tool @@ -57,13 +57,12 @@ 2.4.0 1.9.13 4.0.0 - 10.12.1.1 + 2.1.1 2.1.210 1.2.17 1.1 5.1.0 2.5.0 - 2.10.5 2.2.4 3.5.2 4.0.4 @@ -83,44 +82,45 @@ 1.6 3.9.9.Final - - 3.2.0 - 3.2 - 2.3 + + 3.3.1 + 3.3 + 2.4.2 1.9.3 - 1.11 + 1.15 - - 2.12.10 + + 2.12.15 2.12 - 3.2.5 + 3.2.9 3.2 1.2.0 1.1.2 - 2.1.1 - 3.1.2 - 3.1 - 4.1.51.Final - 4.8-1 - 1.24 - 2.10.0 - 2.10 - 2.10.0 + 3.2.1 + 3.2 + 1.1.8.4 + 4.1.68.Final + 4.8 + 1.27 + 2.12.3 + 2.12 + 2.12.3 2.8 - 2.6.0 - 3.5.7 - 1.8.2 - 3.7.0-M5 + 2.8.0 + 10.14.2.0 + 3.6.2 + 1.10.2 + 3.7.0-M11 14.0.1 1.7.30 - 4.5.6 - 4.4.12 - 4.1.1 - 1.1.8.2 - 2.4 + 4.5.13 + 4.4.14 + 4.2.0 + 2.10.10 3.2.2 - 1.20 - 3.9 + 1.21 + 2.8.0 + 3.12.0 ${project.version} @@ -373,15 +373,15 @@ 2.12.15 2.12 - 3.2.5 + 3.2.9 3.2 1.2.0 1.1.2 - 3.2.0 + 3.2.1 3.2 1.1.8.4 4.1.68.Final - 4.8-1 + 4.8 1.27 2.12.3 2.12 @@ -396,7 +396,7 @@ 1.7.30 4.5.13 4.4.14 - 4.1.1 + 4.2.0 2.10.10 3.2.2 1.21 @@ -600,12 +600,12 @@ org.apache.maven.plugins maven-site-plugin - 3.9.1 + 3.10.0 org.apache.maven.plugins maven-project-info-reports-plugin - 3.1.1 + 3.2.1 org.apache.maven.plugins @@ -629,7 +629,7 @@ true org.codehaus.mojo build-helper-maven-plugin - 3.2.0 + 3.3.0 true @@ -657,7 +657,7 @@ true org.apache.maven.plugins maven-compiler-plugin - 3.8.1 + 3.9.0 ${maven.compiler.source} ${maven.compiler.target} @@ -677,7 +677,7 @@ true net.alchim31.maven scala-maven-plugin - 4.5.4 + 4.5.6 ${scala.version} ${scala.api_version} @@ -858,7 +858,7 @@ true org.codehaus.mojo versions-maven-plugin - 2.8.1 + 2.9.0 true @@ -898,12 +898,12 @@ org.apache.maven.plugins maven-project-info-reports-plugin - 3.1.1 + 3.2.1 net.alchim31.maven scala-maven-plugin - 4.5.3 + 4.5.6 -Xms64m @@ -914,7 +914,7 @@ org.scoverage scoverage-maven-plugin - 1.4.1 + 1.4.11 false