Skip to content

5.0.0

Compare
Choose a tag to compare
@github-actions github-actions released this 18 Oct 09:38
· 171 commits to master since this release

This release brings the first GCP supported applications in the RDB Loader application family: Snowflake Loader and Transformer Pubsub.

Additionally, this release brings a few bug fixes on Databricks Loader and Transformer Kinesis.

GCP support on Snowflake Loader and Transformer Pubsub

From its inception, RDB Loader applications are developed to run on AWS. Making possible to run them in GCP have been in our roadmap for a long time. In this release, we pave its way with integrating GCP services to Snowflake Loader to make it possible to run it with GCP services completely. GCP counterpart of transformer, Transformer Pubsub, is created as well. With these additions, it is possible to load Snowplow data from GCP pipeline to Snowflake.

At the moment, Transformer Pubsub can't output in Parquet format. Adding support for it is in our roadmap as well. This change will make the Databricks Loader on GCP possible as well.

How to start loading into Snowflake on GCP

Initially, you need to deploy the Transformer Pubsub. Minimal configuration file for Transformer Pubsub looks like following:

{
  # Name of the Pubsub subscription with the enriched events
  "input": {
    "subscription": "projects/project-id/subscriptions/subscription-id"
  }
  # Path to transformed archive
  "output": {
    "path": "gs://bucket/transformed/"
  }
  # Name of the Pubsub topic used to communicate with Loader
  "queue": {
    "topic": "projects/project-id/topics/topic-id"
  }
}

You can find the configuration reference to prepare the configuration file and instructions to deploy the application in the docs.

Then, for the Snowflake Loader part you'll need to:

  • setup the necessary Snowflake resources
  • prepare configuration files for the loader
  • deploy the Snowflake Loader app

Important bit in the Snowflake Loader config is that Pubsub should be used as message queue:

  ...
  "messageQueue": {
    "type": "pubsub"
    "subscription": "projects/project-id/subscriptions/subscription-id"
  }
  ...

Full documentation for Snowflake Loader can be found here.

Bug fixes on Databricks Loader and Transformer Kinesis

  • It is reported that there was an issue in Databricks Loader when trying to load a batch where multiple parquet files with different schemas and optional column only exist in some of the files. This issue is fixed in version 5.0.0. Thanks drphrozen for reporting the issue and submitting a PR!

  • It is reported that Transformer Kinesis throws exception when Kinesis stream shard count is increased. This issue is fixed in version 5.0.0. Thanks sdbeans for reporting the issue!

Adding telemetry to loader apps and Transformer Pubsub

In Snowplow, we are trying to improve our products every day and understanding what is popular is important part of it to focus our development effort in the right place. Therefore, we are adding telemetry to loader apps and Transformer Pubsub. What it is doing basically sending heartbeats with some minimal meta-information about the application.

You can help us by providing userProvidedId in the config file:

"telemetry" {
  "userProvidedId": "myCompany"
}

Telemetry can be deactivated by putting the following section in the configuration file:

"telemetry": {
  "disable": true
}

More information about telemetry in RDB Loader project can be found here.

Upgrading to 5.0.0

If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 5.0.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.

docker pull snowplow/transformer-kinesis:5.0.0
docker pull snowplow/rdb-loader-redshift:5.0.0
docker pull snowplow/rdb-loader-snowflake:5.0.0
docker pull snowplow/rdb-loader-databricks:5.0.0

The Snowplow docs site has a full guide to running the RDB Loader.