4.1.0
Concurrent streaming transformers for horizontal scaling
Before version 4.1.0, it was only possible to run a single instance of the streaming transformer at any one time. If you tried to run multiple instances at the same time, then there was a race condition which got described in detail in a previous Discourse thread. The old setup worked great for low volume pipelines, but it meant the streaming solution was not ideal for scaling up to higher volumes.
In version 4.1.0 we have worked around the problem simply by changing the directory names in S3 to contain a UUID unique to each running transformer. Before version 4.1.0, an output directory might be called run=2022-05-01-00-00-14
, but in 4.1.0 the output directory might be called like run=2022-05-01-00-00-14-b4cac3e5-9948-40e3-bd68-38abcf01cdf9
. Directory names for the batch transformer are not affected.
With this simple change, you can now safely scale out your streaming transformer to have multiple instances running in parallel.
Databricks loader supports generated columns
If you load into Databricks, a great way to set up your table is to partition based on the date of the event using a generated column:
CREATE TABLE IF NOT EXISTS snowplow.events (
app_id VARCHAR(255),
collector_tstamp TIMESTAMP NOT NULL,
event_name VARCHAR(1000),
-- Lots of other fields go here
-- Collector timestamp date for partitioning
collector_tstamp_date DATE GENERATED ALWAYS AS (DATE(collector_tstamp))
)
PARTITIONED BY (collector_tstamp_date, event_name);
This partitioning strategy is very efficient for analytic queries that filter by collector_tstamp
. The Snowplow/Databricks dbt web model works particularly well with this partitioning scheme.
In RDB Loader version 4.1.0 we made a small change to the Databricks loading to account for these generated columns.
Upgrading to 4.1.0
If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 4.1.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.
docker pull snowplow/transformer-kinesis:4.1.0
docker pull snowplow/rdb-loader-redshift:4.1.0
docker pull snowplow/rdb-loader-snowflake:4.1.0
docker pull snowplow/rdb-loader-databricks:4.1.0
The Snowplow docs site has a full guide to running the RDB Loader.
Changelog
- Databricks loader: Support for generated columns (#951)
- Loader: Use explicit schema name everywhere (#952)
- Loader: Jars cannot load jsch (#942)
- Snowflake loader: region and account configuration fields should be optional (#947)
- Loader: Include the SQLState when logging a SQLException (#941)
- Loader: Handle run directories with UUID suffix in folder monitoring (#949)
- Add UUID to streaming transformer directory structure (#945)