4.3.0
This release brings some important bug fixes, especially around table migrations. Table migrations are an important feature of RDB Loader: if you update an Iglu schema (e.g. from version 1-0-0
to 1-0-1
) then the loader automatically alters the target table to accommodate the newer events. However, we discovered a number of edge cases where migrations did not work as expected.
Redshift loader: Fix to alter max length of text fields
This bug could affect your pipeline if you load into Redshift with RDB Loader. The bug was introduced in version 3.0.0 and does not affect older versions.
If you update an Iglu schema by raising the maxLength
setting for a string field, then RDB Loader should respond by altering the table e.g. from VARCHAR(10)
to VARCHAR(20)
. Because of this bug, RDB Loader did not attempt to alter the column length; it would instead attempt to load the newer events into the table without running the migrations. You might be affected by this bug if you have recently updated an Iglu schema by raising the max length of a field. If you think you have been affected by this bug, we suggest you check your entity tables and manually alter the table if needed:
ALTER TABLE <SHREDDED_TABLE_NAME> ALTER COLUMN <EXTENDED_COLUMN> TYPE VARCHAR(<NEW_SIZE>);
The bug is fixed in this new 4.3.0 release. Once you upgrade to 4.3.0, RDB Loader will be prepared to correctly migrate your table in response to future field length changes.
Redshift loader: Fix to recover from failed migrations
This bug could affect your pipeline if you load into Redshift with RDB Loader. The bug was introduced in version 1.2.0.
If a table migration is immediately followed by a batch which cannot be loaded for any reason, then a table could be left in an inconsistent state where a migration was partially applied. If this ever happened, then RDB Loader could get stuck on successive loads with error messages like:
Invalid operation: cannot alter column “CCCCCCC” of relation “XXXXXXX", target column size should be different; - SqlState: 0A000
With this new 4.3.0 release, the inconsistent state is still reachable (due to Redshift limitations), but the loader is robust to recover from it.
Redshift loader: Fix migrations for batches with multiple versions of the same schema
This bug could affect your pipeline if you load into Redshift with RDB Loader. The bug was introduced in version 1.3.0.
It is possible and completely allowed for a batch of events to contain multiple versions of the same schema, e.g. both 1-0-0
and 1-0-1
. However, because of this bug, the loader was in danger of trying to perform table migrations twice. This could result in an error message like (same error as in previous case):
Invalid operation: cannot alter column “CCCCCCC” of relation “XXXXXXX", target column size should be different; - SqlState: 0A000
or following one depending on schema migration.
Invalid operation: cannot add column “CCCCCCC” of relation “XXXXXXX", already exists; - SqlState: 0A000
This is fixed in the 4.3.0 release, and now the loader will not enter this failure state if a batch contains multiple versions of the same schema.
Snowflake loader: Configure folder monitoring without a stage, while doing loading with a stage
This is a new feature you can benefit from if you load into Snowflake with RDB Loader. The Snowflake loader allows two alternative methods for authentication between the warehouse and the S3 bucket: either using Snowflake storage integration, or using temporary credentials generated with AWS STS. Previously, you were forced to pick the same method for loading events and for folder monitoring. With this change, it is possible to use the storage integration for loading events, but temporary credentials for folder monitoring. This is beneficial if you want the faster load times from using a storage integration, but do not want to go through the overhead of setting up a storage integration just for folder monitoring.
Take a look at the github issue for more details on the different ways to configure the loader to use the different authentication methods.
Snowflake and Databricks loaders: Fix inserting timestamps with wrong timezone to manifest table
This is a low-impact bug that is not expected to have any detrimental effect on loading. It could affect your pipeline if you load into Snowflake or Databricks, and if your warehouse is set to have a non-UTC timezone by default.
This bug affects the manifest table, which is the table the loader uses to track which batches have been loaded already. Because of this bug, timestamps in the manifest table were stored using the default timezone of the warehouse, not UTC. This bug could only affect you in the unlikely case you use the manifest table for some other purpose.
Starting from this version 4.3.0 release, we now take care to insert timestamps with the UTC timezone.
Upgrading to 4.3.0
If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 4.3.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.
docker pull snowplow/transformer-kinesis:4.3.0
docker pull snowplow/rdb-loader-redshift:4.3.0
docker pull snowplow/rdb-loader-snowflake:4.3.0
docker pull snowplow/rdb-loader-databricks:4.3.0
The Snowplow docs site has a full guide to running the RDB Loader.