Skip to content

Latest commit

 

History

History
98 lines (74 loc) · 5.21 KB

README.md

File metadata and controls

98 lines (74 loc) · 5.21 KB

DBR CLOUD POC

Interpreting guidewire CDA as delta table: As a technology company, Guidewire offers an industry platform for property and casualty insurance carriers worldwide. Through different products and services under their insurance suite, they provide users with the operation capabilities required to acquire, process and settle claims, maintains policies, support underwriting and adjustment processes. Databricks on the other hand provides users with analytical capabilities (from basic reporting to complex ML solutions) through their lakehouse for Insurance. By combining both platforms together, P&C insurance companies have now the ability to start integrating advanced analytics capabilities (AI/ML) into their core business processes, enriching customer information with alternative data (e.g. weather data) but equally reconciling and reporting critical information at enterprise scale.


Approach

Guidewire supports data access to analytical environment via their Cloud Data Access offering (CDA). Storing files as individual parquet files under different timestamps and schema evolution is unfortunately making processing difficult for end users. Instead of processing files individually, why wouldn't we generate the delta log manifest files to only read information we need, when we need it, without having to download, process and reconcile complex information? This is the principle behind this initiative. The generated delta table will not be materialized (data will not be physically moving) but act as a shallow clone to guidewire data.

More specifically, we will process all Guidewire tables independently, in parallel (i.e. as a spark job), where each task will consist in only listing parquet files and folders and generating delta log accordingly. From an end user standpoint, guidewire will look as a Delta table and be processed as such, reducing processing time from days to seconds (since we do not have to download and process each file through many spark jobs).

As the data is now on Delta lake (physically materialized or not), one can benefit from all the downstream capabilities of delta lake, "subscribing" to changes via auto-loader capabilities, delta live table (DLT) or even delta sharing, accelerating time to insights from days to minutes.

A word of caution

As this model follows a shallow clone approach, it is recommended to only grant read permission to end user since a VACCUM operation on the generated delta would possibly result in data loss on the guidewire S3 bucket. We highly recommend organization not exposing this raw dataset to end users but rather create a SILVER version with materialized data for consumption. Note that an OPTIMIZE command will result in materialization of latest delta snapshot with optimized parquet files. Only the relevant files will be physically downloaded from original S3 to destination table.

Usage

import com.databricks.labs.guidewire.Guidewire
val manifestUri = "s3://bucket/key/manifest.json"
val databasePath = "/path/to/delta/database"
Guidewire.index(manifestUri, databasePath)

This command will run on a data increment by default, loading our previous checkpoints stored as a delta table under ${databasePath}/_checkpoints. Should you need to reindex the whole of guidewire data, please provide optional savemode parameter as follows

import org.apache.spark.sql.SaveMode
Guidewire.index(manifestUri, databasePath, saveMode = SaveMode.Overwrite)

Following a 'shallow clone' pattern, Guidewire files will not be stored but referenced from a delta location that can be defined as an external table.

CREATE DATABASE IF NOT EXISTS guidewire;
CREATE EXTERNAL TABLE IF NOT EXISTS guidewire.policy_holders LOCATION '/path/to/delta/database/policy_holders';

Finally, we can query guidewire data and access all its different versions at different timestamps.

SELECT * FROM guidewire.policy_holders
VERSION AS OF 2

Install

mvn clean package -Pshaded

Following maven standard, add profile shaded to generate a standalone jar file with all dependencies included. This jar can be installed on a databricks environment accordingly.

Authors

[email protected]