Skip to content

Latest commit

 

History

History
64 lines (53 loc) · 6.42 KB

Challenge-02.md

File metadata and controls

64 lines (53 loc) · 6.42 KB

Challenge 02 - Standardizing on Silver

< Previous Challenge - Home - Next Challenge >

Introduction

In this challenge we will begin to transform the data we loaded "as is" to the Bronze layer into more "standardized" and "integrated" datasets that are then loaded into the Silver layer. The goal of this challenge is to merge the tables and bring in only a subset of the fields. We do not need this to be production level quality and determine every field that could be a match. Rather we want to bring over just enough fields to show an intelligent report. The definition of "intelligent" is at your discretion. Duplicate names may happen, but do not have duplicate IDs. We need each Delta table to have a unique key.

Description

Some of the things we try to achieve within the Silver layer are

  • Data Lake Storage Formats
    In the Bronze layer the data is usually stored in its native source format, such as CSV or TXT for file-based sources. In the Silver layer we make sure all the different data sources are converted into a file format better suited for the Data Lake, such as Parquet or Delta.
  • Domain Separation
    An important point to keep in mind is that in the Bronze layer, the data is often organized based on the Source System or source object name, on the contrary, in the Silver layer the data is organized by Domain, such as Customer or Product.
  • Consolidation
    We also try and consolidate the disparate data structures on the source side into a more conformed structure. For example, we could have 10 different sources bringing in various kinds of Customer data but in the Silver layer we try and keep one consistent structure for our Customer data and mold our source data accordingly.
  • Data Type Standardization
    As part of the new consolidation, we try and standardize the data types in our Silver layer structures and mold our source data accordingly.
  • Formatting
    Continuing with the trend of standardizing, we make sure our Silver layer data does not have different kinds of formatting within the data. We could do that by doing things such as enforcing NULLs if needed, cleaning up blank spaces, syncing up decimal points across the sources etc.
  • Partitioning and Sharding
    Although optional for this challenge, this is highly recommended for more production focused use cases. Making sure the data is appropriately physically arranged ensures efficient access when doing ad-hoc analysis or when pulling to the Gold Layer.

The above steps can be achieved using Azure Databricks or Azure Synapse. A more visual guide on the process can be seen below.

picture alt

Business Case

Now that we know what we need to do, it’s important to understand why we are doing this.
From an organizational standpoint, the Silver layer serves three main purposes:

  • Enrichment
    As we have discussed above, the Silver layer is often where most of the effort is spent on wrangling the data into a meaningful state. Multiple sources are brought together, the data is refined, and we make sure that it is stored in the most efficient way possible so that the following layers can access data effectively.
  • Discovery
    Since the data on the Silver layer is cleaner and more organized, this is generally where most of the data discovery is done. This involves data engineers, data scientists, data analytics and even sometimes business users digging into the data directly to try to gain more context into the data available in the higher layers. The Silver layer provides a balance between the perk of having long term data retention (Bronze) and having cleaner and organized data (Gold).
  • Reusability
    The Silver layer is also where reusable data sets are created and stored so that different teams creating different higher layers (data marts or warehouses) can reuse the same data sets without having to go through the trouble of creating it again themselves.

Success Criteria

To complete this challenge successfully, you should be able to:

  • Validate that all the data has been transformed and has been loaded into the Silver layer in the appropriate storage hierarchies based on the discussion above.
  • A primary key has been established for each table with unique values in that table.
  • Showcase the data copied into the Silver layer and be able to articulate the transformations performed and the reasons for your choices.
  • Showcase how this data can also be accessed for ad-hoc analysis and reporting, via both Synapse and Databricks.
  • Showcase how to utilize delta lake versioning and time travel.

Learning Resources

The following links may be useful to achieving the success criteria listed above.

Bonus Challenges

  • Brainstorm about how Azure Databricks and Azure Synapse can work together to help deliver this use case more efficiently.
  • As part of your pipeline, find a way to archive the Raw Zone files that have already been loaded into the Silver Zone by moving them to a separate area.