Skip to content

7. Tutorial: Generate a datavault using BETL

Bas van den Berg edited this page Dec 19, 2018 · 35 revisions

Complete code can be found at the bottom of this page.

The Nasa Facilities example

I'm a big fan of learning by example, so I will show you the working of BETL by loading a 'random' sample dataset coming from NASA: NASA Facilities sample data.

I've saved the dataset as excel on my file system NASA_Facilities.xlsx. We are going to load this data into SQL server staging first and then we can decide how to transform this data into our datawarehouse.

Create staging database

We apply the multiple schema pattern (MS) (use different schemas for different source systems).

  1. Create a database called My_Staging
  2. Create a schema called NF (Nasa Facilities) in My_Staging
  3. Create a database called Nasa (This will contain the source data. In the real world of course we wouldn't create a database for this, but in this tutorial we prefer to be able to drop My_Staging and we don't want to reload the CSV).
use My_Staging 
-- make sure that you have a target staging database 
IF NOT EXISTS (SELECT * FROM sys.databases WHERE name = 'My_Staging')
  create database My_Staging
GO
use My_Staging
GO
IF NOT EXISTS ( SELECT schema_id FROM My_Staging.sys.schemas WHERE name = 'NF')
  exec('create schema [NF] AUTHORIZATION [dbo]') 
GO
IF NOT EXISTS (SELECT * FROM sys.databases WHERE name = 'Nasa')
  create database Nasa

Load dataset into My_Staging

Since BETL is currently not adapted to loading CSV files, you will have to do this yourself. (using import / export wizard). You can use the data from NASA or here:

Choose " as text qualifier. And choose Advanced->suggest types in order to determine data types. After this step I assume that you have a table called [Nasa].[dbo].[NASA_Facilities]

Transform the data

Some of the things we want to do are:

  1. Convert datatypes.
  2. Normalise the data in 3NF
  3. Define natural keys
  4. Historization ( detect changes)
  5. Synonyms (duplicates)
  6. Apply data warehouse terminology (rename columns).

First we will create a view for every hub and satellite. Using this view we can implement normalisation (2), data conversion (1) and dwh naming (6).

In our example we have three entities: research centers, research facilities and contacts. In our staging database we create 2 views: NF.Facility and NF.Center:

-- this is the source
select * FROM [Nasa].[dbo].[NASA_Facilities]
-- create source data view in staging database
use My_Staging 
GO
if object_id('NF.stgh_Facility') is null 
exec ( '
create view NF.stgh_Facility as 
SELECT 
    convert(varchar(255), [Facility]) facility_key
    ,convert(varchar(255), [Center]) center_key 
    ,[Occupied] occupy_date
    ,convert(varchar(50), [Status])  facility_status
    ,convert(varchar(255), [URL_Link]) url
    ,[Record_Date] record_date
    ,[Last_Update] last_update_date
    ,convert(varchar(100), [Country]) country_key
    ,convert(varchar(255),[Contact]) contact_key
    ,convert(varchar(20),[Phone]) facility_phone
    ,convert(varchar(100),[Location]) location
    ,convert(varchar(100),[City]) city
    ,convert(varchar(25),[State]) state_key
    ,convert(varchar(25),[Zipcode]) zipcode
    , ''NF'' etl_data_source
FROM [Nasa].[dbo].[NASA_Facilities]
')
  • stgh is the default prefix for a datavault hub staging view or table. (you can use your own prefixes of course, or not use prefixes at all). ** Note that we have create a cross database dependency here, which is not a good practice in general but acceptable since we are importing data into our staging area. A slightly better approach would be to use a linked server so that the database name and location is configurable. And this way you can also connect to external database servers.
-- For the entity Center we will create a similar view.
use My_Staging 
GO
if object_id('NF.stgh_Center') is null 
exec ( '
create view NF.stgh_Center as 
SELECT distinct
    convert(varchar(255), [Center]) center_key 
     , ''NF'' etl_data_source
FROM [Nasa].[dbo].[NASA_Facilities]
') 
  • Note that we use distinct here because center_key is a natural primary key and it would not be unique without distinct.
  1. Detect natural keys

In the BETL object tree, each column has a column type which tells us whether it's a natural primary, foreign key, attribute or etl meta data. When the column type is unknown, it is guessed by using the column name and ordinal position. E.g. natural keys are suffixed with _key. A natural primary key is an important identifier of an entity and may be used for change detection. Note that columns that end with _date can be automatically converted into the date datetype (when date_datatype_based_on_suffix property is enabled and this is default enabled for localhost). (for datetime use _dt suffix ).

--Lets see what betl will guess for column types
exec betl.dbo.refresh 'NF.stgh_Facility'
exec betl.dbo.info 'NF.stgh_Facility'```

refresh_stgh_facility

What we see here is

  • Object tree details for the NF.stgh_Facility view
  • The column details of NF.stgh_Facility
  • The properties related to NF.stgh_Facility As you can see all columns that are suffixed with _key are seen as natural foreign key. Not all meta data can be guessed. Next we will set the natural primary key.
-- set facility_key to natural primary key (id 100, look in static.Column_type). 
exec betl.dbo.set_column_type 'NF.stgh_Facility', 'facility_key', 100 
  • Note that we omitted the database name, because NF.stgh_Facility is unique for identifying the object.
-- first make sure that the columns of stgh_Center are in the meta data. 
exec betl.dbo.refresh 'NF.stgh_Center'
exec betl.dbo.info 'NF.stgh_Center'
-- then set the nat_pkey
exec betl.dbo.set_column_type 'NF.stgh_Center', 'center_key', 100

Another thing that needs to be set is the target schema ( where should this data go to? ). We do this for the entire My_Staging database.

-- Make sure that the target database exists (RDV stands for raw datavault):
IF NOT EXISTS (SELECT * FROM sys.databases WHERE name = 'My_RDV')
    CREATE DATABASE [My_RDV]
GO
exec betl.dbo.set_target_schema '[My_Staging]', '[My_RDV].dbo'

-- we want betl to drop /create target tables
-- This normally set on a dev environment and not test, acceptance or production. 
exec betl.dbo.setp 'recreate_tables', 1, '[My_RDV].dbo' 

Hubs

The prefix stgh sets the template to 8 ( Datavault hub & sat (CDC and delete detection) ). This is defined in dbo.Prefix.

-- lets create the hubs
exec betl.dbo.push '[My_Staging].NF.stgh_Center'
exec betl.dbo.push '[My_Staging].NF.stgh_Facility'

push__stgh_Center

Links

First create a view in staging

-- create a view for the link between facility and center
if object_id('NF.stgl_Facility_center') is null 
exec ( '
CREATE view [NF].[stgl_Facility_center] as 
SELECT 
    convert(varchar(255), [Facility]) facility_key
    ,convert(varchar(255), [Center]) center_key 
	    , 'NF' etl_data_source
FROM [Nasa].[dbo].[NASA_Facilities]
') 
  • Note that I normally don't make links in a raw datavault, but only in an integrated datavault (the next step), but I prefer to limit the scope of this tutorial.
-- push the link 
-- the first two lines are optional
exec betl.dbo.refresh '[My_Staging].[NF].[stgl_Facility_center]'
exec betl.dbo.info '[My_Staging].[NF].[stgl_Facility_center]'
exec betl.dbo.push '[My_Staging].[NF].[stgl_Facility_center]'

This is what your rdv should look like:

nf_rdv

Complete code

use My_Staging 
-- make sure that you have a target staging database 
IF NOT EXISTS (SELECT * FROM sys.databases WHERE name = 'My_Staging')
  create database My_Staging
GO
use My_Staging
GO
IF NOT EXISTS ( SELECT schema_id FROM My_Staging.sys.schemas WHERE name = 'NF')
  exec('create schema [NF] AUTHORIZATION [dbo]') 
GO
IF NOT EXISTS (SELECT * FROM sys.databases WHERE name = 'Nasa')
  create database Nasa

-- this is the source
select * FROM [Nasa].[dbo].[NASA_Facilities]
-- create source data view in staging database
use My_Staging 
GO
if object_id('NF.stgh_Facility') is null 
exec ( '
create view NF.stgh_Facility as 
SELECT 
    convert(varchar(255), [Facility]) facility_key
    ,convert(varchar(255), [Center]) center_key 
    ,[Occupied] occupy_date
    ,convert(varchar(50), [Status])  facility_status
    ,convert(varchar(255), [URL_Link]) url
    ,[Record_Date] record_date
    ,[Last_Update] last_update_date
    ,convert(varchar(100), [Country]) country_key
    ,convert(varchar(255),[Contact]) contact_key
    ,convert(varchar(20),[Phone]) facility_phone
    ,convert(varchar(100),[Location]) location
    ,convert(varchar(100),[City]) city
    ,convert(varchar(25),[State]) state_key
    ,convert(varchar(25),[Zipcode]) zipcode
    , ''NF'' etl_data_source
FROM [Nasa].[dbo].[NASA_Facilities]
')


-- For the entity Center we will create a similar view.
use My_Staging 
GO
if object_id('NF.stgh_Center') is null 
exec ( '
create view NF.stgh_Center as 
SELECT distinct
    convert(varchar(255), [Center]) center_key 
     , ''NF'' etl_data_source
FROM [Nasa].[dbo].[NASA_Facilities]
') 

--Lets see what betl will guess for column types
exec betl.dbo.refresh 'NF.stgh_Facility'
exec betl.dbo.info 'NF.stgh_Facility'

-- set facility_key to natural primary key (id 100, look in static.Column_type). 
exec betl.dbo.set_column_type 'NF.stgh_Facility', 'facility_key', 100 


-- first make sure that the columns of stgh_Center are in the meta data. 
exec betl.dbo.refresh 'NF.stgh_Center'
exec betl.dbo.info 'NF.stgh_Center'

-- then set the nat_pkey
exec betl.dbo.set_column_type 'NF.stgh_Center', 'center_key', 100



-- Make sure that the target database exists (RDV stands for raw datavault):
IF NOT EXISTS (SELECT * FROM sys.databases WHERE name = 'My_RDV')
    CREATE DATABASE [My_RDV]
GO
exec betl.dbo.set_target_schema '[My_Staging]', '[My_RDV].dbo'

-- we want betl to drop /create target tables
-- This normally set on a DEV environment and not test, acceptance or production. 
exec betl.dbo.setp 'recreate_tables', 1, '[My_RDV].dbo' 

-- lets create the hubs
exec betl.dbo.push '[My_Staging].NF.stgh_Center'
exec betl.dbo.push '[My_Staging].NF.stgh_Facility'

-- create a view for the link between facility and center
if object_id('NF.stgl_Facility_center') is null 
exec ( '
CREATE view [NF].[stgl_Facility_center] as 
SELECT 
    convert(varchar(255), [Facility]) facility_key
    ,convert(varchar(255), [Center]) center_key 
	    , ''NF'' etl_data_source
FROM [Nasa].[dbo].[NASA_Facilities]
') 

-- push the link 
-- the first two lines are optional
exec betl.dbo.refresh '[My_Staging].[NF].[stgl_Facility_center]'
exec betl.dbo.info '[My_Staging].[NF].[stgl_Facility_center]'
exec betl.dbo.push '[My_Staging].[NF].[stgl_Facility_center]'