< Previous Module - Home - Next Module >
While the gold layer is highly refined and usable for analytics, it might not be in the exact shape and scope needed for a specific use case. For this reason, it is generally accepted that use case specific variations of the data may be stored as needed.
This lab demonstrates this concept by showing a new file being created that includes an additional aggregation on top of the gold data. However, this layer may reside within a transactional database used by the application, another analytical storage repository, or built as an API, or any other technology.
It is recommended to align your enterprise under the concept that consumption layers should all read from the same published gold data. This increases trust in the data and auditability.
In this module, we create a new aggregate file of the employees gold data grouped by region.
# | Section |
---|---|
1 | Create the data flow |
2 | Create the pipeline |
-
In the factory resources pane, select on the plus icon to open the new resource menu. Select Data flow from the Data flow menu.
-
If the Data flow debug slider is off, click it into the on position. In the Turn on data flow debug panel that appears, select the
ir-vnetwork-medium-60min
Integration Runtime and4 hours
for the Debug time to live. Then, click Ok. Immediately proceed to the next step. -
In the General panel under Properties, add a Name and Description.
Attribute Value Name df_medallion_consumption
Description Create example consumption file
-
Click on the Add Source down carrot (v) and select Add Source.
-
In the Source settings tab of the newly added source, update the below attributes:
Attribute Value Name gold
Description Gold layer data (Delta format)
Source type Inline
Inline dataset type Delta
Linked service ls_adls_irvnetmedium
Sampling Disable
-
In the Source options tab of the source named
gold
, update the below attributes:Attribute Value Folder path File system: publish
Folder path:employees_gold_general
Compression type snappy
Compression level Fastest
Time travel Disable
-
In the Projection tab of the source named
gold
, click the Import schema button. Note that if the data flow debug cluster is not ready, this button will be grey.The debug cluster being ready is indicated by a green check next to the slider.
The Import Schema panel will appear. Leave all values blank and click Import.
-
In the Data preview tab of the source named
gold
, click the Refresh button. -
Click the plus icon connected to the
gold
source and select Aggregate from the transformations menu. Update the below attributes:Attribute Value Output stream name aggregateByRegion
Description Aggregating data by Region
Group by Columns Columns: Region
Name asRegion
Click the Group by / Aggregates slider to the right (Aggregates). Click the Open expression builder button.
Use the Dataflow exression builder to create the following columns.
Column name Expression MaxLongetivity year(currentDate()) - min(YearofJoining)
MinLongetivity year(currentDate()) - max(YearofJoining)
MedianLongetivity avg((year(currentDate()) - (YearofJoining)))
-
In the Data preview tab of the
aggregateByRegion
transformation`, click the Refresh button. Ensure the refresh succeeds. -
Click the plus icon connected to the
aggregateByRegion
transformation and select Sink from the transformations menu. Update the below attributes:Attribute Value Output stream name sink
Description Sink employee aggregates by region CSV.
Sink type Inline
Inline dataset type Delimited Text
Linked Service ls_adls_irvnetmedium
-
On the sink Settings tab, update the below attributes:
Attribute Value Folder path Container: publish
File path:employees_project_a
First row as header checked Quote All checked Clear the folder checked Unmask Octal: 771
-
On the sink Optimize tab, to ensure you only get a single CSV file output, select Single partition. Without this selection, the data flow will create multiple CSVs as would be typical for Spark clusters with multiple workers.
-
In the Data preview tab of the sink, click the Refresh button. Ensure the refresh succeeds.
-
Click the Publish all button, then click the Publish button.
Once a data flow is developed, you can create the pipline that invokes it.
-
In the factory resources pane, select on the plus icon to open the new resource menu. Select Pipeline.
-
In the General panel under Properties, add a Name and Description.
Attribute Value Name pl_medallion_consumption
Description Example consumption of gold data
-
From the Activities panel, open the Move & transform accordian and drag the Data flow activity onto the canvas. Complete the below attributes in the General tab:
Attribute Value Name Run consumption data flow
Timeout 0.00:30:00
-
On the activity named
Run consumption data flow
Settings tab, update the below attributes:Attribute Value Data flow df_medallion_consumption
Run on (Azure IR) ir-vnetwork-medium-60min
Logging level None
-
In the Azure Data Factory Studio, click the Debug button.
-
Review the debug outcomes.
-
In the Azure Storage Account lab resource named
dfmdf< Random string for your lab environment resources >adls
, ensure thepublish/employees_gold_confidential
andpublish/employees_gold_general
directories have data files.
You have created an example of what the consumption pattern might be on top of the gold layer, where that consumption also requires a batch file output.
Note that consumption of gold data assets can take many forms