< Previous Module - Home - Next Module >
A basic data movement can be accomplished two ways in Azure Data Factory. This module will demonstrate both:
- Pipeline copy
- Mapping data flows copy
# | Section |
---|---|
1 | Stage data in the data lake |
2 | Pipeline copy |
3 | Mapping data flows copy |
Data was staged as part of the setup for this lab. See staging instructions for more information.
-
Within the Data Factory Studio, select the Author tab from the leftmost pane. Open the Pipeline Actions elipsis menu, and click the New pipeline menu item.
-
In the Properties pane of the new pipeline, add a Name and Description.
Attribute Example Value Name pl_simple_copy
Description Use the copy activity to copy data from one location to another.
-
From the Activities pane, under Move & transform, click and drag the Copy data activity into the pipeline area. Then, on the Source tab of the activity properties, click the + New button.
-
In the New dataset pane, find Azure Data Lake Storage Gen 2 and click the Continue button.
-
Click the Binary option and click the Continue button.
-
Enter the properties and click the OK button. The New dataset pane will close.
Attribute Value Name ds_irazure_adls_binary
Linked Service ls_adls_irazure
-
Click the Open button from the Source tab of the Copy data activity properties.
-
On the Parameters tab of the new dataset, click the + New button and add three parameters.
Name Type Default value container
String
leave blank directory
String
leave blank filename
String
leave blank -
On the Connection tab of the new dataset, for all 3 attributes of the File path, roll over the attribute fields and click the Add dynamic content link that appears under each field. Then, add the parameters to the appropriate fields, the first field being for the
container
parameter, thendirectory
, thenfilename
. This dataset will take parameters inclusive of file name (as opposed to the directory level). -
Click the
ds_ir_azure_adls_binary
dataset elipsis menu item and select Clone. -
On the Properties pane of the cloned dataset, replace the text "copy1" in the Name with
directory
. The name of the dataset should now beds_irazure_adls_binary_directory
. -
Roll over the file name part of the File path and click the gargage bin icon. On the Parameters tab, check the filename parameter and click the Delete button.
-
Click the
pl_simple_copy
pipeline from the list of Pipelines. On the Source tab of the Copy data activity, enter the following values.Attribute Value Source dataset ds_irazure_adls_binary
Dataset properties / container inbound
Dataset properties / directory x
(This is a placeholder only; the value will be overwritten due to the Wildcard file path option selected in the File path type radio option.)Dataset properties / filename x
(This is a placeholder only; the value will be overwritten due to the Wildcard file path option selected in the File path type radio option.)File path type Wildcard file path
Wildcard paths / directory nyx_taxi_sample
Wildcard paths / filename *.parquet
-
On the Sink tab of the Copy data activity, enter the following values.
Attribute Value Sink dataset ds_irazure_adls_binary_directory
Dataset properties / container publish
Dataset properties / directory nyx_taxi_sample_pipeline
-
Click the Debug button and ensure your copy succeeds!
-
Click the Publish all button, then click the Publish button.
-
Close all open tabs.
-
Within the Data Factory Studio, select the Author tab from the leftmost pane. Open the Pipeline Actions elipsis menu, and click the New pipeline menu item. In the Properties pane of the new pipeline, add a Name and Description.
Attribute Value Name pl_simple_copy_df
Description Use the copy activity to copy data from one location to another using mapping data flows.
-
From the Activities pane, under Move & transform, click and drag the Data flow activity into the pipeline area. Then, on the Settings tab, click the + New button.
-
On the Properties pane, enter the following values.
Attribute Value Name df_simple_copy
Description Data flow that performs a simple copy.
-
Click the Data flow debug radio toggle from within the data flows working area. Select the Integration runtime
ir-vnetwork-medium-60min
. Select4 hours
from the Debug time to live option so that your debug Apache cluster will be available during the next 4 hours of the lab. Click OK. -
Click the Add Source option from the down caret menu as shown below.
-
Complete the Source settings.
Attribute Value Name source
Source type Inline
Inline dataset type Parquet
Linked service ls_adls_irvnetmedium
Sampling Enable
-
On the Source options tab, click the
Wildcard
File mode option and click the Browse button.-
An error may appear indicating that interactive authoring is disabled. This is a debug condition, whereby the . Click the Edit interactive authoring link.
-
On the Edit integration runtime pane, on the Virtual network tab, click Enable under Interactive authoring. Then, click Apply.
-
When the runtime is edited to allow interactive authoring, the Browse pane will indicate
Interactive authoring enabled
. Click Retry.
-
-
Use the Browse pane to navigate to
inbound
and click OK. The File system and Wildcard paths container values will showinbound
. -
Enter
nyx_taxi_sample/*.parquet
in the Wildcard paths field as shown below. -
On the Projection tab, click Import schema.
-
On the Data preview tab, click Refresh to see a preview of the data. Data flow activity data previews are used in conjunction with sampling of the source data to allow for ease of development and to ensure the Azure Data Factory user interface is able to pre-populate the data mid-transformation.
-
Click the + button on the bottom right corner of the
source
activity and then select the Sink option. -
Enter the following values in the Sink tab.
Attribute Value Name sink
Sink type Inline
Inline dataset type Parquet
Linked service ls_adls_irvnetmedium
-
On the Settings tab, click the Browse button, browse to
publish/nyx_taxi_sample_dataflow
and click OK. -
Set the unmask settings as shown below.
Attribute Value Owner R
+W
+X
Group R
+W
+X
Others X
-
On the Data preview tab, click Refresh to see a preview of the data.
-
Click the
pl_simple_copy_df
pipeline from the list of Pipelines. Click the Debug button and ensure your copy succeeds! -
Click the Publish all button, then click the Publish button.
-
Close all open tabs.
You have now completed this module. You have performed a simple copy using both the Pipelines and the Data flows features. The pipelines copy used the Azure IR compute that does not use spark while the data flows pipeline you created utilized an IR with managed virtual network enabled with a medium sized Apache spark cluster.