Module 03 - Two Ways to do a Basic Copy

< Previous Module - Home - Next Module >

📢 Introduction

A basic data movement can be accomplished two ways in Azure Data Factory. This module will demonstrate both:

Pipeline copy
Mapping data flows copy

📑 Table of Contents

#	Section
1	Stage data in the data lake
2	Pipeline copy
3	Mapping data flows copy

1. Stage data in the data lake

Data was staged as part of the setup for this lab. See staging instructions for more information.

↥ back to top

2. Pipeline Copy

Within the Data Factory Studio, select the Author tab from the leftmost pane. Open the Pipeline Actions elipsis menu, and click the New pipeline menu item.
In the Properties pane of the new pipeline, add a Name and Description.

Attribute Example Value

Name pl_simple_copy

Description Use the copy activity to copy data from one location to another.
From the Activities pane, under Move & transform, click and drag the Copy data activity into the pipeline area. Then, on the Source tab of the activity properties, click the + New button.
In the New dataset pane, find Azure Data Lake Storage Gen 2 and click the Continue button.
Click the Binary option and click the Continue button.
Enter the properties and click the OK button. The New dataset pane will close.

Attribute Value

Name ds_irazure_adls_binary

Linked Service ls_adls_irazure
Click the Open button from the Source tab of the Copy data activity properties.
On the Parameters tab of the new dataset, click the + New button and add three parameters.

Name Type Default value

container String leave blank

directory String leave blank

filename String leave blank
On the Connection tab of the new dataset, for all 3 attributes of the File path, roll over the attribute fields and click the Add dynamic content link that appears under each field. Then, add the parameters to the appropriate fields, the first field being for the container parameter, then directory, then filename. This dataset will take parameters inclusive of file name (as opposed to the directory level).
Click the ds_ir_azure_adls_binary dataset elipsis menu item and select Clone.
On the Properties pane of the cloned dataset, replace the text "copy1" in the Name with directory. The name of the dataset should now be ds_irazure_adls_binary_directory.
Roll over the file name part of the File path and click the gargage bin icon. On the Parameters tab, check the filename parameter and click the Delete button.

Click the pl_simple_copy pipeline from the list of Pipelines. On the Source tab of the Copy data activity, enter the following values.

Attribute	Value
Source dataset	`ds_irazure_adls_binary`
Dataset properties / container	`inbound`
Dataset properties / directory	`x` (This is a placeholder only; the value will be overwritten due to the Wildcard file path option selected in the File path type radio option.)
Dataset properties / filename	`x` (This is a placeholder only; the value will be overwritten due to the Wildcard file path option selected in the File path type radio option.)
File path type	`Wildcard file path`
Wildcard paths / directory	`nyx_taxi_sample`
Wildcard paths / filename	`*.parquet`

On the Sink tab of the Copy data activity, enter the following values.

Attribute Value

Sink dataset ds_irazure_adls_binary_directory

Dataset properties / container publish

Dataset properties / directory nyx_taxi_sample_pipeline
Click the Debug button and ensure your copy succeeds!
Click the Publish all button, then click the Publish button.
Close all open tabs.

↥ back to top

3. Data Flows Copy

Within the Data Factory Studio, select the Author tab from the leftmost pane. Open the Pipeline Actions elipsis menu, and click the New pipeline menu item. In the Properties pane of the new pipeline, add a Name and Description.

Attribute Value

Name pl_simple_copy_df

Description Use the copy activity to copy data from one location to another using mapping data flows.
From the Activities pane, under Move & transform, click and drag the Data flow activity into the pipeline area. Then, on the Settings tab, click the + New button.
On the Properties pane, enter the following values.

Attribute Value

Name df_simple_copy

Description Data flow that performs a simple copy.
Click the Data flow debug radio toggle from within the data flows working area. Select the Integration runtime ir-vnetwork-medium-60min. Select 4 hours from the Debug time to live option so that your debug Apache cluster will be available during the next 4 hours of the lab. Click OK.
Click the Add Source option from the down caret menu as shown below.
Complete the Source settings.

Attribute Value

Name source

Source type Inline

Inline dataset type Parquet

Linked service ls_adls_irvnetmedium

Sampling Enable
On the Source options tab, click the Wildcard File mode option and click the Browse button.
1. An error may appear indicating that interactive authoring is disabled. This is a debug condition, whereby the . Click the Edit interactive authoring link.
2. On the Edit integration runtime pane, on the Virtual network tab, click Enable under Interactive authoring. Then, click Apply.
3. When the runtime is edited to allow interactive authoring, the Browse pane will indicate Interactive authoring enabled. Click Retry.
Use the Browse pane to navigate to inbound and click OK. The File system and Wildcard paths container values will show inbound.
Enter nyx_taxi_sample/*.parquet in the Wildcard paths field as shown below.
On the Projection tab, click Import schema.
On the Data preview tab, click Refresh to see a preview of the data. Data flow activity data previews are used in conjunction with sampling of the source data to allow for ease of development and to ensure the Azure Data Factory user interface is able to pre-populate the data mid-transformation.
Click the + button on the bottom right corner of the source activity and then select the Sink option.
Enter the following values in the Sink tab.

Attribute Value

Name sink

Sink type Inline

Inline dataset type Parquet

Linked service ls_adls_irvnetmedium
On the Settings tab, click the Browse button, browse to publish/nyx_taxi_sample_dataflow and click OK.
Set the unmask settings as shown below.

Attribute Value

Owner R + W + X

Group R + W + X

Others X
On the Data preview tab, click Refresh to see a preview of the data.
Click the pl_simple_copy_df pipeline from the list of Pipelines. Click the Debug button and ensure your copy succeeds!
- If you get an message prompting you to select whether to use the activity runtime or the debug session, click Continue with debug session to ensure you are using the Spark capability of data flows.
Click the Publish all button, then click the Publish button.
Close all open tabs.

↥ back to top

🎉 Summary

You have now completed this module. You have performed a simple copy using both the Pipelines and the Data flows features. The pipelines copy used the Azure IR compute that does not use spark while the data flows pipeline you created utilized an IR with managed virtual network enabled with a medium sized Apache spark cluster.

Continue >

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

module03.md

module03.md

Module 03 - Two Ways to do a Basic Copy

📢 Introduction

📑 Table of Contents

1. Stage data in the data lake

2. Pipeline Copy

3. Data Flows Copy

🎉 Summary

Attribute	Example Value
Name	`pl_simple_copy`
Description	`Use the copy activity to copy data from one location to another.`

Name	Type	Default value
`container`	`String`	leave blank
`directory`	`String`	leave blank
`filename`	`String`	leave blank

Attribute	Value
Sink dataset	`ds_irazure_adls_binary_directory`
Dataset properties / container	`publish`
Dataset properties / directory	`nyx_taxi_sample_pipeline`

Attribute	Value
Name	`pl_simple_copy_df`
Description	`Use the copy activity to copy data from one location to another using mapping data flows.`

Attribute	Value
Name	`df_simple_copy`
Description	`Data flow that performs a simple copy.`

Attribute	Value
Name	`source`
Source type	`Inline`
Inline dataset type	`Parquet`
Linked service	`ls_adls_irvnetmedium`
Sampling	`Enable`

Attribute	Value
Name	`sink`
Sink type	`Inline`
Inline dataset type	`Parquet`
Linked service	`ls_adls_irvnetmedium`

Attribute	Value
Owner	`R` + `W` + `X`
Group	`R` + `W` + `X`
Others	`X`

Files

module03.md

Latest commit

History

module03.md

File metadata and controls

Module 03 - Two Ways to do a Basic Copy

📢 Introduction

📑 Table of Contents

1. Stage data in the data lake

2. Pipeline Copy

3. Data Flows Copy

🎉 Summary