Skip to content

Latest commit

 

History

History
241 lines (146 loc) · 11.8 KB

File metadata and controls

241 lines (146 loc) · 11.8 KB

Module 03 - Two Ways to do a Basic Copy

< Previous Module - Home - Next Module >

📢 Introduction

A basic data movement can be accomplished two ways in Azure Data Factory. This module will demonstrate both:

  • Pipeline copy
  • Mapping data flows copy

📑 Table of Contents

# Section
1 Stage data in the data lake
2 Pipeline copy
3 Mapping data flows copy

1. Stage data in the data lake

Data was staged as part of the setup for this lab. See staging instructions for more information.

2. Pipeline Copy

  1. Within the Data Factory Studio, select the Author tab from the leftmost pane. Open the Pipeline Actions elipsis menu, and click the New pipeline menu item.

    new pipeline

  2. In the Properties pane of the new pipeline, add a Name and Description.

    Attribute Example Value
    Name pl_simple_copy
    Description Use the copy activity to copy data from one location to another.
  3. From the Activities pane, under Move & transform, click and drag the Copy data activity into the pipeline area. Then, on the Source tab of the activity properties, click the + New button.

    New dataset

  4. In the New dataset pane, find Azure Data Lake Storage Gen 2 and click the Continue button.

  5. Click the Binary option and click the Continue button.

  6. Enter the properties and click the OK button. The New dataset pane will close.

    Attribute Value
    Name ds_irazure_adls_binary
    Linked Service ls_adls_irazure
  7. Click the Open button from the Source tab of the Copy data activity properties.

    New dataset adls 2

  8. On the Parameters tab of the new dataset, click the + New button and add three parameters.

    Name Type Default value
    container String leave blank
    directory String leave blank
    filename String leave blank

    New dataset adls 3

  9. On the Connection tab of the new dataset, for all 3 attributes of the File path, roll over the attribute fields and click the Add dynamic content link that appears under each field. Then, add the parameters to the appropriate fields, the first field being for the container parameter, then directory, then filename. This dataset will take parameters inclusive of file name (as opposed to the directory level).

    New dataset adls 4 New dataset adls 5

  10. Click the ds_ir_azure_adls_binary dataset elipsis menu item and select Clone.

    Clone dataset

  11. On the Properties pane of the cloned dataset, replace the text "copy1" in the Name with directory. The name of the dataset should now be ds_irazure_adls_binary_directory.

  12. Roll over the file name part of the File path and click the gargage bin icon. On the Parameters tab, check the filename parameter and click the Delete button.

    New dataset adls 6 New dataset adls 7

  13. Click the pl_simple_copy pipeline from the list of Pipelines. On the Source tab of the Copy data activity, enter the following values.

    Attribute Value
    Source dataset ds_irazure_adls_binary
    Dataset properties / container inbound
    Dataset properties / directory x (This is a placeholder only; the value will be overwritten due to the Wildcard file path option selected in the File path type radio option.)
    Dataset properties / filename x (This is a placeholder only; the value will be overwritten due to the Wildcard file path option selected in the File path type radio option.)
    File path type Wildcard file path
    Wildcard paths / directory nyx_taxi_sample
    Wildcard paths / filename *.parquet

    Create new pipeline copy source

  14. On the Sink tab of the Copy data activity, enter the following values.

    Attribute Value
    Sink dataset ds_irazure_adls_binary_directory
    Dataset properties / container publish
    Dataset properties / directory nyx_taxi_sample_pipeline

    Create new pipeline copy sink

  15. Click the Debug button and ensure your copy succeeds!

    Pipeline succeeded

  16. Click the Publish all button, then click the Publish button.

    Publish all

  17. Close all open tabs.

3. Data Flows Copy

  1. Within the Data Factory Studio, select the Author tab from the leftmost pane. Open the Pipeline Actions elipsis menu, and click the New pipeline menu item. In the Properties pane of the new pipeline, add a Name and Description.

    Attribute Value
    Name pl_simple_copy_df
    Description Use the copy activity to copy data from one location to another using mapping data flows.
  2. From the Activities pane, under Move & transform, click and drag the Data flow activity into the pipeline area. Then, on the Settings tab, click the + New button.

    Create pipeline with data flows Create data flow

  3. On the Properties pane, enter the following values.

    Attribute Value
    Name df_simple_copy
    Description Data flow that performs a simple copy.
  4. Click the Data flow debug radio toggle from within the data flows working area. Select the Integration runtime ir-vnetwork-medium-60min. Select 4 hours from the Debug time to live option so that your debug Apache cluster will be available during the next 4 hours of the lab. Click OK.

    Create data flow debug cluster

  5. Click the Add Source option from the down caret menu as shown below.

    DF Add Source

  6. Complete the Source settings.

    Attribute Value
    Name source
    Source type Inline
    Inline dataset type Parquet
    Linked service ls_adls_irvnetmedium
    Sampling Enable

    DF Source Settings

  7. On the Source options tab, click the Wildcard File mode option and click the Browse button.

    1. An error may appear indicating that interactive authoring is disabled. This is a debug condition, whereby the . Click the Edit interactive authoring link.

      DF Interactive authoring

    2. On the Edit integration runtime pane, on the Virtual network tab, click Enable under Interactive authoring. Then, click Apply.

      DF Interactive authoring

    3. When the runtime is edited to allow interactive authoring, the Browse pane will indicate Interactive authoring enabled. Click Retry.

      DF Interactive authoring

  8. Use the Browse pane to navigate to inbound and click OK. The File system and Wildcard paths container values will show inbound.

    DF Source browse

  9. Enter nyx_taxi_sample/*.parquet in the Wildcard paths field as shown below.

    DF Source options path

  10. On the Projection tab, click Import schema.

    DF Source options path

  11. On the Data preview tab, click Refresh to see a preview of the data. Data flow activity data previews are used in conjunction with sampling of the source data to allow for ease of development and to ensure the Azure Data Factory user interface is able to pre-populate the data mid-transformation.

    DF Source preview

  12. Click the + button on the bottom right corner of the source activity and then select the Sink option.

    DF Sink Add

  13. Enter the following values in the Sink tab.

    Attribute Value
    Name sink
    Sink type Inline
    Inline dataset type Parquet
    Linked service ls_adls_irvnetmedium
  14. On the Settings tab, click the Browse button, browse to publish/nyx_taxi_sample_dataflow and click OK.

    DF Sink Settings

  15. Set the unmask settings as shown below.

    Attribute Value
    Owner R + W + X
    Group R + W + X
    Others X
  16. On the Data preview tab, click Refresh to see a preview of the data.

    DF Sink Refresh

  17. Click the pl_simple_copy_df pipeline from the list of Pipelines. Click the Debug button and ensure your copy succeeds!

    DF Debug

    • If you get an message prompting you to select whether to use the activity runtime or the debug session, click Continue with debug session to ensure you are using the Spark capability of data flows.

      DF Debug

  18. Click the Publish all button, then click the Publish button.

    Publish all

  19. Close all open tabs.

🎉 Summary

You have now completed this module. You have performed a simple copy using both the Pipelines and the Data flows features. The pipelines copy used the Azure IR compute that does not use spark while the data flows pipeline you created utilized an IR with managed virtual network enabled with a medium sized Apache spark cluster.

Continue >