Skip to content

Latest commit

 

History

History
209 lines (170 loc) · 13.1 KB

README.md

File metadata and controls

209 lines (170 loc) · 13.1 KB

Build Status Java Dataproc Serverless Integration Tests Status Java Dataproc Cluster Integration Tests Status

Dataproc Templates (Java - Spark)

Please refer to the Dataproc Templates (Java - Spark) README for more information.

...

Requirements

  • Java 8
  • Maven 3

Running Templates

The Dataproc Templates (Java - Spark) support both serverless and cluster modes. By default, serverless mode is used. To run on Dataproc clusters, follow these steps:

Serverless Mode (Default)

Submits job to Dataproc Serverless using the batches submit spark command.

Cluster Mode

Submits job to a Dataproc Standard cluster using the jobs submit spark command.

To run the templates on an existing cluster, you must specify the JOB_TYPE=CLUSTER and CLUSTER=<full clusterId> environment variables. For example:

export JOB_TYPE=CLUSTER
export CLUSTER=${DATAPROC_CLUSTER_NAME}

Note: Certain templates may require a newer version of the Dataproc image. Before running a template, make sure your cluster's dataproc image version includes the supported dependencies version listed in the pom.xml.

Some HBase templates that require a custom image to execute are not yet supported in CLUSTER mode.

Submit templates

  1. Format Code [Optional]

    From either the root directory or v2/ directory, run:

    mvn spotless:apply

    This will format the code and add a license header. To verify that the code is formatted correctly, run:

    mvn spotless:check

    The directory to run the commands from is based on whether the changes are under v2/ or not.

  1. Building the Project

    Build the entire project using the Maven compile command.

    mvn clean install
  2. Executing a Template File

    Once the template is staged on Google Cloud Storage, it can then be executed using the gcloud CLI tool.

    To stage and execute the template, you can use the start.sh script. This takes

    • Environment variables on where and how to deploy the templates

    • Additional options for gcloud dataproc jobs submit spark or gcloud beta dataproc batches submit spark

    • Template options, such as the critical --template option which says which template to run and --templateProperty options for passing in properties at runtime (as an alternative to setting them in src/main/resources/template.properties).

    • Other common template property: log.level, which is an optional parameter to define the log level of Spark Context and it defaults to INFO. Possible choices are the Spark log levels: ["ALL", "DEBUG", "ERROR", "FATAL", "INFO", "OFF", "TRACE", "WARN"]

      --templateProperty log.level=ERROR
    • Usage syntax:

      start.sh [submit-spark-options] -- --template templateName [--templateProperty key=value] [extra-template-options]

      For example:

      # Set required environment variables.
      export PROJECT=my-gcp-project
      export REGION=gcp-region
      export GCS_STAGING_LOCATION=gs://my-bucket/temp
      # Set optional environment variables.
      export SUBNET=projects/<gcp-project>/regions/<region>/subnetworks/test-subnet1
      # ID of Dataproc cluster running permanent history server to access historic logs.
      export HISTORY_SERVER_CLUSTER=projects/<gcp-project>/regions/<region>/clusters/<cluster>
      
      # The submit spark options must be separated with a "--" from the template options
      bin/start.sh \
      --properties=<spark.something.key>=<value> \
      --version=... \
      -- \
      --template <TEMPLATE TYPE>
      --templateProperty <key>=<value>
    1. Executing Hive to GCS template

      Detailed instructions at README.md

      bin/start.sh \
      --properties=spark.hadoop.hive.metastore.uris=thrift://hostname/ip:9083
      -- --template HIVETOGCS
    2. Executing Hive to BigQuery template

      Detailed instructions at README.md

      bin/start.sh \
      --properties=spark.hadoop.hive.metastore.uris=thrift://hostname/ip:9083 \
      -- --template HIVETOBIGQUERY
    3. Executing Spanner to GCS template.

      Detailed instructions at README.md

      bin/start.sh -- --template SPANNERTOGCS
    4. Executing PubSub to BigQuery template.

      bin/start.sh -- --template PUBSUBTOBQ
    5. Executing PubSub to GCS template.

      bin/start.sh -- --template PUBSUBTOGCS
    6. Executing GCS to BigQuery template.

      bin/start.sh -- --template GCSTOBIGQUERY
    7. Executing BigQuery to GCS template.

      bin/start.sh -- --template BIGQUERYTOGCS
    8. Executing General template.

      Detailed instructions at README.md

       bin/start.sh --files="gs://bucket/path/config.yaml" \
       -- --template GENERAL --config config.yaml

      With, for example, config.yaml:

      input:
        shakespeare:
          format: bigquery
          options:
            table: "bigquery-public-data:samples.shakespeare"
      query:
        wordcount:
          sql: "SELECT word, sum(word_count) cnt FROM shakespeare GROUP by word ORDER BY cnt DESC"
      output:
        wordcount:
          format: csv
          options:
            header: true
            path: gs://bucket/output/wordcount/
          mode: Overwrite