diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/bigquery/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/bigquery/README.md index 373bc3339..bafa9f9f9 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/bigquery/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/bigquery/README.md @@ -1,4 +1,4 @@ -## BigQuery To GCS +## BigQuery To Cloud Storage General Execution: diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/databases/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/databases/README.md index c13ee03b9..28fc320f2 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/databases/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/databases/README.md @@ -1,4 +1,4 @@ -## Executing Spanner to GCS template +## Executing Spanner to Cloud Storage template General Execution: @@ -30,7 +30,7 @@ Update`spanner.gcs.input.table.id` property as follows: ``` "spanner.gcs.input.table.id=(select name, age, phone from employee where designation = 'engineer')" ``` -There are two optional properties as well with "Spanner to GCS" Template. Please find below the details :- +There are two optional properties as well with "Spanner to Cloud Storage" Template. Please find below the details :- ``` --templateProperty spanner.gcs.temp.table='temporary_view_name' @@ -41,7 +41,7 @@ The only thing needs to keep in mind is that, the name of the Spark temporary vi **NOTE** It is required to surround your custom query with parenthesis and parameter name with double quotes. -## Executing Cassandra to GCS Template +## Executing Cassandra to Cloud Storage Template ### General Execution ``` @@ -146,7 +146,7 @@ You can replace the ```casscon``` with your catalog name if it is passed. This i Make sure that either ```cassandratobq.input.query``` or both ```cassandratobq.input.keyspace``` and ```cassandratobq.input.table``` is provided. Setting or not setting all three properties at the same time will throw an error. -## Executing Redshift to GCS template +## Executing Redshift to Cloud Storage template General Execution: @@ -171,16 +171,16 @@ bin/start.sh \ --templateProperty redshift.gcs.output.mode= ``` -There are two optional properties as well with "Redshift to GCS" Template. Please find below the details :- +There are two optional properties as well with "Redshift to Cloud Storage" Template. Please find below the details :- ``` --templateProperty redshift.gcs.temp.table='temporary_view_name' --templateProperty redshift.gcs.temp.query='select * from global_temp.temporary_view_name' ``` -These properties are responsible for applying some spark sql transformations while loading data into GCS. +These properties are responsible for applying some spark sql transformations while loading data into Cloud Storage. The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:" -## Executing Mongo to GCS template +## Executing Mongo to Cloud Storage template Template for exporting a MongoDB Collection to files in Google Cloud Storage. It supports writing JSON, CSV, Parquet and Avro formats. @@ -212,8 +212,8 @@ Arguments: * `templateProperty mongo.gcs.input.uri`: MongoDB Connection String as an Input URI (format: `mongodb://host_name:port_no`) * `templateProperty mongo.gcs.input.database`: MongoDB Database Name (format: Database_name) * `templateProperty mongo.gcs.input.collection`: MongoDB Input Collection Name (format: Collection_name) -* `templateProperty mongo.gcs.output.format`: GCS Output File Format (one of: avro,parquet,csv,json) -* `templateProperty mongo.gcs.output.location`: GCS Location to put Output Files (format: `gs://BUCKET/...`) +* `templateProperty mongo.gcs.output.format`: Cloud Storage Output File Format (one of: avro,parquet,csv,json) +* `templateProperty mongo.gcs.output.location`: Cloud Storage Location to put Output Files (format: `gs://BUCKET/...`) * `templateProperty mongo.gcs.output.mode`: Output write mode (one of: append,overwrite,ignore,errorifexists) (Defaults to append) Example Submission: diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/dataplex/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/dataplex/README.md index abdf8b2cb..91c667a86 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/dataplex/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/dataplex/README.md @@ -1,10 +1,10 @@ -## Dataplex GCS to BigQuery +## Dataplex Cloud Storage to BigQuery -This template will incrementally move data from a Dataplex GCS tables to BigQuery. -It will identify new partitions in Dataplex GCS and load them to BigQuery. +This template will incrementally move data from a Dataplex Cloud Storage tables to BigQuery. +It will identify new partitions in Dataplex Cloud Storage and load them to BigQuery. -Note: if the Dataplex GCS table has no partitions, the whole table will be read -from GCS and the target BQ table will be overwritten. +Note: if the Dataplex Cloud Storage table has no partitions, the whole table will be read +from Cloud Storage and the target BQ table will be overwritten. ### General Execution: @@ -46,9 +46,9 @@ gcloud dataplex tasks create \ SQL file should be located `dataplex.gcs.bq.target.dataset` name of the target BigQuery dataset where the -Dataplex GCS asset will be migrated to +Dataplex Cloud Storage asset will be migrated to -`gcs.bigquery.temp.bucket.name` the GCS bucket that temporarily holds the data +`gcs.bigquery.temp.bucket.name` the Cloud Storage bucket that temporarily holds the data before it is loaded to BigQuery `dataplex.gcs.bq.save.mode` specifies how to handle existing data in BigQuery @@ -71,7 +71,7 @@ over any other property or argument specifying target output of the data. ### Arguments -`--dataplexEntity` Dataplex GCS table to load in BigQuery \ +`--dataplexEntity` Dataplex Cloud Storage table to load in BigQuery \ Example: `--dataplexEntityList "projects/{project_number}/locations/{location_id}/lakes/{lake_id}/zones/{zone_id}/entities/{entity_id_1}"` `--partitionField` if field is specified together with `partitionType`, the @@ -87,7 +87,7 @@ argument is not specified the name of the entity will be used as table name Optionally a custom SQL can be provided to filter the data that will be copied to BigQuery. \ -The template will read from a GCS file with the custom sql string. +The template will read from a Cloud Storage file with the custom sql string. The path to this file must be provided with the option `--customSqlGcsPath`. diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/gcs/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/gcs/README.md index 9b55b2fe7..7b283deeb 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/gcs/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/gcs/README.md @@ -1,4 +1,4 @@ -## 1. GCS To BigQuery +## 1. Cloud Storage To BigQuery General Execution: @@ -19,7 +19,7 @@ bin/start.sh \ --templateProperty gcs.bigquery.temp.bucket.name= ``` -There are two optional properties as well with "GCS to BigQuery" Template. Please find below the details :- +There are two optional properties as well with "Cloud Storage to BigQuery" Template. Please find below the details :- ``` --templateProperty gcs.bigquery.temp.table='temporary_view_name' @@ -28,7 +28,7 @@ There are two optional properties as well with "GCS to BigQuery" Template. Pleas These properties are responsible for applying some spark sql transformations while loading data into BigQuery. The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:" -## 2. GCS To BigTable +## 2. Cloud Storage To BigTable General Execution: @@ -63,7 +63,7 @@ bin/start.sh \ (Please note that the table in Bigtable should exist with the above given column family before executing the template) ``` -## 3. GCS to Spanner +## 3. Cloud Storage to Spanner ``` GCP_PROJECT= \ REGION= \ @@ -82,7 +82,7 @@ bin/start.sh \ ``` -## 4. GCS to JDBC +## 4. Cloud Storage to JDBC ``` Please download the JDBC Driver of respective database and copy it to GCS bucket location. @@ -129,7 +129,7 @@ bin/start.sh \ ``` -## 5. GCS to GCS +## 5. Cloud Storage to Cloud Storage ``` GCP_PROJECT= \ @@ -161,9 +161,9 @@ bin/start.sh \ ``` -## 6. GCS To Mongo: +## 6. Cloud Storage To Mongo: -Download the following MongoDb connectors and copy it to GCS bucket location: +Download the following MongoDb connectors and copy it to Cloud Storage bucket location: * [MongoDb Spark Connector](https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector) * [MongoDb Java Driver](https://mvnrepository.com/artifact/org.mongodb/mongo-java-driver) diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/general/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/general/README.md index 218fc636d..4d929d51f 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/general/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/general/README.md @@ -63,7 +63,7 @@ export GCS_STAGING_LOCATION= export SUBNET=projects//regions//subnetworks/ ``` -Create a config file and upload it to GCS: +Create a config file and upload it to Cloud Storage: ``` gsutil cp config.yaml gs://bucket/path/config.yaml @@ -101,7 +101,7 @@ bin/start.sh \ # Example config files: -GCS to BigQuery config.yaml +Cloud Storage to BigQuery config.yaml ```yaml input: @@ -116,7 +116,7 @@ output: mode: Overwrite ``` -BigQuery to GCS config.yaml +BigQuery to Cloud Storage config.yaml ```yaml input: @@ -131,7 +131,7 @@ output: mode: Overwrite ``` -GCS Avro to GCS CSV +Cloud Storage Avro to Cloud Storage CSV ```yaml input: diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/hbase/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/hbase/README.md index 04b16ddf6..7e0d61679 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/hbase/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/hbase/README.md @@ -1,4 +1,4 @@ -## 1. HBase To GCS +## 1. HBase To Cloud Storage ### Required JAR files Some HBase dependencies are required to be passed when submitting the job. These dependencies are automatically set by script when CATALOG environment variable is set for hbase table configuration. If not, diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/hive/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/hive/README.md index 0a9152d1e..2c8f2d9c0 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/hive/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/hive/README.md @@ -33,7 +33,7 @@ These properties are responsible for applying some spark sql transformations bef The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:" -## 2. Hive To GCS +## 2. Hive To Cloud Storage General Execution: ``` @@ -53,25 +53,25 @@ bin/start.sh \ ### Configurable Parameters Update Following properties in [template.properties](../../../../../../../resources/template.properties) file: ``` -## GCS output path. +## Cloud Storage output path. hive.gcs.output.path= ## Name of hive input table. hive.input.table= ## Hive input db name. hive.input.db= -## Optional - GCS output format. avro/csv/parquet/json/orc, defaults to avro. +## Optional - Cloud Storage output format. avro/csv/parquet/json/orc, defaults to avro. hive.gcs.output.format=avro ## Optional, column to partition hive data. hive.partition.col= -## Optional: Write mode to gcs append/overwrite/errorifexists/ignore, defaults to overwrite +## Optional: Write mode to Cloud Storage append/overwrite/errorifexists/ignore, defaults to overwrite hive.gcs.save.mode=overwrite ``` -There are two optional properties as well with "Hive to GCS" Template. Please find below the details :- +There are two optional properties as well with "Hive to Cloud Storage" Template. Please find below the details :- ``` --templateProperty hive.gcs.temp.table='temporary_view_name' --templateProperty hive.gcs.temp.query='select * from global_temp.temporary_view_name' ``` -These properties are responsible for applying some spark sql transformations before loading data into GCS. +These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage. The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:" diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/jdbc/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/jdbc/README.md index c265f48df..280f8471c 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/jdbc/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/jdbc/README.md @@ -19,7 +19,7 @@ Following databases are supported via Spark JDBC by default: ## Required JAR files These templates requires the JDBC jar file to be available in the Dataproc cluster. -User has to download the required jar file and host it inside a GCS Bucket, so that it could be referred during the execution of code. +User has to download the required jar file and host it inside a Cloud Storage Bucket, so that it could be referred during the execution of code. wget command to download JDBC jar file is as follows :- @@ -40,7 +40,7 @@ wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.7.0.0/ojdbc8-21.7.0.0.jar ``` -Once the jar file gets downloaded, please upload the file into a GCS Bucket and export the below variable +Once the jar file gets downloaded, please upload the file into a Cloud Storage Bucket and export the below variable ``` export JARS= @@ -116,7 +116,7 @@ bin/start.sh \ --templateProperty jdbctobq.sql.upperBound= \ --templateProperty jdbctobq.sql.numPartitions= \ --templateProperty jdbctobq.write.mode= \ ---templateProperty jdbctobq.temp.gcs.bucket= +--templateProperty jdbctobq.temp.gcs.bucket= ``` **Note**: Following is example JDBC URL for MySQL database: @@ -152,10 +152,12 @@ jdbctobq.jdbc.sessioninitstatement="BEGIN DBMS_APPLICATION_INFO.SET_MODULE('Data *** -## 2. JDBC To GCS +## 2. JDBC To Cloud Storage Note - Add dependency jars specific to database in JARS variable. + + Example: export JARS=gs:///mysql-connector-java.jar General Execution @@ -232,13 +234,13 @@ Example execution: --templateProperty jdbctogcs.write.mode=OVERWRITE \ --templateProperty 'jdbctogcs.sql=SELECT * FROM MyCloudSQLDB.table1' -There are two optional properties as well with "JDBC to GCS" Template. Please find below the details :- +There are two optional properties as well with "JDBC to Cloud Storage" Template. Please find below the details :- ``` --templateProperty jdbctogcs.temp.table='temporary_view_name' --templateProperty jdbctogcs.temp.query='select * from global_temp.temporary_view_name' ``` -These properties are responsible for applying some spark sql transformations before loading data into GCS. +These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage. The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:" *** @@ -335,7 +337,7 @@ There are two optional properties as well with "JDBC to SPANNER" Template. Pleas --templateProperty jdbctospanner.temp.table='temporary_view_name' --templateProperty jdbctospanner.temp.query='select * from global_temp.temporary_view_name' ``` -These properties are responsible for applying some spark sql transformations before loading data into GCS. +These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage. The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:" # 4. JDBC To JDBC diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/kafka/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/kafka/README.md index f86d696f9..0120188e7 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/kafka/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/kafka/README.md @@ -40,11 +40,11 @@ kafka.bq.dataset= # BigQuery output table kafka.bq.table= -# GCS bucket name, for storing temporary files -kafka.bq.temp.gcs.bucket= +# Cloud Storage bucket name, for storing temporary files +kafka.bq.temp.gcs.bucket= -# GCS location for maintaining checkpoint -kafka.bq.checkpoint.location= +# Cloud Storage location for maintaining checkpoint +kafka.bq.checkpoint.location= # Offset to start reading from. Accepted values: "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """ kafka.bq.starting.offset= @@ -128,7 +128,7 @@ bin/start.sh \ ``` -## 2. Kafka To GCS +## 2. Kafka To Cloud Storage General Execution: @@ -208,7 +208,7 @@ kafka.pubsub.input.topic= # PubSub topic kafka.pubsub.output.topic= -# GCS location for maintaining checkpoint +# Cloud Storage location for maintaining checkpoint kafka.pubsub.checkpoint.location= # Offset to start reading from. Accepted values: "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """ diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/pubsub/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/pubsub/README.md index b75b03b53..42726eb49 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/pubsub/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/pubsub/README.md @@ -42,7 +42,7 @@ pubsub.bq.output.table= ## Number of records to be written per message to BigQuery pubsub.bq.batch.size=1000 ``` -## 2. Pub/Sub To GCS +## 2. Pub/Sub To Cloud Storage General Execution: @@ -70,22 +70,22 @@ bin/start.sh \ Following properties are available in commandline or [template.properties](../../../../../../../resources/template.properties) file: ``` -# PubSub to GCS +# PubSub to Cloud Storage ## Project that contains the input Pub/Sub subscription to be read pubsubtogcs.input.project.id=yadavaja-sandbox ## PubSub subscription name pubsubtogcs.input.subscription= ## Stream timeout, for how long the subscription will be read pubsubtogcs.timeout.ms=60000 -## Streaming duration, how often wil writes to GCS be triggered +## Streaming duration, how often wil writes to Cloud Storage be triggered pubsubtogcs.streaming.duration.seconds=15 ## Number of streams that will read from Pub/Sub subscription in parallel pubsubtogcs.total.receivers=5 -## GCS bucket URL +## Cloud Storage bucket URL pubsubtogcs.gcs.bucket.name= -## Number of records to be written per message to GCS +## Number of records to be written per message to Cloud Storage pubsubtogcs.batch.size=1000 -## PubSub to GCS supported formats are: AVRO, JSON +## PubSub to Cloud Storage supported formats are: AVRO, JSON pubsubtogcs.gcs.output.data.format= ``` ## 3. Pub/Sub To BigTable diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/snowflake/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/snowflake/README.md index 8c445fe71..9ca8b37e6 100644 --- a/java/src/main/java/com/google/cloud/dataproc/templates/snowflake/README.md +++ b/java/src/main/java/com/google/cloud/dataproc/templates/snowflake/README.md @@ -1,4 +1,4 @@ -## 1. Snowflake To GCS +## 1. Snowflake To Cloud Storage General Execution: @@ -57,17 +57,17 @@ snowflake.gcs.table= # Optional property: Snowflake select query snowflake.gcs.query= -# GCS output location. Format: gs:/// +# Cloud Storage output location. Format: gs:/// snowflake.gcs.output.location= -# GCS ouput file format. Accepted values: csv, avro, orc, json or parquet +# Cloud Storage ouput file format. Accepted values: csv, avro, orc, json or parquet snowflake.gcs.output.format= -# Optional property: GCS ouput write mode. Accepted values: Overwrite, ErrorIfExists, Append or Ignore +# Optional property: Cloud Storage ouput write mode. Accepted values: Overwrite, ErrorIfExists, Append or Ignore snowflake.gcs.output.mode=overwrite Note: If not specified explicitly through execution command, the default value is Overwrite. -# Optional property: GCS output data partiton by column name +# Optional property: Cloud Storage output data partiton by column name snowflake.gcs.output.partitionColumn= ``` diff --git a/python/dataproc_templates/cassandra/README.md b/python/dataproc_templates/cassandra/README.md index 1dab83479..85e765bb8 100644 --- a/python/dataproc_templates/cassandra/README.md +++ b/python/dataproc_templates/cassandra/README.md @@ -6,7 +6,7 @@ Template for exporting a Cassandra table to Bigquery ## Arguments * `cassandratobq.input.host`: Cassandra Host IP * `cassandratobq.bigquery.location`: Dataset and Table name -* `cassandratobq.temp.gcs.location`: Temp GCS location for staging +* `cassandratobq.temp.gcs.location`: Temp Cloud Storage location for staging #### Optional Arguments * `cassandratobq.input.keyspace`: Input keyspace name for cassandra * `cassandratobq.input.table`: Input table name of cassandra @@ -84,9 +84,9 @@ export JARS="gs://jar-bucket/spark-cassandra-connector-assembly_2.12-3.2.0.jar,g ``` -## Cassandra to GCS +## Cassandra to Cloud Storage -Template for exporting a Cassandra table to GCS +Template for exporting a Cassandra table to Cloud Storage ## Arguments @@ -94,8 +94,8 @@ Template for exporting a Cassandra table to GCS * `cassandratogcs.input.table`: Input table name of cassandra (not required when query is present) * `cassandratogcs.input.host`: Cassandra Host IP * `cassandratogcs.output.format`: Output File Format -* `cassandratogcs.output.path`: GCS Bucket Path -* `cassandratogcs.output.savemode`: Output mode of Cassandra to GCS +* `cassandratogcs.output.path`: Cloud Storage Bucket Path +* `cassandratogcs.output.savemode`: Output mode of Cassandra to Cloud Storage #### Optional Arguments * `cassandratobq.input.query`: Customised query for selective migration * `cassandratogcs.input.catalog.name`: Connection name, defaults to casscon diff --git a/python/dataproc_templates/gcs/README.md b/python/dataproc_templates/gcs/README.md index c1b0fa7dc..2713e44c8 100644 --- a/python/dataproc_templates/gcs/README.md +++ b/python/dataproc_templates/gcs/README.md @@ -1,4 +1,4 @@ -# GCS To BigQuery +# Cloud Storage To BigQuery Template for reading files from Cloud Storage and writing them to a BigQuery table. It supports reading JSON, CSV, Parquet, Avro and Delta formats. @@ -183,7 +183,7 @@ export JARS="gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar, \ --text.bigquery.temp.bucket.name="" ``` -# GCS To GCS - SQL Transformation +# Cloud Storage To Cloud Storage - SQL Transformation Template for reading files from Cloud Storage, applying data transformations using Spark SQL and then writing the transformed data back to Cloud Storage. It supports reading and writing JSON, CSV, Parquet and Avro formats. Additionally, it can read Delta format. diff --git a/python/dataproc_templates/hbase/README.md b/python/dataproc_templates/hbase/README.md index 539a9fda5..9dfc8ae47 100644 --- a/python/dataproc_templates/hbase/README.md +++ b/python/dataproc_templates/hbase/README.md @@ -1,4 +1,4 @@ -## Hbase To GCS +## Hbase To Cloud Storage Template for reading files from Hbase and writing to Google Cloud Storage. It supports writing in JSON, CSV, Parquet and Avro formats. @@ -49,7 +49,7 @@ Template for reading files from Hbase and writing to Google Cloud Storage. It su Some HBase dependencies are required to be passed when submitting the job. In order to avoid additional manual steps, startup script has **automated this process**. Just by setting **CATALOG** environment variable, script will automatically download and pass required dependency. Example: ```export CATALOG=``` . Or else manual steps has to be followed as discussed below-: These dependencies need to be passed by using the --jars flag, or, in the case of Dataproc Templates, using the JARS environment variable. -Some dependencies (jars) must be downloaded from [MVN Repository](https://mvnrepository.com/) and stored your GCS bucket (create one to store the dependencies). +Some dependencies (jars) must be downloaded from [MVN Repository](https://mvnrepository.com/) and stored your Cloud Storage bucket (create one to store the dependencies). - **[Apache HBase Spark Connector](https://mvnrepository.com/artifact/org.apache.hbase.connectors.spark/hbase-spark) dependencies (already mounted in Dataproc Serverless, so you refer to them using file://):** - file:///usr/lib/spark/external/hbase-spark-protocol-shaded.jar @@ -67,7 +67,7 @@ Some dependencies (jars) must be downloaded from [MVN Repository](https://mvnrep ## Arguments -* `hbase.gcs.output.location`: GCS location for output files (format: `gs:///...`) +* `hbase.gcs.output.location`: Cloud Storage location for output files (format: `gs:///...`) * `hbase.gcs.output.format`: Output file format (one of: avro,parquet,csv,json) * `hbase.gcs.output.mode`: Output write mode (one of: append,overwrite,ignore,errorifexists)(Defaults to append) * `hbase.gcs.catalog.json`: Catalog schema file for Hbase table diff --git a/python/dataproc_templates/hive/README.md b/python/dataproc_templates/hive/README.md index 436ba7474..834e07060 100644 --- a/python/dataproc_templates/hive/README.md +++ b/python/dataproc_templates/hive/README.md @@ -77,7 +77,7 @@ These properties are responsible for applying some spark sql transformations bef The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:" -# Hive To GCS +# Hive To Cloud Storage Template for reading data from Hive and writing to a Cloud Storage location. It supports reading from Hive table. @@ -214,5 +214,5 @@ There are two optional properties as well with "Hive to GCS" Template. Please fi --templateProperty hive.gcs.temp.view.name='temporary_view_name' --templateProperty hive.gcs.sql.query='select * from global_temp.temporary_view_name' ``` -These properties are responsible for applying some spark sql transformations before loading data into GCS. +These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage. The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:" diff --git a/python/dataproc_templates/jdbc/README.md b/python/dataproc_templates/jdbc/README.md index f7d8ff4da..c945f0373 100644 --- a/python/dataproc_templates/jdbc/README.md +++ b/python/dataproc_templates/jdbc/README.md @@ -11,7 +11,7 @@ Following databases are supported via Spark JDBC by default: ## Required JAR files These templates requires the JDBC jar file to be available in the Dataproc cluster. -User has to download the required jar file and host it inside a GCS Bucket, so that it could be referred during the execution of code. +User has to download the required jar file and host it inside a Cloud Storage Bucket, so that it could be referred during the execution of code. wget command to download JDBC jar file is as follows :- @@ -32,7 +32,7 @@ wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.7.0.0/ojdbc8-21.7.0.0.jar ``` -Once the jar file gets downloaded, please upload the file into a GCS Bucket and export the below variable +Once the jar file gets downloaded, please upload the file into a Cloud Storage Bucket and export the below variable ``` export JARS= @@ -310,7 +310,7 @@ These properties are responsible for applying some spark sql transformations bef The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:" -# 2. JDBC To GCS +# 2. JDBC To Cloud Storage Template for reading data from JDBC table and writing into files in Google Cloud Storage. It supports reading partition tabels and supports writing in JSON, CSV, Parquet and Avro formats. @@ -319,7 +319,7 @@ Template for reading data from JDBC table and writing into files in Google Cloud * `jdbctogcs.input.driver`: JDBC input driver name * `jdbctogcs.input.table`: JDBC input table name * `jdbctogcs.input.sql.query`: JDBC input SQL query -* `jdbctogcs.output.location`: GCS location for output files (format: `gs://BUCKET/...`) +* `jdbctogcs.output.location`: Cloud Storage location for output files (format: `gs://BUCKET/...`) * `jdbctogcs.output.format`: Output file format (one of: avro,parquet,csv,json) * `jdbctogcs.input.partitioncolumn` (Optional): JDBC input table partition column name * `jdbctogcs.input.lowerbound` (Optional): JDBC input table partition column lower bound which is used to decide the partition stride @@ -497,7 +497,7 @@ export GCS_STAGING_LOCATION=gs://my-gcp-proj/staging export SUBNET=projects/my-gcp-proj/regions/us-central1/subnetworks/default export JARS="gs://my-gcp-proj/jars/mysql-connector-java-8.0.29.jar,gs://my-gcp-proj/jars/postgresql-42.2.6.jar,gs://my-gcp-proj/jars/mssql-jdbc-6.4.0.jre8.jar" ``` -* MySQL to GCS +* MySQL to Cloud Storage ``` ./bin/start.sh \ -- --template=JDBCTOGCS \ @@ -514,7 +514,7 @@ export JARS="gs://my-gcp-proj/jars/mysql-connector-java-8.0.29.jar,gs://my-gcp-p --jdbctogcs.output.partitioncolumn="department_id" ``` -* PostgreSQL to GCS +* PostgreSQL to Cloud Storage ``` ./bin/start.sh \ -- --template=JDBCTOGCS \ @@ -531,7 +531,7 @@ export JARS="gs://my-gcp-proj/jars/mysql-connector-java-8.0.29.jar,gs://my-gcp-p --jdbctogcs.output.partitioncolumn="department_id" ``` -* Microsoft SQL Server to GCS +* Microsoft SQL Server to Cloud Storage ``` ./bin/start.sh \ -- --template=JDBCTOGCS \ @@ -548,7 +548,7 @@ export JARS="gs://my-gcp-proj/jars/mysql-connector-java-8.0.29.jar,gs://my-gcp-p --jdbctogcs.output.partitioncolumn="department_id" ``` -* Oracle to GCS +* Oracle to Cloud Storage ``` ./bin/start.sh \ -- --template=JDBCTOGCS \ @@ -567,13 +567,13 @@ export JARS="gs://my-gcp-proj/jars/mysql-connector-java-8.0.29.jar,gs://my-gcp-p --jdbctogcs.output.partitioncolumn="department_id" ``` -There are two optional properties as well with "JDBC to GCS" Template. Please find below the details :- +There are two optional properties as well with "JDBC to Cloud Storage" Template. Please find below the details :- ``` --templateProperty jdbctogcs.temp.view.name='temporary_view_name' --templateProperty jdbctogcs.temp.sql.query='select * from global_temp.temporary_view_name' ``` -These properties are responsible for applying some spark sql transformations before loading data into GCS. +These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage. The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:" # 3. JDBC To BigQuery diff --git a/python/dataproc_templates/mongo/README.md b/python/dataproc_templates/mongo/README.md index 12c20f047..638681367 100644 --- a/python/dataproc_templates/mongo/README.md +++ b/python/dataproc_templates/mongo/README.md @@ -1,4 +1,4 @@ -## Mongo to GCS +## Mongo to Cloud Storage Template for exporting a MongoDB Collection to files in Google Cloud Storage. It supports writing JSON, CSV, Parquet and Avro formats. @@ -9,8 +9,8 @@ It uses the [MongoDB Spark Connector](https://www.mongodb.com/products/spark-con - `mongo.gcs.input.uri`: MongoDB Connection String as an Input URI (format: `mongodb://host_name:port_no`) - `mongo.gcs.input.database`: MongoDB Database Name (format: Database_name) - `mongo.gcs.input.collection`: MongoDB Input Collection Name (format: Collection_name) -- `mongo.gcs.output.format`: GCS Output File Format (one of: avro,parquet,csv,json) -- `mongo.gcs.output.location`: GCS Location to put Output Files (format: `gs://BUCKET/...`) +- `mongo.gcs.output.format`: Cloud Storage Output File Format (one of: avro,parquet,csv,json) +- `mongo.gcs.output.location`: Cloud Storage Location to put Output Files (format: `gs://BUCKET/...`) - `mongo.gcs.output.mode`: Output write mode (one of: append,overwrite,ignore,errorifexists) (Defaults to append) #### Optional Arguments @@ -155,7 +155,7 @@ This template has been tested with the following versions of the above mentioned - `mongo.bq.input.collection`: MongoDB Input Collection Name (format: Collection_name) - `mongo.bq.output.dataset`: BigQuery dataset id (format: Dataset_id) - `mongo.bq.output.table`: BigQuery table name (format: Table_name) -- `mongo.bq.temp.bucket.name`: GCS bucket name to store temporary files (format: Bucket_name) +- `mongo.bq.temp.bucket.name`: Cloud Storage bucket name to store temporary files (format: Bucket_name) #### Optional Arguments diff --git a/python/dataproc_templates/pubsublite/README.md b/python/dataproc_templates/pubsublite/README.md index d5ff75514..28257cedb 100644 --- a/python/dataproc_templates/pubsublite/README.md +++ b/python/dataproc_templates/pubsublite/README.md @@ -1,4 +1,4 @@ -# Pub/Sub Lite to GCS +# Pub/Sub Lite to Cloud Storage Template for reading files from Pub/Sub Lite and writing them to Google Cloud Storage. @@ -6,9 +6,9 @@ Template for reading files from Pub/Sub Lite and writing them to Google Cloud St * `pubsublite.to.gcs.input.subscription.url`: PubSubLite Input Subscription Url * `pubsublite.to.gcs.write.mode`: Output write mode (one of: append,overwrite,ignore,errorifexists)(Defaults to append) -* `pubsublite.to.gcs.output.location`: GCS Location to put Output Files (format: `gs://BUCKET/...`) -* `pubsublite.to.gcs.checkpoint.location`: GCS Checkpoint Folder Location -* `pubsublite.to.gcs.output.format`: GCS Output File Format (one of: avro,parquet,csv,json) (Defaults to json) +* `pubsublite.to.gcs.output.location`: Cloud Storage Location to put Output Files (format: `gs://BUCKET/...`) +* `pubsublite.to.gcs.checkpoint.location`: Cloud Storage Checkpoint Folder Location +* `pubsublite.to.gcs.output.format`: Cloud Storage Output File Format (one of: avro,parquet,csv,json) (Defaults to json) * `pubsublite.to.gcs.timeout`: Time for which the subscription will be read (measured in seconds) * `pubsublite.to.gcs.processing.time`: Time at which the query will be triggered to process input data (measured in seconds) (format: `"1 second"`) diff --git a/python/dataproc_templates/redshift/README.md b/python/dataproc_templates/redshift/README.md index dc99b01a7..c74ce79d6 100644 --- a/python/dataproc_templates/redshift/README.md +++ b/python/dataproc_templates/redshift/README.md @@ -1,13 +1,13 @@ -# Redshift To GCS +# Redshift To Cloud Storage -Template for reading data from Redshift table and writing into files in Google Cloud Storage. It supports reading partition tabels and supports writing in JSON, CSV, Parquet and Avro formats. +Template for reading data from Redshift table and writing into files in Google Cloud Storage. It supports reading partition tables and supports writing in JSON, CSV, Parquet and Avro formats. # Prerequisites ## Required JAR files These templates requires the jar file to be available in the Dataproc cluster. -User has to download the required jar file and host it inside a GCS Bucket, so that it could be referred during the execution of code. +User has to download the required jar file and host it inside a Cloud Storage Bucket, so that it could be referred during the execution of code. * spark-redshift.jar ``` @@ -26,7 +26,7 @@ wget https://repo1.maven.org/maven2/com/amazon/redshift/redshift-jdbc42/2.1.0.9/ wget https://repo1.maven.org/maven2/com/eclipsesource/minimal-json/minimal-json/0.9.5/minimal-json-0.9.5.jar ``` -Once the jar file gets downloaded, please upload the file into a GCS Bucket and export the below variable +Once the jar file gets downloaded, please upload the file into a Cloud Storage Bucket and export the below variable ``` export JARS= @@ -71,7 +71,7 @@ redshifttogcs.input.table="employees" * `redshifttogcs.input.url`: Redshift JDBC input URL * `redshifttogcs.s3.tempdir`: S3 temporary bucket location * `redshifttogcs.input.table`: Redshift input table name -* `redshifttogcs.output.location`: GCS location for output files (format: `gs://BUCKET/...`) +* `redshifttogcs.output.location`: Cloud Storage location for output files (format: `gs://BUCKET/...`) * `redshifttogcs.output.format`: Output file format (one of: avro,parquet,csv,json) * `redshifttogcs.iam.rolearn` : IAM Role with S3 Access * `redshifttogcs.s3.accesskey` : AWS Access Key for S3 Access diff --git a/python/dataproc_templates/s3/README.md b/python/dataproc_templates/s3/README.md index c7db5e8b0..4fa81bac3 100644 --- a/python/dataproc_templates/s3/README.md +++ b/python/dataproc_templates/s3/README.md @@ -12,7 +12,7 @@ It uses the [Spark BigQuery connector](https://cloud.google.com/dataproc-serverl * `s3.bq.input.format` : Input file format in Amazon S3 bucket (one of : avro, parquet, csv, json) * `s3.bq.output.dataset.name` : BigQuery dataset for the output table * `s3.bq.output.table.name` : BigQuery output table name -* `s3.bq.temp.bucket.name` : Pre existing GCS bucket name where temporary files are staged +* `s3.bq.temp.bucket.name` : Pre existing Cloud Storage bucket name where temporary files are staged * `s3.bq.output.mode` : (Optional) Output write mode (one of: append,overwrite,ignore,errorifexists) (Defaults to append) ## Usage diff --git a/python/dataproc_templates/snowflake/README.md b/python/dataproc_templates/snowflake/README.md index 32ce5e4ae..b30e33a60 100644 --- a/python/dataproc_templates/snowflake/README.md +++ b/python/dataproc_templates/snowflake/README.md @@ -1,4 +1,4 @@ -## 1. Snowflake To GCS +## 1. Snowflake To Cloud Storage Template for reading data from a Snowflake table or custom query and writing to Google Cloud Storage. It supports writing JSON, CSV, Parquet and Avro formats. @@ -143,7 +143,7 @@ optional arguments: 1. Snowflake Connector for Spark : [Maven Repo Download Link](https://mvnrepository.com/artifact/net.snowflake/spark-snowflake) 2. Snowflake JDBC Driver : [Maven Repo Download Link](https://mvnrepository.com/artifact/net.snowflake/snowflake-jdbc) Please ensure that jdbc driver version is compatible with the snowflake-spark connector version. -Download the above mentioned jars and place them in a GCS bucket. +Download the above mentioned jars and place them in a Cloud Storage bucket. ### Example submission ```