Updating GCS to Cloud Storage in Java templates

GoogleCloudPlatform · Nov 16, 2023 · 8bcb678 · 8bcb678
1 parent 12b8563
commit 8bcb678
Show file tree

Hide file tree

Showing 11 changed files with 64 additions and 62 deletions.
diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/bigquery/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/bigquery/README.md
@@ -1,4 +1,4 @@
-## BigQuery To GCS 
+## BigQuery To Cloud Storage 
 
 General Execution:
 

diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/databases/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/databases/README.md
@@ -1,4 +1,4 @@
-## Executing Spanner to GCS template
+## Executing Spanner to Cloud Storage template
 
 General Execution:
 
@@ -30,7 +30,7 @@ Update`spanner.gcs.input.table.id` property as follows:
 ```
 "spanner.gcs.input.table.id=(select name, age, phone from employee where designation = 'engineer')"
 ```
-There are two optional properties as well with "Spanner to GCS" Template. Please find below the details :-
+There are two optional properties as well with "Spanner to Cloud Storage" Template. Please find below the details :-
 
 ```
 --templateProperty spanner.gcs.temp.table='temporary_view_name' 
@@ -41,7 +41,7 @@ The only thing needs to keep in mind is that, the name of the Spark temporary vi
 
 **NOTE** It is required to surround your custom query with parenthesis and parameter name with double quotes.
 
-## Executing Cassandra to GCS Template
+## Executing Cassandra to Cloud Storage Template
 ### General Execution
 
 ```
@@ -146,7 +146,7 @@ You can replace the ```casscon``` with your catalog name if it is passed. This i
 Make sure that either ```cassandratobq.input.query``` or both ```cassandratobq.input.keyspace``` and ```cassandratobq.input.table``` is provided. Setting or not setting all three properties at the same time will throw an error.
 
 
-## Executing Redshift to GCS template
+## Executing Redshift to Cloud Storage template
 
 General Execution:
 
@@ -171,16 +171,16 @@ bin/start.sh \
 --templateProperty redshift.gcs.output.mode=<Output-GCS-Save-mode>
 ```
 
-There are two optional properties as well with "Redshift to GCS" Template. Please find below the details :-
+There are two optional properties as well with "Redshift to Cloud Storage" Template. Please find below the details :-
 
 ```
 --templateProperty redshift.gcs.temp.table='temporary_view_name' 
 --templateProperty redshift.gcs.temp.query='select * from global_temp.temporary_view_name'
 ```
-These properties are responsible for applying some spark sql transformations while loading data into GCS.
+These properties are responsible for applying some spark sql transformations while loading data into Cloud Storage.
 The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"
 
-## Executing Mongo to GCS template
+## Executing Mongo to Cloud Storage template
 
 Template for exporting a MongoDB Collection to files in Google Cloud Storage. It supports writing JSON, CSV, Parquet and Avro formats.
 
@@ -212,8 +212,8 @@ Arguments:
 * `templateProperty mongo.gcs.input.uri`: MongoDB Connection String as an Input URI (format: `mongodb://host_name:port_no`)
 * `templateProperty mongo.gcs.input.database`: MongoDB Database Name (format: Database_name)
 * `templateProperty mongo.gcs.input.collection`: MongoDB Input Collection Name (format: Collection_name)
-* `templateProperty mongo.gcs.output.format`: GCS Output File Format (one of: avro,parquet,csv,json)
-* `templateProperty mongo.gcs.output.location`: GCS Location to put Output Files (format: `gs://BUCKET/...`)
+* `templateProperty mongo.gcs.output.format`: Cloud Storage Output File Format (one of: avro,parquet,csv,json)
+* `templateProperty mongo.gcs.output.location`: Cloud Storage Location to put Output Files (format: `gs://BUCKET/...`)
 * `templateProperty mongo.gcs.output.mode`: Output write mode (one of: append,overwrite,ignore,errorifexists) (Defaults to append)
 
 Example Submission:

diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/dataplex/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/dataplex/README.md
@@ -1,10 +1,10 @@
-## Dataplex GCS to BigQuery
+## Dataplex Cloud Storage to BigQuery
 
-This template will incrementally move data from a Dataplex GCS tables to BigQuery.
-It will identify new partitions in Dataplex GCS and load them to BigQuery.
+This template will incrementally move data from a Dataplex Cloud Storage tables to BigQuery.
+It will identify new partitions in Dataplex Cloud Storage and load them to BigQuery.
 
-Note: if the Dataplex GCS table has no partitions, the whole table will be read
-from GCS and the target BQ table will be overwritten.
+Note: if the Dataplex Cloud Storage table has no partitions, the whole table will be read
+from Cloud Storage and the target BQ table will be overwritten.
 
 ### General Execution:
 
@@ -46,9 +46,9 @@ gcloud dataplex tasks create <task-id> \
 SQL file should be located
 
 `dataplex.gcs.bq.target.dataset` name of the target BigQuery dataset where the
-Dataplex GCS asset will be migrated to
+Dataplex Cloud Storage asset will be migrated to
 
-`gcs.bigquery.temp.bucket.name` the GCS bucket that temporarily holds the data
+`gcs.bigquery.temp.bucket.name` the Cloud Storage bucket that temporarily holds the data
 before it is loaded to BigQuery
 
 `dataplex.gcs.bq.save.mode` specifies how to handle existing data in BigQuery
@@ -71,7 +71,7 @@ over any other property or argument specifying target output of the data.
 
 
 ### Arguments
-`--dataplexEntity` Dataplex GCS table to load in BigQuery \
+`--dataplexEntity` Dataplex Cloud Storage table to load in BigQuery \
 Example: `--dataplexEntityList "projects/{project_number}/locations/{location_id}/lakes/{lake_id}/zones/{zone_id}/entities/{entity_id_1}"`
 
 `--partitionField` if field is specified together with `partitionType`, the
@@ -87,7 +87,7 @@ argument is not specified the name of the entity will be used as table name
 
 Optionally a custom SQL can be provided to filter the data that will be copied
 to BigQuery. \
-The template will read from a GCS file with the custom sql string.
+The template will read from a Cloud Storage file with the custom sql string.
 
 The path to this file must be provided with the option `--customSqlGcsPath`.
 

diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/gcs/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/gcs/README.md
@@ -1,4 +1,4 @@
-## 1. GCS To BigQuery
+## 1. Cloud Storage To BigQuery
 
 General Execution:
 
@@ -19,7 +19,7 @@ bin/start.sh \
 --templateProperty gcs.bigquery.temp.bucket.name=<bigquery temp bucket name>
 ```
 
-There are two optional properties as well with "GCS to BigQuery" Template. Please find below the details :-
+There are two optional properties as well with "Cloud Storage to BigQuery" Template. Please find below the details :-
 
 ```
 --templateProperty gcs.bigquery.temp.table='temporary_view_name'
@@ -28,7 +28,7 @@ There are two optional properties as well with "GCS to BigQuery" Template. Pleas
 These properties are responsible for applying some spark sql transformations while loading data into BigQuery.
 The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"
 
-## 2. GCS To BigTable
+## 2. Cloud Storage To BigTable
 
 General Execution:
 
@@ -63,7 +63,7 @@ bin/start.sh \
 (Please note that the table in Bigtable should exist with the above given column family before executing the template)
 ```
 
-## 3. GCS to Spanner
+## 3. Cloud Storage to Spanner
 ```
 GCP_PROJECT=<gcp-project-id> \
 REGION=<region>  \
@@ -82,7 +82,7 @@ bin/start.sh \
 ```
 
 
-## 4. GCS to JDBC
+## 4. Cloud Storage to JDBC
 
 ```
 Please download the JDBC Driver of respective database and copy it to GCS bucket location.
@@ -129,7 +129,7 @@ bin/start.sh \
 
 ```
 
-## 5. GCS to GCS
+## 5. Cloud Storage to Cloud Storage
 
 ```
 GCP_PROJECT=<gcp-project-id> \
@@ -161,9 +161,9 @@ bin/start.sh \
 
 ```
 
-## 6. GCS To Mongo:
+## 6. Cloud Storage To Mongo:
 
-Download the following MongoDb connectors and copy it to GCS bucket location:
+Download the following MongoDb connectors and copy it to Cloud Storage bucket location:
 * [MongoDb Spark Connector](https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector)
 * [MongoDb Java Driver](https://mvnrepository.com/artifact/org.mongodb/mongo-java-driver)
 

diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/general/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/general/README.md
@@ -63,7 +63,7 @@ export GCS_STAGING_LOCATION=<gcs path>
 export SUBNET=projects/<project id>/regions/<region name>/subnetworks/<subnetwork name>
 ```
 
-Create a config file and upload it to GCS:
+Create a config file and upload it to Cloud Storage:
 
 ```
 gsutil cp config.yaml gs://bucket/path/config.yaml
@@ -101,7 +101,7 @@ bin/start.sh \
 
 # Example config files:
 
-GCS to BigQuery config.yaml
+Cloud Storage to BigQuery config.yaml
 
 ```yaml
 input:
@@ -116,7 +116,7 @@ output:
     mode: Overwrite
 ```
 
-BigQuery to GCS config.yaml
+BigQuery to Cloud Storage config.yaml
 
 ```yaml
 input:
@@ -131,7 +131,7 @@ output:
     mode: Overwrite
 ```
 
-GCS Avro to GCS CSV
+Cloud Storage Avro to Cloud Storage CSV
 
 ```yaml
 input:

diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/hbase/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/hbase/README.md
@@ -1,4 +1,4 @@
-## 1. HBase To GCS
+## 1. HBase To Cloud Storage
 ### Required JAR files
 
 Some HBase dependencies are required to be passed when submitting the job. These dependencies are automatically set by script when CATALOG environment variable is set for hbase table configuration. If not, 

diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/hive/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/hive/README.md
@@ -33,7 +33,7 @@ These properties are responsible for applying some spark sql transformations bef
 The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"
 
 
-## 2. Hive To GCS
+## 2. Hive To Cloud Storage
 General Execution:
 
 ```
@@ -53,25 +53,25 @@ bin/start.sh \
 ### Configurable Parameters
 Update Following properties in [template.properties](../../../../../../../resources/template.properties) file:
 ```
-## GCS output path.
+## Cloud Storage output path.
 hive.gcs.output.path=<gcs-output-path>
 ## Name of hive input table.
 hive.input.table=<hive-input-table>
 ## Hive input db name.
 hive.input.db=<hive-output-db>
-## Optional - GCS output format. avro/csv/parquet/json/orc, defaults to avro.
+## Optional - Cloud Storage output format. avro/csv/parquet/json/orc, defaults to avro.
 hive.gcs.output.format=avro
 ## Optional, column to partition hive data.
 hive.partition.col=<hive-partition-col>
-## Optional: Write mode to gcs append/overwrite/errorifexists/ignore, defaults to overwrite
+## Optional: Write mode to Cloud Storage append/overwrite/errorifexists/ignore, defaults to overwrite
 hive.gcs.save.mode=overwrite
 ```
 
-There are two optional properties as well with "Hive to GCS" Template. Please find below the details :-
+There are two optional properties as well with "Hive to Cloud Storage" Template. Please find below the details :-
 
 ```
 --templateProperty hive.gcs.temp.table='temporary_view_name' 
 --templateProperty hive.gcs.temp.query='select * from global_temp.temporary_view_name'
 ```
-These properties are responsible for applying some spark sql transformations before loading data into GCS.
+These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage.
 The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"
diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/jdbc/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/jdbc/README.md
@@ -19,7 +19,7 @@ Following databases are supported via Spark JDBC by default:
 ## Required JAR files
 
 These templates requires the JDBC jar file to be available in the Dataproc cluster.
-User has to download the required jar file and host it inside a GCS Bucket, so that it could be referred during the execution of code.
+User has to download the required jar file and host it inside a Cloud Storage Bucket, so that it could be referred during the execution of code.
 
 wget command to download JDBC jar file is as follows :-
 
@@ -40,7 +40,7 @@ wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre
 wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.7.0.0/ojdbc8-21.7.0.0.jar
 ```
 
-Once the jar file gets downloaded, please upload the file into a GCS Bucket and export the below variable
+Once the jar file gets downloaded, please upload the file into a Cloud Storage Bucket and export the below variable
 
 ```
 export JARS=<gcs-bucket-location-containing-jar-file>
@@ -116,7 +116,7 @@ bin/start.sh \
 --templateProperty jdbctobq.sql.upperBound=<optional-partition-end-value> \
 --templateProperty jdbctobq.sql.numPartitions=<optional-partition--number> \
 --templateProperty jdbctobq.write.mode=<Append|Overwrite|ErrorIfExists|Ignore> \
---templateProperty jdbctobq.temp.gcs.bucket=<temp gcs bucket name>
+--templateProperty jdbctobq.temp.gcs.bucket=<temp cloud storage bucket name>
 ```
 
 **Note**: Following is example JDBC URL for MySQL database:
@@ -152,10 +152,12 @@ jdbctobq.jdbc.sessioninitstatement="BEGIN DBMS_APPLICATION_INFO.SET_MODULE('Data
 
 ***
 
-## 2. JDBC To GCS
+## 2. JDBC To Cloud Storage
 
 Note - Add dependency jars specific to database in JARS variable.
 
+
+
 Example: export JARS=gs://<bucket_name>/mysql-connector-java.jar
 
 General Execution
@@ -232,13 +234,13 @@ Example execution:
     --templateProperty jdbctogcs.write.mode=OVERWRITE \
     --templateProperty 'jdbctogcs.sql=SELECT * FROM MyCloudSQLDB.table1'
 
-There are two optional properties as well with "JDBC to GCS" Template. Please find below the details :-
+There are two optional properties as well with "JDBC to Cloud Storage" Template. Please find below the details :-
 
 ```
 --templateProperty jdbctogcs.temp.table='temporary_view_name'
 --templateProperty jdbctogcs.temp.query='select * from global_temp.temporary_view_name'
 ```
-These properties are responsible for applying some spark sql transformations before loading data into GCS.
+These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage.
 The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"
 
 ***
@@ -335,7 +337,7 @@ There are two optional properties as well with "JDBC to SPANNER" Template. Pleas
 --templateProperty jdbctospanner.temp.table='temporary_view_name'
 --templateProperty jdbctospanner.temp.query='select * from global_temp.temporary_view_name'
 ```
-These properties are responsible for applying some spark sql transformations before loading data into GCS.
+These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage.
 The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"
 
 # 4. JDBC To JDBC

diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/kafka/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/kafka/README.md
@@ -40,11 +40,11 @@ kafka.bq.dataset=<output bigquery dataset>
 # BigQuery output table
 kafka.bq.table=<output bigquery table>
 
-# GCS bucket name, for storing temporary files
-kafka.bq.temp.gcs.bucket=<gcs bucket name>
+# Cloud Storage bucket name, for storing temporary files
+kafka.bq.temp.gcs.bucket=<cloud storage bucket name>
 
-# GCS location for maintaining checkpoint
-kafka.bq.checkpoint.location=<gcs bucket location maintains checkpoint>
+# Cloud Storage location for maintaining checkpoint
+kafka.bq.checkpoint.location=<cloud storage bucket location maintains checkpoint>
 
 # Offset to start reading from. Accepted values: "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """
 kafka.bq.starting.offset=<kafka-starting-offset>
@@ -128,7 +128,7 @@ bin/start.sh \
 ```
 
 
-## 2. Kafka To GCS
+## 2. Kafka To Cloud Storage
 
 General Execution:
 
@@ -208,7 +208,7 @@ kafka.pubsub.input.topic=
 # PubSub topic
 kafka.pubsub.output.topic=
 
-# GCS location for maintaining checkpoint
+# Cloud Storage location for maintaining checkpoint
 kafka.pubsub.checkpoint.location=
 
 # Offset to start reading from. Accepted values: "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """

diff --git a/java/src/main/java/com/google/cloud/dataproc/templates/pubsub/README.md b/java/src/main/java/com/google/cloud/dataproc/templates/pubsub/README.md
@@ -42,7 +42,7 @@ pubsub.bq.output.table=<bq output table>
 ## Number of records to be written per message to BigQuery
 pubsub.bq.batch.size=1000
 ```
-## 2. Pub/Sub To GCS
+## 2. Pub/Sub To Cloud Storage
 
 General Execution:
 
@@ -70,22 +70,22 @@ bin/start.sh \
 Following properties are available in commandline or [template.properties](../../../../../../../resources/template.properties) file:
 
 ```
-# PubSub to GCS
+# PubSub to Cloud Storage
 ## Project that contains the input Pub/Sub subscription to be read
 pubsubtogcs.input.project.id=yadavaja-sandbox
 ## PubSub subscription name
 pubsubtogcs.input.subscription=
 ## Stream timeout, for how long the subscription will be read
 pubsubtogcs.timeout.ms=60000
-## Streaming duration, how often wil writes to GCS be triggered
+## Streaming duration, how often wil writes to Cloud Storage be triggered
 pubsubtogcs.streaming.duration.seconds=15
 ## Number of streams that will read from Pub/Sub subscription in parallel
 pubsubtogcs.total.receivers=5
-## GCS bucket URL
+## Cloud Storage bucket URL
 pubsubtogcs.gcs.bucket.name=
-## Number of records to be written per message to GCS
+## Number of records to be written per message to Cloud Storage
 pubsubtogcs.batch.size=1000
-## PubSub to GCS supported formats are: AVRO, JSON
+## PubSub to Cloud Storage supported formats are: AVRO, JSON
 pubsubtogcs.gcs.output.data.format=
 ```
 ## 3. Pub/Sub To BigTable