Skip to content

Commit

Permalink
Updating GCS to Cloud Storage in Java templates
Browse files Browse the repository at this point in the history
  • Loading branch information
violetautumn committed Nov 16, 2023
1 parent 12b8563 commit 8bcb678
Show file tree
Hide file tree
Showing 11 changed files with 64 additions and 62 deletions.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## BigQuery To GCS
## BigQuery To Cloud Storage

General Execution:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## Executing Spanner to GCS template
## Executing Spanner to Cloud Storage template

General Execution:

Expand Down Expand Up @@ -30,7 +30,7 @@ Update`spanner.gcs.input.table.id` property as follows:
```
"spanner.gcs.input.table.id=(select name, age, phone from employee where designation = 'engineer')"
```
There are two optional properties as well with "Spanner to GCS" Template. Please find below the details :-
There are two optional properties as well with "Spanner to Cloud Storage" Template. Please find below the details :-

```
--templateProperty spanner.gcs.temp.table='temporary_view_name'
Expand All @@ -41,7 +41,7 @@ The only thing needs to keep in mind is that, the name of the Spark temporary vi

**NOTE** It is required to surround your custom query with parenthesis and parameter name with double quotes.

## Executing Cassandra to GCS Template
## Executing Cassandra to Cloud Storage Template
### General Execution

```
Expand Down Expand Up @@ -146,7 +146,7 @@ You can replace the ```casscon``` with your catalog name if it is passed. This i
Make sure that either ```cassandratobq.input.query``` or both ```cassandratobq.input.keyspace``` and ```cassandratobq.input.table``` is provided. Setting or not setting all three properties at the same time will throw an error.


## Executing Redshift to GCS template
## Executing Redshift to Cloud Storage template

General Execution:

Expand All @@ -171,16 +171,16 @@ bin/start.sh \
--templateProperty redshift.gcs.output.mode=<Output-GCS-Save-mode>
```

There are two optional properties as well with "Redshift to GCS" Template. Please find below the details :-
There are two optional properties as well with "Redshift to Cloud Storage" Template. Please find below the details :-

```
--templateProperty redshift.gcs.temp.table='temporary_view_name'
--templateProperty redshift.gcs.temp.query='select * from global_temp.temporary_view_name'
```
These properties are responsible for applying some spark sql transformations while loading data into GCS.
These properties are responsible for applying some spark sql transformations while loading data into Cloud Storage.
The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"

## Executing Mongo to GCS template
## Executing Mongo to Cloud Storage template

Template for exporting a MongoDB Collection to files in Google Cloud Storage. It supports writing JSON, CSV, Parquet and Avro formats.

Expand Down Expand Up @@ -212,8 +212,8 @@ Arguments:
* `templateProperty mongo.gcs.input.uri`: MongoDB Connection String as an Input URI (format: `mongodb://host_name:port_no`)
* `templateProperty mongo.gcs.input.database`: MongoDB Database Name (format: Database_name)
* `templateProperty mongo.gcs.input.collection`: MongoDB Input Collection Name (format: Collection_name)
* `templateProperty mongo.gcs.output.format`: GCS Output File Format (one of: avro,parquet,csv,json)
* `templateProperty mongo.gcs.output.location`: GCS Location to put Output Files (format: `gs://BUCKET/...`)
* `templateProperty mongo.gcs.output.format`: Cloud Storage Output File Format (one of: avro,parquet,csv,json)
* `templateProperty mongo.gcs.output.location`: Cloud Storage Location to put Output Files (format: `gs://BUCKET/...`)
* `templateProperty mongo.gcs.output.mode`: Output write mode (one of: append,overwrite,ignore,errorifexists) (Defaults to append)

Example Submission:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
## Dataplex GCS to BigQuery
## Dataplex Cloud Storage to BigQuery

This template will incrementally move data from a Dataplex GCS tables to BigQuery.
It will identify new partitions in Dataplex GCS and load them to BigQuery.
This template will incrementally move data from a Dataplex Cloud Storage tables to BigQuery.
It will identify new partitions in Dataplex Cloud Storage and load them to BigQuery.

Note: if the Dataplex GCS table has no partitions, the whole table will be read
from GCS and the target BQ table will be overwritten.
Note: if the Dataplex Cloud Storage table has no partitions, the whole table will be read
from Cloud Storage and the target BQ table will be overwritten.

### General Execution:

Expand Down Expand Up @@ -46,9 +46,9 @@ gcloud dataplex tasks create <task-id> \
SQL file should be located

`dataplex.gcs.bq.target.dataset` name of the target BigQuery dataset where the
Dataplex GCS asset will be migrated to
Dataplex Cloud Storage asset will be migrated to

`gcs.bigquery.temp.bucket.name` the GCS bucket that temporarily holds the data
`gcs.bigquery.temp.bucket.name` the Cloud Storage bucket that temporarily holds the data
before it is loaded to BigQuery

`dataplex.gcs.bq.save.mode` specifies how to handle existing data in BigQuery
Expand All @@ -71,7 +71,7 @@ over any other property or argument specifying target output of the data.


### Arguments
`--dataplexEntity` Dataplex GCS table to load in BigQuery \
`--dataplexEntity` Dataplex Cloud Storage table to load in BigQuery \
Example: `--dataplexEntityList "projects/{project_number}/locations/{location_id}/lakes/{lake_id}/zones/{zone_id}/entities/{entity_id_1}"`

`--partitionField` if field is specified together with `partitionType`, the
Expand All @@ -87,7 +87,7 @@ argument is not specified the name of the entity will be used as table name

Optionally a custom SQL can be provided to filter the data that will be copied
to BigQuery. \
The template will read from a GCS file with the custom sql string.
The template will read from a Cloud Storage file with the custom sql string.

The path to this file must be provided with the option `--customSqlGcsPath`.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 1. GCS To BigQuery
## 1. Cloud Storage To BigQuery

General Execution:

Expand All @@ -19,7 +19,7 @@ bin/start.sh \
--templateProperty gcs.bigquery.temp.bucket.name=<bigquery temp bucket name>
```

There are two optional properties as well with "GCS to BigQuery" Template. Please find below the details :-
There are two optional properties as well with "Cloud Storage to BigQuery" Template. Please find below the details :-

```
--templateProperty gcs.bigquery.temp.table='temporary_view_name'
Expand All @@ -28,7 +28,7 @@ There are two optional properties as well with "GCS to BigQuery" Template. Pleas
These properties are responsible for applying some spark sql transformations while loading data into BigQuery.
The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"

## 2. GCS To BigTable
## 2. Cloud Storage To BigTable

General Execution:

Expand Down Expand Up @@ -63,7 +63,7 @@ bin/start.sh \
(Please note that the table in Bigtable should exist with the above given column family before executing the template)
```

## 3. GCS to Spanner
## 3. Cloud Storage to Spanner
```
GCP_PROJECT=<gcp-project-id> \
REGION=<region> \
Expand All @@ -82,7 +82,7 @@ bin/start.sh \
```


## 4. GCS to JDBC
## 4. Cloud Storage to JDBC

```
Please download the JDBC Driver of respective database and copy it to GCS bucket location.
Expand Down Expand Up @@ -129,7 +129,7 @@ bin/start.sh \
```

## 5. GCS to GCS
## 5. Cloud Storage to Cloud Storage

```
GCP_PROJECT=<gcp-project-id> \
Expand Down Expand Up @@ -161,9 +161,9 @@ bin/start.sh \
```

## 6. GCS To Mongo:
## 6. Cloud Storage To Mongo:

Download the following MongoDb connectors and copy it to GCS bucket location:
Download the following MongoDb connectors and copy it to Cloud Storage bucket location:
* [MongoDb Spark Connector](https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector)
* [MongoDb Java Driver](https://mvnrepository.com/artifact/org.mongodb/mongo-java-driver)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ export GCS_STAGING_LOCATION=<gcs path>
export SUBNET=projects/<project id>/regions/<region name>/subnetworks/<subnetwork name>
```

Create a config file and upload it to GCS:
Create a config file and upload it to Cloud Storage:

```
gsutil cp config.yaml gs://bucket/path/config.yaml
Expand Down Expand Up @@ -101,7 +101,7 @@ bin/start.sh \

# Example config files:

GCS to BigQuery config.yaml
Cloud Storage to BigQuery config.yaml

```yaml
input:
Expand All @@ -116,7 +116,7 @@ output:
mode: Overwrite
```
BigQuery to GCS config.yaml
BigQuery to Cloud Storage config.yaml
```yaml
input:
Expand All @@ -131,7 +131,7 @@ output:
mode: Overwrite
```
GCS Avro to GCS CSV
Cloud Storage Avro to Cloud Storage CSV
```yaml
input:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 1. HBase To GCS
## 1. HBase To Cloud Storage
### Required JAR files

Some HBase dependencies are required to be passed when submitting the job. These dependencies are automatically set by script when CATALOG environment variable is set for hbase table configuration. If not,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ These properties are responsible for applying some spark sql transformations bef
The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"


## 2. Hive To GCS
## 2. Hive To Cloud Storage
General Execution:

```
Expand All @@ -53,25 +53,25 @@ bin/start.sh \
### Configurable Parameters
Update Following properties in [template.properties](../../../../../../../resources/template.properties) file:
```
## GCS output path.
## Cloud Storage output path.
hive.gcs.output.path=<gcs-output-path>
## Name of hive input table.
hive.input.table=<hive-input-table>
## Hive input db name.
hive.input.db=<hive-output-db>
## Optional - GCS output format. avro/csv/parquet/json/orc, defaults to avro.
## Optional - Cloud Storage output format. avro/csv/parquet/json/orc, defaults to avro.
hive.gcs.output.format=avro
## Optional, column to partition hive data.
hive.partition.col=<hive-partition-col>
## Optional: Write mode to gcs append/overwrite/errorifexists/ignore, defaults to overwrite
## Optional: Write mode to Cloud Storage append/overwrite/errorifexists/ignore, defaults to overwrite
hive.gcs.save.mode=overwrite
```

There are two optional properties as well with "Hive to GCS" Template. Please find below the details :-
There are two optional properties as well with "Hive to Cloud Storage" Template. Please find below the details :-

```
--templateProperty hive.gcs.temp.table='temporary_view_name'
--templateProperty hive.gcs.temp.query='select * from global_temp.temporary_view_name'
```
These properties are responsible for applying some spark sql transformations before loading data into GCS.
These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage.
The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Following databases are supported via Spark JDBC by default:
## Required JAR files

These templates requires the JDBC jar file to be available in the Dataproc cluster.
User has to download the required jar file and host it inside a GCS Bucket, so that it could be referred during the execution of code.
User has to download the required jar file and host it inside a Cloud Storage Bucket, so that it could be referred during the execution of code.

wget command to download JDBC jar file is as follows :-

Expand All @@ -40,7 +40,7 @@ wget https://repo1.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre
wget https://repo1.maven.org/maven2/com/oracle/database/jdbc/ojdbc8/21.7.0.0/ojdbc8-21.7.0.0.jar
```

Once the jar file gets downloaded, please upload the file into a GCS Bucket and export the below variable
Once the jar file gets downloaded, please upload the file into a Cloud Storage Bucket and export the below variable

```
export JARS=<gcs-bucket-location-containing-jar-file>
Expand Down Expand Up @@ -116,7 +116,7 @@ bin/start.sh \
--templateProperty jdbctobq.sql.upperBound=<optional-partition-end-value> \
--templateProperty jdbctobq.sql.numPartitions=<optional-partition--number> \
--templateProperty jdbctobq.write.mode=<Append|Overwrite|ErrorIfExists|Ignore> \
--templateProperty jdbctobq.temp.gcs.bucket=<temp gcs bucket name>
--templateProperty jdbctobq.temp.gcs.bucket=<temp cloud storage bucket name>
```
**Note**: Following is example JDBC URL for MySQL database:
Expand Down Expand Up @@ -152,10 +152,12 @@ jdbctobq.jdbc.sessioninitstatement="BEGIN DBMS_APPLICATION_INFO.SET_MODULE('Data
***
## 2. JDBC To GCS
## 2. JDBC To Cloud Storage
Note - Add dependency jars specific to database in JARS variable.
Example: export JARS=gs://<bucket_name>/mysql-connector-java.jar
General Execution
Expand Down Expand Up @@ -232,13 +234,13 @@ Example execution:
--templateProperty jdbctogcs.write.mode=OVERWRITE \
--templateProperty 'jdbctogcs.sql=SELECT * FROM MyCloudSQLDB.table1'
There are two optional properties as well with "JDBC to GCS" Template. Please find below the details :-
There are two optional properties as well with "JDBC to Cloud Storage" Template. Please find below the details :-
```
--templateProperty jdbctogcs.temp.table='temporary_view_name'
--templateProperty jdbctogcs.temp.query='select * from global_temp.temporary_view_name'
```
These properties are responsible for applying some spark sql transformations before loading data into GCS.
These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage.
The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"
***
Expand Down Expand Up @@ -335,7 +337,7 @@ There are two optional properties as well with "JDBC to SPANNER" Template. Pleas
--templateProperty jdbctospanner.temp.table='temporary_view_name'
--templateProperty jdbctospanner.temp.query='select * from global_temp.temporary_view_name'
```
These properties are responsible for applying some spark sql transformations before loading data into GCS.
These properties are responsible for applying some spark sql transformations before loading data into Cloud Storage.
The only thing needs to keep in mind is that, the name of the Spark temporary view and the name of table in the query should match exactly. Otherwise, there would be an error as:- "Table or view not found:"
# 4. JDBC To JDBC
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,11 @@ kafka.bq.dataset=<output bigquery dataset>
# BigQuery output table
kafka.bq.table=<output bigquery table>
# GCS bucket name, for storing temporary files
kafka.bq.temp.gcs.bucket=<gcs bucket name>
# Cloud Storage bucket name, for storing temporary files
kafka.bq.temp.gcs.bucket=<cloud storage bucket name>
# GCS location for maintaining checkpoint
kafka.bq.checkpoint.location=<gcs bucket location maintains checkpoint>
# Cloud Storage location for maintaining checkpoint
kafka.bq.checkpoint.location=<cloud storage bucket location maintains checkpoint>
# Offset to start reading from. Accepted values: "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """
kafka.bq.starting.offset=<kafka-starting-offset>
Expand Down Expand Up @@ -128,7 +128,7 @@ bin/start.sh \
```
## 2. Kafka To GCS
## 2. Kafka To Cloud Storage
General Execution:
Expand Down Expand Up @@ -208,7 +208,7 @@ kafka.pubsub.input.topic=
# PubSub topic
kafka.pubsub.output.topic=

# GCS location for maintaining checkpoint
# Cloud Storage location for maintaining checkpoint
kafka.pubsub.checkpoint.location=

# Offset to start reading from. Accepted values: "earliest", "latest" (streaming only), or json string """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}} """
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ pubsub.bq.output.table=<bq output table>
## Number of records to be written per message to BigQuery
pubsub.bq.batch.size=1000
```
## 2. Pub/Sub To GCS
## 2. Pub/Sub To Cloud Storage

General Execution:

Expand Down Expand Up @@ -70,22 +70,22 @@ bin/start.sh \
Following properties are available in commandline or [template.properties](../../../../../../../resources/template.properties) file:

```
# PubSub to GCS
# PubSub to Cloud Storage
## Project that contains the input Pub/Sub subscription to be read
pubsubtogcs.input.project.id=yadavaja-sandbox
## PubSub subscription name
pubsubtogcs.input.subscription=
## Stream timeout, for how long the subscription will be read
pubsubtogcs.timeout.ms=60000
## Streaming duration, how often wil writes to GCS be triggered
## Streaming duration, how often wil writes to Cloud Storage be triggered
pubsubtogcs.streaming.duration.seconds=15
## Number of streams that will read from Pub/Sub subscription in parallel
pubsubtogcs.total.receivers=5
## GCS bucket URL
## Cloud Storage bucket URL
pubsubtogcs.gcs.bucket.name=
## Number of records to be written per message to GCS
## Number of records to be written per message to Cloud Storage
pubsubtogcs.batch.size=1000
## PubSub to GCS supported formats are: AVRO, JSON
## PubSub to Cloud Storage supported formats are: AVRO, JSON
pubsubtogcs.gcs.output.data.format=
```
## 3. Pub/Sub To BigTable
Expand Down
Loading

0 comments on commit 8bcb678

Please sign in to comment.