Skip to content

vikasrathee-cs/dataproc-jdbc-connector

 
 

Repository files navigation

JDBC Connector for Apache Hive on Cloud Dataproc

DataprocDriver provides a secure, easily accessible way to connect to Apache Hive on Dataproc cluster from everywhere via Java Database Connection through Component Gateway.

Supported Java Version: 1.8

Before you begin

In order to use this library, you first need to go through the following steps:

  1. Set up a Cloud project
  2. Setup Authentication
    Note that we recommend using service account login rather than end user login.
  3. Make sure Component Gateway is enabled for the cluster(s) you are trying to connect to within the project
  4. Make sure Hive is running in HTTP mode for the cluster(s)

For Step 4 above, we have provided an example initialization action file at gs://hive-http-mode-init-action/hive-http-config.sh that would configure Hive to be running in HTTP mode during cluster creation.

How to use Dataproc Driver

  1. Clone this repo

    git clone https://github.com/GoogleCloudDataproc/dataproc-jdbc-connector.git
    cd dataproc-jdbc-connector
  2. Build Dataproc Driver JAR

    ./gradlew -p jdbc-driver shadowJar

    Note that this step might take around 45 minutes.
    Compiled Dataproc Driver JAR will be at dataproc-jdbc-connector/jdbc-driver/build/libs/jdbc-driver-1.0-SNAPSHOT-all.jar.

  3. Build example-client JAR

    ./gradlew -p example-client shadowJar

Connection URL format

  • Dataproc Driver for Hive accepts JDBC URL string with prefix jdbc:dataproc://hive/

  • Client can pass in database name or leave it as empty to use default database
    jdbc:dataproc://hive/; or jdbc:dataproc://hive/dbName;

  • projectId and region are required parameters. To find your project and region, refer back to Set up a Cloud project in the Before you begin section above
    jdbc:dataproc://hive/;projectId={pid};region={region}

  • Cluster specific parameters: clusterName or clusterPoolLabel
    Client can set parameters to pick a particular cluster by setting either one of clusterName or clusterPoolLabel
    Note that clusterName has a higher priority than clusterPoolLabel, if you pass in both parameters, the DataprocDriver will only look for the cluster by name.

    • clusterName is an optional parameter that allows client to specify the name of the cluster to connect to
      jdbc:dataproc://hive/;projectId={pid};region={region};clusterName={cluster-name}
    • clusterPoolLabel is an optional parameter that supports submitting to a cluster pool
      Client can pass in cluster pool labels and the Dataproc Driver will pick a healthy cluster (cluster with status.state = ACTIVE) within the pool to connect to. Please do not put status.state as one of the cluster pool label.

    Labels can be specified in the format clusterPoolLabel=label1=value1[:label2=value2]

    Example:

    jdbc:dataproc://hive/;projectId={pid};region={region};clusteroPoolLabel=com=google:team=dataproc`
  • DataprocDriver also accepts other semicolon separated list of session variables, Hive configuration variables or Hive variables that Hive supports

    jdbc:dataproc://hive/{dbName};projectId={pid};region={region};clusterName={name};sess_var_list?hive_conf_list#hive_var_list

Connecting to Dataproc through Component Gateway

DataprocDriver uses Google OAuth 2.0 APIs for authentication and authorization.

To connect to Dataproc cluster through Component Gateway, the Dataproc JDBC Driver will include an authentication token. For security reasons, it puts the token in the Proxy-Authorization:Bearer header.

To get the access token set in Before you begin step 2 section above, DataprocDriver will use the Application Default Credentials.

Use with Beeline

# update the HADOOP_CLASSPATH to include the Dataproc JDBC Driver JARs
export HADOOP_CLASSPATH=`hadoop classpath`:{path-to-driver-jar}jdbc-driver-1.0-SNAPSHOT-all.jar

# tell beeline the class name for the driver using “-d” option
beeline -d "com.google.cloud.dataproc.jdbc.DataprocDriver" -u "jdbc:dataproc://hive/;projectId={pid};region={region};{other-parameters}"

Example:

export HADOOP_CLASSPATH=`hadoop classpath`:/usr/local/home/Downloads/dataproc-jdbc-connector/jdbc-driver/build/libs/jdbc-driver-1.0-SNAPSHOT-all.jar
beeline -d "com.google.cloud.dataproc.jdbc.DataprocDriver" -u "jdbc:dataproc://hive/;projectId=demo-dataproc;region=us-central1;clusterName=demo-cluster"

Use with example client

We have provided an example client -- an example usage of the JDBC driver from Java that connects to Hive using our Dataproc Driver.

Connection connection = DriverManager.getConnection("jdbc:dataproc://hive/default;projectId=pid;region=us-central1;clusterName=my-cluster");
try (Statement stmt = connection.createStatement()) {
  ResultSet res = stmt.executeQuery("SHOW TABLES");
  while (res.hasNext()) {
    System.out.println(res.getString(1));
  }
}

To run the example client:

# build example client JAR
./gradlew -p example-client shadowJar

# run the JAR, it will prompt you to enter the JDBC URL string
java -jar example-client/build/libs/example-client-1.0-SNAPSHOT-all.jar

Running the tests

  • To run all unit tests, use ./gradlew test

  • To run specific unit test, use

    ./gradlew test --tests="com.google.cloud.dataproc.jdbc.{test name}"
  • To run system test, use

    ./gradlew systemTest --tests="com.google.cloud.dataproc.jdbc.DataprocSystemTest" -DprojectId="{projectId}" -Dregion="{region}" -Duser="{user}"

    Note that projectId and region are the parameters for the specific project you are working on; user can be an arbitrary username, used as a prefix for the cluster name to create, delete and avoid cluster conflict during testing.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 100.0%