-
Notifications
You must be signed in to change notification settings - Fork 19
/
README.txt
executable file
·155 lines (123 loc) · 6.81 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
***********************************************************************************
** Repo is no longer maintained. **
** Use PXF instead. **
***********************************************************************************
README file for GPLink
########################################################################################
Author: Jon Roberts
Email: [email protected]
########################################################################################
GPLink links JDBC connections to Greenplum External Tables.
Data is automatically cleansed for embedded carriage returns, newline, and/or null
characters. Escape characters are retained by double escaping and embedded pipes
are retained by escaping.
########################################################################################
#Installation:
########################################################################################
- gplink must be installed on a server that is accessible by all nodes of Greenplum. A
dedicated ETL server or the standby master are good candidates for hosting gplink.
1. Download latest version from PivotalGuru.com
2. Unzip <version>.zip
3. source gplink_path.sh and add this to your .bashrc file
4. Edit connections/gplink.properties with correct Greenplum connection information
5. Download 3rd party JDBC drivers and place it in $GPLINK_HOME/jar
6. Define source configurations in $GPLINK_HOME/connections/
7. Define external table names and columns in $GPLINK_HOME/tables/
8. Define SQL statements to execute in the source in $GPLINK_HOME/sql/
9. Create the External Table with gpltable
########################################################################################
#Creating External Tables
########################################################################################
gpltable -s <source_config> -t <target_config> -f <sql> -a <source_table>
example:
gpltable -s sqlserver.properties -t $GPLINK_HOME/connections/gplink.properties \
-f $GPLINK_HOME/sql/sqlserver_example.sql \
-a $GPLINK_HOME/tables/public.sqlserver_example.sql
########################################################################################
#Dropping External Tables
########################################################################################
gpldrop -t <target_config> -n <table_name>
example:
gpldrop -t $GPLINK_HOME/connections/gplink.properties -n public.test
########################################################################################
#Start the gpfdist processes
########################################################################################
gplstart -t <target_config>
example:
gplstart -t $GPLINK_HOME/connections/gplink.properties
Note: this is useful when the host is restarted and you need to start all of the gpfdist
processes needed by gplink External Tables.
########################################################################################
#Debugging
########################################################################################
export GPLINK_DEBUG=true
Turn off debugging:
export GPLINK_DEBUG=
Note: this will show all debug messages from gplstart, gpltable, and gpldrop.
########################################################################################
#Getting data
########################################################################################
The External Table references gpfdist which then executes ext_gpldata. This does a
basic parsing of the URL and then calls gpldata. You never need to call ext_gpldata
directly but you can call gpldata. gpldata can be used to debug connections and SQL
statements prior to creating a table with gpltable.
Usage is gpldata -s <source_config> -f <sql>
example:
gpldata -s $GPLINK_HOME/connections/sqlserver.properties \
-f $GPLINK_HOME/sql/sqlserver_example.sql
########################################################################################
#Known working JDBC connections
########################################################################################
Examples are in the connections directory. Here are some notes on each connection type
that have been tested.
1. SQL Server
connectionUrl=jdbc:sqlserver://jonnywin;CODEPAGE=65001;responseBuffering=adaptive;selectMethod=cursor;
You will want to use CODEPAGE 65001 which connects to SQL Server in UTF8 character set.
This allows the JDBC driver to translate the native character set to UTF8 which is used
by Greenplum.
responseBuffering=adaptive greatly improves the speed of exporting data. Be sure to
use this.
selectMethod=cursor is needed to tell SQL Server how the data will be fetched.
readCommitted=true performs a read consistent query in the database. This can cause a
problem if you aren't using Read Committed Snapshot Isolation and may instead prefer
to use a dirty read.
userName=sa
password=sa
All testing has been done with SQL Server authentication.
2. Oracle
connectionUrl=jdbc:oracle:thin:@//jonnywin:1521/XE is an example where the thin driver
is used to connect to the XE instance on port 1521.
extraProps=defaultRowPrefetch=2000
Be sure to use this! By default, Oracle will fetch only 10 rows at a time which makes
exporting slow. By fetching a larger number of rows, the speed will improve but you
may need to increase the memory settings.
3. DB2
This is being used by customers but I don't have the details at this time.
4. Teradata
connectionUrl=jdbc:teradata://td_server/database=EDW,TYPE=FASTEXPORT,charset=UTF8
- You want to use FASTEXPORT for better performance.
- UTF8 is specified so the JDBC driver does the character set translation
- You will need both tdgssconfig.jar and terajdbc4.jar JDBC drivers from Teradata.
- If LDAP authentication is needed, use LOGMECH=LDAP in the connectionURL:
connectionUrl=jdbc:teradata://td_server/LOGMECH=LDAP,database=EDW,TYPE=FASTEXPORT,charset=UTF8
ClassForName as:
classForName=com.teradata.jdbc.TeraDriver
5. Hive
Several jar files are needed to get Hive to work. Download the following Jar files
from your Hadoop cluster and place it in the jar directory.
I tested with a Hortonworks cluster with Hive 1.2.1.2.4.
/usr/hdp/2.4.2.0-258/hive/lib/hive-jdbc.jar
/usr/hdp/2.4.2.0-258/hadoop/client/hadoop-common.jar
/usr/hdp/2.4.2.0-258/hadoop/client/log4j.jar
/usr/hdp/2.4.2.0-258/hadoop/client/slf4j-api.jar
/usr/hdp/2.4.2.0-258/hadoop/client/slf4j-log4j12.jar
Older versions of Hive may have have duplicate SLF4J bindings and fail to work
properly. The error message is, "Class path contains multiple SLF4J bindings". If
you get this, remove the slf4j-log4j12.jar file from the jar/ directory, source the
gplink_path.sh file, and then try again.
Testing has been done with a cluster that isn't secure. Refer to Hive JDBC
documentation on how to configure your JDBC connection with a secure login.
6. SAP Hana
No special configuration is needed.
classForName=com.sap.db.jdbc.Driver
readCommitted=true