Skip to content
Quintin Donnelly edited this page Mar 7, 2019 · 10 revisions

Biforce Brand

Goals

Main

  • Retrieve and transform data from Caliber for Spark analytics and OLAP
  • Retrieve and transform data from Caliber for storage in data warehouse

Iteration

  • Implement warehousing process
  • Implement incremental importing
  • Design new data flow to accommodate for warehousing.
  • Implement Hive instead of MySQL database as ODS for transformation of Spark input data
  • Have Oozie team convert all Sqoop jobs into workflows for automation.

Implementation

Information Source: The Apache Sqoop User Guide.

Incremental Imports

In order to achieve incremental importing, Sqoop jobs were saved onto a metastore. We used a postgreSQL EC2 instance from AWS as a remote metastore, which allowed for testing creation, deletion, and execution of saved jobs from different machines.

WARNING: The Sqoop metastore currently does not allow for concurrent writing, which means that creation or execution of saved Sqoop jobs will result in failure. Keep this in mind when testing from multiple machines or when creating Oozie workflows that run actions in parallel.

Known Issue: The current iteration of incremental import only uses the append mode of the --incremental flag. This means that the jobs only retrieve new entries from Caliber, not any updated ones (for example, when a trainee status changes).

Furthermore, the current Sqoop commands connect to the metastore by via a connection string that also holds the username and password. For security purposes, the username and password has been removed from the connection string when pushing to Git.

OLAP vs Warehouse

For the purpose of this page, OLAP refers to data being used for the Spark job, which is what is used in reports. Warehouse refers to data is meant to go directly into the data warehouse.

The schema being used in transforming data for Spark Analytics (OLAP) was the same as Caliber, but the introduction of the data warehouse required that we follow its different schema as well. In order to do this, two groups of Sqoop import jobs were created: OLAP and Warehouse.

WARNING: The Sqoop import jobs require that a password alias titled caliber.password.alias be created inside /user/root/caliber.password.jceks in HDFS.

To create the alias, use the following command and enter the password when prompted:

hadoop credential create caliber.password.alias -provider jceks://hdfs/user/root/caliber.password.jceks

OLAP ETL

This process was changed from the previous iteration to use Hive instead of a MySQL database to transform data.

Pros

  • Reduces the number of steps to load and transform data
  • Automatic table creation and loading

Cons

  • Uses MapReduce when performing transformation (which can be useful if dealing with large amounts of data)

Procedure

  1. Create database and Spark table in Hive using commands:
   CREATE DATABASE IF NOT EXISTS BIFORCE_STAGING;
   USE BIFORCE_STAGING;
 CREATE TABLE IF NOT EXISTS SPARK_DATA (ROWNUM INT, TEST_TYPE INT, RAW_SCORE DECIMAL(3, 0), SCORE DECIMAL(5, 2), TEST_PERIOD INT, TEST_CATEGORY INT, TRAINER_ID INT, BATCH_ID INT, GROUP_TYPE STRING, BATTERY_ID INT, BATTERY_STATUS INT);
  1. Store Sqoop import jobs for tables in remote metastore.
  2. Run sqoop jobs on local terminal to import selected Caliber tables into Hive. The connection, username, and password may vary. Refer to Oozie hive-import workflows for Sqoop commands.
  3. Use a Hive query to join all tables into one workable table for Spark Team. See README for example Query.
  4. Run command below in Hive to export the Spark table to HDFS for Spark team to use. Note the directory path.
 insert overwrite directory 'user/hadoop/biforce/Spark_Data' row format delimited fields terminated by ',' select * from spark_data; 
  1. Hand spark_data file to Spark team for analysis.

Warehouse ETL

Sqoop jobs have been created that import data incrementally from Caliber directly into S3. See Oozie warehouse-import workflows to view Sqoop commands.

Clone this wiki locally