Skip to content

Apache Drill Data Provider

Pradeeban Kathiravelu edited this page Aug 24, 2018 · 2 revisions

A data provider has been implemented to integrate Apache Drill as a data source provider to Bindaas.

 

 

Configuring Apache Drill and Data Sources for Bindaas

If your Drill is configured with JPam for authentication, the user of the operating system also functions as the Drill user, as defined in the configurations of your Drill instance.

As Drill driver is based on the JDBC driver, the Drill JDBC url has a similar form. However, user name and password are optional for Drill Provider. If your Drill instance is not configured with JPam, leave the username and password entries blank when you define the data source in the Data Provider Creation step shown in the above screencast. An example for the Drill JDBC URL would be jdbc:drill:drillbit=localhost:31010 for a Drill configured stand-alone.

Drill as a Data Source

While Apache Drill is not a database server in itself, it connects to various data sources and lets the users query the structured and unstructured data stored in those data stores through SQL queries. Hence, it functions as a layer between the data sources and the enterprise application such as Bindaas, offering a distributed homogeneous interface to the heterogeneous data sources that are configured.  Bindaas may query a variety of data sources, including HDFS, Hive, HBase, S3, Mongo, and MySQL through Apache Drill by configuring its storage plugins appropriately. It is possible to query various data sources and combine and return the query outcome to the user by leveraging Drill through the Drill data provider of Bindaas.

Integrate Multiple Data Sources Through Drill

Drill offers auxiliary capabilities to Bindaas in data integration and efficient and dynamic data queries. This post discusses a use case where Drill is used to query data from multiple data sources at once. While this post focuses on distributing the data in multiple MongoDB instances, this can be done with other SQL and NoSQL data sources. This post discusses the definition of multiple storage plugins for the same data server and integrating data dynamically from the data sources for each queries. Following the same architecture, data can also be queried from heterogeneous data sources and returned as a single outcome. The post discusses the configuration and execution from Apache Drill.

 

Querying Disjoint Data Sources Through Drill

Instead of using MongoDB as a single or a clustered data store, we may partition the data in independent MongoDB instances that are hosted remotely. Then we may use the UNION operator of Drill to join the results accordingly. 

 

Why do we need to do this? 

  1. Because we may already have the data partitioned in different sources.
  2. Due to the domain knowledge, we may do a better job in partitioning the data.
  3. Even in a dumb partitioning, Drill scales and performs well.
  4. There are some interesting research questions, leveraging locality of data to provide better and faster outputs than a clustered or distributed Mongo deployment.

 

Be warned that Drill has its limitations in data structures that may hurt the performance - for example, nested complex schema such as multi-dimensional arrays. We previously have discussed a work-around for this.  In this post, we will see the simplest example of achieving this.

 

 

Define the Mongo Storage Plugins

 

For each of the Mongo Server, define the storage plugin separately in Drill.

Multiple definition of Mongo Storage Plugin, pointing to various Mongo deployments

 

For example, above mongo3 is defined as below in http://localhost:8047/storage/mongo3

 

{
  "type": "mongo",
  "connection": "mongodb://184.72.102.246:27017/",
  "enabled": true
}

Query Through the Query Browser

 

Querying from the multiple Mongo Deployments and UNION them to the results

select last_name as id from mongo.employee.empinfo
union all
select first_name as id from mongo2.employee.empinfo
union all
select first_name as id from mongo3.employee.empinfo

 

 

 

Now you may execute this, and get the results. The query can be executed directly from Drill as well as from Bindaas. Depending on the nature of the query and partitioning and scale of the data, you may be able to experience performance benefits due to the data partitioning and distributed execution.

Clone this wiki locally