Enhance input datasource capabilities #101

mkuthan · 2017-03-23T15:09:49Z

Right now druid-spark-batch reads data using sc.textFile from given locations. This is important limitation if data is stored in formats like Parquet (or any other data format supported by Spark).

Would you consider to enhance this tool with support for arbitrary Spark SQL expression to define input data? You'll get for free:

support for any data format supported by Spark
support for any UDF supported by Spark for data pre-processing
support for joins before ingestion

drcrallen · 2017-03-23T23:28:02Z

I think it would make some sense to have a DataSupplierFactory (or a similar more-sparky-name) passed in the task definition, one of whose implementations effectively does sc.textFile(dataFiles mkString ",") and whose other implementations do other things, then have the chain at

val baseData

feed of the factory.

Such an improvement would be quite handy.

Gauravshah · 2017-10-03T01:55:34Z

wouldn't it be easier to make it accept a dataframe, and we do all the pre-processing before sending it to spark-batch ?

drcrallen · 2017-10-03T06:11:03Z

@Gauravshah similar to #10 ?

Gauravshah · 2017-10-03T07:12:34Z

👍

drcrallen added the enhancement label Mar 23, 2017

drcrallen added the help wanted label Apr 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance input datasource capabilities #101

Enhance input datasource capabilities #101

mkuthan commented Mar 23, 2017

drcrallen commented Mar 23, 2017 •

edited

Loading

Gauravshah commented Oct 3, 2017

drcrallen commented Oct 3, 2017

Gauravshah commented Oct 3, 2017

Enhance input datasource capabilities #101

Enhance input datasource capabilities #101

Comments

mkuthan commented Mar 23, 2017

drcrallen commented Mar 23, 2017 • edited Loading

Gauravshah commented Oct 3, 2017

drcrallen commented Oct 3, 2017

Gauravshah commented Oct 3, 2017

drcrallen commented Mar 23, 2017 •

edited

Loading