Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance input datasource capabilities #101

Open
mkuthan opened this issue Mar 23, 2017 · 4 comments
Open

Enhance input datasource capabilities #101

mkuthan opened this issue Mar 23, 2017 · 4 comments

Comments

@mkuthan
Copy link

mkuthan commented Mar 23, 2017

Right now druid-spark-batch reads data using sc.textFile from given locations. This is important limitation if data is stored in formats like Parquet (or any other data format supported by Spark).

Would you consider to enhance this tool with support for arbitrary Spark SQL expression to define input data? You'll get for free:

  • support for any data format supported by Spark
  • support for any UDF supported by Spark for data pre-processing
  • support for joins before ingestion
@drcrallen
Copy link
Contributor

drcrallen commented Mar 23, 2017

I think it would make some sense to have a DataSupplierFactory (or a similar more-sparky-name) passed in the task definition, one of whose implementations effectively does sc.textFile(dataFiles mkString ",") and whose other implementations do other things, then have the chain at

val baseData

feed of the factory.

Such an improvement would be quite handy.

@Gauravshah
Copy link

wouldn't it be easier to make it accept a dataframe, and we do all the pre-processing before sending it to spark-batch ?

@drcrallen
Copy link
Contributor

@Gauravshah similar to #10 ?

@Gauravshah
Copy link

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants