Spark DataSource support #99

rabejens · 2018-11-25T22:53:23Z

Not so long ago I discovered a nifty Spark feature: Spark's Data Source. You can read this article on Hackernoon about it.

Basically, you create a class called DefaultSource which mixes in RelationProvider and SchemaRelationProvider whose createRelation methods return an object of type BaseRelation with TableScan. This allows you to specify a Spark Schema and a method that returns an RDD[Row] based on the schema, which is automagically converted to a DataFrame when you do something like:

val df = spark.
  read.
  format("com.example.foo.bar").
  load("hdfs://path/to/my/data")

where the DefaultSource class resides in the package com.example.foo.bar.

With this, I hooked up all our reading logic for our special data formats (binary or text-based measuring data that is not always readable with the default CSV data source).

It would be really nice to have a Data source in Seahorse where you can specify the package of the DefaultSource class and the URL of the data as usual and where the data is then pulled in via this mechanism.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark DataSource support #99

Spark DataSource support #99

rabejens commented Nov 25, 2018

Spark DataSource support #99

Spark DataSource support #99

Comments

rabejens commented Nov 25, 2018