You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not so long ago I discovered a nifty Spark feature: Spark's Data Source. You can read this article on Hackernoon about it.
Basically, you create a class called DefaultSource which mixes in RelationProvider and SchemaRelationProvider whose createRelation methods return an object of type BaseRelation with TableScan. This allows you to specify a Spark Schema and a method that returns an RDD[Row] based on the schema, which is automagically converted to a DataFrame when you do something like:
where the DefaultSource class resides in the package com.example.foo.bar.
With this, I hooked up all our reading logic for our special data formats (binary or text-based measuring data that is not always readable with the default CSV data source).
It would be really nice to have a Data source in Seahorse where you can specify the package of the DefaultSource class and the URL of the data as usual and where the data is then pulled in via this mechanism.
The text was updated successfully, but these errors were encountered:
Not so long ago I discovered a nifty Spark feature: Spark's Data Source. You can read this article on Hackernoon about it.
Basically, you create a class called
DefaultSource
which mixes inRelationProvider
andSchemaRelationProvider
whosecreateRelation
methods return an object of typeBaseRelation
withTableScan
. This allows you to specify a Spark Schema and a method that returns anRDD[Row]
based on the schema, which is automagically converted to aDataFrame
when you do something like:where the
DefaultSource
class resides in the packagecom.example.foo.bar
.With this, I hooked up all our reading logic for our special data formats (binary or text-based measuring data that is not always readable with the default CSV data source).
It would be really nice to have a Data source in Seahorse where you can specify the package of the
DefaultSource
class and the URL of the data as usual and where the data is then pulled in via this mechanism.The text was updated successfully, but these errors were encountered: