[Long term] Look into Supersonic query API #11

velvia · 2015-09-03T16:59:39Z

https://slack-files.com/files-pri-safe/T03BMF0R2-F0A3LCQ3C/api-presentation_1_.pdf?c=1441299236-4641d956f1354dd200dd184c1f1fc76fc59b9d2c

samklr · 2015-09-28T14:54:31Z

Link expired?

velvia · 2015-09-28T17:19:16Z

@samklr try this?

https://slack-files.com/files-pri-safe/T03BMF0R2-F0AFBB892/jethrodata_white_paper.pdf?c=1441927242-e4340a9d9477dca46000bf030eb89fddb468fd58

darkjh · 2015-09-28T18:09:14Z

@velvia still expired

samklr · 2015-09-28T21:47:22Z

Lol. Still expired ...

velvia · 2015-10-02T06:01:05Z

I finally found a live link - though not sure how much longer this will be up too. Download the PDF while you can.
https://code.google.com/p/supersonic/downloads/list

velvia · 2016-01-10T06:03:01Z

So, Supersonic is C++. There is also Apache Drill, but that might be C++ too.

velvia · 2016-01-13T23:21:28Z

I think in the short term that playing with Spark's Catalyst optimizer to get columnar or at least vector wise execution is the best bet. Here is a video:

http://blog.madhukaraphatak.com/anatomy-of-spark-dataframe-api/

Some thoughts:

We could introduce an extra physical planner stage that does vector computation before passing it to the normal Aggregate* steps. However, we don't want to receive an RDD[InternalRow], but rather an RDD[Segment].
We could introduce something called "aggregation / expression pushdown", at first specific to the Filo data source only, that pushes down the columnar expressions / aggregation and grouping expressions. Then, the Filo data source could do computations on each segment and return an RDD[Row], hopefully with far fewer rows, for Spark to compute.

velvia · 2016-01-19T21:54:37Z

More notes on where in Spark codebase to look for SQL Optimizer stages (Spark 1.5.x):

Overall query execution flow: SQLContext#QueryExecution inner class
Step 1: SQL (or DataFrame DSL) is converted to a LogicalPlan tied to a new DataFrame instance (see LogicalPlan.scala, and DataFrame.logicalPlan)
Step 2: org.apache.spark.sql.catalyst.analysis.Analyzer goes over LogicalPlan, resolves references, produces another LogicalPlan
Step 3: Spark calls the CacheManager to determine if cached tables should be used --> withCachedData LogicalPlan
Step 4: org.apache.spark.sql.catalyst.optimizer.Optimizer optimizes the LogicalPlan
Step 5: SparkPlanner uses various SparkStrategies to convert the LogicalPlan into a SparkPlan.
- These are all in the org.apache.spark.sql.execution package
- For Joins, see SparkStrategies.{LeftSemiJoin, CanBroadcast, EquiJoinSelection}.
- See the DataSourceStrategy for how pushdown predicates are implemented
Step 6: The SparkPlans execute() method is called, which returns an RDD[InternalRow]

Custom execution strategies can be inserted -- see SQLContext.experimental variable.

Changing the optimizer steps might require a custom optimizer and a custom SQLContext/QueryExecution class.

velvia · 2016-02-10T23:29:05Z

A current Spark ticket for pushing down aggregations into DataSources:

https://issues.apache.org/jira/browse/SPARK-12449

See Santiago's comment right above mine, for links to how Druid, Magellan, HBase and other folks are modifying Spark Catalyst plans to get aggregation done on server side.

velvia added the Architecture label Sep 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Long term] Look into Supersonic query API #11

[Long term] Look into Supersonic query API #11

velvia commented Sep 3, 2015

samklr commented Sep 28, 2015

velvia commented Sep 28, 2015

darkjh commented Sep 28, 2015

samklr commented Sep 28, 2015

velvia commented Oct 2, 2015

velvia commented Jan 10, 2016

velvia commented Jan 13, 2016

velvia commented Jan 19, 2016

velvia commented Feb 10, 2016

[Long term] Look into Supersonic query API #11

[Long term] Look into Supersonic query API #11

Comments

velvia commented Sep 3, 2015

samklr commented Sep 28, 2015

velvia commented Sep 28, 2015

darkjh commented Sep 28, 2015

samklr commented Sep 28, 2015

velvia commented Oct 2, 2015

velvia commented Jan 10, 2016

velvia commented Jan 13, 2016

velvia commented Jan 19, 2016

velvia commented Feb 10, 2016