Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

增加 properties map 到 PostQueryExecutionForKylin #70

Open
Aaaaaaron opened this issue Oct 25, 2019 · 0 comments
Open

增加 properties map 到 PostQueryExecutionForKylin #70

Aaaaaaron opened this issue Oct 25, 2019 · 0 comments

Comments

@Aaaaaaron
Copy link

Aaaaaaron commented Oct 25, 2019

KE issue: https://github.com/Kyligence/KAP/issues/14933

lxian pushed a commit to lxian/spark that referenced this issue Sep 24, 2021
…ence index for optimization

### What changes were proposed in this pull request?

This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```

**Before:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=Kyligence#70]
      +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
         +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=Kyligence#67]
            +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
               +- Project [id#37L]
                  +- Filter atleastnnonnulls(1, id#37L)
                     +- Scan ExistingRDD[__index_level_0__#36L,id#37L]
                        # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```

**After:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=Kyligence#174]
      +- HashAggregate(keys=[id#258L], functions=[count(1)])
         +- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
            +- Filter atleastnnonnulls(1, id#258L)
               +- Range (0, 10, step=1, splits=16)
                  # ^^^ Removed the Spark job execution for `zipWithIndex`
```

### Why are the changes needed?

To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.

Closes apache#33807 from HyukjinKwon/SPARK-36559.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
chenzhx pushed a commit to chenzhx/spark that referenced this issue Oct 20, 2021
…ence index for optimization

### What changes were proposed in this pull request?

This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```

**Before:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=Kyligence#70]
      +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
         +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=Kyligence#67]
            +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
               +- Project [id#37L]
                  +- Filter atleastnnonnulls(1, id#37L)
                     +- Scan ExistingRDD[__index_level_0__#36L,id#37L]
                        # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```

**After:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=Kyligence#174]
      +- HashAggregate(keys=[id#258L], functions=[count(1)])
         +- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
            +- Filter atleastnnonnulls(1, id#258L)
               +- Range (0, 10, step=1, splits=16)
                  # ^^^ Removed the Spark job execution for `zipWithIndex`
```

### Why are the changes needed?

To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.

Closes apache#33807 from HyukjinKwon/SPARK-36559.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 93cec49)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant