Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatisticsV2: statistics framework initial redesign for Datafusion #57

Draft
wants to merge 14 commits into
base: apache_main
Choose a base branch
from

Conversation

Fly-Style
Copy link

@Fly-Style Fly-Style commented Jan 8, 2025

Rationale for this change

https://synnada.notion.site/Redesigning-and-Enhancing-the-Statistics-Framework-in-Datafusion-16bf46d2dab180448272dbbd1d1f7cea

What changes are included in this PR?

This patch presents a Statistics v.2 framework with the following main points:

  • introduction to enum-based struct to support multiple distribution types, which initially include:
    • Uniform distribution (range)
    • Gaussian distribution (also known as normal)
    • Exponential distribution
    • Bernoulli distribution (holds probability, is used as resulting distribution of comparison operators),
    • Unknown distribution, abstracts any non-represented distribution.
  • tree-based stats calculation is divided into bottom-up evaluation and top-down propagation;

Tables of statistic execution and propagation rules for PhysicalExpr-s:

Definitions

UF = Uniform UN = Unknown EXP = Exponential GSS = Gaussian BRN = Bernoulli

Binary arithmetical operators, evaluation.

Input 1 Input 2 (+) (-) (*) (/)
UF [a,b] UF [c,d] UN UN UN UN
UF [a,b] GSS N(μ,σ²) UN UN UN UN
UF [a,b] EXP Exp(λ) UN UN UN UN
UF [a,b] BRN UN UN UN UN
UF [a,b] UN UN UN UN UN
GSS N(μ₁,σ₁²) GSS N(μ₂,σ₂²) GSS GSS GSS UN
GSS N(μ,σ²) EXP Exp(λ) UN UN UN UN
GSS N(μ,σ²) BRN UN UN UN UN
GSS N(μ,σ²) UN UN UN UN UN
EXP Exp(λ₁) EXP Exp(λ₂) UN (gamma dist) UN UN UN
EXP Exp(λ) BRN UN UN UN UN
EXP Exp(λ) UN UN UN UN UN
BRN Exp(λ) BRN UN UN UN UN
BRN Exp(λ) UN UN UN UN UN
UN UN UN UN UN UN

Comparison operators, evaluation.

| Input 1 | Input 2 | Comparison (=, ≠) | Inequality (>, >=) | Inequality (<, <=) | |-------------------|-------------------|-------------------|--------------------|--------------------| | **UF** [a,b] | **UF** [c,d] | BRN | BRN | BRN | | **GSS** N(μ₁,σ₁²) | **GSS** N(μ₂,σ₂²) | UN | UN(Ф-func) | UN(Ф-func) | | **EXP** Exp(λ₁) | **EXP** Exp(λ₂) | UN | UN | UN | | **BRN** | **BRN** | BRN | BRN | BRN | | **UF** [a,b] | **GSS** N(μ,σ²) | UN | UN | UN | | **UF** [a,b] | **EXP** Exp(λ) | UN | UN | UN | | **UF** [a,b] | **BRN** | UN? | UN? | UN? | | **UF** [a,b] | **UN** | BRN? | BRN? | BRN? | | **GSS** N(μ,σ²) | **EXP** Exp(λ) | UN | UN | UN | | **GSS** N(μ,σ²) | **BRN** | BRN?/UN | BRN?/UN | BRN? /UN | | **GSS** N(μ,σ²) | **UN** | UN | UN | UN | | **EXP** Exp(λ) | **UN** | UN, estimate | UN, estimate | UN, estimate | | **EXP** Exp(λ) | **BRN** | UN, estimate | UN, estimate | UN, estimate | | **UN** | **BRN** | BRN/UN | BRN/UN | BRN/UN | | **UN** | **UN** | BRN/UN | BRN/UN | BRN/UN | Unit tests are included as well as integration tests.

@Fly-Style Fly-Style changed the title StatisticsV2: statistics framework redesign for Datafusion StatisticsV2: statistics framework initial redesign for Datafusion Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant