Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Generalized chi-squared distribution #1806

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

heliosdrm
Copy link
Contributor

This extends the collection of univariate continuous distributions with the Generalized chi-squared. All required methods are added, and some of the recommended ones, as indicated in https://juliastats.org/Distributions.jl/stable/extends/#Univariate-Distribution

Some notes:

  1. I started considering Abhranil Das' code for Matlab (MIT-licensed) as reference, although progressively departed from it, so that eventually there is little in the Julia code that can be attributed to him. I have kept a comment where I mention the inspiration to define the cdf function, using Davies' algorithm to integrate the characteristic function. But I don't know if a more explicit attribution in the code or in the license should be made.
  2. There are several auxiliary functions that should not be part of the API, and I have encapsulated all of them in the module GChisqComputations, after the example of the Chernoff distribution.
  3. That module includes two functions to calculate the characteristic function: cf_explicit that implements the explicit formula, and cf_inherit that composes the results of cf(::Normal) and cf(::NoncentralChisq). Both are equivalent, and the code of the second is clearer, but cf(::GeneralizedChisq) calls the first one, because in a few tests that I have done, it is slightly (but not much) faster.
  4. cdf and pdf are calculated by integration, using the Gil-Pelaez theorem (the algorithm for the CDF is given in Davies' paper, and the algorithm for the PDF has been indirectly derived from it). The integration is done using QuadGK.jl, which was already in the dependencies of Distributions.jl. The default tolerances of QuadGK.quadgk made the calculation of cdf at the median very slow, because there the integral is zero. Since in the ends of the distribution the integral is ±π/2, which is in the order of magnitude of the unit, I have used a fixed absolute tolerance of eps(one(T)) for the integration in the calculation of the CDF. I have not, however, changed the defaults for the calculation of the PDF.
  5. quantile is calculated using Distributions.quantile_newton. However, there is no analytic solution to find the mode, which is the recommended starting point, so this is searched using the following strategy: (1) start at the mean of the distribution, and use it if it meets the convergence criteria; (2) otherwise define a bracket between that point and another at one standard deviation from the mean, in the direction towards the target probability (and if the bracket does not contain the target probability, extend it until it does); (3) bisect the bracket until one of the ends meets the convergence criteria, and then use that as the starting point. This seems to work fine, but it's a self-made algorithm, and perhaps there is a more efficient one that might be used instead.

@codecov-commenter
Copy link

codecov-commenter commented Nov 28, 2023

Codecov Report

Attention: Patch coverage is 96.71053% with 5 lines in your changes missing coverage. Please review.

Project coverage is 86.19%. Comparing base (7e232ca) to head (3ab5b06).
Report is 43 commits behind head on master.

Files with missing lines Patch % Lines
src/univariate/continuous/generalizedchisq.jl 96.71% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1806      +/-   ##
==========================================
+ Coverage   86.00%   86.19%   +0.18%     
==========================================
  Files         144      145       +1     
  Lines        8646     8798     +152     
==========================================
+ Hits         7436     7583     +147     
- Misses       1210     1215       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mahaa2
Copy link

mahaa2 commented Mar 16, 2024

When this will be added to the main branch ?

@mahaa2
Copy link

mahaa2 commented Mar 25, 2024

Hi,

I have played around with this extension.
As a Statistician, I strongly believe it is a very useful and very important distribution to have in main branch of the package Distribution.jl.

In general it seems to work very well and I will use for future research.

However, I notice that while doing my own tests the quadgk.jl package/function used inside the file generalizedchisq.jl
appears to fail accusing the error (line 260 of that file)

"
DomainError with 0.9999999999999964:
integrand produced NaN in the interval (0.9999999999999929, 1.0)
" ,

in the following explained cases.

Lets say that X ~ N_(µ, Σ) with particular vector dimension D = 2, 3, 4.
Then we now know that Y = q0 + q1'X + X'QX ~GChisq(w, ν, λ, m, s)
(the notation in the GChisq package is different, but also irrelevant here).

The parameter values (w, ν, λ, m, s) = T(q0, q1, Q, µ, Σ) are obtained accordingly with the transformation, say T,
following the paper (I present below my own code to replicate the experiments in Julia).

One can just run the below code and check the results. The error also seems to be exactly the same one appearing in here.

mutable struct parGChisq
    w::Vector{<: Real}
    ν::Vector{<: Real}
    λ::Vector{<: Real}
    m::Real
    s::Real
end

function mapGChisqpar(
    q0::Real,
    q1::Vector{<: Real},
    Q ::AbstractMatrix,
    µ ::Vector{<: Real},
    Σ ::AbstractMatrix,
    ch::Bool = false)::parGChisq
    
    """
     if   X ~ N(µ, Σ)
     then Υ = q0 + q1' X + X'QX ~ GChisq(w, ν, λ, m, s)
     
     see https://arxiv.org/pdf/2012.14331.pdf

     the option with ch = true
     does not work. why ? 
    """ 

    # check dimensionality match
    dimQ = size(Q, 1)
    dimq = length(q1)
    dimΣ = size(Σ, 1)
    dimµ = length(µ)

    dimQ !== dimq ? @error("dims mismatch Q, q1") : nothing
    dimΣ !== dimq ? @error("dims mismatch Σ, q1") : nothing
    dimΣ !== dimµ ? @error("dims mismatch Σ, µ") : nothing

    # check matrix conditions
    !issymmetric(Q) ? @error("Q is not symmetric") : nothing
    !isposdef(Σ)    ? @error("Σ is not PD") : nothing

    # take the square root matrix 
    S = !ch ? sqrt(Σ) : cholesky(Σ).L

    # standartize the space
    t_Q² = !ch ? Symmetric(S * Q * S) : Symmetric(S * Q * S')
    t_q1 = S * (2.0 * Q * µ + q1)
    t_q0 = q0 + dot(q1, µ) + dot(µ, Q, µ) 

    # get the generalized Chi-squared parameters
    d, R = eigen(t_Q²)
    d .= trunc.(d, digits = 9)
    # d[abs.(d) .< eps()] .= 0.0

    # compute b terms
    b = R' * t_q1

    # unique nonzeros eigenvalues
    # the unique function returns an ordered set
    iz = d .!= 0.0
    w  = unique(d[iz])
    ew = !isempty(w) 

    # warn if there are only zero eigenvalues
    ew ? nothing : @warn("Y will follow a Gaussian distribution") 

    # total dof of each nonzero eigenvalue
    ν = ew ? [sum(d .== i) for i in w] : Float64[]

    # total non-centrality for each nonzero eigenvalue
    λ = ew ? [sum(b[d .== x].^2.0) for x in w] ./ (4.0.*w.^2.0) : Float64[]
    
    # Gaussian parameters
    m = t_q0 - dot(w, λ)
    s = norm(b[.!iz]) 

    return parGChisq(w, ν, λ, m, s)
end


using Distributions, 
      LinearAlgebra,
      Plots


# tests
# dimension
D = 3 # 2 or 4

# set parameter of the const + linear + quadratic form
q0 = randn()
q1 = rand(D)
Q  = Diagonal(randn(D))
µ  = randn(D)
Σ  = rand(Wishart(D, Matrix(I, D, D)), 1)[];

# set the correspondent parameters of the GChisq
th = mapGChisqpar(q0, q1, Q, µ, Σ)

# set the GChisq distribution
π = GeneralizedChisq(th.w, th.ν, th.λ, th.m, th.s)

# test
@time pdf(π, mean(π))

@AdaemmerP
Copy link

Adding the generalized chi-squared distribution would be highly appreciated!

@heliosdrm
Copy link
Contributor Author

If there is anything blocking the merging of this PR that could be solved, I'd love to help. Regarding the issue reported by @mahaa2, maybe it is related to what I had mentioned:

The default tolerances of QuadGK.quadgk made the calculation of cdf at the median very slow, because there the integral is zero. Since in the ends of the distribution the integral is ±π/2, which is in the order of magnitude of the unit, I have used a fixed absolute tolerance of eps(one(T)) for the integration in the calculation of the CDF. I have not, however, changed the defaults for the calculation of the PDF.

I see that this problem also happens in other functions of the package, as kldivergence (#1443), and in that case @devmotion comments that it's "safer to throw an error than to return a possibly incorrect value". So I'm not sure what should be done in this case: (a) set another tolerance default for the PDF, (b) document the issue around median values, (c) devise a workaround to change the starting point of the integration if this situation is detected, or (d) just leave it as it is.

Let aside that, there were some CI errors for nightly, but those also happened to other unrelated PR when this one was submitted, so I think they are not related to real issues in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants