Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add fisher z test #7

Merged
merged 15 commits into from
Apr 19, 2023
Merged

[ENH] Add fisher z test #7

merged 15 commits into from
Apr 19, 2023

Conversation

adam2392
Copy link
Collaborator

@adam2392 adam2392 commented Apr 11, 2023

Towards #5

Changes proposed in this pull request:

  • Adds partial correlation test
  • Setsup initial API design
  • Includes sphinx docs

@adam2392 adam2392 requested a review from bloebp April 11, 2023 14:26
Copy link
Member

@bloebp bloebp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be part of this PR, but since it is still in a setup period, can we fix the build issues?

pywhy_stats/fisher_z_test.py Outdated Show resolved Hide resolved
tests/test_fisherz_test.py Outdated Show resolved Hide resolved
@bloebp
Copy link
Member

bloebp commented Apr 11, 2023

Can you also extend
https://github.com/py-why/pywhy-stats/blob/main/pywhy_stats/independence.py#L15
accordingly? For now, the AUTO would default to the fisher test.

Regarding the discussions before, lets try to introduce some pattern for the naming. Here for instance:
fisherz_indep and fisherz_condindep (maybe we can discuss a better naming pattern).

@adam2392
Copy link
Collaborator Author

Regarding the discussions before, lets try to introduce some pattern for the naming. Here for instance: fisherz_indep and fisherz_condindep (maybe we can discuss a better naming pattern).

Quick question regarding this now that I'm trying this out. FisherZ test that I have rn nicely implements this with the condition_on being optional. In the two function case, I actually have to split up the function.

What if we have a decorator pattern internally, or tags pattern, so that way a function can be marked where condition_on is supported or not?

@bloebp
Copy link
Member

bloebp commented Apr 11, 2023

Regarding the discussions before, lets try to introduce some pattern for the naming. Here for instance: fisherz_indep and fisherz_condindep (maybe we can discuss a better naming pattern).

Quick question regarding this now that I'm trying this out. FisherZ test that I have rn nicely implements this with the condition_on being optional. In the two function case, I actually have to split up the function.

What if we have a decorator pattern internally, or tags pattern, so that way a function can be marked where condition_on is supported or not?

I was more thinking of a very simple solution like:

def fisherz_indep(X, Y, ...):
   return fisherz(X,Y, ...)

def fisherz_condindep(X, Y, condition_on, ...):
   return fisherz(X,Y, condition_on,...)

The original fisherz implementation is still in there, no need to modify or split that.

Just so that one can always expect the pattern:
module_indep and module_conindep

We can of course still discuss the "indep" and "conindep" name.

@adam2392
Copy link
Collaborator Author

adam2392 commented Apr 11, 2023

I like the design for a few reasons:

  1. It is clean to read fisherz.ind and fisherz.condind
  2. consistent API

I have a few issues tho and hopefully we can figure out a way around them now before committing:

  1. we cannot type check high-level independence_test
  2. how to properly document the API? We either have a file-level documentation for the whole set of functions inside a file (similar to networkx), or we document each function separately. The cons of documenting each function separately is that there's a lot of repetition when looking at *.ind vs *.condind. It's pretty much exactly the same. The pros of documenting each separately are that we might have a bunch of tests without the *.condind ability.

Some thoughts on an alternative API design proposal

I'm trying to see if the following minor change would help us get the best of both worlds? If instead fisherz(...) was just a function that handles independence and optionally conditional independence, then we could have the following:

class Methods(Enum):
    """Methods for independence testing."""

    AUTO = 0

    # this is no longer a module, but a function
    FISHERZ = fisherz

def independence_test(
    X: ArrayLike,
    Y: ArrayLike,
    condition_on: Optional[ArrayLike] = None,
    method=Methods.AUTO,
    **kwargs,
) -> PValueResult:
    """Perform a (conditional) independence test to determine whether X and Y are independent.

    The test may be conditioned on an optional set of variables. This is, test whether ``X _||_ Y |
    condition_on``, where the null hypothesis is that X and Y are independent.

    Parameters
    ----------
    X : ArrayLike, shape (n_samples, n_features_x)
        Data matrix for X.
    Y : ArrayLike, shape (n_samples, n_features_y)
        Data matrix for Y.
    condition_on : ArrayLike or None, shape (n_samples, n_features_z), optional
        Data matrix for the conditioning variables. If None is given, an unconditional test
        is performed.
    method : Methods, optional
        Independence test method from the Methods enum. Default is Methods.AUTO, which will
        automatically select an appropriate method.
    **kwargs : dict or None, optional
        Additional keyword arguments to be passed to the specific test method

    Returns
    -------
    result : PValueResult
        An instance of the PValueResult data class, containing the p-value, test statistic,
        and any additional information related to the independence test.

    See Also
    --------
    fisherz : Fisher's Z test for independence
    """
    method_func: Callable[[ArrayLike, ArrayLike, Optional[ArrayLike]], PValueResult]
    if method == Methods.AUTO:
        method_func = Methods.FISHERZ
    else:
        method_func = method

    # do some error-checking
    if method_func = Methods.FisherZ:
          pval = scipy.stats.normaltest(X, Y, ...)
          if pval < 0.05: raise warning(your data is not all normal but you're trying to use partial correlation);

    return method_func(X, Y, condition_on, method, **kwargs)

Then, this independence_test(...) is just a place where we can do high-level checks when possible and provide an easy interface with default options. It is also simplified without the if/else check.

fisherz is a one-stop shop for documentation for partial correlation, rather than a two-stop shop with fisherz.ind and fisherz.condind.

All hypothesis tests that do not support condition_on would just raise a runtime error.

Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
@bloebp
Copy link
Member

bloebp commented Apr 11, 2023

For the ind and condind methods, that's right, we would have some (significant) overlap for the documentation. I would probably leave the general explanation on the file-level and just have a shorter one in the actual methods, like "Applies fisher z test. See blub for more details." and then only list the parameters (where copy and paste for them is fine). This is definitely not optimal, but I guess one would rather look on the documentation site to get more details of a method instead of reading the documentation in the python file. Just the parameters or "unusual" things should be flagged there that are not typical for the type of test.

Regarding the high level API, using callables there would be fine for me too, although we most likely still end up having if/elifs if we do want to perform certain checks or preprocessings depending on the methods. Also, having Callables there makes it a bit more difficult to pass kwargs and have a consistent typing at the same time. If you just call the function directly, you don't run into that issue. Otherwise, I really like where this is going with the additional test in case of fisherz for instance. This high level API is really meant for users who want to throw some data in a method and get a p-value. They could also fully configure the tests there, but it is more meant as a simplified helper function that mostly relies on the default parameters of the underlying method. For more advanced users, they would directly go to e.g. the fisher test and there you have the full API documentation.

Regarding the type checking of the high level API, I wouldn't bother. Just pass it to the methods, which should raise an error accordingly. One thing that is also possible, which we e.g. do in DoWhy, you just make sure that you transform all columns to something numeric (e.g., apply one hot encoding to categorical data).

robertness
robertness previously approved these changes Apr 13, 2023
Copy link
Collaborator

@robertness robertness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cut and dry.

@sampan501
Copy link

Curious why you call Z in the conditional tests as condition_on and not just Z?

@bloebp
Copy link
Member

bloebp commented Apr 13, 2023

Curious why you call Z in the conditional tests as condition_on and not just Z?

To make it more explicit that these are the variables we condition on. It also reads clearer as independence(X, Y, condition_on=Z)

Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
@adam2392
Copy link
Collaborator Author

Kay, I made some fixes and an attempt to document the "file" instead of functions. I honestly have never done that before, so we prolly need to make sure this looks right so we don't have documentation issues later on.

@bloebp do you have a suggestion on how to format the documentation, so it's relatively nice and easy to read?

Signed-off-by: Adam Li <[email protected]>
Signed-off-by: Adam Li <[email protected]>
@codecov-commenter
Copy link

codecov-commenter commented Apr 15, 2023

Codecov Report

Merging #7 (a6ca007) into main (d2667af) will increase coverage by 75.40%.
The diff coverage is 69.38%.

@@            Coverage Diff             @@
##            main       #7       +/-   ##
==========================================
+ Coverage   0.00%   75.40%   +75.40%     
==========================================
  Files          3        4        +1     
  Lines         18       61       +43     
  Branches       0        9        +9     
==========================================
+ Hits           0       46       +46     
+ Misses        18       14        -4     
- Partials       0        1        +1     
Impacted Files Coverage Δ
pywhy_stats/independence.py 50.00% <39.13%> (+50.00%) ⬆️
pywhy_stats/fisherz.py 95.65% <95.65%> (ø)
pywhy_stats/pvalue_result.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@adam2392
Copy link
Collaborator Author

Signed-off-by: Adam Li <[email protected]>
@adam2392 adam2392 requested a review from bloebp April 15, 2023 23:00
@adam2392 adam2392 requested a review from robertness April 15, 2023 23:01
@adam2392 adam2392 enabled auto-merge (squash) April 15, 2023 23:02
Signed-off-by: Adam Li <[email protected]>
Copy link
Member

@bloebp bloebp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just some minor comments

pywhy_stats/fisherz.py Outdated Show resolved Hide resolved
pywhy_stats/fisherz.py Outdated Show resolved Hide resolved
pywhy_stats/fisherz.py Show resolved Hide resolved
pywhy_stats/independence.py Outdated Show resolved Hide resolved
pywhy_stats/independence.py Outdated Show resolved Hide resolved
tests/test_fisherz_test.py Show resolved Hide resolved
Signed-off-by: Adam Li <[email protected]>
@adam2392 adam2392 requested a review from bloebp April 19, 2023 13:39
@adam2392
Copy link
Collaborator Author

Addressed comments, I will let you merge if happy @bloebp

@adam2392 adam2392 merged commit c45bbb8 into py-why:main Apr 19, 2023
@adam2392 adam2392 deleted the fisherz branch April 19, 2023 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants