Skip to content

Commit

Permalink
[ENH] Add fisher z test (#7)
Browse files Browse the repository at this point in the history
Towards #5 

Changes proposed in this pull request:
- Adds partial correlation test
- Setsup initial API design
- Includes sphinx docs

---------

Signed-off-by: Adam Li <[email protected]>
  • Loading branch information
adam2392 authored Apr 19, 2023
1 parent d2667af commit c45bbb8
Show file tree
Hide file tree
Showing 15 changed files with 330 additions and 95 deletions.
6 changes: 3 additions & 3 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ jobs:
sudo apt install libspatialindex-dev xdg-utils
- python/install-packages:
pkg-manager: poetry
args: "-E graph_func -E viz --with docs"
args: "--with docs"
cache-version: "v1" # change to clear cache
- run:
name: Check poetry package versions
Expand Down Expand Up @@ -145,12 +145,12 @@ jobs:
- run:
name: make linkcheck
command: |
make -C doc linkcheck
poetry run make -C doc linkcheck
- run:
name: make linkcheck-grep
when: always
command: |
make -C doc linkcheck-grep
poetry run make -C doc linkcheck-grep
- store_artifacts:
path: doc/_build/linkcheck
destination: linkcheck
Expand Down
1 change: 0 additions & 1 deletion doc/_templates/autosummary/class.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
.. currentmodule:: {{ module }}

.. autoclass:: {{ objname }}
:special-members: __contains__,__getitem__,__iter__,__len__,__add__,__sub__,__mul__,__div__,__neg__,__hash__
:members:

.. include:: {{module}}.{{objname}}.examples
40 changes: 37 additions & 3 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,44 @@ Pyhy-Stats experimentally provides an interface for conditional independence
testing and conditional discrepancy testing (also known as k-sample conditional
independence testing).

Conditional Independence Testing
================================
High-level Independence Testing
===============================

The easiest way to run a (conditional) independence test is to use the
:py:func:`independence_test` function. This function takes inputs and
will try to automatically pick the appropriate test based on the input.

Note: this is only meant for beginnners, and the result should be interpreted
with caution as the ability to choose the optimal test is limited. When
one uses the wrong test for the type of data and assumptions they have,
then typically you will get less statistical power.

.. currentmodule:: pywhy_stats
.. autosummary::
:toctree: generated/

independence_test
Methods


All independence tests return a ``PValueResult`` object, which
contains the p-value and the test statistic and optionally additional information.

.. currentmodule:: pywhy_stats.pvalue_result
.. autosummary::
:toctree: generated/

PValueResult

(Conditional) Independence Testing
==================================

Testing for conditional independence among variables is a core part
of many data analysis procedures.

TBD...
.. currentmodule:: pywhy_stats
.. autosummary::
:toctree: generated/

fisherz

51 changes: 14 additions & 37 deletions doc/conditional_independence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
Independence
============

.. currentmodule:: pywhy_stats.ci
.. currentmodule:: pywhy_stats

Probabilistic independence among two random variables is when the realization of one
variable does not affect the distribution of the other variable. It is a fundamental notion
Expand Down Expand Up @@ -42,10 +42,10 @@ with certain assumptions on the underlying data distribution.

Conditional Mutual Information
------------------------------
Conditional mutual information (CMI) is a general formulation of CI, where CMI is defined as
:math::
Conditional mutual information (CMI) is a general formulation of CI, where CMI is defined as:

\\int log \frac{p(x, y | z)}{p(x | z) p(y | z)}
.. math::
\int log \frac{p(x, y | z)}{p(x | z) p(y | z)}
As we can see, CMI is equal to 0, if and only if :math:`p(x, y | z) = p(x | z) p(y | z)`, which
is exactly the definition of CI. CMI is completely non-parametric and thus requires no assumptions
Expand All @@ -70,24 +70,18 @@ various proposals in the literature for estimating CMI, which we summarize here:
one can use variants of Random Forests to generate adaptive nearest-neighbor estimates in high-dimensions
or on manifolds, such that the KSG estimator is still powerful.

.. autosummary::
:toctree: generated/

CMITest
<TBD>

- The Classifier Divergence approach estimates CMI using a classification model.

.. autosummary::
:toctree: generated/

ClassifierCMITest
<TBD>

- Direct posterior estimates can be implemented with a classification model by directly
estimating :math:`P(y|x)` and :math:`P(y|x,z)`, which can be used as plug-in estimates
to the equation for CMI.

Partial (Pearson) Correlation
-----------------------------
:mod:`pywhy_stats.fisherz` Partial (Pearson) Correlation
--------------------------------------------------------
Partial correlation based on the Pearson correlation is equivalent to CMI in the setting
of normally distributed data. Computing partial correlation is fast and efficient and
thus attractive to use. However, this **relies on the assumption that the variables are Gaussiany**,
Expand All @@ -96,7 +90,7 @@ which may be unrealistic in certain datasets.
.. autosummary::
:toctree: generated/

FisherZCITest
fisherz

Discrete, Categorical and Binary Data
-------------------------------------
Expand All @@ -105,10 +99,6 @@ class of tests will construct a contingency table based on the number of levels
each discrete variable. An exponential amount of data is needed for increasing levels
for a discrete variable.

.. autosummary::
:toctree: generated/

GSquareCITest

Kernel-Approaches
-----------------
Expand All @@ -118,10 +108,6 @@ that computes a test statistic from kernels of the data and uses permutation tes
generate samples from the null distribution :footcite:`Zhang2011`, which are then used to
estimate a pvalue.

.. autosummary::
:toctree: generated/

KernelCITest

Classifier-based Approaches
---------------------------
Expand All @@ -142,16 +128,11 @@ helps maintain dependence between (X, Z) and (Y, Z) (if it exists), but generate
conditionally independent dataset.


.. autosummary::
:toctree: generated/

ClassifierCITest

=======================
Conditional Discrepancy
=======================

.. currentmodule:: pywhy_stats.cd
.. currentmodule:: pywhy_stats

Conditional discrepancy (CD) is another form of conditional invariances that may be exhibited by data. The
general question is whether or not the following two distributions are equal:
Expand Down Expand Up @@ -181,10 +162,6 @@ that computes a test statistic from kernels of the data and uses a weighted perm
based on the estimated propensity scores to generate samples from the null distribution
:footcite:`Park2021conditional`, which are then used to estimate a pvalue.

.. autosummary::
:toctree: generated/

KernelCDTest

Bregman-Divergences
-------------------
Expand All @@ -193,7 +170,7 @@ that computes a test statistic from estimated Von-Neumann divergences of the dat
weighted permutation testing based on the estimated propensity scores to generate samples from the null distribution
:footcite:`Yu2020Bregman`, which are then used to estimate a pvalue.

.. autosummary::
:toctree: generated/

BregmanCDTest
==========
References
==========
.. footbibliography::
21 changes: 16 additions & 5 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
from __future__ import annotations

import os
import sys
from datetime import datetime

import numpy.typing
import sphinx_gallery # noqa: F401
from sphinx_gallery.sorting import ExampleTitleSortKey

Expand Down Expand Up @@ -70,7 +72,11 @@
autosummary_generate = True

autodoc_default_options = {"inherited-members": None}
autodoc_typehints = "signature"

# whether to expand type hints in function/class signatures
autodoc_typehints = "none"

add_module_names = False

# -- numpydoc
# Below is needed to prevent errors
Expand Down Expand Up @@ -109,9 +115,6 @@
"dictionary",
"no",
"attributes",
# numpy
"ScalarType",
"ArrayLike",
# shapes
"n_times",
"obj",
Expand All @@ -123,7 +126,6 @@
"n_samples",
"n_variables",
"n_classes",
"NDArray",
"n_samples_X",
"n_samples_Y",
"n_features_x",
Expand All @@ -141,11 +143,20 @@
"pgmpy.models.BayesianNetwork": "pgmpy.models.BayesianNetwork",
# joblib
"joblib.Parallel": "joblib.Parallel",
"PValueResult": "pywhy_stats.pvalue_result.PValueResult",
# numpy
"NDArray": "numpy.ndarray",
# "ArrayLike": "numpy.typing.ArrayLike",
"ArrayLike": ":term:`array_like`",
"fisherz": "pywhy_stats.fisherz",
}

autodoc_typehints_format = "short"
# from __future__ import annotations
# autodoc_type_aliases = {
# 'Iterable': 'Iterable',
# 'ArrayLike': 'ArrayLike'
# }
default_role = "literal"

# Tell myst-parser to assign header anchors for h1-h3.
Expand Down
3 changes: 1 addition & 2 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,15 +25,14 @@ Contents
Reference API<api>
Simple Examples<use>
User Guide<user_guide>
tutorials/index
whats_new

.. toctree::
:hidden:
:caption: Development

License <https://raw.githubusercontent.com/py-why/pywhy-stats/main/LICENSE>
Contributing <https://github.com/py-why/pywhy-stats/main/CONTRIBUTING.md>
Contributing <https://github.com/py-why/pywhy-stats/blob/main/CONTRIBUTING.md>

Team
----
Expand Down
22 changes: 13 additions & 9 deletions doc/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,30 @@ Installation

**pywhy-stats** supports Python >= 3.8.

## Installing with ``pip``, or ``poetry``.
Installing with ``pip``, or ``poetry``
--------------------------------------

**pywhy-stats** is available [on PyPI](https://pypi.org/project/pywhy-stats/). Just run

pip install pywhy-stats
>>> pip install pywhy-stats

# or via poetry (recommended)
poetry add pywhy-stats
>>> # or via poetry (recommended)
>>> poetry add pywhy-stats

## Installing from source
Installing from source
----------------------

To install **pywhy-stats** from source, first clone [the repository](https://github.com/pywhy/pywhy-stats):

git clone https://github.com/py-why/pywhy-stats.git
cd pywhy-stats

>>> git clone https://github.com/py-why/pywhy-stats.git
>>> cd pywhy-stats

Then run installation via poetry (recommended)

poetry install

>>> poetry install

or via pip

pip install -e .
>>> pip install -e .
2 changes: 1 addition & 1 deletion doc/use.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
:orphan:

Examples and Tutorials using pywhy-stats
=======================================
========================================

To be able to effectively use pywhy-stats, you can look at some of the basic examples here
to learn everything you need from concepts to explicit code examples.
Expand Down
1 change: 1 addition & 0 deletions doc/whats_new/v0.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Version 0.1
Changelog
---------

- |Feature| Implement partial correlation test :func:`pywhy_stats.fisherz`, by `Adam Li`_ (:pr:`7`)


Code and Documentation Contributors
Expand Down
Loading

0 comments on commit c45bbb8

Please sign in to comment.