[ENH] Add fisher z test (#7)

Towards #5 Changes proposed in this pull request: - Adds partial correlation test - Setsup initial API design - Includes sphinx docs --------- Signed-off-by: Adam Li <[email protected]>
py-why · Apr 19, 2023 · c45bbb8 · c45bbb8
1 parent d2667af
commit c45bbb8
Show file tree

Hide file tree

Showing 15 changed files with 330 additions and 95 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -63,7 +63,7 @@ jobs:
             sudo apt install libspatialindex-dev xdg-utils
       - python/install-packages:
           pkg-manager: poetry
-          args: "-E graph_func -E viz --with docs"
+          args: "--with docs"
           cache-version: "v1" # change to clear cache
       - run:
           name: Check poetry package versions
@@ -145,12 +145,12 @@ jobs:
       - run:
           name: make linkcheck
           command: |
-            make -C doc linkcheck
+            poetry run make -C doc linkcheck
       - run:
           name: make linkcheck-grep
           when: always
           command: |
-            make -C doc linkcheck-grep
+            poetry run make -C doc linkcheck-grep
       - store_artifacts:
           path: doc/_build/linkcheck
           destination: linkcheck

diff --git a/doc/_templates/autosummary/class.rst b/doc/_templates/autosummary/class.rst
@@ -4,7 +4,6 @@
 .. currentmodule:: {{ module }}
 
 .. autoclass:: {{ objname }}
-   :special-members: __contains__,__getitem__,__iter__,__len__,__add__,__sub__,__mul__,__div__,__neg__,__hash__
    :members:
 
 .. include:: {{module}}.{{objname}}.examples
diff --git a/doc/api.rst b/doc/api.rst
@@ -22,10 +22,44 @@ Pyhy-Stats experimentally provides an interface for conditional independence
 testing and conditional discrepancy testing (also known as k-sample conditional
 independence testing).
 
-Conditional Independence Testing
-================================
+High-level Independence Testing
+===============================
+
+The easiest way to run a (conditional) independence test is to use the
+:py:func:`independence_test` function. This function takes inputs and
+will try to automatically pick the appropriate test based on the input.
+
+Note: this is only meant for beginnners, and the result should be interpreted
+with caution as the ability to choose the optimal test is limited. When
+one uses the wrong test for the type of data and assumptions they have,
+then typically you will get less statistical power.
+
+.. currentmodule:: pywhy_stats
+.. autosummary::
+   :toctree: generated/
+
+   independence_test
+   Methods
+
+
+All independence tests return a ``PValueResult`` object, which
+contains the p-value and the test statistic and optionally additional information.
+
+.. currentmodule:: pywhy_stats.pvalue_result
+.. autosummary::
+   :toctree: generated/
+
+   PValueResult
+
+(Conditional) Independence Testing
+==================================
 
 Testing for conditional independence among variables is a core part
 of many data analysis procedures.
 
-TBD...
+.. currentmodule:: pywhy_stats
+.. autosummary::
+   :toctree: generated/
+
+   fisherz
+
diff --git a/doc/conditional_independence.rst b/doc/conditional_independence.rst
@@ -4,7 +4,7 @@
 Independence
 ============
 
-.. currentmodule:: pywhy_stats.ci
+.. currentmodule:: pywhy_stats
 
 Probabilistic independence among two random variables is when the realization of one
 variable does not affect the distribution of the other variable. It is a fundamental notion
@@ -42,10 +42,10 @@ with certain assumptions on the underlying data distribution.
 
 Conditional Mutual Information
 ------------------------------
-Conditional mutual information (CMI) is a general formulation of CI, where CMI is defined as
-:math::
+Conditional mutual information (CMI) is a general formulation of CI, where CMI is defined as:
 
-    \\int log \frac{p(x, y | z)}{p(x | z) p(y | z)}
+  .. math::
+    \int log \frac{p(x, y | z)}{p(x | z) p(y | z)}
 
 As we can see, CMI is equal to 0, if and only if :math:`p(x, y | z) = p(x | z) p(y | z)`, which
 is exactly the definition of CI. CMI is completely non-parametric and thus requires no assumptions
@@ -70,24 +70,18 @@ various proposals in the literature for estimating CMI, which we summarize here:
   one can use variants of Random Forests to generate adaptive nearest-neighbor estimates in high-dimensions
   or on manifolds, such that the KSG estimator is still powerful.
 
-.. autosummary::
-   :toctree: generated/
-
-    CMITest
+<TBD>
 
 - The Classifier Divergence approach estimates CMI using a classification model.
 
-.. autosummary::
-   :toctree: generated/
-
-    ClassifierCMITest
+<TBD>
 
 - Direct posterior estimates can be implemented with a classification model by directly
   estimating :math:`P(y|x)` and :math:`P(y|x,z)`, which can be used as plug-in estimates
   to the equation for CMI.
 
-Partial (Pearson) Correlation
------------------------------
+:mod:`pywhy_stats.fisherz` Partial (Pearson) Correlation
+--------------------------------------------------------
 Partial correlation based on the Pearson correlation is equivalent to CMI in the setting
 of normally distributed data. Computing partial correlation is fast and efficient and
 thus attractive to use. However, this **relies on the assumption that the variables are Gaussiany**,
@@ -96,7 +90,7 @@ which may be unrealistic in certain datasets.
 .. autosummary::
    :toctree: generated/
 
-    FisherZCITest
+    fisherz
 
 Discrete, Categorical and Binary Data
 -------------------------------------
@@ -105,10 +99,6 @@ class of tests will construct a contingency table based on the number of levels
 each discrete variable. An exponential amount of data is needed for increasing levels
 for a discrete variable.
 
-.. autosummary::
-   :toctree: generated/
-
-    GSquareCITest
 
 Kernel-Approaches
 -----------------
@@ -118,10 +108,6 @@ that computes a test statistic from kernels of the data and uses permutation tes
 generate samples from the null distribution :footcite:`Zhang2011`, which are then used to
 estimate a pvalue.
 
-.. autosummary::
-   :toctree: generated/
-
-    KernelCITest
 
 Classifier-based Approaches
 ---------------------------
@@ -142,16 +128,11 @@ helps maintain dependence between (X, Z) and (Y, Z) (if it exists), but generate
 conditionally independent dataset.
 
 
-.. autosummary::
-   :toctree: generated/
-
-    ClassifierCITest
-
 =======================
 Conditional Discrepancy
 =======================
 
-.. currentmodule:: pywhy_stats.cd
+.. currentmodule:: pywhy_stats
 
 Conditional discrepancy (CD) is another form of conditional invariances that may be exhibited by data. The
 general question is whether or not the following two distributions are equal:
@@ -181,10 +162,6 @@ that computes a test statistic from kernels of the data and uses a weighted perm
 based on the estimated propensity scores to generate samples from the null distribution
 :footcite:`Park2021conditional`, which are then used to estimate a pvalue.
 
-.. autosummary::
-   :toctree: generated/
-
-    KernelCDTest
 
 Bregman-Divergences
 -------------------
@@ -193,7 +170,7 @@ that computes a test statistic from estimated Von-Neumann divergences of the dat
 weighted permutation testing based on the estimated propensity scores to generate samples from the null distribution
 :footcite:`Yu2020Bregman`, which are then used to estimate a pvalue.
 
-.. autosummary::
-   :toctree: generated/
-
-    BregmanCDTest
+==========
+References
+==========
+.. footbibliography::
diff --git a/doc/conf.py b/doc/conf.py
@@ -3,11 +3,13 @@
 # This file only contains a selection of the most common options. For a full
 # list see the documentation:
 # https://www.sphinx-doc.org/en/master/usage/configuration.html
+from __future__ import annotations
 
 import os
 import sys
 from datetime import datetime
 
+import numpy.typing
 import sphinx_gallery  # noqa: F401
 from sphinx_gallery.sorting import ExampleTitleSortKey
 
@@ -70,7 +72,11 @@
 autosummary_generate = True
 
 autodoc_default_options = {"inherited-members": None}
-autodoc_typehints = "signature"
+
+# whether to expand type hints in function/class signatures
+autodoc_typehints = "none"
+
+add_module_names = False
 
 # -- numpydoc
 # Below is needed to prevent errors
@@ -109,9 +115,6 @@
     "dictionary",
     "no",
     "attributes",
-    # numpy
-    "ScalarType",
-    "ArrayLike",
     # shapes
     "n_times",
     "obj",
@@ -123,7 +126,6 @@
     "n_samples",
     "n_variables",
     "n_classes",
-    "NDArray",
     "n_samples_X",
     "n_samples_Y",
     "n_features_x",
@@ -141,11 +143,20 @@
     "pgmpy.models.BayesianNetwork": "pgmpy.models.BayesianNetwork",
     # joblib
     "joblib.Parallel": "joblib.Parallel",
+    "PValueResult": "pywhy_stats.pvalue_result.PValueResult",
     # numpy
     "NDArray": "numpy.ndarray",
+    # "ArrayLike": "numpy.typing.ArrayLike",
     "ArrayLike": ":term:`array_like`",
+    "fisherz": "pywhy_stats.fisherz",
 }
 
+autodoc_typehints_format = "short"
+# from __future__ import annotations
+# autodoc_type_aliases = {
+#     'Iterable': 'Iterable',
+#     'ArrayLike': 'ArrayLike'
+# }
 default_role = "literal"
 
 # Tell myst-parser to assign header anchors for h1-h3.

diff --git a/doc/index.rst b/doc/index.rst
@@ -25,15 +25,14 @@ Contents
    Reference API<api>
    Simple Examples<use>
    User Guide<user_guide>
-   tutorials/index
    whats_new
 
 .. toctree::
    :hidden:
    :caption: Development
 
    License <https://raw.githubusercontent.com/py-why/pywhy-stats/main/LICENSE>
-   Contributing <https://github.com/py-why/pywhy-stats/main/CONTRIBUTING.md>
+   Contributing <https://github.com/py-why/pywhy-stats/blob/main/CONTRIBUTING.md>
 
 Team
 ----

diff --git a/doc/installation.md b/doc/installation.md
@@ -3,26 +3,30 @@ Installation
 
 **pywhy-stats** supports Python >= 3.8.
 
-## Installing with ``pip``, or ``poetry``.
+Installing with ``pip``, or ``poetry``
+--------------------------------------
 
 **pywhy-stats** is available [on PyPI](https://pypi.org/project/pywhy-stats/). Just run
 
-    pip install pywhy-stats
+    >>> pip install pywhy-stats
 
-    # or via poetry (recommended)
-    poetry add pywhy-stats
+    >>> # or via poetry (recommended)
+    >>> poetry add pywhy-stats
 
-## Installing from source
+Installing from source
+----------------------
 
 To install **pywhy-stats** from source, first clone [the repository](https://github.com/pywhy/pywhy-stats):
 
-    git clone https://github.com/py-why/pywhy-stats.git
-    cd pywhy-stats
+
+    >>> git clone https://github.com/py-why/pywhy-stats.git
+    >>> cd pywhy-stats
 
 Then run installation via poetry (recommended)
 
-    poetry install
+
+    >>> poetry install
 
 or via pip
 
-    pip install -e .
+    >>> pip install -e .
diff --git a/doc/use.rst b/doc/use.rst
@@ -1,7 +1,7 @@
 :orphan:
 
 Examples and Tutorials using pywhy-stats
-=======================================
+========================================
 
 To be able to effectively use pywhy-stats, you can look at some of the basic examples here
 to learn everything you need from concepts to explicit code examples.

diff --git a/doc/whats_new/v0.1.rst b/doc/whats_new/v0.1.rst
@@ -26,6 +26,7 @@ Version 0.1
 Changelog
 ---------
 
+- |Feature| Implement partial correlation test :func:`pywhy_stats.fisherz`, by `Adam Li`_ (:pr:`7`)
 
 
 Code and Documentation Contributors
-Original file line number
+Diff line change
@@ Expand Up / @@ -26,6 +26,7 @@ Version 0.1 @@
     Changelog
     ---------
+    - |Feature| Implement partial correlation test :func:`pywhy_stats.fisherz`, by `Adam Li`_ (:pr:`7`)
     Code and Documentation Contributors
@@ Expand Down @@