Skip to content

Commit

Permalink
first version of joss paper, scipy implmentation geometric distribution
Browse files Browse the repository at this point in the history
  • Loading branch information
sebastianherreramonterrosa committed Oct 4, 2024
1 parent 3ae4f00 commit 8d64800
Show file tree
Hide file tree
Showing 14 changed files with 248 additions and 123 deletions.
Binary file not shown.
226 changes: 113 additions & 113 deletions examples/fit_accelerate.ipynb

Large diffs are not rendered by default.

51 changes: 51 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
@article{marsaglia2004evaluating,
title = {Evaluating the anderson-darling distribution},
author = {Marsaglia, George and Marsaglia, John},
journal = {Journal of statistical software},
volume = {9},
pages = {1--5},
year = {2004}
}

@book{walck1996hand,
title = {Hand-book on statistical distributions for experimentalists},
author = {Walck, Christian and others},
year = {1996},
publisher = {Stockholms universitet}
}

@article{george2011estimation,
title = {Estimation of parameters of Johnson's system of distributions},
author = {George, Florence and Ramachandran, KM},
journal = {Journal of Modern Applied Statistical Methods},
volume = {10},
pages = {494--504},
year = {2011}
}

@article{sinclair1988approximations,
title = {Approximations to the distribution function of the anderson—darling test statistic},
author = {Sinclair, CD and Spurr, BD},
journal = {Journal of the American Statistical Association},
volume = {83},
number = {404},
pages = {1190--1191},
year = {1988},
publisher = {Taylor \& Francis}
}

@book{mclaughlin2001compendium,
title = {A compendium of common probability distributions},
author = {McLaughlin, Michael P},
year = {2001},
publisher = {Michael P. McLaughlin}
}

@article{lewis1961distribution,
title = {Distribution of the Anderson-Darling statistic},
author = {Lewis, Peter AW},
journal = {The Annals of Mathematical Statistics},
pages = {1118--1124},
year = {1961},
publisher = {JSTOR}
}
73 changes: 73 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
title: "Phitter: A Python Library for Probability Distribution Fitting and Analysis"
tags:
- Python
- Statistics
- probability distributions
- data analysis
- machine learning
- simulation
- monte carlo
authors:
- name: Sebastián José Herrera Monterrosa
orcid: 0009-0002-2766-642X
affiliation: 1
affiliations:
- name: Pontificia Universidad Javeriana
index: 1
date: 26 March 2024
bibliography: paper.bib
---

# Summary

Phitter is a Python library designed to analyze datasets and determine the best analytical probability distributions that represent them. It provides a comprehensive suite of tools for fitting and analyzing over 80 probability distributions, both continuous and discrete. Phitter implements three goodness-of-fit tests and offers interactive visualizations to aid in the analysis process. For each selected probability distribution, Phitter provides a standard modeling guide along with detailed spreadsheets that outline the methodology for using the chosen distribution in various fields such as data science, operations research, and artificial intelligence.

# Statement of Need

In the fields of data science, statistics, and machine learning, understanding the underlying probability distributions of datasets is crucial for accurate modeling and prediction. However, identifying the most appropriate distribution for a given dataset can be a complex and time-consuming task. Phitter addresses this need by providing a user-friendly, efficient, and comprehensive tool for probability distribution fitting and analysis.

Phitter stands out from existing tools by offering:

1. A wide range of over 80 probability distributions, including both continuous and discrete options.
2. Implementation of multiple goodness-of-fit tests (Chi-Square, Kolmogorov-Smirnov, and Anderson-Darling).
3. Interactive visualizations for better understanding and interpretation of results.
4. Accelerated fitting capabilities for large datasets (over 100K samples).
5. Detailed modeling guides and spreadsheets for practical application in various fields.

# Features and Functionality

Phitter offers a range of features designed to streamline the process of probability distribution analysis:

- **Flexible Fitting**: Users can fit both continuous and discrete distributions to their data.
- **Customizable Analysis**: Options to specify the number of bins, confidence level, and distributions to fit.
- **Parallel Processing**: Support for multi-threaded fitting to improve performance.
- **Comprehensive Output**: Detailed summaries of fitted distributions, including parameters, test statistics, and rankings.
- **Visualization Tools**: Functions to plot histograms, PDFs, ECDFs, and Q-Q plots for visual analysis.
- **Distribution Utilities**: Methods to work with individual distributions, including CDF, PDF, PPF, and sampling functions.

# Implementation and Usage

Phitter is implemented in Python and is available via PyPI. It requires Python 3.9 or higher. The library can be easily installed using pip:

```
pip install phitter
```

Basic usage involves creating a `PHITTER` object with a dataset and calling the `fit()` method:

```python
import phitter

data = [...] # Your dataset
phi = phitter.PHITTER(data)
phi.fit()
```

More advanced usage allows for customization of fitting parameters and specific distribution analysis.

# Conclusion

Phitter provides researchers, data scientists, and statisticians with a powerful tool for probability distribution analysis. By offering a comprehensive set of distributions, multiple goodness-of-fit tests, and interactive visualizations, Phitter simplifies the process of identifying and working with probability distributions in various data-driven fields.

# References
2 changes: 1 addition & 1 deletion phitter/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "0.0.8"
__version__ = "0.7.1"

from .main import PHITTER
from phitter import continuous
Expand Down
10 changes: 5 additions & 5 deletions phitter/discrete/discrete_distributions/geometric.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,16 +43,16 @@ def cdf(self, x: int | numpy.ndarray) -> float | numpy.ndarray:
"""
Cumulative distribution function
"""
result = 1 - (1 - self.p) ** numpy.floor(x)
# result = scipy.stats.geom.cdf(x, self.p)
# result = 1 - (1 - self.p) ** numpy.floor(x)
result = scipy.stats.geom.cdf(x, self.p)
return result

def pmf(self, x: int | numpy.ndarray) -> float | numpy.ndarray:
"""
Probability mass function
"""
result = self.p * (1 - self.p) ** (x - 1)
# result = scipy.stats.geom.pmf(x, self.p)
# result = self.p * (1 - self.p) ** (x - 1)
result = scipy.stats.geom.pmf(x, self.p)
return result

def ppf(self, u: float | numpy.ndarray) -> float | numpy.ndarray:
Expand All @@ -67,7 +67,7 @@ def sample(self, n: int, seed: int | None = None) -> numpy.ndarray:
Sample of n elements of ditribution
"""
if seed:
numpy.random.seed(0)
numpy.random.seed(seed)
return self.ppf(numpy.random.rand(n))

def non_central_moments(self, k: int) -> float | None:
Expand Down
5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "phitter"
version = "0.0.8"
version = "0.7.1"
description = "Find the best probability distribution for your dataset"
authors = [{name = "Sebastián José Herrera Monterrosa", email = "[email protected]"}]
readme = "README.md"
Expand Down Expand Up @@ -36,7 +36,8 @@ dependencies = [
"scipy>=1.1.0",
"plotly>=5.14.0",
"kaleido>=0.2.1",
"matplotlib>=3.3"
"matplotlib>=3.3",
"pandas>=1.5.0"
]

[project.urls]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"cells": [
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -11,7 +11,7 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
Expand Down

0 comments on commit 8d64800

Please sign in to comment.