Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
marcoalopez committed May 4, 2020
1 parent 678618a commit bc40078
Show file tree
Hide file tree
Showing 9 changed files with 82 additions and 194 deletions.
5 changes: 3 additions & 2 deletions DOCS/_Plot_module.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ The method returns a plot, the number of classes and bin size of the histogram,
def distribution(data,
plot=('hist', 'kde'),
avg=('amean', 'gmean', 'median', 'mode'),
binsize='auto', bandwidth='silverman'):
binsize='auto',
bandwidth='silverman'):
""" Return a plot with the ditribution of (apparent or actual) grain sizes
in a dataset.
Expand Down Expand Up @@ -200,6 +201,6 @@ KDE bandwidth = 0.1
=======================================
```

![]()
![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/new_normalized_median.png?raw=true)

Note that in this case, the method returns the normalized inter-quartile range (IQR) rather than the normalized standard deviation. Also, note that the kernel density estimate appears smoother resembling an almost perfect normal distribution.
119 changes: 56 additions & 63 deletions DOCS/_describe.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,18 @@ dataset = pd.read_csv(filepath, sep='\t')

# estimate equivalent circular diameters (ECDs)
dataset['diameters'] = 2 * np.sqrt(dataset['Area'] / np.pi)
dataset
dataset.head()
```

![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_output.png?raw=true)
![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_newcol.png?raw=true)

```python
# Set the population properties
# Set the population properties for the toy dataset
scale = np.log(20) # set sample geometric mean to 20
shape = np.log(1.5) # set the lognormal shape to 1.5

# generate a random lognormal population of size 500
np.random.seed(seed=1) # this is to generate always the same population for reproducibility
np.random.seed(seed=1) # this is for reproducibility
toy_dataset = np.random.lognormal(mean=scale, sigma=shape, size=500)
```

Expand Down Expand Up @@ -73,9 +73,7 @@ By default, the `summarize()` function returns:
- The shape of the lognormal distribution using the multiplicative standard deviation (MSD)
- A Shapiro-Wilk test warning indicating when the data deviates from normal and/or lognormal (when p-value < 0.05).

Note that here the Shapiro-Wilk test warning tells us that the distribution is not normally distributed, which is to be expected since we know that this is a lognormal distribution. Note that the geometric mean and the lognormal shape are very close to the values used to generate the synthetic dataset, 20 and 1.5 respectively.

Now, let's do the same using the dataset that comes from a real rock, for this, we have to pass the column with the diameters:
In the example above, the Shapiro-Wilk test tells us that the distribution is not normally distributed, which is to be expected since we know that this is a lognormal distribution. Note that the geometric mean and the lognormal shape are very close to the values used to generate the synthetic random dataset, 20 and 1.5 respectively. Now, let's do the same using the dataset that comes from a real rock, for this, we have to pass the column with the diameters:

```python
summarize(dataset['diameters'])
Expand Down Expand Up @@ -117,69 +115,64 @@ Lognormality test: 0.99, 0.03 (test statistic, p-value)
============================================================================
```

Leaving aside the difference in numbers, there are some subtle differences compared to the results obtained with the toy dataset. First, the confidence interval method for the arithmetic mean is no longer the modified Cox (mCox) but the one based on the central limit theorem (CLT) advised by the [ASTM](https://en.wikipedia.org/wiki/ASTM_International). As previously noted, the function ```summarize()``` automatically choose the optimal confidence interval method depending on distribution features. We show below the decision tree flowchart for choosing the optimal confidence interval estimation method, which is based on [Lopez-Sanchez (2020)](https://doi.org/10.1016/j.jsg.2020.104042).
Leaving aside the different numbers, there are some subtle differences compared to the results obtained with the toy dataset. First, the confidence interval method for the arithmetic mean is no longer the modified Cox (mCox) but the one based on the central limit theorem (CLT) advised by the [ASTM](https://en.wikipedia.org/wiki/ASTM_International). As previously noted, the function ```summarize()``` automatically choose the optimal confidence interval method depending on distribution features. We show below the decision tree flowchart for choosing the optimal confidence interval estimation method, which is based on [Lopez-Sanchez (2020)](https://doi.org/10.1016/j.jsg.2020.104042).

![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/avg_map.png?raw=true)

The reason why the CLT method applies in this case is that the grain size distribution not enough lognormal-like (note the Shapiro-Wilk test warning with a p-value < 0.05), and this might cause an inaccurate estimate of the arithmetic mean confidence interval.

Now, let's focus on the different options of the ``summarize()`` method.

```
Signature:
summarize(
data,
avg=('amean', 'gmean', 'median', 'mode'),
ci_level=0.95,
bandwidth='silverman',
precision=0.1,
)
Docstring:
Estimate different grain size statistics. This includes different means,
the median, the frequency peak grain size via KDE, the confidence intervals
using different methods, and the distribution features.
Parameters
----------
data : array_like
the size of the grains
avg : string, tuple or list; optional
the averages to be estimated
| Types:
| 'amean' - arithmetic mean
| 'gmean' - geometric mean
| 'median' - median
| 'mode' - the kernel-based frequency peak of the distribution
ci_level : scalar between 0 and 1; optional
the certainty of the confidence interval (default = 0.95)
bandwidth : string {'silverman' or 'scott'} or positive scalar; optional
the method to estimate the bandwidth or a scalar directly defining the
bandwidth. It uses the Silverman plug-in method by default.
precision : positive scalar or None; optional
the maximum precision expected for the "peak" kde-based estimator.
Default is 0.1. Note that this has nothing to do with the
confidence intervals
Call functions
--------------
- amean, gmean, median, and freq_peak (from averages)
Examples
--------
>>> summarize(dataset['diameters'])
>>> summarize(dataset['diameters'], ci_level=0.99)
>>> summarize(np.log(dataset['diameters']), avg=('amean', 'median', 'mode'))
Returns
-------
None
File: c:\users\marco\documents\github\grainsizetools\grain_size_tools\grainsizetools_script.py
Type: function
```python
def summarize(data,
avg=('amean', 'gmean', 'median', 'mode'),
ci_level=0.95,
bandwidth='silverman',
precision=0.1):
""" Estimate different grain size statistics. This includes different means,
the median, the frequency peak grain size via KDE, the confidence intervals
using different methods, and the distribution features.
Parameters
----------
data : array_like
the size of the grains
avg : string, tuple or list; optional
the averages to be estimated
| Types:
| 'amean' - arithmetic mean
| 'gmean' - geometric mean
| 'median' - median
| 'mode' - the kernel-based frequency peak of the distribution
ci_level : scalar between 0 and 1; optional
the certainty of the confidence interval (default = 0.95)
bandwidth : string {'silverman' or 'scott'} or positive scalar; optional
the method to estimate the bandwidth or a scalar directly defining the
bandwidth. It uses the Silverman plug-in method by default.
precision : positive scalar or None; optional
the maximum precision expected for the "peak" kde-based estimator.
Default is 0.1. Note that this is not related with the confidence
intervals
Call functions
--------------
- amean, gmean, median, and freq_peak (from averages)
Examples
--------
>>> summarize(dataset['diameters'])
>>> summarize(dataset['diameters'], ci_level=0.99)
>>> summarize(np.log(dataset['diameters']), avg=('amean', 'median', 'mode'))
Returns
-------
None
"""
```


Expand Down
11 changes: 6 additions & 5 deletions DOCS/_first_steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Installing Python for data science
-------------

GrainSizeTools script requires [Python](https://www.python.org/ ) 3.5+ or higher and the Python scientific libraries [*Numpy*](http://www.numpy.org/ ) [*Scipy*](http://www.scipy.org/ ), [*Pandas*](http://pandas.pydata.org ) and [*Matplotlib*](http://matplotlib.org/ ). If you have no previous experience with Python, I recommend downloading and installing the [Anaconda Python distribution](https://www.anaconda.com/distribution/ ) (Python 3.x version), as it includes all the required the scientific packages (> 5 GB disk space). In case you have a limited space in your hard disk, there is a distribution named [miniconda](http://conda.pydata.org/miniconda.html ) that only installs the Python packages you actually need. For both cases you have versions for Windows, MacOS and Linux.
GrainSizeTools script requires [Python](https://www.python.org/ ) 3.5+ or higher and the Python scientific libraries [*NumPy*](http://www.numpy.org/ ) [*SciPy*](http://www.scipy.org/ ), [*Pandas*](http://pandas.pydata.org ) and [*Matplotlib*](http://matplotlib.org/ ). If you have no previous experience with Python, I recommend downloading and installing the [Anaconda Python distribution](https://www.anaconda.com/distribution/ ) (Python 3.x version), as it includes all the required the scientific packages (> 5 GB disk space). In case you have a limited space in your hard disk, there is a distribution named [miniconda](http://conda.pydata.org/miniconda.html ) that only installs the Python packages you actually need. For both cases you have versions for Windows, MacOS and Linux.

Anaconda Python Distribution: https://www.anaconda.com/distribution/

Expand Down Expand Up @@ -201,7 +201,7 @@ Let's first see how the data set looks like. Instead of calling the variable (as
dataset.head() # returns 5 rows by default, you can define any number within the parenthesis
```

![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_output.png?raw=true)
![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_output_head5.png?raw=true)

The example dataset has 11 different columns (one without a name). To interact with one of the columns we must call its name in square brackets with the name in quotes as follows:

Expand Down Expand Up @@ -233,7 +233,7 @@ dataset = dataset.drop(' ', axis=1)
dataset.head(3)
```

![]()
![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_head3.png?raw=true)

If you want to remove more than one column pass a list of columns instead as in the example below:

Expand All @@ -256,7 +256,7 @@ dataset['diameters'] = 2 * np.sqrt(dataset['Area'] / np.pi)
dataset.head()
```

![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_diameters.png?raw=true)
![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_newcol.png?raw=true)

You can see a new column named diameters.

Expand Down Expand Up @@ -285,9 +285,10 @@ dataset.info() # display info of the DataFrame
dataset.shape() # (rows, columns)
dataset.count() # number of non-null values

# Data cleaning
dataset.dropna() # remove missing values from the data

# writing to disk
# Writing to disk
dataset.to_csv(filename) # save as csv file, the filename must be within quotes
dataset.to_excel(filename) # save as excel file
```
Expand Down
Binary file removed FIGURES/dataframe_diameters.png
Binary file not shown.
Binary file added FIGURES/dataframe_output_head3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added FIGURES/dataframe_output_head5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added FIGURES/dataframe_output_newcol.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions grain_size_tools/GrainSizeTools_script.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,8 +116,8 @@ def summarize(data, avg=('amean', 'gmean', 'median', 'mode'), ci_level=0.95,
precision : positive scalar or None; optional
the maximum precision expected for the "peak" kde-based estimator.
Default is 0.1. Note that this has nothing to do with the
confidence intervals
Default is 0.1. Note that this is not related with the confidence
intervals
Call functions
--------------
Expand Down
Loading

0 comments on commit bc40078

Please sign in to comment.