update docs

marcoalopez · May 4, 2020 · bc40078 · bc40078
1 parent 678618a
commit bc40078
Show file tree

Hide file tree

Showing 9 changed files with 82 additions and 194 deletions.
diff --git a/DOCS/_Plot_module.md b/DOCS/_Plot_module.md
@@ -28,7 +28,8 @@ The method returns a plot, the number of classes and bin size of the histogram,
 def distribution(data,
                  plot=('hist', 'kde'),
                  avg=('amean', 'gmean', 'median', 'mode'),
-                 binsize='auto', bandwidth='silverman'):
+                 binsize='auto',
+                 bandwidth='silverman'):
     """ Return a plot with the ditribution of (apparent or actual) grain sizes
     in a dataset.
 
@@ -200,6 +201,6 @@ KDE bandwidth =  0.1
 =======================================
 ```
 
-![]()
+![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/new_normalized_median.png?raw=true)
 
 Note that in this case, the method returns the normalized inter-quartile range (IQR) rather than the normalized standard deviation. Also, note that the kernel density estimate appears smoother resembling an almost perfect normal distribution.
diff --git a/DOCS/_describe.md b/DOCS/_describe.md
@@ -9,18 +9,18 @@ dataset = pd.read_csv(filepath, sep='\t')
 
 # estimate equivalent circular diameters (ECDs)
 dataset['diameters'] = 2 * np.sqrt(dataset['Area'] / np.pi)
-dataset
+dataset.head()
 ```
 
-![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_output.png?raw=true)
+![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_newcol.png?raw=true)
 
 ```python
-# Set the population properties
+# Set the population properties for the toy dataset
 scale = np.log(20)  # set sample geometric mean to 20
 shape = np.log(1.5)  # set the lognormal shape to 1.5
 
 # generate a random lognormal population of size 500
-np.random.seed(seed=1)  # this is to generate always the same population for reproducibility
+np.random.seed(seed=1)  # this is for reproducibility
 toy_dataset = np.random.lognormal(mean=scale, sigma=shape, size=500)
 ```
 
@@ -73,9 +73,7 @@ By default, the `summarize()` function returns:
 - The shape of the lognormal distribution using the multiplicative standard deviation (MSD)
 - A Shapiro-Wilk test warning indicating when the data deviates from normal and/or lognormal (when p-value < 0.05).
 
-Note that here the Shapiro-Wilk test warning tells us that the distribution is not normally distributed, which is to be expected since we know that this is a lognormal distribution. Note that the geometric mean and the lognormal shape are very close to the values used to generate the synthetic dataset, 20 and 1.5 respectively.
-
-Now, let's do the same using the dataset that comes from a real rock, for this, we have to pass the column with the diameters:
+In the example above, the Shapiro-Wilk test tells us that the distribution is not normally distributed, which is to be expected since we know that this is a lognormal distribution. Note that the geometric mean and the lognormal shape are very close to the values used to generate the synthetic random dataset, 20 and 1.5 respectively. Now, let's do the same using the dataset that comes from a real rock, for this, we have to pass the column with the diameters:
 
 ```python
 summarize(dataset['diameters'])
@@ -117,69 +115,64 @@ Lognormality test: 0.99, 0.03 (test statistic, p-value)
 ============================================================================
 ```
 
-Leaving aside the difference in numbers, there are some subtle differences compared to the results obtained with the toy dataset. First, the confidence interval method for the arithmetic mean is no longer the modified Cox (mCox) but the one based on the central limit theorem (CLT) advised by the [ASTM](https://en.wikipedia.org/wiki/ASTM_International). As previously noted, the function ```summarize()``` automatically choose the optimal confidence interval method depending on distribution features. We show below the decision tree flowchart for choosing the optimal confidence interval estimation method, which is based on [Lopez-Sanchez (2020)](https://doi.org/10.1016/j.jsg.2020.104042).
+Leaving aside the different numbers, there are some subtle differences compared to the results obtained with the toy dataset. First, the confidence interval method for the arithmetic mean is no longer the modified Cox (mCox) but the one based on the central limit theorem (CLT) advised by the [ASTM](https://en.wikipedia.org/wiki/ASTM_International). As previously noted, the function ```summarize()``` automatically choose the optimal confidence interval method depending on distribution features. We show below the decision tree flowchart for choosing the optimal confidence interval estimation method, which is based on [Lopez-Sanchez (2020)](https://doi.org/10.1016/j.jsg.2020.104042).
 
 ![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/avg_map.png?raw=true)
 
 The reason why the CLT method applies in this case is that the grain size distribution not enough lognormal-like (note the Shapiro-Wilk test warning with a p-value < 0.05), and this might cause an inaccurate estimate of the arithmetic mean confidence interval.
 
 Now, let's focus on the different options of the ``summarize()`` method.
 
-```
-Signature:
-summarize(
-    data,
-    avg=('amean', 'gmean', 'median', 'mode'),
-    ci_level=0.95,
-    bandwidth='silverman',
-    precision=0.1,
-)
-Docstring:
-Estimate different grain size statistics. This includes different means,
-the median, the frequency peak grain size via KDE, the confidence intervals
-using different methods, and the distribution features.
-
-Parameters
-----------
-data : array_like
-    the size of the grains
-
-avg : string, tuple or list; optional
-    the averages to be estimated
-
-    | Types:
-    | 'amean' - arithmetic mean
-    | 'gmean' - geometric mean
-    | 'median' - median
-    | 'mode' - the kernel-based frequency peak of the distribution
-
-ci_level : scalar between 0 and 1; optional
-    the certainty of the confidence interval (default = 0.95)
-
-bandwidth : string {'silverman' or 'scott'} or positive scalar; optional
-    the method to estimate the bandwidth or a scalar directly defining the
-    bandwidth. It uses the Silverman plug-in method by default.
-
-precision : positive scalar or None; optional
-    the maximum precision expected for the "peak" kde-based estimator.
-    Default is 0.1. Note that this has nothing to do with the
-    confidence intervals
-
-Call functions
---------------
-- amean, gmean, median, and freq_peak (from averages)
-
-Examples
---------
->>> summarize(dataset['diameters'])
->>> summarize(dataset['diameters'], ci_level=0.99)
->>> summarize(np.log(dataset['diameters']), avg=('amean', 'median', 'mode'))
-
-Returns
--------
-None
-File:      c:\users\marco\documents\github\grainsizetools\grain_size_tools\grainsizetools_script.py
-Type:      function
+```python
+def summarize(data,
+              avg=('amean', 'gmean', 'median', 'mode'),
+              ci_level=0.95,
+              bandwidth='silverman',
+              precision=0.1):
+    """ Estimate different grain size statistics. This includes different means,
+    the median, the frequency peak grain size via KDE, the confidence intervals
+    using different methods, and the distribution features.
+
+    Parameters
+    ----------
+    data : array_like
+        the size of the grains
+
+    avg : string, tuple or list; optional
+        the averages to be estimated
+
+        | Types:
+        | 'amean' - arithmetic mean
+        | 'gmean' - geometric mean
+        | 'median' - median
+        | 'mode' - the kernel-based frequency peak of the distribution
+
+    ci_level : scalar between 0 and 1; optional
+        the certainty of the confidence interval (default = 0.95)
+
+    bandwidth : string {'silverman' or 'scott'} or positive scalar; optional
+        the method to estimate the bandwidth or a scalar directly defining the
+        bandwidth. It uses the Silverman plug-in method by default.
+
+    precision : positive scalar or None; optional
+        the maximum precision expected for the "peak" kde-based estimator.
+        Default is 0.1. Note that this is not related with the confidence
+        intervals
+
+    Call functions
+    --------------
+    - amean, gmean, median, and freq_peak (from averages)
+
+    Examples
+    --------
+    >>> summarize(dataset['diameters'])
+    >>> summarize(dataset['diameters'], ci_level=0.99)
+    >>> summarize(np.log(dataset['diameters']), avg=('amean', 'median', 'mode'))
+
+    Returns
+    -------
+    None
+    """
 ```
 
 

diff --git a/DOCS/_first_steps.md b/DOCS/_first_steps.md
@@ -3,7 +3,7 @@
 Installing Python for data science
 -------------
 
-GrainSizeTools script requires [Python](https://www.python.org/ ) 3.5+ or higher and the Python scientific libraries [*Numpy*](http://www.numpy.org/ ) [*Scipy*](http://www.scipy.org/ ), [*Pandas*](http://pandas.pydata.org ) and [*Matplotlib*](http://matplotlib.org/ ). If you have no previous experience with Python, I recommend downloading and installing the [Anaconda Python distribution](https://www.anaconda.com/distribution/ ) (Python 3.x version), as it includes all the required the scientific packages (> 5 GB disk space). In case you have a limited space in your hard disk, there is a distribution named [miniconda](http://conda.pydata.org/miniconda.html ) that only installs the Python packages you actually need. For both cases you have versions for Windows, MacOS and Linux.
+GrainSizeTools script requires [Python](https://www.python.org/ ) 3.5+ or higher and the Python scientific libraries [*NumPy*](http://www.numpy.org/ ) [*SciPy*](http://www.scipy.org/ ), [*Pandas*](http://pandas.pydata.org ) and [*Matplotlib*](http://matplotlib.org/ ). If you have no previous experience with Python, I recommend downloading and installing the [Anaconda Python distribution](https://www.anaconda.com/distribution/ ) (Python 3.x version), as it includes all the required the scientific packages (> 5 GB disk space). In case you have a limited space in your hard disk, there is a distribution named [miniconda](http://conda.pydata.org/miniconda.html ) that only installs the Python packages you actually need. For both cases you have versions for Windows, MacOS and Linux.
 
 Anaconda Python Distribution: https://www.anaconda.com/distribution/ 
 
@@ -201,7 +201,7 @@ Let's first see how the data set looks like. Instead of calling the variable (as
 dataset.head()  # returns 5 rows by default, you can define any number within the parenthesis
 ```
 
-![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_output.png?raw=true)
+![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_output_head5.png?raw=true)
 
 The example dataset has 11 different columns (one without a name). To interact with one of the columns we must call its name in square brackets with the name in quotes as follows:
 
@@ -233,7 +233,7 @@ dataset = dataset.drop(' ', axis=1)
 dataset.head(3)
 ```
 
-![]()
+![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_head3.png?raw=true)
 
 If you want to remove more than one column pass a list of columns instead as in the example below:
 
@@ -256,7 +256,7 @@ dataset['diameters'] = 2 * np.sqrt(dataset['Area'] / np.pi)
 dataset.head()
 ```
 
-![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_diameters.png?raw=true)
+![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_newcol.png?raw=true)
 
 You can see a new column named diameters.
 
@@ -285,9 +285,10 @@ dataset.info()       # display info of the DataFrame
 dataset.shape()      # (rows, columns)
 dataset.count()      # number of non-null values
 
+# Data cleaning
 dataset.dropna()        # remove missing values from the data
 
-# writing to disk
+# Writing to disk
 dataset.to_csv(filename)    # save as csv file, the filename must be within quotes
 dataset.to_excel(filename)  # save as excel file
 ```

diff --git a/FIGURES/dataframe_diameters.png b/FIGURES/dataframe_diameters.png
diff --git a/FIGURES/dataframe_output_head3.png b/FIGURES/dataframe_output_head3.png
diff --git a/FIGURES/dataframe_output_head5.png b/FIGURES/dataframe_output_head5.png
diff --git a/FIGURES/dataframe_output_newcol.png b/FIGURES/dataframe_output_newcol.png
diff --git a/grain_size_tools/GrainSizeTools_script.py b/grain_size_tools/GrainSizeTools_script.py
@@ -116,8 +116,8 @@ def summarize(data, avg=('amean', 'gmean', 'median', 'mode'), ci_level=0.95,
 
     precision : positive scalar or None; optional
         the maximum precision expected for the "peak" kde-based estimator.
-        Default is 0.1. Note that this has nothing to do with the
-        confidence intervals
+        Default is 0.1. Note that this is not related with the confidence
+        intervals
 
     Call functions
     --------------