diff --git a/DOCS/_Plot_module.md b/DOCS/_Plot_module.md index ce7f1ef..af29cbc 100644 --- a/DOCS/_Plot_module.md +++ b/DOCS/_Plot_module.md @@ -28,7 +28,8 @@ The method returns a plot, the number of classes and bin size of the histogram, def distribution(data, plot=('hist', 'kde'), avg=('amean', 'gmean', 'median', 'mode'), - binsize='auto', bandwidth='silverman'): + binsize='auto', + bandwidth='silverman'): """ Return a plot with the ditribution of (apparent or actual) grain sizes in a dataset. @@ -200,6 +201,6 @@ KDE bandwidth = 0.1 ======================================= ``` -![]() +![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/new_normalized_median.png?raw=true) Note that in this case, the method returns the normalized inter-quartile range (IQR) rather than the normalized standard deviation. Also, note that the kernel density estimate appears smoother resembling an almost perfect normal distribution. \ No newline at end of file diff --git a/DOCS/_describe.md b/DOCS/_describe.md index 9a60608..2352e4a 100644 --- a/DOCS/_describe.md +++ b/DOCS/_describe.md @@ -9,18 +9,18 @@ dataset = pd.read_csv(filepath, sep='\t') # estimate equivalent circular diameters (ECDs) dataset['diameters'] = 2 * np.sqrt(dataset['Area'] / np.pi) -dataset +dataset.head() ``` -![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_output.png?raw=true) +![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_newcol.png?raw=true) ```python -# Set the population properties +# Set the population properties for the toy dataset scale = np.log(20) # set sample geometric mean to 20 shape = np.log(1.5) # set the lognormal shape to 1.5 # generate a random lognormal population of size 500 -np.random.seed(seed=1) # this is to generate always the same population for reproducibility +np.random.seed(seed=1) # this is for reproducibility toy_dataset = np.random.lognormal(mean=scale, sigma=shape, size=500) ``` @@ -73,9 +73,7 @@ By default, the `summarize()` function returns: - The shape of the lognormal distribution using the multiplicative standard deviation (MSD) - A Shapiro-Wilk test warning indicating when the data deviates from normal and/or lognormal (when p-value < 0.05). -Note that here the Shapiro-Wilk test warning tells us that the distribution is not normally distributed, which is to be expected since we know that this is a lognormal distribution. Note that the geometric mean and the lognormal shape are very close to the values used to generate the synthetic dataset, 20 and 1.5 respectively. - -Now, let's do the same using the dataset that comes from a real rock, for this, we have to pass the column with the diameters: +In the example above, the Shapiro-Wilk test tells us that the distribution is not normally distributed, which is to be expected since we know that this is a lognormal distribution. Note that the geometric mean and the lognormal shape are very close to the values used to generate the synthetic random dataset, 20 and 1.5 respectively. Now, let's do the same using the dataset that comes from a real rock, for this, we have to pass the column with the diameters: ```python summarize(dataset['diameters']) @@ -117,7 +115,7 @@ Lognormality test: 0.99, 0.03 (test statistic, p-value) ============================================================================ ``` -Leaving aside the difference in numbers, there are some subtle differences compared to the results obtained with the toy dataset. First, the confidence interval method for the arithmetic mean is no longer the modified Cox (mCox) but the one based on the central limit theorem (CLT) advised by the [ASTM](https://en.wikipedia.org/wiki/ASTM_International). As previously noted, the function ```summarize()``` automatically choose the optimal confidence interval method depending on distribution features. We show below the decision tree flowchart for choosing the optimal confidence interval estimation method, which is based on [Lopez-Sanchez (2020)](https://doi.org/10.1016/j.jsg.2020.104042). +Leaving aside the different numbers, there are some subtle differences compared to the results obtained with the toy dataset. First, the confidence interval method for the arithmetic mean is no longer the modified Cox (mCox) but the one based on the central limit theorem (CLT) advised by the [ASTM](https://en.wikipedia.org/wiki/ASTM_International). As previously noted, the function ```summarize()``` automatically choose the optimal confidence interval method depending on distribution features. We show below the decision tree flowchart for choosing the optimal confidence interval estimation method, which is based on [Lopez-Sanchez (2020)](https://doi.org/10.1016/j.jsg.2020.104042). ![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/avg_map.png?raw=true) @@ -125,61 +123,56 @@ The reason why the CLT method applies in this case is that the grain size distri Now, let's focus on the different options of the ``summarize()`` method. -``` -Signature: -summarize( - data, - avg=('amean', 'gmean', 'median', 'mode'), - ci_level=0.95, - bandwidth='silverman', - precision=0.1, -) -Docstring: -Estimate different grain size statistics. This includes different means, -the median, the frequency peak grain size via KDE, the confidence intervals -using different methods, and the distribution features. - -Parameters ----------- -data : array_like - the size of the grains - -avg : string, tuple or list; optional - the averages to be estimated - - | Types: - | 'amean' - arithmetic mean - | 'gmean' - geometric mean - | 'median' - median - | 'mode' - the kernel-based frequency peak of the distribution - -ci_level : scalar between 0 and 1; optional - the certainty of the confidence interval (default = 0.95) - -bandwidth : string {'silverman' or 'scott'} or positive scalar; optional - the method to estimate the bandwidth or a scalar directly defining the - bandwidth. It uses the Silverman plug-in method by default. - -precision : positive scalar or None; optional - the maximum precision expected for the "peak" kde-based estimator. - Default is 0.1. Note that this has nothing to do with the - confidence intervals - -Call functions --------------- -- amean, gmean, median, and freq_peak (from averages) - -Examples --------- ->>> summarize(dataset['diameters']) ->>> summarize(dataset['diameters'], ci_level=0.99) ->>> summarize(np.log(dataset['diameters']), avg=('amean', 'median', 'mode')) - -Returns -------- -None -File: c:\users\marco\documents\github\grainsizetools\grain_size_tools\grainsizetools_script.py -Type: function +```python +def summarize(data, + avg=('amean', 'gmean', 'median', 'mode'), + ci_level=0.95, + bandwidth='silverman', + precision=0.1): + """ Estimate different grain size statistics. This includes different means, + the median, the frequency peak grain size via KDE, the confidence intervals + using different methods, and the distribution features. + + Parameters + ---------- + data : array_like + the size of the grains + + avg : string, tuple or list; optional + the averages to be estimated + + | Types: + | 'amean' - arithmetic mean + | 'gmean' - geometric mean + | 'median' - median + | 'mode' - the kernel-based frequency peak of the distribution + + ci_level : scalar between 0 and 1; optional + the certainty of the confidence interval (default = 0.95) + + bandwidth : string {'silverman' or 'scott'} or positive scalar; optional + the method to estimate the bandwidth or a scalar directly defining the + bandwidth. It uses the Silverman plug-in method by default. + + precision : positive scalar or None; optional + the maximum precision expected for the "peak" kde-based estimator. + Default is 0.1. Note that this is not related with the confidence + intervals + + Call functions + -------------- + - amean, gmean, median, and freq_peak (from averages) + + Examples + -------- + >>> summarize(dataset['diameters']) + >>> summarize(dataset['diameters'], ci_level=0.99) + >>> summarize(np.log(dataset['diameters']), avg=('amean', 'median', 'mode')) + + Returns + ------- + None + """ ``` diff --git a/DOCS/_first_steps.md b/DOCS/_first_steps.md index d5ed27e..ed2de19 100644 --- a/DOCS/_first_steps.md +++ b/DOCS/_first_steps.md @@ -3,7 +3,7 @@ Installing Python for data science ------------- -GrainSizeTools script requires [Python](https://www.python.org/ ) 3.5+ or higher and the Python scientific libraries [*Numpy*](http://www.numpy.org/ ) [*Scipy*](http://www.scipy.org/ ), [*Pandas*](http://pandas.pydata.org ) and [*Matplotlib*](http://matplotlib.org/ ). If you have no previous experience with Python, I recommend downloading and installing the [Anaconda Python distribution](https://www.anaconda.com/distribution/ ) (Python 3.x version), as it includes all the required the scientific packages (> 5 GB disk space). In case you have a limited space in your hard disk, there is a distribution named [miniconda](http://conda.pydata.org/miniconda.html ) that only installs the Python packages you actually need. For both cases you have versions for Windows, MacOS and Linux. +GrainSizeTools script requires [Python](https://www.python.org/ ) 3.5+ or higher and the Python scientific libraries [*NumPy*](http://www.numpy.org/ ) [*SciPy*](http://www.scipy.org/ ), [*Pandas*](http://pandas.pydata.org ) and [*Matplotlib*](http://matplotlib.org/ ). If you have no previous experience with Python, I recommend downloading and installing the [Anaconda Python distribution](https://www.anaconda.com/distribution/ ) (Python 3.x version), as it includes all the required the scientific packages (> 5 GB disk space). In case you have a limited space in your hard disk, there is a distribution named [miniconda](http://conda.pydata.org/miniconda.html ) that only installs the Python packages you actually need. For both cases you have versions for Windows, MacOS and Linux. Anaconda Python Distribution: https://www.anaconda.com/distribution/ @@ -201,7 +201,7 @@ Let's first see how the data set looks like. Instead of calling the variable (as dataset.head() # returns 5 rows by default, you can define any number within the parenthesis ``` -![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_output.png?raw=true) +![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_output_head5.png?raw=true) The example dataset has 11 different columns (one without a name). To interact with one of the columns we must call its name in square brackets with the name in quotes as follows: @@ -233,7 +233,7 @@ dataset = dataset.drop(' ', axis=1) dataset.head(3) ``` -![]() +![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_head3.png?raw=true) If you want to remove more than one column pass a list of columns instead as in the example below: @@ -256,7 +256,7 @@ dataset['diameters'] = 2 * np.sqrt(dataset['Area'] / np.pi) dataset.head() ``` -![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_diameters.png?raw=true) +![](https://github.com/marcoalopez/GrainSizeTools/blob/master/FIGURES/dataframe_newcol.png?raw=true) You can see a new column named diameters. @@ -285,9 +285,10 @@ dataset.info() # display info of the DataFrame dataset.shape() # (rows, columns) dataset.count() # number of non-null values +# Data cleaning dataset.dropna() # remove missing values from the data -# writing to disk +# Writing to disk dataset.to_csv(filename) # save as csv file, the filename must be within quotes dataset.to_excel(filename) # save as excel file ``` diff --git a/FIGURES/dataframe_diameters.png b/FIGURES/dataframe_diameters.png deleted file mode 100644 index a8e8943..0000000 Binary files a/FIGURES/dataframe_diameters.png and /dev/null differ diff --git a/FIGURES/dataframe_output_head3.png b/FIGURES/dataframe_output_head3.png new file mode 100644 index 0000000..d82a2ab Binary files /dev/null and b/FIGURES/dataframe_output_head3.png differ diff --git a/FIGURES/dataframe_output_head5.png b/FIGURES/dataframe_output_head5.png new file mode 100644 index 0000000..97b1415 Binary files /dev/null and b/FIGURES/dataframe_output_head5.png differ diff --git a/FIGURES/dataframe_output_newcol.png b/FIGURES/dataframe_output_newcol.png new file mode 100644 index 0000000..38f608e Binary files /dev/null and b/FIGURES/dataframe_output_newcol.png differ diff --git a/grain_size_tools/GrainSizeTools_script.py b/grain_size_tools/GrainSizeTools_script.py index bbd76d7..df1eab7 100644 --- a/grain_size_tools/GrainSizeTools_script.py +++ b/grain_size_tools/GrainSizeTools_script.py @@ -116,8 +116,8 @@ def summarize(data, avg=('amean', 'gmean', 'median', 'mode'), ci_level=0.95, precision : positive scalar or None; optional the maximum precision expected for the "peak" kde-based estimator. - Default is 0.1. Note that this has nothing to do with the - confidence intervals + Default is 0.1. Note that this is not related with the confidence + intervals Call functions -------------- diff --git a/grain_size_tools/example_notebooks/grain_size_description.ipynb b/grain_size_tools/example_notebooks/grain_size_description.ipynb index 7d788ec..0facaed 100644 --- a/grain_size_tools/example_notebooks/grain_size_description.ipynb +++ b/grain_size_tools/example_notebooks/grain_size_description.ipynb @@ -159,129 +159,24 @@ "
2661 rows × 12 columns
\n", "" ], "text/plain": [ - " Area Circ. Feret FeretX FeretY FeretAngle MinFeret \\\n", - "0 1 157.25 0.680 18.062 1535.0 0.5 131.634 13.500 \n", - "1 2 2059.75 0.771 62.097 753.5 16.5 165.069 46.697 \n", - "2 3 1961.50 0.842 57.871 727.0 65.0 71.878 46.923 \n", - "3 4 5428.50 0.709 114.657 1494.5 83.5 19.620 63.449 \n", - "4 5 374.00 0.699 29.262 2328.0 34.0 33.147 16.000 \n", - "... ... ... ... ... ... ... ... ... \n", - "2656 2657 452.50 0.789 28.504 1368.0 1565.5 127.875 22.500 \n", - "2657 2658 1081.25 0.756 47.909 1349.5 1569.5 108.246 31.363 \n", - "2658 2659 513.50 0.720 32.962 1373.0 1586.0 112.286 20.496 \n", - "2659 2660 277.75 0.627 29.436 1316.0 1601.5 159.102 17.002 \n", - "2660 2661 725.00 0.748 39.437 1335.5 1615.5 129.341 28.025 \n", - "\n", - " AR Round Solidity diameters \n", - "0 1.101 0.908 0.937 14.149803 \n", - "1 1.314 0.761 0.972 51.210889 \n", - "2 1.139 0.878 0.972 49.974587 \n", - "3 1.896 0.528 0.947 83.137121 \n", - "4 1.515 0.660 0.970 21.821815 \n", - "... ... ... ... ... \n", - "2656 1.235 0.810 0.960 24.002935 \n", - "2657 1.446 0.692 0.960 37.103777 \n", - "2658 1.493 0.670 0.953 25.569679 \n", - "2659 1.727 0.579 0.920 18.805379 \n", - "2660 1.351 0.740 0.960 30.382539 \n", + " Area Circ. Feret FeretX FeretY FeretAngle MinFeret AR \\\n", + "0 1 157.25 0.680 18.062 1535.0 0.5 131.634 13.500 1.101 \n", + "1 2 2059.75 0.771 62.097 753.5 16.5 165.069 46.697 1.314 \n", + "2 3 1961.50 0.842 57.871 727.0 65.0 71.878 46.923 1.139 \n", + "3 4 5428.50 0.709 114.657 1494.5 83.5 19.620 63.449 1.896 \n", + "4 5 374.00 0.699 29.262 2328.0 34.0 33.147 16.000 1.515 \n", "\n", - "[2661 rows x 12 columns]" + " Round Solidity diameters \n", + "0 0.908 0.937 14.149803 \n", + "1 0.761 0.972 51.210889 \n", + "2 0.878 0.972 49.974587 \n", + "3 0.528 0.947 83.137121 \n", + "4 0.660 0.970 21.821815 " ] }, "execution_count": 2, @@ -296,7 +191,7 @@ "\n", "# estimate equivalent circular diameters (ECDs)\n", "dataset['diameters'] = 2 * np.sqrt(dataset['Area'] / np.pi)\n", - "dataset" + "dataset.head()" ] }, { @@ -391,9 +286,7 @@ "- The shape of the lognormal distribution using the multiplicative standard deviation (MSD)\n", "- A Shapiro-Wilk test warning indicating when the data deviates from normal and/or lognormal (when p-value < 0.05).\n", "\n", - "Note that here the Shapiro-Wilk test warning tells us that the distribution is not normally distributed, which is to be expected since we know that this is a lognormal distribution. Note that the geometric mean and the lognormal shape are very close to the values used to generate the synthetic dataset, 20 and 1.5 respectively.\n", - "\n", - "Now, let's do the same using the dataset that comes from a real rock, for this, we have to pass the column with the diameters:" + "In the example above, the Shapiro-Wilk test tells us that the distribution is not normally distributed, which is to be expected since we know that this is a lognormal distribution. Note that the geometric mean and the lognormal shape are very close to the values used to generate the synthetic random dataset, 20 and 1.5 respectively. Now, let's do the same using the dataset that comes from a real rock, for this, we have to pass the column with the diameters:" ] }, { @@ -450,7 +343,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Leaving aside the difference in numbers, there are some subtle differences compared to the results obtained with the toy dataset. First, the confidence interval method for the arithmetic mean is no longer the modified Cox (mCox) but the one based on the central limit theorem (CLT) advised by the [ASTM](https://en.wikipedia.org/wiki/ASTM_International). As previously noted, the function ```summarize()``` automatically choose the optimal confidence interval method depending on distribution features. We show below the decision tree flowchart for choosing the optimal confidence interval estimation method, which is based on [Lopez-Sanchez (2020)](https://doi.org/10.1016/j.jsg.2020.104042)." + "Leaving aside the different numbers, there are some subtle differences compared to the results obtained with the toy dataset. First, the confidence interval method for the arithmetic mean is no longer the modified Cox (mCox) but the one based on the central limit theorem (CLT) advised by the [ASTM](https://en.wikipedia.org/wiki/ASTM_International). As previously noted, the function ```summarize()``` automatically choose the optimal confidence interval method depending on distribution features. We show below the decision tree flowchart for choosing the optimal confidence interval estimation method, which is based on [Lopez-Sanchez (2020)](https://doi.org/10.1016/j.jsg.2020.104042)." ] }, {