diff --git a/DESCRIPTION b/DESCRIPTION index cd6d9e6..19399d9 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,7 +1,7 @@ Package: mdatools Title: Multivariate Data Analysis for Chemometrics -Version: 0.8.0 -Date: 2016-10-07 +Version: 0.8.1 +Date: 2016-10-30 Author: Sergey Kucheryavskiy Maintainer: Sergey Kucheryavskiy Description: Package implements projection based methods for preprocessing, diff --git a/NEWS b/NEWS index 300271f..4175b92 100755 --- a/NEWS +++ b/NEWS @@ -1,3 +1,9 @@ +v.0.8.1 +======= +* fixed a bug in PCA when explained variance was calculated incorrectly for data with excluded rows +* fixed several issues with SIMCA (cross-validation) and SIMCAM (Cooman's plot) +* added a chapter about SIMCA to the tutorial + v.0.8.0 ======= * tutorial has been moved from GitBook to Bookdown and fully rewritten diff --git a/NEWS.md b/NEWS.md index 300271f..4175b92 100755 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,9 @@ +v.0.8.1 +======= +* fixed a bug in PCA when explained variance was calculated incorrectly for data with excluded rows +* fixed several issues with SIMCA (cross-validation) and SIMCAM (Cooman's plot) +* added a chapter about SIMCA to the tutorial + v.0.8.0 ======= * tutorial has been moved from GitBook to Bookdown and fully rewritten diff --git a/R/simcam.R b/R/simcam.R index 8691be7..fa5873c 100755 --- a/R/simcam.R +++ b/R/simcam.R @@ -166,7 +166,6 @@ predict.simcam = function(object, x, c.ref = NULL, cv = F, ...) { pred.res[[i]] = predict(object$models[[i]], x, c.ref) c.pred[, , i] = pred.res[[i]]$c.pred[, object$models[[i]]$ncomp.selected, ] } - dimnames(c.pred) = list(rownames(x), paste('Comp'), object$classnames) c.pred = mda.setattr(c.pred, attrs, 'row') attr(c.pred, 'name') = 'SIMCAM predictions' diff --git a/docs/_main_files/figure-html/unnamed-chunk-43-1.png b/docs/_main_files/figure-html/unnamed-chunk-43-1.png index 52ed672..2d0d327 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-43-1.png and b/docs/_main_files/figure-html/unnamed-chunk-43-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-5-1.png b/docs/_main_files/figure-html/unnamed-chunk-5-1.png new file mode 100644 index 0000000..d539b78 Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-61-1.png b/docs/_main_files/figure-html/unnamed-chunk-61-1.png index d85c05a..b83f13e 100644 Binary files a/docs/_main_files/figure-html/unnamed-chunk-61-1.png and b/docs/_main_files/figure-html/unnamed-chunk-61-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-7-1.png b/docs/_main_files/figure-html/unnamed-chunk-7-1.png new file mode 100644 index 0000000..d91a2d3 Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-8-1.png b/docs/_main_files/figure-html/unnamed-chunk-8-1.png new file mode 100644 index 0000000..6d22954 Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-88-1.png b/docs/_main_files/figure-html/unnamed-chunk-88-1.png new file mode 100644 index 0000000..caa592f Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-88-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-89-1.png b/docs/_main_files/figure-html/unnamed-chunk-89-1.png new file mode 100644 index 0000000..d539b78 Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-89-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-91-1.png b/docs/_main_files/figure-html/unnamed-chunk-91-1.png new file mode 100644 index 0000000..d91a2d3 Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-91-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-92-1.png b/docs/_main_files/figure-html/unnamed-chunk-92-1.png new file mode 100644 index 0000000..6d22954 Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-92-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-95-1.png b/docs/_main_files/figure-html/unnamed-chunk-95-1.png new file mode 100644 index 0000000..55d3eed Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-95-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-97-1.png b/docs/_main_files/figure-html/unnamed-chunk-97-1.png new file mode 100644 index 0000000..1e6c325 Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-97-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-98-1.png b/docs/_main_files/figure-html/unnamed-chunk-98-1.png new file mode 100644 index 0000000..184b26c Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-98-1.png differ diff --git a/docs/_main_files/figure-html/unnamed-chunk-99-1.png b/docs/_main_files/figure-html/unnamed-chunk-99-1.png new file mode 100644 index 0000000..c16cc72 Binary files /dev/null and b/docs/_main_files/figure-html/unnamed-chunk-99-1.png differ diff --git a/docs/attributes-and-factors.html b/docs/attributes-and-factors.html index 7c581a9..fb89c9f 100644 --- a/docs/attributes-and-factors.html +++ b/docs/attributes-and-factors.html @@ -25,7 +25,7 @@ - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+ + +
+
+ +
+
+

Calibration and validation

+

The model calibration is similar to PCA, but there are several additional arguments, which are important for classification. First of all it is a class name. Class name is a string, which can be used later e.g. for identifying class members for testing. The second important argument is a level of significance, alpha. This parameter is used for calculation of statistical limits and can be considered as probability for false negatives. The default value is 0.05.

+

In this chapter as well as for describing other classification methods we will use a famous Iris dataset, available in R. The dataset includes 150 measurements of three Iris species: Setosa, Virginica and Versicola. The measurements are length and width of petals and sepals in cm. Use ?iris for more details.

+

Let’s get the data and split it to calibration and test sets.

+
data(iris)
+head(iris)
+
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
+## 1          5.1         3.5          1.4         0.2  setosa
+## 2          4.9         3.0          1.4         0.2  setosa
+## 3          4.7         3.2          1.3         0.2  setosa
+## 4          4.6         3.1          1.5         0.2  setosa
+## 5          5.0         3.6          1.4         0.2  setosa
+## 6          5.4         3.9          1.7         0.4  setosa
+
# generate indices for calibration set
+idx = seq(1, nrow(iris), by = 2)
+
+# split the values
+X.c = iris[idx, 1:4]
+c.c = iris[idx, 5, drop = F]
+
+X.t = iris[-idx, 1:4]
+c.t = iris[-idx, 5, drop = F]
+

Now, because for calibration we need only objects belonging to a class, we will split the X.c into three matrices — one for each species. The data is ordered by the species, so it can be done relatively easy by taking every 25 rows.

+
X.set = X.c[1:25, ]
+X.ver = X.c[26:50, ]
+X.vir = X.c[51:75, ]
+

Let’s start with creating a model for class Versicolor and exploring available statistics and plots. We will use full cross-validation to validate the results.

+
library(mdatools)
+m = simca(X.ver, 'versicolor', ncomp = 3, cv = 1)
+summary(m)
+
## 
+## SIMCA model for class "versicolor" summary
+## 
+## Info: 
+## Significance level (alpha): 0.05
+## Selected number of components: 3
+## 
+##        Expvar Cumexpvar Sens (cal) Expvar (cv) Sens (cv)
+## Comp 1  76.44     76.44       0.96       71.85      0.88
+## Comp 2  13.93     90.37       0.92       13.91      0.84
+## Comp 3   8.45     98.82       0.92       12.24      0.84
+

Let’s look at plots and start with summary plot.

+
plot(m)
+

+

The plot is very similar to what we seen for PCA model, the only difference is that it shows modelling power instead of loadings. Modelling power is a measure of contribution of each variable to the model and varies from 0 to 1. Usually variables with modelling power below 0.1 are considered as irrelevant.

+

Let’s give a closer look at the residuals plot with different values for alpha (we will keep number of components equal to three in all cases).

+
m1 = simca(X.ver, 'versicolor', ncomp = 3, cv = 1, alpha = 0.01)
+m2 = simca(X.ver, 'versicolor', ncomp = 3, cv = 1, alpha = 0.05)
+m3 = simca(X.ver, 'versicolor', ncomp = 3, cv = 1, alpha = 0.10)
+m4 = simca(X.ver, 'versicolor', ncomp = 3, cv = 1, alpha = 0.15)
+
+par(mfrow = c(2, 2))
+plotResiduals(m1)
+plotResiduals(m2)
+plotResiduals(m3)
+plotResiduals(m4)
+

+

As you can see, using alpha = 0.01 reduced number of false negatives to zero, as the acceptance limits became larger, while alpha = 0.15 gives a lot of incorrectly rejected class members. It must be noted, that decreasing alpha will also lead to a larger number of false positives, which we can not see in this case.

+
+

Predictions and validation with a test set

+

When model is ready one can test it using a new test set with know classes. In this case we will use objects from all three species and be able to see how good the model performs on strangers (and calculate the specificity). In order to do that we will provide both the matrix with predictors, X.t, and a vector with names of the classes for corresponding objects/rows (c.t). The values with known classes in this case can be:

+
    +
  • a vector with text values (names)
  • +
  • a factor using the names as labels
  • +
  • a vector with logical values (TRUE for class members and FALSE for strangers)
  • +
+

In our case we have a factor. Instead of creating a new model and providing the values as test set we will make predictions instead.

+
res = predict(m, X.t, c.t)
+summary(res)
+
## 
+## Summary for SIMCA one-class classification result
+## 
+## Class name: versicolor
+## Number of selected components: 3
+## 
+##        Expvar Cumexpvar TP FP TN FN Spec Sens
+## Comp 1  64.27     64.27 23  5 45  2 0.90 0.92
+## Comp 2   1.67     65.95 24  3 47  1 0.94 0.96
+## Comp 3  32.45     98.40 22  3 47  3 0.94 0.88
+

In this case we see a more detailed statistics with true/false positives and negatives, specificity and sensitivity. The performance statistics can be also shown as plots.

+
par(mfrow = c(2, 2))
+plotSpecificity(res)
+plotSensitivity(res)
+plotMisclassified(res)
+plotPerformance(res)
+

+

The classification results can be shown both graphically and numerically. Here is a prediction plot for the results.

+
par(mfrow = c(2, 1))
+plotPredictions(res)
+plotPredictions(res, ncomp = 2)
+

+

So we can see that for the model with three components we have no false positives (specificity = 1) and one false negative (sensitivity = 24/25 = 0.96). You can also show the predictions as a matrix with -1 and +1 using method showPredictions() or get the array with predicted class values directly as it is shown in the example below (for first 10 rows, different number of components and the first classification variable).

+
show(res$c.pred[31:40, 1:3, 1])
+
##    Comp 1 Comp 2 Comp 3
+## 62      1      1      1
+## 64      1      1      1
+## 66      1      1      1
+## 68      1      1     -1
+## 70      1      1      1
+## 72      1      1      1
+## 74      1      1     -1
+## 76      1      1      1
+## 78      1      1      1
+## 80      1      1      1
+
+
+
+ +
+
+
+ + + + + + + + + + + + + + + + + diff --git a/docs/datasets-and-plots.html b/docs/datasets-and-plots.html index cce8827..c25f48d 100644 --- a/docs/datasets-and-plots.html +++ b/docs/datasets-and-plots.html @@ -25,7 +25,7 @@ - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+ + +
+
+ +
+
+

Multiclass classification

+

Several SIMCA models can be combined to a special object simcam, which is used to make a multiclass classification. Besides this, it also allows calculating distance between individual models and a discrimination power — importance of variables to discriminate between any two classes. Let’s see how it works.

+

First we create three single-class SIMCA models with individual settings, such as number of optimal components and alpha.

+
m.set = simca(X.set, 'setosa', 3, alpha = 0.01)
+m.set = selectCompNum(m.set, 1)
+
+m.vir = simca(X.vir, 'virginica', 3)
+m.vir = selectCompNum(m.vir, 2)
+
+m.ver = simca(X.ver, 'versicola', 3)
+m.ver = selectCompNum(m.ver, 1)
+

Then we combine the models into a SIMCAM model object. Summary will show the performance on calibration set, which is a combination of calibration sets for each of the individual models

+
m = simcam(list(m.set, m.vir, m.ver))
+summary(m)
+
## 
+## SIMCA multiple classes classification (class simcam)
+## Nmber of classes: 3
+## Info: 
+## 
+## SIMCA model for class "setosa" summary
+## 
+## Info: 
+## Significance level (alpha): 0.01
+## Selected number of components: 1
+## 
+##        Expvar Cumexpvar Sens (cal)
+## Comp 1  73.51     73.51          1
+## Comp 2  14.24     87.76          1
+## Comp 3  10.44     98.20          1
+## 
+## SIMCA model for class "virginica" summary
+## 
+## Info: 
+## Significance level (alpha): 0.05
+## Selected number of components: 2
+## 
+##        Expvar Cumexpvar Sens (cal)
+## Comp 1  76.16     76.16       0.88
+## Comp 2  14.94     91.10       1.00
+## Comp 3   6.09     97.20       0.96
+## 
+## SIMCA model for class "versicola" summary
+## 
+## Info: 
+## Significance level (alpha): 0.05
+## Selected number of components: 1
+## 
+##        Expvar Cumexpvar Sens (cal)
+## Comp 1  76.44     76.44       0.96
+## Comp 2  13.93     90.37       0.92
+## Comp 3   8.45     98.82       0.92
+

Now we apply the combined model to the test set and look at the predictions.

+
res = predict(m, X.t, c.t)
+plotPredictions(res)
+

+

In this case the predictions are shown only for the number of components each model found optimal. The names of classes along y-axis are the individual models. Similarly we can show the predicted values.

+
show(res$c.pred[20:30, 1, 1:3])
+
##    setosa virginica versicola
+## 40      1        -1        -1
+## 42     -1        -1        -1
+## 44      1        -1        -1
+## 46      1        -1        -1
+## 48      1        -1        -1
+## 50      1        -1        -1
+## 52     -1        -1         1
+## 54     -1        -1         1
+## 56     -1         1         1
+## 58     -1        -1        -1
+## 60     -1        -1         1
+

There are three additional plots available for multiclass SIMCA model. First of all it is a distance between a selected model and the others.

+
par(mfrow = c(1, 2))
+plotModelDistance(m, 1)
+plotModelDistance(m, 2)
+

+

The second plot is a discrimination power, mentioned in the beginning of the section.

+
par(mfrow = c(1, 2))
+plotDiscriminationPower(m, c(1, 3), show.labels = T)
+plotDiscriminationPower(m, c(2, 3), show.labels = T)
+

+

And, finally, a Cooman’s plot showing an orthogonal distance from objects to two selected classes/models.

+
par(mfrow = c(1, 2))
+plotCooman(m, c(1, 3), show.labels = T)
+plotCooman(m, c(2, 3), show.labels = T)
+

+ +
+ +
+ +
+
+
+ + + + + + + + + + + + + + + + + diff --git a/docs/overview.html b/docs/overview.html index ce1153d..468a799 100644 --- a/docs/overview.html +++ b/docs/overview.html @@ -25,7 +25,7 @@ - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+ + +
+
+ +
+
+

SIMCA classification

+

SIMCA (Soft Independent Modelling of Class Analogy) is a simple one-class classification method mainly based on PCA. The general idea is to create a PCA model using data for samples/objects belonging to a class and classify new objects based on how good the model can fit them. The decision is made using two residual distances — \(Q\) (squared residual distance from an object to its projection to PCA space) and \(T^2\) - distance between the projection of the object and origin of PC space. The \(T^2\) distance is calculated for normalized scores.

+

The first distance (also known as “orthogonal distance”) shows how good the new object following the same trend as the other objects from the class used to create the model, while the second (also known as “score distance”) tells how extreme is it. Both distances may have certain statistical limits, which can be used to cut-off the strangers and accept class members with a pre-define expected ratio of false negatives (\(\alpha\)). There are several ways to calculate the limits, for example, see this paper. In mdatools so far we use a simplest way, suggested by Svante Wold, where the limit for orthogonal distance is calculated using F-distribution and the score distance is computed using Hotelling T2 distribution. More methods will be available in future releases.

+

The classification performance is assessed using true/false positives and negatives and statistics, showing the ability of a classification model to recognize class members (sensitivity or true positive rate) and how good the model is for identifying strangers (specificity or true negative rate). In addition to that, model also calculates a percent of misclassified objects. All statistics are calculated for calibration and validation (if any) results, but one must be aware that specificity can not be computed without objects not belonging to the class and, therefore, calibration and cross-validation results in SIMCA do not have specificity values.

+

It must be also noted that any SIMCA model or result is also a PCA object and all plots, methods, statistics, available for PCA, can be used for SIMCA objects as well.

+
+
+ +
+
+
+ + + + + + + + + + + + + + + + + diff --git a/docs/simple-plots.html b/docs/simple-plots.html index 92181b2..597e9c4 100644 --- a/docs/simple-plots.html +++ b/docs/simple-plots.html @@ -25,7 +25,7 @@ - + @@ -107,7 +107,7 @@ +
  • SIMCA classification
  • @@ -274,8 +280,8 @@

    Variable selection

    - - + + diff --git a/docs/what-is-new.html b/docs/what-is-new.html index b8febce..9f37217 100644 --- a/docs/what-is-new.html +++ b/docs/what-is-new.html @@ -25,7 +25,7 @@ - +