Skip to content

Commit

Permalink
create new vignettes for functions impPCA and xgboostImpute
Browse files Browse the repository at this point in the history
  • Loading branch information
BirgitKarlhuber committed Jul 17, 2024
1 parent 4013d12 commit 013cc58
Show file tree
Hide file tree
Showing 2 changed files with 230 additions and 0 deletions.
89 changes: 89 additions & 0 deletions vignettes/impPCA.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
title: "Imputation Method based on Iterative EM PCA"
author: "Birgit Karlhuber"
date: "2024-07-08"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Imputation Method based on Iterative EM PCA}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---


```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 7,
fig.height=4,
fig.align = "center"
)
```


This vignette showcases the function `impPCA()`, which can be used to impute missing values with the help of a greedy algorithm for EM-PCA including robust methods.

### Data

The following example demonstrates the functionality of `impPCA()` using a subset of `sleep`. The columns have been selected deliberately to include some interactions between the missing values

```{r, message=FALSE}
library(VIM)
data(iris)
df <- iris[1:30,c(1,4)]
colnames(df) <- c("S.Length", "P.Width") # select two numerical variables
df_na <- df
# randomly produce some missing values in the column P.Width
set.seed(1)
nbr_missing <- 10
index_na <- sample(nrow(df_na),size = nbr_missing,replace = T)
index_na <- index_na[!duplicated(index_na)]
df_na[index_na,2] <- NA
w <- is.na(df_na$`P.Width`)
aggr(df_na)
```


## Imputation

By setting method to "mcd" the robust estimation is used (instead of the default "classical"). With boot=FALSE imputed data set would be a data.frame else (boot=TRUE) it is a list where each list element contains a data.frame.

```{r, message=FALSE, results='hide'}
imputed <- impPCA(df_na, method = "mcd", boot=TRUE)[[1]]
aggr(imputed)
```

The plot shows that all missing values of the variable `P.Width` were imputed by the `impPCA()` function.


## Performance of method

In the next plot the non-missing data points, the ones set to missing and the associated imputed values are visualized. Furthermore the **MAPE** short for **M**ean **A**bsolute **P**ercentage **E**rror and the **NRMSE** short for **N**ormalized **R**oot **M**ean **S**quared **E**rror are visualized in the plot.

```{r, message=FALSE, fig.height=5}
# create plot
plot(`P.Width` ~ `S.Length`, data = df, type = "n", ylab = "P.Width", xlab="S.Length")
mtext(text = "impPCA robust", side = 3)
points(df$`S.Length`[!w], df$`P.Width`[!w])
points(df$`S.Length`[w], df$`P.Width`[w], col = "grey", pch = 17)
points(imputed$`S.Length`[w], imputed$`P.Width`[w], col = "red", pch = 20, cex = 1.4)
segments(x0 = df$`S.Length`[w], x1 = imputed$`S.Length`[w], y0 = df$`P.Width`[w],
y1 = imputed$`P.Width`[w], lty = 2, col = "grey")
legend("topleft", legend = c("non-missings", "set to missing", "imputed values"),
pch = c(1,17,20), col = c("black","grey","red"), cex = 0.7)
mape <- round(100* 1/sum(is.na(df_na$`P.Width`)) * sum(abs((df$`P.Width` -
imputed$`P.Width`) / df$`P.Width`)), 2)
s2 <- var(df$`P.Width`)
nrmse <- round(sqrt(1/sum(is.na(df_na$`P.Width`)) * sum(abs((df$`P.Width` -
imputed$`P.Width`) / s2))), 2)
text(x = 5.6, y = 0.16, labels = paste("MAPE =", mape))
text(x = 5.6, y = 0.12, labels = paste("NRMSE =", nrmse))
```

141 changes: 141 additions & 0 deletions vignettes/xgboostImpute.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
---
title: "Imputation Method based on Random Forest Model"
author: "Birgit Karlhuber"
date: "2024-07-08"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Imputation Method based on Random Forest Model}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---


```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 7,
fig.height=4,
fig.align = "center"
)
```


This vignette showcases the function `xgboostImpute()`, which can be used to impute missing values based on a random forest model using `[xgboost::xgboost()].

### Data

The following example demonstrates the functionality of `xgboostImpute()` using a subset of `sleep`. The columns have been selected deliberately to include some interactions between the missing values

```{r, message=FALSE}
library(VIM)
dataset <- sleep[, c("Dream", "NonD", "BodyWgt", "Span")] # dataset with missings
dataset$BodyWgt <- log(dataset$BodyWgt)
dataset$Span <- log(dataset$Span)
aggr(dataset)
str(dataset)
```


## Imputation

In order to invoke the imputation methods, a formula is used to specify which
variables are to be estimated and which variables should be used as regressors.First `Dream` will be imputed based on `BodyWgt`.

```{r, message=FALSE}
imp_xgboost <- xgboostImpute(formula=Dream~BodyWgt,data = dataset)
aggr(imp_xgboost, delimiter = "_imp")
```

The plot shows that all missing values of the variable `Dream` were imputed by the `xgboostImpute()` function.


## Diagnosing the result

As we can see in the next plot, the correlation structure of `Dream` and
`BodyWgt` is preserved by the imputation method.

```{r, fig.height=5}
imp_xgboost[, c("Dream", "BodyWgt", "Dream_imp")] |>
marginplot(delimiter = "_imp")
```


## Imputing multiple variables

To impute several variables at once, the formula can be specified with more than one column name on the
left hand side.

```{r, message=FALSE}
imp_xgboost <- xgboostImpute(Dream+NonD+Span~BodyWgt,data=dataset)
aggr(imp_xgboost, delimiter = "_imp")
```


## Performance of method

In order to validate the performance of `xgboostImpute()` the `iris` dataset is used. Firstly, some values are randomly set to `NA`.

```{r}
library(reactable)
data(iris)
df <- iris
colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")
# randomly produce some missing values in the data
set.seed(1)
nbr_missing <- 50
y <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = T),
col=sample(ncol(iris)-1,size = nbr_missing,replace = T))
y<-y[!duplicated(y),]
df[as.matrix(y)]<-NA
aggr(df)
sapply(df, function(x)sum(is.na(x)))
```

We can see that there are missings in all variables and some observations reveal missing values on several points. In the next step we perform a multiple variable imputation and `Species` serves as a regressor.

```{r, message=FALSE}
imp_xgboost <- xgboostImpute(S.Length + S.Width + P.Length + P.Width ~ Species, df)
aggr(imp_xgboost, delimiter = "imp")
```

The plot indicates that all missing values have been imputed by the `xgboostImpute()` algorithm. The following table displays the rounded first five results of the imputation for all variables.

```{r echo=F,warning=F}
results <- cbind("TRUE1" = as.numeric(iris[as.matrix(y[which(y$col==1),])]),
"IMPUTED1" = round(as.numeric(imp_xgboost[as.matrix(y[which(y$col==1),])]),2),
"TRUE2" = as.numeric(iris[as.matrix(y[which(y$col==2),])]),
"IMPUTED2" = round(as.numeric(imp_xgboost[as.matrix(y[which(y$col==2),])]),2),
"TRUE3" = as.numeric(iris[as.matrix(y[which(y$col==3),])]),
"IMPUTED3" = round(as.numeric(imp_xgboost[as.matrix(y[which(y$col==3),])]),2),
"TRUE4" = as.numeric(iris[as.matrix(y[which(y$col==4),])]),
"IMPUTED4" = round(as.numeric(imp_xgboost[as.matrix(y[which(y$col==4),])]),2))[1:5,]
reactable(results, columns = list(
TRUE1 = colDef(name = "True"),
IMPUTED1 = colDef(name = "Imputed"),
TRUE2 = colDef(name = "True"),
IMPUTED2 = colDef(name = "Imputed"),
TRUE3 = colDef(name = "True"),
IMPUTED3 = colDef(name = "Imputed"),
TRUE4 = colDef(name = "True"),
IMPUTED4 = colDef(name = "Imputed")
),
columnGroups = list(
colGroup(name = "S.Length", columns = c("TRUE1", "IMPUTED1")),
colGroup(name = "S.Width", columns = c("TRUE2", "IMPUTED2")),
colGroup(name = "P.Length", columns = c("TRUE3", "IMPUTED3")),
colGroup(name = "P.Width", columns = c("TRUE4", "IMPUTED4"))
),
striped = TRUE,
highlight = TRUE,
bordered = TRUE
)
```

0 comments on commit 013cc58

Please sign in to comment.