nhgh.qmd

--- 
title: "NHANES"
author:
  - name: Frank Harrell
    affiliation: Department of Biostatistics<br>Vanderbilt University School of Medicine<br>MSCI Biostatistics II
date: last-modified
format:
  html:
    self-contained: true
    number-sections: true
    number-depth: 3
    anchor-sections: true
    smooth-scroll: true
    theme: journal
    toc: true
    toc-depth: 3
    toc-title: Contents
    toc-location: left
    code-link: false
    code-tools: true
    code-fold: show
    code-block-bg: "#f1f3f5"
    code-block-border-left: "#31BAE9"
    reference-location: margin
    fig-cap-location: margin

execute:
  warning: false
  message: false

major: data reduction
---

```{r setup}
require(rms)
require(qreport)   # for maketabs
options(prType='html', grType='plotly')   # plotly makes certain graphics interactive
```

# Problem: How Are Demographics Associated with Height?

* Dataset: `nhgh`
* N observations: 6795
* Response: height
* Predictors: Race/ethnicity, sex, age

# Data Import

```{r import}
getHdata(nhgh)
d <- upData(nhgh, keep=.q(sex, age, re, ht, SCr, gh,
  dx, tx, wt, bmi, leg, arml, armc, waist, tri, sub))
contents(d)
```

# Descriptive Statistics

```{r results='asis'}
sparkline::sparkline(0) # initialize javascript for describe
des <- describe(d)
maketabs(print(des, 'both'), wide=TRUE)
```

# Plot Raw Data

There are many tied values of `ht` so use hexagonal binning to depict frequency counts of coincident points.  Use `ggplot2` faceting to stratify by `re` and `sex`.  Make the plot taller than usual with chunk markup `#| fig.height: 8.5` (8.5 inches). [If you have trouble getting this too run try installing the `hexbin` package.}{.aside}

```{r}
#| fig.height: 8.5
cuts <- c(1:5, 10, 15, 20, 30)  # cutpoints for cell N
ggplot(d, aes(x=age, y=ht)) +
  facet_grid(re ~ sex) +
  stat_binhex(aes(alpha=..count.., color=cut2(..count.., cuts=cuts)), bins=80) +
  xlab(hlab(age)) + ylab(hlab(ht)) +
  guides(alpha='none', fill=FALSE, color=guide_legend(title='Frequency'))
```

# Linear Model

* No assumption of linearity for continuous predictors
* Allow for age $\times$ sex interaction (different shapes of age relationship by sex)

```{r}
dd <- datadist(d); options(datadist='dd')
# x=TRUE needed for studentized residuals
f <- ols(ht ~ re + sex * rcs(age, 4), data=d, x=TRUE)
f
latex(f)
```

# Diagnostics

## Residuals

```{r}
r <- resid(f, type='studentized')
ggplot(mapping=aes(fitted(f), r)) + geom_point(alpha=.2)
ggplot(mapping=aes(sample=r)) + geom_qq() + stat_qq_line() +
  xlab('Studentized Residuals') + ylab('Normal Quantiles')
```

* What is your opinion about which model assumptions hold and which don't?
* How might the right hand side of the model not fit adequately?

## Influence Statistics

Find observations that if removed would change a parameter by more than 0.15 standardized (by standard error) units.[Because of the large sample size there were no overly influential observations at standard cutoffs such as 0.4.  The cutoff was relaxed to 0.15 for illustration.]{.aside}  `which.influence` in `rms` recognizes that some predictors have multiple $\beta$s associated with them (e.g., splined predictors or categorical predictors like `re` with > 2 levels).  It finds observations that are influential on *any* of the predictor's parameters.


```{r}
w <- which.influence(f, 0.15)
w
show.influence(w, d)
```

# Data Reduction

**Why do data reduction?**

If we hope to construct stable models, there is a limit to the complexity of the models we construct.  The type of outcome and number of observations limit complexity. [Some of this material was provided by Tom Stewart]{.aside}

**Approach**

Determine "effective sample size" and parameter budget.  Live within the parameter budget and allocate parameters to predictors based on research goals and the strength of association with outcome.

**Challenge**

Lots of predictors with a limited parameter budget. 

**Strategies Demonstrated**

 * Generalized Spearman's $\rho$ (predictive potential)
 * Variable clustering
 * Redundancy analysis
 * Principal components

## Data Subsetting

Include persons age 21 and older who have never been diagnosed or treated for diabetes.

```{r}
d <- subset(d, age >= 21 & dx == 0 & tx == 0)
cat(nrow(d), 'subjects qualified\n')
```

## Predictive Potential of Continuous Variables

```{r}
#| fig.height: 3.5
s <- spearman2(gh ~ age + bmi + ht + wt + leg + arml + armc +
               waist + tri + sub, data=d)
s
plot(s)
```

## Variable Clustering

* Shows possibility of combining some variables
* Occasionally leads to removal of some variables
* Helps with collinearities, redundancies, lowering effective number of parameters in need of estimation

```{r}
v <- varclus(~ age + bmi + ht + wt + leg + arml + armc + waist +
             tri + sub + re + sex,
             data=d)   # function is in Hmisc
plot(v)
```

* Interpret what you see for continuous variables
* Interpret what you see for categorical variables

## Redundancy Analysis

* Try to predict each predictor from the other predictors (splined if continuous, all indicator variables less one if categorical)
* Note that `bmi, ht, wt` would have been perfectly predicted (any one from the other two; $R^{2} = 1.0$) had they been logged

```{r}
redun(~ age + bmi + ht + wt + leg + arml + armc + waist + tri + sub + re + sex,
             data=d)  # function is in Hmisc
```

## Principal Components

* A way to find linear combinations of predictors that captures much of the information in the individual predictors
* Is masked to Y so is *unsupervised learning* that does not lead to overfitting
* Consider the body size measurements only, and remove BMI for the moment because it is perfectly computed from height and weight (were logs to be taken on all 3 variables)
* PCs are linear combination of the original variables
* First PC (PC$_1$) explains maximum variance subject to a scaling constraint
* Second PC (PC$_2$) explains the second largest variance subject to being uncorrelated with PC$_1$.

```{r}
#| fig.height: 4
pc <- princmp(~ age + ht + wt + leg + arml + armc + waist + tri + sub,
              data=d, sw=TRUE)$scores
pc1 <- pc[, 1]   # vector of PC1
pc2 <- pc[, 2]   # PC2
```

* How many components would you choose?

* To use PC-based data reduction in outcome prediction, substitute `pc1`, `pc2`, ... for some of the constituent variables.

The `princmp` function also can do sparse PCs if you have installed the `pcaPP` package.  This results in many of the loadings being set exactly to zero, effectively doing variable clustering and PC simultaneously.

```{r}
spc <- princmp(~ age + ht + wt + leg + arml + armc + waist + tri + sub,
               data=d, sw=TRUE, method='sparse')$scores
```

* Let's see how regular and sparce PCs predict glycohemoglobin in an ordinal model

```{r results='asis'}
w <- vector('list', 10)
for(k in 1:5) w[[k]]   <- orm(gh ~  pc[, 1:k], data=d)
for(k in 1:5) w[[k+5]] <- orm(gh ~ spc[, 1:k], data=d)
nam <- c(paste0('PC1-', 1:5), paste0('Sparse PC1-', 1:5))
names(w) <- nam
maketabs(w)
```

# Computing Environment

```{r echo=FALSE,results='asis'}
markupSpecs$html$session()
```