Skip to content

This repository accompanies “Shape-Based Clustering of Daily Weigh-In Trajectories using Dynamic Time Warping”. The aim of the paper was to use Dynamic Time Warping, a shape-based clustering method, to cluster binary trajectories and evaluate patterns.

Notifications You must be signed in to change notification settings

samantha-bothwell/BinaryShapeClust

Repository files navigation

Binary Shape-Based Clustering

This repository is accompanies “Pattern-Based Clustering of Daily Weigh-In Trajectories using Dynamic Time Warping” (DOI : 10.1111/biom.13773). The aim of the paper was to use Dynamic Time Warping, a shape-based clutering method, to cluster binary trajectories and evaluate patterns. Study data is not made publically available, however simulated data to resemble the study data is used and shared in this repository.

We first load necessary libraries and source files that clean the study data that is used to simulate clusters. The simulated dataset can be accessed in the Data/DataSimulated folder.

### Libraries 
library(tidyverse)
library(ggplot2)
library(dtwclust) # cluster time series with dynamic time warping
library(ecodist) # distance function for jaccard
library(kableExtra)

### Source files
source("Code/01_CleanStudyData.R")
source("Code/02_SimulateClusters.R")

We plot the binary adherence of each participant stratified by their true cluster assignment. In the plot below, black indicates a day in which the participant did not weigh in while yellow indicates a day in which the participant did weigh in. The plot is sorted by overall % adherence, with lowest adherence on the bottom.

#### Sort by % adherence
source("Code/ClusterHeatmap.R")

ggplot(sim_plot, aes(x = day, y = participant_id, fill = weighed_in)) +
    geom_tile() + ggtitle("True Data Clusters and Mean Adherence %") + 
    theme(plot.title = element_text(hjust = 0.5)) + 
    scale_fill_viridis(discrete = TRUE, option="A") +
    facet_wrap(~cluster_assignment, scales = "free", ncol = 5)

We use the Euclidan, Jaccard, and DTW distances in the paper to cluster on our binary data and a rolling probability window of the binary data.

### Using binary data
jac_dist <- distance(adherence_mat, method = 'jaccard')
dtw_dist <- dist(adherence_mat, method = 'dtw')
euc_dist <- distance(adherence_mat, method = 'euclidean')

The RollAvg.R code file provides a roll_avg function that requires the binary adherence matrix and a rolling window size. In this example we calculate a 14-day rolling probability window.

### Using continuous data
source("Code/RollAvg.R")
roll_mat = roll_avg(adherence_mat, window = 14) # window size can be adjusted

jac_dist_roll <- distance(roll_mat, method = 'jaccard')
dtw_dist_roll <- dist(roll_mat, method = 'dtw')
euc_dist_roll <- distance(roll_mat, method = 'euclidean')

In the paper, we cluster under the average, single, complete, and Ward linkages. In this example, we cluster using the average linkage with the DTW distance using a 14-day rolling probability window. We define 5 clusters and the dendrogram below visualizes how the clusters are determined.

### Method can be specified as 'average', 'single', 'complete', or 'Ward'
### Provide the distance matrix you want to cluster
clust <- hclust(dtw_dist_roll, method = "average")

### Specify the number of clusters 
cut <- cutree(clust, k = 5)

### Visualize dendrogram
plot(clust)
rect.hclust(clust, k = 5, border = 2:6)

With our binary heatmap adherences, we can assess how well the clustering algorithm worked.

### Add simulated clustering labels
hclus <- stats::cutree(clust, k = 5) %>% 
  as.data.frame(.) %>%
  dplyr::rename(.,cluster_group = .) %>%
  tibble::rownames_to_column("type_col")

hcdata <- ggdendro::dendro_data(clust)
names_order <- hcdata$labels$label

### Plot - plotting the binary data heat map based on the rolling average clustering
data.frame(t(adherence_mat[,-c(1:13)])) %>%
  dplyr::mutate(index = 1:352) %>%
  dplyr::rename_all(funs(stringr::str_replace_all(., "X", ""))) %>% 
  tidyr::gather(key = type_col,value = value, -index) %>%
  dplyr::full_join(., hclus, by = "type_col") %>% 
  mutate(type_col = factor(type_col, levels = as.character(names_order)), 
         weighed_in = factor(value, levels = 0:1, labels = c("no", "yes"))) %>% 
  ggplot(aes(x = index, y = type_col, fill = weighed_in)) +
  geom_tile() +
  scale_fill_viridis(discrete = TRUE, option="A") +
  facet_wrap(~cluster_group, ncol = 5, scales = "free") + 
  guides(fill=FALSE) + 
  theme_bw() + ylab("Subject") +
  theme(strip.background = element_blank(), strip.text = element_blank())

It looks like the correct subjects were sorted into the correct clusters. To verify, we can use external vailidation to assess if the true cluster assignment matches the assigned one under dynamic time warping. All of the listed indices are optimal if close to 1 except for VI (or Variation of Information).

#### Source Validation Indices which will calculate validation indices based on user input
source('Code/ValidationIndices.R')

# Rename cluster labels for validation 
sim_clust$cluster_label <- ifelse(sim_clust$cluster_label == "low", 1, 
    ifelse(sim_clust$cluster_label == "dropout", 2, 
      ifelse(sim_clust$cluster_label == "vacation", 3, 
        ifelse(sim_clust$cluster_label == "med_adherence", 4, 5))))

# Provide true clusters for external validation
external_validation = validation(cut_matrix = cut, true_clusters = sim_clust$cluster_label)

kable(external_validation)

ARI

RI

J

FM

VI

0.9526602

0.9816162

0.9325926

0.9657083

0.0562365

If we do not know the true cluster assingment, we can use internal validation indices instead. The Validation Indices pdf provides detail about how each validation index is calculated.

# Provide original matrix and distance for internal validation
internal_validation = validation(matrix = roll_mat, distance = dtw_dist_roll, cut_matrix = cut)

kable(internal_validation)

Silhouette

Dunn

DaviesBouldin

Calinhara

0.5918745

0.2641509

0.9151558

35.95916

About

This repository accompanies “Shape-Based Clustering of Daily Weigh-In Trajectories using Dynamic Time Warping”. The aim of the paper was to use Dynamic Time Warping, a shape-based clustering method, to cluster binary trajectories and evaluate patterns.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages