-
Notifications
You must be signed in to change notification settings - Fork 4
/
README.Rmd
208 lines (166 loc) · 8 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
---
output: github_document
references:
- id: R-partition
type: article-journal
author:
- family: Millstein
given: Joshua
- family: Battaglin
given: Francesca
- family: Barrett
given: Malcolm
- family: Cao
given: Shu
- family: Zhang
given: Wu
- family: Stintzing
given: Sebastian
- family: Heinemann
given: Volker
- family: Lenz
given: Heinz-Josef
issued:
- year: 2020
title: 'Partition: A surjective mapping approach for dimensionality reduction'
title-short: Partition
container-title: Bioinformatics
page: 676-681
volume: '36'
issue: '3'
URL: 'https://doi.org/10.1093/bioinformatics/btz661'
params:
invalidate_cache: false
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
dpi = 320
)
```
<!-- badges: start -->
[![R-CMD-check](https://github.com/USCbiostats/partition/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/USCbiostats/partition/actions/workflows/R-CMD-check.yaml)
[![Coverage status](https://codecov.io/gh/USCbiostats/partition/branch/master/graph/badge.svg)](https://app.codecov.io/github/USCbiostats/partition?branch=master)
[![CRAN status](https://www.r-pkg.org/badges/version-ago/partition)](https://cran.r-project.org/package=partition)
[![JOSS](https://joss.theoj.org/papers/10.21105/joss.01991/status.svg)](https://doi.org/10.21105/joss.01991)
[![DOI](https://zenodo.org/badge/178615892.svg)](https://zenodo.org/badge/latestdoi/178615892)
[![USC IMAGE](https://raw.githubusercontent.com/USCbiostats/badges/master/tommy-image-badge.svg)](https://image.usc.edu)
<!-- badges: end -->
# partition
partition is a fast and flexible framework for agglomerative partitioning. partition uses an approach called Direct-Measure-Reduce to create new variables that maintain the user-specified minimum level of information. Each reduced variable is also interpretable: the original variables map to one and only one variable in the reduced data set. partition is flexible, as well: how variables are selected to reduce, how information loss is measured, and the way data is reduced can all be customized.
## Installation
You can install the partition from CRAN with:
``` r
install.packages("partition")
```
Or you can install the development version of partition GitHub with:
``` r
# install.packages("remotes")
remotes::install_github("USCbiostats/partition")
```
## Example
```{r example}
library(partition)
set.seed(1234)
df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 100)
# don't accept reductions where information < .6
prt <- partition(df, threshold = .6)
prt
# return reduced data
partition_scores(prt)
# access mapping keys
mapping_key(prt)
unnest_mappings(prt)
# use a lower threshold of information loss
partition(df, threshold = .5, partitioner = part_kmeans())
# use a custom partitioner
part_icc_rowmeans <- replace_partitioner(
part_icc,
reduce = as_reducer(rowMeans)
)
partition(df, threshold = .6, partitioner = part_icc_rowmeans)
```
partition also supports a number of ways to visualize partitions and permutation tests; these functions all start with `plot_*()`. These functions all return ggplots and can thus be extended using ggplot2.
```{r stacked_area_chart, dpi = 320}
plot_stacked_area_clusters(df) +
ggplot2::theme_minimal(14)
```
## Performance
partition has been meticulously benchmarked and profiled to improve performance, and key sections are written in C++ or use C++-based packages. Using a data frame with 1 million rows on a 2017 MacBook Pro with 16 GB RAM, here's how each of the built-in partitioners perform:
```{r benchmarks1, eval = FALSE}
large_df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 1e6)
basic_benchmarks <- microbenchmark::microbenchmark(
icc = partition(large_df, .3),
kmeans = partition(large_df, .3, partitioner = part_kmeans()),
minr2 = partition(large_df, .3, partitioner = part_minr2()),
pc1 = partition(large_df, .3, partitioner = part_pc1()),
stdmi = partition(large_df, .3, partitioner = part_stdmi())
)
```
```{r secret_benchmarks1, echo = FALSE, warning=FALSE, message=FALSE}
library(microbenchmark)
library(ggplot2)
if (params$invalidate_cache) {
large_df <- simulate_block_data(c(3, 4, 5), lower_corr = .4, upper_corr = .6, n = 1e6)
basic_benchmarks <- microbenchmark::microbenchmark(
icc = partition(large_df, .3),
kmeans = partition(large_df, .3, partitioner = part_kmeans()),
minr2 = partition(large_df, .3, partitioner = part_minr2()),
pc1 = partition(large_df, .3, partitioner = part_pc1()),
stdmi = partition(large_df, .3, partitioner = part_stdmi())
)
readr::write_rds(basic_benchmarks, "basic_benchmarks.rds")
} else {
basic_benchmarks <- readr::read_rds("basic_benchmarks.rds")
}
basic_benchmarks$expr <- forcats::fct_reorder(basic_benchmarks$expr, basic_benchmarks$time)
ggplot2::autoplot(basic_benchmarks) %+%
ggplot2::stat_ydensity(color = "#0072B2", fill = "#0072B2BF") +
ggplot2::theme_minimal()
```
## ICC vs K-Means
As the features (columns) in the data set become greater than the number of observations (rows), the default ICC method scales more linearly than K-Means-based methods. While K-Means is often faster at lower dimensions, it becomes slower as the features outnumber the observations. For example, using three data sets with increasing numbers of columns, K-Means starts as the fastest and gets increasingly slower, although in this case it is still comparable to ICC:
```{r benchmarks2, eval = FALSE}
narrow_df <- simulate_block_data(3:5, lower_corr = .4, upper_corr = .6, n = 100)
wide_df <- simulate_block_data(rep(3:10, 2), lower_corr = .4, upper_corr = .6, n = 100)
wider_df <- simulate_block_data(rep(3:20, 4), lower_corr = .4, upper_corr = .6, n = 100)
icc_kmeans_benchmarks <- microbenchmark::microbenchmark(
icc_narrow = partition(narrow_df, .3),
icc_wide = partition(wide_df, .3),
icc_wider = partition(wider_df, .3),
kmeans_narrow = partition(narrow_df, .3, partitioner = part_kmeans()),
kmeans_wide = partition(wide_df, .3, partitioner = part_kmeans()),
kmeans_wider = partition(wider_df, .3, partitioner = part_kmeans())
)
```
```{r secret_benchmarks2, echo = FALSE, warning=FALSE, message=FALSE}
if (params$invalidate_cache) {
narrow_df <- simulate_block_data(3:5, lower_corr = .4, upper_corr = .6, n = 100)
wide_df <- simulate_block_data(rep(3:10, 2), lower_corr = .4, upper_corr = .6, n = 100)
wider_df <- simulate_block_data(rep(3:20, 4), lower_corr = .4, upper_corr = .6, n = 100)
icc_kmeans_benchmarks <- microbenchmark::microbenchmark(
icc_narrow = partition(narrow_df, .3),
icc_wide = partition(wide_df, .3),
icc_wider = partition(wider_df, .3),
kmeans_narrow = partition(narrow_df, .3, partitioner = part_kmeans()),
kmeans_wide = partition(wide_df, .3, partitioner = part_kmeans()),
kmeans_wider = partition(wider_df, .3, partitioner = part_kmeans())
)
readr::write_rds(icc_kmeans_benchmarks, "icc_kmeans_benchmarks.rds")
} else {
icc_kmeans_benchmarks <- readr::read_rds("icc_kmeans_benchmarks.rds")
}
icc_kmeans_benchmarks$type <- stringr::str_extract(icc_kmeans_benchmarks$expr, "icc|kmeans")
ggplot2::autoplot(icc_kmeans_benchmarks) %+%
ggplot2::stat_ydensity(color = "#0072B2", fill = "#0072B2BF") +
ggplot2::facet_wrap(~type, ncol = 1, scales = "free_y") +
ggplot2::theme_minimal()
```
For more information, see [our paper in Bioinformatics](https://doi.org/10.1093/bioinformatics/btz661), which discusses these issues in more depth [@R-partition].
## Contributing
Please read the [Contributor Guidelines](https://github.com/USCbiostats/partition/blob/master/.github/CONTRIBUTING.md) prior to submitting a pull request to partition. Also note that this project is released with a [Contributor Code of Conduct](https://github.com/USCbiostats/partition/blob/master/.github/CODE_OF_CONDUCT.md). By participating in this project you agree to abide by its terms.
## References