-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
219 lines (181 loc) · 6.3 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
fig.path = "man/figures/README-"
)
```
The `rtree` package offers fast Euclidean within-distance checks and
KNN calculations for points in 2D space.
It offers significant speed-ups vis-a-vis simple implementations
by relying on the [R-tree data structure](https://en.wikipedia.org/wiki/R-tree)
implemented by the
[Boost geometry](https://www.boost.org/doc/libs/1_75_0/libs/geometry/doc/html/geometry/spatial_indexes/introduction.html)
library.
`rtree` was inspired by
[this](https://gallery.rcpp.org/articles/Rtree-examples/)
example in the Rcpp gallery.
## Installation
### From CRAN
```{r eval=FALSE}
install.packages("rtree")
```
### Development version
```{r eval=FALSE}
# install.packages("remotes") # Install if needed
remotes::install_github("hunzikp/rtree")
```
Note: As of version 0.2.0, `rtree` requires R version 4.0.0 or higher.
This is because version 1.75 of `boost::geometry` requires C++14 which is not
well supported in Windows R versions before 4.0.0.
## Usage
Say we have two large sets of points, A and B, stored as 2-column matrices of
Cartesian coordinates:
```{r simulate}
## Simulate point coordinates
set.seed(0)
A_n <- 10^4
A <- cbind(runif(A_n), runif(A_n))
B_n <- 10^4
B <- cbind(runif(B_n), runif(B_n))
colnames(A) <- colnames(B) <- c('x', 'y')
```
### Within-Distance Calculation
For each point of set $A$, $a_i$, we want to know all points of set $B$ that
are within distance $d$ of $a_i$.
To compute this, we first create an R-Tree index on $B$:
```{r index}
library(rtree)
## Set index
B_rtree <- RTree(B)
```
The `RTree()` function creates an S3 object of class `RTree`,
```{r class}
inherits(B_rtree, 'RTree')
```
which essentially just points to a C++ object of class `RTreeCpp`.
Using the `RTree` object, we can now perform our query efficiently:
```{r within}
## Within distance calculation
d <- 0.05
wd_ls <- withinDistance(B_rtree, A, d)
```
`wd_ls` is a list of length `nrow(A)`...
```{r check}
nrow(A)==length(wd_ls)
```
...whereby the $i$th list element contains the row-indices of the points
in $B$ that are within distance $d$ of point $a_i$:
```{r check2}
print(wd_ls[[1]])
```
We can also check the sanity of the result visually:
```{r checkplot, fig.cap = "Within distance sanity check.", message=FALSE, warning=FALSE}
## Plot points in B within distance d of point a_1
a_1 <- A[1,] # Get coords of a_1
plot(a_1[1], a_1[2], xlim=c(a_1[1]-d, a_1[1]+d), ylim=c(a_1[2]-d, a_1[2]+d),
col='black', asp=1, pch=20, xlab='x', ylab='y') # Plot a_1
points(B[,1], B[,2], col='grey') # Plot B in grey
symbols(a_1[1], a_1[2], circles=d, add=TRUE, inches=FALSE) # Draw circle of radius d
b_wd <- B[wd_ls[[1]],] # Get relevant points in B
points(b_wd[,1], b_wd[,2], col='red', pch=20) # Plot relevant points in red
```
### Nearest Neighbor Calculation
For each point of set $A$, $a_i$, we want to know the $k$ points in B
closest to $a_i$.
Recycling the `RTree` object created above, we perform the knn computation...
```{r knn}
## KNN calculation
k <- 10L
knn_ls <- knn(B_rtree, A, k)
```
...which returns a list of the same format as above, with the exception that
each element of `knn_ls` is exactly of length $k$.
Again, we may plot the result to inspect its veracity:
```{r checkplot2, fig.cap = "KNN sanity check.", message=FALSE, warning=FALSE}
## Plot points in B within distance d of point a_1
a_1 <- A[1,] # Get coords of a_1
plot(a_1[1], a_1[2], xlim=c(a_1[1]-d, a_1[1]+d), ylim=c(a_1[2]-d, a_1[2]+d),
col='black', asp=1, pch=20, xlab='x', ylab='y') # Plot a_1
points(B[,1], B[,2], col='grey') # Plot B in grey
b_knn <- B[knn_ls[[1]],] # Get relevant points in B
points(b_knn[,1], b_knn[,2], col='red', pch=20) # Plot relevant points in red
```
## Benchmarking
### Within-Distance Benchmarks
We first compare the within-distance functionality to the `gWithinDistance()`
function offered in [rgeos](https://cran.r-project.org/package=rgeos)
(version `r packageVersion('rgeos')`).
```{r wd_bench, fig.cap = "", message=FALSE, warning=FALSE}
## Load packages
library(sp)
library(rgeos)
library(rbenchmark)
## Simulate data
set.seed(0)
A_n <- 10^3
A <- cbind(runif(A_n), runif(A_n))
B_n <- 10^3
B <- cbind(runif(B_n), runif(B_n))
d <- 0.05
## Encapsulate wd operations in functions, then benchmark
rgeos.wd <- function() {
wd_mat <- gWithinDistance(spgeom1=SpatialPoints(A), spgeom2=SpatialPoints(B),
dist=d, byid=TRUE)
}
rtree.wd <- function() {
wd_ls <- withinDistance(RTree(B), A, d)
}
bm.wd <- benchmark(rtree=rtree.wd(),
rgeos=rgeos.wd(),
replications=10,
columns=c("test", "replications", "elapsed", "relative"))
## Print output
print(bm.wd)
## Plot
barplot(bm.wd$relative, names.arg=bm.wd$test,
ylab="Relative Time Elapsed", cex.main=1.5)
mtext("within distance", line=3, cex=1.5, font=2)
speedup <- round(bm.wd$relative[bm.wd$test=="rgeos"], 1)
mtext(paste("rtree ", speedup, "x faster than rgeos", sep=""),
line=1.5, cex=1.25)
```
### KNN Benchmarks
Next we compare the KNN functionality with the KNN implementation based on
d-trees offered in the [FNN](https://cran.r-project.org/package=FNN)
package (version 1.1).
We don't offer benchmarking statistics against a linear search KNN
implementation, which would obviously be much, much slower.
```{r knn_bench, fig.cap = "", message=FALSE, warning=FALSE}
## Load packages
library(FNN)
## Simulate data
set.seed(0)
A_n <- 10^4
A <- cbind(runif(A_n), runif(A_n))
B_n <- 10^4
B <- cbind(runif(B_n), runif(B_n))
k <- 100L
## Encapsulate knn operations in functions, then benchmark
kdtree.knn <- function() {
nn.idx <- get.knnx(data=B, query=A, k=k, algorithm=c("kd_tree"))
}
rtree.knn <- function() {
nn_ls <- rtree::knn(RTree(B), A, k)
}
bm.knn <- benchmark(rtree=rtree.knn(),
kdtree=kdtree.knn(),
replications=10,
columns=c("test", "replications", "elapsed", "relative"))
## Print output
print(bm.knn)
## Plot
barplot(bm.knn$relative, names.arg=bm.knn$test,
ylab="Relative Time Elapsed", cex.main=1.5)
mtext("KNN", line=3, cex=1.5, font=2)
speedup <- round(bm.knn$relative[bm.knn$test=="kdtree"], 1)
mtext(paste("rtree ", speedup, "x faster than FNN (kd-tree)", sep=""),
line=1.5, cex=1.25)
```