-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvement on layers without positional scales #4990
Conversation
d9bd02d
to
a9e5d00
Compare
Hi Sergio, Thank you for your PRs! |
If you want you can spend some time going through #4989. It's a long issue, so I will summarize it here. GoalGiven a matrix: x_small <- c(3,5,7,9)
y_small <- c(100,200,300)
# The matrix values are usually "random numbers" acquired from an instrument
# Here I choose them so they will have a nice scale when they are log2 transformed:
mat_small <- matrix(
2^seq(from = 1, to = 64, length.out = length(x_small)*length(y_small)),
nrow = length(x_small),
ncol = length(y_small)
)
dimnames(mat_small) <- list(x = x_small, y = y_small)
mat_small
## y
## x 100 200 300
## 3 2.0000 1.575265e+07 1.240729e+14
## 5 105.9524 8.345155e+08 6.572914e+15
## 7 5612.9576 4.420947e+10 3.482081e+17
## 9 297353.2218 2.342050e+12 1.844674e+19 Create a plot that looks like: Straightforward solutionlibrary(ggplot2)
df <- reshape2::melt(mat_small, value.name = "intensity")
gplt <- ggplot(df) +
geom_raster(aes(x = x, y = y, fill = intensity)) +
scale_fill_gradient(trans = "log2")
gplt This solution scales very badly with the size of the matrix. On a 4k x 3k matrix, this plot takes ~45 seconds to complete, while it can be done in 5 seconds with a more efficient strategy and some optimizations. A new geom:I profiled the #' Raster a matrix as a rectangle, efficiently
#'
#'
#' @param matrix The matrix we want to render in the plot
#' @param xmin,xmax,ymin,ymax Coordinates where the corners of the matrix will
#' be centered By default they are taken from rownames (x) and colnames (y) respectively.
#' @param interpolate If `TRUE`, interpolate linearly, if `FALSE` (the default) don't interpolate.
#' @param flip_cols,flip_rows Flip the rows and columns of the matrix. By default we flip the columns.
#' @inheritParams ggplot2::geom_raster
#'
#' @export
geom_matrix_raster <- function(matrix, xmin = NULL, xmax = NULL, ymin = NULL, ymax = NULL,
interpolate = FALSE,
flip_cols = TRUE,
flip_rows = FALSE,
show.legend = NA,
inherit.aes = TRUE)
{
data <- data.frame(values = c(matrix))
mapping <- aes(fill = .data$values)
if (is.null(xmin)) {
xmin <- as.numeric(rownames(matrix)[1L])
}
if (is.null(xmax)) {
xmax <- as.numeric(rownames(matrix)[nrow(matrix)])
}
if (is.null(ymin)) {
ymin <- as.numeric(colnames(matrix)[1L])
}
if (is.null(ymax)) {
ymax <- as.numeric(colnames(matrix)[ncol(matrix)])
}
if (nrow(matrix) > 1L) {
x_step <- (xmax - xmin)/(nrow(matrix) - 1L)
} else {
x_step <- 1
}
if (ncol(matrix) > 1L) {
y_step <- (ymax - ymin)/(ncol(matrix) - 1L)
} else {
y_step <- 1
}
# we return two layers, one blank to create the axes and handle limits, another
# rastering the matrix.
corners <- data.frame(
x = c(xmin - x_step/2, xmax + x_step/2),
y = c(ymin - y_step/2, ymax + y_step/2)
)
corners_xy <- corners
x_y_names <- names(dimnames(matrix))
if (is.null(x_y_names)) {
x_y_names <- c("rows", "columns")
}
colnames(corners) <- x_y_names
x_name <- rlang::sym(x_y_names[1L])
y_name <- rlang::sym(x_y_names[2L])
list(
layer(
data = corners, mapping = aes(x=!!x_name, y=!!y_name), stat = StatIdentity, geom = GeomBlank,
position = PositionIdentity, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(), check.aes = FALSE
),
layer(
data = data,
mapping = mapping,
stat = StatIdentity,
geom = GeomMatrixRaster,
position = PositionIdentity,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list2(
mat = matrix,
matrix_nrows = nrow(matrix),
matrix_ncols = ncol(matrix),
corners = corners_xy,
flip_cols = flip_cols,
flip_rows = flip_rows,
interpolate = interpolate
)
)
)
}
GeomMatrixRaster <- ggproto(
"GeomMatrixRaster", Geom,
non_missing_aes = c("fill"),
required_aes = c("fill"),
default_aes = aes(fill = "grey35"),
draw_panel = function(self, data, panel_params, coord, mat, matrix_nrows, matrix_ncols,
corners, flip_cols, flip_rows, interpolate) {
if (!inherits(coord, "CoordCartesian")) {
rlang::abort(c(
"GeomMatrixRaster only works with coord_cartesian"
))
}
corners <- coord$transform(corners, panel_params)
if (inherits(coord, "CoordFlip")) {
byrow <- TRUE
mat_nr <- matrix_ncols
mat_nc <- matrix_nrows
nr_dim <- c(matrix_nrows, matrix_ncols)
} else {
byrow <- FALSE
mat_nr <- matrix_nrows
mat_nc <- matrix_ncols
nr_dim <- c(matrix_ncols, matrix_nrows)
}
x_rng <- range(corners$x, na.rm = TRUE)
y_rng <- range(corners$y, na.rm = TRUE)
mat <- matrix(
farver::encode_native(data$fill),
nrow = mat_nr,
ncol = mat_nc,
byrow = byrow
)
if (flip_cols) {
rev_cols <- seq.int(mat_nc, 1L, by = -1L)
mat <- mat[, rev_cols, drop = FALSE]
}
if (flip_rows) {
rev_rows <- seq.int(mat_nr, 1L, by = -1L)
mat <- mat[rev_rows, drop = FALSE]
}
nr <- structure(
mat,
dim = nr_dim,
class = "nativeRaster",
channels = 4L
)
rasterGrob(nr, x_rng[1], y_rng[1],
diff(x_rng), diff(y_rng), default.units = "native",
just = c("left","bottom"), interpolate = interpolate)
},
draw_key = draw_key_rect
) With that geom defined, you can build the plot: gplt <- ggplot() +
geom_matrix_raster(matrix = mat_small) +
scale_fill_gradient(trans = "log2")
gplt This pull request kicks inMy geom does not use the x,y coordinates, so in the modified function from this PR, You may think 2.5 out of 45 seconds is not a huge deal, but since my goal is to bring those 45 down to 6 seconds, 2.5 out of 6 matters to me. There is a really long thread I wrote about this (could have been some blog posts I guess) on the performance analysis, ways to do the matrix and my profiling results here: But it's long, so maybe it is worth taking it slowly. I would be happy to present/discuss things in over a call if you like |
Thanks for your response! I did read through that issue, but due to its length it became a bit of a needle in a haystack to find out why exactly this PR would be needed. My gut tells me that this PR is alright, but tenders to a niche problem (which is totally fine). Would you mind adding the news bullet, so that the macOS check can clear? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks good to me. Might be bit before the actual merge happens though; I heard on the grapevines that there will be a hotfix release for unrelated issues at some point.
Thanks for the review. Looking forward to the merge! |
Hi @zeehio, I'm sorry I didn't merge this early but there were two hotfix releases in quick succession and forget to merge this PR in somewhere. Meanwhile, the git tree has moved on and has made the changes in this PR incompatible due to a large file renaming effort. Is there any way I could persuade you to merge in the latest changes with this PR? |
Yes sure no problem! I'll do it this week. |
The rebase was straightforward and without conflicts. The only "issue" was the news entry. Feel free to merge :-) |
By the way, this pull request is one of a series aiming to improve ggplot2 performance. The rest of the patches now depend on some improvements to the farver package. The farver package is a ggplot2 dependency that takes care of colour conversion between formats (e.g. hsl->rgb). When
I don't know your availability to support/review/merge code there, but I'd love to get some feedback on those pull requests, if you find some time |
I'm not really involved with the {farver} package, it's mostly Thomas who is involved there, I think. AFAIK, he's usually focussing on one thing at a time, so it might be a while before the cycle reaches {farver} again. My C++ experience is very limited, so I'd be unable to estimate the merit of the PRs. In any case, thank you for syncing this PR and contributing! |
Layers without positional scales do not need to use
match_id
, however it is always computed.On layers with 1E7 data points, these
match()
calls become 2.5 seconds out of the 18 seconds it currently takes me to build the plot. I intend to drive those 18 seconds to 6 seconds or less.I have (maybe too extensively) documented my progress at:
Looking forward to your feedback
Proposed NEWS entry (not added to avoid trivial conflicts):