Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement on layers without positional scales #4990

Merged
merged 2 commits into from
May 10, 2023

Conversation

zeehio
Copy link
Contributor

@zeehio zeehio commented Sep 16, 2022

Layers without positional scales do not need to use match_id, however it is always computed.

On layers with 1E7 data points, these match() calls become 2.5 seconds out of the 18 seconds it currently takes me to build the plot. I intend to drive those 18 seconds to 6 seconds or less.

I have (maybe too extensively) documented my progress at:

Looking forward to your feedback

Proposed NEWS entry (not added to avoid trivial conflicts):

* Improve performance of layers without positional scales (@zeehio, #4990)

@teunbrand
Copy link
Collaborator

Hi Sergio,

Thank you for your PRs!
Please forgive my ignorance here, but in what type of scenario would these improvements kick in? I'm having a bit of trouble imagining where this would apply.

@zeehio
Copy link
Contributor Author

zeehio commented Dec 24, 2022

If you want you can spend some time going through #4989. It's a long issue, so I will summarize it here.

Goal

Given a matrix:

x_small <- c(3,5,7,9)
y_small <- c(100,200,300)
# The matrix values are usually "random numbers" acquired from an instrument
# Here I choose them so they will have a nice scale when they are log2 transformed:
mat_small <- matrix(
  2^seq(from = 1, to = 64, length.out = length(x_small)*length(y_small)),
  nrow = length(x_small),
  ncol = length(y_small)
)
dimnames(mat_small) <- list(x = x_small, y = y_small)
mat_small
##    y
## x           100          200          300
##   3      2.0000 1.575265e+07 1.240729e+14
##   5    105.9524 8.345155e+08 6.572914e+15
##   7   5612.9576 4.420947e+10 3.482081e+17
##   9 297353.2218 2.342050e+12 1.844674e+19

Create a plot that looks like:

imatge

Straightforward solution

library(ggplot2)
df <- reshape2::melt(mat_small, value.name = "intensity")
gplt <- ggplot(df) +
  geom_raster(aes(x = x, y = y, fill = intensity)) +
  scale_fill_gradient(trans = "log2")
gplt

This solution scales very badly with the size of the matrix. On a 4k x 3k matrix, this plot takes ~45 seconds to complete, while it can be done in 5 seconds with a more efficient strategy and some optimizations.

A new geom:

I profiled the geom_raster() code and I measured where those 45 seconds went. I realized that for this problem I did not need to know all the x and y values, but rather only the xmin,xmax,ymin and ymax. I wrote a geom that takes the matrix and the corner positions, transforming and filling the values efficiently:

#' Raster a matrix as a rectangle, efficiently
#'
#'
#' @param matrix The matrix we want to render in the plot
#' @param xmin,xmax,ymin,ymax Coordinates where the corners of the matrix will
#'  be centered By default they are taken from rownames (x) and colnames (y) respectively.
#' @param interpolate If `TRUE`, interpolate linearly, if `FALSE` (the default) don't interpolate.
#' @param flip_cols,flip_rows Flip the rows and columns of the matrix. By default we flip the columns.
#' @inheritParams ggplot2::geom_raster
#'
#' @export
geom_matrix_raster <- function(matrix, xmin = NULL, xmax = NULL, ymin = NULL, ymax = NULL,
                               interpolate = FALSE,
                               flip_cols = TRUE,
                               flip_rows = FALSE,
                               show.legend = NA,
                               inherit.aes = TRUE)
{
  data <- data.frame(values = c(matrix))
  mapping <- aes(fill = .data$values)

  if (is.null(xmin)) {
    xmin <- as.numeric(rownames(matrix)[1L])
  }
  if (is.null(xmax)) {
    xmax <- as.numeric(rownames(matrix)[nrow(matrix)])
  }
  if (is.null(ymin)) {
    ymin <- as.numeric(colnames(matrix)[1L])
  }
  if (is.null(ymax)) {
    ymax <- as.numeric(colnames(matrix)[ncol(matrix)])
  }

  if (nrow(matrix) > 1L) {
    x_step <- (xmax - xmin)/(nrow(matrix) - 1L)
  } else {
    x_step <- 1
  }
  if (ncol(matrix) > 1L) {
    y_step <- (ymax - ymin)/(ncol(matrix) - 1L)
  } else {
    y_step <- 1
  }


  # we return two layers, one blank to create the axes and handle limits, another
  # rastering the matrix.
  corners <- data.frame(
    x = c(xmin - x_step/2, xmax + x_step/2),
    y = c(ymin - y_step/2, ymax + y_step/2)
  )
  corners_xy <- corners
  x_y_names <- names(dimnames(matrix))
  if (is.null(x_y_names)) {
    x_y_names <- c("rows", "columns")
  }
  colnames(corners) <- x_y_names
  x_name <- rlang::sym(x_y_names[1L])
  y_name <- rlang::sym(x_y_names[2L])

  list(
    layer(
      data = corners, mapping = aes(x=!!x_name, y=!!y_name), stat = StatIdentity, geom = GeomBlank,
      position = PositionIdentity, show.legend = show.legend, inherit.aes = inherit.aes,
      params = list(), check.aes = FALSE
    ),
    layer(
      data = data,
      mapping = mapping,
      stat = StatIdentity,
      geom = GeomMatrixRaster,
      position = PositionIdentity,
      show.legend = show.legend,
      inherit.aes = inherit.aes,
      params = list2(
        mat = matrix,
        matrix_nrows = nrow(matrix),
        matrix_ncols = ncol(matrix),
        corners = corners_xy,
        flip_cols = flip_cols,
        flip_rows = flip_rows,
        interpolate = interpolate
      )
    )
  )
}

GeomMatrixRaster <- ggproto(
  "GeomMatrixRaster", Geom,
  non_missing_aes = c("fill"),
  required_aes = c("fill"),
  default_aes = aes(fill = "grey35"),

  draw_panel = function(self, data, panel_params, coord, mat, matrix_nrows, matrix_ncols,
                        corners, flip_cols, flip_rows, interpolate) {
    if (!inherits(coord, "CoordCartesian")) {
      rlang::abort(c(
        "GeomMatrixRaster only works with coord_cartesian"
      ))
    }
    corners <- coord$transform(corners, panel_params)
    if (inherits(coord, "CoordFlip")) {
      byrow <- TRUE
      mat_nr <- matrix_ncols
      mat_nc <- matrix_nrows
      nr_dim <- c(matrix_nrows, matrix_ncols)
    } else {
      byrow <- FALSE
      mat_nr <- matrix_nrows
      mat_nc <- matrix_ncols
      nr_dim <- c(matrix_ncols, matrix_nrows)
    }
    x_rng <- range(corners$x, na.rm = TRUE)
    y_rng <- range(corners$y, na.rm = TRUE)

    mat <- matrix(
      farver::encode_native(data$fill),
      nrow = mat_nr,
      ncol = mat_nc,
      byrow = byrow
    )

    if (flip_cols) {
      rev_cols <- seq.int(mat_nc, 1L, by = -1L)
      mat <- mat[, rev_cols, drop = FALSE]
    }
    if (flip_rows) {
      rev_rows <- seq.int(mat_nr, 1L, by = -1L)
      mat <- mat[rev_rows, drop = FALSE]
    }

    nr <- structure(
      mat,
      dim = nr_dim,
      class = "nativeRaster",
      channels = 4L
    )

    rasterGrob(nr, x_rng[1], y_rng[1],
               diff(x_rng), diff(y_rng), default.units = "native",
               just = c("left","bottom"), interpolate = interpolate)
  },
  draw_key = draw_key_rect
)

With that geom defined, you can build the plot:

gplt <- ggplot() +
      geom_matrix_raster(matrix = mat_small) +
      scale_fill_gradient(trans = "log2")
    gplt

This pull request kicks in

My geom does not use the x,y coordinates, so in the modified function from this PR, x_vars and y_vars have length zero. However, I am still spending 2.5 seconds (out of 45) calculating match_id <- match(...), even if train_scales doesn't use match_id afterwards.

You may think 2.5 out of 45 seconds is not a huge deal, but since my goal is to bring those 45 down to 6 seconds, 2.5 out of 6 matters to me.

There is a really long thread I wrote about this (could have been some blog posts I guess) on the performance analysis, ways to do the matrix and my profiling results here:

But it's long, so maybe it is worth taking it slowly. I would be happy to present/discuss things in over a call if you like

@teunbrand
Copy link
Collaborator

Thanks for your response! I did read through that issue, but due to its length it became a bit of a needle in a haystack to find out why exactly this PR would be needed. My gut tells me that this PR is alright, but tenders to a niche problem (which is totally fine). Would you mind adding the news bullet, so that the macOS check can clear?

Copy link
Collaborator

@teunbrand teunbrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good to me. Might be bit before the actual merge happens though; I heard on the grapevines that there will be a hotfix release for unrelated issues at some point.

@zeehio
Copy link
Contributor Author

zeehio commented Dec 24, 2022

Thanks for the review. Looking forward to the merge!

@teunbrand
Copy link
Collaborator

Hi @zeehio,

I'm sorry I didn't merge this early but there were two hotfix releases in quick succession and forget to merge this PR in somewhere. Meanwhile, the git tree has moved on and has made the changes in this PR incompatible due to a large file renaming effort. Is there any way I could persuade you to merge in the latest changes with this PR?

@zeehio
Copy link
Contributor Author

zeehio commented May 7, 2023

Yes sure no problem! I'll do it this week.

@zeehio zeehio force-pushed the perf-pos-scales branch from 5da22fc to 53d4afc Compare May 10, 2023 04:14
@zeehio
Copy link
Contributor Author

zeehio commented May 10, 2023

The rebase was straightforward and without conflicts. The only "issue" was the news entry.

Feel free to merge :-)

@zeehio
Copy link
Contributor Author

zeehio commented May 10, 2023

By the way, this pull request is one of a series aiming to improve ggplot2 performance. The rest of the patches now depend on some improvements to the farver package.

The farver package is a ggplot2 dependency that takes care of colour conversion between formats (e.g. hsl->rgb). When geom_raster is used on a large dataset, this colour conversion currently becomes a bottleneck. Here are some of the implemented performance improvements waiting for review:

  • A colour vector can be represented in farver through a matrix, where each row is a colour and each column a channel. However, it is very common (in ggplot2) to compute channels separately, with each channel in a different vector. Instead of cbind the channel vectors into a matrix, and feed the matrix into farver, encode_colour accepts a list besides a matrix thomasp85/farver#36 enhances farver so it also accepts a list of channel vectors.
  • R has nativeRaster objects that allow really fast drawing. See the https://github.com/coolbutuseless/nara package for examples. nativeRaster objects use the "native colour format" where colours are 32-bit integers, instead of character strings. While farver supports converting character strings into the native color format, the conversion is slow since it goes through the intermediate matrix format described above (colours x channels). PR Performance: encode_native() does not need an intermediate character vector thomasp85/farver#37 skips the intermediate matrix format when converting string colours to integers, making the conversion 10x faster.

I don't know your availability to support/review/merge code there, but I'd love to get some feedback on those pull requests, if you find some time

@teunbrand
Copy link
Collaborator

I'm not really involved with the {farver} package, it's mostly Thomas who is involved there, I think. AFAIK, he's usually focussing on one thing at a time, so it might be a while before the cycle reaches {farver} again. My C++ experience is very limited, so I'd be unable to estimate the merit of the PRs.

In any case, thank you for syncing this PR and contributing!

@teunbrand teunbrand merged commit 533f635 into tidyverse:main May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants