Skip to content

Commit

Permalink
CRAN comments addressed
Browse files Browse the repository at this point in the history
  • Loading branch information
hrbrmstr committed Jul 25, 2019
1 parent 0aaa3c5 commit c233760
Show file tree
Hide file tree
Showing 13 changed files with 124 additions and 93 deletions.
26 changes: 13 additions & 13 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,27 +1,27 @@
Package: tdigest
Type: Package
Title: Wicked Fast, Accurate Quantiles Using 't-Digests'
Title: Wicked Fast, Accurate Quantiles Using t-Digests
Version: 0.3.0
Date: 2019-07-21
Date: 2019-07-25
Authors@R: c(
person("Bob", "Rudis", email = "[email protected]", role = c("aut", "cre"),
comment = c(ORCID = "0000-0001-5670-2640")),
person("Ted", "Dunning", role = "aut",
comment = "t-Digest algorithm; <https://github.com/tdunning/t-digest/>"),
person("ajwerner", "", role = "aut",
person("Andrew", "Werner", role = "aut",
comment = "Original C+ code; <https://github.com/ajwerner/tdigest>")
)
Maintainer: Bob Rudis <[email protected]>
Description: The 't-digest' construction algorithm uses a variant of 1-dimensional
'k-means' clustering to produce a very compact data structure that allows
accurate estimation of quantiles. This 't-digest' data structure can be used
Description: The t-Digest construction algorithm, by
Dunning et al., (2019) <arXiv:1902.04023v1>, uses a variant of 1-dimensional
k-means clustering to produce a very compact data structure that allows
accurate estimation of quantiles. This t-Digest data structure can be used
to estimate quantiles, compute other rank statistics or even to estimate
related measures like trimmed means. The advantage of the 't-digest' over
previous digests for this purpose is that the 't-digest' handles data with
full floating point resolution. With small changes, the 't-digest' can handle
values from any ordered set for which we can compute something akin to a mean.
The accuracy of quantile estimates produced by 't-digests' can be orders of
magnitude more accurate than those produced by previous digest algorithms.
related measures like trimmed means. The advantage of the t-Digest over
previous digests for this purpose is that the t-Digest handles data with
full floating point resolution. The accuracy of quantile estimates produced
by t-Digests can be orders of magnitude more accurate than those produced
by previous digest algorithms. Methods are provided to create and update
t-Digests and retrieve quantiles from the accumulated distributions.
URL: https://gitlab.com/hrbrmstr/tdigest
BugReports: https://gitlab.com/hrbrmstr/tdigest/issues
Copyright: file inst/COPYRIGHTS
Expand Down
71 changes: 38 additions & 33 deletions R/create.R
Original file line number Diff line number Diff line change
@@ -1,31 +1,31 @@
#' Create a new t-digest histogram from a vector
#' Create a new t-Digest histogram from a vector
#'
#' The t-digest construction algorithm uses a variant of 1-dimensional
#' The t-Digest construction algorithm, by Dunning et al., uses a variant of 1-dimensional
#' k-means clustering to produce a very compact data structure that allows
#' accurate estimation of quantiles. This t-digest data structure can be used
#' accurate estimation of quantiles. This t-Digest data structure can be used
#' to estimate quantiles, compute other rank statistics or even to estimate
#' related measures like trimmed means. The advantage of the t-digest over
#' previous digests for this purpose is that the t-digest handles data with
#' full floating point resolution. With small changes, the t-digest can handle
#' values from any ordered set for which we can compute something akin to a mean.
#' The accuracy of quantile estimates produced by t-digests can be orders of
#' magnitude more accurate than those produced by previous digest algorithms.
#' related measures like trimmed means. The advantage of the t-Digest over
#' previous digests for this purpose is that the t-Digest handles data with
#' full floating point resolution. The accuracy of quantile estimates produced
#' by t-Digests can be orders of magnitude more accurate than those produced
#' by previous digest algorithms. Methods are provided to create and update
#' t-Digests and retrieve quantiles from the accumulated distributions.
#'
#' @param vec vector (will be converted to `double` if not already double). NOTE that this
#' is ALTREP-aware and will not materialize the passed-in object in order to
#' add the values to the t-Digest.
#' @param vec vector (will be converted to `double` if not already double).
#' NOTE that this is ALTREP-aware and will not materialize the passed-in
#' object in order to add the values to the t-Digest.
#' @param compression the input compression value; should be >= 1.0; this
#' will control how aggressively the TDigest compresses data together.
#' will control how aggressively the t-Digest compresses data together.
#' The original t-Digest paper suggests using a value of 100 for a good
#' balance between precision and efficiency. It will land at very small
#' (think like 1e-6 percentile points) errors at extreme points in the
#' distribution, and compression ratios of around 500 for large data sets
#' (~1 million datapoints). Defaults to 100.
#' @export
#' @return a tdigest object
#' @references <https://raw.githubusercontent.com/tdunning/t-digest/master/docs/t-digest-paper/histo.pdf>
#' @return a `tdigest` object
#' @references [Computing Extremely Accurate Quantiles Using t-Digests](https://arxiv.org/abs/1902.04023)
#' @importFrom stats quantile
#' @useDynLib tdigest, .registration = TRUE
#' @export
#' @examples
#' set.seed(1492)
#' x <- sample(0:100, 1000000, replace = TRUE)
Expand All @@ -37,13 +37,13 @@ tdigest <- function(vec, compression=100) {
.Call("Rtdig", vec=vec, compression=compression)
}

#' Calculate sample quantiles from a t-digest
#' Calculate sample quantiles from a t-Digest
#'
#' @param td t-digest object
#' @param td t-Digest object
#' @param probs numeric vector of probabilities with values in range 0:1
#' @export
#' @return a numeric vector
#' @references <https://raw.githubusercontent.com/tdunning/t-digest/master/docs/t-digest-paper/histo.pdf>
#' @return a `numeric` vector containing the requested quantile values
#' @references <https://raw.githubusercontent.com/tdunning/t-Digest/master/docs/t-Digest-paper/histo.pdf>
#' @examples
#' set.seed(1492)
#' x <- sample(0:100, 1000000, replace = TRUE)
Expand All @@ -66,7 +66,7 @@ quantile.tdigest <- function(x, probs = seq(0, 1, 0.25), ...) {
}

#' @rdname tdigest
#' @param x t-tigest object
#' @param x `tdigest` object
#' @param ... unused
#' @keywords internal
#' @export
Expand All @@ -87,26 +87,28 @@ print.tdigest <- function(x, ...) {
#' Allocate a new histogram
#'
#' @param compression the input compression value; should be >= 1.0; this
#' will control how aggressively the TDigest compresses data together.
#' will control how aggressively the t-Digest compresses data together.
#' The original t-Digest paper suggests using a value of 100 for a good
#' balance between precision and efficiency. It will land at very small
#' (think like 1e-6 percentile points) errors at extreme points in the
#' distribution, and compression ratios of around 500 for large data sets
#' (~1 million datapoints). Defaults to 100.
#' @export
#' @return a tdigest object
#' @references <https://raw.githubusercontent.com/tdunning/t-digest/master/docs/t-digest-paper/histo.pdf>
#' @return a `tdigest` object
#' @references <https://raw.githubusercontent.com/tdunning/t-Digest/master/docs/t-Digest-paper/histo.pdf>
#' @examples
#' td <- td_create(10)
td_create <- function(compression=100) {
stopifnot(compression >= 1.0)
compression <- as.double(compression[1])
.Call("Rtd_create", compression=compression, PACKAGE="tdigest")
}

#' Total items contained in the t-digest
#' Total items contained in the t-Digest
#'
#' @param td t-digest object
#' @param td t-Digest object
#' @export
#' @return `double` containing the size of the t-Digest
#' @examples
#' td <- td_create(10)
#' td_add(td, 0, 1)
Expand All @@ -118,11 +120,12 @@ td_total_count <- function(td) {
.Call("Rtd_total_count", td=td, PACKAGE="tdigest")
}

#' Add a value to the t-digest with the specified count
#' Add a value to the t-Digest with the specified count
#'
#' @param td t-digest object
#' @param td t-Digest object
#' @param val value
#' @param count count
#' @return the original, updated `tdigest` object
#' @export
#' @examples
#' td <- td_create(10)
Expand All @@ -138,9 +141,10 @@ td_add <- function(td, val, count) {

#' Return the value at the specified quantile
#'
#' @param td t-digest object
#' @param td t-Digest object
#' @param q quantile (range 0:1)
#' @export
#' @return the computed quantile (`double`)
#' @examples
#' td <- td_create(10)
#'
Expand All @@ -160,8 +164,9 @@ td_value_at <- function(td, q) {

#' Return the quantile of the value
#'
#' @param td t-digest object
#' @param td t-Digest object
#' @param val value
#' @return the computed quantile (`double`)
#' @export
td_quantile_of <- function(td, val) {
stopifnot(inherits(td, "tdigest"))
Expand All @@ -170,11 +175,11 @@ td_quantile_of <- function(td, val) {
.Call("Rtd_quantile_of", tdig=td, val=val, PACKAGE="tdigest")
}

#' Merge one t-digest into another
#' Merge one t-Digest into another
#'
#' @param from,into t-digests
#' @param from,into t-Digests
#' @return `into`
#' @return a tdigest object
#' @return a `tdigest` object
#' @export
td_merge <- function(from, into) {
stopifnot(inherits(from, "tdigest"))
Expand Down
15 changes: 12 additions & 3 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,18 @@ Wicked Fast, Accurate Quantiles Using 't-Digests'

## Description

The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means. The advantage of the t-digest over previous digests for this purpose is that the t-digest handles data with full floating point resolution. With small changes, the t-digest can handle values from any ordered set for which we can compute something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by previous digest algorithms.

See [the original paper by Ted Dunning](https://raw.githubusercontent.com/tdunning/t-digest/master/docs/t-digest-paper/histo.pdf) for more details on t-Digests.
The t-Digest construction algorithm uses a variant of 1-dimensional
k-means clustering to produce a very compact data structure that allows
accurate estimation of quantiles. This t-Digest data structure can be used
to estimate quantiles, compute other rank statistics or even to estimate
related measures like trimmed means. The advantage of the t-Digest over
previous digests for this purpose is that the t-Digest handles data with
full floating point resolution. The accuracy of quantile estimates produced
by t-Digests can be orders of magnitude more accurate than those produced
by previous digest algorithms. Methods are provided to create and update
t-Digests and retreive quantiles from the accumulated distributions.

See [the original paper by Ted Dunning & Otmar Ertl](https://arxiv.org/abs/1902.04023) for more details on t-Digests.

## What's Inside The Tin

Expand Down
13 changes: 10 additions & 3 deletions inst/COPYRIGHTS
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
The R code and src/tdigest-main.c, src/init.c are MIT-licensed by the package author.
The R code and src/tdigest-main.c, src/init.c are MIT-licensed
by the package author.

src/tdigest.h, src/tdigest.c are MIT-licensed & Copyright (c) 2018 ajwerner [REF: https://github.com/ajwerner/tdigestc; license below copied from that repository]
src/tdigest.h, src/tdigest.c are MIT-licensed & Copyright (c) 2018
Andrew Werner [REF: https://github.com/ajwerner/tdigestc; license below
copied from that repository]

The original t-Digest implementation and algorithm are have the following license:
The t-Digest original algorithm and Java implementation
(which is the base of Andrew Werner's re-implementation) is
Copyright (c) 2015 Ted Dunning and uses the same Apache License2.0, below.

------------

Apache License
Version 2.0, January 2004
Expand Down
9 changes: 6 additions & 3 deletions man/td_add.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions man/td_create.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions man/td_merge.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 4 additions & 1 deletion man/td_quantile_of.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 6 additions & 3 deletions man/td_total_count.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 4 additions & 1 deletion man/td_value_at.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit c233760

Please sign in to comment.