CRAN comments addressed

hrbrmstr · Jul 25, 2019 · c233760 · c233760
1 parent 0aaa3c5
commit c233760
Show file tree

Hide file tree

Showing 13 changed files with 124 additions and 93 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,27 +1,27 @@
 Package: tdigest
 Type: Package
-Title: Wicked Fast, Accurate Quantiles Using 't-Digests'
+Title: Wicked Fast, Accurate Quantiles Using t-Digests
 Version: 0.3.0
-Date: 2019-07-21
+Date: 2019-07-25
 Authors@R: c(
     person("Bob", "Rudis", email = "[email protected]", role = c("aut", "cre"), 
            comment = c(ORCID = "0000-0001-5670-2640")),
     person("Ted", "Dunning", role = "aut",
            comment = "t-Digest algorithm; <https://github.com/tdunning/t-digest/>"),
-    person("ajwerner", "", role = "aut", 
+    person("Andrew", "Werner", role = "aut", 
            comment = "Original C+ code; <https://github.com/ajwerner/tdigest>")
   )
-Maintainer: Bob Rudis <[email protected]>
-Description: The 't-digest' construction algorithm uses a variant of 1-dimensional 
-    'k-means' clustering to produce a very compact data structure that allows 
-    accurate estimation of quantiles. This 't-digest' data structure can be used 
+Description: The t-Digest construction algorithm, by 
+    Dunning et al., (2019) <arXiv:1902.04023v1>, uses a variant of 1-dimensional 
+    k-means clustering to produce a very compact data structure that allows 
+    accurate estimation of quantiles. This t-Digest data structure can be used 
     to estimate quantiles, compute other rank statistics or even to estimate 
-    related measures like trimmed means. The advantage of the 't-digest' over 
-    previous digests for this purpose is that the 't-digest' handles data with 
-    full floating point resolution. With small changes, the 't-digest' can handle 
-    values from any ordered set for which we can compute something akin to a mean. 
-    The accuracy of quantile estimates produced by 't-digests' can be orders of 
-    magnitude more accurate than those produced by previous digest algorithms.
+    related measures like trimmed means. The advantage of the t-Digest over 
+    previous digests for this purpose is that the t-Digest handles data with 
+    full floating point resolution. The accuracy of quantile estimates produced 
+    by t-Digests can be orders of magnitude more accurate than those produced 
+    by previous digest algorithms. Methods are provided to create and update 
+    t-Digests and retrieve quantiles from the accumulated distributions.
 URL: https://gitlab.com/hrbrmstr/tdigest
 BugReports: https://gitlab.com/hrbrmstr/tdigest/issues
 Copyright: file inst/COPYRIGHTS

diff --git a/R/create.R b/R/create.R
@@ -1,31 +1,31 @@
-#' Create a new t-digest histogram from a vector
+#' Create a new t-Digest histogram from a vector
 #'
-#' The t-digest construction algorithm uses a variant of 1-dimensional
+#' The t-Digest construction algorithm, by Dunning et al., uses a variant of 1-dimensional
 #' k-means clustering to produce a very compact data structure that allows
-#' accurate estimation of quantiles. This t-digest data structure can be used
+#' accurate estimation of quantiles. This t-Digest data structure can be used
 #' to estimate quantiles, compute other rank statistics or even to estimate
-#' related measures like trimmed means. The advantage of the t-digest over
-#' previous digests for this purpose is that the t-digest handles data with
-#' full floating point resolution. With small changes, the t-digest can handle
-#' values from any ordered set for which we can compute something akin to a mean.
-#' The accuracy of quantile estimates produced by t-digests can be orders of
-#' magnitude more accurate than those produced by previous digest algorithms.
+#' related measures like trimmed means. The advantage of the t-Digest over
+#' previous digests for this purpose is that the t-Digest handles data with
+#' full floating point resolution. The accuracy of quantile estimates produced
+#' by t-Digests can be orders of magnitude more accurate than those produced
+#' by previous digest algorithms. Methods are provided to create and update
+#' t-Digests and retrieve quantiles from the accumulated distributions.
 #'
-#' @param vec vector (will be converted to `double` if not already double). NOTE that this
-#'        is ALTREP-aware and will not materialize the passed-in object in order to
-#'        add the values to the t-Digest.
+#' @param vec vector (will be converted to `double` if not already double).
+#'        NOTE that this is ALTREP-aware and will not materialize the passed-in
+#'        object in order to add the values to the t-Digest.
 #' @param compression the input compression value; should be >= 1.0; this
-#'        will control how aggressively the TDigest compresses data together.
+#'        will control how aggressively the t-Digest compresses data together.
 #'        The original t-Digest paper suggests using a value of 100 for a good
 #'        balance between precision and efficiency. It will land at very small
 #'        (think like 1e-6 percentile points) errors at extreme points in the
 #'        distribution, and compression ratios of around 500 for large data sets
 #'        (~1 million datapoints). Defaults to 100.
-#' @export
-#' @return a tdigest object
-#' @references <https://raw.githubusercontent.com/tdunning/t-digest/master/docs/t-digest-paper/histo.pdf>
+#' @return a `tdigest` object
+#' @references [Computing Extremely Accurate Quantiles Using t-Digests](https://arxiv.org/abs/1902.04023)
 #' @importFrom stats quantile
 #' @useDynLib tdigest, .registration = TRUE
+#' @export
 #' @examples
 #' set.seed(1492)
 #' x <- sample(0:100, 1000000, replace = TRUE)
@@ -37,13 +37,13 @@ tdigest <- function(vec, compression=100) {
   .Call("Rtdig", vec=vec, compression=compression)
 }
 
-#' Calculate sample quantiles from a t-digest
+#' Calculate sample quantiles from a t-Digest
 #'
-#' @param td t-digest object
+#' @param td t-Digest object
 #' @param probs numeric vector of probabilities with values in range 0:1
 #' @export
-#' @return a numeric vector
-#' @references <https://raw.githubusercontent.com/tdunning/t-digest/master/docs/t-digest-paper/histo.pdf>
+#' @return a `numeric` vector containing the requested quantile values
+#' @references <https://raw.githubusercontent.com/tdunning/t-Digest/master/docs/t-Digest-paper/histo.pdf>
 #' @examples
 #' set.seed(1492)
 #' x <- sample(0:100, 1000000, replace = TRUE)
@@ -66,7 +66,7 @@ quantile.tdigest <- function(x, probs = seq(0, 1, 0.25), ...) {
 }
 
 #' @rdname tdigest
-#' @param x t-tigest object
+#' @param x `tdigest` object
 #' @param ... unused
 #' @keywords internal
 #' @export
@@ -87,26 +87,28 @@ print.tdigest <- function(x, ...) {
 #' Allocate a new histogram
 #'
 #' @param compression the input compression value; should be >= 1.0; this
-#'        will control how aggressively the TDigest compresses data together.
+#'        will control how aggressively the t-Digest compresses data together.
 #'        The original t-Digest paper suggests using a value of 100 for a good
 #'        balance between precision and efficiency. It will land at very small
 #'        (think like 1e-6 percentile points) errors at extreme points in the
 #'        distribution, and compression ratios of around 500 for large data sets
 #'        (~1 million datapoints). Defaults to 100.
 #' @export
-#' @return a tdigest object
-#' @references <https://raw.githubusercontent.com/tdunning/t-digest/master/docs/t-digest-paper/histo.pdf>
+#' @return a `tdigest` object
+#' @references <https://raw.githubusercontent.com/tdunning/t-Digest/master/docs/t-Digest-paper/histo.pdf>
 #' @examples
 #' td <- td_create(10)
 td_create <- function(compression=100) {
+  stopifnot(compression >= 1.0)
   compression <- as.double(compression[1])
   .Call("Rtd_create", compression=compression, PACKAGE="tdigest")
 }
 
-#' Total items contained in the t-digest
+#' Total items contained in the t-Digest
 #'
-#' @param td t-digest object
+#' @param td t-Digest object
 #' @export
+#' @return `double` containing the size of the t-Digest
 #' @examples
 #' td <- td_create(10)
 #' td_add(td, 0, 1)
@@ -118,11 +120,12 @@ td_total_count <- function(td) {
   .Call("Rtd_total_count", td=td, PACKAGE="tdigest")
 }
 
-#' Add a value to the t-digest with the specified count
+#' Add a value to the t-Digest with the specified count
 #'
-#' @param td t-digest object
+#' @param td t-Digest object
 #' @param val value
 #' @param count count
+#' @return the original, updated `tdigest` object
 #' @export
 #' @examples
 #' td <- td_create(10)
@@ -138,9 +141,10 @@ td_add <- function(td, val, count) {
 
 #' Return the value at the specified quantile
 #'
-#' @param td t-digest object
+#' @param td t-Digest object
 #' @param q quantile (range 0:1)
 #' @export
+#' @return the computed quantile (`double`)
 #' @examples
 #' td <- td_create(10)
 #'
@@ -160,8 +164,9 @@ td_value_at <- function(td, q) {
 
 #' Return the quantile of the value
 #'
-#' @param td t-digest object
+#' @param td t-Digest object
 #' @param val value
+#' @return the computed quantile (`double`)
 #' @export
 td_quantile_of <- function(td, val) {
   stopifnot(inherits(td, "tdigest"))
@@ -170,11 +175,11 @@ td_quantile_of <- function(td, val) {
   .Call("Rtd_quantile_of", tdig=td, val=val, PACKAGE="tdigest")
 }
 
-#' Merge one t-digest into another
+#' Merge one t-Digest into another
 #'
-#' @param from,into t-digests
+#' @param from,into t-Digests
 #' @return `into`
-#' @return a tdigest object
+#' @return a `tdigest` object
 #' @export
 td_merge <- function(from, into) {
   stopifnot(inherits(from, "tdigest"))

diff --git a/README.Rmd b/README.Rmd
@@ -19,9 +19,18 @@ Wicked Fast, Accurate Quantiles Using 't-Digests'
 
 ## Description
 
-The t-digest construction algorithm uses a variant of 1-dimensional k-means clustering to produce a very compact data structure that allows accurate estimation of quantiles. This t-digest data structure can be used to estimate quantiles, compute other rank statistics or even to estimate related measures like trimmed means. The advantage of the t-digest over previous digests for this purpose is that the t-digest handles data with full floating point resolution. With small changes, the t-digest can handle values from any ordered set for which we can compute something akin to a mean. The accuracy of quantile estimates produced by t-digests can be orders of magnitude more accurate than those produced by previous digest algorithms.
-
-See [the original paper by Ted Dunning](https://raw.githubusercontent.com/tdunning/t-digest/master/docs/t-digest-paper/histo.pdf) for more details on t-Digests.
+The t-Digest construction algorithm uses a variant of 1-dimensional 
+k-means clustering to produce a very compact data structure that allows 
+accurate estimation of quantiles. This t-Digest data structure can be used 
+to estimate quantiles, compute other rank statistics or even to estimate 
+related measures like trimmed means. The advantage of the t-Digest over 
+previous digests for this purpose is that the t-Digest handles data with 
+full floating point resolution. The accuracy of quantile estimates produced 
+by t-Digests can be orders of magnitude more accurate than those produced 
+by previous digest algorithms. Methods are provided to create and update 
+t-Digests and retreive quantiles from the accumulated distributions.
+
+See [the original paper by Ted Dunning & Otmar Ertl](https://arxiv.org/abs/1902.04023) for more details on t-Digests.
 
 ## What's Inside The Tin
 

diff --git a/inst/COPYRIGHTS b/inst/COPYRIGHTS
@@ -1,8 +1,15 @@
-The R code and src/tdigest-main.c, src/init.c are MIT-licensed by the package author.
+The R code and src/tdigest-main.c, src/init.c are MIT-licensed
+by the package author.
 
-src/tdigest.h, src/tdigest.c are MIT-licensed & Copyright (c) 2018 ajwerner [REF: https://github.com/ajwerner/tdigestc; license below copied from that repository]
+src/tdigest.h, src/tdigest.c are MIT-licensed & Copyright (c) 2018
+Andrew Werner [REF: https://github.com/ajwerner/tdigestc; license below
+copied from that repository]
 
-The original t-Digest implementation and algorithm are  have the following license:
+The t-Digest original algorithm and Java implementation
+(which is the base of Andrew Werner's re-implementation) is
+Copyright (c) 2015 Ted Dunning and uses the same Apache License2.0, below.
+
+------------
 
                                 Apache License
                            Version 2.0, January 2004

diff --git a/man/td_add.Rd b/man/td_add.Rd
diff --git a/man/td_create.Rd b/man/td_create.Rd
diff --git a/man/td_merge.Rd b/man/td_merge.Rd
diff --git a/man/td_quantile_of.Rd b/man/td_quantile_of.Rd
diff --git a/man/td_total_count.Rd b/man/td_total_count.Rd
diff --git a/man/td_value_at.Rd b/man/td_value_at.Rd