You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there anyway to change the algorithm or approach to handle deletion of data point. I was reading the algorithm but I haven't yet got a sense if introducing deletion will affect the optimization of the algorithm
The text was updated successfully, but these errors were encountered:
Hey there ... sorry to be slow responding.
There are two issues with deletions:
a) there is no known way to ensure that the digest invariant is preserved
(as a key example, if you deleted a bunch of data from the left half of a
normal distribution, you would be left with a really big centroid on the
left edge). You can try to guess how to split centroids, but the point of a
centroid is that it loses information and the point of the digest invariant
is that this loss is non-critical for estimating tails. If you delete, you
may have to split and splitting accurately is impossible without that lost
information. It is conceivable that you could keep a second digest
containing the distribution of deleted points, but I don't think that
preserves the accuracy that you want.
b) in practice, people keep digests of relatively short time periods
(typically 5 minute intervals) and then combine these short intervals using
a merge when necessary rather than trying to keep long intervals and
subtract. This makes the desire to delete much less.
So given that there is no known way to do it accurately and people don't
seem to need it, we haven't ever tried to do this.
Is there anyway to change the algorithm or approach to handle deletion of data point. I was reading the algorithm but I haven't yet got a sense if introducing deletion will affect the optimization of the algorithm
The text was updated successfully, but these errors were encountered: