Skip to content
This repository has been archived by the owner on Aug 14, 2024. It is now read-only.

Commit

Permalink
Update transaction clustering docs (#1047)
Browse files Browse the repository at this point in the history
  • Loading branch information
jjbayer authored Oct 6, 2023
1 parent 6e688cf commit a479e65
Showing 1 changed file with 2 additions and 15 deletions.
17 changes: 2 additions & 15 deletions src/docs/transaction-clustering/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -135,22 +135,9 @@ Nodes with large weights would then be encoded into the replacement rules as exc

We decided against this because encoding exceptions into rules would bloat project configs on the wire and in Relay.

### False Positives
### Identifiers not scrubbed

The discovery of replacement rules is a best-effort approach: no matter how many rules the clusterer discovers, a project can always
The discovery of replacement rules is a best-effort approach: no matter how many rules the clusterer discovers, it might not detect all patterns. A project can always
introduce a new feature that brings more high-cardinality transactions, and it takes time until the clusterer discovers a new rule for those.

At the same time, the algorithm is blind to low-cardinality transactions that do not contain identifiers at all. For example, if a transaction
like `/settings` has type `url`, neither the pattern-based nor the rule-based approach detect any identifiers.

In order to prevent these false negatives, as of [this PR](https://github.com/getsentry/relay/pull/1960) we mark _every_ URL transaction as low-cardinality as long as there
is _some_ scrubbing rule (even if it does not match), or we found an identifier pattern. In other words, we sacrifice [precision](https://en.wikipedia.org/wiki/Precision_and_recall) for the sake of [recall](https://en.wikipedia.org/wiki/Precision_and_recall).

| category | description |
| -------------- | ------------------------------------------------------------------------------------ |
| true positive | We scrubbed all identifiers (if any) and label the transaction as `sanitized` |
| false positive | We miss an identifier, but still label as `sanitized` |
| true negative | We keep the transaction labeled as `url` and it contains identifiers |
| false negative | We keep the transaction labeled as `url` even though it does not contain identifiers |

The consequence of this is again potentially high cardinality in our metrics ingestion and storage, up to the point where we might hit the [cardinality limiter](https://github.com/getsentry/sentry/blob/9af20330c84b971882be9837d4e43e148af5a126/src/sentry/ratelimits/cardinality.py#L93-L95).

1 comment on commit a479e65

@vercel
Copy link

@vercel vercel bot commented on a479e65 Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully deployed to the following URLs:

develop – ./

develop.sentry.dev
develop-git-master.sentry.dev

Please sign in to comment.