Skip to content

feat(workflow_engine): Add can_process_data_packet method to StatefulDetector #89953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wedamija
Copy link
Member

@wedamija wedamija commented Apr 18, 2025

We have various preconditions that are enforced by the uptime results consumer, like

)
return
self.check_and_update_regions(subscription, result, subscription_regions)

We need a way to include these as part of the abstraction. Adding this now and will implement for UptimeDetector.

We might also need can_process_data_packet_group for any checks that might be group specific. We could also put the de-dupe checks there in the base class.

…fulDetector`

We have various preconditions that are enforced by the uptime results consumer, like https://github.com/getsentry/sentry/blob/d86b8323384ecbdfae7167c9374e31da6eb7fdda/src/sentry/uptime/consumers/results_consumer.py#L207-L211

We need a way to include these as part of the abstraction. Adding this now and will implement for `UptimeDetector`.

We might also need `can_process_data_packet_group` for more specific checks. We could also put the de-dupe checks there in the base class.
@wedamija wedamija requested a review from a team April 18, 2025 22:20
@wedamija wedamija requested review from a team as code owners April 18, 2025 22:20
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 18, 2025
@wedamija
Copy link
Member Author

@saponifi3d I know you've been talking about making things more composition based - I think it'd be easy enough to switch this over in that direction when we get closer to it.

@saponifi3d
Copy link
Contributor

saponifi3d commented Apr 18, 2025

could we just wait on invoking the process_detector method in the uptime code until it's ready to be processed instead? I'm not quite sure why we want this to live on a detector vs in the product logic -- mind walking through an example of it? (i clicked in on that link and 😵'd -- eod friday haha)

@wedamija
Copy link
Member Author

could we just wait on invoking the process_detector method in the uptime code until it's ready to be processed instead? I'm not quite sure why we want this to live on a detector vs in the product logic -- mind walking through an example of it?

I hadn't thought about that - I guess I assumed we want to move most logic into the detector. If that's not the case, I can just focus on moving this function in, which is basically where all the threshold detection happens in terms of firing an incident:

def handle_result_for_project_active_mode(
self,
project_subscription: ProjectUptimeSubscription,
result: CheckResult,
metric_tags: dict[str, str],
):
redis = _get_cluster()
delete_status = (
CHECKSTATUS_FAILURE if result["status"] == CHECKSTATUS_SUCCESS else CHECKSTATUS_SUCCESS
)
# Delete any consecutive results we have for the opposing status, since we received this status
redis.delete(build_active_consecutive_status_key(project_subscription, delete_status))
if (
project_subscription.uptime_status == UptimeStatus.OK
and result["status"] == CHECKSTATUS_FAILURE
):
if not self.has_reached_status_threshold(
project_subscription, result["status"], metric_tags
):
return
issue_creation_flag_enabled = features.has(
"organizations:uptime-create-issues",
project_subscription.project.organization,
)
# Do not create uptime issue occurences for
restricted_host_provider_ids = options.get(
"uptime.restrict-issue-creation-by-hosting-provider-id"
)
host_provider_id = project_subscription.uptime_subscription.host_provider_id
issue_creation_restricted_by_provider = host_provider_id in restricted_host_provider_ids
if issue_creation_restricted_by_provider:
metrics.incr(
"uptime.result_processor.restricted_by_provider",
sample_rate=1.0,
tags={
"host_provider_id": host_provider_id,
**metric_tags,
},
)
if issue_creation_flag_enabled and not issue_creation_restricted_by_provider:
create_issue_platform_occurrence(result, project_subscription)
metrics.incr(
"uptime.result_processor.active.sent_occurrence",
tags=metric_tags,
sample_rate=1.0,
)
logger.info(
"uptime_active_sent_occurrence",
extra={
"project_id": project_subscription.project_id,
"url": project_subscription.uptime_subscription.url,
**result,
},
)
# TODO(epurkhiser): Dual until we're only reading the uptime_status
# from the uptime_subscription.
project_subscription.update(
uptime_status=UptimeStatus.FAILED, uptime_status_update_date=django_timezone.now()
)
project_subscription.uptime_subscription.update(
uptime_status=UptimeStatus.FAILED, uptime_status_update_date=django_timezone.now()
)
elif (
project_subscription.uptime_status == UptimeStatus.FAILED
and result["status"] == CHECKSTATUS_SUCCESS
):
if not self.has_reached_status_threshold(
project_subscription, result["status"], metric_tags
):
return
if features.has(
"organizations:uptime-create-issues", project_subscription.project.organization
):
resolve_uptime_issue(project_subscription)
metrics.incr(
"uptime.result_processor.active.resolved",
sample_rate=1.0,
tags=metric_tags,
)
logger.info(
"uptime_active_resolved",
extra={
"project_id": project_subscription.project_id,
"url": project_subscription.uptime_subscription.url,
**result,
},
)
# TODO(epurkhiser): Dual until we're only reading the uptime_status
# from the uptime_subscription.
project_subscription.update(
uptime_status=UptimeStatus.OK, uptime_status_update_date=django_timezone.now()
)
project_subscription.uptime_subscription.update(
uptime_status=UptimeStatus.OK, uptime_status_update_date=django_timezone.now()
)

Our consumer has a separate branch of code that does detector like things:

def handle_result_for_project_auto_onboarding_mode(
self,
project_subscription: ProjectUptimeSubscription,
result: CheckResult,
metric_tags: dict[str, str],
):
if result["status"] == CHECKSTATUS_FAILURE:
redis = _get_cluster()
key = build_onboarding_failure_key(project_subscription)
pipeline = redis.pipeline()
pipeline.incr(key)
pipeline.expire(key, ONBOARDING_FAILURE_REDIS_TTL)
failure_count = pipeline.execute()[0]
if failure_count >= ONBOARDING_FAILURE_THRESHOLD:
# If we've hit too many failures during the onboarding period we stop monitoring
delete_project_uptime_subscription(project_subscription)
# Mark the url as failed so that we don't attempt to auto-detect it for a while
set_failed_url(project_subscription.uptime_subscription.url)
redis.delete(key)
status_reason = "unknown"
if result["status_reason"]:
status_reason = result["status_reason"]["type"]
metrics.incr(
"uptime.result_processor.autodetection.failed_onboarding",
tags={"failure_reason": status_reason, **metric_tags},
sample_rate=1.0,
)
logger.info(
"uptime_onboarding_failed",
extra={
"project_id": project_subscription.project_id,
"url": project_subscription.uptime_subscription.url,
**result,
},
)
elif result["status"] == CHECKSTATUS_SUCCESS:
assert project_subscription.date_added is not None
scheduled_check_time = datetime.fromtimestamp(
result["scheduled_check_time_ms"] / 1000, timezone.utc
)
if scheduled_check_time - ONBOARDING_MONITOR_PERIOD > project_subscription.date_added:
# If we've had mostly successes throughout the onboarding period then we can graduate the subscription
# to active.
update_project_uptime_subscription(
project_subscription,
interval_seconds=int(
AUTO_DETECTED_ACTIVE_SUBSCRIPTION_INTERVAL.total_seconds()
),
mode=ProjectUptimeSubscriptionMode.AUTO_DETECTED_ACTIVE,
)
metrics.incr(
"uptime.result_processor.autodetection.graduated_onboarding",
sample_rate=1.0,
tags=metric_tags,
)
logger.info(
"uptime_onboarding_graduated",
extra={
"project_id": project_subscription.project_id,
"url": project_subscription.uptime_subscription.url,
**result,
},
)

The point of this is that it will basically validate thresholds in the same way a detector would, but we never create an occurrence. Instead, it monitors a url for a while before we move it into active mode or delete it. We would need access to the thresholds from the Detector here, but we could apply that logic outside of the detector.

Copy link

codecov bot commented Apr 18, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ntry/workflow_engine/handlers/detector/stateful.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #89953   +/-   ##
=======================================
  Coverage   87.64%   87.64%           
=======================================
  Files       10250    10250           
  Lines      577029   577034    +5     
  Branches    22719    22719           
=======================================
+ Hits       505725   505740   +15     
+ Misses      70861    70851   -10     
  Partials      443      443           

Copy link
Member

@ceorourke ceorourke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change makes sense to me, is there a reason this sat like was the idea abandoned?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants