feat(workflow_engine): Add `can_process_data_packet` method to `StatefulDetector` #89953

wedamija · 2025-04-18T22:20:46Z

We have various preconditions that are enforced by the uptime results consumer, like

sentry/src/sentry/uptime/consumers/results_consumer.py

Lines 207 to 211 in d86b832

    
               ) 
        
               return 
        
           self.check_and_update_regions(subscription, result, subscription_regions)

We need a way to include these as part of the abstraction. Adding this now and will implement for UptimeDetector.

We might also need can_process_data_packet_group for any checks that might be group specific. We could also put the de-dupe checks there in the base class.

…fulDetector` We have various preconditions that are enforced by the uptime results consumer, like https://github.com/getsentry/sentry/blob/d86b8323384ecbdfae7167c9374e31da6eb7fdda/src/sentry/uptime/consumers/results_consumer.py#L207-L211 We need a way to include these as part of the abstraction. Adding this now and will implement for `UptimeDetector`. We might also need `can_process_data_packet_group` for more specific checks. We could also put the de-dupe checks there in the base class.

wedamija · 2025-04-18T22:21:27Z

@saponifi3d I know you've been talking about making things more composition based - I think it'd be easy enough to switch this over in that direction when we get closer to it.

saponifi3d · 2025-04-18T22:28:12Z

could we just wait on invoking the process_detector method in the uptime code until it's ready to be processed instead? I'm not quite sure why we want this to live on a detector vs in the product logic -- mind walking through an example of it? (i clicked in on that link and 😵'd -- eod friday haha)

wedamija · 2025-04-18T22:35:54Z

could we just wait on invoking the process_detector method in the uptime code until it's ready to be processed instead? I'm not quite sure why we want this to live on a detector vs in the product logic -- mind walking through an example of it?

I hadn't thought about that - I guess I assumed we want to move most logic into the detector. If that's not the case, I can just focus on moving this function in, which is basically where all the threshold detection happens in terms of firing an incident:

sentry/src/sentry/uptime/consumers/results_consumer.py

Lines 413 to 513 in d86b832

    
           def handle_result_for_project_active_mode( 
        
               self, 
        
               project_subscription: ProjectUptimeSubscription, 
        
               result: CheckResult, 
        
               metric_tags: dict[str, str], 
        
           ): 
        
               redis = _get_cluster() 
        
               delete_status = ( 
        
                   CHECKSTATUS_FAILURE if result["status"] == CHECKSTATUS_SUCCESS else CHECKSTATUS_SUCCESS 
        
               ) 
        
               # Delete any consecutive results we have for the opposing status, since we received this status 
        
               redis.delete(build_active_consecutive_status_key(project_subscription, delete_status)) 
        
               if ( 
        
                   project_subscription.uptime_status == UptimeStatus.OK 
        
                   and result["status"] == CHECKSTATUS_FAILURE 
        
               ): 
        
                   if not self.has_reached_status_threshold( 
        
                       project_subscription, result["status"], metric_tags 
        
                   ): 
        
                       return 
        
                   issue_creation_flag_enabled = features.has( 
        
                       "organizations:uptime-create-issues", 
        
                       project_subscription.project.organization, 
        
                   ) 
        
                   # Do not create uptime issue occurences for 
        
                   restricted_host_provider_ids = options.get( 
        
                       "uptime.restrict-issue-creation-by-hosting-provider-id" 
        
                   ) 
        
                   host_provider_id = project_subscription.uptime_subscription.host_provider_id 
        
                   issue_creation_restricted_by_provider = host_provider_id in restricted_host_provider_ids 
        
                   if issue_creation_restricted_by_provider: 
        
                       metrics.incr( 
        
                           "uptime.result_processor.restricted_by_provider", 
        
                           sample_rate=1.0, 
        
                           tags={ 
        
                               "host_provider_id": host_provider_id, 
        
                               **metric_tags, 
        
                           }, 
        
                       ) 
        
                   if issue_creation_flag_enabled and not issue_creation_restricted_by_provider: 
        
                       create_issue_platform_occurrence(result, project_subscription) 
        
                       metrics.incr( 
        
                           "uptime.result_processor.active.sent_occurrence", 
        
                           tags=metric_tags, 
        
                           sample_rate=1.0, 
        
                       ) 
        
                       logger.info( 
        
                           "uptime_active_sent_occurrence", 
        
                           extra={ 
        
                               "project_id": project_subscription.project_id, 
        
                               "url": project_subscription.uptime_subscription.url, 
        
                               **result, 
        
                           }, 
        
                       ) 
        
                   # TODO(epurkhiser): Dual until we're only reading the uptime_status 
        
                   # from the uptime_subscription. 
        
                   project_subscription.update( 
        
                       uptime_status=UptimeStatus.FAILED, uptime_status_update_date=django_timezone.now() 
        
                   ) 
        
                   project_subscription.uptime_subscription.update( 
        
                       uptime_status=UptimeStatus.FAILED, uptime_status_update_date=django_timezone.now() 
        
                   ) 
        
               elif ( 
        
                   project_subscription.uptime_status == UptimeStatus.FAILED 
        
                   and result["status"] == CHECKSTATUS_SUCCESS 
        
               ): 
        
                   if not self.has_reached_status_threshold( 
        
                       project_subscription, result["status"], metric_tags 
        
                   ): 
        
                       return 
        
                   if features.has( 
        
                       "organizations:uptime-create-issues", project_subscription.project.organization 
        
                   ): 
        
                       resolve_uptime_issue(project_subscription) 
        
                       metrics.incr( 
        
                           "uptime.result_processor.active.resolved", 
        
                           sample_rate=1.0, 
        
                           tags=metric_tags, 
        
                       ) 
        
                       logger.info( 
        
                           "uptime_active_resolved", 
        
                           extra={ 
        
                               "project_id": project_subscription.project_id, 
        
                               "url": project_subscription.uptime_subscription.url, 
        
                               **result, 
        
                           }, 
        
                       ) 
        
                   # TODO(epurkhiser): Dual until we're only reading the uptime_status 
        
                   # from the uptime_subscription. 
        
                   project_subscription.update( 
        
                       uptime_status=UptimeStatus.OK, uptime_status_update_date=django_timezone.now() 
        
                   ) 
        
                   project_subscription.uptime_subscription.update( 
        
                       uptime_status=UptimeStatus.OK, uptime_status_update_date=django_timezone.now() 
        
                   )

Our consumer has a separate branch of code that does detector like things:

sentry/src/sentry/uptime/consumers/results_consumer.py

Lines 349 to 411 in d86b832

    
           def handle_result_for_project_auto_onboarding_mode( 
        
               self, 
        
               project_subscription: ProjectUptimeSubscription, 
        
               result: CheckResult, 
        
               metric_tags: dict[str, str], 
        
           ): 
        
               if result["status"] == CHECKSTATUS_FAILURE: 
        
                   redis = _get_cluster() 
        
                   key = build_onboarding_failure_key(project_subscription) 
        
                   pipeline = redis.pipeline() 
        
                   pipeline.incr(key) 
        
                   pipeline.expire(key, ONBOARDING_FAILURE_REDIS_TTL) 
        
                   failure_count = pipeline.execute()[0] 
        
                   if failure_count >= ONBOARDING_FAILURE_THRESHOLD: 
        
                       # If we've hit too many failures during the onboarding period we stop monitoring 
        
                       delete_project_uptime_subscription(project_subscription) 
        
                       # Mark the url as failed so that we don't attempt to auto-detect it for a while 
        
                       set_failed_url(project_subscription.uptime_subscription.url) 
        
                       redis.delete(key) 
        
                       status_reason = "unknown" 
        
                       if result["status_reason"]: 
        
                           status_reason = result["status_reason"]["type"] 
        
                       metrics.incr( 
        
                           "uptime.result_processor.autodetection.failed_onboarding", 
        
                           tags={"failure_reason": status_reason, **metric_tags}, 
        
                           sample_rate=1.0, 
        
                       ) 
        
                       logger.info( 
        
                           "uptime_onboarding_failed", 
        
                           extra={ 
        
                               "project_id": project_subscription.project_id, 
        
                               "url": project_subscription.uptime_subscription.url, 
        
                               **result, 
        
                           }, 
        
                       ) 
        
               elif result["status"] == CHECKSTATUS_SUCCESS: 
        
                   assert project_subscription.date_added is not None 
        
                   scheduled_check_time = datetime.fromtimestamp( 
        
                       result["scheduled_check_time_ms"] / 1000, timezone.utc 
        
                   ) 
        
                   if scheduled_check_time - ONBOARDING_MONITOR_PERIOD > project_subscription.date_added: 
        
                       # If we've had mostly successes throughout the onboarding period then we can graduate the subscription 
        
                       # to active. 
        
                       update_project_uptime_subscription( 
        
                           project_subscription, 
        
                           interval_seconds=int( 
        
                               AUTO_DETECTED_ACTIVE_SUBSCRIPTION_INTERVAL.total_seconds() 
        
                           ), 
        
                           mode=ProjectUptimeSubscriptionMode.AUTO_DETECTED_ACTIVE, 
        
                       ) 
        
                       metrics.incr( 
        
                           "uptime.result_processor.autodetection.graduated_onboarding", 
        
                           sample_rate=1.0, 
        
                           tags=metric_tags, 
        
                       ) 
        
                       logger.info( 
        
                           "uptime_onboarding_graduated", 
        
                           extra={ 
        
                               "project_id": project_subscription.project_id, 
        
                               "url": project_subscription.uptime_subscription.url, 
        
                               **result, 
        
                           }, 
        
                       )

The point of this is that it will basically validate thresholds in the same way a detector would, but we never create an occurrence. Instead, it monitors a url for a while before we move it into active mode or delete it. We would need access to the thresholds from the Detector here, but we could apply that logic outside of the detector.

codecov · 2025-04-18T22:51:58Z

Codecov Report

Attention: Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...ntry/workflow_engine/handlers/detector/stateful.py	66.66%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #89953   +/-   ##
=======================================
  Coverage   87.64%   87.64%           
=======================================
  Files       10250    10250           
  Lines      577029   577034    +5     
  Branches    22719    22719           
=======================================
+ Hits       505725   505740   +15     
+ Misses      70861    70851   -10     
  Partials      443      443

ceorourke

change makes sense to me, is there a reason this sat like was the idea abandoned?

wedamija requested a review from a team April 18, 2025 22:20

wedamija requested review from a team as code owners April 18, 2025 22:20

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 18, 2025

vercel bot deployed to Preview April 18, 2025 22:24 View deployment

ceorourke approved these changes Apr 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(workflow_engine): Add `can_process_data_packet` method to `StatefulDetector` #89953

feat(workflow_engine): Add `can_process_data_packet` method to `StatefulDetector` #89953

wedamija commented Apr 18, 2025 •

edited

Loading

wedamija commented Apr 18, 2025

saponifi3d commented Apr 18, 2025 •

edited

Loading

wedamija commented Apr 18, 2025

codecov bot commented Apr 18, 2025

ceorourke left a comment

	)
	return

	self.check_and_update_regions(subscription, result, subscription_regions)

feat(workflow_engine): Add can_process_data_packet method to StatefulDetector #89953

Are you sure you want to change the base?

feat(workflow_engine): Add can_process_data_packet method to StatefulDetector #89953

Conversation

wedamija commented Apr 18, 2025 • edited Loading

wedamija commented Apr 18, 2025

saponifi3d commented Apr 18, 2025 • edited Loading

wedamija commented Apr 18, 2025

codecov bot commented Apr 18, 2025

Codecov Report

ceorourke left a comment

Choose a reason for hiding this comment

feat(workflow_engine): Add `can_process_data_packet` method to `StatefulDetector` #89953

feat(workflow_engine): Add `can_process_data_packet` method to `StatefulDetector` #89953

wedamija commented Apr 18, 2025 •

edited

Loading

saponifi3d commented Apr 18, 2025 •

edited

Loading