-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online deployment bugs #301
Comments
Great catches (and only discoverable thanks to the verbose logging). Curious about a couple of things:
|
The scitoken validation is something that's done by gracedb, unfortunately. There seems to be some caching, as the subsequent validations take less time, but scitokens expire after an hour (and I'm generating a new one every half-hour to be safe), so the full validation needs to happen relatively often. In a non-MDC setting, I'd imagine it would need to happen every time. Yeah, we need to wait |
Ahh got it, adds up. It's great how easy we are able to diagnose these bottlenecks |
As I think about it some more, it's kind of strange that the scitoken validation should take so long. @mcoughlin, is this something that the low latency group is aware of? Maybe we're doing something wrong on our end? Adding 10 seconds of latency to perform authorization seems not great. |
This 10 seconds obviously doesnt happen every time considering we have many events with sub-10 latency. Are there any hints under which conditions this added latency occurs? |
Not that I can see right off the bat, but let me scrape the logs and see if there are any patterns to it. |
Just as an example, for our lowest latency event, the validation steps seem to be the same, but happen much faster:
|
Ah, wait, at least some of the time here is coming from creating PE/p_astro. Those |
Got it - probably worth doing this asynchronously now? |
Definitely. Still, in the second example in the top comment, there's a 10-second gap between between "Submitting event" and the first scitoken validation line for the first event, and that's not coming from us (all that occurs is here). So I think also worth looking at how much time authorization is costing us. |
I'm compiling a list of bugs in our online deployment that I've found while looking at our MDC results. There's been only two so far, but I imagine others will be found.
check_refractory
function checks whether the current time is at leastrefractory_period
seconds after the previous detection time, but it ought to compare the new detection time to the previous detection time. Otherwise, if event submission takes a while for some reason, we can have situations like the below, where we submit two events that are essentially at the same time (taken from/home/aframe/dev/o3_mdc/events/log/deploy_2024-09-18T02:30:00.log
)reset_t0
function can reset to a time prior to the frame that failed. Coupled to this, we don't reset the snapshotter after missing a frame file. This means that it's possible for an event to be detected, a frame to be missed,t0
reset to before the event, and the event to be detected again. For example, in/home/aframe/dev/o3_mdc/events/log/deploy_2024-09-18T11:25:49.log
:The text was updated successfully, but these errors were encountered: