-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Osquerybeat: Long running queries timeout at RPC level, failing subsequent queries. #36622
Comments
I’ve put together a sample code with osquery and osquery-go client library that shows the problem with the long running queries
We probably would have to do two things:
I forwarded this to osquery/osquery-go developers in order to get their feedback on this issue. |
Stumbled upon one more caveat when reproducing this issue with longer intervals. Looks can't use the intervals longer than 16 seconds with osquery You still get the result row though just without the response_code, bytes and round_trip_time fields set. This is not relevant to the original issue, just limits this approach for testing of the current behavior of osquery-go->osquery. |
Did more testing of osquery-go/osquery behavior with long running queries. There are a couple of more things worth mentioning here for future generations.
It might make sense to improve the visibility for the users. Currently for each query osquerybeat captures the number of results returned as a part of the fleet action result document. It would be nice to capture the query execution time and possibly CPU and memory utilization. Maybe we could surface some stats per query on UI based on the historical data in order to inform users of potentially expensive queries. |
Description
User reported an issue with running live queries after executing the pre-packaged windows query
unsigned_dlls_on_system_folders_vt_windows_elastic
The subsequent queries fail with errors:
The user reported that this issue was reproducible with 8.8.2 version of the agent as well as with 8.10.x.
In some cases the users reported that they could run other queries on other tables after that, in some cases it required osquerybeat/agent restart in order to allow osquery to recover and be able to process queries again.
Preliminary research
This specific query is fairly expensive. When asked the user to run the query with the osquery shell on that same machine it took 3 minutes to complete. Current default timeout at the osquery RPC is 1 minute https://github.com/elastic/beats/blob/main/x-pack/osquerybeat/beater/osquerybeat.go#L47. As far as I remember this would be applied to all IO ops at the thrift RPC level.
Pending further investigation the current theory is:
unsigned_dlls_on_system_folders_vt_windows_elastic
query request is sent from osquerybeat to osqueryd over thrift RPC (using https://github.com/osquery/osquery-go lib)https://github.com/apache/thrift/blob/master/lib/go/thrift/client.go#L65
The overlapping request/responses result in subsequent errors.
I reached out to osquery developers to see if they have any advice. At this point the timeout at RPC level doesn't stop the query execution as far as I remember.
Possible ways to address the issue
The text was updated successfully, but these errors were encountered: