Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Osquerybeat: Long running queries timeout at RPC level, failing subsequent queries. #36622

Closed
aleksmaus opened this issue Sep 19, 2023 · 3 comments · Fixed by #36722
Closed

Comments

@aleksmaus
Copy link
Member

aleksmaus commented Sep 19, 2023

Description

User reported an issue with running live queries after executing the pre-packaged windows query unsigned_dlls_on_system_folders_vt_windows_elastic

SELECT datetime(btime, 'unixepoch', 'UTC') as CreationTimeUTC, datetime(mtime, 'unixepoch', 'UTC') as ModificationTimeUTC,
    concat('https://www.virustotal.com/gui/file/', sha1) AS VtLink, filename, issuer_name, original_program_name, file.path,
    result, size, subject_name, uid FROM file
JOIN authenticode ON file.path = authenticode.path
JOIN hash ON file.path = hash.path
WHERE (file.path LIKE 'C:\%.dll' OR file.path LIKE 'C:\Windows\%.dll' OR
       file.path LIKE 'C:\Windows\System32\%.dll' OR file.path LIKE 'C:\Windows\SysWow64\%.dll') AND result != "trusted"
Failed to execute query, err: osquery failed: *osquery.ExtensionManagerQueryArgs.sql (1) field write error: i/o timeout

The subsequent queries fail with errors:

osquery failed: query: out of order sequence response
osquery failed: The pipe is being closed.
osquery failed: *osquery.ExtensionManagerQueryArgs.sql (1) field write error: i/o timeout

The user reported that this issue was reproducible with 8.8.2 version of the agent as well as with 8.10.x.

In some cases the users reported that they could run other queries on other tables after that, in some cases it required osquerybeat/agent restart in order to allow osquery to recover and be able to process queries again.

Preliminary research

This specific query is fairly expensive. When asked the user to run the query with the osquery shell on that same machine it took 3 minutes to complete. Current default timeout at the osquery RPC is 1 minute https://github.com/elastic/beats/blob/main/x-pack/osquerybeat/beater/osquerybeat.go#L47. As far as I remember this would be applied to all IO ops at the thrift RPC level.

Pending further investigation the current theory is:

  1. The unsigned_dlls_on_system_folders_vt_windows_elastic query request is sent from osquerybeat to osqueryd over thrift RPC (using https://github.com/osquery/osquery-go lib)
  2. The request times out at the transport level, while osqueryd process is still running the expensive query
  3. Subsequent request can error out with "out of order sequence response", because the thrift transport implementation relies on matching the request/response seqId:
    https://github.com/apache/thrift/blob/master/lib/go/thrift/client.go#L65
    The overlapping request/responses result in subsequent errors.

I reached out to osquery developers to see if they have any advice. At this point the timeout at RPC level doesn't stop the query execution as far as I remember.

Possible ways to address the issue

  1. Increase the timeout for Thrift transport (possibly make it configurable).
  2. Allow queries to specify the timeout per query. Need to research, but probably the query context timeout have to still be shorter that the transport timeout.
  3. Research if we can possibly improve the error handling, retry on in certain failure cases when possible.
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 19, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 19, 2023
@aleksmaus
Copy link
Member Author

I’ve put together a sample code with osquery and osquery-go client library that shows the problem with the long running queries
https://github.com/aleksmaus/osqlong/blob/main/README.md

[213.291µs] Execute query: select * from curl where url='http://localhost:8080/?sleep=30s'
[500.556833ms] Execute query: select * from curl where url='http://localhost:8080/?sleep=5s'
[705.269333ms] Failed query: select * from curl where url='http://localhost:8080/?sleep=5s', err: timeout after 200ms
[10.001679916s] Failed query: select * from curl where url='http://localhost:8080/?sleep=30s', err: read unix ->/Users/[redacted]/.osquery/shell.em: i/o timeout
[10.00192125s] Execute query: select * from curl where url='http://localhost:8080/?sleep=0s'
[16.005457916s] Failed query:  select * from curl where url='http://localhost:8080/?sleep=0s', err: query: out of order sequence response
[16.005506083s] Execute query: select * from curl where url='http://localhost:8080/?sleep=0s'
[26.033969291s] Failed query:  select * from curl where url='http://localhost:8080/?sleep=0s', err: read unix ->/Users/[redacted]/.osquery/shell.em: i/o timeout
[26.034065958s] Execute query: select * from curl where url='http://localhost:8080/?sleep=0s'
[map[bytes:5 method:GET response_code:200 result:Done
 round_trip_time:5304 url:http://localhost:8080/?sleep=0s user_agent:osquery]]

We probably would have to do two things:

  1. increase the osquery client timeout because it takes higher priority over anything we might do with the context timeout per query
  2. introduce some retry for the queries that fail while they could have succeeded because of the long query timeout that leads to the failure of the subsequent queries for few times.

I forwarded this to osquery/osquery-go developers in order to get their feedback on this issue.

@aleksmaus
Copy link
Member Author

aleksmaus commented Sep 21, 2023

Stumbled upon one more caveat when reproducing this issue with longer intervals.

Looks can't use the intervals longer than 16 seconds with osquery curl table, the 16 secs are hardcoded in osquery
https://github.com/osquery/osquery/blob/master/osquery/remote/transports/tls.cpp#L100

You still get the result row though just without the response_code, bytes and round_trip_time fields set.

This is not relevant to the original issue, just limits this approach for testing of the current behavior of osquery-go->osquery.

@aleksmaus
Copy link
Member Author

aleksmaus commented Sep 22, 2023

Did more testing of osquery-go/osquery behavior with long running queries. There are a couple of more things worth mentioning here for future generations.

  1. It looks like creating the fresh instance of the osquery-go client helps to eliminate the problems with the subsequent queries after the long query times out. Can't reuse the same thrift RPC "connection" until the long query finishes running inside osquery.
  2. Creating multiple instances of the osquery-go client in multiple go-routines concurrently seems to be working ok. Don't know what the limit is at the moment, the side effect could be the higher resources utilization.
  3. Since the timeout that happens at the transport level doesn't stop the query execution within osquery process itself, it's possible to flood osquery with the queries that will cause osquery to spike CPU and memory and there is nothing preventing that at the moment. In my tests after timing out on dozens of the long running queries, the osquery CPU spiked to 1GB and CPU utilization was at 90%. The osquery doesn't release the resources until it finishes running the queries, even if they are timed out from the client perspective.

It might make sense to improve the visibility for the users. Currently for each query osquerybeat captures the number of results returned as a part of the fleet action result document. It would be nice to capture the query execution time and possibly CPU and memory utilization. Maybe we could surface some stats per query on UI based on the historical data in order to inform users of potentially expensive queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants