Osquerybeat: Long running queries timeout at RPC level, failing subsequent queries. #36622

aleksmaus · 2023-09-19T17:03:55Z

Description

User reported an issue with running live queries after executing the pre-packaged windows query unsigned_dlls_on_system_folders_vt_windows_elastic

SELECT datetime(btime, 'unixepoch', 'UTC') as CreationTimeUTC, datetime(mtime, 'unixepoch', 'UTC') as ModificationTimeUTC,
    concat('https://www.virustotal.com/gui/file/', sha1) AS VtLink, filename, issuer_name, original_program_name, file.path,
    result, size, subject_name, uid FROM file
JOIN authenticode ON file.path = authenticode.path
JOIN hash ON file.path = hash.path
WHERE (file.path LIKE 'C:\%.dll' OR file.path LIKE 'C:\Windows\%.dll' OR
       file.path LIKE 'C:\Windows\System32\%.dll' OR file.path LIKE 'C:\Windows\SysWow64\%.dll') AND result != "trusted"

Failed to execute query, err: osquery failed: *osquery.ExtensionManagerQueryArgs.sql (1) field write error: i/o timeout

The subsequent queries fail with errors:

osquery failed: query: out of order sequence response

osquery failed: The pipe is being closed.

osquery failed: *osquery.ExtensionManagerQueryArgs.sql (1) field write error: i/o timeout

The user reported that this issue was reproducible with 8.8.2 version of the agent as well as with 8.10.x.

In some cases the users reported that they could run other queries on other tables after that, in some cases it required osquerybeat/agent restart in order to allow osquery to recover and be able to process queries again.

Preliminary research

This specific query is fairly expensive. When asked the user to run the query with the osquery shell on that same machine it took 3 minutes to complete. Current default timeout at the osquery RPC is 1 minute https://github.com/elastic/beats/blob/main/x-pack/osquerybeat/beater/osquerybeat.go#L47. As far as I remember this would be applied to all IO ops at the thrift RPC level.

Pending further investigation the current theory is:

The unsigned_dlls_on_system_folders_vt_windows_elastic query request is sent from osquerybeat to osqueryd over thrift RPC (using https://github.com/osquery/osquery-go lib)
The request times out at the transport level, while osqueryd process is still running the expensive query
Subsequent request can error out with "out of order sequence response", because the thrift transport implementation relies on matching the request/response seqId:
https://github.com/apache/thrift/blob/master/lib/go/thrift/client.go#L65
The overlapping request/responses result in subsequent errors.

I reached out to osquery developers to see if they have any advice. At this point the timeout at RPC level doesn't stop the query execution as far as I remember.

Possible ways to address the issue

Increase the timeout for Thrift transport (possibly make it configurable).
Allow queries to specify the timeout per query. Need to research, but probably the query context timeout have to still be shorter that the transport timeout.
Research if we can possibly improve the error handling, retry on in certain failure cases when possible.

The text was updated successfully, but these errors were encountered:

aleksmaus · 2023-09-21T18:05:16Z

I’ve put together a sample code with osquery and osquery-go client library that shows the problem with the long running queries
https://github.com/aleksmaus/osqlong/blob/main/README.md

[213.291µs] Execute query: select * from curl where url='http://localhost:8080/?sleep=30s'
[500.556833ms] Execute query: select * from curl where url='http://localhost:8080/?sleep=5s'
[705.269333ms] Failed query: select * from curl where url='http://localhost:8080/?sleep=5s', err: timeout after 200ms
[10.001679916s] Failed query: select * from curl where url='http://localhost:8080/?sleep=30s', err: read unix ->/Users/[redacted]/.osquery/shell.em: i/o timeout
[10.00192125s] Execute query: select * from curl where url='http://localhost:8080/?sleep=0s'
[16.005457916s] Failed query:  select * from curl where url='http://localhost:8080/?sleep=0s', err: query: out of order sequence response
[16.005506083s] Execute query: select * from curl where url='http://localhost:8080/?sleep=0s'
[26.033969291s] Failed query:  select * from curl where url='http://localhost:8080/?sleep=0s', err: read unix ->/Users/[redacted]/.osquery/shell.em: i/o timeout
[26.034065958s] Execute query: select * from curl where url='http://localhost:8080/?sleep=0s'
[map[bytes:5 method:GET response_code:200 result:Done
 round_trip_time:5304 url:http://localhost:8080/?sleep=0s user_agent:osquery]]

We probably would have to do two things:

increase the osquery client timeout because it takes higher priority over anything we might do with the context timeout per query
introduce some retry for the queries that fail while they could have succeeded because of the long query timeout that leads to the failure of the subsequent queries for few times.

I forwarded this to osquery/osquery-go developers in order to get their feedback on this issue.

aleksmaus · 2023-09-21T23:05:32Z

Stumbled upon one more caveat when reproducing this issue with longer intervals.

Looks can't use the intervals longer than 16 seconds with osquery curl table, the 16 secs are hardcoded in osquery
https://github.com/osquery/osquery/blob/master/osquery/remote/transports/tls.cpp#L100

You still get the result row though just without the response_code, bytes and round_trip_time fields set.

This is not relevant to the original issue, just limits this approach for testing of the current behavior of osquery-go->osquery.

aleksmaus · 2023-09-22T20:42:12Z

Did more testing of osquery-go/osquery behavior with long running queries. There are a couple of more things worth mentioning here for future generations.

It looks like creating the fresh instance of the osquery-go client helps to eliminate the problems with the subsequent queries after the long query times out. Can't reuse the same thrift RPC "connection" until the long query finishes running inside osquery.
Creating multiple instances of the osquery-go client in multiple go-routines concurrently seems to be working ok. Don't know what the limit is at the moment, the side effect could be the higher resources utilization.
Since the timeout that happens at the transport level doesn't stop the query execution within osquery process itself, it's possible to flood osquery with the queries that will cause osquery to spike CPU and memory and there is nothing preventing that at the moment. In my tests after timing out on dozens of the long running queries, the osquery CPU spiked to 1GB and CPU utilization was at 90%. The osquery doesn't release the resources until it finishes running the queries, even if they are timed out from the client perspective.

It might make sense to improve the visibility for the users. Currently for each query osquerybeat captures the number of results returned as a part of the fleet action result document. It would be nice to capture the query execution time and possibly CPU and memory utilization. Maybe we could surface some stats per query on UI based on the historical data in order to inform users of potentially expensive queries.

aleksmaus assigned aleksmaus and kevinlog Sep 19, 2023

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 19, 2023

aleksmaus added the Team:Defend Workflows label Sep 19, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 19, 2023

aleksmaus mentioned this issue Oct 3, 2023

Osquerybeat: Allow to specify timeout for the long running queries. Workaround the broken connection after long query timeout. #36722

Merged

6 tasks

aleksmaus closed this as completed in #36722 Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Osquerybeat: Long running queries timeout at RPC level, failing subsequent queries. #36622

Osquerybeat: Long running queries timeout at RPC level, failing subsequent queries. #36622

aleksmaus commented Sep 19, 2023 •

edited

Loading

aleksmaus commented Sep 21, 2023

aleksmaus commented Sep 21, 2023 •

edited

Loading

aleksmaus commented Sep 22, 2023 •

edited

Loading

Osquerybeat: Long running queries timeout at RPC level, failing subsequent queries. #36622

Osquerybeat: Long running queries timeout at RPC level, failing subsequent queries. #36622

Comments

aleksmaus commented Sep 19, 2023 • edited Loading

Description

Preliminary research

Possible ways to address the issue

aleksmaus commented Sep 21, 2023

aleksmaus commented Sep 21, 2023 • edited Loading

aleksmaus commented Sep 22, 2023 • edited Loading

aleksmaus commented Sep 19, 2023 •

edited

Loading

aleksmaus commented Sep 21, 2023 •

edited

Loading

aleksmaus commented Sep 22, 2023 •

edited

Loading