Support adaptive refresh in Searcher Managers. #14443

vigyasharma · 2025-04-05T22:39:32Z

In segment based replication systems, a large replication payload (checkpoint) can induce heavy page faults, cause thrashing for in-flight search requests, and affect overall search performance.

A potential way to handle these bursts, is to leverage multiple commit points in the Lucene index. Instead of refreshing to the latest commit for a large replication payload, searchers can intelligently select the commit point that they can safely absorb. By processing through multiple such points, searchers can eventually get to the latest commit, without incurring too many page faults.

This change lets users define a commit selection strategy, controlling which commit the searcher manager refreshes on. Addresses #14219

Usage:
To incrementally refresh through multiple commit points until searcher is current with its directory:

Define a commit selection strategy using the RefreshCommitSupplier interface.
Update searcher managers with this strategy via setRefreshCommitSupplier()
Invoke maybeRefresh() or maybeRefreshBlocking in a loop until isSearcherCurrent() returns true.

jpountz · 2025-04-07T12:17:37Z

Thanks for tackling this!

To incrementally refresh through multiple commit points until searcher is current with its directory:

[...]
Invoke maybeRefresh() or maybeRefreshBlocking in a loop until isSearcherCurrent() returns true.

Is this the way we anticipate this to be used? I had imagined that the application would not change the way it refreshes and still call it on a schedule, but commit more frequently and retain multiple commits. E.g. commit every 30 seconds, retain commits for 300 seconds and refresh every 120 seconds (these numbers are just for the sake of the example). So every 120 seconds, SearcherManager would pick the most recent commit that differs by less than X GB (configurable based on the amount of trashing that the app can sustain between consecutive point-in-time views of the index) from the current point-in-time reader, or the commit that differs by the least amount of data if there is no such commit (typically the oldest commit). Most of the time, SearcherManager would pick the newest commit point, but under heavy merging it may decide to lag behind the latest commit point a bit for the sake of smoothing out page cache trashing.

vigyasharma · 2025-04-07T23:22:46Z

every 120 seconds, SearcherManager would pick the most recent commit that differs by less than X GB

This is indeed how we anticipate it being used. In NRT style segment replicated setups, if searchers refresh more often than replication frequency, they will eventually catch up to the latest commit. I mentioned the while loop for cases where users want to wait and verify that their searchers are current.

The PR of course supports both patterns, I'll update the description to reflect it as well.

jpountz · 2025-04-08T06:21:42Z

if searchers refresh more often than replication frequency

OK I think I misunderstood how it would be used. I had assumed that commits would always get replicated immediately, but you are suggesting that replications are infrequent and bring several commits at once to leave time to replica nodes to smoothly absorb the delta.

jpountz · 2025-04-10T13:39:56Z

Sorry I'm still a bit confused: how is this approach better than just committing more frequently, replicating commits as soon as they are created, and refreshing searchers as soon as commits are replicated?

msokolov · 2025-04-11T23:41:07Z

Sorry I'm still a bit confused: how is this approach better than just committing more frequently, replicating commits as soon as they are created, and refreshing searchers as soon as commits are replicated?

One scenario of interest is when replication becomes delayed, for example when working with cross-datacenter replication this is expected. In that case commit points may pile up, even to the extent of completely replacing the entire index. In such a case we'd like to be able to recover without undue impact to searchers.

vigyasharma · 2025-04-12T03:53:48Z

just committing more frequently, replicating commits as soon as they are created, and refreshing searchers as soon as commits are replicated?

This is more or less the setup we have today at Amazon Product Search. We have separate indexing and search fleets that use s3 as a sink. Some fleets replicate across aws data centers. I believe this is a common architecture, for e.g. DoorDash seems to have a similar search architecture.

However, as Mike mentioned, these commits go over network hops and are vulnerable to networking lags. Our searchers periodically pull the latest commit from s3 and refresh. If replication is delayed, searchers can skip a few commits to pull the latest one available. This latest commit can have a very high delta to what searchers are currently on.

With adaptive refresh, we are experimenting with making searchers pull the last N commits and refresh on the newest commit that they can safely absorb. At the extreme, if the entire index has changed, it will be no different than refreshing on the latest commit. But for moderate delay windows, we could find "bite sized" hops for searchers to catch up safely.

vigyasharma · 2025-04-12T03:54:32Z

Another scenario where adaptive refresh might be useful is with heterogenous search fleets. Searchers with less memory would benefit from stepping through smaller commit deltas, while high memory searchers can jump ahead.

vigyasharma added 17 commits April 5, 2025 14:33

add refresh commit supplier to Searcher, Taxonomy, and Reader managers

2fdf80c

add license

f6d0eb5

pass directory reader and let impl. list out commits

9f7387c

fix doc string

29fd8eb

tidy

48c8fea

docstring;

61609fb

start UT

09a6c29

test next commit in searcher manager

91e03a7

unused var remove

0157a66

tidy

b4e5661

add step wise commit test

cea5fd5

move util methods to SearcherManager

bb0e0c5

add searcher taxo mgr tests

f1e4e48

test taxo refresh after searcher

7b72cf9

restore ReaderManager to main branch version

7981503

simplify Searcher Taxonomy manager

bb73dd4

tidy

20e48af

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Apr 5, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Apr 5, 2025

github-actions bot added module:core/search module:facet labels Apr 5, 2025

vigyasharma changed the title ~~Support incremental refresh in Searcher Managers.~~ Support adaptive refresh in Searcher Managers. Apr 8, 2025

vigyasharma added 2 commits April 8, 2025 17:14

always refresh taxonomy on to the latest commit

e4d0012

fix test

d12d63d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support adaptive refresh in Searcher Managers. #14443

Support adaptive refresh in Searcher Managers. #14443

vigyasharma commented Apr 5, 2025

jpountz commented Apr 7, 2025

vigyasharma commented Apr 7, 2025

jpountz commented Apr 8, 2025 •

edited

Loading

jpountz commented Apr 10, 2025

msokolov commented Apr 11, 2025

vigyasharma commented Apr 12, 2025

vigyasharma commented Apr 12, 2025

Support adaptive refresh in Searcher Managers. #14443

Are you sure you want to change the base?

Support adaptive refresh in Searcher Managers. #14443

Conversation

vigyasharma commented Apr 5, 2025

jpountz commented Apr 7, 2025

vigyasharma commented Apr 7, 2025

jpountz commented Apr 8, 2025 • edited Loading

jpountz commented Apr 10, 2025

msokolov commented Apr 11, 2025

vigyasharma commented Apr 12, 2025

vigyasharma commented Apr 12, 2025

jpountz commented Apr 8, 2025 •

edited

Loading