Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

[Issue #179] Incrementally load search data #180

Merged
merged 3 commits into from
Sep 13, 2024

Conversation

chouinar
Copy link
Collaborator

@chouinar chouinar commented Aug 16, 2024

Summary

Fixes #179

Time to review: 10 mins

Changes proposed

Updated the load search data task to partially support incrementally loading + deleting records in the search index rather than just fully remaking it.

Various changes to the search utilities to support this work

Context for reviewers

Technically this doesn't fully support a true incremental load as it updates every record rather than just the ones with changes. I think the logic necessary to detect changes both deserves its own ticket, and may evolve when we later support indexing files to OpenSearch, so I think it makes sense to hold off on that for now.

Copy link

@Rwolfe-Nava Rwolfe-Nava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me so far. Would love to go over in some more detail when you return

@chouinar chouinar marked this pull request as ready for review August 27, 2024 16:21
@chouinar chouinar requested a review from jamesbursa as a code owner August 27, 2024 16:21
@chouinar chouinar requested a review from Rwolfe-Nava August 27, 2024 16:21
@mdragon
Copy link

mdragon commented Sep 13, 2024

Technically this doesn't fully support a true incremental load as it updates every record rather than just the ones with changes. I think the logic necessary to detect changes both deserves its own ticket, and may evolve when we later support indexing files to OpenSearch, so I think it makes sense to hold off on that for now.

While it's not strictly ideal, I've seen good results with Elastic (and thereby I think safe to think OpenSearch) about turning updates with no new data to "no-ops" at the Search layer. Obviously in higher volume data situations we might still want to limit using coarse methods what we send to search, but I've always taken a better safe than sorry approach and let the search code figure out when something might have been "updated" but not "changed" in terms of what is indexed.

@chouinar
Copy link
Collaborator Author

Technically this doesn't fully support a true incremental load as it updates every record rather than just the ones with changes. I think the logic necessary to detect changes both deserves its own ticket, and may evolve when we later support indexing files to OpenSearch, so I think it makes sense to hold off on that for now.

While it's not strictly ideal, I've seen good results with Elastic (and thereby I think safe to think OpenSearch) about turning updates with no new data to "no-ops" at the Search layer. Obviously in higher volume data situations we might still want to limit using coarse methods what we send to search, but I've always taken a better safe than sorry approach and let the search code figure out when something might have been "updated" but not "changed" in terms of what is indexed.

I do see mention of noop in the ElasticSearch docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html - but not in the OpenSearch docs: https://opensearch.org/docs/latest/api-reference/document-apis/update-document/

I wonder if you could reindex and merge two indices together into one (lets say, an opportunity and opportunity attachment index), and use that?

@chouinar chouinar merged commit 2854d43 into main Sep 13, 2024
8 checks passed
@chouinar chouinar deleted the chouinar/179-incremental-search-load branch September 13, 2024 17:10
@mdragon
Copy link

mdragon commented Sep 13, 2024

I wonder if you could reindex and merge two indices together into one (lets say, an opportunity and opportunity attachment index), and use that?

So you can definitely assign to the same alias to multiple indexes (again at least in Elastic) and it somehow will query across both (not sure how this works in practice).

I did in the past use aliases to allow the index to be swapped out under a running system. On a monthly data update cycle we would push data out to the DB, run a full new index under the month "search-sept" and then once the new index was fully built, flip the alias from "search-aug" to "search-sept." This represented a good way to make an effort to always sync data changes to the index, but then have a regular checkpoint where no matter what we'd know the search was up-to-date. Our data drove the monthly timeline, you could do this weekly, daily, or even hourly, depending on how expensive the data pull is to fully index.

acouch pushed a commit that referenced this pull request Sep 18, 2024
Fixes HHS#2038

Updated the load search data task to partially support incrementally
loading + deleting records in the search index rather than just fully
remaking it.

Various changes to the search utilities to support this work

Technically this doesn't fully support a true incremental load as it
updates every record rather than just the ones with changes. I think the
logic necessary to detect changes both deserves its own ticket, and may
evolve when we later support indexing files to OpenSearch, so I think it
makes sense to hold off on that for now.
acouch pushed a commit that referenced this pull request Sep 18, 2024
Fixes HHS#2038

Updated the load search data task to partially support incrementally
loading + deleting records in the search index rather than just fully
remaking it.

Various changes to the search utilities to support this work

Technically this doesn't fully support a true incremental load as it
updates every record rather than just the ones with changes. I think the
logic necessary to detect changes both deserves its own ticket, and may
evolve when we later support indexing files to OpenSearch, so I think it
makes sense to hold off on that for now.
acouch pushed a commit to HHS/simpler-grants-gov that referenced this pull request Sep 18, 2024
Fixes #2038

Updated the load search data task to partially support incrementally
loading + deleting records in the search index rather than just fully
remaking it.

Various changes to the search utilities to support this work

Technically this doesn't fully support a true incremental load as it
updates every record rather than just the ones with changes. I think the
logic necessary to detect changes both deserves its own ticket, and may
evolve when we later support indexing files to OpenSearch, so I think it
makes sense to hold off on that for now.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: Modify data import to search index to also work incrementally
4 participants