[Issue #10] Populate the search index from the opportunity tables #47

chouinar · 2024-05-21T19:52:44Z

Summary

Fixes #10

Time to review: 10 mins

Changes proposed

Setup a script to populate the search index by loading opportunities from the DB, jsonify'ing them, loading them into a new index, and then aliasing that index.

Several utilities were created for simplifying working with the OpenSearch client (a wrapper for setting up configuration / patterns)

Context for reviewers

Iterating over the opportunities and doing something with them is a common pattern in several of our scripts, so nothing is really different there.

The meaningful implementation is how we handle creating and aliasing the index. In OpenSearch you can give any index an alias (including putting multiple indexes behind the same alias). The approach is pretty simple:

Create an index
Load opportunities into the index
Atomically swap the index backing the opportunity-index-alias
Delete the old index if they exist

This approach means that our search endpoint just needs to query the alias, and we can keep making new indexes and swapping them out behind the scenes. Because we could remake the index every few minutes, if we ever need to re-configure things like the number of shards, or any other index-creation configuration, we just update that in this script and wait for it to run again.

Additional information

I ran this locally after loading 83250 records, and it took about 61s.

You can run this locally yourself by doing:

make init
make db-seed-local
poetry run flask load-search-data load-opportunity-data

If you'd like to see the data, you can test it out on http://localhost:5601/app/dev_tools#/console - here is an example query that filters by the word research across a few fields and filters to just forecasted/posted.

GET opportunity-index-alias/_search
{
  "size": 25,
  "from": 0,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "query": "research",
            "default_operator": "AND", 
            "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"]
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "opportunity_status": [
              "forecasted",
              "posted"
            ]
          }
        }
      ]
    }
  }
}

acouch · 2024-05-22T17:52:17Z

api/src/search/backend/load_opportunities_to_index.py

+        schema = OpportunitySchema()  # TODO - switch to the v1 version when that is merged
+        json_records = []
+
+        for record in records:


(thought) This could be a pain point in the future depending on how large the records get, but looks great for now.

Its much faster than you'd think. Locally I ran ~87000 records through this script in about 61 seconds. And that includes querying the DB, joining across all the tables, doing this iteration to jsonify, and then doing the bulk inserts.

The batching (which is done by the DB queries at 5000 records per batch) makes it scale pretty uneventfully.

I imagine on an actual cluster it'll be a bit slower as we'll probably have more nodes, but I don't see why this would ever go beyond ~5 minutes in a run, and there are a few quick optimizations (removing the index refresh from every bulk insert)

acouch

Also exciting. We have relevancy!

The base branch was changed.

#47) Fixes HHS#2092 Setup a script to populate the search index by loading opportunities from the DB, jsonify'ing them, loading them into a new index, and then aliasing that index. Several utilities were created for simplifying working with the OpenSearch client (a wrapper for setting up configuration / patterns) Iterating over the opportunities and doing something with them is a common pattern in several of our scripts, so nothing is really different there. The meaningful implementation is how we handle creating and aliasing the index. In OpenSearch you can give any index an alias (including putting multiple indexes behind the same alias). The approach is pretty simple: * Create an index * Load opportunities into the index * Atomically swap the index backing the `opportunity-index-alias` * Delete the old index if they exist This approach means that our search endpoint just needs to query the alias, and we can keep making new indexes and swapping them out behind the scenes. Because we could remake the index every few minutes, if we ever need to re-configure things like the number of shards, or any other index-creation configuration, we just update that in this script and wait for it to run again. I ran this locally after loading `83250` records, and it took about 61s. You can run this locally yourself by doing: ```sh make init make db-seed-local poetry run flask load-search-data load-opportunity-data ``` If you'd like to see the data, you can test it out on http://localhost:5601/app/dev_tools#/console - here is an example query that filters by the word `research` across a few fields and filters to just forecasted/posted. ```json GET opportunity-index-alias/_search { "size": 25, "from": 0, "query": { "bool": { "must": [ { "simple_query_string": { "query": "research", "default_operator": "AND", "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"] } } ], "filter": [ { "terms": { "opportunity_status": [ "forecasted", "posted" ] } } ] } } } ```

…avapbc#47) Fixes #2092 Setup a script to populate the search index by loading opportunities from the DB, jsonify'ing them, loading them into a new index, and then aliasing that index. Several utilities were created for simplifying working with the OpenSearch client (a wrapper for setting up configuration / patterns) Iterating over the opportunities and doing something with them is a common pattern in several of our scripts, so nothing is really different there. The meaningful implementation is how we handle creating and aliasing the index. In OpenSearch you can give any index an alias (including putting multiple indexes behind the same alias). The approach is pretty simple: * Create an index * Load opportunities into the index * Atomically swap the index backing the `opportunity-index-alias` * Delete the old index if they exist This approach means that our search endpoint just needs to query the alias, and we can keep making new indexes and swapping them out behind the scenes. Because we could remake the index every few minutes, if we ever need to re-configure things like the number of shards, or any other index-creation configuration, we just update that in this script and wait for it to run again. I ran this locally after loading `83250` records, and it took about 61s. You can run this locally yourself by doing: ```sh make init make db-seed-local poetry run flask load-search-data load-opportunity-data ``` If you'd like to see the data, you can test it out on http://localhost:5601/app/dev_tools#/console - here is an example query that filters by the word `research` across a few fields and filters to just forecasted/posted. ```json GET opportunity-index-alias/_search { "size": 25, "from": 0, "query": { "bool": { "must": [ { "simple_query_string": { "query": "research", "default_operator": "AND", "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"] } } ], "filter": [ { "terms": { "opportunity_status": [ "forecasted", "posted" ] } } ] } } } ```

chouinar added 7 commits May 16, 2024 16:59

[Issue #9] Setup opensearch locally

62ba7f1

Some rearranging of files

1922340

Dependency fixes

649339c

Trying something else for the network setup?

2126171

Simplify the networking/docker setup

8f80852

[Issue #10] Populate the search index from the opportunity tables

f02f3d3

Slightly tidying up

49c2a2b

chouinar requested a review from acouch May 21, 2024 19:52

chouinar requested a review from jamesbursa as a code owner May 21, 2024 19:52

github-actions bot added api python labels May 21, 2024

acouch reviewed May 22, 2024

View reviewed changes

acouch previously approved these changes May 22, 2024

View reviewed changes

Base automatically changed from chouinar/9-setup-search-index to main May 22, 2024 17:58

Merge branch 'main' into chouinar/10-populate-search-data

f263381

chouinar requested a review from acouch May 22, 2024 18:33

acouch approved these changes May 22, 2024

View reviewed changes

chouinar merged commit 879e743 into main May 22, 2024
8 checks passed

chouinar deleted the chouinar/10-populate-search-data branch May 22, 2024 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue #10] Populate the search index from the opportunity tables #47

[Issue #10] Populate the search index from the opportunity tables #47

chouinar commented May 21, 2024

acouch May 22, 2024

chouinar May 22, 2024

acouch left a comment

[Issue #10] Populate the search index from the opportunity tables #47

[Issue #10] Populate the search index from the opportunity tables #47

Conversation

chouinar commented May 21, 2024

Summary

Time to review: 10 mins

Changes proposed

Context for reviewers

Additional information

acouch May 22, 2024

Choose a reason for hiding this comment

chouinar May 22, 2024

Choose a reason for hiding this comment

acouch left a comment

Choose a reason for hiding this comment