Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

[Issue #10] Populate the search index from the opportunity tables #47

Merged
merged 8 commits into from
May 22, 2024

Conversation

chouinar
Copy link
Collaborator

Summary

Fixes #10

Time to review: 10 mins

Changes proposed

Setup a script to populate the search index by loading opportunities from the DB, jsonify'ing them, loading them into a new index, and then aliasing that index.

Several utilities were created for simplifying working with the OpenSearch client (a wrapper for setting up configuration / patterns)

Context for reviewers

Iterating over the opportunities and doing something with them is a common pattern in several of our scripts, so nothing is really different there.

The meaningful implementation is how we handle creating and aliasing the index. In OpenSearch you can give any index an alias (including putting multiple indexes behind the same alias). The approach is pretty simple:

  • Create an index
  • Load opportunities into the index
  • Atomically swap the index backing the opportunity-index-alias
  • Delete the old index if they exist

This approach means that our search endpoint just needs to query the alias, and we can keep making new indexes and swapping them out behind the scenes. Because we could remake the index every few minutes, if we ever need to re-configure things like the number of shards, or any other index-creation configuration, we just update that in this script and wait for it to run again.

Additional information

I ran this locally after loading 83250 records, and it took about 61s.

You can run this locally yourself by doing:

make init
make db-seed-local
poetry run flask load-search-data load-opportunity-data

If you'd like to see the data, you can test it out on http://localhost:5601/app/dev_tools#/console - here is an example query that filters by the word research across a few fields and filters to just forecasted/posted.

GET opportunity-index-alias/_search
{
  "size": 25,
  "from": 0,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "query": "research",
            "default_operator": "AND", 
            "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"]
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "opportunity_status": [
              "forecasted",
              "posted"
            ]
          }
        }
      ]
    }
  }
}

@chouinar chouinar requested a review from acouch May 21, 2024 19:52
@chouinar chouinar requested a review from jamesbursa as a code owner May 21, 2024 19:52
schema = OpportunitySchema() # TODO - switch to the v1 version when that is merged
json_records = []

for record in records:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thought) This could be a pain point in the future depending on how large the records get, but looks great for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its much faster than you'd think. Locally I ran ~87000 records through this script in about 61 seconds. And that includes querying the DB, joining across all the tables, doing this iteration to jsonify, and then doing the bulk inserts.

The batching (which is done by the DB queries at 5000 records per batch) makes it scale pretty uneventfully.

I imagine on an actual cluster it'll be a bit slower as we'll probably have more nodes, but I don't see why this would ever go beyond ~5 minutes in a run, and there are a few quick optimizations (removing the index refresh from every bulk insert)

acouch
acouch previously approved these changes May 22, 2024
Copy link
Member

@acouch acouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also exciting. We have relevancy!

image

Base automatically changed from chouinar/9-setup-search-index to main May 22, 2024 17:58
@chouinar chouinar dismissed acouch’s stale review May 22, 2024 17:58

The base branch was changed.

@chouinar chouinar requested a review from acouch May 22, 2024 18:33
@chouinar chouinar merged commit 879e743 into main May 22, 2024
8 checks passed
@chouinar chouinar deleted the chouinar/10-populate-search-data branch May 22, 2024 20:15
acouch pushed a commit that referenced this pull request Sep 18, 2024
#47)

Fixes HHS#2092

Setup a script to populate the search index by loading opportunities
from the DB, jsonify'ing them, loading them into a new index, and then
aliasing that index.

Several utilities were created for simplifying working with the
OpenSearch client (a wrapper for setting up configuration / patterns)

Iterating over the opportunities and doing something with them is a
common pattern in several of our scripts, so nothing is really different
there.

The meaningful implementation is how we handle creating and aliasing the
index. In OpenSearch you can give any index an alias (including putting
multiple indexes behind the same alias). The approach is pretty simple:
* Create an index
* Load opportunities into the index
* Atomically swap the index backing the `opportunity-index-alias`
* Delete the old index if they exist

This approach means that our search endpoint just needs to query the
alias, and we can keep making new indexes and swapping them out behind
the scenes. Because we could remake the index every few minutes, if we
ever need to re-configure things like the number of shards, or any other
index-creation configuration, we just update that in this script and
wait for it to run again.

I ran this locally after loading `83250` records, and it took about 61s.

You can run this locally yourself by doing:
```sh
make init
make db-seed-local
poetry run flask load-search-data load-opportunity-data
```

If you'd like to see the data, you can test it out on
http://localhost:5601/app/dev_tools#/console - here is an example query
that filters by the word `research` across a few fields and filters to
just forecasted/posted.

```json
GET opportunity-index-alias/_search
{
  "size": 25,
  "from": 0,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "query": "research",
            "default_operator": "AND",
            "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"]
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "opportunity_status": [
              "forecasted",
              "posted"
            ]
          }
        }
      ]
    }
  }
}

```
acouch pushed a commit that referenced this pull request Sep 18, 2024
#47)

Fixes HHS#2092

Setup a script to populate the search index by loading opportunities
from the DB, jsonify'ing them, loading them into a new index, and then
aliasing that index.

Several utilities were created for simplifying working with the
OpenSearch client (a wrapper for setting up configuration / patterns)

Iterating over the opportunities and doing something with them is a
common pattern in several of our scripts, so nothing is really different
there.

The meaningful implementation is how we handle creating and aliasing the
index. In OpenSearch you can give any index an alias (including putting
multiple indexes behind the same alias). The approach is pretty simple:
* Create an index
* Load opportunities into the index
* Atomically swap the index backing the `opportunity-index-alias`
* Delete the old index if they exist

This approach means that our search endpoint just needs to query the
alias, and we can keep making new indexes and swapping them out behind
the scenes. Because we could remake the index every few minutes, if we
ever need to re-configure things like the number of shards, or any other
index-creation configuration, we just update that in this script and
wait for it to run again.

I ran this locally after loading `83250` records, and it took about 61s.

You can run this locally yourself by doing:
```sh
make init
make db-seed-local
poetry run flask load-search-data load-opportunity-data
```

If you'd like to see the data, you can test it out on
http://localhost:5601/app/dev_tools#/console - here is an example query
that filters by the word `research` across a few fields and filters to
just forecasted/posted.

```json
GET opportunity-index-alias/_search
{
  "size": 25,
  "from": 0,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "query": "research",
            "default_operator": "AND",
            "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"]
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "opportunity_status": [
              "forecasted",
              "posted"
            ]
          }
        }
      ]
    }
  }
}

```
acouch pushed a commit to HHS/simpler-grants-gov that referenced this pull request Sep 18, 2024
…avapbc#47)

Fixes #2092

Setup a script to populate the search index by loading opportunities
from the DB, jsonify'ing them, loading them into a new index, and then
aliasing that index.

Several utilities were created for simplifying working with the
OpenSearch client (a wrapper for setting up configuration / patterns)

Iterating over the opportunities and doing something with them is a
common pattern in several of our scripts, so nothing is really different
there.

The meaningful implementation is how we handle creating and aliasing the
index. In OpenSearch you can give any index an alias (including putting
multiple indexes behind the same alias). The approach is pretty simple:
* Create an index
* Load opportunities into the index
* Atomically swap the index backing the `opportunity-index-alias`
* Delete the old index if they exist

This approach means that our search endpoint just needs to query the
alias, and we can keep making new indexes and swapping them out behind
the scenes. Because we could remake the index every few minutes, if we
ever need to re-configure things like the number of shards, or any other
index-creation configuration, we just update that in this script and
wait for it to run again.

I ran this locally after loading `83250` records, and it took about 61s.

You can run this locally yourself by doing:
```sh
make init
make db-seed-local
poetry run flask load-search-data load-opportunity-data
```

If you'd like to see the data, you can test it out on
http://localhost:5601/app/dev_tools#/console - here is an example query
that filters by the word `research` across a few fields and filters to
just forecasted/posted.

```json
GET opportunity-index-alias/_search
{
  "size": 25,
  "from": 0,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "query": "research",
            "default_operator": "AND",
            "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"]
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "opportunity_status": [
              "forecasted",
              "posted"
            ]
          }
        }
      ]
    }
  }
}

```
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: Setup scripts to populate the search index from the DB
2 participants