-
Notifications
You must be signed in to change notification settings - Fork 0
[Issue #10] Populate the search index from the opportunity tables #47
Conversation
schema = OpportunitySchema() # TODO - switch to the v1 version when that is merged | ||
json_records = [] | ||
|
||
for record in records: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(thought) This could be a pain point in the future depending on how large the records get, but looks great for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its much faster than you'd think. Locally I ran ~87000 records through this script in about 61 seconds. And that includes querying the DB, joining across all the tables, doing this iteration to jsonify, and then doing the bulk inserts.
The batching (which is done by the DB queries at 5000 records per batch) makes it scale pretty uneventfully.
I imagine on an actual cluster it'll be a bit slower as we'll probably have more nodes, but I don't see why this would ever go beyond ~5 minutes in a run, and there are a few quick optimizations (removing the index refresh from every bulk insert)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#47) Fixes HHS#2092 Setup a script to populate the search index by loading opportunities from the DB, jsonify'ing them, loading them into a new index, and then aliasing that index. Several utilities were created for simplifying working with the OpenSearch client (a wrapper for setting up configuration / patterns) Iterating over the opportunities and doing something with them is a common pattern in several of our scripts, so nothing is really different there. The meaningful implementation is how we handle creating and aliasing the index. In OpenSearch you can give any index an alias (including putting multiple indexes behind the same alias). The approach is pretty simple: * Create an index * Load opportunities into the index * Atomically swap the index backing the `opportunity-index-alias` * Delete the old index if they exist This approach means that our search endpoint just needs to query the alias, and we can keep making new indexes and swapping them out behind the scenes. Because we could remake the index every few minutes, if we ever need to re-configure things like the number of shards, or any other index-creation configuration, we just update that in this script and wait for it to run again. I ran this locally after loading `83250` records, and it took about 61s. You can run this locally yourself by doing: ```sh make init make db-seed-local poetry run flask load-search-data load-opportunity-data ``` If you'd like to see the data, you can test it out on http://localhost:5601/app/dev_tools#/console - here is an example query that filters by the word `research` across a few fields and filters to just forecasted/posted. ```json GET opportunity-index-alias/_search { "size": 25, "from": 0, "query": { "bool": { "must": [ { "simple_query_string": { "query": "research", "default_operator": "AND", "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"] } } ], "filter": [ { "terms": { "opportunity_status": [ "forecasted", "posted" ] } } ] } } } ```
#47) Fixes HHS#2092 Setup a script to populate the search index by loading opportunities from the DB, jsonify'ing them, loading them into a new index, and then aliasing that index. Several utilities were created for simplifying working with the OpenSearch client (a wrapper for setting up configuration / patterns) Iterating over the opportunities and doing something with them is a common pattern in several of our scripts, so nothing is really different there. The meaningful implementation is how we handle creating and aliasing the index. In OpenSearch you can give any index an alias (including putting multiple indexes behind the same alias). The approach is pretty simple: * Create an index * Load opportunities into the index * Atomically swap the index backing the `opportunity-index-alias` * Delete the old index if they exist This approach means that our search endpoint just needs to query the alias, and we can keep making new indexes and swapping them out behind the scenes. Because we could remake the index every few minutes, if we ever need to re-configure things like the number of shards, or any other index-creation configuration, we just update that in this script and wait for it to run again. I ran this locally after loading `83250` records, and it took about 61s. You can run this locally yourself by doing: ```sh make init make db-seed-local poetry run flask load-search-data load-opportunity-data ``` If you'd like to see the data, you can test it out on http://localhost:5601/app/dev_tools#/console - here is an example query that filters by the word `research` across a few fields and filters to just forecasted/posted. ```json GET opportunity-index-alias/_search { "size": 25, "from": 0, "query": { "bool": { "must": [ { "simple_query_string": { "query": "research", "default_operator": "AND", "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"] } } ], "filter": [ { "terms": { "opportunity_status": [ "forecasted", "posted" ] } } ] } } } ```
…avapbc#47) Fixes #2092 Setup a script to populate the search index by loading opportunities from the DB, jsonify'ing them, loading them into a new index, and then aliasing that index. Several utilities were created for simplifying working with the OpenSearch client (a wrapper for setting up configuration / patterns) Iterating over the opportunities and doing something with them is a common pattern in several of our scripts, so nothing is really different there. The meaningful implementation is how we handle creating and aliasing the index. In OpenSearch you can give any index an alias (including putting multiple indexes behind the same alias). The approach is pretty simple: * Create an index * Load opportunities into the index * Atomically swap the index backing the `opportunity-index-alias` * Delete the old index if they exist This approach means that our search endpoint just needs to query the alias, and we can keep making new indexes and swapping them out behind the scenes. Because we could remake the index every few minutes, if we ever need to re-configure things like the number of shards, or any other index-creation configuration, we just update that in this script and wait for it to run again. I ran this locally after loading `83250` records, and it took about 61s. You can run this locally yourself by doing: ```sh make init make db-seed-local poetry run flask load-search-data load-opportunity-data ``` If you'd like to see the data, you can test it out on http://localhost:5601/app/dev_tools#/console - here is an example query that filters by the word `research` across a few fields and filters to just forecasted/posted. ```json GET opportunity-index-alias/_search { "size": 25, "from": 0, "query": { "bool": { "must": [ { "simple_query_string": { "query": "research", "default_operator": "AND", "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"] } } ], "filter": [ { "terms": { "opportunity_status": [ "forecasted", "posted" ] } } ] } } } ```
Summary
Fixes #10
Time to review: 10 mins
Changes proposed
Setup a script to populate the search index by loading opportunities from the DB, jsonify'ing them, loading them into a new index, and then aliasing that index.
Several utilities were created for simplifying working with the OpenSearch client (a wrapper for setting up configuration / patterns)
Context for reviewers
Iterating over the opportunities and doing something with them is a common pattern in several of our scripts, so nothing is really different there.
The meaningful implementation is how we handle creating and aliasing the index. In OpenSearch you can give any index an alias (including putting multiple indexes behind the same alias). The approach is pretty simple:
opportunity-index-alias
This approach means that our search endpoint just needs to query the alias, and we can keep making new indexes and swapping them out behind the scenes. Because we could remake the index every few minutes, if we ever need to re-configure things like the number of shards, or any other index-creation configuration, we just update that in this script and wait for it to run again.
Additional information
I ran this locally after loading
83250
records, and it took about 61s.You can run this locally yourself by doing:
If you'd like to see the data, you can test it out on http://localhost:5601/app/dev_tools#/console - here is an example query that filters by the word
research
across a few fields and filters to just forecasted/posted.