-
Notifications
You must be signed in to change notification settings - Fork 0
[Issue #14] Setup utils for creating requests and parsing responses from search #54
Conversation
# which might be fine generally, but we work with a lot of acronyms | ||
# and should verify that doesn't cause any issues. | ||
# see: https://opensearch.org/docs/latest/analyzers/token-filters/index/ | ||
"filter": {"custom_stemmer": {"type": "snowball", "name": "english"}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good default to start with. At some point there should be a content strategy or the like for the indexing options. For acronyms we can also create a huge list of synonyms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the exact configuration we'll want is likely something we will sort out with testing. Just got this one from some basic testing of opportunity titles. Realized I needed something more than the default when my tests were getting a difference between searching King
and Kings
.
""" | ||
|
||
""" | ||
The hits object looks like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the extra examples here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Helps understand the parsing logic / I put these first when implementing the logic so I know what I'm working with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Curious on why you chose the simple query vs query but ran out of time to dig further. Can discuss in Slack or whenevs.
Based on the recommendation of Opensearch's docs: https://opensearch.org/docs/latest/query-dsl/full-text/query-string/
The thing is, because the type of query is a query param itself, we could always support more. The default being the simple version, but maybe allowing a user to select "advanced query format" in the UI |
…ses from search (#54) Fixes HHS#2091 Created utilities for creating requests to opensearch, and parsing the responses into more manageable objects. Added some logic for configuring how we create indexes to use a different tokenizer + stemmer as the defaults aren't great. The search queries we need to create aren't that complex, but they're pretty large and very nested objects. To help with this, I've built a few generic utilities for creating the requests by using a builder to pass in each of the components of the search request. This way when the API gets built out next, the search logic really is just taking our requests and passing the details to the factories, which is pretty trivial. Responses are a bit less complex, they're just very nested, and adding a simple wrapper around them in the response helps any usage of the search client need to do a bit less indexing into dictionaries (eg. to access the response objects I was doing `values = [record["_source"] for record in response["hits"]["hits"]]` which is fun). That logic just safely handles parsing the responses in a very generic manner. Note that in both cases, there are many cases that we don't handle yet (a ton of other request params for example), we can add those as needed. Just focused on the ones we need right now. --- One unrelated change made it in here and that was adjusting how the analysis was done on an index. In short, the default tokenization/stemming of words was clunky for our use case. For example, the default stemmer treats `king` and `kings` as separate words. I adjusted the stemmer to use the [snowball stemmer](https://snowballstem.org/) which seems to work a bit better although we should definitely investigate this further. I also changed the tokenization to be on whitespaces as before it would separate on dashes, and a lot of terms in our system have dashes (opportunity number and agency pretty prominently). Since this logic is potentially going to be shared across many components (if we ever build out more search indexes) - I tried to document it pretty thoroughly with links to a lot of documentation.
…ses from search (#54) Fixes HHS#2091 Created utilities for creating requests to opensearch, and parsing the responses into more manageable objects. Added some logic for configuring how we create indexes to use a different tokenizer + stemmer as the defaults aren't great. The search queries we need to create aren't that complex, but they're pretty large and very nested objects. To help with this, I've built a few generic utilities for creating the requests by using a builder to pass in each of the components of the search request. This way when the API gets built out next, the search logic really is just taking our requests and passing the details to the factories, which is pretty trivial. Responses are a bit less complex, they're just very nested, and adding a simple wrapper around them in the response helps any usage of the search client need to do a bit less indexing into dictionaries (eg. to access the response objects I was doing `values = [record["_source"] for record in response["hits"]["hits"]]` which is fun). That logic just safely handles parsing the responses in a very generic manner. Note that in both cases, there are many cases that we don't handle yet (a ton of other request params for example), we can add those as needed. Just focused on the ones we need right now. --- One unrelated change made it in here and that was adjusting how the analysis was done on an index. In short, the default tokenization/stemming of words was clunky for our use case. For example, the default stemmer treats `king` and `kings` as separate words. I adjusted the stemmer to use the [snowball stemmer](https://snowballstem.org/) which seems to work a bit better although we should definitely investigate this further. I also changed the tokenization to be on whitespaces as before it would separate on dashes, and a lot of terms in our system have dashes (opportunity number and agency pretty prominently). Since this logic is potentially going to be shared across many components (if we ever build out more search indexes) - I tried to document it pretty thoroughly with links to a lot of documentation.
… from search (navapbc#54) Fixes #2091 Created utilities for creating requests to opensearch, and parsing the responses into more manageable objects. Added some logic for configuring how we create indexes to use a different tokenizer + stemmer as the defaults aren't great. The search queries we need to create aren't that complex, but they're pretty large and very nested objects. To help with this, I've built a few generic utilities for creating the requests by using a builder to pass in each of the components of the search request. This way when the API gets built out next, the search logic really is just taking our requests and passing the details to the factories, which is pretty trivial. Responses are a bit less complex, they're just very nested, and adding a simple wrapper around them in the response helps any usage of the search client need to do a bit less indexing into dictionaries (eg. to access the response objects I was doing `values = [record["_source"] for record in response["hits"]["hits"]]` which is fun). That logic just safely handles parsing the responses in a very generic manner. Note that in both cases, there are many cases that we don't handle yet (a ton of other request params for example), we can add those as needed. Just focused on the ones we need right now. --- One unrelated change made it in here and that was adjusting how the analysis was done on an index. In short, the default tokenization/stemming of words was clunky for our use case. For example, the default stemmer treats `king` and `kings` as separate words. I adjusted the stemmer to use the [snowball stemmer](https://snowballstem.org/) which seems to work a bit better although we should definitely investigate this further. I also changed the tokenization to be on whitespaces as before it would separate on dashes, and a lot of terms in our system have dashes (opportunity number and agency pretty prominently). Since this logic is potentially going to be shared across many components (if we ever build out more search indexes) - I tried to document it pretty thoroughly with links to a lot of documentation.
Summary
Fixes #14
Time to review: 10 mins
Changes proposed
Created utilities for creating requests to opensearch, and parsing the responses into more manageable objects.
Added some logic for configuring how we create indexes to use a different tokenizer + stemmer as the defaults aren't great.
Context for reviewers
The search queries we need to create aren't that complex, but they're pretty large and very nested objects. To help with this, I've built a few generic utilities for creating the requests by using a builder to pass in each of the components of the search request. This way when the API gets built out next, the search logic really is just taking our requests and passing the details to the factories, which is pretty trivial.
Responses are a bit less complex, they're just very nested, and adding a simple wrapper around them in the response helps any usage of the search client need to do a bit less indexing into dictionaries (eg. to access the response objects I was doing
values = [record["_source"] for record in response["hits"]["hits"]]
which is fun). That logic just safely handles parsing the responses in a very generic manner.Note that in both cases, there are many cases that we don't handle yet (a ton of other request params for example), we can add those as needed. Just focused on the ones we need right now.
One unrelated change made it in here and that was adjusting how the analysis was done on an index. In short, the default tokenization/stemming of words was clunky for our use case. For example, the default stemmer treats
king
andkings
as separate words. I adjusted the stemmer to use the snowball stemmer which seems to work a bit better although we should definitely investigate this further. I also changed the tokenization to be on whitespaces as before it would separate on dashes, and a lot of terms in our system have dashes (opportunity number and agency pretty prominently).Additional information
Since this logic is potentially going to be shared across many components (if we ever build out more search indexes) - I tried to document it pretty thoroughly with links to a lot of documentation.