Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

[Issue #14] Setup utils for creating requests and parsing responses from search #54

Merged
merged 12 commits into from
Jun 25, 2024

Conversation

chouinar
Copy link
Collaborator

Summary

Fixes #14

Time to review: 10 mins

Changes proposed

Created utilities for creating requests to opensearch, and parsing the responses into more manageable objects.

Added some logic for configuring how we create indexes to use a different tokenizer + stemmer as the defaults aren't great.

Context for reviewers

The search queries we need to create aren't that complex, but they're pretty large and very nested objects. To help with this, I've built a few generic utilities for creating the requests by using a builder to pass in each of the components of the search request. This way when the API gets built out next, the search logic really is just taking our requests and passing the details to the factories, which is pretty trivial.

Responses are a bit less complex, they're just very nested, and adding a simple wrapper around them in the response helps any usage of the search client need to do a bit less indexing into dictionaries (eg. to access the response objects I was doing values = [record["_source"] for record in response["hits"]["hits"]] which is fun). That logic just safely handles parsing the responses in a very generic manner.

Note that in both cases, there are many cases that we don't handle yet (a ton of other request params for example), we can add those as needed. Just focused on the ones we need right now.


One unrelated change made it in here and that was adjusting how the analysis was done on an index. In short, the default tokenization/stemming of words was clunky for our use case. For example, the default stemmer treats king and kings as separate words. I adjusted the stemmer to use the snowball stemmer which seems to work a bit better although we should definitely investigate this further. I also changed the tokenization to be on whitespaces as before it would separate on dashes, and a lot of terms in our system have dashes (opportunity number and agency pretty prominently).

Additional information

Since this logic is potentially going to be shared across many components (if we ever build out more search indexes) - I tried to document it pretty thoroughly with links to a lot of documentation.

@chouinar chouinar requested a review from acouch May 28, 2024 15:58
@chouinar chouinar requested a review from jamesbursa as a code owner May 28, 2024 15:58
# which might be fine generally, but we work with a lot of acronyms
# and should verify that doesn't cause any issues.
# see: https://opensearch.org/docs/latest/analyzers/token-filters/index/
"filter": {"custom_stemmer": {"type": "snowball", "name": "english"}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good default to start with. At some point there should be a content strategy or the like for the indexing options. For acronyms we can also create a huge list of synonyms.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the exact configuration we'll want is likely something we will sort out with testing. Just got this one from some basic testing of opportunity titles. Realized I needed something more than the default when my tests were getting a difference between searching King and Kings.

"""

"""
The hits object looks like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the extra examples here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helps understand the parsing logic / I put these first when implementing the logic so I know what I'm working with.

Copy link
Member

@acouch acouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Curious on why you chose the simple query vs query but ran out of time to dig further. Can discuss in Slack or whenevs.

@chouinar
Copy link
Collaborator Author

Looks great. Curious on why you chose the simple query vs query but ran out of time to dig further. Can discuss in Slack or whenevs.

Based on the recommendation of Opensearch's docs: https://opensearch.org/docs/latest/query-dsl/full-text/query-string/

Query string query has a strict syntax and returns an error in case of invalid syntax.
Therefore, it does not work well for search box applications.
For a less strict alternative, consider using simple_query_string.
If you don’t need query syntax support, use the match query

The thing is, because the type of query is a query param itself, we could always support more. The default being the simple version, but maybe allowing a user to select "advanced query format" in the UI

@chouinar chouinar merged commit 4ed48da into main Jun 25, 2024
8 checks passed
@chouinar chouinar deleted the chouinar/14-req-resp-tools branch June 25, 2024 14:15
acouch pushed a commit that referenced this pull request Sep 18, 2024
…ses from search (#54)

Fixes HHS#2091

Created utilities for creating requests to opensearch, and parsing the
responses into more manageable objects.

Added some logic for configuring how we create indexes to use a
different tokenizer + stemmer as the defaults aren't great.

The search queries we need to create aren't that complex, but they're
pretty large and very nested objects. To help with this, I've built a
few generic utilities for creating the requests by using a builder to
pass in each of the components of the search request. This way when the
API gets built out next, the search logic really is just taking our
requests and passing the details to the factories, which is pretty
trivial.

Responses are a bit less complex, they're just very nested, and adding a
simple wrapper around them in the response helps any usage of the search
client need to do a bit less indexing into dictionaries (eg. to access
the response objects I was doing `values = [record["_source"] for record
in response["hits"]["hits"]]` which is fun). That logic just safely
handles parsing the responses in a very generic manner.

Note that in both cases, there are many cases that we don't handle yet
(a ton of other request params for example), we can add those as needed.
Just focused on the ones we need right now.

---

One unrelated change made it in here and that was adjusting how the
analysis was done on an index. In short, the default
tokenization/stemming of words was clunky for our use case. For example,
the default stemmer treats `king` and `kings` as separate words. I
adjusted the stemmer to use the [snowball
stemmer](https://snowballstem.org/) which seems to work a bit better
although we should definitely investigate this further. I also changed
the tokenization to be on whitespaces as before it would separate on
dashes, and a lot of terms in our system have dashes (opportunity number
and agency pretty prominently).

Since this logic is potentially going to be shared across many
components (if we ever build out more search indexes) - I tried to
document it pretty thoroughly with links to a lot of documentation.
acouch pushed a commit that referenced this pull request Sep 18, 2024
…ses from search (#54)

Fixes HHS#2091

Created utilities for creating requests to opensearch, and parsing the
responses into more manageable objects.

Added some logic for configuring how we create indexes to use a
different tokenizer + stemmer as the defaults aren't great.

The search queries we need to create aren't that complex, but they're
pretty large and very nested objects. To help with this, I've built a
few generic utilities for creating the requests by using a builder to
pass in each of the components of the search request. This way when the
API gets built out next, the search logic really is just taking our
requests and passing the details to the factories, which is pretty
trivial.

Responses are a bit less complex, they're just very nested, and adding a
simple wrapper around them in the response helps any usage of the search
client need to do a bit less indexing into dictionaries (eg. to access
the response objects I was doing `values = [record["_source"] for record
in response["hits"]["hits"]]` which is fun). That logic just safely
handles parsing the responses in a very generic manner.

Note that in both cases, there are many cases that we don't handle yet
(a ton of other request params for example), we can add those as needed.
Just focused on the ones we need right now.

---

One unrelated change made it in here and that was adjusting how the
analysis was done on an index. In short, the default
tokenization/stemming of words was clunky for our use case. For example,
the default stemmer treats `king` and `kings` as separate words. I
adjusted the stemmer to use the [snowball
stemmer](https://snowballstem.org/) which seems to work a bit better
although we should definitely investigate this further. I also changed
the tokenization to be on whitespaces as before it would separate on
dashes, and a lot of terms in our system have dashes (opportunity number
and agency pretty prominently).

Since this logic is potentially going to be shared across many
components (if we ever build out more search indexes) - I tried to
document it pretty thoroughly with links to a lot of documentation.
acouch pushed a commit to HHS/simpler-grants-gov that referenced this pull request Sep 18, 2024
… from search (navapbc#54)

Fixes #2091

Created utilities for creating requests to opensearch, and parsing the
responses into more manageable objects.

Added some logic for configuring how we create indexes to use a
different tokenizer + stemmer as the defaults aren't great.

The search queries we need to create aren't that complex, but they're
pretty large and very nested objects. To help with this, I've built a
few generic utilities for creating the requests by using a builder to
pass in each of the components of the search request. This way when the
API gets built out next, the search logic really is just taking our
requests and passing the details to the factories, which is pretty
trivial.

Responses are a bit less complex, they're just very nested, and adding a
simple wrapper around them in the response helps any usage of the search
client need to do a bit less indexing into dictionaries (eg. to access
the response objects I was doing `values = [record["_source"] for record
in response["hits"]["hits"]]` which is fun). That logic just safely
handles parsing the responses in a very generic manner.

Note that in both cases, there are many cases that we don't handle yet
(a ton of other request params for example), we can add those as needed.
Just focused on the ones we need right now.

---

One unrelated change made it in here and that was adjusting how the
analysis was done on an index. In short, the default
tokenization/stemming of words was clunky for our use case. For example,
the default stemmer treats `king` and `kings` as separate words. I
adjusted the stemmer to use the [snowball
stemmer](https://snowballstem.org/) which seems to work a bit better
although we should definitely investigate this further. I also changed
the tokenization to be on whitespaces as before it would separate on
dashes, and a lot of terms in our system have dashes (opportunity number
and agency pretty prominently).

Since this logic is potentially going to be shared across many
components (if we ever build out more search indexes) - I tried to
document it pretty thoroughly with links to a lot of documentation.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: Setup utility for converting our requests into Opensearch requests, and parsing the responses
2 participants