[Issue #14] Setup utils for creating requests and parsing responses from search #54

chouinar · 2024-05-28T15:58:31Z

Summary

Fixes #14

Time to review: 10 mins

Changes proposed

Created utilities for creating requests to opensearch, and parsing the responses into more manageable objects.

Added some logic for configuring how we create indexes to use a different tokenizer + stemmer as the defaults aren't great.

Context for reviewers

The search queries we need to create aren't that complex, but they're pretty large and very nested objects. To help with this, I've built a few generic utilities for creating the requests by using a builder to pass in each of the components of the search request. This way when the API gets built out next, the search logic really is just taking our requests and passing the details to the factories, which is pretty trivial.

Responses are a bit less complex, they're just very nested, and adding a simple wrapper around them in the response helps any usage of the search client need to do a bit less indexing into dictionaries (eg. to access the response objects I was doing values = [record["_source"] for record in response["hits"]["hits"]] which is fun). That logic just safely handles parsing the responses in a very generic manner.

Note that in both cases, there are many cases that we don't handle yet (a ton of other request params for example), we can add those as needed. Just focused on the ones we need right now.

One unrelated change made it in here and that was adjusting how the analysis was done on an index. In short, the default tokenization/stemming of words was clunky for our use case. For example, the default stemmer treats king and kings as separate words. I adjusted the stemmer to use the snowball stemmer which seems to work a bit better although we should definitely investigate this further. I also changed the tokenization to be on whitespaces as before it would separate on dashes, and a lot of terms in our system have dashes (opportunity number and agency pretty prominently).

Additional information

Since this logic is potentially going to be shared across many components (if we ever build out more search indexes) - I tried to document it pretty thoroughly with links to a lot of documentation.

…rom search

acouch · 2024-06-24T22:05:30Z

api/src/adapters/search/opensearch_client.py

+    #        which might be fine generally, but we work with a lot of acronyms
+    #        and should verify that doesn't cause any issues.
+    # see: https://opensearch.org/docs/latest/analyzers/token-filters/index/
+    "filter": {"custom_stemmer": {"type": "snowball", "name": "english"}},


This looks like a good default to start with. At some point there should be a content strategy or the like for the indexing options. For acronyms we can also create a huge list of synonyms.

Yeah, the exact configuration we'll want is likely something we will sort out with testing. Just got this one from some basic testing of opportunity titles. Realized I needed something more than the default when my tests were getting a difference between searching King and Kings.

acouch · 2024-06-24T22:07:39Z

api/src/adapters/search/opensearch_response.py

+        """
+
+        """
+        The hits object looks like:


Appreciate the extra examples here.

Helps understand the parsing logic / I put these first when implementing the logic so I know what I'm working with.

acouch

Looks great. Curious on why you chose the simple query vs query but ran out of time to dig further. Can discuss in Slack or whenevs.

chouinar · 2024-06-25T13:54:33Z

Looks great. Curious on why you chose the simple query vs query but ran out of time to dig further. Can discuss in Slack or whenevs.

Based on the recommendation of Opensearch's docs: https://opensearch.org/docs/latest/query-dsl/full-text/query-string/

Query string query has a strict syntax and returns an error in case of invalid syntax.
Therefore, it does not work well for search box applications.
For a less strict alternative, consider using simple_query_string.
If you don’t need query syntax support, use the match query

The thing is, because the type of query is a query param itself, we could always support more. The default being the simple version, but maybe allowing a user to select "advanced query format" in the UI

…ses from search (#54) Fixes HHS#2091 Created utilities for creating requests to opensearch, and parsing the responses into more manageable objects. Added some logic for configuring how we create indexes to use a different tokenizer + stemmer as the defaults aren't great. The search queries we need to create aren't that complex, but they're pretty large and very nested objects. To help with this, I've built a few generic utilities for creating the requests by using a builder to pass in each of the components of the search request. This way when the API gets built out next, the search logic really is just taking our requests and passing the details to the factories, which is pretty trivial. Responses are a bit less complex, they're just very nested, and adding a simple wrapper around them in the response helps any usage of the search client need to do a bit less indexing into dictionaries (eg. to access the response objects I was doing `values = [record["_source"] for record in response["hits"]["hits"]]` which is fun). That logic just safely handles parsing the responses in a very generic manner. Note that in both cases, there are many cases that we don't handle yet (a ton of other request params for example), we can add those as needed. Just focused on the ones we need right now. --- One unrelated change made it in here and that was adjusting how the analysis was done on an index. In short, the default tokenization/stemming of words was clunky for our use case. For example, the default stemmer treats `king` and `kings` as separate words. I adjusted the stemmer to use the [snowball stemmer](https://snowballstem.org/) which seems to work a bit better although we should definitely investigate this further. I also changed the tokenization to be on whitespaces as before it would separate on dashes, and a lot of terms in our system have dashes (opportunity number and agency pretty prominently). Since this logic is potentially going to be shared across many components (if we ever build out more search indexes) - I tried to document it pretty thoroughly with links to a lot of documentation.

… from search (navapbc#54) Fixes #2091 Created utilities for creating requests to opensearch, and parsing the responses into more manageable objects. Added some logic for configuring how we create indexes to use a different tokenizer + stemmer as the defaults aren't great. The search queries we need to create aren't that complex, but they're pretty large and very nested objects. To help with this, I've built a few generic utilities for creating the requests by using a builder to pass in each of the components of the search request. This way when the API gets built out next, the search logic really is just taking our requests and passing the details to the factories, which is pretty trivial. Responses are a bit less complex, they're just very nested, and adding a simple wrapper around them in the response helps any usage of the search client need to do a bit less indexing into dictionaries (eg. to access the response objects I was doing `values = [record["_source"] for record in response["hits"]["hits"]]` which is fun). That logic just safely handles parsing the responses in a very generic manner. Note that in both cases, there are many cases that we don't handle yet (a ton of other request params for example), we can add those as needed. Just focused on the ones we need right now. --- One unrelated change made it in here and that was adjusting how the analysis was done on an index. In short, the default tokenization/stemming of words was clunky for our use case. For example, the default stemmer treats `king` and `kings` as separate words. I adjusted the stemmer to use the [snowball stemmer](https://snowballstem.org/) which seems to work a bit better although we should definitely investigate this further. I also changed the tokenization to be on whitespaces as before it would separate on dashes, and a lot of terms in our system have dashes (opportunity number and agency pretty prominently). Since this logic is potentially going to be shared across many components (if we ever build out more search indexes) - I tried to document it pretty thoroughly with links to a lot of documentation.

chouinar added 11 commits May 16, 2024 16:59

[Issue #9] Setup opensearch locally

62ba7f1

Some rearranging of files

1922340

Dependency fixes

649339c

Trying something else for the network setup?

2126171

Simplify the networking/docker setup

8f80852

[Issue #10] Populate the search index from the opportunity tables

f02f3d3

Slightly tidying up

49c2a2b

[Issue #14] Setup utils for creating requests and parsing responses f…

25edfab

…rom search

Merge branch 'main' into chouinar/14-req-resp-tools

1058287

A lot of tests / comments / cleanup

327f242

Add an example

eaba30d

chouinar requested a review from acouch May 28, 2024 15:58

chouinar requested a review from jamesbursa as a code owner May 28, 2024 15:58

github-actions bot added api python labels May 28, 2024

Merge branch 'main' into chouinar/14-req-resp-tools

28e106b

acouch reviewed Jun 24, 2024

View reviewed changes

acouch approved these changes Jun 24, 2024

View reviewed changes

chouinar merged commit 4ed48da into main Jun 25, 2024
8 checks passed

chouinar deleted the chouinar/14-req-resp-tools branch June 25, 2024 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue #14] Setup utils for creating requests and parsing responses from search #54

[Issue #14] Setup utils for creating requests and parsing responses from search #54

chouinar commented May 28, 2024

acouch Jun 24, 2024

chouinar Jun 25, 2024

acouch Jun 24, 2024

chouinar Jun 25, 2024

acouch left a comment

chouinar commented Jun 25, 2024

[Issue #14] Setup utils for creating requests and parsing responses from search #54

[Issue #14] Setup utils for creating requests and parsing responses from search #54

Conversation

chouinar commented May 28, 2024

Summary

Time to review: 10 mins

Changes proposed

Context for reviewers

Additional information

acouch Jun 24, 2024

Choose a reason for hiding this comment

chouinar Jun 25, 2024

Choose a reason for hiding this comment

acouch Jun 24, 2024

Choose a reason for hiding this comment

chouinar Jun 25, 2024

Choose a reason for hiding this comment

acouch left a comment

Choose a reason for hiding this comment

chouinar commented Jun 25, 2024