feat: ADR for incremental algolia indexing #773

johnnagro · 2024-02-14T16:18:06Z

Description

A proposed ADR for incremental indexing of content metadata and catalogs into Algolia.

https://2u-internal.atlassian.net/browse/ENT-8139

docs/decisions/0009-incremental-algolia-indexing.rst

adamstankiewicz · 2024-03-07T19:30:04Z

docs/decisions/0009-incremental-algolia-indexing.rst

+First, the existing indexing process begins with executing catalog queries against `search/all` to determine which
+courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the
+opposite semantic and instead be able to determine catalog membership from a given course (rather than courses from a
+given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a catalog query


[question] How will this process be aware of courses that exist upstream in course-discovery, but don't yet exist in enterprise-catalog (e.g., a newly added course). Based on this, it would seem enterprise-catalog would need to be aware of all course metadata from course-discovery, even if it's not tied to any enterprise catalogs?

Realizing that this is a comment on an older iteration of this ADR however, to answer your question the catalog service would be aware of new content by both hooking up to post save signals and querying all content. For each piece of content that it has been notified about, the service will do it's own query association filtering logic post content ingestion.

alex-sheehan-edx

lgtm

alex-sheehan-edx · 2024-03-18T16:40:40Z

For clarity sake- @johnnagro started this work before he left 2U and I'm taking up the task of finishing and publishing

openedx-webhooks · 2024-03-18T16:42:17Z

Thanks for the pull request, @johnnagro! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

supporting documentation
Open edX discussion forum threads
timeline information ("this must be merged by XX date", and why that is)
partner information ("this is a course on edx.org")
any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

⚠️ We can't start reviewing your pull request until you've submitted a signed contributor agreement or indicated your institutional affiliation. Please see the CONTRIBUTING file for more information. If you've signed an agreement in the past, you may need to re-sign. See The New Home of the Open edX Codebase for details.

Once you've signed the CLA, please allow 1 business day for it to be processed. After this time, you can re-run the CLA check by adding a comment here that you have signed it. If the problem persists, you can tag the @openedx/cla-problems team in a comment on your PR for further assistance.

pwnage101

halfway done

pwnage101 · 2024-03-20T15:59:50Z

docs/decisions/0009-incremental-algolia-indexing.rst

+- Support all current metadata types but doesn’t need to support them all on day 1
+- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing
+update_content_metadata job, etc.
+    - Invocation of the new indexing process should not be reliant on separate processes run synchronously before hand.


perhaps s/synchronously/serially ?

pwnage101 · 2024-03-20T16:09:15Z

docs/decisions/0009-incremental-algolia-indexing.rst

+the contents metadata (modified_at) must be bumped from what's previously stored. Secondly, the content must have
+associations with queries within the service.


If the content must have associations with queries in order to kick of an indexing task, what happens when the content had associations before, but then those associations were removed? The end result is that there are no associations, but we still need to kick off an indexing task to de-index the content, right?

That's a good point - we would want to kick off the process for a piece of content should it lose/gain any number of associated queries. We need to run the individual indexing task of a course IFF
1- The content metadata in any way changes
2- Any associations between a customer catalog is removed
3- Any associations between a customer catalog is created

We will need to make sure that it is done and evaluated once we go to index

Agreed, somehow this should be represented in the ADR, since right now it calls for a pseudocode which only kicks off indexing for a course IFF the content has associations.

pwnage101 · 2024-03-20T16:14:47Z

docs/decisions/0009-incremental-algolia-indexing.rst

+Incremental updates, through the act of saving individual records, will need to be triggered by something - such as
+polling of updated content from Course Discovery, consumption of event-bus events, and/or triggering based on a nightly
+Course Discovery crawl or Django Admin button. However it is not the responsibility of the indexer, nor this ADR
+to determine when those events should occur, and in fact the indexing process should be able to handle any source of
+content metadata record updating processes.


In a previous paragraph, a solution utilizing a ContentMetadata post_save() hook to trigger a celery task was proposed. Is that a valid solution for triggering incremental index updates? If so, why is it not listed in this paragraph as a solution? Likewise, why aren't solutions in this paragraph listed in the above paragraph?

If the two paragraphs are duplicative, I recommend consolidating them into one.

pwnage101

all done!

pwnage101 · 2024-03-20T16:44:01Z

docs/decisions/0010-incremental-content-metadata-updating.rst

+contributing factors as to the long run time of the ``update_content_metadata`` task. Additionally, housing
+our own filtering logic will allow us to maintain and tweak/improve upon the functionality should we want additional
+features.


Are there any concerns about local filtering logic in enterprise-catalog (apps.catalog.filters) diverging from how course-discovery does it? How do we keep two black boxes in sync? Do we even need to?

I would argue that filters out the gate should match our discovery counter part, and that we would need rigorous tests to ensure our in house filters result in the same subsets of content. However from there, one of the benefits to this is that we get control of how the filters are administered and can change the behaviors to fit our needs and odd situations, no more need to go ask the phoenix team why there is one off odd behaviors, instead we can just adjust it ourselves ¯\_(ツ)_/¯

iloveagent57 · 2024-03-29T18:34:42Z

docs/decisions/0009-incremental-algolia-indexing.rst

+Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will
+also provide us with more flexibility about including non-course-discovery content in catalogs because we will
+no-longer rely on a query to course-discovery's `search/all` endpoint and instead rely on the metadata records in the
+catalog service, regardless of it's source.


We'll also have to do something when catalog queries are created or edited, so that the search index is updated to reflect any catalog <-> metadata relationships that are created/updated due to those queries being changed.

this is an excellent point- in a similar vein to us having to tie a content record updating to catalog query contentmetadata_set's being updated, we'd probably need to tie a query being updated to a process of calculating it's contentmetadata_set and then kick off incremental indexing processes for all effected content records.

iloveagent57 · 2024-03-29T18:36:37Z

docs/decisions/0010-incremental-content-metadata-updating.rst

+``update_content_metadata`` tasks and can eventually replace old infrastructure. The first method will be a bulk
+job similar to the current ``update_content_metadata`` task to query external sources of content and update any records
+should they mismatch using `apps.catalog.filters` to determine the query-content association sets. And second, an event


A bulk job like this, though, means that you're going to run your filter functions in proportion to (|# of queries| x |# of content metadata records|). Which means you're going to run it 10s of millions of times if we have thousands of queries and 10,000s of content.

iloveagent57 · 2024-03-29T18:38:17Z

docs/decisions/0010-incremental-content-metadata-updating.rst

+signal receiver which will process any individual content update events that are received. The intention is for the
+majority of updates in the catalog service to happen at the moment they are updated in their external source and the
+signal is fired, only to be cleaned up and verified by the bulk job later on should something go wrong.


It's a good idea, but keep in mind, to keep content/catalog relationships up-to-date, an update of a single metadata records means we'll have to run the filter logic against every catalog query, because a change to content metadata could mean that a query should now include the content, or that the query should now not include the content.

adamstankiewicz reviewed Feb 14, 2024

View reviewed changes

docs/decisions/0009-incremental-algolia-indexing.rst Outdated Show resolved Hide resolved

docs/decisions/0009-incremental-algolia-indexing.rst Outdated Show resolved Hide resolved

alex-sheehan-edx reviewed Mar 7, 2024

View reviewed changes

docs/decisions/0009-incremental-algolia-indexing.rst Outdated Show resolved Hide resolved

adamstankiewicz reviewed Mar 7, 2024

View reviewed changes

alex-sheehan-edx approved these changes Mar 13, 2024

View reviewed changes

alex-sheehan-edx force-pushed the johnnagro/ENT-8139/0 branch from 08493ca to b1a0d0e Compare March 15, 2024 16:17

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 15, 2024

alex-sheehan-edx force-pushed the johnnagro/ENT-8139/0 branch 3 times, most recently from d1146b2 to 6535cbd Compare March 18, 2024 16:38

alex-sheehan-edx self-assigned this Mar 18, 2024

alex-sheehan-edx removed the open-source-contribution PR author is not from Axim or 2U label Mar 18, 2024

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 18, 2024

openedx deleted a comment from openedx-webhooks Mar 18, 2024

alex-sheehan-edx removed the open-source-contribution PR author is not from Axim or 2U label Mar 18, 2024

openedx deleted a comment from openedx-webhooks Mar 18, 2024

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Mar 18, 2024

alex-sheehan-edx requested review from adamstankiewicz and a team March 18, 2024 16:42

feat: ADR for incremental algolia indexing

ea2fd61

alex-sheehan-edx force-pushed the johnnagro/ENT-8139/0 branch from 6535cbd to ea2fd61 Compare March 19, 2024 14:58

pwnage101 reviewed Mar 20, 2024

View reviewed changes

iloveagent57 reviewed Mar 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ADR for incremental algolia indexing #773

feat: ADR for incremental algolia indexing #773

johnnagro commented Feb 14, 2024

adamstankiewicz Mar 7, 2024

alex-sheehan-edx Mar 15, 2024

alex-sheehan-edx left a comment

alex-sheehan-edx commented Mar 18, 2024

openedx-webhooks commented Mar 18, 2024

pwnage101 left a comment

pwnage101 Mar 20, 2024

pwnage101 Mar 20, 2024

alex-sheehan-edx Apr 12, 2024

pwnage101 Apr 22, 2024 •

edited

Loading

pwnage101 Mar 20, 2024

pwnage101 left a comment

pwnage101 Mar 20, 2024

alex-sheehan-edx Apr 12, 2024

iloveagent57 Mar 29, 2024

alex-sheehan-edx Apr 12, 2024

iloveagent57 Mar 29, 2024

iloveagent57 Mar 29, 2024

		the contents metadata (modified_at) must be bumped from what's previously stored. Secondly, the content must have
		associations with queries within the service.

feat: ADR for incremental algolia indexing #773

Are you sure you want to change the base?

feat: ADR for incremental algolia indexing #773

Conversation

johnnagro commented Feb 14, 2024

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-sheehan-edx left a comment

Choose a reason for hiding this comment

alex-sheehan-edx commented Mar 18, 2024

openedx-webhooks commented Mar 18, 2024

pwnage101 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwnage101 Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwnage101 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwnage101 Apr 22, 2024 •

edited

Loading