feat: ADR for incremental algolia indexing

openedx · Mar 15, 2024 · b1a0d0e · b1a0d0e
1 parent 8c5b50d
commit b1a0d0e
Show file tree

Hide file tree

Showing 2 changed files with 102 additions and 0 deletions.
diff --git a/docs/decisions/0009-incremental-algolia-indexing.rst b/docs/decisions/0009-incremental-algolia-indexing.rst
@@ -0,0 +1,88 @@
+Incremental Algolia Indexing
+============================
+
+
+Status
+------
+Draft
+
+
+Context
+-------
+The Enterprise Catalog Service produces an Algolia-based search index of its Content Metadata and Course Catalog
+database. This index is entirely rebuilt at least nightly, working off a compendium of content records
+resulting in a wholesale replacement of the prior Algolia index. This job is time consuming and memory intensive.
+This job also relies heavily on separate but required processes responsible for retrieving filtered subsets of
+content from external sources of truth, primarily Course Discovery, where synchronous tasks must be regularly
+run in specific orders. This results in a system that is brittle - either entirely successful or entirely unsuccessful.
+
+
+Solution Approach
+-----------------
+The goals should include:
+- Implement new tasks that run alongside/augment the existing indexer until we’re able to entirely cut-over
+- Support all current metadata types but doesn’t need to support them all on day 1
+- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing
+update_content_metadata job, etc.
+    - Invocation of the new indexing process should not be reliant on separate processes run synchronously before hand.
+- Higher parallelization factor, i.e. 1 content item per celery task worker (and no task group coordination required)
+- Provide a content-oriented method of determining content catalog membership that's not reliant on external services.
+
+
+Decision
+--------
+We want to follow updates to content with individual and incremental updates to Algolia. To do this we both create
+new functionality and reuse some existing functionality of our Algolia indexing infrastructure.
+
+----------------------------------
+First, the existing indexing process begins with executing catalog queries against `search/all` to determine which
+courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the
+opposite semantic and instead be able to determine catalog membership from a given course (rather than courses from a
+given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a catalog query
+and a piece of content metadata and determine if the content matches the query (without the use of course discovery).
+----------------------------------
+
+First is to address the way in which and the moments when we choose to invoke the process of indexing. Previously,
+the bulk indexing logic was reliant on a completely separate task synchronously completing. In order to bulk index,
+content records needed to be bulk updated. The update_content_metadata job's purpose is two fold, one is to ingest content
+metadata from external service providers and standardize its format and enterprise representation, and two is to
+build associations between said metadata records and customer catalogs by way of catalog query inclusion. Once this
+information is entirely read and saved within the catalog service, the system is then ready to snapshot the state of
+content in the form of algolia objects and entirely rebuild and replace our algolia index.
+
+This first A then B approach to wholesale rebuilding our indices is both time and resource intensive as well as brittle
+and prone to outages. Not to mention the system is slow to fix should a partial or full error occur, as
+everything must be rerun in a specific order.
+
+To remediate these symptoms, indexing content records will be dealt with on an individual object-shard/content metadata
+object basis and will happen at the moment a record is saved to the ContentMetadata table. Tying the indexing process
+to the model ``post_save()`` will decouple the task from any other time consuming, bulk job. In order to combat
+redundant/unneeded requests, the record will be evaluated on two levels before an indexing task is kicked off. First
+the contents metadata (modified_at) must be bumped from what's previously stored. Secondly, the content must have
+associations with queries within the service.
+
+In order to incrementally update the Algolia index we need to introduce the ability to replace individual
+object-shard documents in the index (today we just replace the whole index). This can be implemented by creating
+methods to determine which Algolia object-shards exist for a piece of content. Once we have relevant IDs we are able to
+determine if a create, update, or delete of them is required and can highjack existing processes that bulk construct
+our algolia objects except on an individual basis. For simplicity sake an update will likely be a delete followed by
+the creation of new objects.
+
+Incremental updates, through the act of saving individual records, will need to be triggered by something - such as
+polling of updated content from Course Discovery, consumption of event-bus events, and/or triggering based on a nightly
+Course Discovery crawl or Django Admin button. However it is not the responsibility of the indexer, nor this ADR
+to determine when those events should occur, and in fact the indexing process should be able to handle any source of
+content metadata record updating processes.
+
+
+Consequences
+------------
+Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will
+also provide us with more flexibility about including non-course-discovery content in catalogs because we will
+no-longer rely on a query to course-discovery's `search/all` endpoint and instead rely on the metadata records in the
+catalog service, regardless of it's source.
+
+
+Alternatives Considered
+-----------------------
+No alternatives were considered.
diff --git a/docs/decisions/0010-incremental-content-metadata-updating.rst b/docs/decisions/0010-incremental-content-metadata-updating.rst
@@ -0,0 +1,14 @@
+Incremental Content Metadata Updating
+=====================================
+
+
+Status
+------
+Draft
+
+
+Context
+-------
+The Enterprise Catalog Service implicitly relies on an external services as the sources of truth for content surfaced
+to organizations within the the suite of enterprise offerings. For the most part this external source of truth has been
+assumed to be the course-discovery service.