-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: ADR for incremental algolia indexing #773
base: master
Are you sure you want to change the base?
Conversation
First, the existing indexing process begins with executing catalog queries against `search/all` to determine which | ||
courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the | ||
opposite semantic and instead be able to determine catalog membership from a given course (rather than courses from a | ||
given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a catalog query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[question] How will this process be aware of courses that exist upstream in course-discovery, but don't yet exist in enterprise-catalog (e.g., a newly added course). Based on this, it would seem enterprise-catalog would need to be aware of all course metadata from course-discovery, even if it's not tied to any enterprise catalogs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realizing that this is a comment on an older iteration of this ADR however, to answer your question the catalog service would be aware of new content by both hooking up to post save signals and querying all content. For each piece of content that it has been notified about, the service will do it's own query association filtering logic post content ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
08493ca
to
b1a0d0e
Compare
d1146b2
to
6535cbd
Compare
For clarity sake- @johnnagro started this work before he left 2U and I'm taking up the task of finishing and publishing |
Thanks for the pull request, @johnnagro! Please note that it may take us up to several weeks or months to complete a review and merge your PR. Feel free to add as much of the following information to the ticket as you can:
All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here. Please let us know once your PR is ready for our review and all tests are green. Once you've signed the CLA, please allow 1 business day for it to be processed. After this time, you can re-run the CLA check by adding a comment here that you have signed it. If the problem persists, you can tag the |
6535cbd
to
ea2fd61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
halfway done
- Support all current metadata types but doesn’t need to support them all on day 1 | ||
- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing | ||
update_content_metadata job, etc. | ||
- Invocation of the new indexing process should not be reliant on separate processes run synchronously before hand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps s/synchronously/serially ?
the contents metadata (modified_at) must be bumped from what's previously stored. Secondly, the content must have | ||
associations with queries within the service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the content must have associations with queries in order to kick of an indexing task, what happens when the content had associations before, but then those associations were removed? The end result is that there are no associations, but we still need to kick off an indexing task to de-index the content, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point - we would want to kick off the process for a piece of content should it lose/gain any number of associated queries. We need to run the individual indexing task of a course IFF
1- The content metadata in any way changes
2- Any associations between a customer catalog is removed
3- Any associations between a customer catalog is created
We will need to make sure that it is done and evaluated once we go to index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, somehow this should be represented in the ADR, since right now it calls for a pseudocode which only kicks off indexing for a course IFF the content has associations.
Incremental updates, through the act of saving individual records, will need to be triggered by something - such as | ||
polling of updated content from Course Discovery, consumption of event-bus events, and/or triggering based on a nightly | ||
Course Discovery crawl or Django Admin button. However it is not the responsibility of the indexer, nor this ADR | ||
to determine when those events should occur, and in fact the indexing process should be able to handle any source of | ||
content metadata record updating processes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a previous paragraph, a solution utilizing a ContentMetadata post_save() hook to trigger a celery task was proposed. Is that a valid solution for triggering incremental index updates? If so, why is it not listed in this paragraph as a solution? Likewise, why aren't solutions in this paragraph listed in the above paragraph?
If the two paragraphs are duplicative, I recommend consolidating them into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all done!
contributing factors as to the long run time of the ``update_content_metadata`` task. Additionally, housing | ||
our own filtering logic will allow us to maintain and tweak/improve upon the functionality should we want additional | ||
features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any concerns about local filtering logic in enterprise-catalog (apps.catalog.filters) diverging from how course-discovery does it? How do we keep two black boxes in sync? Do we even need to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that filters out the gate should match our discovery counter part, and that we would need rigorous tests to ensure our in house filters result in the same subsets of content. However from there, one of the benefits to this is that we get control of how the filters are administered and can change the behaviors to fit our needs and odd situations, no more need to go ask the phoenix team why there is one off odd behaviors, instead we can just adjust it ourselves ¯\_(ツ)_/¯
Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will | ||
also provide us with more flexibility about including non-course-discovery content in catalogs because we will | ||
no-longer rely on a query to course-discovery's `search/all` endpoint and instead rely on the metadata records in the | ||
catalog service, regardless of it's source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll also have to do something when catalog queries are created or edited, so that the search index is updated to reflect any catalog <-> metadata relationships that are created/updated due to those queries being changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an excellent point- in a similar vein to us having to tie a content record updating to catalog query contentmetadata_set
's being updated, we'd probably need to tie a query being updated to a process of calculating it's contentmetadata_set
and then kick off incremental indexing processes for all effected content records.
``update_content_metadata`` tasks and can eventually replace old infrastructure. The first method will be a bulk | ||
job similar to the current ``update_content_metadata`` task to query external sources of content and update any records | ||
should they mismatch using `apps.catalog.filters` to determine the query-content association sets. And second, an event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bulk job like this, though, means that you're going to run your filter functions in proportion to (|# of queries| x |# of content metadata records|). Which means you're going to run it 10s of millions of times if we have thousands of queries and 10,000s of content.
signal receiver which will process any individual content update events that are received. The intention is for the | ||
majority of updates in the catalog service to happen at the moment they are updated in their external source and the | ||
signal is fired, only to be cleaned up and verified by the bulk job later on should something go wrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea, but keep in mind, to keep content/catalog relationships up-to-date, an update of a single metadata records means we'll have to run the filter logic against every catalog query, because a change to content metadata could mean that a query should now include the content, or that the query should now not include the content.
Description
A proposed ADR for incremental indexing of content metadata and catalogs into Algolia.