-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index Studio content using Meilisearch [FC-0040] #34310
Index Studio content using Meilisearch [FC-0040] #34310
Conversation
Thanks for the pull request, @bradenmacdonald! Please note that it may take us up to several weeks or months to complete a review and merge your PR. Feel free to add as much of the following information to the ticket as you can:
All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here. Please let us know once your PR is ready for our review and all tests are green. |
This PR brings me such joy. 😄 |
2af93c5
to
f57dbfa
Compare
f57dbfa
to
c53963e
Compare
107d814
to
0c43e21
Compare
0c43e21
to
6be622e
Compare
@rpenido I included those changes - thanks! I also added a test for the document format and fixed a couple minor bugs that that uncovered. |
41a88a8
to
69879e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work here @bradenmacdonald!
I think this is ready for upstream review.
Co-authored-by: Rômulo Penido <romulo@opencraft.com>
Hi @bradenmacdonald! Working with this code today, I realized we are not indexing the root content (course and library), only its blocks. I just want to make sure that this is intended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions, a couple of small requests. I think the ADR looks great.
cms/envs/common.py
Outdated
@@ -1776,6 +1776,9 @@ | |||
'openedx_tagging.core.tagging.apps.TaggingConfig', | |||
'openedx.core.djangoapps.content_tagging', | |||
|
|||
# Search | |||
'openedx.core.djangoapps.content.search.apps.ContentSearchConfig', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Someone (I think maybe @kdmccormick) pointed out to me that you don't need to explicitly put the app config if there's only one.
integer or a string composed of alphanumeric characters (a-z A-Z 0-9), | ||
hyphens (-) and underscores (_). Since our opaque keys don't meet this | ||
requirement, we transform them to a similar slug ID string that does. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question (not request)]: Would there be any advantage in using PublishableEntity's primary key or UUID for this, once course storage uses Learning Core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I don't really like having to generate this arbitrary ID here. We don't really use it for anything, so it's not a big deal, but it would be nicer to use an ID that's the same as other parts of the system, which would be the case with PublishableEntity's ID/UUID. So I'd happily change this to use that once it's available for courseware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to that effect.
edx-platform/openedx/core/djangoapps/content/search/documents.py
Lines 72 to 73 in ec0c4a8
In the future, with Learning Core's data models in place for courseware, | |
we could use PublishableEntity's primary key / UUID instead. |
""" | ||
# The slugified key _may_ not be unique so we append a hashed string to make it unique: | ||
key_bin = str(usage_key).encode() | ||
suffix = hashlib.sha1(key_bin).hexdigest()[:7] # When we use Python 3.9+, should add usedforsecurity=False here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Optional/nit]: blake2b is just as fast (or slightly faster) than sha1 and lets you choose a digest size.
Values for the 'type' field on each doc in the search index | ||
""" | ||
course_block = "course_block" | ||
library_block = "library_block" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Searching within files and uploads is out of scope, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't thought about Files & Uploads tbh. It's probably a good idea to add them at some point, but definitely not in this PR and probably not for Redwood either.
# Meilisearch primary key. String. | ||
id = "id" | ||
usage_key = "usage_key" | ||
type = "type" # DocType.course_block or DocType.library_block (see below) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What's the process for adding a field later? For instance, when Units are introduced as a searchable thing, would we run a data migration that adds a field with a default value of "component"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to change any of the index settings:
- which attributes can be used for filtering/faceted search
- which attributes are used for keyword search
- The fact that the "usage_key" is the distinct attribute
...then it's best to do it via reindex_studio
management command, which is the only method I've implemented for doing so.
The Meilisearch docs recommend this approach:
Updating [index settings] will re-index all documents in the index, which can take some time. We recommend updating your index settings first and then adding documents as this reduces RAM consumption.
But otherwise just adding new fields or values can be done any time. If it's something like a new value for the type
field, or a new field that doesn't need to be used for filtering or keyword search, or it's a dictionary value within one of the existing fields (e.g. a new set of keys and values in the content
field from some new XBlock's index_dictionary) - then it can be done anytime, no migration needed, no performance issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted via a comment:
edx-platform/openedx/core/djangoapps/content/search/documents.py
Lines 52 to 54 in ec0c4a8
# Note: new fields or values can be added at any time, but if they need to be indexed for filtering or keyword | |
# search, the index configuration will need to be changed, which is only done as part of the 'reindex_studio' | |
# command (changing those settings on an large active index is not recommended). |
for level in range(4): | ||
new_value = " > ".join(parts[0:level + 2]) | ||
if f"level{level}" not in result: | ||
result[f"level{level}"] = [new_value] | ||
elif new_value not in result[f"level{level}"]: | ||
result[f"level{level}"].append(new_value) | ||
if len(parts) == level + 2: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment explaining why "4", and more generally what's going on in this block of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done:
edx-platform/openedx/core/djangoapps/content/search/documents.py
Lines 178 to 194 in ec0c4a8
# Now we build each level (tags.level0, tags.level1, etc.) as applicable. | |
# We have a hard-coded limit of 4 levels of tags for now (see Fields.tags above). | |
# A tag like "Difficulty: Hard" will only result in one level (tags.level0) | |
# But a tag like "Location: North America > Canada > Vancouver" would result in three levels (tags.level0: | |
# "North America", tags.level1: "North America > Canada", tags.level2: "North America > Canada > Vancouver") | |
# See the comments above on "Field.tags" for an explanation of why we use this format (basically it's the format | |
# required by the Instantsearch frontend). | |
for level in range(4): | |
# We use '>' as a separator because it's the default for the Instantsearch frontend library, and our | |
# preferred separator (\t) used in the database is ignored by Meilisearch since it's whitespace. | |
new_value = " > ".join(parts[0:level + 2]) | |
if f"level{level}" not in result: | |
result[f"level{level}"] = [new_value] | |
elif new_value not in result[f"level{level}"]: | |
result[f"level{level}"].append(new_value) | |
if len(parts) == level + 2: | |
break # We have all the levels for this tag now (e.g. parts=["Difficulty", "Hard"] -> need level0 only) |
block = xblock_api.load_block(metadata.usage_key, user=None) | ||
except Exception as err: # pylint: disable=broad-except | ||
log.exception(f"Failed to load XBlock {metadata.usage_key}: {err}") | ||
# Even though we couldn't load the block, we can still include basic data about it in the index, from 'metadata' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want to index it if this is the case though? Wouldn't this XBlock be in a broken state anyhow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and I didn't like having this alternate code path anyways; not very DRY. Removed.
openedx/core/djangoapps/content/search/management/commands/reindex_studio.py
Outdated
Show resolved
Hide resolved
docs.append(doc) # pylint: disable=cell-var-from-loop | ||
self.recurse_children(block, add_with_children) # pylint: disable=cell-var-from-loop | ||
|
||
self.recurse_children(course, add_with_children) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fetching children at each step like this in split mongo is likely to be painfully slow, and we can't prefetch them with depth=None
on the get_course call without exploding memory. So I think the best way to do this might be to iterate the CourseOverview ids to get the course keys, and then do a separate get_course(course_key, depth=None)
to prefetch the children.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added get_course(..., depth=None)
in 1ff4b75 . Is that sufficient, or do you want me to change to loading the IDs from CourseOverview too? (I couldn't see a public API for "get all courses" in the CourseOverview API so I just left it for now.)
This didn't make any noticeable difference on my devstack but presumably that's because I don't have [large] enough courses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fine for now, thank you.
|
||
See also cms/djangoapps/contentstore/management/commands/reindex_course.py which | ||
indexes LMS (published) courses in ElasticSearch. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what standard to hold this module to. As a command intended for developers to be able to bootstrap and test local dev envs with search functionality, this is fine. I might ask for small features like being able to specify a specific library or course. As a migration, I'd have more requests w.r.t. scaling, logging, debug output, etc.
Could you maybe indicate that it's experimental and not safe to run in production envs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command creates a brand new index, configures it, populates it, then swaps it to become the active index. It wouldn't make sense to run it for a single course, because it would erase all other courses from the resulting index. But it could make sense to have an option to skip populating it with content (for initial setup on huge instances), and another command that would reindex a single course/library (into the active index, without creating a new index), to give more control.
I have indicated it's experimental and it now requires the --experimental
flag.
edx-platform/openedx/core/djangoapps/content/search/management/commands/reindex_studio.py
Lines 33 to 48 in ec0c4a8
This is experimental and not recommended for production use. | |
""" | |
def add_arguments(self, parser): | |
parser.add_argument('--experimental', action='store_true') | |
parser.set_defaults(experimental=False) | |
def handle(self, *args, **options): | |
""" | |
Build a new search index for Studio, containing content from courses and libraries | |
""" | |
if not options["experimental"]: | |
raise CommandError( | |
"This command is experimental and not recommended for production. " | |
"Use the --experimental argument to acknowledge and run it." | |
) |
Thanks for the great questions @ormsbee. I've addressed all of them I think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to squash and merge to me.
@bradenmacdonald 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future. |
2U Release Notice: This PR has been deployed to the edX staging environment in preparation for a release to production. |
2U Release Notice: This PR has been deployed to the edX production environment. |
1 similar comment
2U Release Notice: This PR has been deployed to the edX production environment. |
2U Release Notice: This PR has been deployed to the edX staging environment in preparation for a release to production. |
2U Release Notice: This PR has been deployed to the edX production environment. |
1 similar comment
2U Release Notice: This PR has been deployed to the edX production environment. |
This is a huge step forward for Open edX. Awesome work Braden! |
Description
This is a discovery prototype to learn about Meilisearch and how it could be used with Open edX.
The goal is to have this available for Studio search in Redwood, but disabled by default. Interested users can enable it and give us feedback, particularly on the use of Meilisearch vs. ElasticSearch.
Supporting information
The goal would be to implement this planned Studio courseware search UX using Meilisearch:
For now, this is a
proof of conceptinitial backend implementation that indexes draft (Studio) content from courses and libraries (v2).Screenshot
Here is a rudimentary search UI in the course authoring MFE searching courseware using this Meilisearch index:
This shows the resulting index in the UI built in to Meilisearch:
And here is an example search result with typo tolerance:
Testing instructions
tutor dev run cms bash
and./manage.py cms reindex_libraries
tutor-contrib-meilisearch
README for how to get the API key to log in)Faceted Search Results (backend)
You can see faceted search using tags using the frontend demo. But you can also see the data on the backend if you want: From the CMS django shell (
./manage.py cms shell
), run this code to see an example faceted search:result:
Not Implemented / TODO
Deadline
We're hoping to include a beta version of this (off by default) in Redwood.
Private ref: FAL-3689