Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index Studio content using Meilisearch [FC-0040] #34310

Merged

Conversation

bradenmacdonald
Copy link
Contributor

@bradenmacdonald bradenmacdonald commented Feb 28, 2024

Description

This is a discovery prototype to learn about Meilisearch and how it could be used with Open edX.

The goal is to have this available for Studio search in Redwood, but disabled by default. Interested users can enable it and give us feedback, particularly on the use of Meilisearch vs. ElasticSearch.

Supporting information

The goal would be to implement this planned Studio courseware search UX using Meilisearch:

Screenshot 2024-02-28 at 2 08 52 PM

For now, this is a proof of concept initial backend implementation that indexes draft (Studio) content from courses and libraries (v2).

Screenshot

Here is a rudimentary search UI in the course authoring MFE searching courseware using this Meilisearch index:

Screenshot 2024-03-05 at 12 15 29 PM

This shows the resulting index in the UI built in to Meilisearch:

meilisearch-example

And here is an example search result with typo tolerance:

Screenshot 2024-02-28 at 2 12 54 PM

Testing instructions

  1. Install tutor-contrib-meilisearch onto a Tutor Nightly Devstack.
  2. Run tutor dev run cms bash and ./manage.py cms reindex_libraries
  3. View the resulting index at http://meilisearch.local.edly.io:7700/ (see tutor-contrib-meilisearch README for how to get the API key to log in)
  4. Or set up this PR in the course authoring MFE and follow it's instructions to see the basic search UI in the MFE.

Faceted Search Results (backend)

You can see faceted search using tags using the frontend demo. But you can also see the data on the backend if you want: From the CMS django shell (./manage.py cms shell), run this code to see an example faceted search:

from django.conf import settings
import meilisearch
client = meilisearch.Client(settings.MEILISEARCH_URL, settings.MEILISEARCH_API_KEY)
index_name = settings.MEILISEARCH_INDEX_PREFIX + "studio_content"

client.index(index_name).search("hyperspace", {"facets": ["block_type", "tags"]})

result:

 'facetDistribution': {'block_type': {'html': 2,
   'problem': 2,
   'vertical': 1,
   'video': 1},
  'tags': {},
  'tags.taxonomy': {'ESDC Skills and Competencies': 1, 'Lightcast Open Skills Taxonomy': 3}
  'tags.level0': {'ESDC Skills and Competencies > Knowledge': 1,
   'Lightcast Open Skills Taxonomy > Administration': 2,
   'Lightcast Open Skills Taxonomy > Engineering': 1},
  'tags.level1': {'ESDC Skills and Competencies > Knowledge > Physical Sciences Sub-Category': 1,
   'Lightcast Open Skills Taxonomy > Administration > Administrative Support': 1,
   'Lightcast Open Skills Taxonomy > Administration > Dictation': 1,
   'Lightcast Open Skills Taxonomy > Administration > Document Management': 1,
   'Lightcast Open Skills Taxonomy > Engineering > Aerospace Engineering Subcategory': 1},
  'tags.level2': {'ESDC Skills and Competencies > Knowledge > Physical Sciences Sub-Category > Physical Sciences': 1,
   'Lightcast Open Skills Taxonomy > Administration > Administrative Support > Administrative Functions': 1,
   'Lightcast Open Skills Taxonomy > Administration > Dictation > Transcribing': 1,
   'Lightcast Open Skills Taxonomy > Engineering > Aerospace Engineering Subcategory > Aerospace Engineering': 1,
   'Lightcast Open Skills Taxonomy > Engineering > Aerospace Engineering Subcategory > Space Exploration': 1,
   'Lightcast Open Skills Taxonomy > Engineering > Aerospace Engineering Subcategory > Space Flight': 1,
   'Lightcast Open Skills Taxonomy > Engineering > Aerospace Engineering Subcategory > Spacecraft Propulsion': 1},
  'tags.level3': {},
},

Not Implemented / TODO

  • Add the course/library name to the index
  • Add the hierarchy to each block - the breadcrumbs of what course/section/subsection/unit it's in
  • Add tests
  • Add ADR
  • Restrict the search results based on what permissions the user has. Most of the code to do this is already in there; already the frontend is using a user-specific API key that is setup with restrictions on what it can search. We just need to populate those restrictions with a list of orgs/courses/libraries they can see.
    • Will do in a future PR. For now, this is disabled by default and if enabled, is only available to global staff users.

Deadline

We're hoping to include a beta version of this (off by default) in Redwood.

Private ref: FAL-3689

@openedx-webhooks
Copy link

Thanks for the pull request, @bradenmacdonald! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

  • supporting documentation
  • Open edX discussion forum threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Feb 28, 2024
@ormsbee
Copy link
Contributor

ormsbee commented Feb 29, 2024

This PR brings me such joy. 😄

@bradenmacdonald bradenmacdonald changed the title Proof of Concept: index content libraries using Meilisearch [FC-0040] Proof of Concept: index Studio content using Meilisearch [FC-0040] Feb 29, 2024
@bradenmacdonald bradenmacdonald force-pushed the braden/meilisearch-libraries branch from 2af93c5 to f57dbfa Compare March 15, 2024 19:22
@bradenmacdonald bradenmacdonald changed the title Proof of Concept: index Studio content using Meilisearch [FC-0040] Index Studio content using Meilisearch [FC-0040] Mar 15, 2024
@bradenmacdonald bradenmacdonald force-pushed the braden/meilisearch-libraries branch from f57dbfa to c53963e Compare March 18, 2024 01:31
@bradenmacdonald bradenmacdonald force-pushed the braden/meilisearch-libraries branch 2 times, most recently from 107d814 to 0c43e21 Compare March 18, 2024 02:52
@bradenmacdonald bradenmacdonald force-pushed the braden/meilisearch-libraries branch from 0c43e21 to 6be622e Compare March 18, 2024 02:54
@bradenmacdonald
Copy link
Contributor Author

@rpenido I included those changes - thanks! I also added a test for the document format and fixed a couple minor bugs that that uncovered.

@bradenmacdonald bradenmacdonald force-pushed the braden/meilisearch-libraries branch from 41a88a8 to 69879e0 Compare March 19, 2024 02:34
@rpenido rpenido self-requested a review March 19, 2024 17:08
Copy link
Contributor

@rpenido rpenido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here @bradenmacdonald!

I think this is ready for upstream review.

openedx/core/djangoapps/content/search/documents.py Outdated Show resolved Hide resolved
@rpenido
Copy link
Contributor

rpenido commented Mar 20, 2024

Hi @bradenmacdonald! Working with this code today, I realized we are not indexing the root content (course and library), only its blocks. I just want to make sure that this is intended.

Copy link
Contributor

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions, a couple of small requests. I think the ADR looks great.

@@ -1776,6 +1776,9 @@
'openedx_tagging.core.tagging.apps.TaggingConfig',
'openedx.core.djangoapps.content_tagging',

# Search
'openedx.core.djangoapps.content.search.apps.ContentSearchConfig',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Someone (I think maybe @kdmccormick) pointed out to me that you don't need to explicitly put the app config if there's only one.

integer or a string composed of alphanumeric characters (a-z A-Z 0-9),
hyphens (-) and underscores (_). Since our opaque keys don't meet this
requirement, we transform them to a similar slug ID string that does.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question (not request)]: Would there be any advantage in using PublishableEntity's primary key or UUID for this, once course storage uses Learning Core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I don't really like having to generate this arbitrary ID here. We don't really use it for anything, so it's not a big deal, but it would be nicer to use an ID that's the same as other parts of the system, which would be the case with PublishableEntity's ID/UUID. So I'd happily change this to use that once it's available for courseware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to that effect.

In the future, with Learning Core's data models in place for courseware,
we could use PublishableEntity's primary key / UUID instead.

"""
# The slugified key _may_ not be unique so we append a hashed string to make it unique:
key_bin = str(usage_key).encode()
suffix = hashlib.sha1(key_bin).hexdigest()[:7] # When we use Python 3.9+, should add usedforsecurity=False here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional/nit]: blake2b is just as fast (or slightly faster) than sha1 and lets you choose a digest size.

Values for the 'type' field on each doc in the search index
"""
course_block = "course_block"
library_block = "library_block"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searching within files and uploads is out of scope, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't thought about Files & Uploads tbh. It's probably a good idea to add them at some point, but definitely not in this PR and probably not for Redwood either.

# Meilisearch primary key. String.
id = "id"
usage_key = "usage_key"
type = "type" # DocType.course_block or DocType.library_block (see below)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What's the process for adding a field later? For instance, when Units are introduced as a searchable thing, would we run a data migration that adds a field with a default value of "component"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to change any of the index settings:

  • which attributes can be used for filtering/faceted search
  • which attributes are used for keyword search
  • The fact that the "usage_key" is the distinct attribute

...then it's best to do it via reindex_studio management command, which is the only method I've implemented for doing so.

The Meilisearch docs recommend this approach:

Updating [index settings] will re-index all documents in the index, which can take some time. We recommend updating your index settings first and then adding documents as this reduces RAM consumption.

But otherwise just adding new fields or values can be done any time. If it's something like a new value for the type field, or a new field that doesn't need to be used for filtering or keyword search, or it's a dictionary value within one of the existing fields (e.g. a new set of keys and values in the content field from some new XBlock's index_dictionary) - then it can be done anytime, no migration needed, no performance issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted via a comment:

# Note: new fields or values can be added at any time, but if they need to be indexed for filtering or keyword
# search, the index configuration will need to be changed, which is only done as part of the 'reindex_studio'
# command (changing those settings on an large active index is not recommended).

Comment on lines 163 to 170
for level in range(4):
new_value = " > ".join(parts[0:level + 2])
if f"level{level}" not in result:
result[f"level{level}"] = [new_value]
elif new_value not in result[f"level{level}"]:
result[f"level{level}"].append(new_value)
if len(parts) == level + 2:
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment explaining why "4", and more generally what's going on in this block of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done:

# Now we build each level (tags.level0, tags.level1, etc.) as applicable.
# We have a hard-coded limit of 4 levels of tags for now (see Fields.tags above).
# A tag like "Difficulty: Hard" will only result in one level (tags.level0)
# But a tag like "Location: North America > Canada > Vancouver" would result in three levels (tags.level0:
# "North America", tags.level1: "North America > Canada", tags.level2: "North America > Canada > Vancouver")
# See the comments above on "Field.tags" for an explanation of why we use this format (basically it's the format
# required by the Instantsearch frontend).
for level in range(4):
# We use '>' as a separator because it's the default for the Instantsearch frontend library, and our
# preferred separator (\t) used in the database is ignored by Meilisearch since it's whitespace.
new_value = " > ".join(parts[0:level + 2])
if f"level{level}" not in result:
result[f"level{level}"] = [new_value]
elif new_value not in result[f"level{level}"]:
result[f"level{level}"].append(new_value)
if len(parts) == level + 2:
break # We have all the levels for this tag now (e.g. parts=["Difficulty", "Hard"] -> need level0 only)

block = xblock_api.load_block(metadata.usage_key, user=None)
except Exception as err: # pylint: disable=broad-except
log.exception(f"Failed to load XBlock {metadata.usage_key}: {err}")
# Even though we couldn't load the block, we can still include basic data about it in the index, from 'metadata'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to index it if this is the case though? Wouldn't this XBlock be in a broken state anyhow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I didn't like having this alternate code path anyways; not very DRY. Removed.

docs.append(doc) # pylint: disable=cell-var-from-loop
self.recurse_children(block, add_with_children) # pylint: disable=cell-var-from-loop

self.recurse_children(course, add_with_children)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetching children at each step like this in split mongo is likely to be painfully slow, and we can't prefetch them with depth=None on the get_course call without exploding memory. So I think the best way to do this might be to iterate the CourseOverview ids to get the course keys, and then do a separate get_course(course_key, depth=None) to prefetch the children.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added get_course(..., depth=None) in 1ff4b75 . Is that sufficient, or do you want me to change to loading the IDs from CourseOverview too? (I couldn't see a public API for "get all courses" in the CourseOverview API so I just left it for now.)

This didn't make any noticeable difference on my devstack but presumably that's because I don't have [large] enough courses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine for now, thank you.


See also cms/djangoapps/contentstore/management/commands/reindex_course.py which
indexes LMS (published) courses in ElasticSearch.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what standard to hold this module to. As a command intended for developers to be able to bootstrap and test local dev envs with search functionality, this is fine. I might ask for small features like being able to specify a specific library or course. As a migration, I'd have more requests w.r.t. scaling, logging, debug output, etc.

Could you maybe indicate that it's experimental and not safe to run in production envs here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command creates a brand new index, configures it, populates it, then swaps it to become the active index. It wouldn't make sense to run it for a single course, because it would erase all other courses from the resulting index. But it could make sense to have an option to skip populating it with content (for initial setup on huge instances), and another command that would reindex a single course/library (into the active index, without creating a new index), to give more control.

I have indicated it's experimental and it now requires the --experimental flag.

This is experimental and not recommended for production use.
"""
def add_arguments(self, parser):
parser.add_argument('--experimental', action='store_true')
parser.set_defaults(experimental=False)
def handle(self, *args, **options):
"""
Build a new search index for Studio, containing content from courses and libraries
"""
if not options["experimental"]:
raise CommandError(
"This command is experimental and not recommended for production. "
"Use the --experimental argument to acknowledge and run it."
)

@bradenmacdonald
Copy link
Contributor Author

Thanks for the great questions @ormsbee. I've addressed all of them I think!

Copy link
Contributor

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to squash and merge to me.

@bradenmacdonald bradenmacdonald merged commit f663739 into openedx:master Mar 22, 2024
67 checks passed
@openedx-webhooks
Copy link

@bradenmacdonald 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX staging environment in preparation for a release to production.

@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX production environment.

1 similar comment
@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX production environment.

@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX staging environment in preparation for a release to production.

@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX production environment.

1 similar comment
@edx-pipeline-bot
Copy link
Contributor

2U Release Notice: This PR has been deployed to the edX production environment.

@regisb
Copy link
Contributor

regisb commented Mar 25, 2024

This is a huge step forward for Open edX. Awesome work Braden!

@bradenmacdonald bradenmacdonald deleted the braden/meilisearch-libraries branch March 27, 2024 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants