Index Studio content using Meilisearch [FC-0040] #34310

bradenmacdonald · 2024-02-28T22:13:11Z

Description

This is a discovery prototype to learn about Meilisearch and how it could be used with Open edX.

The goal is to have this available for Studio search in Redwood, but disabled by default. Interested users can enable it and give us feedback, particularly on the use of Meilisearch vs. ElasticSearch.

Supporting information

The goal would be to implement this planned Studio courseware search UX using Meilisearch:

For now, this is a ~~proof of concept~~ initial backend implementation that indexes draft (Studio) content from courses and libraries (v2).

Screenshot

Here is a rudimentary search UI in the course authoring MFE searching courseware using this Meilisearch index:

This shows the resulting index in the UI built in to Meilisearch:

And here is an example search result with typo tolerance:

Testing instructions

Install tutor-contrib-meilisearch onto a Tutor Nightly Devstack.
Run tutor dev run cms bash and ./manage.py cms reindex_libraries
View the resulting index at http://meilisearch.local.edly.io:7700/ (see tutor-contrib-meilisearch README for how to get the API key to log in)
Or set up this PR in the course authoring MFE and follow it's instructions to see the basic search UI in the MFE.

Faceted Search Results (backend)

You can see faceted search using tags using the frontend demo. But you can also see the data on the backend if you want: From the CMS django shell (./manage.py cms shell), run this code to see an example faceted search:

from django.conf import settings
import meilisearch
client = meilisearch.Client(settings.MEILISEARCH_URL, settings.MEILISEARCH_API_KEY)
index_name = settings.MEILISEARCH_INDEX_PREFIX + "studio_content"

client.index(index_name).search("hyperspace", {"facets": ["block_type", "tags"]})

result:

 'facetDistribution': {'block_type': {'html': 2,
   'problem': 2,
   'vertical': 1,
   'video': 1},
  'tags': {},
  'tags.taxonomy': {'ESDC Skills and Competencies': 1, 'Lightcast Open Skills Taxonomy': 3}
  'tags.level0': {'ESDC Skills and Competencies > Knowledge': 1,
   'Lightcast Open Skills Taxonomy > Administration': 2,
   'Lightcast Open Skills Taxonomy > Engineering': 1},
  'tags.level1': {'ESDC Skills and Competencies > Knowledge > Physical Sciences Sub-Category': 1,
   'Lightcast Open Skills Taxonomy > Administration > Administrative Support': 1,
   'Lightcast Open Skills Taxonomy > Administration > Dictation': 1,
   'Lightcast Open Skills Taxonomy > Administration > Document Management': 1,
   'Lightcast Open Skills Taxonomy > Engineering > Aerospace Engineering Subcategory': 1},
  'tags.level2': {'ESDC Skills and Competencies > Knowledge > Physical Sciences Sub-Category > Physical Sciences': 1,
   'Lightcast Open Skills Taxonomy > Administration > Administrative Support > Administrative Functions': 1,
   'Lightcast Open Skills Taxonomy > Administration > Dictation > Transcribing': 1,
   'Lightcast Open Skills Taxonomy > Engineering > Aerospace Engineering Subcategory > Aerospace Engineering': 1,
   'Lightcast Open Skills Taxonomy > Engineering > Aerospace Engineering Subcategory > Space Exploration': 1,
   'Lightcast Open Skills Taxonomy > Engineering > Aerospace Engineering Subcategory > Space Flight': 1,
   'Lightcast Open Skills Taxonomy > Engineering > Aerospace Engineering Subcategory > Spacecraft Propulsion': 1},
  'tags.level3': {},
},

Not Implemented / TODO

Add the course/library name to the index
Add the hierarchy to each block - the breadcrumbs of what course/section/subsection/unit it's in
Add tests
Add ADR
Restrict the search results based on what permissions the user has. Most of the code to do this is already in there; already the frontend is using a user-specific API key that is setup with restrictions on what it can search. We just need to populate those restrictions with a list of orgs/courses/libraries they can see.
- Will do in a future PR. For now, this is disabled by default and if enabled, is only available to global staff users.

Deadline

We're hoping to include a beta version of this (off by default) in Redwood.

Private ref: FAL-3689

openedx-webhooks · 2024-02-28T22:13:16Z

Thanks for the pull request, @bradenmacdonald! Please note that it may take us up to several weeks or months to complete a review and merge your PR.

Feel free to add as much of the following information to the ticket as you can:

supporting documentation
Open edX discussion forum threads
timeline information ("this must be merged by XX date", and why that is)
partner information ("this is a course on edx.org")
any other information that can help Product understand the context for the PR

All technical communication about the code itself will be done via the GitHub pull request interface. As a reminder, our process documentation is here.

Please let us know once your PR is ready for our review and all tests are green.

ormsbee · 2024-02-29T00:16:41Z

This PR brings me such joy. 😄

bradenmacdonald · 2024-03-18T22:02:50Z

@rpenido I included those changes - thanks! I also added a test for the document format and fixed a couple minor bugs that that uncovered.

rpenido

Great work here @bradenmacdonald!

I think this is ready for upstream review.

openedx/core/djangoapps/content/search/documents.py

Co-authored-by: Rômulo Penido <[email protected]>

rpenido · 2024-03-20T22:08:05Z

Hi @bradenmacdonald! Working with this code today, I realized we are not indexing the root content (course and library), only its blocks. I just want to make sure that this is intended.

ormsbee

Mostly questions, a couple of small requests. I think the ADR looks great.

ormsbee · 2024-03-20T18:05:52Z

cms/envs/common.py

@@ -1776,6 +1776,9 @@
    'openedx_tagging.core.tagging.apps.TaggingConfig',
    'openedx.core.djangoapps.content_tagging',

+    # Search
+    'openedx.core.djangoapps.content.search.apps.ContentSearchConfig',


Nit: Someone (I think maybe @kdmccormick) pointed out to me that you don't need to explicitly put the app config if there's only one.

ormsbee · 2024-03-21T03:10:28Z

openedx/core/djangoapps/content/search/documents.py

+    integer or a string composed of alphanumeric characters (a-z A-Z 0-9),
+    hyphens (-) and underscores (_). Since our opaque keys don't meet this
+    requirement, we transform them to a similar slug ID string that does.
+    """


[Question (not request)]: Would there be any advantage in using PublishableEntity's primary key or UUID for this, once course storage uses Learning Core?

Yes. I don't really like having to generate this arbitrary ID here. We don't really use it for anything, so it's not a big deal, but it would be nicer to use an ID that's the same as other parts of the system, which would be the case with PublishableEntity's ID/UUID. So I'd happily change this to use that once it's available for courseware.

Added a comment to that effect.

edx-platform/openedx/core/djangoapps/content/search/documents.py

Lines 72 to 73 in ec0c4a8

In the future, with Learning Core's data models in place for courseware,

we could use PublishableEntity's primary key / UUID instead.

ormsbee · 2024-03-21T03:30:13Z

openedx/core/djangoapps/content/search/documents.py

+    """
+    # The slugified key _may_ not be unique so we append a hashed string to make it unique:
+    key_bin = str(usage_key).encode()
+    suffix = hashlib.sha1(key_bin).hexdigest()[:7]  # When we use Python 3.9+, should add usedforsecurity=False here.


[Optional/nit]: blake2b is just as fast (or slightly faster) than sha1 and lets you choose a digest size.

ormsbee · 2024-03-21T03:34:08Z

openedx/core/djangoapps/content/search/documents.py

+    Values for the 'type' field on each doc in the search index
+    """
+    course_block = "course_block"
+    library_block = "library_block"


Searching within files and uploads is out of scope, right?

I hadn't thought about Files & Uploads tbh. It's probably a good idea to add them at some point, but definitely not in this PR and probably not for Redwood either.

ormsbee · 2024-03-21T04:11:38Z

openedx/core/djangoapps/content/search/documents.py

+    # Meilisearch primary key. String.
+    id = "id"
+    usage_key = "usage_key"
+    type = "type"  # DocType.course_block or DocType.library_block (see below)


Question: What's the process for adding a field later? For instance, when Units are introduced as a searchable thing, would we run a data migration that adds a field with a default value of "component"?

If you want to change any of the index settings:

which attributes can be used for filtering/faceted search

which attributes are used for keyword search

The fact that the "usage_key" is the distinct attribute

...then it's best to do it via reindex_studio management command, which is the only method I've implemented for doing so.

The Meilisearch docs recommend this approach:

Updating [index settings] will re-index all documents in the index, which can take some time. We recommend updating your index settings first and then adding documents as this reduces RAM consumption.

But otherwise just adding new fields or values can be done any time. If it's something like a new value for the type field, or a new field that doesn't need to be used for filtering or keyword search, or it's a dictionary value within one of the existing fields (e.g. a new set of keys and values in the content field from some new XBlock's index_dictionary) - then it can be done anytime, no migration needed, no performance issue.

Noted via a comment:

edx-platform/openedx/core/djangoapps/content/search/documents.py

Lines 52 to 54 in ec0c4a8

# Note: new fields or values can be added at any time, but if they need to be indexed for filtering or keyword

# search, the index configuration will need to be changed, which is only done as part of the 'reindex_studio'

# command (changing those settings on an large active index is not recommended).

ormsbee · 2024-03-21T14:56:29Z

openedx/core/djangoapps/content/search/documents.py

+        for level in range(4):
+            new_value = " > ".join(parts[0:level + 2])
+            if f"level{level}" not in result:
+                result[f"level{level}"] = [new_value]
+            elif new_value not in result[f"level{level}"]:
+                result[f"level{level}"].append(new_value)
+            if len(parts) == level + 2:
+                break


Please add a comment explaining why "4", and more generally what's going on in this block of code.

Done:

edx-platform/openedx/core/djangoapps/content/search/documents.py

Lines 178 to 194 in ec0c4a8

# Now we build each level (tags.level0, tags.level1, etc.) as applicable.

# We have a hard-coded limit of 4 levels of tags for now (see Fields.tags above).

# A tag like "Difficulty: Hard" will only result in one level (tags.level0)

# But a tag like "Location: North America > Canada > Vancouver" would result in three levels (tags.level0:

# "North America", tags.level1: "North America > Canada", tags.level2: "North America > Canada > Vancouver")

# See the comments above on "Field.tags" for an explanation of why we use this format (basically it's the format

# required by the Instantsearch frontend).

for level in range(4):

# We use '>' as a separator because it's the default for the Instantsearch frontend library, and our

# preferred separator (\t) used in the database is ignored by Meilisearch since it's whitespace.

new_value = " > ".join(parts[0:level + 2])

if f"level{level}" not in result:

result[f"level{level}"] = [new_value]

elif new_value not in result[f"level{level}"]:

result[f"level{level}"].append(new_value)

if len(parts) == level + 2:

break # We have all the levels for this tag now (e.g. parts=["Difficulty", "Hard"] -> need level0 only)

ormsbee · 2024-03-21T14:58:44Z

openedx/core/djangoapps/content/search/documents.py

+        block = xblock_api.load_block(metadata.usage_key, user=None)
+    except Exception as err:  # pylint: disable=broad-except
+        log.exception(f"Failed to load XBlock {metadata.usage_key}: {err}")
+        # Even though we couldn't load the block, we can still include basic data about it in the index, from 'metadata'


Would we want to index it if this is the case though? Wouldn't this XBlock be in a broken state anyhow?

Yeah, and I didn't like having this alternate code path anyways; not very DRY. Removed.

openedx/core/djangoapps/content/search/management/commands/reindex_studio.py

ormsbee · 2024-03-21T15:09:27Z

openedx/core/djangoapps/content/search/management/commands/reindex_studio.py

+                    docs.append(doc)  # pylint: disable=cell-var-from-loop
+                    self.recurse_children(block, add_with_children)  # pylint: disable=cell-var-from-loop
+
+                self.recurse_children(course, add_with_children)


Fetching children at each step like this in split mongo is likely to be painfully slow, and we can't prefetch them with depth=None on the get_course call without exploding memory. So I think the best way to do this might be to iterate the CourseOverview ids to get the course keys, and then do a separate get_course(course_key, depth=None) to prefetch the children.

I added get_course(..., depth=None) in 1ff4b75 . Is that sufficient, or do you want me to change to loading the IDs from CourseOverview too? (I couldn't see a public API for "get all courses" in the CourseOverview API so I just left it for now.)

This didn't make any noticeable difference on my devstack but presumably that's because I don't have [large] enough courses.

I think that's fine for now, thank you.

ormsbee · 2024-03-21T15:11:49Z

openedx/core/djangoapps/content/search/management/commands/reindex_studio.py

+
+See also cms/djangoapps/contentstore/management/commands/reindex_course.py which
+indexes LMS (published) courses in ElasticSearch.
+"""


I'm not sure what standard to hold this module to. As a command intended for developers to be able to bootstrap and test local dev envs with search functionality, this is fine. I might ask for small features like being able to specify a specific library or course. As a migration, I'd have more requests w.r.t. scaling, logging, debug output, etc.

Could you maybe indicate that it's experimental and not safe to run in production envs here?

This command creates a brand new index, configures it, populates it, then swaps it to become the active index. It wouldn't make sense to run it for a single course, because it would erase all other courses from the resulting index. But it could make sense to have an option to skip populating it with content (for initial setup on huge instances), and another command that would reindex a single course/library (into the active index, without creating a new index), to give more control.

I have indicated it's experimental and it now requires the --experimental flag.

edx-platform/openedx/core/djangoapps/content/search/management/commands/reindex_studio.py

Lines 33 to 48 in ec0c4a8

This is experimental and not recommended for production use.

"""

def add_arguments(self, parser):

parser.add_argument('--experimental', action='store_true')

parser.set_defaults(experimental=False)

def handle(self, *args, **options):

"""

Build a new search index for Studio, containing content from courses and libraries

"""

if not options["experimental"]:

raise CommandError(

"This command is experimental and not recommended for production. "

"Use the --experimental argument to acknowledge and run it."

)

bradenmacdonald · 2024-03-21T19:11:44Z

Thanks for the great questions @ormsbee. I've addressed all of them I think!

ormsbee

Looks good to squash and merge to me.

openedx-webhooks · 2024-03-22T17:08:34Z

@bradenmacdonald 🎉 Your pull request was merged! Please take a moment to answer a two question survey so we can improve your experience in the future.

edx-pipeline-bot · 2024-03-22T18:21:16Z

2U Release Notice: This PR has been deployed to the edX staging environment in preparation for a release to production.

edx-pipeline-bot · 2024-03-22T18:39:30Z

2U Release Notice: This PR has been deployed to the edX production environment.

edx-pipeline-bot · 2024-03-22T18:39:30Z

2U Release Notice: This PR has been deployed to the edX production environment.

edx-pipeline-bot · 2024-03-22T19:18:54Z

2U Release Notice: This PR has been deployed to the edX staging environment in preparation for a release to production.

edx-pipeline-bot · 2024-03-22T19:39:15Z

2U Release Notice: This PR has been deployed to the edX production environment.

edx-pipeline-bot · 2024-03-22T19:39:15Z

2U Release Notice: This PR has been deployed to the edX production environment.

regisb · 2024-03-25T08:43:11Z

This is a huge step forward for Open edX. Awesome work Braden!

feat: management command to index content libraries using Meilisearch

463549f

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Feb 28, 2024

bradenmacdonald added 5 commits February 28, 2024 17:55

feat: include tags in library content index

81daab2

refactor: cleanup

ea01c70

feat: index courseware too

9f75b4c

refactor: combine indexes into one

b9eddb8

feat: index tags (in a way compatible with InstantSearch)

adbf357

bradenmacdonald changed the title ~~Proof of Concept: index content libraries using Meilisearch [FC-0040]~~ Proof of Concept: index Studio content using Meilisearch [FC-0040] Feb 29, 2024

feat: REST API to retrieve a user-specific API key

47dc3c1

bradenmacdonald mentioned this pull request Mar 5, 2024

Hacky Prototype search UI using Instantsearch + Meilisearch [FC-0040] openedx/frontend-app-authoring#868

Closed

fix: better handling index existence check, better comments

6cf9e04

bradenmacdonald mentioned this pull request Mar 6, 2024

[Course Search] Initial Integration of Studio Search Backend openedx/modular-learning#195

Closed

bradenmacdonald added 7 commits March 14, 2024 19:12

feat: disable studio search (w/ Meilisearch) by default

8637e16

feat: limit search endpoint usage to global staff for now

6431948

chore: fix quality issues

9c05806

refactor: put all "content" data into the "content" field

af1402f

chore: fix lint issue and comment

7c5cd7f

feat: add breadcrumbs to index

4c66d06

feat: use fallback display_name if needed

67cf0cc

bradenmacdonald force-pushed the braden/meilisearch-libraries branch from 2af93c5 to f57dbfa Compare March 15, 2024 19:22

bradenmacdonald changed the title ~~Proof of Concept: index Studio content using Meilisearch [FC-0040]~~ Index Studio content using Meilisearch [FC-0040] Mar 15, 2024

feat: state how long the reindex_studio command took

c53963e

bradenmacdonald force-pushed the braden/meilisearch-libraries branch from f57dbfa to c53963e Compare March 18, 2024 01:31

test: add tests for the Studio search REST API

3884fa2

bradenmacdonald force-pushed the braden/meilisearch-libraries branch 2 times, most recently from 107d814 to 0c43e21 Compare March 18, 2024 02:52

docs: added ADR

6be622e

bradenmacdonald force-pushed the braden/meilisearch-libraries branch from 0c43e21 to 6be622e Compare March 18, 2024 02:54

bradenmacdonald added 3 commits March 18, 2024 15:01

fix: hash used for ID wasn't stable

0af6939

fix: breadcrumbs weren't using fallback display_names

106cfd0

test: expand tests

1393d41

bradenmacdonald added 3 commits March 18, 2024 18:34

fix: lint issues

5a1429e

chore: update with latest master

ec27d5f

test: expand tests

69879e0

bradenmacdonald force-pushed the braden/meilisearch-libraries branch from 41a88a8 to 69879e0 Compare March 19, 2024 02:34

rpenido self-requested a review March 19, 2024 17:08

rpenido approved these changes Mar 19, 2024

View reviewed changes

openedx/core/djangoapps/content/search/documents.py Outdated Show resolved Hide resolved

docs: fix typo pointed out in review

1b5dfdf

Co-authored-by: Rômulo Penido <[email protected]>

rpenido mentioned this pull request Mar 19, 2024

feat: update search index when course content is updated [FC-0040] #34391

Merged

ormsbee requested changes Mar 21, 2024

View reviewed changes

bradenmacdonald added 2 commits March 21, 2024 11:18

docs: clarify code with comments, indicate command is experimental

ec0c4a8

perf: pre-fetch all of the blocks in the course

1ff4b75

ormsbee approved these changes Mar 21, 2024

View reviewed changes

bradenmacdonald merged commit f663739 into openedx:master Mar 22, 2024
67 checks passed

bradenmacdonald deleted the braden/meilisearch-libraries branch March 27, 2024 18:53

	In the future, with Learning Core's data models in place for courseware,
	we could use PublishableEntity's primary key / UUID instead.

	# Note: new fields or values can be added at any time, but if they need to be indexed for filtering or keyword
	# search, the index configuration will need to be changed, which is only done as part of the 'reindex_studio'
	# command (changing those settings on an large active index is not recommended).

	# Now we build each level (tags.level0, tags.level1, etc.) as applicable.
	# We have a hard-coded limit of 4 levels of tags for now (see Fields.tags above).
	# A tag like "Difficulty: Hard" will only result in one level (tags.level0)
	# But a tag like "Location: North America > Canada > Vancouver" would result in three levels (tags.level0:
	# "North America", tags.level1: "North America > Canada", tags.level2: "North America > Canada > Vancouver")
	# See the comments above on "Field.tags" for an explanation of why we use this format (basically it's the format
	# required by the Instantsearch frontend).
	for level in range(4):
	# We use '>' as a separator because it's the default for the Instantsearch frontend library, and our
	# preferred separator (\t) used in the database is ignored by Meilisearch since it's whitespace.
	new_value = " > ".join(parts[0:level + 2])
	if f"level{level}" not in result:
	result[f"level{level}"] = [new_value]
	elif new_value not in result[f"level{level}"]:
	result[f"level{level}"].append(new_value)
	if len(parts) == level + 2:
	break # We have all the levels for this tag now (e.g. parts=["Difficulty", "Hard"] -> need level0 only)

	This is experimental and not recommended for production use.
	"""

	def add_arguments(self, parser):
	parser.add_argument('--experimental', action='store_true')
	parser.set_defaults(experimental=False)

	def handle(self, args, *options):
	"""
	Build a new search index for Studio, containing content from courses and libraries
	"""
	if not options["experimental"]:
	raise CommandError(
	"This command is experimental and not recommended for production. "
	"Use the --experimental argument to acknowledge and run it."
	)

Index Studio content using Meilisearch [FC-0040] #34310

Index Studio content using Meilisearch [FC-0040] #34310

Conversation

bradenmacdonald commented Feb 28, 2024 • edited Loading

Description

Supporting information

Screenshot

Testing instructions

Faceted Search Results (backend)

Not Implemented / TODO

Deadline

openedx-webhooks commented Feb 28, 2024

ormsbee commented Feb 29, 2024

bradenmacdonald commented Mar 18, 2024

rpenido left a comment

Choose a reason for hiding this comment

rpenido commented Mar 20, 2024

ormsbee left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bradenmacdonald commented Mar 21, 2024

ormsbee left a comment

Choose a reason for hiding this comment

openedx-webhooks commented Mar 22, 2024

edx-pipeline-bot commented Mar 22, 2024

edx-pipeline-bot commented Mar 22, 2024

edx-pipeline-bot commented Mar 22, 2024

edx-pipeline-bot commented Mar 22, 2024

edx-pipeline-bot commented Mar 22, 2024

edx-pipeline-bot commented Mar 22, 2024

regisb commented Mar 25, 2024

bradenmacdonald commented Feb 28, 2024 •

edited

Loading