Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index Studio content using Meilisearch [FC-0040] #34310

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
463549f
feat: management command to index content libraries using Meilisearch
bradenmacdonald Feb 27, 2024
81daab2
feat: include tags in library content index
bradenmacdonald Feb 29, 2024
ea01c70
refactor: cleanup
bradenmacdonald Feb 29, 2024
9f75b4c
feat: index courseware too
bradenmacdonald Feb 29, 2024
b9eddb8
refactor: combine indexes into one
bradenmacdonald Feb 29, 2024
adbf357
feat: index tags (in a way compatible with InstantSearch)
bradenmacdonald Feb 29, 2024
47dc3c1
feat: REST API to retrieve a user-specific API key
bradenmacdonald Mar 5, 2024
6cf9e04
fix: better handling index existence check, better comments
bradenmacdonald Mar 5, 2024
8637e16
feat: disable studio search (w/ Meilisearch) by default
bradenmacdonald Mar 15, 2024
6431948
feat: limit search endpoint usage to global staff for now
bradenmacdonald Mar 15, 2024
9c05806
chore: fix quality issues
bradenmacdonald Mar 15, 2024
af1402f
refactor: put all "content" data into the "content" field
bradenmacdonald Mar 15, 2024
7c5cd7f
chore: fix lint issue and comment
bradenmacdonald Mar 15, 2024
4c66d06
feat: add breadcrumbs to index
bradenmacdonald Mar 15, 2024
67cf0cc
feat: use fallback display_name if needed
bradenmacdonald Mar 15, 2024
c53963e
feat: state how long the reindex_studio command took
bradenmacdonald Mar 15, 2024
3884fa2
test: add tests for the Studio search REST API
bradenmacdonald Mar 18, 2024
6be622e
docs: added ADR
bradenmacdonald Mar 18, 2024
9b77442
chore: fix quality issue
bradenmacdonald Mar 18, 2024
91d0358
fix: handle case where there are no library/course blocks
bradenmacdonald Mar 18, 2024
0af6939
fix: hash used for ID wasn't stable
bradenmacdonald Mar 18, 2024
106cfd0
fix: breadcrumbs weren't using fallback display_names
bradenmacdonald Mar 18, 2024
1393d41
test: expand tests
bradenmacdonald Mar 18, 2024
5a1429e
fix: lint issues
bradenmacdonald Mar 19, 2024
ec27d5f
chore: update with latest master
bradenmacdonald Mar 19, 2024
69879e0
test: expand tests
bradenmacdonald Mar 19, 2024
1b5dfdf
docs: fix typo pointed out in review
bradenmacdonald Mar 19, 2024
ec0c4a8
docs: clarify code with comments, indicate command is experimental
bradenmacdonald Mar 21, 2024
1ff4b75
perf: pre-fetch all of the blocks in the course
bradenmacdonald Mar 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions cms/envs/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1776,6 +1776,9 @@
'openedx_tagging.core.tagging.apps.TaggingConfig',
'openedx.core.djangoapps.content_tagging',

# Search
'openedx.core.djangoapps.content.search',

'openedx.features.course_duration_limits',
'openedx.features.content_type_gating',
'openedx.features.discounts',
Expand Down Expand Up @@ -2874,3 +2877,18 @@ def _should_send_xblock_events(settings):
REST_FRAMEWORK['DEFAULT_SCHEMA_CLASS'] = 'drf_spectacular.openapi.AutoSchema'

BEAMER_PRODUCT_ID = ""

################### Studio Search (alpha, using Meilisearch) ###################

# Enable Studio search features (powered by Meilisearch) (beta, off by default)
MEILISEARCH_ENABLED = False
# Meilisearch URL that the python backend can use. Often points to another docker container or k8s service.
MEILISEARCH_URL = "http://meilisearch"
# URL that browsers (end users) can user to reach Meilisearch. Should be HTTPS in production.
MEILISEARCH_PUBLIC_URL = "http://meilisearch.example.com"
# To support multi-tenancy, you can prefix all indexes with a common key like "sandbox7-"
# and use a restricted tenant token in place of an API key, so that this Open edX instance
# can only use the index(es) that start with this prefix.
# See https://www.meilisearch.com/docs/learn/security/tenant_tokens
MEILISEARCH_INDEX_PREFIX = ""
MEILISEARCH_API_KEY = "devkey"
4 changes: 4 additions & 0 deletions cms/urls.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,10 @@
),
)

urlpatterns.append(
path('', include(('openedx.core.djangoapps.content.search.urls', 'content_search'), namespace='content_search')),
)

# display error page templates, for testing purposes
urlpatterns += [
path('404', handler404),
Expand Down
Empty file.
16 changes: 16 additions & 0 deletions openedx/core/djangoapps/content/search/apps.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"""
Define the content search Django App.
"""

from django.apps import AppConfig


class ContentSearchConfig(AppConfig):
"""App config for the content search feature"""

default_auto_field = "django.db.models.BigAutoField"
name = "openedx.core.djangoapps.content.search"

def ready(self):
# Connect signal handlers
from . import handlers # pylint: disable=unused-import
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
Studio Content Search Powered by Meilisearch
############################################

Status
******

Draft


Context
*******

Existing search functionality
=============================

The Open edX platform currently implements many different forms of search. For
example, users can search for course content, library content, forum posts, and
more. Most of the search functionality in the core platform is powered by the
Elasticsearch search engine (though other functionality developed by 2U, such as
in edx-enterprise, is powered by Algolia).

Most uses of Elasticsearch in Open edX use
`edx-search <https://github.com/openedx/edx-search>`_ which provides a partial
abstraction over Elasticsearch. The edx-search library formerly used
`django-haystack <https://django-haystack.readthedocs.io/>`_ as an abstraction
layer across search engines, but "that was ripped out after the package was
abandoned upstream and it became an obstacle to upgrades and efficiently
utilizing Elasticsearch (the abstraction layer imposed significant limits)"
(thanks to Jeremy Bowman for this context). Due to these changes, the current
edx-search API is a mix of abstractions and direct usage of the Elasticsearch
API, which makes it confusing and difficult to work with. In addition, each
usage of edx-search has been implemented fairly differently. See
`State of edx-search <https://openedx.atlassian.net/wiki/spaces/AC/pages/3884744738/State+of+edx-search+2023>`_
for details (thanks to Andy Shultz).

Other platform components use Elasticsearch more directly:

* ``course-discovery`` and ``edx-notes-api`` do not use ``edx-search``, but are
very tied to Elasticsearch via the use of ``django-elasticsearch-dsl`` and
``django-elasticsearch-drf``.
* ``cs_comments_service`` uses Elasticsearch via the official ruby gems.

Problems with Elasticsearch
===========================

At the same time, there are many problems with the current reliance on
Elasticsearch:

1. In 2021, the license of Elasticsearch changed from Apache 2.0 to a more
restrictive license that prohibits providing "the products to others as a
managed service". Consequently, AWS forked the search engine to create
OpenSearch and no longer offers Elasticsearch as a service. This is
problematic for many Open edX operators that use AWS and prefer to avoid
any third-party services.
2. Elasticsearch is very resource-intensive and often uses more than a gigabyte
of memory just for small search use cases.
3. Elasticsearch has poor support for multi-tenancy, which multiplies the
problem of resource usage for organizations with many small Open edX sites.
4. The existing usage of edx-search/Elasticsearch routes all search requests and
result processing through edxapp (the LMS) or other IDAs, increasing the
load on those applications.

Need for Studio Search
======================

At the time of this ADR, we have a goal to implement new search functionality in
Studio, to support various course authoring workflows.

Meilisearch
===========

Meilisearch ("MAY-lee search") is a new, promising search engine that offers a
compelling alternative to Elasticsearch. It is open source, feature rich, and
very fast and memory efficient (written in Rust, uses orders of magnitude less
memory than Elasticsearch for small datasets). It has a simple API with an
official python driver, and has official integrations with the popular
Instantsearch frontend library from Algolia. It has strong support for
multi-tenancy, and allows creating restricted API keys that incorporate a user's
permissions, so that search requests can be made directly from the user to
Meilisearch, rather than routing them through Django. Initial testing has shown
it to be much more developer friendly than Elasticsearch/OpenSearch.

At the time of writing, there are only two known concerns with Meilisearch:

1. It doesn't (yet) support High Availability via replication, although this is
planned and under development. It does have other features to support high
availability, such as very low restart time (in ms).
2. It doesn't support boolean operators in keyword search ("red AND panda"),
though it does of course support boolean operators in filters. This is a
product decision aimed at keeping the user experience simple, and is unlikely
to change.


Decision
********

1. We will implement the new Studio search functionality using Meilisearch,
as an experiment and to evaluate it more thoroughly.
2. The Studio search functionality will be disabled by default in the next
Open edX release (Redwood), so that Meilisearch will not be a requirement
for any default nor existing features. This will also allow us to evaluate it
before deciding to embrace it or replace it.
3. We will keep the Meilisearch-specific code isolated to the
new ``content/search`` Django app, so it's relatively easy to swap out later
if this experiment doesn't pan out.
4. We will not use ``edx-search`` for the new search functionality.


Consequences
************

1. Organizations that wish to try out the new Studio Search functionality in
the Redwood release will have to install and configure Meilisearch.
2. Building both the backend and frontend components of the Studio search
project will be much faster and simpler than if we used ElasticSearch,
edx-search, OpenSearch, django-haystack, etc.
3. Keyword search with boolean operators will not be supported in any of the new
search features.


Alternatives Considered
***********************

OpenSearch Only
===============

Moving existing search functionality to OpenSearch is a possibility, but though
it mostly addresses the licensing issue, it doesn't solve the problems of
resource usage, API complexity, edx-search API complexity, lack of Instantsearch
integration, and poor multi-tenancy.

OpenSearch and Elasticsearch
============================

When OpenSearch was originally forked from Elasticsearch, it was completely API
compatible, but over time they have developed along divergent paths. Regardless
of whether ElasticSearch and OpenSearch are actually wire-compatible, recent
versions of all the official ElasticSearch clients have been made to actively
reject connections to OpenSearch, which is why you generally won't find client
libraries that work with both engines, and why there are OpenSearch forks of
everything on the client side as well as the server side.

As there is no ready-to-use abstraction layer that would allow us to comfortably
support both, and no interest in maintaining one ourselves, this is not an
appealing option.

Algolia
=======

Algolia is a great search engine service, but as it is a proprietary product, it
is not suitable as a requirement for an open source platform like Open edX.
Loading
Loading