feat: incremental reindex_studio management command #35864

DanielVZ96 · 2024-11-15T13:11:05Z

Description

Adds an incremental mode to the reindex studio management command, and also other few utilities for managing the index: reset and init.

Supporting information

Refactor reindex_studio management command to support large instances modular-learning#235

Testing instructions

First open the home page of any library. eg.: http://apps.local.openedx.io:2001/course-authoring/library/lib:test:2test

Using tutor, exec into the cms instance: tutor dev exec cms bash
Verify that the command requires the experimental flag: ./manage.py cms reindex_studio should do nothing, and only reindex if you pass the --experimental flag
Test out the different flags for the command
- ./manage.py cms reindex_studio --experimental --init: Should do nothing since the index already exist
- ./manage.py cms reindex_studio --experimental --reset: Should recreate an empty index. This means that searching in studio page should not return any results.
- Run and interrupt ./manage.py cms reindex_studio --experimental --incremental right after it finishes indexing collections. Run it again and assert it continues from where it was interrupted. Try searching content now.
- Run ./manage.py cms reindex_studio --experimental and verify the index is recreated by searching blocks after it finishes. And also assert that there are no records of a current incremental update lingering: ./manage.py cms shell -c 'from openedx.core.djangoapps.content.search.models import IncrementalIndexCompleted; print(IncrementalIndexCompleted.objects.all())'

Deadline

Asap

Other information

Private-Ref: https://tasks.opencraft.com/browse/FAL-3902

openedx-webhooks · 2024-11-15T13:11:10Z

Thanks for the pull request, @DanielVZ96!

What's next?

Please work through the following steps to get your changes ready for engineering review:

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
- This process (including the steps you'll need to take) is documented here.
If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

Dependencies

This PR must be merged before / after / at the same time as ...
Blockers

This PR is waiting for OEP-1234 to be accepted.
Timeline information

This PR must be merged by XX date because ...
Partner information

This is for a course on edx.org.
Supporting documentation
Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

🔘 Let us know that your PR is ready for review:

Who will review my changes?

This repository is currently maintained by @openedx/wg-maintenance-edx-platform. Tag them in a comment and let them know that your changes are ready for review.

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

The size and impact of the changes that it introduces
The need for product review
Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

ChrisChV · 2024-11-15T16:48:36Z

openedx/core/djangoapps/content/search/tests/test_api.py

+        assert IncrementalIndexCompleted.objects.all().count() == 1
+        api.rebuild_index(incremental=True)
+        assert IncrementalIndexCompleted.objects.all().count() == 0
+        assert mock_meilisearch.return_value.index.return_value.add_documents.call_count == 7


@DanielVZ96 could you add a comment here? It is not easy at first glance to realize why this call_count is 7

I modified the mocking mechanism to make it easier to understand and also added some comments.

ChrisChV · 2024-11-15T17:02:39Z

openedx/core/djangoapps/content/search/api.py

+        status_cb = log.info
+
+    status_cb("Creating new empty index...")
+    with _using_temp_index(status_cb) as temp_index_name:


@DanielVZ96 @bradenmacdonald @pomegranited I am concerned about the functionality of this reset.

If we need to create and populate a new index on a large instance, we need to run the --reset, which removes the old index, and then I need to run the --incremental. However, the search results will be broken when the new index is populated.

Wouldn't it be better to add an option to --incremental, so that it creates a temporary index and does swap, so as not to break the search results, as the non-incremental form does?

@ChrisChV I'm assuming that most times an administrator is running this incremental build, they either have no existing index or their existing index is from an old version and is missing some configuration/columns/etc. (so it actually is broken). In that case, removing the old index and starting an incremental build will result in incomplete search results for a while, but the search will be working without errors. And, the incremental index rebuilds the newest courses first, so the results should fill in relatively quickly.

I think for large instances, (where the reindex can take several days) it's better to have a working search with incomplete results, than to have a totally broken search (that displays errors because the old index does not exist or has the wrong configuration).

For Teak I would like to find a way to simplify this, maybe by only having the incremental option and allowing it to be either using a temporary index or not. But I think we'll need to see how this works first and hear from people testing it out.

OK that's fine for me 👍

ChrisChV

@DanielVZ96 Looks good 👍 Some nits:

I tested this: I followed the testing instructions
I read through the code and considered the security, stability and performance implications of the changes.
Includes tests for bugfixes and/or features added.
Includes documentation

When running ./manage.py cms reindex_studio --experimental --init with the existing index, the message is lost in the logs, is it possible to set the output message to have a different color, yellow or red?

DanielVZ96 · 2024-11-18T23:36:09Z

@DanielVZ96 Looks good 👍 Some nits:

I tested this: I followed the testing instructions

I read through the code and considered the security, stability and performance implications of the changes.

Includes tests for bugfixes and/or features added.

Includes documentation

When running ./manage.py cms reindex_studio --experimental --init with the existing index, the message is lost in the logs, is it possible to set the output message to have a different color, yellow or red?

Nice nit. I'll send it to stderr then

bradenmacdonald · 2024-11-25T19:49:17Z

openedx/core/djangoapps/content/search/api.py

+    if _index_exists(STUDIO_INDEX_NAME):
+        warn_cb(
+            "A rebuild of the index is required. Please run ./manage.py cms reindex_studio"
+            " --experimental [--incremental]"
+        )
+        return


Hmm, I realize that we shouldn't actually print this every time if the index exists. What we really want is to print it only if the index is empty or the index configuration (schema) is outdated. Sorry for not realizing this sooner.

Suggested change

if _index_exists(STUDIO_INDEX_NAME):

warn_cb(

"A rebuild of the index is required. Please run ./manage.py cms reindex_studio"

" --experimental [--incremental]"

)

return

if _index_exists(STUDIO_INDEX_NAME):

if _index_is_empty(STUDIO_INDEX_NAME):

warn_cb(

"The studio search index is empty. Please run ./manage.py cms reindex_studio"

" --experimental [--incremental]"

)

return

elif get_index_version(STUDIO_INDEX_NAME) < CURRENT_INDEX_SCHEMA_VERSION):

warn_cb(

"A rebuild of the index is required. Please run ./manage.py cms reindex_studio"

" --experimental [--incremental]"

)

return

and something like this:

CURRENT_INDEX_SCHEMA_VERSION = 20241021 def get_index_version(index_name: str) -> int: # use the get index settings endpoint to get the current settings: # https://www.meilisearch.com/docs/reference/api/settings#get-settings current_settings = ... # Then compare the settings to determine the index schema version: if (Fields.published + "." + Fields.display_name) in current_settings["searchableAttributes"]: # We call this "version 20241021", including this change: # https://github.com/openedx/edx-platform/commit/fb25a5d635c0f2650d514454927da666b057aa39 return 20241021 return 0

@bradenmacdonald done. i did something slightly different for checking that the schema is up to date. let me know what you think.

bradenmacdonald · 2024-11-25T19:56:57Z

openedx/core/djangoapps/content/search/api.py

@@ -473,10 +535,16 @@ def add_with_children(block):
                status_cb(
                    f"{num_contexts_done + 1}/{num_contexts}. Now indexing course {course.display_name} ({course.id})"
                )
+                if course.id in keys_indexed:
+                    num_contexts_done += 1


Skipped courses are still included in the total count and in the num_contexts_done. I like that. I think we should do the same for skipped (already-indexed) libraries. Currently, they're excluded altogether and won't be include din the num_contexts_done count.

bradenmacdonald · 2024-11-25T19:57:39Z

@DanielVZ96 Very nice work here! I have a couple requests but I'm very happy with how this is looking.

bradenmacdonald

Nice, thanks! I like that approach you took. Just a few more tweaks I'm suggesting.

bradenmacdonald · 2024-11-28T17:44:18Z

openedx/core/djangoapps/content/search/api.py

@@ -62,6 +62,65 @@

 EXCLUDED_XBLOCK_TYPES = ['course', 'course_info']

+INDEX_DISTINCT_ATTRIBUTE = "usage_key"


Nit: I like how you separated out these INDEX_ settings, but I don't think they really belong in our public api.py file. What do you think about putting them in a new index_config.py module?

bradenmacdonald · 2024-11-28T17:44:57Z

openedx/core/djangoapps/content/search/api.py

@@ -62,6 +62,65 @@

 EXCLUDED_XBLOCK_TYPES = ['course', 'course_info']

+INDEX_DISTINCT_ATTRIBUTE = "usage_key"
+INDEX_FILTRABLE_ATTRIBUTES = [


Suggested change

INDEX_FILTRABLE_ATTRIBUTES = [

# Mark which attributes can be used for filtering/faceted search:

INDEX_FILTERABLE_ATTRIBUTES = [

bradenmacdonald · 2024-11-28T17:46:52Z

openedx/core/djangoapps/content/search/api.py

+    Fields.modified,
+    Fields.last_published,
+]
+INDEX_RANKING_RULES = [


This was very important context - without this comment it's unclear why this is here and what it's doing.

Suggested change

INDEX_RANKING_RULES = [

# Update the search ranking rules to let the (optional) "sort" parameter take precedence over keyword relevance.

# cf https://www.meilisearch.com/docs/learn/core_concepts/relevancy

INDEX_RANKING_RULES = [

bradenmacdonald · 2024-11-28T17:47:06Z

openedx/core/djangoapps/content/search/api.py

+    Fields.published + "." + Fields.display_name,
+    Fields.published + "." + Fields.published_description,
+]
+INDEX_SORTABLE_ATTRIBUTES = [


Suggested change

INDEX_SORTABLE_ATTRIBUTES = [

# Mark which attributes can be used for sorting search results:

INDEX_SORTABLE_ATTRIBUTES = [

bradenmacdonald · 2024-11-28T17:47:21Z

openedx/core/djangoapps/content/search/api.py

+    Fields.last_published,
+    Fields.content + "." + Fields.problem_types,
+]
+INDEX_SEARCHABLE_ATTRIBUTES = [


Suggested change

INDEX_SEARCHABLE_ATTRIBUTES = [

# Mark which attributes are used for keyword search, in order of importance:

INDEX_SEARCHABLE_ATTRIBUTES = [

bradenmacdonald · 2024-11-28T17:48:10Z

openedx/core/djangoapps/content/search/api.py

+    # Mark usage_key as unique (it's not the primary key for the index, but nevertheless must be unique):
+    client.index(index_name).update_distinct_attribute(INDEX_DISTINCT_ATTRIBUTE)
+    # Mark which attributes can be used for filtering/faceted search:
+    client.index(index_name).update_filterable_attributes(INDEX_FILTRABLE_ATTRIBUTES)


Suggested change

client.index(index_name).update_filterable_attributes(INDEX_FILTRABLE_ATTRIBUTES)

client.index(index_name).update_filterable_attributes(INDEX_FILTERABLE_ATTRIBUTES)

openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Nov 15, 2024

DanielVZ96 force-pushed the dvz/refactor-reindex-studio branch from 9d7329c to 5ffe2d5 Compare November 15, 2024 13:21

feat: incremental reindex_studio management command

bd37e4c

DanielVZ96 force-pushed the dvz/refactor-reindex-studio branch from 5ffe2d5 to bd37e4c Compare November 15, 2024 13:22

ChrisChV reviewed Nov 15, 2024

View reviewed changes

fix: tests, linting and formatting

776b1d1

DanielVZ96 force-pushed the dvz/refactor-reindex-studio branch from 24da2c3 to 776b1d1 Compare November 16, 2024 04:37

ChrisChV approved these changes Nov 18, 2024

View reviewed changes

fix: improve output of init_index

de1d918

DanielVZ96 requested a review from bradenmacdonald November 18, 2024 23:37

bradenmacdonald reviewed Nov 25, 2024

View reviewed changes

DanielVZ96 requested a review from bradenmacdonald November 28, 2024 04:29

fix: address bradens comments

73957b6

DanielVZ96 force-pushed the dvz/refactor-reindex-studio branch from 0bf5368 to 73957b6 Compare November 28, 2024 04:35

fix: settings name overshadow

084c2f6

bradenmacdonald reviewed Nov 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: incremental reindex_studio management command #35864

feat: incremental reindex_studio management command #35864

DanielVZ96 commented Nov 15, 2024

openedx-webhooks commented Nov 15, 2024

ChrisChV Nov 15, 2024

DanielVZ96 Nov 16, 2024

ChrisChV Nov 15, 2024

bradenmacdonald Nov 15, 2024

ChrisChV Nov 18, 2024

ChrisChV left a comment

DanielVZ96 commented Nov 18, 2024

bradenmacdonald Nov 25, 2024

DanielVZ96 Nov 28, 2024

bradenmacdonald Nov 25, 2024

bradenmacdonald commented Nov 25, 2024

bradenmacdonald left a comment

bradenmacdonald Nov 28, 2024

bradenmacdonald Nov 28, 2024

bradenmacdonald Nov 28, 2024

bradenmacdonald Nov 28, 2024

bradenmacdonald Nov 28, 2024

bradenmacdonald Nov 28, 2024

		@@ -62,6 +62,65 @@

		EXCLUDED_XBLOCK_TYPES = ['course', 'course_info']

		INDEX_DISTINCT_ATTRIBUTE = "usage_key"

	INDEX_FILTRABLE_ATTRIBUTES = [
	# Mark which attributes can be used for filtering/faceted search:
	INDEX_FILTERABLE_ATTRIBUTES = [

	INDEX_SORTABLE_ATTRIBUTES = [
	# Mark which attributes can be used for sorting search results:
	INDEX_SORTABLE_ATTRIBUTES = [

	INDEX_SEARCHABLE_ATTRIBUTES = [
	# Mark which attributes are used for keyword search, in order of importance:
	INDEX_SEARCHABLE_ATTRIBUTES = [

	client.index(index_name).update_filterable_attributes(INDEX_FILTRABLE_ATTRIBUTES)
	client.index(index_name).update_filterable_attributes(INDEX_FILTERABLE_ATTRIBUTES)

feat: incremental reindex_studio management command #35864

Are you sure you want to change the base?

feat: incremental reindex_studio management command #35864

Conversation

DanielVZ96 commented Nov 15, 2024

Description

Supporting information

Testing instructions

Deadline

Other information

openedx-webhooks commented Nov 15, 2024

What's next?

🔘 Get product approval

🔘 Provide context

🔘 Get a green build

🔘 Let us know that your PR is ready for review:

Who will review my changes?

Where can I find more information?

When can I expect my changes to be merged?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisChV left a comment

Choose a reason for hiding this comment

DanielVZ96 commented Nov 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bradenmacdonald commented Nov 25, 2024

bradenmacdonald left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment