Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Associate PublishableEntities with Collections [FC-0062] #216

Merged
merged 9 commits into from
Sep 3, 2024
2 changes: 1 addition & 1 deletion openedx_learning/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""
Open edX Learning ("Learning Core").
"""
__version__ = "0.11.2"
__version__ = "0.11.3"
115 changes: 108 additions & 7 deletions openedx_learning/apps/authoring/collections/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,25 @@
from __future__ import annotations

from django.db.models import QuerySet
from django.db.transaction import atomic
from django.utils import timezone

from .models import Collection
from ..publishing.models import PublishableEntity
from .models import Collection, CollectionObject

# The public API that will be re-exported by openedx_learning.apps.authoring.api
# is listed in the __all__ entries below. Internal helper functions that are
# private to this module should start with an underscore. If a function does not
# start with an underscore AND it is not in __all__, that function is considered
# to be callable only by other apps in the authoring package.
__all__ = [
"add_to_collections",
"create_collection",
"get_collection",
"get_collections",
"get_learning_package_collections",
"get_object_collections",
"remove_from_collections",
"update_collection",
]

Expand All @@ -26,16 +32,27 @@ def create_collection(
title: str,
created_by: int | None,
description: str = "",
contents_qset: QuerySet[PublishableEntity] = PublishableEntity.objects.none(), # default to empty set,
pomegranited marked this conversation as resolved.
Show resolved Hide resolved
) -> Collection:
"""
Create a new Collection
"""
collection = Collection.objects.create(
learning_package_id=learning_package_id,
title=title,
created_by_id=created_by,
description=description,
)

with atomic():
collection = Collection.objects.create(
learning_package_id=learning_package_id,
title=title,
created_by_id=created_by,
description=description,
)

added = add_to_collections(
Collection.objects.filter(id=collection.id),
contents_qset,
)
if added:
collection.refresh_from_db() # fetch updated modified date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick, optional]: It's a little weird to me that this is going to cause multiple writes to the same Collection with just slightly differing datetimes, but in the same transaction. I can't quite place my finger on why this bothers me, except some vague sense that something tracking the Collection lifecycle might get confused.

Instead of invoking add_to_collections and doing the refresh to get around the timestamp change, can we instead copy some of logic from there that does the adding of the records?

So something like:

        records_to_add = [
            CollectionPublishableEntity(
                collection_id=collection.id,
                entity_id=entity_id,
            )
            for entity_id in entities_qset.values_list("pk", flat=True)
        ]
        CollectionPublishableEntity.bulk_add(records_to_add)

Copy link
Contributor Author

@pomegranited pomegranited Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ormsbee I don't have a use case for creating a collection with components in it, I was just copying that code from your PR. Maybe it's better just to keep them as separate API calls, to avoid this duplication, and the need for a transaction?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way. I think that use case will come up when we do v1 library import functionality, but I'm fine with punting until then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing the entities_qset param from create_collection, we can add it back later if needed.


return collection


Expand Down Expand Up @@ -70,6 +87,90 @@ def update_collection(
return collection


def add_to_collections(
collections_qset: QuerySet[Collection],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question] Is it a common use case that we'll want to add to multiple Collections simultaneously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think so, but that's what was in the ticket: openedx/modular-learning#227

@bradenmacdonald can you comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was just trying to indicate that the API needs to support having an item linked to multiple collections, but it doesn't necessarily have to support linking multiple collections to an item in a single API call. If it simplifies the implementation to only operate by adding/removing a single collection at a time, that's fine with me.

There is only one use case I can think of where we might want to associate/dissociate from multiple collections simultaneously and that's import/export. But we haven't planned that part out yet at all.

contents_qset: QuerySet[PublishableEntity],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change this to be entities_qset for better consistency (contents usually means the Content model, and may cause confusion).

) -> int:
"""
Adds a QuerySet of PublishableEntities to a QuerySet of Collections.

Records are created in bulk, and so integrity errors are deliberately ignored: they indicate that the content(s)
have already been added to the collection(s).

Returns the number of entities added (including any that already exist).
"""
collection_objects = []
object_ids = contents_qset.values_list("pk", flat=True)
collection_ids = collections_qset.values_list("pk", flat=True)

for collection_id in collection_ids:
for object_id in object_ids:
collection_objects.append(
CollectionObject(
collection_id=collection_id,
object_id=object_id,
)
)

created = []
if collection_objects:
created = CollectionObject.objects.bulk_create(
collection_objects,
ignore_conflicts=True,
)

# Update the modified date for all the provided Collections.
# (Ignoring conflicts means we don't know which ones were modified.)
collections_qset.update(modified=timezone.now())
pomegranited marked this conversation as resolved.
Show resolved Hide resolved

return len(created)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a useful number to return? If it includes the conflicts, doesn't it always return len(collections_qset) * len(contents_qset)?

Copy link
Contributor Author

@pomegranited pomegranited Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true.. What should this method return instead?

I had thought that I should filter out any entities that aren't in the same learning package as the collection, so that would change this count. But since there's potentially more than one collection/learning package, I did that filtering in the REST API.

Do you think it makes sense to enforce the "a collection may only contain entities in their learning package" here instead? It'll mean fetching more data in add_to_collections -- the Collections + PublishableEntities -- but if it's worth if if we should enforce this rule.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's a fantastic callout. We should absolutely enforce it at the Learning Core layer, so people don't shoot themselves in the foot. It's very possible we'll eventually allow folks to add stuff from other LearningPackages into Collections, but we're going to have to be very deliberate when we do and think through the lifecycle issues when those LearningPackages disappear. We definitely don't want to let it happen by accident because other apps used Learning Core directly instead of going through the Libraries REST API.



def remove_from_collections(
collections_qset: QuerySet,
contents_qset: QuerySet,
) -> int:
"""
Removes a QuerySet of PublishableEntities from a QuerySet of Collections.

PublishableEntities are deleted from each Collection, in bulk.

Collections which had objects removed are marked with modified=now().

Returns the total number of entities deleted.
"""
total_deleted = 0
object_ids = contents_qset.values_list("pk", flat=True)
collection_ids = collections_qset.values_list("pk", flat=True)
modified_collection_ids = []

for collection_id in collection_ids:
num_deleted, _ = CollectionObject.objects.filter(
collection_id=collection_id,
object_id__in=object_ids,
).delete()

if num_deleted:
modified_collection_ids.append(collection_id)

total_deleted += num_deleted

# Update the modified date for the affected Collections
Collection.objects.filter(id__in=modified_collection_ids).update(modified=timezone.now())

return total_deleted


def get_object_collections(object_key: str) -> QuerySet[Collection]:
"""
Get all collections associated with a given PublishableEntity.key.

Only enabled collections are returned.
"""
entity = PublishableEntity.objects.get(key=object_key)
pomegranited marked this conversation as resolved.
Show resolved Hide resolved
return entity.collections.filter(enabled=True).order_by("pk")


def get_learning_package_collections(learning_package_id: int) -> QuerySet[Collection]:
pomegranited marked this conversation as resolved.
Show resolved Hide resolved
"""
Get all collections for a given learning package
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Generated by Django 4.2.14 on 2024-08-21 07:15

import django.db.models.deletion
from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('oel_publishing', '0002_alter_learningpackage_key_and_more'),
('oel_collections', '0002_remove_collection_name_collection_created_by_and_more'),
]

operations = [
migrations.CreateModel(
name='CollectionObject',
fields=[
('id', models.BigAutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('collection', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='oel_collections.collection')),
('object', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='oel_publishing.publishableentity')),
],
),
migrations.AddField(
model_name='collection',
name='contents',
field=models.ManyToManyField(related_name='collections', through='oel_collections.CollectionObject', to='oel_publishing.publishableentity'),
),
migrations.AddConstraint(
model_name='collectionobject',
constraint=models.UniqueConstraint(fields=('collection', 'object'), name='oel_collections_cpe_uniq_col_obj'),
),
]
99 changes: 98 additions & 1 deletion openedx_learning/apps/authoring/collections/models.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,68 @@
"""
Core models for Collections

TLDR Guidelines:
1. DO NOT modify these models to store full version snapshots.
2. DO NOT use these models to try to reconstruct historical versions of
Collections for fast querying.

If you're trying to do either of these things, you probably want a new model or
app. For more details, read on.
Comment on lines +9 to +10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a lot of context!! Thank you! ❤️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh definitely.. @ormsbee 's notes are always worth preserving :)


The goal of these models is to provide a lightweight method of organizing
PublishableEntities. The first use case for this is modeling the structure of a
v1 Content Library within a LearningPackage. This is what we'll use the
Collection model for.

An important thing to note here is that Collections are *NOT* publishable
entities themselves. They have no "Draft" or "Published" versions. Collections
are never "published", though the things inside of them are.

When a LibraryContentBlock makes use of a Content Library, it copies all of
the items it will use into the Course itself. It will also store a version
on the LibraryContentBlock -- this is a MongoDB ObjectID in v1 and an integer in
v2 Libraries. Later on, the LibraryContentBlock will want to check back to see
if any updates have been made, using its version as a key. If a new version
exists, the course team has the option of re-copying data from the Library.

ModuleStore based v1 Libraries and Blockstore-based v2 libraries both version
the entire library in a series of snapshots. This makes it difficult to have
very large libraries, which is an explicit goal for Modular Learning. In
Learning Core, we've moved to tracking the versions of individual Components to
address this issue. But that means we no longer have a single version indicator
for "has anything here changed"?

We *could* have put that version in the ``publishing`` app's PublishLog, but
that would make it too broad. We want the ability to eventually collapse many v1
Libraries into a single Learning Core backed v2 Library. If we tracked the
versioning in only a central location, then we'd have many false positives where
the version was bumped because something else in the Learning Package changed.
So instead, we're creating a new Collection model inside the LearningPackage to
track that concept.

A critical takeaway is that we don't have to store snapshots of every version of
a Collection, because that data has been copied over by the LibraryContentBlock.
We only need to store the current state of the Collection, and increment the
version numbers when changes happen. This will allow the LibraryContentBlock to
check in and re-copy over the latest version if the course team desires.

That's why these models only store the current state of a Collection. Unlike the
``components`` app, ``collections`` does not store fully materialized snapshots
of past versions. This is done intentionally in order to save space and reduce
the cost of writes. Collections may grow to be very large, and we don't want to
be writing N rows with every version, where N is the number of
PublishableEntities in a Collection.

MVP of these models does not store changesets, but we can add this when there's a
use case for it. The number of rows in these changesets would grow in proportion
to the number of things that are actually changing (instead of copying over
everything on every version). This is could be used to make it easier to figure out
what changed between two given versions of a Collection. A LibraryContentBlock
in a course would have stored the version number of the last time it copied data
from the Collection, and we can eventually surface this data to the user. Note that
while it may be possible to reconstruct past versions of Collections based off of
this changeset data, it's going to be a very slow process to do so, and it is
strongly discouraged.
"""
from __future__ import annotations

Expand All @@ -9,10 +72,11 @@

from ....lib.fields import MultiCollationTextField, case_insensitive_char_field
from ....lib.validators import validate_utc_datetime
from ..publishing.models import LearningPackage
from ..publishing.models import LearningPackage, PublishableEntity

__all__ = [
"Collection",
"CollectionObject",
]


Expand Down Expand Up @@ -79,6 +143,12 @@ class Collection(models.Model):
],
)

contents: models.ManyToManyField[PublishableEntity, "CollectionObject"] = models.ManyToManyField(
pomegranited marked this conversation as resolved.
Show resolved Hide resolved
PublishableEntity,
through="CollectionObject",
related_name="collections",
)

class Meta:
verbose_name_plural = "Collections"
indexes = [
Expand All @@ -96,3 +166,30 @@ def __str__(self) -> str:
User-facing string representation of a Collection.
"""
return f"<{self.__class__.__name__}> ({self.id}:{self.title})"


class CollectionObject(models.Model):
pomegranited marked this conversation as resolved.
Show resolved Hide resolved
"""
Collection -> PublishableEntity association.
"""
collection = models.ForeignKey(
Collection,
on_delete=models.CASCADE,
)
object = models.ForeignKey(
pomegranited marked this conversation as resolved.
Show resolved Hide resolved
PublishableEntity,
on_delete=models.CASCADE,
pomegranited marked this conversation as resolved.
Show resolved Hide resolved
)

pomegranited marked this conversation as resolved.
Show resolved Hide resolved
pomegranited marked this conversation as resolved.
Show resolved Hide resolved
class Meta:
constraints = [
# Prevent race conditions from making multiple rows associating the
# same Collection to the same Entity.
models.UniqueConstraint(
fields=[
"collection",
"object",
],
name="oel_collections_cpe_uniq_col_obj",
)
]
9 changes: 9 additions & 0 deletions openedx_learning/apps/authoring/collections/readme.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,12 @@ had to create many small libraries.

For the Libraries Relaunch ("v2"), we want to encourage people to create large libraries with lots of content,
and organize that content using tags and collections.

Architecture Guidelines
-----------------------

Things to remember:

* Collections may grow very large.
* Collections are not publishable in and of themselves.
* Collections link to PublisableEntity records, **not** PublishableEntityVersion records.
pomegranited marked this conversation as resolved.
Show resolved Hide resolved
Loading