Update master branch to ES6 only (#36)

This commit will update the master branch to require ES6 and above. It formally deprecates ES5 and ES2, which will now live on in named branches only. Aside from documentation, and updating the dependency versions, this commit also create a separate code path for partial document updates, so that instead of a single as_search_document method that takes an optional update_fields kwarg, we have two separate methods, as_search_document and as_search_document_update. Only the later has the update_fields arg, which is mandatory. There are examples in the README, so recommendation is to start there. One other addition is that the repo has been "Blackened" - so the formatting of app *.py files should now be consistent.
yunojuno · Jan 17, 2019 · 4ac14cc · 4ac14cc
1 parent 681192a
commit 4ac14cc
Show file tree

Hide file tree

Showing 13 changed files with 306 additions and 240 deletions.
diff --git a/Pipfile b/Pipfile
@@ -6,9 +6,13 @@ verify_ssl = true
 [dev-packages]
 tox = "*"
 coverage = "*"
+black = "*"
 
 [packages]
 elasticsearch-django = {path = ".",editable = true}
 
 [requires]
 python_version = "3.7"
+
+[pipenv]
+allow_prereleases = true
diff --git a/README.rst b/README.rst
@@ -11,9 +11,7 @@ Elasticsearch for Django
 
 This is a lightweight Django app for people who are using Elasticsearch with Django, and want to manage their indexes.
 
-**NB the master branch is now based on ES5.x. If you are using ES2.x, please switch to the ES2 branch (released on PyPI as 2.x)**
-
-Status
+**NB the master branch is now based on ES6. If you are using ES2/ES5, please switch to the relevant branch (released on PyPI as 2.x, 5.x)**
 
 ----
 
@@ -104,13 +102,70 @@ So far we have configured Django to know the names of the indexes we want, and t
 
 **SearchDocumentMixin**
 
-This mixin must be implemented by the model itself, and it requires a single method implementation - ``as_search_document()``. This should return a dict that is the index representation of the object; the ``index`` kwarg can be used to provide different representations for different indexes. By default this is ``_all`` which means that all indexes receive the same document for a given object.
+This mixin is responsible for the seaerch index document format. We are indexing JSON representations of each object, and we have two methods on the mixin responsible for outputting the correct format - ``as_search_document`` and ``as_search_document_update``.
+
+An aside on the mechanics of the ``auto_sync`` process, which is hooked up using Django's ``post_save`` and ``post_delete`` model signals. ES supports partial updates to documents that already exist, and we make a fundamental assumption about indexing models - that **if you pass the ``update_fields`` kwarg to a ``model.save`` method call, then you are performing a partial update**, and this will be propagated to ES as a partial update only.
+
+To this end, we have two methods for generating the model's JSON representation - ``as_search_document``, which should return a dict that represents the entire object; and ``as_search_document_update``, which takes the ``update_fields`` kwarg, and should return a dict that contains just the fields that are being updated. (NB this is not automated as the field being updated may itself be an object, which needs specific formatting - see below for an example).
+
+To better understand this, let us say that we have a model (``MyModel``) that is configured to be included in an index called ``myindex``. If we save an object, without passing ``update_fields``, then this is considered a full document update, which triggers the object's ``index_search_document`` method:
+
+.. code:: python
+
+    obj = MyModel.objects.first()
+    obj.save()
+    ...
+    # AUTO_SYNC=true will trigger a re-index of the complete object document:
+    obj.index_search_document(index='myindex')
+
+However, if we only want to update a single field (say the ``timestamp``), and we pass this in to the save method, then this will trigger the ``update_search_document`` method, passing in the names of the fields that we want updated.
+
+.. code:: python
+
+    # save a single field on the object
+    obj.save(update_fields=['timestamp'])
+    ...
+    # AUTO_SYNC=true will trigger a partial update of the object document
+    obj.update_search_document(index, update_fields=['timestamp'])
+
+We pass the name of the index being updated as the first arg, as objects may have different representations in different indexes:
 
 .. code:: python
 
-    def as_search_document(self, index='_all'):
+    def as_search_document(self, index):
         return {'name': "foo"} if index == 'foo' else {'name': "bar"}
 
+In the case of the second method, the simplest possible implementation would be a dictionary containing the names of the fields being updated and their new values. However, if the value is itself an object, it will have to serialized properly. As an example:
+
+.. code:: python
+
+    def as_search_document_update(self, index, update_fields):
+        # create a basic attr: value dict from the fields that were updated
+        doc = {f: getattr(self, f) for f in update_fields}
+        # if the 'user' attr was updated, we need to convert this from
+        # a User object to something ES can index - in this case just the full name
+        # (NB GPDR klaxon sounding at this point - do you have permission to do this?)
+        if 'user' in doc:
+            doc['user'] = self.user.get_full_name()
+        return doc
+
+The reason we have split out the update from the full-document index comes from a real problem that we ourselves suffered. The full object representation that we were using was quite DB intensive - we were storing properties of the model that required walking the ORM tree. However, because we were also touching the objects (see below) to record activity timestamps, we ended up flooding the database with queries simply to update a single field in the output document. Partial updates solves this issue:
+
+.. code:: python
+
+    def touch(self):
+        self.timestamp = now()
+        self.save(update_fields=['timestamp'])
+
+    def as_search_document_update(self, index, update_fields):
+        if update_fields == ['timestamp']:
+            # only propagate changes if it's +1hr since the last timestamp change
+            if now() - self.timestamp < timedelta(hours=1):
+                return {}
+            else:
+                return {'timestamp': self.timestamp}
+        ....
+
 **SearchDocumentManagerMixin**
 
 This mixin must be implemented by the model's default manager (``objects``). It also requires a single method implementation - ``get_search_queryset()`` - which returns a queryset of objects that are to be indexed. This can also use the ``index`` kwarg to provide different sets of objects to different indexes.

diff --git a/elasticsearch_django/__init__.py b/elasticsearch_django/__init__.py
@@ -1,3 +1,3 @@
-default_app_config = 'elasticsearch_django.apps.ElasticAppConfig'
+default_app_config = "elasticsearch_django.apps.ElasticAppConfig"
 
-__version__ = '5.3'
+__version__ = "6.0"
diff --git a/elasticsearch_django/admin.py b/elasticsearch_django/admin.py
@@ -19,52 +19,40 @@ def pprint(data):
     until someone builds a custom syntax function.
 
     """
-    pretty = json.dumps(
-        data,
-        sort_keys=True,
-        indent=4,
-        separators=(',', ': ')
-    )
+    pretty = json.dumps(data, sort_keys=True, indent=4, separators=(",", ": "))
     html = pretty.replace(" ", "&nbsp;").replace("\n", "<br>")
     return mark_safe("<code>%s</code>" % html)
 
 
 class SearchQueryAdmin(admin.ModelAdmin):
 
     list_display = (
-        'id',
-        'user',
-        'search_terms_',
-        'total_hits',
-        'returned_',
-        'min_',
-        'max_',
-        'reference',
-        'executed_at',
-    )
-    list_filter = (
-        'index',
-    )
-    search_fields = (
-        'search_terms',
-        'user__first_name',
-        'user__last_name',
-        'reference'
+        "id",
+        "user",
+        "search_terms_",
+        "total_hits",
+        "returned_",
+        "min_",
+        "max_",
+        "reference",
+        "executed_at",
     )
+    list_filter = ("index",)
+    search_fields = ("search_terms", "user__first_name", "user__last_name", "reference")
     # excluding because we are using a pretty version instead
-    exclude = ('hits', 'query', 'page')
+    exclude = ("hits", "query", "page")
     readonly_fields = (
-        'user',
-        'index',
-        'search_terms',
-        'total_hits',
-        'returned_',
-        'min_',
-        'max_',
-        'duration',
-        'query_',
-        'hits_',
-        'executed_at',
+        "user",
+        "index",
+        "search_terms",
+        "total_hits",
+        "returned_",
+        "min_",
+        "max_",
+        "duration",
+        "query_",
+        "hits_",
+        "executed_at",
     )
 
     def search_terms_(self, instance):
@@ -79,18 +67,20 @@ def query_(self, instance):
 
     def max_(self, instance):
         """Pretty version of max_score."""
-        return '-' if instance.page_size == 0 else instance.max_score
+        return "-" if instance.page_size == 0 else instance.max_score
+
     max_.short_description = "Max score"
 
     def min_(self, instance):
         """Pretty version of min_score."""
-        return '-' if instance.page_size == 0 else instance.min_score
+        return "-" if instance.page_size == 0 else instance.min_score
+
     min_.short_description = "Min score"
 
     def returned_(self, instance):
         """Number of hits returned in the page."""
         if instance.page_size == 0:
-            return '-'
+            return "-"
         else:
             return "%i - %i" % (instance.page_from, instance.page_to)
 
@@ -100,4 +90,5 @@ def hits_(self, instance):
         """Pretty version of hits JSON."""
         return pprint(instance.hits)
 
+
 admin.site.register(SearchQuery, SearchQueryAdmin)
diff --git a/elasticsearch_django/apps.py b/elasticsearch_django/apps.py
@@ -13,15 +13,15 @@ class ElasticAppConfig(AppConfig):
 
     """AppConfig for Search3."""
 
-    name = 'elasticsearch_django'
+    name = "elasticsearch_django"
     verbose_name = "Elasticsearch"
     configs = []
 
     def ready(self):
         """Validate config and connect signals."""
         super(ElasticAppConfig, self).ready()
-        _validate_config(settings.get_setting('strict_validation'))
-        if settings.get_setting('auto_sync'):
+        _validate_config(settings.get_setting("strict_validation"))
+        if settings.get_setting("auto_sync"):
             _connect_signals()
         else:
             logger.debug("SEARCH_AUTO_SYNC has been disabled.")
@@ -48,11 +48,9 @@ def _validate_mapping(index, strict=False):
 
 def _validate_model(model):
     """Check that a model configured for an index subclasses the required classes."""
-    if not hasattr(model, 'as_search_document'):
-        raise ImproperlyConfigured(
-            "'%s' must implement `as_search_document`." % model
-        )
-    if not hasattr(model.objects, 'get_search_queryset'):
+    if not hasattr(model, "as_search_document"):
+        raise ImproperlyConfigured("'%s' must implement `as_search_document`." % model)
+    if not hasattr(model.objects, "get_search_queryset"):
         raise ImproperlyConfigured(
             "'%s.objects must implement `get_search_queryset`." % model
         )
@@ -68,22 +66,30 @@ def _connect_signals():
 def _connect_model_signals(model):
     """Connect signals for a single model."""
     if settings.auto_sync(model):
-        dispatch_uid = '%s.post_save' % model._meta.model_name
+        dispatch_uid = "%s.post_save" % model._meta.model_name
         logger.debug("Connecting search index model sync signal: %s", dispatch_uid)
-        signals.post_save.connect(_on_model_save, sender=model, dispatch_uid=dispatch_uid)
-        dispatch_uid = '%s.post_delete' % model._meta.model_name
+        signals.post_save.connect(
+            _on_model_save, sender=model, dispatch_uid=dispatch_uid
+        )
+        dispatch_uid = "%s.post_delete" % model._meta.model_name
         logger.debug("Connecting search index model sync signal: %s", dispatch_uid)
-        signals.post_delete.connect(_on_model_delete, sender=model, dispatch_uid=dispatch_uid)
+        signals.post_delete.connect(
+            _on_model_delete, sender=model, dispatch_uid=dispatch_uid
+        )
     else:
-        logger.debug("Auto sync disabled for '%s', ignoring update.", model._meta.model_name)
+        logger.debug(
+            "Auto sync disabled for '%s', ignoring update.", model._meta.model_name
+        )
 
 
 def _on_model_save(sender, **kwargs):
     """Update document in search index post_save."""
-    update_fields = kwargs['update_fields']
-    instance = kwargs['instance']
+    update_fields = kwargs["update_fields"]
+    instance = kwargs["instance"]
     for index in instance.search_indexes:
-        _update_search_index(instance=instance, index=index, update_fields=update_fields)
+        _update_search_index(
+            instance=instance, index=index, update_fields=update_fields
+        )
 
 
 def _update_search_index(*, instance, index, update_fields):
@@ -106,7 +112,7 @@ def _on_model_delete(sender, **kwargs):
     the object will no longer in exist in the database.
 
     """
-    instance = kwargs['instance']
+    instance = kwargs["instance"]
     for index in instance.search_indexes:
         try:
             instance.delete_search_document(index=index)

diff --git a/elasticsearch_django/index.py b/elasticsearch_django/index.py
@@ -2,12 +2,7 @@
 
 from elasticsearch import helpers
 
-from .settings import (
-    get_setting,
-    get_index_mapping,
-    get_index_models,
-    get_client
-)
+from .settings import get_setting, get_index_mapping, get_index_models, get_client
 
 logger = logging.getLogger(__name__)
 
@@ -27,8 +22,8 @@ def update_index(index):
     for model in get_index_models(index):
         logger.info("Updating search index model: '%s'", model.search_doc_type)
         objects = model.objects.get_search_queryset(index).iterator()
-        actions = bulk_actions(objects, index=index, action='index')
-        response = helpers.bulk(client, actions, chunk_size=get_setting('chunk_size'))
+        actions = bulk_actions(objects, index=index, action="index")
+        response = helpers.bulk(client, actions, chunk_size=get_setting("chunk_size"))
         responses.append(response)
     return responses
 
@@ -68,11 +63,15 @@ def prune_index(index):
                 prunes.append(obj)
         logger.info(
             "Found %s objects of type '%s' for deletion from '%s'.",
-            len(prunes), model, index
+            len(prunes),
+            model,
+            index,
         )
         if len(prunes) > 0:
-            actions = bulk_actions(prunes, index, 'delete')
-            response = helpers.bulk(client, actions, chunk_size=get_setting('chunk_size'))
+            actions = bulk_actions(prunes, index, "delete")
+            response = helpers.bulk(
+                client, actions, chunk_size=get_setting("chunk_size")
+            )
             responses.append(response)
     return responses
 
@@ -98,18 +97,19 @@ def _prune_hit(hit, model):
         a 'delete' action.
 
     """
-    hit_id = hit['_id']
-    hit_index = hit['_index']
+    hit_id = hit["_id"]
+    hit_index = hit["_index"]
     if model.objects.in_search_queryset(hit_id, index=hit_index):
         logger.debug(
-            "%s with id=%s exists in the '%s' index queryset.",
-            model, hit_id, hit_index
+            "%s with id=%s exists in the '%s' index queryset.", model, hit_id, hit_index
         )
         return None
     else:
         logger.debug(
             "%s with id=%s does not exist in the '%s' index queryset and will be pruned.",
-            model, hit_id, hit_index
+            model,
+            hit_id,
+            hit_index,
         )
         # we don't need the full obj for a delete action, just the id.
         # (the object itself may not even exist.)
@@ -158,7 +158,9 @@ def bulk_actions(objects, index, action):
             how the final document is formatted.
 
     """
-    assert index != '_all', "index arg must be a valid index name. '_all' is a reserved term."
+    assert (
+        index != "_all"
+    ), "index arg must be a valid index name. '_all' is a reserved term."
     logger.info("Creating bulk '%s' actions for '%s'", action, index)
     for obj in objects:
         try: