Skip to content

Commit

Permalink
Update master branch to ES6 only (#36)
Browse files Browse the repository at this point in the history
This commit will update the master branch to require ES6 and above. It formally deprecates ES5 and ES2, which will now live on in named branches only. Aside from documentation, and updating the dependency versions, this commit also create a separate code path for partial document updates, so that instead of a single as_search_document method that takes an optional update_fields kwarg, we have two separate methods, as_search_document and as_search_document_update. Only the later has the update_fields arg, which is mandatory. There are examples in the README, so recommendation is to start there.

One other addition is that the repo has been "Blackened" - so the formatting of app *.py files should now be consistent.
  • Loading branch information
hugorodgerbrown authored Jan 17, 2019
1 parent 681192a commit 4ac14cc
Show file tree
Hide file tree
Showing 13 changed files with 306 additions and 240 deletions.
4 changes: 4 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,13 @@ verify_ssl = true
[dev-packages]
tox = "*"
coverage = "*"
black = "*"

[packages]
elasticsearch-django = {path = ".",editable = true}

[requires]
python_version = "3.7"

[pipenv]
allow_prereleases = true
65 changes: 60 additions & 5 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,7 @@ Elasticsearch for Django

This is a lightweight Django app for people who are using Elasticsearch with Django, and want to manage their indexes.

**NB the master branch is now based on ES5.x. If you are using ES2.x, please switch to the ES2 branch (released on PyPI as 2.x)**

Status
**NB the master branch is now based on ES6. If you are using ES2/ES5, please switch to the relevant branch (released on PyPI as 2.x, 5.x)**

----

Expand Down Expand Up @@ -104,13 +102,70 @@ So far we have configured Django to know the names of the indexes we want, and t

**SearchDocumentMixin**

This mixin must be implemented by the model itself, and it requires a single method implementation - ``as_search_document()``. This should return a dict that is the index representation of the object; the ``index`` kwarg can be used to provide different representations for different indexes. By default this is ``_all`` which means that all indexes receive the same document for a given object.
This mixin is responsible for the seaerch index document format. We are indexing JSON representations of each object, and we have two methods on the mixin responsible for outputting the correct format - ``as_search_document`` and ``as_search_document_update``.

An aside on the mechanics of the ``auto_sync`` process, which is hooked up using Django's ``post_save`` and ``post_delete`` model signals. ES supports partial updates to documents that already exist, and we make a fundamental assumption about indexing models - that **if you pass the ``update_fields`` kwarg to a ``model.save`` method call, then you are performing a partial update**, and this will be propagated to ES as a partial update only.

To this end, we have two methods for generating the model's JSON representation - ``as_search_document``, which should return a dict that represents the entire object; and ``as_search_document_update``, which takes the ``update_fields`` kwarg, and should return a dict that contains just the fields that are being updated. (NB this is not automated as the field being updated may itself be an object, which needs specific formatting - see below for an example).

To better understand this, let us say that we have a model (``MyModel``) that is configured to be included in an index called ``myindex``. If we save an object, without passing ``update_fields``, then this is considered a full document update, which triggers the object's ``index_search_document`` method:

.. code:: python
obj = MyModel.objects.first()
obj.save()
...
# AUTO_SYNC=true will trigger a re-index of the complete object document:
obj.index_search_document(index='myindex')
However, if we only want to update a single field (say the ``timestamp``), and we pass this in to the save method, then this will trigger the ``update_search_document`` method, passing in the names of the fields that we want updated.

.. code:: python
# save a single field on the object
obj.save(update_fields=['timestamp'])
...
# AUTO_SYNC=true will trigger a partial update of the object document
obj.update_search_document(index, update_fields=['timestamp'])
We pass the name of the index being updated as the first arg, as objects may have different representations in different indexes:

.. code:: python
def as_search_document(self, index='_all'):
def as_search_document(self, index):
return {'name': "foo"} if index == 'foo' else {'name': "bar"}
In the case of the second method, the simplest possible implementation would be a dictionary containing the names of the fields being updated and their new values. However, if the value is itself an object, it will have to serialized properly. As an example:

.. code:: python
def as_search_document_update(self, index, update_fields):
# create a basic attr: value dict from the fields that were updated
doc = {f: getattr(self, f) for f in update_fields}
# if the 'user' attr was updated, we need to convert this from
# a User object to something ES can index - in this case just the full name
# (NB GPDR klaxon sounding at this point - do you have permission to do this?)
if 'user' in doc:
doc['user'] = self.user.get_full_name()
return doc
The reason we have split out the update from the full-document index comes from a real problem that we ourselves suffered. The full object representation that we were using was quite DB intensive - we were storing properties of the model that required walking the ORM tree. However, because we were also touching the objects (see below) to record activity timestamps, we ended up flooding the database with queries simply to update a single field in the output document. Partial updates solves this issue:

.. code:: python
def touch(self):
self.timestamp = now()
self.save(update_fields=['timestamp'])
def as_search_document_update(self, index, update_fields):
if update_fields == ['timestamp']:
# only propagate changes if it's +1hr since the last timestamp change
if now() - self.timestamp < timedelta(hours=1):
return {}
else:
return {'timestamp': self.timestamp}
....
**SearchDocumentManagerMixin**

This mixin must be implemented by the model's default manager (``objects``). It also requires a single method implementation - ``get_search_queryset()`` - which returns a queryset of objects that are to be indexed. This can also use the ``index`` kwarg to provide different sets of objects to different indexes.
Expand Down
4 changes: 2 additions & 2 deletions elasticsearch_django/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
default_app_config = 'elasticsearch_django.apps.ElasticAppConfig'
default_app_config = "elasticsearch_django.apps.ElasticAppConfig"

__version__ = '5.3'
__version__ = "6.0"
69 changes: 30 additions & 39 deletions elasticsearch_django/admin.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,52 +19,40 @@ def pprint(data):
until someone builds a custom syntax function.
"""
pretty = json.dumps(
data,
sort_keys=True,
indent=4,
separators=(',', ': ')
)
pretty = json.dumps(data, sort_keys=True, indent=4, separators=(",", ": "))
html = pretty.replace(" ", "&nbsp;").replace("\n", "<br>")
return mark_safe("<code>%s</code>" % html)


class SearchQueryAdmin(admin.ModelAdmin):

list_display = (
'id',
'user',
'search_terms_',
'total_hits',
'returned_',
'min_',
'max_',
'reference',
'executed_at',
)
list_filter = (
'index',
)
search_fields = (
'search_terms',
'user__first_name',
'user__last_name',
'reference'
"id",
"user",
"search_terms_",
"total_hits",
"returned_",
"min_",
"max_",
"reference",
"executed_at",
)
list_filter = ("index",)
search_fields = ("search_terms", "user__first_name", "user__last_name", "reference")
# excluding because we are using a pretty version instead
exclude = ('hits', 'query', 'page')
exclude = ("hits", "query", "page")
readonly_fields = (
'user',
'index',
'search_terms',
'total_hits',
'returned_',
'min_',
'max_',
'duration',
'query_',
'hits_',
'executed_at',
"user",
"index",
"search_terms",
"total_hits",
"returned_",
"min_",
"max_",
"duration",
"query_",
"hits_",
"executed_at",
)

def search_terms_(self, instance):
Expand All @@ -79,18 +67,20 @@ def query_(self, instance):

def max_(self, instance):
"""Pretty version of max_score."""
return '-' if instance.page_size == 0 else instance.max_score
return "-" if instance.page_size == 0 else instance.max_score

max_.short_description = "Max score"

def min_(self, instance):
"""Pretty version of min_score."""
return '-' if instance.page_size == 0 else instance.min_score
return "-" if instance.page_size == 0 else instance.min_score

min_.short_description = "Min score"

def returned_(self, instance):
"""Number of hits returned in the page."""
if instance.page_size == 0:
return '-'
return "-"
else:
return "%i - %i" % (instance.page_from, instance.page_to)

Expand All @@ -100,4 +90,5 @@ def hits_(self, instance):
"""Pretty version of hits JSON."""
return pprint(instance.hits)


admin.site.register(SearchQuery, SearchQueryAdmin)
40 changes: 23 additions & 17 deletions elasticsearch_django/apps.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ class ElasticAppConfig(AppConfig):

"""AppConfig for Search3."""

name = 'elasticsearch_django'
name = "elasticsearch_django"
verbose_name = "Elasticsearch"
configs = []

def ready(self):
"""Validate config and connect signals."""
super(ElasticAppConfig, self).ready()
_validate_config(settings.get_setting('strict_validation'))
if settings.get_setting('auto_sync'):
_validate_config(settings.get_setting("strict_validation"))
if settings.get_setting("auto_sync"):
_connect_signals()
else:
logger.debug("SEARCH_AUTO_SYNC has been disabled.")
Expand All @@ -48,11 +48,9 @@ def _validate_mapping(index, strict=False):

def _validate_model(model):
"""Check that a model configured for an index subclasses the required classes."""
if not hasattr(model, 'as_search_document'):
raise ImproperlyConfigured(
"'%s' must implement `as_search_document`." % model
)
if not hasattr(model.objects, 'get_search_queryset'):
if not hasattr(model, "as_search_document"):
raise ImproperlyConfigured("'%s' must implement `as_search_document`." % model)
if not hasattr(model.objects, "get_search_queryset"):
raise ImproperlyConfigured(
"'%s.objects must implement `get_search_queryset`." % model
)
Expand All @@ -68,22 +66,30 @@ def _connect_signals():
def _connect_model_signals(model):
"""Connect signals for a single model."""
if settings.auto_sync(model):
dispatch_uid = '%s.post_save' % model._meta.model_name
dispatch_uid = "%s.post_save" % model._meta.model_name
logger.debug("Connecting search index model sync signal: %s", dispatch_uid)
signals.post_save.connect(_on_model_save, sender=model, dispatch_uid=dispatch_uid)
dispatch_uid = '%s.post_delete' % model._meta.model_name
signals.post_save.connect(
_on_model_save, sender=model, dispatch_uid=dispatch_uid
)
dispatch_uid = "%s.post_delete" % model._meta.model_name
logger.debug("Connecting search index model sync signal: %s", dispatch_uid)
signals.post_delete.connect(_on_model_delete, sender=model, dispatch_uid=dispatch_uid)
signals.post_delete.connect(
_on_model_delete, sender=model, dispatch_uid=dispatch_uid
)
else:
logger.debug("Auto sync disabled for '%s', ignoring update.", model._meta.model_name)
logger.debug(
"Auto sync disabled for '%s', ignoring update.", model._meta.model_name
)


def _on_model_save(sender, **kwargs):
"""Update document in search index post_save."""
update_fields = kwargs['update_fields']
instance = kwargs['instance']
update_fields = kwargs["update_fields"]
instance = kwargs["instance"]
for index in instance.search_indexes:
_update_search_index(instance=instance, index=index, update_fields=update_fields)
_update_search_index(
instance=instance, index=index, update_fields=update_fields
)


def _update_search_index(*, instance, index, update_fields):
Expand All @@ -106,7 +112,7 @@ def _on_model_delete(sender, **kwargs):
the object will no longer in exist in the database.
"""
instance = kwargs['instance']
instance = kwargs["instance"]
for index in instance.search_indexes:
try:
instance.delete_search_document(index=index)
Expand Down
36 changes: 19 additions & 17 deletions elasticsearch_django/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,7 @@

from elasticsearch import helpers

from .settings import (
get_setting,
get_index_mapping,
get_index_models,
get_client
)
from .settings import get_setting, get_index_mapping, get_index_models, get_client

logger = logging.getLogger(__name__)

Expand All @@ -27,8 +22,8 @@ def update_index(index):
for model in get_index_models(index):
logger.info("Updating search index model: '%s'", model.search_doc_type)
objects = model.objects.get_search_queryset(index).iterator()
actions = bulk_actions(objects, index=index, action='index')
response = helpers.bulk(client, actions, chunk_size=get_setting('chunk_size'))
actions = bulk_actions(objects, index=index, action="index")
response = helpers.bulk(client, actions, chunk_size=get_setting("chunk_size"))
responses.append(response)
return responses

Expand Down Expand Up @@ -68,11 +63,15 @@ def prune_index(index):
prunes.append(obj)
logger.info(
"Found %s objects of type '%s' for deletion from '%s'.",
len(prunes), model, index
len(prunes),
model,
index,
)
if len(prunes) > 0:
actions = bulk_actions(prunes, index, 'delete')
response = helpers.bulk(client, actions, chunk_size=get_setting('chunk_size'))
actions = bulk_actions(prunes, index, "delete")
response = helpers.bulk(
client, actions, chunk_size=get_setting("chunk_size")
)
responses.append(response)
return responses

Expand All @@ -98,18 +97,19 @@ def _prune_hit(hit, model):
a 'delete' action.
"""
hit_id = hit['_id']
hit_index = hit['_index']
hit_id = hit["_id"]
hit_index = hit["_index"]
if model.objects.in_search_queryset(hit_id, index=hit_index):
logger.debug(
"%s with id=%s exists in the '%s' index queryset.",
model, hit_id, hit_index
"%s with id=%s exists in the '%s' index queryset.", model, hit_id, hit_index
)
return None
else:
logger.debug(
"%s with id=%s does not exist in the '%s' index queryset and will be pruned.",
model, hit_id, hit_index
model,
hit_id,
hit_index,
)
# we don't need the full obj for a delete action, just the id.
# (the object itself may not even exist.)
Expand Down Expand Up @@ -158,7 +158,9 @@ def bulk_actions(objects, index, action):
how the final document is formatted.
"""
assert index != '_all', "index arg must be a valid index name. '_all' is a reserved term."
assert (
index != "_all"
), "index arg must be a valid index name. '_all' is a reserved term."
logger.info("Creating bulk '%s' actions for '%s'", action, index)
for obj in objects:
try:
Expand Down
Loading

0 comments on commit 4ac14cc

Please sign in to comment.