Skip to content

Commit

Permalink
docs: operations
Browse files Browse the repository at this point in the history
  • Loading branch information
topless committed Mar 13, 2020
1 parent 8d877e9 commit dd19fe6
Show file tree
Hide file tree
Showing 4 changed files with 70 additions and 27 deletions.
89 changes: 65 additions & 24 deletions docs/operations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,40 +8,81 @@
Operations
==========

(#NOTE: The only copy of the raw events is stored in the index, so in case of an
Elasticsearch cluster failure/loss, the events will be lost.)
Since our only copy of statts is stored in the indices of Elasticsearch in case
of a cluster error or failure we will lose our stats data. Thus it is adviced
to setup a backup/restore mechanism for projects in production.

Since our statistics are stored in Elasticsearch in the unfortunate event that
our cluster goes down, we will find ourself in the unpleasant poition to have
lost all of our statistics for our service. Though a backup/restore mechanism
is adviced for projects in production. We will go though the defacto solution
for that and provide some possible alternatives for those who want a more fine
grained approach.
We have several options when it comes down to tooling and methods for preserving
our data in Elasticsearch.

Backup ES
~~~~~~~~~
- `elasticdump <https://github.com/taskrabbit/elasticsearch-dump#readme>`_
A simple and straight forward tool to for moving and saving indices.
- `Elasticsearch Snapshots <https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html>`_
is a tool that takes snapshots of our cluster. Snapshots are build in incremental
fashion so current snapshots do not include data from previous ones.
We can also take snapshots of individual indices or the whole cluster.
- `Curator <https://github.com/elastic/curator>`_
is an advanced python library from elastic, you can read more about
curator and how to configure and use it, in the official `Elasticsearch
documentation <https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html>`_
- Not recommended, but if you want, you can even keep raw filesystem backups for
each of your elasticsearch nodes.

Possible options for backing up ES
Demonstrating all the aforementioned tools falls out of the scope of this
guide so we will provide examples only for elasticdump.

- elasticdump (defacto)
- ES Snapshots
- Raw filesystem backups for each node... 🤢
- In terms of managing indices it might be also worth taking a look into the
Python library elasticsearch-curator.
.. note::
To give you a magnitude of the produced data for stats, `Zenodo <https://zenodo.org>`_
for January 2020, got approximately **3M** visits (combined harvesters and users),
which produced approximately **10Gb** of stats data.


Backup with elasticdump
~~~~~~~~~~~~~~~~~~~~~~~

.. note::
Apart from the data, you will also have to backup the mappings, so you are
able to restore data properly.


Save our mappings and our index data to record_view_mapping_backup.json and
record_view_index_backup.json files respectively.

downloads and views for Zenodo for January 2020
.. code-block:: console
- 3M users (not crappy harvesters/ users)
- ~ 10Gb
$ elasticdump \
> --input=http://production.es.com:9200/stats-record-view \
> --output=record_view_mapping_backup.json \
> --type=mapping
Restore ES
~~~~~~~~~~
Fri, 13 Mar 2020 13:13:01 GMT | starting dump
Fri, 13 Mar 2020 13:13:01 GMT | got 1 objects from source elasticsearch (offset: 0)
Fri, 13 Mar 2020 13:13:01 GMT | sent 1 objects to destination file, wrote 1
Fri, 13 Mar 2020 13:13:01 GMT | got 0 objects from source elasticsearch (offset: 1)
Fri, 13 Mar 2020 13:13:01 GMT | Total Writes: 1
Fri, 13 Mar 2020 13:13:01 GMT | dump complete
$ elasticdump \
> --input=http://production.es.com:9200/stats-record-view \
> --output=record_view_index_backup.json \
> --type=data
Fri, 13 Mar 2020 13:13:13 GMT | starting dump
Fri, 13 Mar 2020 13:13:13 GMT | got 5 objects from source elasticsearch (offset: 0)
Fri, 13 Mar 2020 13:13:13 GMT | sent 5 objects to destination file, wrote 5
Fri, 13 Mar 2020 13:13:13 GMT | got 0 objects from source elasticsearch (offset: 5)
Fri, 13 Mar 2020 13:13:13 GMT | Total Writes: 5
Fri, 13 Mar 2020 13:13:13 GMT | dump complete
Restore with elasticdump
~~~~~~~~~~~~~~~~~~~~~~~~

There is a saying that goes "A backup worked only when it got restored." This
section will take us through the restore process of the previous step. We will
have to bring our application close to the state it was before the ES cluster
failure.

.. note::
Some data loss is possible, from the time we notice the issue and restore
our cluster and its data to the last valid backed up dataset.
Some data loss is possible, from the time we notice the issue and restore
our cluster and its data to the last valid backed up dataset.

4 changes: 3 additions & 1 deletion examples/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ def fixtures():

def publish_filedownload(nb_events, user_id, file_key,
file_id, bucket_id, date):
"""Publish file download event."""
current_stats.publish('file-download', [dict(
# When:
timestamp=(
Expand All @@ -143,7 +144,7 @@ def publish_filedownload(nb_events, user_id, file_key,

@fixtures.command()
def events():
# Create events
"""Create events."""
nb_days = 20
day = datetime(2016, 12, 1, 0, 0, 0)
max_events = 10
Expand All @@ -162,6 +163,7 @@ def events():

@fixtures.command()
def aggregations():
"""Aggregate events."""
aggregate_events(['file-download-agg'])
# flush elasticsearch indices so that the aggregations become searchable
current_search_client.indices.flush(index='*')
1 change: 0 additions & 1 deletion requirements-devel.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,3 @@

-e git+https://github.com/inveniosoftware/invenio-queues.git#egg=invenio-queues
-e git+https://github.com/inveniosoftware/invenio-search.git#egg=invenio-search
-e git+https://github.com/inveniosoftware/invenio-base.git#egg=invenio-base
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,13 +69,14 @@

install_requires = [
'counter-robots>=2018.6',
'invenio-base>=1.2.2',
'Flask>=0.11.1',
'invenio-cache>=1.0.0',
'invenio-celery>=1.1.3',
'invenio-queues>=1.0.0a2',
'maxminddb-geolite2>=2017.0404',
'python-dateutil>=2.6.1',
'python-geoip>=1.2',
'Werkzeug>=0.15.0, <1.0.0',
]

packages = find_packages()
Expand Down

0 comments on commit dd19fe6

Please sign in to comment.