diff --git a/docs/operations.rst b/docs/operations.rst index 5779262e..96193dec 100644 --- a/docs/operations.rst +++ b/docs/operations.rst @@ -8,40 +8,81 @@ Operations ========== -(#NOTE: The only copy of the raw events is stored in the index, so in case of an -Elasticsearch cluster failure/loss, the events will be lost.) +Since our only copy of statts is stored in the indices of Elasticsearch in case +of a cluster error or failure we will lose our stats data. Thus it is adviced +to setup a backup/restore mechanism for projects in production. -Since our statistics are stored in Elasticsearch in the unfortunate event that -our cluster goes down, we will find ourself in the unpleasant poition to have -lost all of our statistics for our service. Though a backup/restore mechanism -is adviced for projects in production. We will go though the defacto solution -for that and provide some possible alternatives for those who want a more fine -grained approach. +We have several options when it comes down to tooling and methods for preserving +our data in Elasticsearch. -Backup ES -~~~~~~~~~ +- `elasticdump `_ + A simple and straight forward tool to for moving and saving indices. +- `Elasticsearch Snapshots `_ + is a tool that takes snapshots of our cluster. Snapshots are build in incremental + fashion so current snapshots do not include data from previous ones. + We can also take snapshots of individual indices or the whole cluster. +- `Curator `_ + is an advanced python library from elastic, you can read more about + curator and how to configure and use it, in the official `Elasticsearch + documentation `_ +- Not recommended, but if you want, you can even keep raw filesystem backups for + each of your elasticsearch nodes. -Possible options for backing up ES +Demonstrating all the aforementioned tools falls out of the scope of this +guide so we will provide examples only for elasticdump. -- elasticdump (defacto) -- ES Snapshots -- Raw filesystem backups for each node... 🤢 -- In terms of managing indices it might be also worth taking a look into the - Python library elasticsearch-curator. +.. note:: + To give you a magnitude of the produced data for stats, `Zenodo `_ + for January 2020, got approximately **3M** visits (combined harvesters and users), + which produced approximately **10Gb** of stats data. + + +Backup with elasticdump +~~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + Apart from the data, you will also have to backup the mappings, so you are + able to restore data properly. + + +Save our mappings and our index data to record_view_mapping_backup.json and +record_view_index_backup.json files respectively. -downloads and views for Zenodo for January 2020 +.. code-block:: console -- 3M users (not crappy harvesters/ users) -- ~ 10Gb + $ elasticdump \ + > --input=http://production.es.com:9200/stats-record-view \ + > --output=record_view_mapping_backup.json \ + > --type=mapping -Restore ES -~~~~~~~~~~ + Fri, 13 Mar 2020 13:13:01 GMT | starting dump + Fri, 13 Mar 2020 13:13:01 GMT | got 1 objects from source elasticsearch (offset: 0) + Fri, 13 Mar 2020 13:13:01 GMT | sent 1 objects to destination file, wrote 1 + Fri, 13 Mar 2020 13:13:01 GMT | got 0 objects from source elasticsearch (offset: 1) + Fri, 13 Mar 2020 13:13:01 GMT | Total Writes: 1 + Fri, 13 Mar 2020 13:13:01 GMT | dump complete + + $ elasticdump \ + > --input=http://production.es.com:9200/stats-record-view \ + > --output=record_view_index_backup.json \ + > --type=data + + Fri, 13 Mar 2020 13:13:13 GMT | starting dump + Fri, 13 Mar 2020 13:13:13 GMT | got 5 objects from source elasticsearch (offset: 0) + Fri, 13 Mar 2020 13:13:13 GMT | sent 5 objects to destination file, wrote 5 + Fri, 13 Mar 2020 13:13:13 GMT | got 0 objects from source elasticsearch (offset: 5) + Fri, 13 Mar 2020 13:13:13 GMT | Total Writes: 5 + Fri, 13 Mar 2020 13:13:13 GMT | dump complete + + +Restore with elasticdump +~~~~~~~~~~~~~~~~~~~~~~~~ There is a saying that goes "A backup worked only when it got restored." This section will take us through the restore process of the previous step. We will have to bring our application close to the state it was before the ES cluster failure. -.. note:: - Some data loss is possible, from the time we notice the issue and restore - our cluster and its data to the last valid backed up dataset. +Some data loss is possible, from the time we notice the issue and restore +our cluster and its data to the last valid backed up dataset. + diff --git a/examples/app.py b/examples/app.py index 4071b3c6..2598780c 100644 --- a/examples/app.py +++ b/examples/app.py @@ -127,6 +127,7 @@ def fixtures(): def publish_filedownload(nb_events, user_id, file_key, file_id, bucket_id, date): + """Publish file download event.""" current_stats.publish('file-download', [dict( # When: timestamp=( @@ -143,7 +144,7 @@ def publish_filedownload(nb_events, user_id, file_key, @fixtures.command() def events(): - # Create events + """Create events.""" nb_days = 20 day = datetime(2016, 12, 1, 0, 0, 0) max_events = 10 @@ -162,6 +163,7 @@ def events(): @fixtures.command() def aggregations(): + """Aggregate events.""" aggregate_events(['file-download-agg']) # flush elasticsearch indices so that the aggregations become searchable current_search_client.indices.flush(index='*') diff --git a/requirements-devel.txt b/requirements-devel.txt index dd330062..fbe1005c 100644 --- a/requirements-devel.txt +++ b/requirements-devel.txt @@ -14,4 +14,3 @@ -e git+https://github.com/inveniosoftware/invenio-queues.git#egg=invenio-queues -e git+https://github.com/inveniosoftware/invenio-search.git#egg=invenio-search --e git+https://github.com/inveniosoftware/invenio-base.git#egg=invenio-base diff --git a/setup.py b/setup.py index fd4e80b9..a5eb66b2 100644 --- a/setup.py +++ b/setup.py @@ -69,13 +69,14 @@ install_requires = [ 'counter-robots>=2018.6', - 'invenio-base>=1.2.2', + 'Flask>=0.11.1', 'invenio-cache>=1.0.0', 'invenio-celery>=1.1.3', 'invenio-queues>=1.0.0a2', 'maxminddb-geolite2>=2017.0404', 'python-dateutil>=2.6.1', 'python-geoip>=1.2', + 'Werkzeug>=0.15.0, <1.0.0', ] packages = find_packages()