Skip to content

Commit

Permalink
docs: section operations
Browse files Browse the repository at this point in the history
- backup ES
- restore ES
  • Loading branch information
topless committed Mar 17, 2020
1 parent 7752c39 commit 310c0e2
Show file tree
Hide file tree
Showing 11 changed files with 135 additions and 12 deletions.
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@

# General information about the project.
project = u'Invenio-Stats'
copyright = u'2017, CERN'
copyright = u'2020, CERN'
author = u'CERN'

# The version info for the project you're documenting, acts as replacement for
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Invenio-Stats.
overview
configuration
usage
operations
examplesapp


Expand Down
121 changes: 121 additions & 0 deletions docs/operations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
..
This file is part of Invenio.
Copyright (C) 2016-2020 CERN.
Invenio is free software; you can redistribute it and/or modify it
under the terms of the MIT License; see LICENSE file for more details.

Operations
==========

Since our only copy of stats is stored in the indices of Elasticsearch in case
of a cluster error or failure we will lose our stats data. Thus it is advised
to setup a backup/restore mechanism for projects in production.

We have several options when it comes down to tooling and methods for preserving
our data in Elasticsearch.

- `elasticdump <https://github.com/taskrabbit/elasticsearch-dump#readme>`_
A simple and straight forward tool to for moving and saving indices.
- `Elasticsearch Snapshots <https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html>`_
is a tool that takes snapshots of our cluster. Snapshots are build in incremental
fashion so current snapshots do not include data from previous ones.
We can also take snapshots of individual indices or the whole cluster.
- `Curator <https://github.com/elastic/curator>`_
is an advanced python library from elastic, you can read more about
curator and how to configure and use it, in the official `Elasticsearch
documentation <https://www.elastic.co/guide/en/elasticsearch/client/curator/current/index.html>`_
- Not recommended, but if you want, you can even keep raw filesystem backups for
each of your elasticsearch nodes.

Demonstrating all the aforementioned tools falls out of the scope of this
guide so we will provide examples only for elasticdump.

.. note::
To give you a magnitude of the produced data for stats, `Zenodo <https://zenodo.org>`_
for January 2020, got approximately **3M** visits (combined harvesters and users),
which produced approximately **10Gb** of stats data.


Backup with elasticdump
~~~~~~~~~~~~~~~~~~~~~~~

.. note::
Apart from the data, you will also have to backup the mappings, so you are
able to restore data properly. The following example will backup only stats
for record-views (not the events), you can go through your indices and
select which ones make sense to backup.


Save our mappings and our index data to record_view_mapping_backup.json and
record_view_index_backup.json files respectively.

.. code-block:: console
$ elasticdump \
> --input=http://localhost:9200/stats-record-view-2020-03 \
> --output=record_view_mapping_backup.json \
> --type=mapping
Fri, 13 Mar 2020 13:13:01 GMT | starting dump
Fri, 13 Mar 2020 13:13:01 GMT | got 1 objects from source elasticsearch (offset: 0)
Fri, 13 Mar 2020 13:13:01 GMT | sent 1 objects to destination file, wrote 1
Fri, 13 Mar 2020 13:13:01 GMT | got 0 objects from source elasticsearch (offset: 1)
Fri, 13 Mar 2020 13:13:01 GMT | Total Writes: 1
Fri, 13 Mar 2020 13:13:01 GMT | dump complete
$ elasticdump \
> --input=http://localhost:9200/stats-record-view-2020-03 \
> --output=record_view_index_backup.json \
> --type=data
Fri, 13 Mar 2020 13:13:13 GMT | starting dump
Fri, 13 Mar 2020 13:13:13 GMT | got 5 objects from source elasticsearch (offset: 0)
Fri, 13 Mar 2020 13:13:13 GMT | sent 5 objects to destination file, wrote 5
Fri, 13 Mar 2020 13:13:13 GMT | got 0 objects from source elasticsearch (offset: 5)
Fri, 13 Mar 2020 13:13:13 GMT | Total Writes: 5
Fri, 13 Mar 2020 13:13:13 GMT | dump complete
In order to test restore functionality below I will delete on purpose the
index we backed up, from my instance.

.. code-block:: console
$ curl -XDELETE http://localhost:9200/stats-record-view-2020-03
{"acknowledged":true}
Restore with elasticdump
~~~~~~~~~~~~~~~~~~~~~~~~

As we are all aware a backup did not work until it gets restored. Note that
before importing our data, we need to import the mappings to re-create the index.
The process is identical with the backup with just reversed sources --input and
--output.


.. code-block:: console
$ elasticdump \
> --input=record_view_mapping_backup.json \
> --output=http://localhost:9200/stats-record-view-2020-03 \
> --type=mapping
Fri, 13 Mar 2020 15:22:17 GMT | starting dump
Fri, 13 Mar 2020 15:22:17 GMT | got 1 objects from source file (offset: 0)
Fri, 13 Mar 2020 15:22:17 GMT | sent 1 objects to destination elasticsearch, wrote 4
Fri, 13 Mar 2020 15:22:17 GMT | got 0 objects from source file (offset: 1)
Fri, 13 Mar 2020 15:22:17 GMT | Total Writes: 4
Fri, 13 Mar 2020 15:22:17 GMT | dump complete
$ elasticdump \
> --input=record_view_mapping_backup.json \
> --output=http://localhost:9200/stats-record-view-2020-03 \
> --type=mapping
Fri, 13 Mar 2020 15:23:01 GMT | starting dump
Fri, 13 Mar 2020 15:23:01 GMT | got 5 objects from source file (offset: 0)
Fri, 13 Mar 2020 15:23:01 GMT | sent 5 objects to destination elasticsearch, wrote 5
Fri, 13 Mar 2020 15:23:01 GMT | got 0 objects from source file (offset: 5)
Fri, 13 Mar 2020 15:23:01 GMT | Total Writes: 5
Fri, 13 Mar 2020 15:23:01 GMT | dump complete
4 changes: 3 additions & 1 deletion examples/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ def fixtures():

def publish_filedownload(nb_events, user_id, file_key,
file_id, bucket_id, date):
"""Publish file download event."""
current_stats.publish('file-download', [dict(
# When:
timestamp=(
Expand All @@ -143,7 +144,7 @@ def publish_filedownload(nb_events, user_id, file_key,

@fixtures.command()
def events():
# Create events
"""Create events."""
nb_days = 20
day = datetime(2016, 12, 1, 0, 0, 0)
max_events = 10
Expand All @@ -162,6 +163,7 @@ def events():

@fixtures.command()
def aggregations():
"""Aggregate events."""
aggregate_events(['file-download-agg'])
# flush elasticsearch indices so that the aggregations become searchable
current_search_client.indices.flush(index='*')
8 changes: 4 additions & 4 deletions invenio_stats/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -223,14 +223,14 @@ def register_events():
delete or archive old indices.
2. Aggregating
^^^^^^^^^^^^^^
~~~~~~~~~~~~~~
The :py:class:`~invenio_stats.processors.EventsIndexer` processor indexes raw
events. Querying those events can put a big strain on the Elasticsearch
cluster. Thus Invenio-Stats provides a way to *compress* those events by
pre-aggregating them into meaningful statistics.
*Example: individual file downoalds events can be aggregated into the number of
*Example: individual file downloads events can be aggregated into the number of
file download per day and per file.*
Aggregations are registered in the same way as events, under the entrypoint
Expand Down Expand Up @@ -270,7 +270,7 @@ def register_aggregations():
]
An aggregator class must be specified. The dictionary ``params``
contains all the arguments given to its construtor. An Aggregator class is
contains all the arguments given to its constructor. An Aggregator class is
just required to have a ``run()`` method.
The default one is :py:class:`~invenio_stats.aggregations.StatAggregator`
Expand Down Expand Up @@ -300,7 +300,7 @@ def register_aggregations():
]
}
Again the registering function returns the configuraton for the query:
Again the registering function returns the configuration for the query:
.. code-block:: python
Expand Down
1 change: 0 additions & 1 deletion requirements-devel.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,3 @@

-e git+https://github.com/inveniosoftware/invenio-queues.git#egg=invenio-queues
-e git+https://github.com/inveniosoftware/invenio-search.git#egg=invenio-search
-e git+https://github.com/inveniosoftware/invenio-base.git#egg=invenio-base
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,13 +69,14 @@

install_requires = [
'counter-robots>=2018.6',
'invenio-base>=1.2.2',
'Flask>=0.11.1',
'invenio-cache>=1.0.0',
'invenio-celery>=1.1.3',
'invenio-queues>=1.0.0a2',
'maxminddb-geolite2>=2017.0404',
'python-dateutil>=2.6.1',
'python-geoip>=1.2',
'Werkzeug>=0.15.0, <1.0.0',
]

packages = find_packages()
Expand Down
2 changes: 1 addition & 1 deletion tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
import uuid
from contextlib import contextmanager
from copy import deepcopy
from unittest.mock import Mock, patch

# imported to make sure that
# login_oauth2_user(valid, oauth) is included
Expand All @@ -42,7 +43,6 @@
from invenio_records.api import Record
from invenio_search import InvenioSearch, current_search, current_search_client
from kombu import Exchange
from unittest.mock import Mock, patch
from six import BytesIO
from sqlalchemy_utils.functions import create_database, database_exists

Expand Down
1 change: 0 additions & 1 deletion tests/contrib/test_event_builders.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
"""Test event builders."""

import datetime

from unittest.mock import patch

from invenio_stats.contrib.event_builders import file_download_event_builder, \
Expand Down
2 changes: 1 addition & 1 deletion tests/test_aggregations.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@
"""Aggregation tests."""

import datetime
from unittest.mock import patch

import pytest
from conftest import _create_file_download_event
from elasticsearch_dsl import Index, Search
from invenio_search import current_search
from unittest.mock import patch

from invenio_stats import current_stats
from invenio_stats.aggregations import StatAggregator, filter_robots
Expand Down
2 changes: 1 addition & 1 deletion tests/test_processors.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

import logging
from datetime import datetime
from unittest.mock import patch

import pytest
from conftest import _create_file_download_event
Expand All @@ -18,7 +19,6 @@
from helpers import get_queue_size
from invenio_queues.proxies import current_queues
from invenio_search import current_search
from unittest.mock import patch

from invenio_stats.contrib.event_builders import build_file_unique_id, \
file_download_event_builder
Expand Down

0 comments on commit 310c0e2

Please sign in to comment.