Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuring data replication #580

Open
bpowell65536 opened this issue Oct 24, 2019 · 19 comments
Open

Configuring data replication #580

bpowell65536 opened this issue Oct 24, 2019 · 19 comments

Comments

@bpowell65536
Copy link

I have been trying to configure replication of a Backslash instance to an Elasticsearch instance, and the process seems straightforward, but I am having trouble getting it to work.

Backslash version: 2.17.2 (UI 2.17.2), running on Ubuntu 18.04.3, installed using the instructions at https://backslash.readthedocs.io/en/latest/installation.html
Elasticsearch version: 5.4.3, run in a Docker container on Ubuntu 18.04.3 using docker run -p 9200:9200 -p 9300:9300 -e discovery.type=single-node docker.elastic.co/elasticsearch/elasticsearch:5.4.3

I have seen that replication doesn't support the latest version of Elasticsearch (#570), but I can't find any information about which versions of Elasticsearch should be supported.

Steps I took to reproduce this issue:

  • Create a new replication (Admin > Replications > Add New...)
  • Enter the hostname and port number of the Elasticsearch server
  • Leave the index name and credentials fields blank
  • Click "Create"
    • the replication status immediately shows "error" and "lagging"
  • Click "Start" for the new replication
    • the replication status changes to "running" and "lagging"
    • the progress bar starts at the ~40% position and does not move at all
    • after about five minutes the replication stops, the replication status changes back to "error" and "lagging", and no test data has appeared in Elasticsearch

The only log messages I can find related to this are:

Oct 24 22:01:58 bpowell-nz-vl2 7042cafe1e8a[1140]: [2019-10-24 22:01:58,633: INFO/MainProcess] Received task: flask_app.tasks.replications.do_elasticsearch_replication[b0408a37-afd0-415e-89f4-bfb67b6603d2]
Oct 24 22:01:58 bpowell-nz-vl2 7042cafe1e8a[1140]: [2019-10-24 22:01:58,640: INFO/ForkPoolWorker-3] Task flask_app.tasks.replications.do_elasticsearch_replication[b0408a37-afd0-415e-89f4-bfb67b6603d2] succee
ded in 0.005087274999823421s: None

Are there some more detailed instructions for configuring Backslash replication that I can follow besides https://backslash.readthedocs.io/en/latest/server_configuration.html#data-replication?

@vmalloc
Copy link
Member

vmalloc commented Oct 26, 2019

Nothing else in the worker log?

In general, during initial resync the percentages aren't always accurate at the moment. The important thing is that the replication advances. Unfortunately the only way to truly know that at the moment is through the database itself (or viewing the REST api the browser is using and the responses returned). Can you confirm the state remains exactly the same and not advancing at all?

@bpowell65536
Copy link
Author

I couldn't see anything else that looked relevant in the worker log, but for completeness here are the log messages that appear while the replication is in progress:

[2019-10-28 19:02:43,787: INFO/ForkPoolWorker-3] Raven is not configured (logging is disabled). Please see the documentation for more information.
[2019-10-28 19:02:43,797: INFO/ForkPoolWorker-3] Started
[2019-10-28 19:02:44,349: INFO/ForkPoolWorker-3] Task flask_app.tasks.migrations.start_live_migrations[2d278ed3-e3fe-4aec-925a-023fe4cbf903] succeeded in 0.5734206900233403s: None
[2019-10-28 19:03:04,282: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:03:04,286: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[b1f69e1b-7e64-466b-be82-0e1ec10581e9]
[2019-10-28 19:03:04,288: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[b1f69e1b-7e64-466b-be82-0e1ec10581e9] succeeded in 0.0010479899938218296s: None
[2019-10-28 19:04:04,313: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:04:04,319: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[f9786014-ed33-4509-80bc-aefe71a22700]
[2019-10-28 19:04:04,322: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[f9786014-ed33-4509-80bc-aefe71a22700] succeeded in 0.0008305439841933548s: None
[2019-10-28 19:04:32,948: INFO/MainProcess] Received task: flask_app.tasks.replications.do_elasticsearch_replication[4a978fb2-1523-4d8f-9b45-8704ff5661c4]
[2019-10-28 19:04:32,957: INFO/ForkPoolWorker-3] Task flask_app.tasks.replications.do_elasticsearch_replication[4a978fb2-1523-4d8f-9b45-8704ff5661c4] succeeded in 0.007307631021831185s: None
[2019-10-28 19:05:04,322: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:05:04,328: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[dc339c31-27a0-44c2-952f-e7fbf689246e]
[2019-10-28 19:05:04,330: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[dc339c31-27a0-44c2-952f-e7fbf689246e] succeeded in 0.0005721849738620222s: None
[2019-10-28 19:06:04,374: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:06:04,379: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[2240523c-c9d3-483c-9d6c-659617fedf31]
[2019-10-28 19:06:04,383: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[2240523c-c9d3-483c-9d6c-659617fedf31] succeeded in 0.0008473509806208313s: None
[2019-10-28 19:07:04,421: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:07:04,426: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[e5133d80-22c5-4788-a281-34fb4e337bd2]
[2019-10-28 19:07:04,432: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[e5133d80-22c5-4788-a281-34fb4e337bd2] succeeded in 0.0008454519556835294s: None
[2019-10-28 19:08:04,472: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:08:04,477: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[506042a3-a01e-4fe5-a805-4f6e9ae0753f]
[2019-10-28 19:08:04,481: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[506042a3-a01e-4fe5-a805-4f6e9ae0753f] succeeded in 0.001215384982060641s: None
[2019-10-28 19:09:04,523: INFO/Beat] Scheduler: Sending due task rerun_stale_reliable_tasks (flask_app.tasks.maintenance.rerun_stale_tasks)
[2019-10-28 19:09:04,532: INFO/MainProcess] Received task: flask_app.tasks.maintenance.rerun_stale_tasks[8f6a62fb-d4e8-49fe-890a-3dd022dc01bd]
[2019-10-28 19:09:04,534: INFO/ForkPoolWorker-3] Task flask_app.tasks.maintenance.rerun_stale_tasks[8f6a62fb-d4e8-49fe-890a-3dd022dc01bd] succeeded in 0.0006121189799159765s: None

(the replication task starts at 2019-10-28 19:04:32,948)

Yes, the replication state remains the same for the entire replication attempt.

I have looked at the Elasticsearch REST API, and the "backslash" index that should have been created for the test records does not exist. I tried creating that index manually, but no records are added to it.

I also tried adding the default Elasticsearch REST API credentials to the replication setup, but the behaviour does not change.

@ayalash
Copy link
Contributor

ayalash commented Oct 29, 2019

@vmalloc Please notice that the elastics-search version which @bpowell65536 uses is pretty old - Current (latest) version is 7.4.1. Can it be that the problem he is having is due to backwards incompatibility?

We run replication successfully with version 6.2.2.

@vmalloc
Copy link
Member

vmalloc commented Oct 29, 2019

Definitely - an old ES can absolutely cause issues with replication

@bpowell65536
Copy link
Author

Thanks for that info, @ayalash. I had seen in #570 that replication doesn't work with some versions of ES, but that issue didn't say which ones. I will try 6.2.2 and see how it goes.

@bpowell65536
Copy link
Author

I tried ES 6.2.2, but the behaviour is still the same. Which versions are supported? Is it just 6.2.2?

@vmalloc
Copy link
Member

vmalloc commented Oct 29, 2019

There is no clear statement as we didn't perform that many conformance tests, but I think >6 should be fine.

I think the only option is to look into the specific installation you have to see how (if at all) it progresses. Are you familiar with how to open a connection to Backslash's internal DB?

@bpowell65536
Copy link
Author

I can connect to the database, but I don't know much about Postgres so will have to figure it out as I go 😄. Are there any specific queries you want me to run?

jenkins@bpowell-nz-vl2:~$ psql -U backslash -h 127.0.0.1 -p 5432
psql (10.10 (Ubuntu 10.10-0ubuntu0.18.04.1), server 9.6.15)
Type "help" for help.

backslash=# \dt
                 List of relations
 Schema |         Name         | Type  |   Owner
--------+----------------------+-------+-----------
 public | alembic_version      | table | backslash
 public | app_config           | table | backslash
 public | background_migration | table | backslash
 public | comment              | table | backslash
 public | entity               | table | backslash
 public | error                | table | backslash
 public | label                | table | backslash
 public | product              | table | backslash
 public | product_revision     | table | backslash
 public | product_version      | table | backslash
 public | replication          | table | backslash
 public | role                 | table | backslash
 public | roles_users          | table | backslash
 public | run_token            | table | backslash
 public | session              | table | backslash
 public | session_entity       | table | backslash
 public | session_label        | table | backslash
 public | session_metadata     | table | backslash
 public | session_subject      | table | backslash
 public | subject              | table | backslash
 public | subject_instance     | table | backslash
 public | test                 | table | backslash
 public | test_entity          | table | backslash
 public | test_information     | table | backslash
 public | test_metadata        | table | backslash
 public | test_variation       | table | backslash
 public | timing               | table | backslash
 public | user                 | table | backslash
 public | user_preference      | table | backslash
 public | user_starred_tests   | table | backslash
 public | warning              | table | backslash
(31 rows)

backslash=#

@vmalloc
Copy link
Member

vmalloc commented Oct 29, 2019

Let's start with several SELECT * FROM replication, with about 30 minutes between each execution -- this will clarify whether or not there is any progress at all...

@vmalloc
Copy link
Member

vmalloc commented Oct 29, 2019

PLEASE NOTE that the output would probably need to be sanitized by you prior to pasting as it could contain passwords

@bpowell65536
Copy link
Author

Here are the contents of the replication table, dumped every 30 minutes (the username and password fields did have the correct values for my ES instance in them, but I removed those).

 id |  service_type  |             url             | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
 13 | elastic-search | http://bpowell-cd-2536:9200 |            |          |          | f      | f            |                   |                           |                    |    1572375905.53291 |            |              0
(1 row)

 id |  service_type  |             url             | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
 13 | elastic-search | http://bpowell-cd-2536:9200 |            |          |          | f      | f            |                   |                           |                    |    1572375905.53291 |            |              0
(1 row)

 id |  service_type  |             url             | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
 13 | elastic-search | http://bpowell-cd-2536:9200 |            |          |          | f      | f            |                   |                           |                    |    1572375905.53291 |            |              0
(1 row)

 id |  service_type  |             url             | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
 13 | elastic-search | http://bpowell-cd-2536:9200 |            |          |          | f      | f            |                   |                           |                    |    1572375905.53291 |            |              0
(1 row)

 id |  service_type  |             url             | index_name | username | password | paused | untimed_done | backlog_remaining | last_replicated_timestamp | last_replicated_id | last_chunk_finished | last_error | avg_per_second
----+----------------+-----------------------------+------------+----------+----------+--------+--------------+-------------------+---------------------------+--------------------+---------------------+------------+----------------
 13 | elastic-search | http://bpowell-cd-2536:9200 |            |          |          | f      | f            |                   |                           |                    |    1572375905.53291 |            |              0
(1 row)

Is the index_name field supposed to be empty? I didn't provide that value when configuring the replication because the UI says it defaults to "backslash".

There doesn't seem to be any progress, but no records have been replicated either. There are definitely sessions in the database:

backslash=# select id,logical_id,start_time from session;
 id |               logical_id               |    start_time
----+----------------------------------------+------------------
  1 | f65aa066-f6a2-11e9-a46a-782bcb9c4f51_0 | 1571951539.27069
  2 | b07ef4ac-f6a7-11e9-a46a-782bcb9c4f51_0 | 1571953569.48793
  3 | c27f7f2a-f9c0-11e9-a46a-782bcb9c4f51_0 |  1572294191.5634
(3 rows)
backslash=#

I ran another test which pushed results to Backslash to see if that would change the replication status, and while the new test session appeared in the session table, the replication status did not change.

@vmalloc
Copy link
Member

vmalloc commented Nov 4, 2019

It looks like the replication hasn't even started... Is it possible your URL isn't reachable from within the worker container?

@bpowell65536
Copy link
Author

The ES URL is reachable from the machine that the worker container is running on, using ping and also using curl to access the ES REST interface. I started an interactive shell on the backslash_worker_1 container using docker exec -it backslash_worker_1 bash and the ES URL was reachable from there too (using ping and curl again). Is there anything more specific I should try to see if the URL is reachable?

I tried running an ES Docker container on the same machine as the Backslash containers and replicating data to that, but the replication still doesn't start.

@vmalloc
Copy link
Member

vmalloc commented Nov 5, 2019

This is really strange. May I suggest adding some debug information inside the container around

def do_elasticsearch_replication(replica_id=None, reconfigure=True):
? We will need to pin down what the replication process is doing exactly and where it is stuck... There is also the slight chance that there is a flaw in the replication chunking logic (fetching the next bulk to be replicated) and that causes the issue...

@bpowell65536
Copy link
Author

I added some debug logging to the start of the do_elasticsearch_replication function:

@reliable_task
@needs_app_context
def do_elasticsearch_replication(replica_id=None, reconfigure=True):
    _logger.debug("replica_id is %r" % replica_id)
    if replica_id is None:
        for replica in models.Replication.query.all():
            _logger.debug("do_elasticsearch_replication.apply_async for replica %r" % replica)
            do_elasticsearch_replication.apply_async((replica.id,))
        return

    replica = models.Replication.query.get(replica_id)
    _logger.debug("got replica %r for replica_id %r" % (replica, replica_id))
    if replica is None:
        _logger.debug(f'Replica {replica_id} already deleted. Not doing anything')
        return
    if replica.paused:
        _logger.debug(f'Replica {replica_id} is paused')
        return
    sleep_time = 60
    results = None
    try:
        es_index = index_from_replica(replica)
        _logger.debug("es_index for replica %r is %r" % (replica, es_index))
        if not es_index:
            _logger.debug("No es_index, returning")
            return
        if reconfigure:
            _logger.debug("Reconfiguring es_index")
            es_index.reconfigure()
        _logger.debug("estimating time left")
        es_index.estimate_time_left()
        start_time = replica.last_chunk_finished or flux.current_timeline.time()

        results = [es_index.handle_result(result) \
                   for result in models.db.session.execute(get_next_bulk_query(es_index.get_fields_query(), replica, es_index))]
        num_replicated, _ = es_helpers.bulk(es_index.get_client(), results)
        _log.debug("num_replicated = %s" % num_replicated)
        end_time = flux.current_timeline.time()

Creating and starting a new replication task now produces this log output from the worker container:

[2019-11-05 19:18:48,002: INFO/MainProcess] Received task: flask_app.tasks.replications.do_elasticsearch_replication[6c5cb05e-473e-455a-ac00-821b9c1be2fd]
[2019-11-05 19:18:48.024929] DEBUG: flask_app.tasks.replications: replica_id is 24
[2019-11-05 19:18:48.031890] DEBUG: flask_app.tasks.replications: got replica <Replica to http://bpowell-cd-2536:9200> for replica_id 24
[2019-11-05 19:18:48.032145] DEBUG: flask_app.tasks.replications: es_index for replica <Replica to http://bpowell-cd-2536:9200> is None
[2019-11-05 19:18:48.032241] DEBUG: flask_app.tasks.replications: No es_index, returning
[2019-11-05 19:18:48,032: INFO/ForkPoolWorker-3] Task flask_app.tasks.replications.do_elasticsearch_replication[6c5cb05e-473e-455a-ac00-821b9c1be2fd] succeeded in 0.024834798998199403s: None

This explains why the task gets done so fast, and possibly explains why it looks like it hasn't started, but I don't understand why es_index is None. Should es_index is None really be a silent failure? I'll keep digging and see what I can find.

@bpowell65536
Copy link
Author

The es_index value comes from the index_from_replica function at https://github.com/getslash/backslash/blob/develop/flask_app/tasks/replications.py#L330. This function expects the index name (the value taken from the "Index name" field on the replication configuration page) to be either "test" or "session", and it returns an index of None in all other cases. The replication configuration page says the index name defaults to "backslash", and based on that information I left it blank, which appears to have caused all the problems here. Creating one replication with the index name set to "test" and one with the index name set to "session" allows Backslash to successfully replicate all of its data to Elasticsearch, but it seems a bit non-intuitive. Perhaps the documentation around replication needs to be updated?

@vmalloc
Copy link
Member

vmalloc commented Nov 6, 2019

Good catch! That doesn't seem like the behavior we want - it sounds like a bug...

@bpowell65536
Copy link
Author

Thanks! Yeah, I was expecting that all the data would be replicated to a single "backslash" index if I left the index name field blank - is that the intended behaviour?

@vmalloc
Copy link
Member

vmalloc commented Nov 7, 2019

This sounds like a bug in the frontend/UI. I am guessing the "new" route sends an empty string, and since the API now accepts the index name this ends up as empty/null. @maormarcus can you confirm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants