Improve Graph Sync #11

miteshvp · 2017-06-19T11:42:58Z

This issue collates all the points that could be helpful in improving graph writes. There are three ways to solve the issue.

Improve current data_importer service

Run data_importer gunicorn (HTTP Server) process in multi-worker mode
Run more than one replica for data_importer in openshift

Split actual graph sync process in a way that other workers are not kept waiting

Call data_importer/<ingest_to_graph> API in an asynchronous way. (Cons, we may loose logging ability. Pros, very simple to implement and no worker is kept waiting)
Update the workflow to not wait for Graph Sync task (Not much idea if possible. cc @fridex)
Implement data_importer/<ingest_to_graph> as part of selinon tasks. (Cons, implementation might require more time. Pros, all the selinon related advantage)

Improve Graph Writes

Use single-model instead of multi-model which helps in faster writes. (Cons, rewrite full graph. Pros, writes will be much faster. Need to see how much faster with a small load test)

cc @msrb @krishnapaparaju @samuzzal-choudhury

fridex · 2017-06-19T11:55:03Z

Call data_importer/<ingest_to_graph> API in an asynchronous way. (Cons, we may loose logging ability. Pros, very simple to implement and no worker is kept waiting)

There was already an implementation that used asynchronous HTTP requests from workers but I insist to change it. One of disadvantages of such implementation are, as you stated, lose of log control and all the log-related things. Making request without making sure that we have actually succeeded could lead to really hard to debug/hard to track issues.

Another thing is that these async requests are usually done in separate threads that could lead to thread scheduling dependent evaluation. If we just issue async request and immediately return from call, this could simply lead to:

the async request is acutally never done as thread was never started
even if it was started, and let's say that the call to the remote API was done (not necessary true), there could be closed TCP connection and this behaviour is highly dependent on server and client implementation (and in what situation this was done)

And not to forget... even if you do async requests, you can easily perform DoS to remote API server with such approach.

I still think that we should get rid of data importer as it really does not make sense to have it and create a standalone Selinon task that offers us transparent horizontal scaling.

fridex · 2017-06-19T12:06:55Z

Update the workflow to not wait for Graph Sync task (Not much idea if possible. cc @fridex)

Yes, this is possible and we have already pending patch for that in fabric8-analytics/fabric8-analytics-worker#63

EDIT: also note that this does not solve this issue as it occurred many times that all cluster nodes were stucked on graph imports as they occupied all available workers and we were not able to analyse anything.

miteshvp · 2017-06-19T12:51:42Z

There was already an implementation that used asynchronous HTTP requests from workers but I insist to change it. One of disadvantages of such implementation are, as you stated, lose of log control and all the log-related things. Making request without making sure that we have actually succeeded could lead to really hard to debug/hard to track issues.

One option is to we could track these errors at data_importer layer rather than worker processes.

even if it was started, and let's say that the call to the remote API was done (not necessary true), there could be closed TCP connection and this behaviour is highly dependent on server and client implementation (and in what situation this was done)

Again we could handle this as part of unknown scenario.

And not to forget... even if you do async requests, you can easily perform DoS to remote API server with such approach.

Any API server for that matter goes through this challenge. We might as well encounter on our core API server.

miteshvp · 2017-06-19T12:53:29Z

EDIT: also note that this does not solve this issue as it occurred many times that all cluster nodes were stucked on graph imports as they occupied all available workers and we were not able to analyse anything.

Yup, that is what I meant initially, whether is it possible to unblock other workers. Thanks for clarifying. So we might have to remove that as an option

miteshvp · 2017-06-19T12:54:20Z

@fridex - what about invoking data_importer in multi-worker mode? Do you think it might be helpful to perform a small load test?

fridex · 2017-06-19T12:58:27Z

There was already an implementation that used asynchronous HTTP requests from workers but I insist to change it. One of disadvantages of such implementation are, as you stated, lose of log control and all the log-related things. Making request without making sure that we have actually succeeded could lead to really hard to debug/hard to track issues.

One option is to we could track these errors at data_importer layer rather than worker processes.

How do you want to track these issues when you don't even know that the connection was actually established?

even if it was started, and let's say that the call to the remote API was done (not necessary true), there could be closed TCP connection and this behaviour is highly dependent on server and client implementation (and in what situation this was done)

Again we could handle this as part of unknown scenario.

How do you want to handle application related issues on transport layer?

@fridex - what about invoking data_importer in multi-worker mode? Do you think it might be helpful to perform a small load test?

I don't think this will help - anyway we are scaling vertically not horizontally. I created PRs this morning that separate worker for graph imports so other workers can continue with analyses (now we can scale worker that does ingestion independently). I still think we should remove data importer completely and place this logic to core worker tasks.

miteshvp · 2017-06-19T13:37:06Z

How do you want to track these issues when you don't even know that the connection was actually established?

Should be handled as part of unknown scenario. But we may not know what went wrong.

I don't think this will help - anyway we are scaling vertically not horizontally.

This is about short term solution, easy-fix and faster turn-arounds.

I still think we should remove data importer completely and place this logic to core worker tasks.

I am not saying we don't need this, but even if its a part of core worker, we still rely on gremlin-http as a last-mile insertion to graph. This is where we could keep track of all the comments, reasons for future references.

fridex · 2017-06-19T14:17:06Z

How do you want to track these issues when you don't even know that the connection was actually established?

Should be handled as part of unknown scenario. But we may not know what went wrong.

We are talking about TCP (transport layer of OSI/ISO model) and application level logic. How does unknown scenario - unknown package to our analyses relate to this?

I don't think this will help - anyway we are scaling vertically not horizontally.

This is about short term solution, easy-fix and faster turn-arounds.

I think we introduced a lot of "short--term" solutions. We should start doing things properly,

I still think we should remove data importer completely and place this logic to core worker tasks.

I am not saying we don't need this, but even if its a part of core worker, we still rely on gremlin-http as a last-mile insertion to graph. This is where we could keep track of all the comments, reasons for future references.

Yes, exactly. So we don't have bottleneck the API server that is written by us, but services that were designed to deal with heavy load and large data sets. I suppose Gremlin was chosen because of this.

miteshvp · 2017-06-29T05:21:37Z

Some Performance Benchmark results are available at https://docs.google.com/spreadsheets/d/1ojvQwhWzpxKBF77X8EGukRi2O0f2dPD_g0rFivZp8To/edit#gid=0

I have also raised an issue for SINGLE item model read/write resulting in high number of storage exceptions.
amazon-archives/dynamodb-janusgraph-storage-backend#204

miteshvp mentioned this issue Jun 29, 2017

Split Titan and Gremlin - impossible to scale #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Graph Sync #11

Improve Graph Sync #11

miteshvp commented Jun 19, 2017 •

edited

Loading

fridex commented Jun 19, 2017 •

edited

Loading

fridex commented Jun 19, 2017 •

edited

Loading

miteshvp commented Jun 19, 2017

miteshvp commented Jun 19, 2017

miteshvp commented Jun 19, 2017

fridex commented Jun 19, 2017

miteshvp commented Jun 19, 2017 •

edited

Loading

fridex commented Jun 19, 2017 •

edited

Loading

miteshvp commented Jun 29, 2017

Improve Graph Sync #11

Improve Graph Sync #11

Comments

miteshvp commented Jun 19, 2017 • edited Loading

fridex commented Jun 19, 2017 • edited Loading

fridex commented Jun 19, 2017 • edited Loading

miteshvp commented Jun 19, 2017

miteshvp commented Jun 19, 2017

miteshvp commented Jun 19, 2017

fridex commented Jun 19, 2017

miteshvp commented Jun 19, 2017 • edited Loading

fridex commented Jun 19, 2017 • edited Loading

miteshvp commented Jun 29, 2017

miteshvp commented Jun 19, 2017 •

edited

Loading

fridex commented Jun 19, 2017 •

edited

Loading

fridex commented Jun 19, 2017 •

edited

Loading

miteshvp commented Jun 19, 2017 •

edited

Loading

fridex commented Jun 19, 2017 •

edited

Loading