Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Graph Sync #11

Open
miteshvp opened this issue Jun 19, 2017 · 9 comments
Open

Improve Graph Sync #11

miteshvp opened this issue Jun 19, 2017 · 9 comments

Comments

@miteshvp
Copy link
Contributor

miteshvp commented Jun 19, 2017

This issue collates all the points that could be helpful in improving graph writes. There are three ways to solve the issue.

  • Improve current data_importer service
  1. Run data_importer gunicorn (HTTP Server) process in multi-worker mode
  2. Run more than one replica for data_importer in openshift
  • Split actual graph sync process in a way that other workers are not kept waiting
  1. Call data_importer/<ingest_to_graph> API in an asynchronous way. (Cons, we may loose logging ability. Pros, very simple to implement and no worker is kept waiting)
  2. Update the workflow to not wait for Graph Sync task (Not much idea if possible. cc @fridex)
  3. Implement data_importer/<ingest_to_graph> as part of selinon tasks. (Cons, implementation might require more time. Pros, all the selinon related advantage)
  • Improve Graph Writes
  1. Use single-model instead of multi-model which helps in faster writes. (Cons, rewrite full graph. Pros, writes will be much faster. Need to see how much faster with a small load test)

cc @msrb @krishnapaparaju @samuzzal-choudhury

@fridex
Copy link
Contributor

fridex commented Jun 19, 2017

Call data_importer/<ingest_to_graph> API in an asynchronous way. (Cons, we may loose logging ability. Pros, very simple to implement and no worker is kept waiting)

There was already an implementation that used asynchronous HTTP requests from workers but I insist to change it. One of disadvantages of such implementation are, as you stated, lose of log control and all the log-related things. Making request without making sure that we have actually succeeded could lead to really hard to debug/hard to track issues.

Another thing is that these async requests are usually done in separate threads that could lead to thread scheduling dependent evaluation. If we just issue async request and immediately return from call, this could simply lead to:

  1. the async request is acutally never done as thread was never started
  2. even if it was started, and let's say that the call to the remote API was done (not necessary true), there could be closed TCP connection and this behaviour is highly dependent on server and client implementation (and in what situation this was done)

And not to forget... even if you do async requests, you can easily perform DoS to remote API server with such approach.

I still think that we should get rid of data importer as it really does not make sense to have it and create a standalone Selinon task that offers us transparent horizontal scaling.

@fridex
Copy link
Contributor

fridex commented Jun 19, 2017

Update the workflow to not wait for Graph Sync task (Not much idea if possible. cc @fridex)

Yes, this is possible and we have already pending patch for that in fabric8-analytics/fabric8-analytics-worker#63

EDIT: also note that this does not solve this issue as it occurred many times that all cluster nodes were stucked on graph imports as they occupied all available workers and we were not able to analyse anything.

@miteshvp
Copy link
Contributor Author

There was already an implementation that used asynchronous HTTP requests from workers but I insist to change it. One of disadvantages of such implementation are, as you stated, lose of log control and all the log-related things. Making request without making sure that we have actually succeeded could lead to really hard to debug/hard to track issues.

One option is to we could track these errors at data_importer layer rather than worker processes.

even if it was started, and let's say that the call to the remote API was done (not necessary true), there could be closed TCP connection and this behaviour is highly dependent on server and client implementation (and in what situation this was done)

Again we could handle this as part of unknown scenario.

And not to forget... even if you do async requests, you can easily perform DoS to remote API server with such approach.

Any API server for that matter goes through this challenge. We might as well encounter on our core API server.

@miteshvp
Copy link
Contributor Author

EDIT: also note that this does not solve this issue as it occurred many times that all cluster nodes were stucked on graph imports as they occupied all available workers and we were not able to analyse anything.

Yup, that is what I meant initially, whether is it possible to unblock other workers. Thanks for clarifying. So we might have to remove that as an option

@miteshvp
Copy link
Contributor Author

@fridex - what about invoking data_importer in multi-worker mode? Do you think it might be helpful to perform a small load test?

@fridex
Copy link
Contributor

fridex commented Jun 19, 2017

There was already an implementation that used asynchronous HTTP requests from workers but I insist to change it. One of disadvantages of such implementation are, as you stated, lose of log control and all the log-related things. Making request without making sure that we have actually succeeded could lead to really hard to debug/hard to track issues.

One option is to we could track these errors at data_importer layer rather than worker processes.

How do you want to track these issues when you don't even know that the connection was actually established?

even if it was started, and let's say that the call to the remote API was done (not necessary true), there could be closed TCP connection and this behaviour is highly dependent on server and client implementation (and in what situation this was done)

Again we could handle this as part of unknown scenario.

How do you want to handle application related issues on transport layer?

@fridex - what about invoking data_importer in multi-worker mode? Do you think it might be helpful to perform a small load test?

I don't think this will help - anyway we are scaling vertically not horizontally. I created PRs this morning that separate worker for graph imports so other workers can continue with analyses (now we can scale worker that does ingestion independently). I still think we should remove data importer completely and place this logic to core worker tasks.

@miteshvp
Copy link
Contributor Author

miteshvp commented Jun 19, 2017

How do you want to track these issues when you don't even know that the connection was actually established?

Should be handled as part of unknown scenario. But we may not know what went wrong.

I don't think this will help - anyway we are scaling vertically not horizontally.

This is about short term solution, easy-fix and faster turn-arounds.

I still think we should remove data importer completely and place this logic to core worker tasks.

I am not saying we don't need this, but even if its a part of core worker, we still rely on gremlin-http as a last-mile insertion to graph. This is where we could keep track of all the comments, reasons for future references.

@fridex
Copy link
Contributor

fridex commented Jun 19, 2017

How do you want to track these issues when you don't even know that the connection was actually established?

Should be handled as part of unknown scenario. But we may not know what went wrong.

We are talking about TCP (transport layer of OSI/ISO model) and application level logic. How does unknown scenario - unknown package to our analyses relate to this?

I don't think this will help - anyway we are scaling vertically not horizontally.

This is about short term solution, easy-fix and faster turn-arounds.

I think we introduced a lot of "short--term" solutions. We should start doing things properly,

I still think we should remove data importer completely and place this logic to core worker tasks.

I am not saying we don't need this, but even if its a part of core worker, we still rely on gremlin-http as a last-mile insertion to graph. This is where we could keep track of all the comments, reasons for future references.

Yes, exactly. So we don't have bottleneck the API server that is written by us, but services that were designed to deal with heavy load and large data sets. I suppose Gremlin was chosen because of this.

@miteshvp
Copy link
Contributor Author

Some Performance Benchmark results are available at https://docs.google.com/spreadsheets/d/1ojvQwhWzpxKBF77X8EGukRi2O0f2dPD_g0rFivZp8To/edit#gid=0

I have also raised an issue for SINGLE item model read/write resulting in high number of storage exceptions.
amazon-archives/dynamodb-janusgraph-storage-backend#204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants