Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split Titan and Gremlin - impossible to scale #10

Closed
fridex opened this issue Jun 14, 2017 · 15 comments
Closed

Split Titan and Gremlin - impossible to scale #10

fridex opened this issue Jun 14, 2017 · 15 comments

Comments

@fridex
Copy link
Contributor

fridex commented Jun 14, 2017

As the current stack looks like the following:

DynamoDB - Titan - Gremlin

We would like to split Titan and Gremlin into two standalone containers which would allow us to scale Gremlin independently. Note that creating multiple Titan instances talking to the same DynamoDB tables results in data inconsistencies and data corruptions as stated in "Titan Limitations":

Running multiple Titan instances on one machine backed by the same storage backend (distributed or local) requires that each of these instances has a unique configuration for storage.machine-id-appendix. Otherwise, these instances might overwrite each other leading to data corruption. See Graph Configuration for more information.

Source: http://titan.thinkaurelius.com/wikidoc/0.3.1/Titan-Limitations.html

@fridex fridex changed the title Split Titan and Gremlin Split Titan and Gremlin - impossible to scale Jun 14, 2017
@tuxdna
Copy link
Contributor

tuxdna commented Jun 14, 2017

Totally agree with having only one TitanGraph instance.

@msrb
Copy link
Member

msrb commented Jun 14, 2017

My understanding (may be completely wrong) is that Gremlin is just a thin wrapper sitting in front of Titan. So, ideally, we would like to be able to scale (also) Titan, as titan is the component that does all the heavy lifting.

Can we scale Titan as well?

@fridex
Copy link
Contributor Author

fridex commented Jun 14, 2017

My understanding (may be completely wrong) is that Gremlin is just a thin wrapper sitting in front of Titan. So, ideally, we would like to be able to scale (also) Titan, as titan is the component that does all the heavy lifting.

If I understand the overall architecture correctly, gremlin also performs query strategy evaluation and path traversal computations to optimize queries. So we would probably save something if we scale Gremlin (besides possible connection reuse?).

Can we scale Titan as well?

I think this will not be doable - based on the configuration section, it is not possible to let two or more Titans talk to the same graph - see storage.machine-id-appendix [1] as stated above in:

http://titan.thinkaurelius.com/wikidoc/0.3.1/Graph-Configuration.html

@krishnapaparaju
Copy link

Do we have current issues documented (for ingestion onto Gremlin layer.. ) ?

@krishnapaparaju
Copy link

What kind of issues we getting into when more workers / threads / containers run in parallel and ingest onto Gremlin ? Any plan action can be based on this.

@fridex
Copy link
Contributor Author

fridex commented Jun 14, 2017

What kind of issues we getting into when more workers / threads / containers run in parallel and ingest onto Gremlin ?

We wanted to scale Gremlin (now we can scale only Gremlin+Titan as it is one container) when we were on heavy load. It took quiet a while to store/retrieve data from the graph database. I will need to check communication parameters or find a bottleneck there. Also on UI level there are done retries as it takes time to get data.

Currently the main issue is the data_model importer which is an API that simply pushes data into the graph. There was a plan to remove this API service (save one container in deployment) and directly let workers to communicate with Gremlin.

From my perspective there could be a nice opportunity to write a really small library that would help us with gremlin communication, serializing query results and constructing queries (it could directly use schemas we maintain for task results) - it could be used on worker side and on API side to utilize work with the graph. This will also address other concerns we currently have.

@miteshvp
Copy link
Contributor

Running multiple Titan instances on one machine backed by the same storage backend (distributed or local) requires that each of these instances has a unique configuration for storage.machine-id-appendix. Otherwise, these instances might overwrite each other leading to data corruption. See Graph Configuration for more information.

Source: http://titan.thinkaurelius.com/wikidoc/0.3.1/Titan-Limitations.html

We are using Titan 1.0.0. This document relates to pretty old version of Titan.

@krishnapaparaju
Copy link

Which mode are we using in production for Titan ? multiple/ single item model ?

@miteshvp
Copy link
Contributor

miteshvp commented Jun 16, 2017

Which mode are we using in production for Titan ? multiple/ single item model ?

@krishnapaparaju - yes we are using multi item model.

@krishnapaparaju
Copy link

Thanks @miteshvp @tuxdna .. I am sure MULTI would have issues on write side. Any side affects we get into if we switch to SINGLE ? If not I would start with (a) change to SINGLE model (b) Observe details for WRITE-CAPACITY-UNITS-USED at AWS.

@krishnapaparaju
Copy link

Once these issues are resolved, we might want to move to JanusGraph (again with AWS plugin) , this would protect our prior investments both at Gremlin (code) and DynamoDB (deployment). Don't think we need to be doing anything related to JanusGraph anytime soon, this is more of FYI for moving to a DB which is actively maintained and very minimal rework.

https://github.com/awslabs/dynamodb-janusgraph-storage-backend

@miteshvp
Copy link
Contributor

miteshvp commented Jun 19, 2017

Any side affects we get into if we switch to SINGLE ?

  1. This means converting entire graph to write in SINGLE mode from scratch.
  2. A limit of 400kb per node, which is still ok for our nodes.
  3. Graphs with low vertex degree and low number of items per index can take advantage of this implementation. We have rather a big skew towards less indices but more items.
  4. We will definitely save time on writes (initial graph loads), but with previous experimentation for reads, there was hardly any difference to fetch. That was one of the reasons for not going ahead and re-writing entire graph in SINGLE mode.

cc @krishnapaparaju

@krishnapaparaju
Copy link

@miteshvp thanks. Can start with experimenting with SINGLE model (on a different instance) to check if we get better on WRITE side. Good to know that changing to SINGLE would not lead to any READ side challenges. Based on these results from experiment, can plan course of action.

@miteshvp
Copy link
Contributor

closing this issue w.r.t #11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants