Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design or prototype distributed execution #25

Open
fpetkovski opened this issue Sep 23, 2022 · 4 comments
Open

Design or prototype distributed execution #25

fpetkovski opened this issue Sep 23, 2022 · 4 comments

Comments

@fpetkovski
Copy link
Collaborator

We would likely need to a logical plan first, but we can already start thinking about distributed query execution and how we can implement it in the engine.

I would avoid adding networking specifics to the engine itself, and defer that responsibility that to the library user. This way, a project that uses the engine might decide to inject an implementation based on gRPC, plain HTTP, or some other protocol for communicating between engine instances.

The engine itself would define an interface for inter-engine communication, and would make sure the query is properly decomposed and merged back after all parts of the plan complete successfully.

@alanprot
Copy link
Contributor

alanprot commented Sep 28, 2022

This seems very interesting...

Besides running a single query on multiple pods this would also allow compute parts of the query closer to the data (on the storage nodes) and reduce network traffic quite a bit, right? As we would not need transfer raw data anymore.

@GiedriusS
Copy link
Member

I was thinking a lot about this topic and I think the main problem at least in the Thanos space is that we do read-time deduplication instead of write-time deduplication. In my opinion, the problem with read-time deduplication is that we don't know whether there are any gaps in the data stored in a node (or any block) thus we need to download everything (all matching blocks) to be able to deduplicate effectively. If we would have identical copies of deduplicated data on multiple replicas then we could effectively execute a given query if all of the needed data for that query resides in the replica of that data.

@alanprot
Copy link
Contributor

alanprot commented Oct 4, 2022

Indeed, another possibility would be to support sharding natively, so we could split aggregation over multiple pods?

@fpetkovski
Copy link
Collaborator Author

That would be the way to go I think. But it should really be up to the user to decide where the engine will run. So in theory, if you can guarantee unique data in a Store, you could also embed an engine there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants