Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add tolerance in RSS cluster for server going away #97

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mayurdb
Copy link
Collaborator

@mayurdb mayurdb commented Mar 20, 2023

Adds fault tolerance in RSS servers for one or more server going away. This is how the functionality works

  • Node/server goes away
  • Task reading/writing data from that server fail
  • Code throws a fetch failed exception in both read/write flow
  • If its a shuffle map stage, patch added in spark takes care of rolling back the stage completely and retry the required stages: if fetch fail is thrown when reading shuffle data, retry current and parent stage, if fetch fail is thrown when writing shuffle data, retry only the current stage
  • A stage retry hook is added in the patch which gets triggered before the stage is retried. Hook's implementation in RSS calls the registerShuffle again for this particular shuffle
  • A new list of available servers is picked
  • As in the normal flow, set of RSS servers to be used is sent to mappers and reducers as part of shuffle handle

In the spark patch, new interface is added for the stage retry hook. I won't be able to add UTs without these changes in spark binary. Maybe we can upload a fat jar in the repo for that.

Also, there is a patch added in open source for rolling back shuffle map stage in Spark 3.0, I haven't yet evaluated that. Maybe we can make use of it to avoid the long changes here. I'll evaluate and get back on that.

@hiboyang
Copy link
Contributor

Thanks @mayurdb for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants