[WIP] Add tolerance in RSS cluster for server going away #97

mayurdb · 2023-03-20T06:14:00Z

Adds fault tolerance in RSS servers for one or more server going away. This is how the functionality works

Node/server goes away
Task reading/writing data from that server fail
Code throws a fetch failed exception in both read/write flow
If its a shuffle map stage, patch added in spark takes care of rolling back the stage completely and retry the required stages: if fetch fail is thrown when reading shuffle data, retry current and parent stage, if fetch fail is thrown when writing shuffle data, retry only the current stage
A stage retry hook is added in the patch which gets triggered before the stage is retried. Hook's implementation in RSS calls the registerShuffle again for this particular shuffle
A new list of available servers is picked
As in the normal flow, set of RSS servers to be used is sent to mappers and reducers as part of shuffle handle

In the spark patch, new interface is added for the stage retry hook. I won't be able to add UTs without these changes in spark binary. Maybe we can upload a fat jar in the repo for that.

Also, there is a patch added in open source for rolling back shuffle map stage in Spark 3.0, I haven't yet evaluated that. Maybe we can make use of it to avoid the long changes here. I'll evaluate and get back on that.

hiboyang · 2023-03-21T18:27:59Z

Thanks @mayurdb for the PR!

Add tolerance in RSS for server going away

5a81d85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add tolerance in RSS cluster for server going away #97

[WIP] Add tolerance in RSS cluster for server going away #97

mayurdb commented Mar 20, 2023

hiboyang commented Mar 21, 2023

[WIP] Add tolerance in RSS cluster for server going away #97

Are you sure you want to change the base?

[WIP] Add tolerance in RSS cluster for server going away #97

Conversation

mayurdb commented Mar 20, 2023

hiboyang commented Mar 21, 2023