[ShareChat] Introduced the concept of uniform parallelism #1

isburmistrov · 2024-10-22T16:07:04Z

Context

In Tardis, the autoscale struggled with finding the right balance. The issue is that in the heterogeneous parallelism across all vertices the decisions of autoscaler can be suboptimal. Vertices are not independent, and the current parallelism of "parent" vertex influences how much traffic receives a "child" vertex, hence impacts the decision when we decide new parallelims of the "child" vertex. Basically - the relative parallelism of vertices can change after scale event, and in the new - changed - situation the decision about the optimal scaling can be very different.

How it looks like in practice is tons of "bouncing": autoscaler scales down, then quickly realizes it needs to upscale back. And this process never ends, no matter how hard we try to tune parameters.

This PR

Introduces the concept of "flat parallelism". To reduce the "cognitive load" on autoscaler and prevent the situation when relative parallelism changes over time, we can simply maintain the same parallelism across all vertices. Pretty much like we do now in Tardis.

With this setting, Tardis job autoscales perfectly, maintaining small lag without "bouncing" back situations.

david-sharechat

Lgtm!

### Context In Tardis, the autoscale struggled with finding the right balance. The issue is that in the heterogeneous parallelism across all vertices the decisions of autoscaler can be suboptimal. Vertices are not independent, and the current parallelism of "parent" vertex influences how much traffic receives a "child" vertex, hence impacts the decision when we decide new parallelims of the "child" vertex. Basically - the relative parallelism of vertices can change after scale event, and in the new - changed - situation the decision about the optimal scaling can be very different. How it looks like in practice is tons of "bouncing": autoscaler scales down, then quickly realizes it needs to upscale back. And this process never ends, no matter how hard we try to tune parameters. <img width="896" alt="image" src="https://github.com/user-attachments/assets/05d32637-cc96-4d60-8021-49396e297234"> ### This PR Introduces the concept of "flat parallelism". To reduce the "cognitive load" on autoscaler and prevent the situation when relative parallelism changes over time, we can simply maintain the same parallelism across all vertices. Pretty much like we do now in Tardis. With this setting, Tardis job autoscales perfectly, maintaining small lag without "bouncing" back situations. <img width="895" alt="image" src="https://github.com/user-attachments/assets/ca2f77e8-7ad4-4009-8fdd-3be5971dd8f6"> <img width="893" alt="image" src="https://github.com/user-attachments/assets/f9060ee1-0503-47b0-ac89-8babfc472862">

isburmistrov requested review from AndrewKostousov, Harsh02708, andectionsharechat and david-sharechat October 22, 2024 16:07

AndrewKostousov approved these changes Oct 22, 2024

View reviewed changes

andectionsharechat approved these changes Oct 22, 2024

View reviewed changes

isburmistrov changed the base branch from main to main-sharechat October 23, 2024 13:18

isburmistrov force-pushed the force-same-parallelism branch 4 times, most recently from 419679a to d5be0d4 Compare October 23, 2024 14:02

isburmistrov changed the title ~~Introduced the concept of flat parallelism~~ Introduced the concept of uniform parallelism Oct 23, 2024

isburmistrov force-pushed the force-same-parallelism branch from d5be0d4 to a2d745a Compare October 23, 2024 14:24

david-sharechat approved these changes Oct 24, 2024

View reviewed changes

isburmistrov changed the title ~~Introduced the concept of uniform parallelism~~ [ShareChat] Introduced the concept of uniform parallelism Oct 31, 2024

isburmistrov force-pushed the force-same-parallelism branch from a2d745a to 00a2e51 Compare October 31, 2024 12:31

Introduced the concept of uniform parallelism

5a36e40

isburmistrov force-pushed the force-same-parallelism branch from 00a2e51 to 5a36e40 Compare October 31, 2024 12:32

isburmistrov merged commit 9a1f687 into main-sharechat Oct 31, 2024
232 checks passed

isburmistrov deleted the force-same-parallelism branch October 31, 2024 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ShareChat] Introduced the concept of uniform parallelism #1

[ShareChat] Introduced the concept of uniform parallelism #1

isburmistrov commented Oct 22, 2024 •

edited

Loading

david-sharechat left a comment

[ShareChat] Introduced the concept of uniform parallelism #1

[ShareChat] Introduced the concept of uniform parallelism #1

Conversation

isburmistrov commented Oct 22, 2024 • edited Loading

Context

This PR

david-sharechat left a comment

Choose a reason for hiding this comment

isburmistrov commented Oct 22, 2024 •

edited

Loading