Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ShareChat] Introduced the concept of uniform parallelism #1

Merged
merged 1 commit into from
Oct 31, 2024

Conversation

isburmistrov
Copy link
Member

@isburmistrov isburmistrov commented Oct 22, 2024

Context

In Tardis, the autoscale struggled with finding the right balance. The issue is that in the heterogeneous parallelism across all vertices the decisions of autoscaler can be suboptimal. Vertices are not independent, and the current parallelism of "parent" vertex influences how much traffic receives a "child" vertex, hence impacts the decision when we decide new parallelims of the "child" vertex. Basically - the relative parallelism of vertices can change after scale event, and in the new - changed - situation the decision about the optimal scaling can be very different.

How it looks like in practice is tons of "bouncing": autoscaler scales down, then quickly realizes it needs to upscale back. And this process never ends, no matter how hard we try to tune parameters.

image

This PR

Introduces the concept of "flat parallelism". To reduce the "cognitive load" on autoscaler and prevent the situation when relative parallelism changes over time, we can simply maintain the same parallelism across all vertices. Pretty much like we do now in Tardis.

With this setting, Tardis job autoscales perfectly, maintaining small lag without "bouncing" back situations.

image image

@isburmistrov isburmistrov changed the base branch from main to main-sharechat October 23, 2024 13:18
@isburmistrov isburmistrov force-pushed the force-same-parallelism branch 4 times, most recently from 419679a to d5be0d4 Compare October 23, 2024 14:02
@isburmistrov isburmistrov changed the title Introduced the concept of flat parallelism Introduced the concept of uniform parallelism Oct 23, 2024
@isburmistrov isburmistrov force-pushed the force-same-parallelism branch from d5be0d4 to a2d745a Compare October 23, 2024 14:24
Copy link
Member

@david-sharechat david-sharechat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@isburmistrov isburmistrov changed the title Introduced the concept of uniform parallelism [ShareChat] Introduced the concept of uniform parallelism Oct 31, 2024
@isburmistrov isburmistrov force-pushed the force-same-parallelism branch from a2d745a to 00a2e51 Compare October 31, 2024 12:31
@isburmistrov isburmistrov force-pushed the force-same-parallelism branch from 00a2e51 to 5a36e40 Compare October 31, 2024 12:32
@isburmistrov isburmistrov merged commit 9a1f687 into main-sharechat Oct 31, 2024
232 checks passed
@isburmistrov isburmistrov deleted the force-same-parallelism branch October 31, 2024 14:43
isburmistrov added a commit that referenced this pull request Oct 31, 2024
### Context

In Tardis, the autoscale struggled with finding the right balance. The
issue is that in the heterogeneous parallelism across all vertices the
decisions of autoscaler can be suboptimal. Vertices are not independent,
and the current parallelism of "parent" vertex influences how much
traffic receives a "child" vertex, hence impacts the decision when we
decide new parallelims of the "child" vertex. Basically - the relative
parallelism of vertices can change after scale event, and in the new -
changed - situation the decision about the optimal scaling can be very
different.

How it looks like in practice is tons of "bouncing": autoscaler scales
down, then quickly realizes it needs to upscale back. And this process
never ends, no matter how hard we try to tune parameters.

<img width="896" alt="image"
src="https://github.com/user-attachments/assets/05d32637-cc96-4d60-8021-49396e297234">

### This PR

Introduces the concept of "flat parallelism". To reduce the "cognitive
load" on autoscaler and prevent the situation when relative parallelism
changes over time, we can simply maintain the same parallelism across
all vertices. Pretty much like we do now in Tardis.

With this setting, Tardis job autoscales perfectly, maintaining small
lag without "bouncing" back situations.

<img width="895" alt="image"
src="https://github.com/user-attachments/assets/ca2f77e8-7ad4-4009-8fdd-3be5971dd8f6">

<img width="893" alt="image"
src="https://github.com/user-attachments/assets/f9060ee1-0503-47b0-ac89-8babfc472862">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants