Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(rfc): operation cache warmer #1115

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

StarpTech
Copy link
Contributor

Motivation and Context

TODO

The User can push individual operations to the distributed operation cache by using the CLI:

```bash
wgc router cache add -g mygraph operations.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be under the namespace "router"?
At which level do we apply cache warming?
Org? Namespace? Graph? Router?

As an alternative, you could consider creating a "cache-warming-bucket", which you can then use to add operations, and which can be used in the router config, e.g.:

version: "1"

cache_warmup:
  enabled: true
  interval: 5m
  bucket: myBucket

I'm not sure though if this is the right direction.
The cache warming bucket is somewhat closely related to a Graph because the containing operations need to be compatible with the client schema of the graph.

]
```

The cli command is idempotent and always updates the cache with the latest operations. This doesn't trigger the computation of the Top-N operations which is done periodically by the Cosmo Platform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another aspect to think about is pre-compiling operation plans.
We could make plans serializable, generate and serialize them at composition time.
Once the router wants to "load" them, it doesn't have to plan everything but can load the plan much faster.

@StarpTech StarpTech changed the title chore(rfc): distributed operation cache chore(rfc): operation cache warmer Aug 25, 2024
Copy link

@vasylruban-nw vasylruban-nw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for drafting this proposal, it looks great, can't wait to see it live!

Left a few questions I had during reading.

The User can push individual operations to the operation cache by using the CLI:

```bash
wgc federated-graph operation-cache add --graph mygraph --file operations.json

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about manual cache invalidation.

What will happen in the following scenario:

First, we run:

wgc federated-graph operation-cache add --graph mygraph --file operationA.json

Then we run:

wgc federated-graph operation-cache add --graph mygraph --file operationB.json

What will we have in the cache after the second operation? Will it be just Operation B or both Operation A and Operation B?

Curious, because if it is both Operation A and Operation B, the cache eventually will be filled with the manual operation and there will not be space for automatic operations unless we can invalidate it explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is right. Both operations will be in the "batch". This is the part where a customer can manage the cache themself and is responsible of it. At some point, we could also introduce a percentage limit about how much space can be reserved for manual pushed operations. In the first version, we want to keep it simple.

The User can push individual operations to the operation cache by using the CLI:

```bash
wgc federated-graph operation-cache add --graph mygraph --file operations.json

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this command could be executed automatically (like part of CI/CD) or if we have to call it manually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea is to run this locally as well as part of the CI/CD process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'd recommend to make the operation idempotent so that we can run this over and over again on each deployment without having duplicates.

Copy link
Contributor Author

@StarpTech StarpTech Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the idea. Operations are identified by the hash and appended to the "batch" until the cache is full. A user can repush the changes in another pipeline run to re-prioritize the list.


### Cosmo UI integration

A User can disable the operation cache in the Cosmo UI. The User can see the current operations in the cache and remove them if necessary. The User can also see the current status of the cache and the last computation time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question:

what will happen if we disable the operation cache in the Cosmo UI but keep cache_warmup enabled in the router configuration?

Copy link
Contributor Author

@StarpTech StarpTech Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, Cosmo would no longer compute the latest TopN operations for you and your router will fetch at some point an outdated list of operations. It won't break anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that in this case, the Router would try to fetch operations for warmup, but the CP won't return anything, so nothing will be warmed up.

Copy link
Contributor Author

@StarpTech StarpTech Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest batch of operations has been pushed on the S3. The router will fetch for it on startup and schema changes. If nothing changes, the router will always fetch the latest available batch. There is no interaction with the controlplane.

This is one possibility. It would be the "best-effort" approach because there might be operations that can still benefit from the stale cache.

Another solution is to delete the cache after the user has disabled it in the Studio, in that case the router won't find any artifact and skip the warm up. I think this is the case that @jensneuse described.

This could be made configurable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to delete the warming data in S3 when the feature is disabled.

Copy link
Contributor Author

@StarpTech StarpTech Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with both ways. 👍 Let's purge it.

- Total operation pre-execution time: Normalization, Validation, Planning
- Total request count

The Top-N computation is done for a specific time interval e.g. 3-72 hour (configurable). The operations are sorted by the pre-execution time and request count. The Top-N operations are then pushed to the operation cache. Manual operations have a higher priority than automatic operations. This means when the cache capacity is reached, manual operations are moved to the cache first and automatic operations are removed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will Top-N prioritize between request count and pre-execution time? I could see wanting to prioritize slowest planned queries and then sort by request count or vice versa.

Copy link
Contributor Author

@StarpTech StarpTech Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ultimate goal is to prepare operations in advance, focusing on those that take the most time. We will only sort by request count when two items have the same pre-execution time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @joshlevinson is right in that we should prioritize by planning time and then request count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We mean the same thing 😄


### Platform integration

For containerized environments like Kubernetes, users should use the readiness probe to ensure that the router is ready to accept traffic. Setting not to small values for the readiness probe timeout is recommended to ensure that the router has enough time to prepare the cache. For schema updates after startup, this process is non-blocking because the new graph schema isn't swapped until the cache is warmed up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To double-check:
/health/ready will be successful only when the router is warmed up, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right!

Copy link

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Sep 25, 2024
Copy link

github-actions bot commented Oct 9, 2024

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Oct 9, 2024
@StarpTech StarpTech removed the Stale label Oct 9, 2024
@StarpTech StarpTech reopened this Oct 9, 2024
Copy link

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 24, 2024
@StarpTech StarpTech removed the Stale label Oct 24, 2024
Copy link

github-actions bot commented Nov 8, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 8, 2024
@StarpTech StarpTech removed the Stale label Nov 8, 2024
Copy link

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 23, 2024
@StarpTech StarpTech removed the Stale label Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants