-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(rfc): operation cache warmer #1115
base: main
Are you sure you want to change the base?
Conversation
rfc/distributed-operation-cache.md
Outdated
The User can push individual operations to the distributed operation cache by using the CLI: | ||
|
||
```bash | ||
wgc router cache add -g mygraph operations.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be under the namespace "router"?
At which level do we apply cache warming?
Org? Namespace? Graph? Router?
As an alternative, you could consider creating a "cache-warming-bucket", which you can then use to add operations, and which can be used in the router config, e.g.:
version: "1"
cache_warmup:
enabled: true
interval: 5m
bucket: myBucket
I'm not sure though if this is the right direction.
The cache warming bucket is somewhat closely related to a Graph because the containing operations need to be compatible with the client schema of the graph.
rfc/distributed-operation-cache.md
Outdated
] | ||
``` | ||
|
||
The cli command is idempotent and always updates the cache with the latest operations. This doesn't trigger the computation of the Top-N operations which is done periodically by the Cosmo Platform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another aspect to think about is pre-compiling operation plans.
We could make plans serializable, generate and serialize them at composition time.
Once the router wants to "load" them, it doesn't have to plan everything but can load the plan much faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for drafting this proposal, it looks great, can't wait to see it live!
Left a few questions I had during reading.
The User can push individual operations to the operation cache by using the CLI: | ||
|
||
```bash | ||
wgc federated-graph operation-cache add --graph mygraph --file operations.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about manual cache invalidation.
What will happen in the following scenario:
First, we run:
wgc federated-graph operation-cache add --graph mygraph --file operationA.json
Then we run:
wgc federated-graph operation-cache add --graph mygraph --file operationB.json
What will we have in the cache after the second operation? Will it be just Operation B or both Operation A and Operation B?
Curious, because if it is both Operation A and Operation B, the cache eventually will be filled with the manual operation and there will not be space for automatic operations unless we can invalidate it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is right. Both operations will be in the "batch". This is the part where a customer can manage the cache themself and is responsible of it. At some point, we could also introduce a percentage limit about how much space can be reserved for manual pushed operations. In the first version, we want to keep it simple.
The User can push individual operations to the operation cache by using the CLI: | ||
|
||
```bash | ||
wgc federated-graph operation-cache add --graph mygraph --file operations.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this command could be executed automatically (like part of CI/CD) or if we have to call it manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the idea is to run this locally as well as part of the CI/CD process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'd recommend to make the operation idempotent so that we can run this over and over again on each deployment without having duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the idea. Operations are identified by the hash and appended to the "batch" until the cache is full. A user can repush the changes in another pipeline run to re-prioritize the list.
|
||
### Cosmo UI integration | ||
|
||
A User can disable the operation cache in the Cosmo UI. The User can see the current operations in the cache and remove them if necessary. The User can also see the current status of the cache and the last computation time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question:
what will happen if we disable the operation cache in the Cosmo UI but keep cache_warmup
enabled in the router configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, Cosmo would no longer compute the latest TopN operations for you and your router will fetch at some point an outdated list of operations. It won't break anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that in this case, the Router would try to fetch operations for warmup, but the CP won't return anything, so nothing will be warmed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest batch of operations has been pushed on the S3. The router will fetch for it on startup and schema changes. If nothing changes, the router will always fetch the latest available batch. There is no interaction with the controlplane.
This is one possibility. It would be the "best-effort" approach because there might be operations that can still benefit from the stale cache.
Another solution is to delete the cache after the user has disabled it in the Studio, in that case the router won't find any artifact and skip the warm up. I think this is the case that @jensneuse described.
This could be made configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable to delete the warming data in S3 when the feature is disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with both ways. 👍 Let's purge it.
- Total operation pre-execution time: Normalization, Validation, Planning | ||
- Total request count | ||
|
||
The Top-N computation is done for a specific time interval e.g. 3-72 hour (configurable). The operations are sorted by the pre-execution time and request count. The Top-N operations are then pushed to the operation cache. Manual operations have a higher priority than automatic operations. This means when the cache capacity is reached, manual operations are moved to the cache first and automatic operations are removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will Top-N prioritize between request count and pre-execution time? I could see wanting to prioritize slowest planned queries and then sort by request count or vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ultimate goal is to prepare operations in advance, focusing on those that take the most time. We will only sort by request count when two items have the same pre-execution time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @joshlevinson is right in that we should prioritize by planning time and then request count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We mean the same thing 😄
|
||
### Platform integration | ||
|
||
For containerized environments like Kubernetes, users should use the readiness probe to ensure that the router is ready to accept traffic. Setting not to small values for the readiness probe timeout is recommended to ensure that the router has enough time to prepare the cache. For schema updates after startup, this process is non-blocking because the new graph schema isn't swapped until the cache is warmed up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To double-check:
/health/ready
will be successful only when the router is warmed up, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right!
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Motivation and Context
TODO