-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CEP: sharded repodata #75
Changes from 11 commits
eb465c0
0f2dcdf
d850eef
dea8894
a3d734f
97c2499
d2b1ac7
fa58859
c2a201a
06695dd
9722cf0
1a3b2ad
d6af2de
ab7971c
cf5b5ea
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,260 @@ | ||
# Sharded Repodata | ||
|
||
We propose a new "repodata" format that can be sparsely fetched. That means, generally, smaller fetches (only fetch what you need) and faster updates of existing repodata (only fetch what has changed). | ||
|
||
## Motivation | ||
|
||
The current repodata format is a JSON file that contains all the packages in a given channel. Unfortunately, that means it grows with the number of packages in the channel. This is a problem for large channels like conda-forge, which has over 150,000+ packages. It becomes very slow to fetch, parse and update the repodata. | ||
|
||
## Design goals | ||
|
||
1. **Speed**: Fetching repodata MUST be very fast. Both in the hot- and cold-cache case. | ||
2. **Easy to update**: The channel MUST be very easy to update when new packages become available. | ||
3. **CDN friendly**: A CDN MUST be usable to cache the majority of the data. This reduces the operating cost of a channel. | ||
4. **Support authN and authZ**: It MUST be possible to implement authentication and authorization with little extra overhead. | ||
5. **Easy to implement**: It MUST be relatively straightforward to implement to ease the adoption in different tools. | ||
6. **Client-side cachable**: If a user has a hot cache the user SHOULD only have to download small incremental changes. Preferably as little communication as possible with the server should be required to check freshness of the data. | ||
7. **Bandwidth optimized**: Any data that is transferred SHOULD be as small as possible. | ||
|
||
## Previous work | ||
|
||
### JLAP | ||
|
||
In a previously proposed CEP, [JLAP](https://github.com/conda-incubator/ceps/pull/20) was introduced. | ||
With JLAP only the changes to an initially downloaded `repodata.json` file have to be downloaded which means the user drastically saves on bandwidth which in turn makes fetching repodata much faster. | ||
|
||
However, in practice patching the original repodata can be a very expensive operation, both in terms of memory and in terms of compute because of the sheer amount of data involved. | ||
|
||
JLAP also does not save anything with a cold cache because the initial repodata still has to be downloaded. This is often the case for CI runners. | ||
|
||
Finally, the implementation of JLAP is quite complex which makes it hard to adopt for implementers. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO the complexity is the least fair criticism of JLAP, it can yield a small implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO in theory its pretty straightforward but to integrate properly there are a number of additional complexities that make this statememt true. To name a few: it requires maintaining extra state on disk, http range requests, we need to make sure that the repodata.json is still the previously stores repodata, indexing requires the previous state, and the problem that you need an additional overlay file to make it perform optimally. But in fairness this is just my personal experience while implementing it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dholth I think we can claim that we are the only ones shipping JLAP in production and it's been pretty rough, unfortunately (we observe a waiting progress bar for multiple seconds while patches are applied and the file is serialized). We could improve with the two file approach but I think we make a good point about why this proposal is simpler. Also, one thing we gain with our proposal here is complete hash-integrity. Something that is not possible with the JLAP two-file approach. We want to enable signed repodata and packages (TUF style) sooner or later and I don't see a way to do that with JLAP's two-file approach. With our approach, we can at any time prove the integrity of the whole thing. |
||
|
||
### ZSTD compression | ||
|
||
A notable improvement is compressing the `repodata.json` with `zst` and serving that file. In practice this yields a file that is 20% of the original size (20-30 Mb for large cases). Although this is still quite a big file its substantially smaller. | ||
baszalmstra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
However, the file still contains all repodata in the channel. This means the file needs to be redownloaded every time anyone adds a single package (even if a user doesnt need that package). | ||
|
||
Because the file is relatively big this means that often a large `max-age` is used for caching which means it takes more time to propagate new packages through the ecosystem. | ||
|
||
## Proposal | ||
|
||
We propose a "sharded" repodata format. It works by splitting the repodata into multiple files (one per package name) and recursively fetching the "shards". | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I reminds me about current There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We haven't really discussed different sharding strategies. However, our implementation uses a CDN + static files and that seems to work relatively well. It also seems to work well enough for PyPI and crates.io so I don't think we're particularly concerned. |
||
|
||
The shards are stored by the hash of their content (e.g. "content-addressable"). | ||
That means that the URL of the shard is derived from the content of the shard. This allows for efficient caching and deduplication of shards on the client. Because the files are content-addressable no round-trip to the server is required to check freshness of individual shards. | ||
|
||
Additionally an index file stores the mapping from package name to shard hash. | ||
|
||
Although not explicitly required the server SHOULD support HTTP/2 to reduce the overhead of doing a massive number of requests. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the sheer number of files, I think realistically this is a MUST. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have been running this with HTTP/1 by accident for a while, its definitely slower but is still very acceptable. I left this as a should because everything still works perfectly fine without HTTP/2. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does pip do something similar with the simple index, requesting one per package name There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes there is one index per package name e.g. https://pypi.org/simple/pinject . But this index does not store any dependency information for any of the artifacts. An additional request is required PER artifact to get the dependecy information. In this proposal we add the dependecy information of individual packages directly to the per package name shard. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pip is regarded as fast even though it does not use HTTP/2 and has a similar access pattern to what this CEP would imply. You might keep a few pipelined HTTP 1 connections open per repodata server to grab everything. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would also expect that HTTP/1 gives already a reasonable improvement. Not all corporate proxies (but most) support HTTP/2 yet and thus I'm always happy to have enhancements that can fall back to HTTP/1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried doing an http/1 fetch of all ~10k files from the prefix test server using a 20-thread (and 10 connections at once pool) with Python requests. It takes about 20 seconds to fetch everything with a hot cache, and the spec is designed for us to fetch much less than everything. |
||
|
||
### Repodata shard index | ||
|
||
The shard index is a file that is stored under `<shard_base_url>/<subdir>/repodata_shards.msgpack.zst`. It is a zstandard compressed `msgpack` file that contains a mapping from package name to shard hash. The `<shard_base_url>` is defined in [Authentication](#authentication). | ||
|
||
The contents look like the following (written in JSON for readability): | ||
|
||
```js | ||
{ | ||
"version": 1, | ||
"info": { | ||
"base_url": "https://example.com/channel/subdir/", | ||
"created_at": "2022-01-01T00:00:00Z", | ||
"...": "other metadata" | ||
}, | ||
"shards": { | ||
// note that the hashes are stored as binary data (hex encoding just for visualization) | ||
"python": b"ad2c69dfa11125300530f5b390aa0f7536d1f566a6edc8823fd72f9aa33c4910", | ||
"numpy": b"27ea8f80237eefcb6c587fb3764529620aefb37b9a9d3143dce5d6ba4667583d" | ||
"...": "other packages" | ||
} | ||
} | ||
``` | ||
|
||
The index is still updated regularly but the file does not increase in size with every package added, only when new package names are added which happens much less often. | ||
|
||
For a large case (conda-forge linux-64) this files is 670kb at the time of writing. | ||
baszalmstra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
We suggest serving the file with a short lived `Cache-Control` `max-age` header of 60 seconds to an hour but we leave it up to the channel administrator to set a value that works for that channel. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do you suggest 60 seconds TTL? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel like this provides a sane tradeoff between limiting the number of requests to the server while also reducing the time it takes to get new packages. But TBH it's quite random. |
||
|
||
### Repodata shard | ||
|
||
Individual shards are stored under the URL `<shard_base_url>/<subdir>/shards/<sha256>.msgpack.zst`. Where the `sha256` is the lower-case hex representation of the bytes from the index. It is a zstandard compressed msgpack file that contains the metadata of the package. The `<shard_base_url>` is defined in [Authentication](#authentication). | ||
baszalmstra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The files are content-addressable which makes them ideal to be served through a CDN. They SHOULD be served with `Cache-Control: immutable` header. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why SHOULD and not MUST? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An implementation doesnt care about Cache-Control headers, it just uses the content-address. However, for CDNs having the immutable header ensures that the content is cached on the edge for as long as possible. But it is not a requirement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a non-content-addressable variant that might be suitable for a package server with per-user, per-package permissions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Most permission-based solutions I have yet encountered (I have not seen Anaconda Enterprise) all managed permissions on the basis of whole channels. This would stay compatible with the current approach and was something that was also compatible with the old approach. For giving access to a subset of packages in a channel, you would need to serve copies of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another form of permission for content-addressed blob store systems I've seen is to store some additional tag style header attributes with the content blob and use that to drive authz. gcp and aws both support setting this kind of metadata and you can perform the auth part with a proxy in front of the blob store |
||
|
||
The shard contains the repodata information that would otherwise have been found in the `repodata.json` file. | ||
It is a dictionary that contains the following keys: | ||
|
||
**Example (written in JSON for readability):** | ||
|
||
```js | ||
{ | ||
// dictionary of .tar.bz2 files | ||
"packages": { | ||
"rich-10.15.2-pyhd8ed1ab_1.tar.bz2": { | ||
"build": "pyhd8ed1ab_1", | ||
"build_number": 1, | ||
"depends": [ | ||
"colorama >=0.4.0,<0.5.0", | ||
"commonmark >=0.9.0,<0.10.0", | ||
"dataclasses >=0.7,<0.9", | ||
"pygments >=2.6.0,<3.0.0", | ||
"python >=3.6.2", | ||
"typing_extensions >=3.7.4,<5.0.0" | ||
], | ||
"license": "MIT", | ||
"license_family": "MIT", | ||
"md5": "2456071b5d040cba000f72ced5c72032", | ||
"name": "rich", | ||
"noarch": "python", | ||
"sha256": "a38347390191fd3e60b17204f2f6470a013ec8753e1c2e8c9a892683f59c3e40", | ||
"size": 153963, | ||
"subdir": "noarch", | ||
"timestamp": 1638891318904, | ||
"version": "10.15.2" | ||
} | ||
}, | ||
// dictionary of .conda files | ||
"packages.conda": { | ||
"rich-13.7.1-pyhd8ed1ab_0.conda": { | ||
"build": "pyhd8ed1ab_0", | ||
"build_number": 0, | ||
"depends": [ | ||
"markdown-it-py >=2.2.0", | ||
"pygments >=2.13.0,<3.0.0", | ||
"python >=3.7.0", | ||
"typing_extensions >=4.0.0,<5.0.0" | ||
], | ||
"license": "MIT", | ||
"license_family": "MIT", | ||
"md5": "ba445bf767ae6f0d959ff2b40c20912b", | ||
"name": "rich", | ||
"noarch": "python", | ||
"sha256": "2b26d58aa59e46f933c3126367348651b0dab6e0bf88014e857415bb184a4667", | ||
"size": 184347, | ||
"subdir": "noarch", | ||
"timestamp": 1709150578093, | ||
"version": "13.7.1" | ||
} | ||
}, | ||
// list of strings of keys (filenames) that were removed from either packages or packages.conda | ||
"removed": [ | ||
"rich-10.15.1-pyhd8ed1ab_1.tar.bz2" | ||
] | ||
} | ||
``` | ||
|
||
The `sha256` and `md5` from the original repodata fields are converted from their hex representation to bytes. | ||
This is done to reduce the overall file size of the shards. | ||
|
||
Implementers SHOULD ignore unknown keys, this allows adding additional keys to the format in the future without breaking old versions of tools. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CEP 15 adds a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The downside of that though is that a client that doesn't support version 3 will not be able to parse the repodata, effectively breaking it. Adding keys is forward-compatible as long as the client is not strict about unknown keys. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or did I misunderstand and is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We know that optional fields like "binstar" from anaconda.org non-CDN channels exist. Or whatever happens to be in the package's index.json goes into repodata. But fetching files from the wrong url when base_url is present is obviously broken. Should the index shard use It would be good form to have an algorithm for converting between repodata.json and shards in both directions. |
||
|
||
Although these files can become relatively large (100s of kilobytes) typically for a large case (conda-forge) these files remaing very small, e.g. 100s of bytes to a couple of kilobytes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❤️ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if we allow a shard to contain the metadata of more than one package, so that multiple package names could point to the same shard? It is a nuisance to download many 300-byte shards. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah we could, but it comes with some downsides:
We could keep the current repodata_shards index structure and use the same hash for multiple package names (that would probably compress decently). However, I am not sure if the tradeoff is worth it. Also it will start to look a bit like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried using a buffer to combine sequential (ordered by package name) shards when the combined size is not > 8192 bytes compressed. So we still wind up with plenty of ~300 byte single package shards if it sees (large shard, small shard, large shard) but it's all in order. That scheme gives 7987 files and 10883 package names on a conda-forge/linux-64 snapshot, probably taking 75% of the time to download if we happen to want to download everything. Unlike zchunk we have the option of not maintaining a complete local copy of the data. The most packages combined into a single shard are these fifteen.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am wondering, if you need everything, wouldn't you rather download the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm imagining a situation in which there is only repodata_shards. Suppose we are in the middle of processing a bunch of queries on the SubdirData and suddenly we encounter a query for "any package name, md5=x". Are we more likely to reject the query, download everything, or eject to downloading an alternate format? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not deprecate it, and grab the repodata.json in the meantime? Tell people to use proper lockfiles? :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if we simplified the implementation, and included one repodata shard with all the packages in it, and an index that pointed to that singular shard 10k times 🤣
baszalmstra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## <a id="authentication"></a>Authentication | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm -1 on adding this specific bit of authentication to the sharding proposal. We should be able to implement the sharding approach independently of the authentication mechanism. I don't see in some setups how we would be able to provide such an endpoint. This ties the CEP to a certain kind of authentication and removes the flexibility that we currently have in also using HTTP Basic Auth instead of token based auth. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am open to opening another CEP specifically for authentication of the sharded index but let me first explain the reasoning behind why we added this. The reason we added this is that we have noticed that adding support for generic authentication methods provides a very significant overhead to fetching the individual shards and a large infrastructure cost. Take as an example our prefix.dev backend. When requesting private repodata a user requests Ideally, we want to have a solution where the "presigning" happens just once and we can reuse this information on subsequent requests. I am open to alternative solutions, but I think being able to authenticate once for all subsequent requests is very important to how effective this proposal is. Standardizing this method is also important to ensure that every client will be able to access different server implementations. Then with regards to our specific solution using the As an alternative though, we could also look at returning a set of headers from the initial request to the shard index instead of requesting a separate file. FYI @jezdez There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think what we are not making clear in the CEP is that "traditional" auth methods should continue to work just as well. For example, if the server uses basic HTTP authentication or conda token authentication, this should continue to work (and depending on the server implementation should also not negatively affect performance). We are just interested in adding a mechanism to the protocol in order to facilitate really fast performance for large public channels. As Bas mentioned, I think we can make this an "add-on" behavior when the request to
The client would then use these values for subsequent requests. If these headers are not returned, it will just use the "regular" URL with regular authentication. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With these comments, I understand the need for an efficient implementation. Still, since it is part of this proposal, it seems necessary to implement sharding fully. We will have several places where this pre-signed authentication will not work (e.g., when using an HTTP caching proxy or Artifactory's implementation). I would love to use sharding in these places at the cost of the additional authentication checks. |
||
|
||
To faciliate authentication and authorization we propose to add an additional endpoint at `<channel>/<subdir>/token` with the following content: | ||
|
||
```json | ||
{ | ||
"shard_base_url": "https://shards.prefix.dev/conda-forge/<subdir>/", | ||
"token": "<bearer token>", | ||
"issued_at": "2024-01-30T03:35:39.896023447Z", | ||
"expires_in": 300, | ||
} | ||
``` | ||
|
||
`shard_base_url` is an optional url to use as the base url for the `repodata_shards.msgpack.zst` and the individual shards. If the field is not specified it should default to `<channel>/<subdir>`. | ||
|
||
`token` is an optional field that if set MUST be added to any subsequent request in the `Authentication` header as `Authentication: Bearer <token>`. If the `token` field is not set sending the `Authentication` header is also not required. | ||
|
||
The optional `issued_at` and `expires_in` fields can be used to verify the freshness of a token. If the fields are not present a client can assume that the token is valid for any subsequent request. | ||
|
||
For a simple implementor this endpoint could just be a static file with `{}` as the content. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As stated above, authnz is a transport layer thing and doesn't fit into this application layer metadata standard. This should be moved into an own CEP. |
||
|
||
## Fetch process | ||
|
||
To fetch all needed package records, the client should implement the following steps: | ||
|
||
1. Acquire a token (see: [Authentication](#authentication)). Acquiring a token can be done lazily as to only request a token when an actual network request is performed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Authzn isn't part of the repodata standard and should be evolved in a separate CEP. |
||
2. Fetch the `repodata_shards.msgpack.zst` file. Standard HTTP caching semantics can be applied to this file. | ||
3. For each package name, start fetching the corresponding hashes from the index file (for both arch & and noarch). | ||
Shards can be cached locally and because they are content-addressable no additional round-trips to the server are required to check freshness. The server should also mark these with an `immutable` `Cache-Control` header. | ||
4. Parsing the requirements of the fetched records and add the package names of the requirements to the set of packages to fetch. | ||
5. Loop back to 2. until there are no new package names to fetch. | ||
|
||
## Garbage collection | ||
|
||
To avoid the cache from growing indefinitely, we propose to implement a garbage collection mechanism that removes shards that have no entry in the index file. The server should keep old shards for a certain amount of time (e.g. 1 week) to allow for clients with older shard-index data to fetch the previous versions. | ||
|
||
On the client side, a garbage collection process should run every so often to remove old shards from the cache. This can be done by comparing the cached shards with the index file and removing those that are not referenced anymore. | ||
|
||
|
||
## Rejected ideas | ||
|
||
### SHA hash compression | ||
|
||
SHA hashes are non-compressable because in the eyes of the compressor it is just random data. We have investigated using a binary prefix tree to enable better compression but this increased the complexity of the implementation quite a bit which conflicts with our goal of keeping things simple. | ||
|
||
### Shorter SHA hashes | ||
|
||
Another approach would be to only store the first 100 bytes or so of the SHA hash. This reduces the total size of the sha hashes significantly but it makes the client side implementation more complex because hash conflicts become an issue. | ||
|
||
This also makes a future implementation based on a OCI registry harder because layers in an OCI registry are also referenced by SHA256 hash. | ||
|
||
### Storing the data as a struct of arrays | ||
|
||
To improve compression we investigated storing the index file as a struct of arrays instead of as an array of structs: | ||
|
||
```json | ||
[ | ||
{ | ||
"name": "", | ||
"hash": "", | ||
}, | ||
{ | ||
"name": "", | ||
"hash": "", | ||
} | ||
] | ||
``` | ||
|
||
vs | ||
|
||
```json | ||
{ | ||
"names": [...], | ||
"hashes": [...] | ||
} | ||
``` | ||
|
||
This did yield slightly better compression but we felt it makes it slightly harder to implement and adapt in the future which we deemed not worth the small size decrease. | ||
|
||
## Future improvements | ||
|
||
### Remove redundant keys | ||
|
||
`platform` and `arch` can be removed because these can be inferred from `subdir`. | ||
|
||
### Integrating additional data | ||
|
||
With the total size of the repodata reduced it becomes feasible to add additional fields directly to the repodata records. Examples are: | ||
|
||
- add `purl` as a list of strings (Package URLs to reference to original source of the package) (See: https://github.com/conda-incubator/ceps/pull/63) | ||
- add `run_exports` as a list of strings (run-exports needed to build the package) (See: https://github.com/conda-incubator/ceps/pull/51) | ||
Comment on lines
+246
to
+247
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this relate to CEP-15? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Im not sure what you mean? The |
||
|
||
### Update optimization | ||
|
||
We could implement support for smaller index update files. This can be done by creating a rolling daily and weekly index update file that can be used instead of fetching the whole `repodata_shards.msgpack.zst` file. The update operation is very simple (just update the hashmap with the new entries). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could be done with a jlap-like rolling checksum / changelog file where the client would look for the hash of their current index, and json merge patch instead of the more complicated json patch. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I like the idea of having a hash of the "expected" file. We don't need any proper "patching" as it shoudl be as simple as
(with some extra handling of null keys potentially). We would also store a timestamp in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "json merge patch" is similar to ".update()". We could simulate the size of the proposed file using https://github.com/dholth/conda-test-data, probably it is pretty small based on what I've observed about jlap. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we are logging what percentage of shards changes every 6 hours in the indexing pipeline. It's usually < 1 percent so it's pretty small. I think adding json-merge-patch isn't going to add anything over the proposed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
For this we propose to add the following two files: | ||
|
||
- `<channel>/<subdir>/repodata_shards_daily.msgpack.zst` | ||
- `<channel>/<subdir>/repodata_shards_weekly.msgpack.zst` | ||
|
||
They will contain the same format as the `repodata_shards.msgpack.zst` file but only contain the packages that have been updated in the last day or week respectively. `null` is used for keys that have been removed. The `created_at` field in the index file can be used to determine which file to fetch to make sure that the client has the latest information. | ||
|
||
### Store `set(dependencies)` at the start of the shards or in a header | ||
|
||
To reduce the time it takes to parse a shard and start fetching its dependencies we could also store the set of all dependencies in the file at the start of the shard or in a separate header. This could enable fetching recursive dependencies while still parsing the records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like an odd duck, authorization shouldn't be attached to the repodata format as it's part of the transport layer, not the application layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure why this CEP cannot make changes to both? Authenticating requests can add serious overhead to the server-side implementation which in turn affects the performance on the client. Having a way to effectively "cache" the authentication drastically reduces the work on the server which given the sheer number of requests introduced by this CEP I think makes sense to address as well. Note that the approach taken here is very similar to how OCI registries operate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also argue that we're proposing more of a transport protocol here where the authentication fits in pretty well as a key part of the whole thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lastly, you can also use other (more traditional) means by not replying with a token on the endpoint and use traditional means instead (depending on how your authentication works, it might be slower).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel that authorization should be split off. You can do this in multiple ways that are independent of the how shards are actually implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed authentication from the CEP and marked it as something we should investigate in another CEP. I still left it in the design goals because I still think we shouldnt implement something that would rule this out.