Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CEP: sharded repodata #75

Merged
merged 15 commits into from
Jul 22, 2024
260 changes: 260 additions & 0 deletions cep-16.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,260 @@
# Sharded Repodata

We propose a new "repodata" format that can be sparsely fetched. That means, generally, smaller fetches (only fetch what you need) and faster updates of existing repodata (only fetch what has changed).

## Motivation

The current repodata format is a JSON file that contains all the packages in a given channel. Unfortunately, that means it grows with the number of packages in the channel. This is a problem for large channels like conda-forge, which has over 150,000+ packages. It becomes very slow to fetch, parse and update the repodata.

## Design goals

1. **Speed**: Fetching repodata MUST be very fast. Both in the hot- and cold-cache case.
2. **Easy to update**: The channel MUST be very easy to update when new packages become available.
3. **CDN friendly**: A CDN MUST be usable to cache the majority of the data. This reduces the operating cost of a channel.
4. **Support authN and authZ**: It MUST be possible to implement authentication and authorization with little extra overhead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like an odd duck, authorization shouldn't be attached to the repodata format as it's part of the transport layer, not the application layer.

Copy link
Contributor Author

@baszalmstra baszalmstra May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure why this CEP cannot make changes to both? Authenticating requests can add serious overhead to the server-side implementation which in turn affects the performance on the client. Having a way to effectively "cache" the authentication drastically reduces the work on the server which given the sheer number of requests introduced by this CEP I think makes sense to address as well. Note that the approach taken here is very similar to how OCI registries operate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also argue that we're proposing more of a transport protocol here where the authentication fits in pretty well as a key part of the whole thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lastly, you can also use other (more traditional) means by not replying with a token on the endpoint and use traditional means instead (depending on how your authentication works, it might be slower).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel that authorization should be split off. You can do this in multiple ways that are independent of the how shards are actually implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed authentication from the CEP and marked it as something we should investigate in another CEP. I still left it in the design goals because I still think we shouldnt implement something that would rule this out.

5. **Easy to implement**: It MUST be relatively straightforward to implement to ease the adoption in different tools.
6. **Client-side cachable**: If a user has a hot cache the user SHOULD only have to download small incremental changes. Preferably as little communication as possible with the server should be required to check freshness of the data.
7. **Bandwidth optimized**: Any data that is transferred SHOULD be as small as possible.

## Previous work

### JLAP

In a previously proposed CEP, [JLAP](https://github.com/conda-incubator/ceps/pull/20) was introduced.
With JLAP only the changes to an initially downloaded `repodata.json` file have to be downloaded which means the user drastically saves on bandwidth which in turn makes fetching repodata much faster.

However, in practice patching the original repodata can be a very expensive operation, both in terms of memory and in terms of compute because of the sheer amount of data involved.

JLAP also does not save anything with a cold cache because the initial repodata still has to be downloaded. This is often the case for CI runners.

Finally, the implementation of JLAP is quite complex which makes it hard to adopt for implementers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the complexity is the least fair criticism of JLAP, it can yield a small implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO in theory its pretty straightforward but to integrate properly there are a number of additional complexities that make this statememt true. To name a few: it requires maintaining extra state on disk, http range requests, we need to make sure that the repodata.json is still the previously stores repodata, indexing requires the previous state, and the problem that you need an additional overlay file to make it perform optimally.

But in fairness this is just my personal experience while implementing it.

Copy link
Contributor

@wolfv wolfv May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dholth I think we can claim that we are the only ones shipping JLAP in production and it's been pretty rough, unfortunately (we observe a waiting progress bar for multiple seconds while patches are applied and the file is serialized). We could improve with the two file approach but I think we make a good point about why this proposal is simpler.

Also, one thing we gain with our proposal here is complete hash-integrity. Something that is not possible with the JLAP two-file approach. We want to enable signed repodata and packages (TUF style) sooner or later and I don't see a way to do that with JLAP's two-file approach.

With our approach, we can at any time prove the integrity of the whole thing.


### ZSTD compression

A notable improvement is compressing the `repodata.json` with `zst` and serving that file. In practice this yields a file that is 20% of the original size (20-30 Mb for large cases). Although this is still quite a big file its substantially smaller.
baszalmstra marked this conversation as resolved.
Show resolved Hide resolved

However, the file still contains all repodata in the channel. This means the file needs to be redownloaded every time anyone adds a single package (even if a user doesnt need that package).

Because the file is relatively big this means that often a large `max-age` is used for caching which means it takes more time to propagate new packages through the ecosystem.

## Proposal

We propose a "sharded" repodata format. It works by splitting the repodata into multiple files (one per package name) and recursively fetching the "shards".
Copy link

@aovasylenko aovasylenko May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reminds me about current simple pypi index (what is mentioned below), which (in case all python ecosystem of 500K+ packages leads to this number of shards)
Were there any ideas about evaluating smaller number of shards, for example by having prefix-based rule or range?
My worry that based on server-side implementation initial pulling of this information will generate massive number of requests (even with using http2) on API/DB level each request could be processed separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't really discussed different sharding strategies. However, our implementation uses a CDN + static files and that seems to work relatively well. It also seems to work well enough for PyPI and crates.io so I don't think we're particularly concerned.


The shards are stored by the hash of their content (e.g. "content-addressable").
That means that the URL of the shard is derived from the content of the shard. This allows for efficient caching and deduplication of shards on the client. Because the files are content-addressable no round-trip to the server is required to check freshness of individual shards.

Additionally an index file stores the mapping from package name to shard hash.

Although not explicitly required the server SHOULD support HTTP/2 to reduce the overhead of doing a massive number of requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the sheer number of files, I think realistically this is a MUST.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been running this with HTTP/1 by accident for a while, its definitely slower but is still very acceptable. I left this as a should because everything still works perfectly fine without HTTP/2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pip do something similar with the simple index, requesting one per package name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is one index per package name e.g. https://pypi.org/simple/pinject .

But this index does not store any dependency information for any of the artifacts. An additional request is required PER artifact to get the dependecy information.

In this proposal we add the dependecy information of individual packages directly to the per package name shard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip is regarded as fast even though it does not use HTTP/2 and has a similar access pattern to what this CEP would imply. You might keep a few pipelined HTTP 1 connections open per repodata server to grab everything.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also expect that HTTP/1 gives already a reasonable improvement. Not all corporate proxies (but most) support HTTP/2 yet and thus I'm always happy to have enhancements that can fall back to HTTP/1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing an http/1 fetch of all ~10k files from the prefix test server using a 20-thread (and 10 connections at once pool) with Python requests. It takes about 20 seconds to fetch everything with a hot cache, and the spec is designed for us to fetch much less than everything.


### Repodata shard index

The shard index is a file that is stored under `<shard_base_url>/<subdir>/repodata_shards.msgpack.zst`. It is a zstandard compressed `msgpack` file that contains a mapping from package name to shard hash. The `<shard_base_url>` is defined in [Authentication](#authentication).

The contents look like the following (written in JSON for readability):

```js
{
"version": 1,
"info": {
"base_url": "https://example.com/channel/subdir/",
"created_at": "2022-01-01T00:00:00Z",
"...": "other metadata"
},
"shards": {
// note that the hashes are stored as binary data (hex encoding just for visualization)
"python": b"ad2c69dfa11125300530f5b390aa0f7536d1f566a6edc8823fd72f9aa33c4910",
"numpy": b"27ea8f80237eefcb6c587fb3764529620aefb37b9a9d3143dce5d6ba4667583d"
"...": "other packages"
}
}
```

The index is still updated regularly but the file does not increase in size with every package added, only when new package names are added which happens much less often.

For a large case (conda-forge linux-64) this files is 670kb at the time of writing.
baszalmstra marked this conversation as resolved.
Show resolved Hide resolved

We suggest serving the file with a short lived `Cache-Control` `max-age` header of 60 seconds to an hour but we leave it up to the channel administrator to set a value that works for that channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you suggest 60 seconds TTL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this provides a sane tradeoff between limiting the number of requests to the server while also reducing the time it takes to get new packages. But TBH it's quite random.


### Repodata shard

Individual shards are stored under the URL `<shard_base_url>/<subdir>/shards/<sha256>.msgpack.zst`. Where the `sha256` is the lower-case hex representation of the bytes from the index. It is a zstandard compressed msgpack file that contains the metadata of the package. The `<shard_base_url>` is defined in [Authentication](#authentication).
baszalmstra marked this conversation as resolved.
Show resolved Hide resolved

The files are content-addressable which makes them ideal to be served through a CDN. They SHOULD be served with `Cache-Control: immutable` header.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why SHOULD and not MUST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An implementation doesnt care about Cache-Control headers, it just uses the content-address. However, for CDNs having the immutable header ensures that the content is cached on the edge for as long as possible. But it is not a requirement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a non-content-addressable variant that might be suitable for a package server with per-user, per-package permissions?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a non-content-addressable variant that might be suitable for a package server with per-user, per-package permissions?

Most permission-based solutions I have yet encountered (I have not seen Anaconda Enterprise) all managed permissions on the basis of whole channels. This would stay compatible with the current approach and was something that was also compatible with the old approach.

For giving access to a subset of packages in a channel, you would need to serve copies of repodata.json depending on the user. As the new approach (new = sharding) would still also serve the old repodata.json, I think this can only be supported with a backend that check permissions on the request basis and doesn't do authentication purely based on the URL.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another form of permission for content-addressed blob store systems I've seen is to store some additional tag style header attributes with the content blob and use that to drive authz. gcp and aws both support setting this kind of metadata and you can perform the auth part with a proxy in front of the blob store


The shard contains the repodata information that would otherwise have been found in the `repodata.json` file.
It is a dictionary that contains the following keys:

**Example (written in JSON for readability):**

```js
{
// dictionary of .tar.bz2 files
"packages": {
"rich-10.15.2-pyhd8ed1ab_1.tar.bz2": {
"build": "pyhd8ed1ab_1",
"build_number": 1,
"depends": [
"colorama >=0.4.0,<0.5.0",
"commonmark >=0.9.0,<0.10.0",
"dataclasses >=0.7,<0.9",
"pygments >=2.6.0,<3.0.0",
"python >=3.6.2",
"typing_extensions >=3.7.4,<5.0.0"
],
"license": "MIT",
"license_family": "MIT",
"md5": "2456071b5d040cba000f72ced5c72032",
"name": "rich",
"noarch": "python",
"sha256": "a38347390191fd3e60b17204f2f6470a013ec8753e1c2e8c9a892683f59c3e40",
"size": 153963,
"subdir": "noarch",
"timestamp": 1638891318904,
"version": "10.15.2"
}
},
// dictionary of .conda files
"packages.conda": {
"rich-13.7.1-pyhd8ed1ab_0.conda": {
"build": "pyhd8ed1ab_0",
"build_number": 0,
"depends": [
"markdown-it-py >=2.2.0",
"pygments >=2.13.0,<3.0.0",
"python >=3.7.0",
"typing_extensions >=4.0.0,<5.0.0"
],
"license": "MIT",
"license_family": "MIT",
"md5": "ba445bf767ae6f0d959ff2b40c20912b",
"name": "rich",
"noarch": "python",
"sha256": "2b26d58aa59e46f933c3126367348651b0dab6e0bf88014e857415bb184a4667",
"size": 184347,
"subdir": "noarch",
"timestamp": 1709150578093,
"version": "13.7.1"
}
},
// list of strings of keys (filenames) that were removed from either packages or packages.conda
"removed": [
"rich-10.15.1-pyhd8ed1ab_1.tar.bz2"
]
}
```

The `sha256` and `md5` from the original repodata fields are converted from their hex representation to bytes.
This is done to reduce the overall file size of the shards.

Implementers SHOULD ignore unknown keys, this allows adding additional keys to the format in the future without breaking old versions of tools.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEP 15 adds a repodata_version field incrementing it to 2 when a base_url is present. Since you're willing to update a number of fields, this should be incremented to 3, so clients don't have to implement forever backwards-compatible repodata loaders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of that though is that a client that doesn't support version 3 will not be able to parse the repodata, effectively breaking it. Adding keys is forward-compatible as long as the client is not strict about unknown keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or did I misunderstand and is the repodata_version field forward-compatible?

Copy link
Contributor

@dholth dholth May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know that optional fields like "binstar" from anaconda.org non-CDN channels exist. Or whatever happens to be in the package's index.json goes into repodata.

But fetching files from the wrong url when base_url is present is obviously broken.

Should the index shard use repodata_version instead of "version": 1,

It would be good form to have an algorithm for converting between repodata.json and shards in both directions.


Although these files can become relatively large (100s of kilobytes) typically for a large case (conda-forge) these files remaing very small, e.g. 100s of bytes to a couple of kilobytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we allow a shard to contain the metadata of more than one package, so that multiple package names could point to the same shard? It is a nuisance to download many 300-byte shards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we could, but it comes with some downsides:

  • cache might not be as efficient as multiple packages might change at different frequencies and cache becomes invalid more often
  • it's more complicated on the implementation as packages don't map 1-1 to shards, but that would not be too bad.

We could keep the current repodata_shards index structure and use the same hash for multiple package names (that would probably compress decently).

However, I am not sure if the tradeoff is worth it.

Also it will start to look a bit like zchunk then :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using a buffer to combine sequential (ordered by package name) shards when the combined size is not > 8192 bytes compressed. So we still wind up with plenty of ~300 byte single package shards if it sees (large shard, small shard, large shard) but it's all in order.

That scheme gives 7987 files and 10883 package names on a conda-forge/linux-64 snapshot, probably taking 75% of the time to download if we happen to want to download everything. Unlike zchunk we have the option of not maintaining a complete local copy of the data.

The most packages combined into a single shard are these fifteen.

['perl-bit-vector-7.4-pl5321h166bdaf_0.tar.bz2',
 'perl-capture-tiny-0.48-pl5321ha770c72_1.tar.bz2',
 'perl-carp-assert-0.21-pl526_0.tar.bz2',
 'perl-class-method-modifiers-2.13-pl5321ha770c72_0.tar.bz2',
 'perl-clone-0.46-pl5321h166bdaf_0.tar.bz2',
 'perl-compress-raw-bzip2-2.201-pl5321h166bdaf_0.tar.bz2',
 'perl-compress-raw-zlib-2.202-pl5321h166bdaf_0.tar.bz2',
 'perl-data-dumper-2.183-pl5321h166bdaf_0.tar.bz2',
 'perl-b-hooks-endofscope-0.26-pl5321ha770c72_0.conda',
 'perl-b-hooks-endofscope-0.28-pl5321ha770c72_0.conda',
 'perl-class-data-inheritable-0.09-pl5321ha770c72_0.conda',
 'perl-class-load-0.25-pl5321ha770c72_0.conda',
 'perl-class-load-xs-0.10-pl5321h0b41bf4_0.conda',
 'perl-class-method-modifiers-2.14-pl5321ha770c72_0.conda',
 'perl-class-method-modifiers-2.15-pl5321ha770c72_0.conda',
 'perl-cpan-meta-check-0.014-pl5321ha770c72_0.conda',
 'perl-data-dumper-2.183-pl5321hd590300_0.conda',
 'perl-data-dumper-concise-2.023-pl5321ha770c72_0.conda',
 'perl-data-optlist-0.112-pl5321ha770c72_0.conda',
 'perl-data-optlist-0.113-pl5321ha770c72_0.conda',
 'perl-data-optlist-0.114-pl5321ha770c72_0.conda']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, if you need everything, wouldn't you rather download the repodata.json file instead? I think for us that wasn't the use case we optimized for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm imagining a situation in which there is only repodata_shards. Suppose we are in the middle of processing a bunch of queries on the SubdirData and suddenly we encounter a query for "any package name, md5=x". Are we more likely to reject the query, download everything, or eject to downloading an alternate format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not deprecate it, and grab the repodata.json in the meantime? Tell people to use proper lockfiles? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we simplified the implementation, and included one repodata shard with all the packages in it, and an index that pointed to that singular shard 10k times 🤣

baszalmstra marked this conversation as resolved.
Show resolved Hide resolved

## <a id="authentication"></a>Authentication
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on adding this specific bit of authentication to the sharding proposal. We should be able to implement the sharding approach independently of the authentication mechanism.

I don't see in some setups how we would be able to provide such an endpoint. This ties the CEP to a certain kind of authentication and removes the flexibility that we currently have in also using HTTP Basic Auth instead of token based auth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to opening another CEP specifically for authentication of the sharded index but let me first explain the reasoning behind why we added this.

The reason we added this is that we have noticed that adding support for generic authentication methods provides a very significant overhead to fetching the individual shards and a large infrastructure cost.

Take as an example our prefix.dev backend. When requesting private repodata a user requests https://prefix.dev/mychannel/win-64/repodata.json, our backend server authenticates the request using either a conda token, basic auth or an API key and redirects the request to a presigned-url. If we would do something similar for the shards each request would have to go through this dynamic end-point. This adds additional delay to the request and incurs an infrastructure cost due to the required compute. And we are talking about hundreds to thousands of requests when a user fetching repodata. We have noticed that this significantly impacts the performance.

Ideally, we want to have a solution where the "presigning" happens just once and we can reuse this information on subsequent requests.

I am open to alternative solutions, but I think being able to authenticate once for all subsequent requests is very important to how effective this proposal is. Standardizing this method is also important to ensure that every client will be able to access different server implementations.

Then with regards to our specific solution using the token endpoint. The token endpoint only really needs to be dynamic when the authentication mechanism used is dynamic. In any other case, the token end-point could just be an empty static file. Basic authentication or even conda-token-based authentication can still be applied to both the individual shards and the shard index.

As an alternative though, we could also look at returning a set of headers from the initial request to the shard index instead of requesting a separate file.

FYI @jezdez

Copy link
Contributor

@wolfv wolfv May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we are not making clear in the CEP is that "traditional" auth methods should continue to work just as well. For example, if the server uses basic HTTP authentication or conda token authentication, this should continue to work (and depending on the server implementation should also not negatively affect performance).

We are just interested in adding a mechanism to the protocol in order to facilitate really fast performance for large public channels.

As Bas mentioned, I think we can make this an "add-on" behavior when the request to repodata_shards.msgpack.zst returns specific header values such as:

X-Shard-BaseUrl: https://fast.prefix.dev/conda-forge/
X-Shard-Token: pfx_secret_token
X-Shard-Token-ExpiresAt: TIMESTAMP

The client would then use these values for subsequent requests. If these headers are not returned, it will just use the "regular" URL with regular authentication.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these comments, I understand the need for an efficient implementation. Still, since it is part of this proposal, it seems necessary to implement sharding fully. We will have several places where this pre-signed authentication will not work (e.g., when using an HTTP caching proxy or Artifactory's implementation). I would love to use sharding in these places at the cost of the additional authentication checks.


To faciliate authentication and authorization we propose to add an additional endpoint at `<channel>/<subdir>/token` with the following content:

```json
{
"shard_base_url": "https://shards.prefix.dev/conda-forge/<subdir>/",
"token": "<bearer token>",
"issued_at": "2024-01-30T03:35:39.896023447Z",
"expires_in": 300,
}
```

`shard_base_url` is an optional url to use as the base url for the `repodata_shards.msgpack.zst` and the individual shards. If the field is not specified it should default to `<channel>/<subdir>`.

`token` is an optional field that if set MUST be added to any subsequent request in the `Authentication` header as `Authentication: Bearer <token>`. If the `token` field is not set sending the `Authentication` header is also not required.

The optional `issued_at` and `expires_in` fields can be used to verify the freshness of a token. If the fields are not present a client can assume that the token is valid for any subsequent request.

For a simple implementor this endpoint could just be a static file with `{}` as the content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated above, authnz is a transport layer thing and doesn't fit into this application layer metadata standard. This should be moved into an own CEP.


## Fetch process

To fetch all needed package records, the client should implement the following steps:

1. Acquire a token (see: [Authentication](#authentication)). Acquiring a token can be done lazily as to only request a token when an actual network request is performed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Authzn isn't part of the repodata standard and should be evolved in a separate CEP.

2. Fetch the `repodata_shards.msgpack.zst` file. Standard HTTP caching semantics can be applied to this file.
3. For each package name, start fetching the corresponding hashes from the index file (for both arch & and noarch).
Shards can be cached locally and because they are content-addressable no additional round-trips to the server are required to check freshness. The server should also mark these with an `immutable` `Cache-Control` header.
4. Parsing the requirements of the fetched records and add the package names of the requirements to the set of packages to fetch.
5. Loop back to 2. until there are no new package names to fetch.

## Garbage collection

To avoid the cache from growing indefinitely, we propose to implement a garbage collection mechanism that removes shards that have no entry in the index file. The server should keep old shards for a certain amount of time (e.g. 1 week) to allow for clients with older shard-index data to fetch the previous versions.

On the client side, a garbage collection process should run every so often to remove old shards from the cache. This can be done by comparing the cached shards with the index file and removing those that are not referenced anymore.


## Rejected ideas

### SHA hash compression

SHA hashes are non-compressable because in the eyes of the compressor it is just random data. We have investigated using a binary prefix tree to enable better compression but this increased the complexity of the implementation quite a bit which conflicts with our goal of keeping things simple.

### Shorter SHA hashes

Another approach would be to only store the first 100 bytes or so of the SHA hash. This reduces the total size of the sha hashes significantly but it makes the client side implementation more complex because hash conflicts become an issue.

This also makes a future implementation based on a OCI registry harder because layers in an OCI registry are also referenced by SHA256 hash.

### Storing the data as a struct of arrays

To improve compression we investigated storing the index file as a struct of arrays instead of as an array of structs:

```json
[
{
"name": "",
"hash": "",
},
{
"name": "",
"hash": "",
}
]
```

vs

```json
{
"names": [...],
"hashes": [...]
}
```

This did yield slightly better compression but we felt it makes it slightly harder to implement and adapt in the future which we deemed not worth the small size decrease.

## Future improvements

### Remove redundant keys

`platform` and `arch` can be removed because these can be inferred from `subdir`.

### Integrating additional data

With the total size of the repodata reduced it becomes feasible to add additional fields directly to the repodata records. Examples are:

- add `purl` as a list of strings (Package URLs to reference to original source of the package) (See: https://github.com/conda-incubator/ceps/pull/63)
- add `run_exports` as a list of strings (run-exports needed to build the package) (See: https://github.com/conda-incubator/ceps/pull/51)
Comment on lines +246 to +247
Copy link
Member

@jezdez jezdez May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to CEP-15?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure what you mean? The base_url from CEP-15 is part of the index file.


### Update optimization

We could implement support for smaller index update files. This can be done by creating a rolling daily and weekly index update file that can be used instead of fetching the whole `repodata_shards.msgpack.zst` file. The update operation is very simple (just update the hashmap with the new entries).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done with a jlap-like rolling checksum / changelog file where the client would look for the hash of their current index, and json merge patch instead of the more complicated json patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like the idea of having a hash of the "expected" file. We don't need any proper "patching" as it shoudl be as simple as

old_repodata["shards"].update(new_repodata["shards"])

(with some extra handling of null keys potentially).

We would also store a timestamp in the info field that we could use to evaluate wether we can use the daily or weekly update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"json merge patch" is similar to ".update()". We could simulate the size of the proposed file using https://github.com/dholth/conda-test-data, probably it is pretty small based on what I've observed about jlap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are logging what percentage of shards changes every 6 hours in the indexing pipeline. It's usually < 1 percent so it's pretty small.

I think adding json-merge-patch isn't going to add anything over the proposed update of dictionary keys - but the principle is definitely the same. We could also define a hash, but I would argue that we should do a "content" hash (ie. one that does not rely on the specific formatting of a "JSON" file. Although that might also be less of an issue since we're using msgpack for serialization in this proposal :)

Copy link
Contributor

@dholth dholth May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msgpack may have more options for formatting than json, e.g. "serializers SHOULD use the format which represents the data in the smallest number of bytes." but does our implementation? JSON with sorted keys and no whitespace is predictable. But we won't think about implementing https://www.rfc-editor.org/rfc/rfc8785
An algorithm like merge-patch would handle any change in the index file not just individual shards.


For this we propose to add the following two files:

- `<channel>/<subdir>/repodata_shards_daily.msgpack.zst`
- `<channel>/<subdir>/repodata_shards_weekly.msgpack.zst`

They will contain the same format as the `repodata_shards.msgpack.zst` file but only contain the packages that have been updated in the last day or week respectively. `null` is used for keys that have been removed. The `created_at` field in the index file can be used to determine which file to fetch to make sure that the client has the latest information.

### Store `set(dependencies)` at the start of the shards or in a header

To reduce the time it takes to parse a shard and start fetching its dependencies we could also store the set of all dependencies in the file at the start of the shard or in a separate header. This could enable fetching recursive dependencies while still parsing the records.