CEP: sharded repodata #75

baszalmstra · 2024-04-30T13:48:17Z

We propose a new solution to repodata downloading aimed at drastically reducing the time and memory required.

We implemented the use of this proposal in rattler and created a mirror for popular channels containing the new files that are updated at a fixed interval.

📝 Rendered

Preliminary results

Requesting repodata records for a specific set of packages for linux-64 and noarch. Measured on a machine with a 200mbit internet connection.

| requested packages        | records | fresh | cached | JLAP  | sharded (cold) | sharded (hot) |
| ------------------------- | ------- | ----- | ------ | ----- | -------------- | ------------- |
| python + boto3 + requests | 6969    | 2.4 s | 0.58 s | 8.2 s | 0.34 s         | 0.015 s       |
| jupyterlab + detectron2   | 35524   | 2.4 s | 0.63 s | 8.7 s | 0.7 s          | 0.05 s        |
| rubin-env                 | 84254   | 2.9 s | 0.80 s | 8.8 s | 1 s            | 0.13 s        |

Note that with this approach the cache is not all or nothing. So even if the shards index is updated, most likely a user will still have a relatively hot cache.

wolfv · 2024-04-30T14:17:09Z

Note that in fairness we never got around to implement the faster "2-file" JLAP variant. However, the sharded repodata is a lot "simpler" in terms of updates, so we believe that in rattler this will supersede the JLAP-efforts.

cep-16.md

jaimergp · 2024-04-30T15:15:58Z

cep-16.md

+
+We propose a new "repodata" format that can be sparsely fetched. That means, generally, smaller fetches (only fetch what you need) and faster updates of existing repodata (only fetch what has changed).
+
+We also change the encoding from JSON to MSGPACK for faster decoding.


Could this be considered a implementation detail? It imposes an additional dependency on implementors, but I think this should be mostly about the strategy of sharding, their path locations and the schema expecting. Maybe we can standardize things like <path>/<filename>.<encoding>.<compression>. Or at least standardize that a JSON counterpart is also available.

I'm open to any format and compression but I do think it's important to only provide a single format.

The files are content-addressed so we would have to add multiple hashes for different file formats to the index.

It makes it easier for the server/indexer if there is just one file.

It makes the client simpler because it doesn't have to query which variants are available.

We chose to use msgpack over other formats (like protobuf or capnproto) exactly because it has a lot of great implementations for a large variety of languages and still provides small files.

Ah, I see, the hashes are for the compressed blobs? I had assumed we would hash the raw underlying content.

Do you have any numbers comparing JSON and msgpack?

Initially we used to hash the underlying data. Unfortunately that would not map well to the OCI registry. So we decided to use the hash of the compressed blob.

And here are some basic numbers.

I generated the repodata shard index for conda-forges noarch subdir in both json and msgpack format. And I took the largest shard (conda-forge-pinning) for comparison as well.

| which | size compressed | size uncompressed | deserialization time | | --------------------------- | --------------- | ----------------- | -------------------- | | json shard index | 705 KB | 1404 KB | 4.8 ms | | msgpack shard index | 666 KB (94%) | 788 KB (56%) | 3.2 ms (66%) | | json shard | 298 KB | 1926 KB | 12.8 ms | | msgpack shard | 313 KB (105%) | 1567 KB (81%) | 10.2 ms (83%) | | msgpack shard (byte hashes) | 290 KB (97%) | 1357 KB (70%) | 8.8 ms (68%) |

In terms of size both JSON and msgpack files become roughly the same size. The biggest win seems to be in deserialization time which also takes up a significant amount of time in real-world examples.

Edit: If I change the format of the hashes in the shards from hex string to bytes, the msgpack files are always smaller (290 KB for the conda-forge-pinning case).

Edit 2: Updated table to include msgpack with byte hashes.

This is the size of all shards of the noarch conda-forge channel (with the above edit applied):

| | size | | -------------------- | ------- | | repodata.json.zst | 13.5 MB | | shards/*.json.zst | 18 MB | | shards/*.msgpack.zst | 17.1 MB |

This code plays with the codec and compression.

Have you drawn any conclusions from this that we should incorporate @dholth ?

Co-authored-by: jaimergp <[email protected]>

wolfv · 2024-05-01T18:15:49Z

One thing that just came to mind is that we might want to add some language that says that implementations should ignore any unknown keys. That will enable us to add run_exports, purls, and other keys to the repodata record without breaking old versions of the tools.

wolfv · 2024-05-01T19:06:35Z

One other thing:

msgpack2json -d -b -i /Users/wolfv/Library/Caches/rattler/cache/repodata/shards-v1/7336fbc738626810350ff13c195012deb91b794840c81f59ffd8a74559102ec3.msg works nowadays for looking at the shards in JSON :)

jezdez · 2024-05-01T22:47:24Z

Note that in fairness we never got around to implement the faster "2-file" JLAP variant. However, the sharded repodata is a lot "simpler" in terms of updates, so we believe that in rattler this will supersede the JLAP-efforts.

Thinking about this a little more, I think your statement that this proposal is "simpler" needs some elaboration, "simpler" in what areas? Do you mean "faster"?

In my experience, "simplicity" is easier said than done (read: is hard) and given that this introduces a whole new format, uses a different encoding type and seems to be focused on a non-standard hosting type (OCI, from reading your comments), I'm hoping you can provide context about this. It's.. a lot for one CEP, while trying to imagine a way to implement this in conda (which remains a goal with these CEPs after all).

I would appreciate hearing your thoughts on how a rollout would look like for existing conda hosting (e.g. conda-forge) and how current conda users would benefit from this.

wolfv · 2024-05-02T05:55:10Z

Note that in fairness we never got around to implement the faster "2-file" JLAP variant. However, the sharded repodata is a lot "simpler" in terms of updates, so we believe that in rattler this will supersede the JLAP-efforts.

Thinking about this a little more, I think your statement that this proposal is "simpler" needs some elaboration, "simpler" in what areas? Do you mean "faster"?

Simpler in that no "patching" is needed (you get the real repodata right away). There is also no "state" that needs to be kept - just update the index file and make sure that all the shards are there, which is really straightforward.

It should be faster, too, because, again, we don't need the whole index in the first place. So the download is only ~1 Mb (for linux + noarch). I can download all shards for jupyterlab in 350 ms.

In my experience, "simplicity" is easier said than done (read: is hard) and given that this introduces a whole new format, uses a different encoding type and seems to be focused on a non-standard hosting type (OCI, from reading your comments), I'm hoping you can provide context about this. It's.. a lot for one CEP, while trying to imagine a way to implement this in conda (which remains a goal with these CEPs after all).

We're not at all focused on a non-standard hosting type - we just want to make sure that that works 100% fine. In fact, we host our "repodata-shards" on a traditional S3-style bucket, but a regular file server would work just as well. One just needs to add the repodata_shards.msgpack.zst file and all the shards under /shards/....

I don't think there is anything to worry about with regards to the implementation in conda. We are using msgpack-python from Python to create the index (the API is the same as json.load/json.dumps).

I would appreciate hearing your thoughts on how a rollout would look like for existing conda hosting (e.g. conda-forge) and how current conda users would benefit from this.

You could do it the same way as we do, and host an alternative index on a different subdomain for a testing period? What we do is that our Github action is ingesting existing repodata, creates shards, and stores the (new) shards on the bucket. It's all open source (in the linked repository). So you can try it out today! :)

With the base_url in the repodata_shards.msgpack.zst file we link back to the "original" conda-forge, by the way (so that packages would be downloaded from the official conda.anaconda.org.

cep-16.md

wolfv · 2024-05-04T06:17:59Z

I just tested the sharded repodata vs regular repodata on a slow internet connection. It took 15s for noarch + osx-arm64 in the traditional case, and 900ms for the sharded repodata. Very nice!

wolfv · 2024-05-06T07:55:59Z

I also wanted to point out that this is an infrastructure improvement that should dramatically reduce indexing times and thereby hopefully pave the way to an almost realtime experience when publishing new packages. Especially when we start using the daily and weekly files.

The changes will be much more minimal and small. I hope you guys are also excited about that prospect :) @jezdez

wolfv · 2024-05-06T15:40:21Z

And if you want it really simple + join forces, you can use Gateway class from py-rattler:

https://github.com/baszalmstra/rattler/blob/e570cc74e8bce76606e3246926af723ae68461d2/py-rattler/rattler/repo_data/gateway.py#L65-L81

cep-16.md

jezdez · 2024-05-07T14:22:12Z

cep-16.md

+1. **Speed**: Fetching repodata MUST be very fast. Both in the hot- and cold-cache case.
+2. **Easy to update**: The channel MUST be very easy to update when new packages become available.
+3. **CDN friendly**: A CDN MUST be usable to cache the majority of the data. This reduces the operating cost of a channel.
+4. **Support authN and authZ**: It MUST be possible to implement authentication and authorization with little extra overhead.


That seems like an odd duck, authorization shouldn't be attached to the repodata format as it's part of the transport layer, not the application layer.

Im not sure why this CEP cannot make changes to both? Authenticating requests can add serious overhead to the server-side implementation which in turn affects the performance on the client. Having a way to effectively "cache" the authentication drastically reduces the work on the server which given the sheer number of requests introduced by this CEP I think makes sense to address as well. Note that the approach taken here is very similar to how OCI registries operate.

I would also argue that we're proposing more of a transport protocol here where the authentication fits in pretty well as a key part of the whole thing.

Lastly, you can also use other (more traditional) means by not replying with a token on the endpoint and use traditional means instead (depending on how your authentication works, it might be slower).

I also feel that authorization should be split off. You can do this in multiple ways that are independent of the how shards are actually implemented.

I removed authentication from the CEP and marked it as something we should investigate in another CEP. I still left it in the design goals because I still think we shouldnt implement something that would rule this out.

jezdez · 2024-05-07T14:22:55Z

cep-16.md

+
+Additionally an index file stores the mapping from package name to shard hash.
+
+Although not explicitly required the server SHOULD support HTTP/2 to reduce the overhead of doing a massive number of requests. 


Given the sheer number of files, I think realistically this is a MUST.

We have been running this with HTTP/1 by accident for a while, its definitely slower but is still very acceptable. I left this as a should because everything still works perfectly fine without HTTP/2.

Does pip do something similar with the simple index, requesting one per package name

Yes there is one index per package name e.g. https://pypi.org/simple/pinject .

But this index does not store any dependency information for any of the artifacts. An additional request is required PER artifact to get the dependecy information.

In this proposal we add the dependecy information of individual packages directly to the per package name shard.

pip is regarded as fast even though it does not use HTTP/2 and has a similar access pattern to what this CEP would imply. You might keep a few pipelined HTTP 1 connections open per repodata server to grab everything.

I would also expect that HTTP/1 gives already a reasonable improvement. Not all corporate proxies (but most) support HTTP/2 yet and thus I'm always happy to have enhancements that can fall back to HTTP/1.

I tried doing an http/1 fetch of all ~10k files from the prefix test server using a 20-thread (and 10 connections at once pool) with Python requests. It takes about 20 seconds to fetch everything with a hot cache, and the spec is designed for us to fetch much less than everything.

jezdez · 2024-05-07T14:23:59Z

cep-16.md

+
+For a large case (conda-forge linux-64) this files is 670kb at the time of writing.
+
+We suggest serving the file with a short lived `Cache-Control` `max-age` header of 60 seconds to an hour but we leave it up to the channel administrator to set a value that works for that channel.


Why do you suggest 60 seconds TTL?

I feel like this provides a sane tradeoff between limiting the number of requests to the server while also reducing the time it takes to get new packages. But TBH it's quite random.

jezdez · 2024-05-07T14:25:50Z

cep-16.md

+
+Individual shards are stored under the URL `<shard_base_url>/<subdir>/shards/<sha256>.msgpack.zst`. Where the `sha256` is the lower-case hex representation of the bytes from the index. It is a zstandard compressed msgpack file that contains the metadata of the package. The `<shard_base_url>` is defined in [Authentication](#authentication).
+
+The files are content-addressable which makes them ideal to be served through a CDN. They SHOULD be served with `Cache-Control: immutable` header.


Why SHOULD and not MUST?

An implementation doesnt care about Cache-Control headers, it just uses the content-address. However, for CDNs having the immutable header ensures that the content is cached on the edge for as long as possible. But it is not a requirement.

Is there a non-content-addressable variant that might be suitable for a package server with per-user, per-package permissions?

Is there a non-content-addressable variant that might be suitable for a package server with per-user, per-package permissions?

Most permission-based solutions I have yet encountered (I have not seen Anaconda Enterprise) all managed permissions on the basis of whole channels. This would stay compatible with the current approach and was something that was also compatible with the old approach.

For giving access to a subset of packages in a channel, you would need to serve copies of repodata.json depending on the user. As the new approach (new = sharding) would still also serve the old repodata.json, I think this can only be supported with a backend that check permissions on the request basis and doesn't do authentication purely based on the URL.

Another form of permission for content-addressed blob store systems I've seen is to store some additional tag style header attributes with the content blob and use that to drive authz. gcp and aws both support setting this kind of metadata and you can perform the auth part with a proxy in front of the blob store

jezdez · 2024-05-07T14:28:50Z

cep-16.md

+The `sha256` and `md5` from the original repodata fields are converted from their hex representation to bytes. 
+This is done to reduce the overall file size of the shards.
+
+Implementers SHOULD ignore unknown keys, this allows adding additional keys to the format in the future without breaking old versions of tools.


CEP 15 adds a repodata_version field incrementing it to 2 when a base_url is present. Since you're willing to update a number of fields, this should be incremented to 3, so clients don't have to implement forever backwards-compatible repodata loaders.

The downside of that though is that a client that doesn't support version 3 will not be able to parse the repodata, effectively breaking it. Adding keys is forward-compatible as long as the client is not strict about unknown keys.

Or did I misunderstand and is the repodata_version field forward-compatible?

We know that optional fields like "binstar" from anaconda.org non-CDN channels exist. Or whatever happens to be in the package's index.json goes into repodata.

But fetching files from the wrong url when base_url is present is obviously broken.

Should the index shard use repodata_version instead of "version": 1,

It would be good form to have an algorithm for converting between repodata.json and shards in both directions.

jezdez · 2024-05-07T14:29:04Z

cep-16.md

+
+Implementers SHOULD ignore unknown keys, this allows adding additional keys to the format in the future without breaking old versions of tools.
+
+Although these files can become relatively large (100s of kilobytes) typically for a large case (conda-forge) these files remaing very small, e.g. 100s of bytes to a couple of kilobytes.


What if we allow a shard to contain the metadata of more than one package, so that multiple package names could point to the same shard? It is a nuisance to download many 300-byte shards.

Yeah we could, but it comes with some downsides:

cache might not be as efficient as multiple packages might change at different frequencies and cache becomes invalid more often

it's more complicated on the implementation as packages don't map 1-1 to shards, but that would not be too bad.

We could keep the current repodata_shards index structure and use the same hash for multiple package names (that would probably compress decently).

However, I am not sure if the tradeoff is worth it.

Also it will start to look a bit like zchunk then :)

I tried using a buffer to combine sequential (ordered by package name) shards when the combined size is not > 8192 bytes compressed. So we still wind up with plenty of ~300 byte single package shards if it sees (large shard, small shard, large shard) but it's all in order.

That scheme gives 7987 files and 10883 package names on a conda-forge/linux-64 snapshot, probably taking 75% of the time to download if we happen to want to download everything. Unlike zchunk we have the option of not maintaining a complete local copy of the data.

The most packages combined into a single shard are these fifteen.

['perl-bit-vector-7.4-pl5321h166bdaf_0.tar.bz2', 'perl-capture-tiny-0.48-pl5321ha770c72_1.tar.bz2', 'perl-carp-assert-0.21-pl526_0.tar.bz2', 'perl-class-method-modifiers-2.13-pl5321ha770c72_0.tar.bz2', 'perl-clone-0.46-pl5321h166bdaf_0.tar.bz2', 'perl-compress-raw-bzip2-2.201-pl5321h166bdaf_0.tar.bz2', 'perl-compress-raw-zlib-2.202-pl5321h166bdaf_0.tar.bz2', 'perl-data-dumper-2.183-pl5321h166bdaf_0.tar.bz2', 'perl-b-hooks-endofscope-0.26-pl5321ha770c72_0.conda', 'perl-b-hooks-endofscope-0.28-pl5321ha770c72_0.conda', 'perl-class-data-inheritable-0.09-pl5321ha770c72_0.conda', 'perl-class-load-0.25-pl5321ha770c72_0.conda', 'perl-class-load-xs-0.10-pl5321h0b41bf4_0.conda', 'perl-class-method-modifiers-2.14-pl5321ha770c72_0.conda', 'perl-class-method-modifiers-2.15-pl5321ha770c72_0.conda', 'perl-cpan-meta-check-0.014-pl5321ha770c72_0.conda', 'perl-data-dumper-2.183-pl5321hd590300_0.conda', 'perl-data-dumper-concise-2.023-pl5321ha770c72_0.conda', 'perl-data-optlist-0.112-pl5321ha770c72_0.conda', 'perl-data-optlist-0.113-pl5321ha770c72_0.conda', 'perl-data-optlist-0.114-pl5321ha770c72_0.conda']

I am wondering, if you need everything, wouldn't you rather download the repodata.json file instead? I think for us that wasn't the use case we optimized for.

I'm imagining a situation in which there is only repodata_shards. Suppose we are in the middle of processing a bunch of queries on the SubdirData and suddenly we encounter a query for "any package name, md5=x". Are we more likely to reject the query, download everything, or eject to downloading an alternate format?

Why not deprecate it, and grab the repodata.json in the meantime? Tell people to use proper lockfiles? :)

What if we simplified the implementation, and included one repodata shard with all the packages in it, and an index that pointed to that singular shard 10k times 🤣

jezdez · 2024-05-07T14:30:36Z

cep-16.md

+To faciliate authentication and authorization we propose to add an additional endpoint at `<channel>/<subdir>/token` with the following content:
+
+```json
+{
+    "shard_base_url": "https://shards.prefix.dev/conda-forge/<subdir>/",
+    "token": "<bearer token>",
+    "issued_at": "2024-01-30T03:35:39.896023447Z",
+    "expires_in": 300,
+}
+```
+
+`shard_base_url` is an optional url to use as the base url for the `repodata_shards.msgpack.zst` and the individual shards. If the field is not specified it should default to `<channel>/<subdir>`.
+
+`token` is an optional field that if set MUST be added to any subsequent request in the `Authentication` header as `Authentication: Bearer <token>`. If the `token` field is not set sending the `Authentication` header is also not required. 
+
+The optional `issued_at` and `expires_in` fields can be used to verify the freshness of a token. If the fields are not present a client can assume that the token is valid for any subsequent request.
+
+For a simple implementor this endpoint could just be a static file with `{}` as the content.


As stated above, authnz is a transport layer thing and doesn't fit into this application layer metadata standard. This should be moved into an own CEP.

jezdez · 2024-05-07T14:31:14Z

cep-16.md

+
+To fetch all needed package records, the client should implement the following steps:
+
+1. Acquire a token (see: [Authentication](#authentication)). Acquiring a token can be done lazily as to only request a token when an actual network request is performed.


Authzn isn't part of the repodata standard and should be evolved in a separate CEP.

jezdez · 2024-05-07T14:32:24Z

cep-16.md

+- add `purl` as a list of strings (Package URLs to reference to original source of the package) (See: https://github.com/conda-incubator/ceps/pull/63)
+- add `run_exports` as a list of strings (run-exports needed to build the package) (See: https://github.com/conda-incubator/ceps/pull/51)


How does this relate to CEP-15?

Im not sure what you mean? The base_url from CEP-15 is part of the index file.

Co-authored-by: Jannis Leidel <[email protected]>

dholth · 2024-05-08T18:10:58Z

cep-16.md

+
+JLAP also does not save anything with a cold cache because the initial repodata still has to be downloaded. This is often the case for CI runners.
+
+Finally, the implementation of JLAP is quite complex which makes it hard to adopt for implementers. 


IMO the complexity is the least fair criticism of JLAP, it can yield a small implementation.

IMHO in theory its pretty straightforward but to integrate properly there are a number of additional complexities that make this statememt true. To name a few: it requires maintaining extra state on disk, http range requests, we need to make sure that the repodata.json is still the previously stores repodata, indexing requires the previous state, and the problem that you need an additional overlay file to make it perform optimally.

But in fairness this is just my personal experience while implementing it.

@dholth I think we can claim that we are the only ones shipping JLAP in production and it's been pretty rough, unfortunately (we observe a waiting progress bar for multiple seconds while patches are applied and the file is serialized). We could improve with the two file approach but I think we make a good point about why this proposal is simpler.

Also, one thing we gain with our proposal here is complete hash-integrity. Something that is not possible with the JLAP two-file approach. We want to enable signed repodata and packages (TUF style) sooner or later and I don't see a way to do that with JLAP's two-file approach.

With our approach, we can at any time prove the integrity of the whole thing.

cep-16.md

Co-authored-by: Daniel Holth <[email protected]>

cep-16.md

Co-authored-by: Bas Zalmstra <[email protected]>

dholth · 2024-05-10T15:23:46Z

cep-16.md

+
+### Update optimization
+
+We could implement support for smaller index update files. This can be done by creating a rolling daily and weekly index update file that can be used instead of fetching the whole `repodata_shards.msgpack.zst` file. The update operation is very simple (just update the hashmap with the new entries).


This could be done with a jlap-like rolling checksum / changelog file where the client would look for the hash of their current index, and json merge patch instead of the more complicated json patch.

Yeah I like the idea of having a hash of the "expected" file. We don't need any proper "patching" as it shoudl be as simple as

old_repodata["shards"].update(new_repodata["shards"])

(with some extra handling of null keys potentially).

We would also store a timestamp in the info field that we could use to evaluate wether we can use the daily or weekly update.

"json merge patch" is similar to ".update()". We could simulate the size of the proposed file using https://github.com/dholth/conda-test-data, probably it is pretty small based on what I've observed about jlap.

Yes, we are logging what percentage of shards changes every 6 hours in the indexing pipeline. It's usually < 1 percent so it's pretty small.

I think adding json-merge-patch isn't going to add anything over the proposed update of dictionary keys - but the principle is definitely the same. We could also define a hash, but I would argue that we should do a "content" hash (ie. one that does not rely on the specific formatting of a "JSON" file. Although that might also be less of an issue since we're using msgpack for serialization in this proposal :)

msgpack may have more options for formatting than json, e.g. "serializers SHOULD use the format which represents the data in the smallest number of bytes." but does our implementation? JSON with sorted keys and no whitespace is predictable. But we won't think about implementing https://www.rfc-editor.org/rfc/rfc8785
An algorithm like merge-patch would handle any change in the index file not just individual shards.

xhochy

I really like sharding approach, but I'm very heavy -1 on mixing in authentication into it. This will conflict with existing implementations and thus will make the feature not usable there.

xhochy · 2024-05-13T09:50:20Z

cep-16.md

+1. **Speed**: Fetching repodata MUST be very fast. Both in the hot- and cold-cache case.
+2. **Easy to update**: The channel MUST be very easy to update when new packages become available.
+3. **CDN friendly**: A CDN MUST be usable to cache the majority of the data. This reduces the operating cost of a channel.
+4. **Support authN and authZ**: It MUST be possible to implement authentication and authorization with little extra overhead.


I also feel that authorization should be split off. You can do this in multiple ways that are independent of the how shards are actually implemented.

xhochy · 2024-05-13T09:51:25Z

cep-16.md

+
+Additionally an index file stores the mapping from package name to shard hash.
+
+Although not explicitly required the server SHOULD support HTTP/2 to reduce the overhead of doing a massive number of requests. 


I would also expect that HTTP/1 gives already a reasonable improvement. Not all corporate proxies (but most) support HTTP/2 yet and thus I'm always happy to have enhancements that can fall back to HTTP/1.

xhochy · 2024-05-13T09:55:10Z

cep-16.md

+
+Individual shards are stored under the URL `<shard_base_url>/<subdir>/shards/<sha256>.msgpack.zst`. Where the `sha256` is the lower-case hex representation of the bytes from the index. It is a zstandard compressed msgpack file that contains the metadata of the package. The `<shard_base_url>` is defined in [Authentication](#authentication).
+
+The files are content-addressable which makes them ideal to be served through a CDN. They SHOULD be served with `Cache-Control: immutable` header.


Is there a non-content-addressable variant that might be suitable for a package server with per-user, per-package permissions?

Most permission-based solutions I have yet encountered (I have not seen Anaconda Enterprise) all managed permissions on the basis of whole channels. This would stay compatible with the current approach and was something that was also compatible with the old approach.

For giving access to a subset of packages in a channel, you would need to serve copies of repodata.json depending on the user. As the new approach (new = sharding) would still also serve the old repodata.json, I think this can only be supported with a backend that check permissions on the request basis and doesn't do authentication purely based on the URL.

xhochy · 2024-05-13T09:58:57Z

cep-16.md

+
+Although these files can become relatively large (100s of kilobytes) typically for a large case (conda-forge) these files remaing very small, e.g. 100s of bytes to a couple of kilobytes.
+
+## <a id="authentication"></a>Authentication


I'm -1 on adding this specific bit of authentication to the sharding proposal. We should be able to implement the sharding approach independently of the authentication mechanism.

I don't see in some setups how we would be able to provide such an endpoint. This ties the CEP to a certain kind of authentication and removes the flexibility that we currently have in also using HTTP Basic Auth instead of token based auth.

I am open to opening another CEP specifically for authentication of the sharded index but let me first explain the reasoning behind why we added this.

The reason we added this is that we have noticed that adding support for generic authentication methods provides a very significant overhead to fetching the individual shards and a large infrastructure cost.

Take as an example our prefix.dev backend. When requesting private repodata a user requests https://prefix.dev/mychannel/win-64/repodata.json, our backend server authenticates the request using either a conda token, basic auth or an API key and redirects the request to a presigned-url. If we would do something similar for the shards each request would have to go through this dynamic end-point. This adds additional delay to the request and incurs an infrastructure cost due to the required compute. And we are talking about hundreds to thousands of requests when a user fetching repodata. We have noticed that this significantly impacts the performance.

Ideally, we want to have a solution where the "presigning" happens just once and we can reuse this information on subsequent requests.

I am open to alternative solutions, but I think being able to authenticate once for all subsequent requests is very important to how effective this proposal is. Standardizing this method is also important to ensure that every client will be able to access different server implementations.

Then with regards to our specific solution using the token endpoint. The token endpoint only really needs to be dynamic when the authentication mechanism used is dynamic. In any other case, the token end-point could just be an empty static file. Basic authentication or even conda-token-based authentication can still be applied to both the individual shards and the shard index.

As an alternative though, we could also look at returning a set of headers from the initial request to the shard index instead of requesting a separate file.

FYI @jezdez

I think what we are not making clear in the CEP is that "traditional" auth methods should continue to work just as well. For example, if the server uses basic HTTP authentication or conda token authentication, this should continue to work (and depending on the server implementation should also not negatively affect performance).

We are just interested in adding a mechanism to the protocol in order to facilitate really fast performance for large public channels.

As Bas mentioned, I think we can make this an "add-on" behavior when the request to repodata_shards.msgpack.zst returns specific header values such as:

X-Shard-BaseUrl: https://fast.prefix.dev/conda-forge/ X-Shard-Token: pfx_secret_token X-Shard-Token-ExpiresAt: TIMESTAMP

The client would then use these values for subsequent requests. If these headers are not returned, it will just use the "regular" URL with regular authentication.

With these comments, I understand the need for an efficient implementation. Still, since it is part of this proposal, it seems necessary to implement sharding fully. We will have several places where this pre-signed authentication will not work (e.g., when using an HTTP caching proxy or Artifactory's implementation). I would love to use sharding in these places at the cost of the additional authentication checks.

xhochy · 2024-05-13T10:01:57Z

Real-life note: This will probably work mostly out-of-the-box with the existing conda support in Artifactory except that the shard index would for now be cached indefinitely there (i.e. that is something JFrog would need to implement). It took them ~2months with repodata.json.zst (and then add another 6 months for corporate IT updating their Artifactory version).

aovasylenko · 2024-05-29T12:12:23Z

cep-16.md

+
+## Proposal
+
+We propose a "sharded" repodata format. It works by splitting the repodata into multiple files (one per package name) and recursively fetching the "shards".


I reminds me about current simple pypi index (what is mentioned below), which (in case all python ecosystem of 500K+ packages leads to this number of shards)
Were there any ideas about evaluating smaller number of shards, for example by having prefix-based rule or range?
My worry that based on server-side implementation initial pulling of this information will generate massive number of requests (even with using http2) on API/DB level each request could be processed separately.

We haven't really discussed different sharding strategies. However, our implementation uses a CDN + static files and that seems to work relatively well. It also seems to work well enough for PyPI and crates.io so I don't think we're particularly concerned.

cep-16.md

baszalmstra · 2024-06-11T15:52:50Z

@jezdez @xhochy @dholth @wolfv I updated the CEP to:

Add shards_base_url
Remove authentication flow from the CEP (we will address this in a future CEP).

Please give it another read, I would like to put this to a vote soon.

xhochy

Fine with me now.

jezdez · 2024-07-02T13:18:38Z

@conda/steering-council

This vote falls under the "Enhancement Proposal Approval" policy of the conda governance policy,
please vote and/or comment on this proposal at your earliest convenience.

It needs 60% of the Steering Council to vote yes to pass.

To vote, please leave yes, no or abstain as comments below.

If you have questions concerning the proposal, you may also leave a comment or code review.

This vote will end on 2024-07-16, End of Day, Anywhere on Earth (AoE). This is an extended voting period due to summer holiday time in the Northern Hemisphere.

baszalmstra · 2024-07-02T13:21:31Z

yes

wolfv · 2024-07-02T13:40:04Z

jaimergp

Just a few typos I found while reading through the whole thing.

cep-16.md

Co-authored-by: jaimergp <[email protected]>

jakirkham · 2024-07-16T06:10:48Z

Weird. Had checked the box above a few days ago and it appears to have disappeared. Rechecked it

wolfv · 2024-07-16T06:13:20Z

@jakirkham strange indeed. I can see your edit in the edit history, but it didn't seem to "tick" the box...

wolfv · 2024-07-16T10:22:45Z

@chenghlee @CJ-Wright @mbargull last chance to vote :)

mbargull · 2024-07-17T04:30:45Z

I am thankful that someone is working on the repodata implementation since the current one (one big blob per channel/subdir) became unwieldy a good while ago (at least in resource-constrained environments).
As far as I can tell, this proposal is a good starting point but still needs refinement (which I expect to get clearer once more people test it out in rattler and conda-index, etc.).
Unfortunately, I was not able to take the time to do a proper review and consult in the refine process here and as such opted to abstain.

wolfv · 2024-07-22T15:23:12Z

This vote was closed, with the following result

Total voters: 15 (valid: 93.33%)

Yes votes (12 / 85.71%):

@xhochy (Uwe Korn)
@mariusvniekerk (Marius van Niekerk)
@goanpeca (Gonzalo Peña-Castellanos)
@ocefpaf (Filipe Fernandes)
@marcelotrevisani (Marcelo Duarte Trevisani)
@msarahan (Michael Sarahan)
@jakirkham (John Kirkham)
@jezdez (Jannis Leidel)
@wolfv (Wolf Vollprecht)
@jaimergp (Jaime Rodríguez-Guerra)
@kkraus14 (Keith Kraus)
@baszalmstra (Bas Zalmstra)

No votes (0 / 0.00%)):

Abstain votes (2 / 14.29%):

@chenghlee (Cheng H. Lee)
@mbargull (Marcel Bargull)

Not voted (1):

@CJ-Wright (Christopher J. 'CJ' Wright)

Invalid votes (0):

We reached quorum, and enough YES votes to accept the proposal! 🎉

baszalmstra added 2 commits April 30, 2024 15:22

initial version of cep

eb465c0

typos

0f2dcdf

jezdez reviewed Apr 30, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

mention that JLAP did not pass

d850eef

jaimergp reviewed Apr 30, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

jaimergp reviewed Apr 30, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

jaimergp reviewed Apr 30, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

jaimergp reviewed Apr 30, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

jaimergp reviewed Apr 30, 2024

View reviewed changes

baszalmstra and others added 2 commits April 30, 2024 17:26

types

dea8894

Co-authored-by: jaimergp <[email protected]>

add note about converting hashes to bytes

a3d734f

add note about ignoring unknown keys

97c2499

wolfv reviewed May 2, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

add removed field

d2b1ac7

jezdez reviewed May 7, 2024

View reviewed changes

Update cep-16.md

fa58859

Co-authored-by: Jannis Leidel <[email protected]>

baszalmstra requested a review from jaimergp May 7, 2024 15:08

dholth reviewed May 8, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

dholth reviewed May 8, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

Update cep-16.md

c2a201a

Co-authored-by: Daniel Holth <[email protected]>

dholth reviewed May 10, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

Update cep-16.md

9722cf0

Co-authored-by: Bas Zalmstra <[email protected]>

dholth reviewed May 10, 2024

View reviewed changes

dholth mentioned this pull request May 10, 2024

Write sharded repodata conda/conda-index#161

Draft

3 tasks

xhochy requested changes May 13, 2024

View reviewed changes

aovasylenko reviewed May 29, 2024

View reviewed changes

dholth reviewed May 30, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

remove authentication and add shards_base_url

1a3b2ad

xhochy approved these changes Jun 13, 2024

View reviewed changes

jezdez added the vote Voting following governance policy label Jul 2, 2024

jaimergp approved these changes Jul 6, 2024

View reviewed changes

cep-16.md Outdated Show resolved Hide resolved

cep-16.md Outdated Show resolved Hide resolved

cep-16.md Outdated Show resolved Hide resolved

Apply suggestions from code review

d6af2de

Co-authored-by: jaimergp <[email protected]>

dholth mentioned this pull request Jul 22, 2024

Implement sharded repodata CEP conda/conda#14060

Open

2 tasks

update top notes

ab7971c

wolfv approved these changes Jul 22, 2024

View reviewed changes

Merge branch 'main' into sharded-repodata

cf5b5ea

wolfv merged commit 189d482 into conda:main Jul 22, 2024

jakirkham mentioned this pull request Jul 31, 2024

Can i add a run_exports? conda-forge/conda-forge-repodata-patches-feedstock#224

Closed

baszalmstra mentioned this pull request Sep 9, 2024

https://conda.anaconda.org vs https://prefix.dev/ prefix-dev/prefix-dev#44

Closed

jjerphan mentioned this pull request Oct 8, 2024

Avoid downloading large package metadata mamba-org/mamba#3431

Open

jakirkham mentioned this pull request Nov 20, 2024

Add CEP to transition to Zulip #92

Merged


		We propose a new "repodata" format that can be sparsely fetched. That means, generally, smaller fetches (only fetch what you need) and faster updates of existing repodata (only fetch what has changed).

		We also change the encoding from JSON to MSGPACK for faster decoding.


		Additionally an index file stores the mapping from package name to shard hash.

		Although not explicitly required the server SHOULD support HTTP/2 to reduce the overhead of doing a massive number of requests.


		For a large case (conda-forge linux-64) this files is 670kb at the time of writing.

		We suggest serving the file with a short lived `Cache-Control` `max-age` header of 60 seconds to an hour but we leave it up to the channel administrator to set a value that works for that channel.


		Individual shards are stored under the URL `<shard_base_url>/<subdir>/shards/<sha256>.msgpack.zst`. Where the `sha256` is the lower-case hex representation of the bytes from the index. It is a zstandard compressed msgpack file that contains the metadata of the package. The `<shard_base_url>` is defined in [Authentication](#authentication).

		The files are content-addressable which makes them ideal to be served through a CDN. They SHOULD be served with `Cache-Control: immutable` header.


		Implementers SHOULD ignore unknown keys, this allows adding additional keys to the format in the future without breaking old versions of tools.

		Although these files can become relatively large (100s of kilobytes) typically for a large case (conda-forge) these files remaing very small, e.g. 100s of bytes to a couple of kilobytes.


		To fetch all needed package records, the client should implement the following steps:

		1. Acquire a token (see: [Authentication](#authentication)). Acquiring a token can be done lazily as to only request a token when an actual network request is performed.

		- add `purl` as a list of strings (Package URLs to reference to original source of the package) (See: https://github.com/conda-incubator/ceps/pull/63)
		- add `run_exports` as a list of strings (run-exports needed to build the package) (See: https://github.com/conda-incubator/ceps/pull/51)


		JLAP also does not save anything with a cold cache because the initial repodata still has to be downloaded. This is often the case for CI runners.

		Finally, the implementation of JLAP is quite complex which makes it hard to adopt for implementers.


		### Update optimization

		We could implement support for smaller index update files. This can be done by creating a rolling daily and weekly index update file that can be used instead of fetching the whole `repodata_shards.msgpack.zst` file. The update operation is very simple (just update the hashmap with the new entries).


		Although these files can become relatively large (100s of kilobytes) typically for a large case (conda-forge) these files remaing very small, e.g. 100s of bytes to a couple of kilobytes.

		## <a id="authentication"></a>Authentication


		## Proposal

		We propose a "sharded" repodata format. It works by splitting the repodata into multiple files (one per package name) and recursively fetching the "shards".

CEP: sharded repodata #75

CEP: sharded repodata #75

Conversation

baszalmstra commented Apr 30, 2024 • edited Loading

Preliminary results

wolfv commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

baszalmstra Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wolfv commented May 1, 2024

wolfv commented May 1, 2024

jezdez commented May 1, 2024

wolfv commented May 2, 2024 • edited Loading

wolfv commented May 4, 2024

wolfv commented May 6, 2024

wolfv commented May 6, 2024

Choose a reason for hiding this comment

baszalmstra May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth May 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jezdez May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wolfv May 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dholth May 10, 2024 • edited Loading

Choose a reason for hiding this comment

xhochy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wolfv May 13, 2024 • edited Loading

Choose a reason for hiding this comment

baszalmstra commented Apr 30, 2024 •

edited

Loading

baszalmstra Apr 30, 2024 •

edited

Loading

wolfv commented May 2, 2024 •

edited

Loading

baszalmstra May 7, 2024 •

edited

Loading

dholth May 8, 2024 •

edited

Loading

jezdez May 7, 2024 •

edited

Loading

wolfv May 9, 2024 •

edited

Loading

dholth May 10, 2024 •

edited

Loading

wolfv May 13, 2024 •

edited

Loading

aovasylenko May 29, 2024 •

edited

Loading

wolfv commented Jul 2, 2024 •

edited by mbargull

Loading