Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CEP: sharded repodata #75

Merged
merged 15 commits into from
Jul 22, 2024
Merged

CEP: sharded repodata #75

merged 15 commits into from
Jul 22, 2024

Conversation

baszalmstra
Copy link
Contributor

@baszalmstra baszalmstra commented Apr 30, 2024

We propose a new solution to repodata downloading aimed at drastically reducing the time and memory required.

We implemented the use of this proposal in rattler and created a mirror for popular channels containing the new files that are updated at a fixed interval.

📝 Rendered

Preliminary results

Requesting repodata records for a specific set of packages for linux-64 and noarch. Measured on a machine with a 200mbit internet connection.

| requested packages        | records | fresh | cached | JLAP  | sharded (cold) | sharded (hot) |
| ------------------------- | ------- | ----- | ------ | ----- | -------------- | ------------- |
| python + boto3 + requests | 6969    | 2.4 s | 0.58 s | 8.2 s | 0.34 s         | 0.015 s       |
| jupyterlab + detectron2   | 35524   | 2.4 s | 0.63 s | 8.7 s | 0.7 s          | 0.05 s        |
| rubin-env                 | 84254   | 2.9 s | 0.80 s | 8.8 s | 1 s            | 0.13 s        |

Note that with this approach the cache is not all or nothing. So even if the shards index is updated, most likely a user will still have a relatively hot cache.

@wolfv
Copy link
Contributor

wolfv commented Apr 30, 2024

Note that in fairness we never got around to implement the faster "2-file" JLAP variant. However, the sharded repodata is a lot "simpler" in terms of updates, so we believe that in rattler this will supersede the JLAP-efforts.

cep-16.md Outdated Show resolved Hide resolved
cep-16.md Outdated Show resolved Hide resolved
cep-16.md Outdated Show resolved Hide resolved
cep-16.md Outdated Show resolved Hide resolved
cep-16.md Outdated Show resolved Hide resolved
cep-16.md Outdated

We propose a new "repodata" format that can be sparsely fetched. That means, generally, smaller fetches (only fetch what you need) and faster updates of existing repodata (only fetch what has changed).

We also change the encoding from JSON to MSGPACK for faster decoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be considered a implementation detail? It imposes an additional dependency on implementors, but I think this should be mostly about the strategy of sharding, their path locations and the schema expecting. Maybe we can standardize things like <path>/<filename>.<encoding>.<compression>. Or at least standardize that a JSON counterpart is also available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to any format and compression but I do think it's important to only provide a single format.

  • The files are content-addressed so we would have to add multiple hashes for different file formats to the index.
  • It makes it easier for the server/indexer if there is just one file.
  • It makes the client simpler because it doesn't have to query which variants are available.

We chose to use msgpack over other formats (like protobuf or capnproto) exactly because it has a lot of great implementations for a large variety of languages and still provides small files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, the hashes are for the compressed blobs? I had assumed we would hash the raw underlying content.

Do you have any numbers comparing JSON and msgpack?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially we used to hash the underlying data. Unfortunately that would not map well to the OCI registry. So we decided to use the hash of the compressed blob.

Copy link
Contributor Author

@baszalmstra baszalmstra Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here are some basic numbers.

I generated the repodata shard index for conda-forges noarch subdir in both json and msgpack format. And I took the largest shard (conda-forge-pinning) for comparison as well.

| which                       | size compressed | size uncompressed | deserialization time |
| --------------------------- | --------------- | ----------------- | -------------------- |
| json shard index            | 705 KB          | 1404 KB           | 4.8 ms               |
| msgpack shard index         | 666 KB (94%)    | 788 KB (56%)      | 3.2 ms (66%)         |
| json shard                  | 298 KB          | 1926 KB           | 12.8 ms              |
| msgpack shard               | 313 KB (105%)   | 1567 KB (81%)     | 10.2 ms (83%)        |
| msgpack shard (byte hashes) | 290 KB (97%)    | 1357 KB (70%)     | 8.8 ms (68%)         |

In terms of size both JSON and msgpack files become roughly the same size. The biggest win seems to be in deserialization time which also takes up a significant amount of time in real-world examples.

Edit: If I change the format of the hashes in the shards from hex string to bytes, the msgpack files are always smaller (290 KB for the conda-forge-pinning case).

Edit 2: Updated table to include msgpack with byte hashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the size of all shards of the noarch conda-forge channel (with the above edit applied):

|                      | size    |
| -------------------- | ------- |
| repodata.json.zst    | 13.5 MB |
| shards/*.json.zst    | 18   MB |
| shards/*.msgpack.zst | 17.1 MB |

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you drawn any conclusions from this that we should incorporate @dholth ?

@wolfv
Copy link
Contributor

wolfv commented May 1, 2024

One thing that just came to mind is that we might want to add some language that says that implementations should ignore any unknown keys. That will enable us to add run_exports, purls, and other keys to the repodata record without breaking old versions of the tools.

@wolfv
Copy link
Contributor

wolfv commented May 1, 2024

One other thing:

msgpack2json -d -b -i /Users/wolfv/Library/Caches/rattler/cache/repodata/shards-v1/7336fbc738626810350ff13c195012deb91b794840c81f59ffd8a74559102ec3.msg works nowadays for looking at the shards in JSON :)

@jezdez
Copy link
Member

jezdez commented May 1, 2024

Note that in fairness we never got around to implement the faster "2-file" JLAP variant. However, the sharded repodata is a lot "simpler" in terms of updates, so we believe that in rattler this will supersede the JLAP-efforts.

Thinking about this a little more, I think your statement that this proposal is "simpler" needs some elaboration, "simpler" in what areas? Do you mean "faster"?

In my experience, "simplicity" is easier said than done (read: is hard) and given that this introduces a whole new format, uses a different encoding type and seems to be focused on a non-standard hosting type (OCI, from reading your comments), I'm hoping you can provide context about this. It's.. a lot for one CEP, while trying to imagine a way to implement this in conda (which remains a goal with these CEPs after all).

I would appreciate hearing your thoughts on how a rollout would look like for existing conda hosting (e.g. conda-forge) and how current conda users would benefit from this.

@wolfv
Copy link
Contributor

wolfv commented May 2, 2024

Note that in fairness we never got around to implement the faster "2-file" JLAP variant. However, the sharded repodata is a lot "simpler" in terms of updates, so we believe that in rattler this will supersede the JLAP-efforts.

Thinking about this a little more, I think your statement that this proposal is "simpler" needs some elaboration, "simpler" in what areas? Do you mean "faster"?

Simpler in that no "patching" is needed (you get the real repodata right away). There is also no "state" that needs to be kept - just update the index file and make sure that all the shards are there, which is really straightforward.

It should be faster, too, because, again, we don't need the whole index in the first place. So the download is only ~1 Mb (for linux + noarch). I can download all shards for jupyterlab in 350 ms.

In my experience, "simplicity" is easier said than done (read: is hard) and given that this introduces a whole new format, uses a different encoding type and seems to be focused on a non-standard hosting type (OCI, from reading your comments), I'm hoping you can provide context about this. It's.. a lot for one CEP, while trying to imagine a way to implement this in conda (which remains a goal with these CEPs after all).

We're not at all focused on a non-standard hosting type - we just want to make sure that that works 100% fine. In fact, we host our "repodata-shards" on a traditional S3-style bucket, but a regular file server would work just as well. One just needs to add the repodata_shards.msgpack.zst file and all the shards under /shards/....

I don't think there is anything to worry about with regards to the implementation in conda. We are using msgpack-python from Python to create the index (the API is the same as json.load/json.dumps).

I would appreciate hearing your thoughts on how a rollout would look like for existing conda hosting (e.g. conda-forge) and how current conda users would benefit from this.

You could do it the same way as we do, and host an alternative index on a different subdomain for a testing period? What we do is that our Github action is ingesting existing repodata, creates shards, and stores the (new) shards on the bucket. It's all open source (in the linked repository). So you can try it out today! :)

With the base_url in the repodata_shards.msgpack.zst file we link back to the "original" conda-forge, by the way (so that packages would be downloaded from the official conda.anaconda.org.

cep-16.md Outdated Show resolved Hide resolved
@wolfv
Copy link
Contributor

wolfv commented May 4, 2024

I just tested the sharded repodata vs regular repodata on a slow internet connection. It took 15s for noarch + osx-arm64 in the traditional case, and 900ms for the sharded repodata. Very nice!

@wolfv
Copy link
Contributor

wolfv commented May 6, 2024

I also wanted to point out that this is an infrastructure improvement that should dramatically reduce indexing times and thereby hopefully pave the way to an almost realtime experience when publishing new packages. Especially when we start using the daily and weekly files.

The changes will be much more minimal and small. I hope you guys are also excited about that prospect :) @jezdez

@wolfv
Copy link
Contributor

wolfv commented May 6, 2024

And if you want it really simple + join forces, you can use Gateway class from py-rattler:

https://github.com/baszalmstra/rattler/blob/e570cc74e8bce76606e3246926af723ae68461d2/py-rattler/rattler/repo_data/gateway.py#L65-L81

cep-16.md Outdated Show resolved Hide resolved
1. **Speed**: Fetching repodata MUST be very fast. Both in the hot- and cold-cache case.
2. **Easy to update**: The channel MUST be very easy to update when new packages become available.
3. **CDN friendly**: A CDN MUST be usable to cache the majority of the data. This reduces the operating cost of a channel.
4. **Support authN and authZ**: It MUST be possible to implement authentication and authorization with little extra overhead.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like an odd duck, authorization shouldn't be attached to the repodata format as it's part of the transport layer, not the application layer.

Copy link
Contributor Author

@baszalmstra baszalmstra May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure why this CEP cannot make changes to both? Authenticating requests can add serious overhead to the server-side implementation which in turn affects the performance on the client. Having a way to effectively "cache" the authentication drastically reduces the work on the server which given the sheer number of requests introduced by this CEP I think makes sense to address as well. Note that the approach taken here is very similar to how OCI registries operate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also argue that we're proposing more of a transport protocol here where the authentication fits in pretty well as a key part of the whole thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lastly, you can also use other (more traditional) means by not replying with a token on the endpoint and use traditional means instead (depending on how your authentication works, it might be slower).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel that authorization should be split off. You can do this in multiple ways that are independent of the how shards are actually implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed authentication from the CEP and marked it as something we should investigate in another CEP. I still left it in the design goals because I still think we shouldnt implement something that would rule this out.


Additionally an index file stores the mapping from package name to shard hash.

Although not explicitly required the server SHOULD support HTTP/2 to reduce the overhead of doing a massive number of requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the sheer number of files, I think realistically this is a MUST.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been running this with HTTP/1 by accident for a while, its definitely slower but is still very acceptable. I left this as a should because everything still works perfectly fine without HTTP/2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pip do something similar with the simple index, requesting one per package name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is one index per package name e.g. https://pypi.org/simple/pinject .

But this index does not store any dependency information for any of the artifacts. An additional request is required PER artifact to get the dependecy information.

In this proposal we add the dependecy information of individual packages directly to the per package name shard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip is regarded as fast even though it does not use HTTP/2 and has a similar access pattern to what this CEP would imply. You might keep a few pipelined HTTP 1 connections open per repodata server to grab everything.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also expect that HTTP/1 gives already a reasonable improvement. Not all corporate proxies (but most) support HTTP/2 yet and thus I'm always happy to have enhancements that can fall back to HTTP/1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried doing an http/1 fetch of all ~10k files from the prefix test server using a 20-thread (and 10 connections at once pool) with Python requests. It takes about 20 seconds to fetch everything with a hot cache, and the spec is designed for us to fetch much less than everything.


For a large case (conda-forge linux-64) this files is 670kb at the time of writing.

We suggest serving the file with a short lived `Cache-Control` `max-age` header of 60 seconds to an hour but we leave it up to the channel administrator to set a value that works for that channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you suggest 60 seconds TTL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this provides a sane tradeoff between limiting the number of requests to the server while also reducing the time it takes to get new packages. But TBH it's quite random.


Individual shards are stored under the URL `<shard_base_url>/<subdir>/shards/<sha256>.msgpack.zst`. Where the `sha256` is the lower-case hex representation of the bytes from the index. It is a zstandard compressed msgpack file that contains the metadata of the package. The `<shard_base_url>` is defined in [Authentication](#authentication).

The files are content-addressable which makes them ideal to be served through a CDN. They SHOULD be served with `Cache-Control: immutable` header.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why SHOULD and not MUST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An implementation doesnt care about Cache-Control headers, it just uses the content-address. However, for CDNs having the immutable header ensures that the content is cached on the edge for as long as possible. But it is not a requirement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a non-content-addressable variant that might be suitable for a package server with per-user, per-package permissions?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a non-content-addressable variant that might be suitable for a package server with per-user, per-package permissions?

Most permission-based solutions I have yet encountered (I have not seen Anaconda Enterprise) all managed permissions on the basis of whole channels. This would stay compatible with the current approach and was something that was also compatible with the old approach.

For giving access to a subset of packages in a channel, you would need to serve copies of repodata.json depending on the user. As the new approach (new = sharding) would still also serve the old repodata.json, I think this can only be supported with a backend that check permissions on the request basis and doesn't do authentication purely based on the URL.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another form of permission for content-addressed blob store systems I've seen is to store some additional tag style header attributes with the content blob and use that to drive authz. gcp and aws both support setting this kind of metadata and you can perform the auth part with a proxy in front of the blob store

The `sha256` and `md5` from the original repodata fields are converted from their hex representation to bytes.
This is done to reduce the overall file size of the shards.

Implementers SHOULD ignore unknown keys, this allows adding additional keys to the format in the future without breaking old versions of tools.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEP 15 adds a repodata_version field incrementing it to 2 when a base_url is present. Since you're willing to update a number of fields, this should be incremented to 3, so clients don't have to implement forever backwards-compatible repodata loaders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of that though is that a client that doesn't support version 3 will not be able to parse the repodata, effectively breaking it. Adding keys is forward-compatible as long as the client is not strict about unknown keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or did I misunderstand and is the repodata_version field forward-compatible?

Copy link
Contributor

@dholth dholth May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know that optional fields like "binstar" from anaconda.org non-CDN channels exist. Or whatever happens to be in the package's index.json goes into repodata.

But fetching files from the wrong url when base_url is present is obviously broken.

Should the index shard use repodata_version instead of "version": 1,

It would be good form to have an algorithm for converting between repodata.json and shards in both directions.

cep-16.md Outdated

Implementers SHOULD ignore unknown keys, this allows adding additional keys to the format in the future without breaking old versions of tools.

Although these files can become relatively large (100s of kilobytes) typically for a large case (conda-forge) these files remaing very small, e.g. 100s of bytes to a couple of kilobytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we allow a shard to contain the metadata of more than one package, so that multiple package names could point to the same shard? It is a nuisance to download many 300-byte shards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we could, but it comes with some downsides:

  • cache might not be as efficient as multiple packages might change at different frequencies and cache becomes invalid more often
  • it's more complicated on the implementation as packages don't map 1-1 to shards, but that would not be too bad.

We could keep the current repodata_shards index structure and use the same hash for multiple package names (that would probably compress decently).

However, I am not sure if the tradeoff is worth it.

Also it will start to look a bit like zchunk then :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using a buffer to combine sequential (ordered by package name) shards when the combined size is not > 8192 bytes compressed. So we still wind up with plenty of ~300 byte single package shards if it sees (large shard, small shard, large shard) but it's all in order.

That scheme gives 7987 files and 10883 package names on a conda-forge/linux-64 snapshot, probably taking 75% of the time to download if we happen to want to download everything. Unlike zchunk we have the option of not maintaining a complete local copy of the data.

The most packages combined into a single shard are these fifteen.

['perl-bit-vector-7.4-pl5321h166bdaf_0.tar.bz2',
 'perl-capture-tiny-0.48-pl5321ha770c72_1.tar.bz2',
 'perl-carp-assert-0.21-pl526_0.tar.bz2',
 'perl-class-method-modifiers-2.13-pl5321ha770c72_0.tar.bz2',
 'perl-clone-0.46-pl5321h166bdaf_0.tar.bz2',
 'perl-compress-raw-bzip2-2.201-pl5321h166bdaf_0.tar.bz2',
 'perl-compress-raw-zlib-2.202-pl5321h166bdaf_0.tar.bz2',
 'perl-data-dumper-2.183-pl5321h166bdaf_0.tar.bz2',
 'perl-b-hooks-endofscope-0.26-pl5321ha770c72_0.conda',
 'perl-b-hooks-endofscope-0.28-pl5321ha770c72_0.conda',
 'perl-class-data-inheritable-0.09-pl5321ha770c72_0.conda',
 'perl-class-load-0.25-pl5321ha770c72_0.conda',
 'perl-class-load-xs-0.10-pl5321h0b41bf4_0.conda',
 'perl-class-method-modifiers-2.14-pl5321ha770c72_0.conda',
 'perl-class-method-modifiers-2.15-pl5321ha770c72_0.conda',
 'perl-cpan-meta-check-0.014-pl5321ha770c72_0.conda',
 'perl-data-dumper-2.183-pl5321hd590300_0.conda',
 'perl-data-dumper-concise-2.023-pl5321ha770c72_0.conda',
 'perl-data-optlist-0.112-pl5321ha770c72_0.conda',
 'perl-data-optlist-0.113-pl5321ha770c72_0.conda',
 'perl-data-optlist-0.114-pl5321ha770c72_0.conda']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, if you need everything, wouldn't you rather download the repodata.json file instead? I think for us that wasn't the use case we optimized for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm imagining a situation in which there is only repodata_shards. Suppose we are in the middle of processing a bunch of queries on the SubdirData and suddenly we encounter a query for "any package name, md5=x". Are we more likely to reject the query, download everything, or eject to downloading an alternate format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not deprecate it, and grab the repodata.json in the meantime? Tell people to use proper lockfiles? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we simplified the implementation, and included one repodata shard with all the packages in it, and an index that pointed to that singular shard 10k times 🤣

cep-16.md Outdated
Comment on lines 157 to 174
To faciliate authentication and authorization we propose to add an additional endpoint at `<channel>/<subdir>/token` with the following content:

```json
{
"shard_base_url": "https://shards.prefix.dev/conda-forge/<subdir>/",
"token": "<bearer token>",
"issued_at": "2024-01-30T03:35:39.896023447Z",
"expires_in": 300,
}
```

`shard_base_url` is an optional url to use as the base url for the `repodata_shards.msgpack.zst` and the individual shards. If the field is not specified it should default to `<channel>/<subdir>`.

`token` is an optional field that if set MUST be added to any subsequent request in the `Authentication` header as `Authentication: Bearer <token>`. If the `token` field is not set sending the `Authentication` header is also not required.

The optional `issued_at` and `expires_in` fields can be used to verify the freshness of a token. If the fields are not present a client can assume that the token is valid for any subsequent request.

For a simple implementor this endpoint could just be a static file with `{}` as the content.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated above, authnz is a transport layer thing and doesn't fit into this application layer metadata standard. This should be moved into an own CEP.

cep-16.md Outdated

To fetch all needed package records, the client should implement the following steps:

1. Acquire a token (see: [Authentication](#authentication)). Acquiring a token can be done lazily as to only request a token when an actual network request is performed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Authzn isn't part of the repodata standard and should be evolved in a separate CEP.

Comment on lines +244 to +245
- add `purl` as a list of strings (Package URLs to reference to original source of the package) (See: https://github.com/conda-incubator/ceps/pull/63)
- add `run_exports` as a list of strings (run-exports needed to build the package) (See: https://github.com/conda-incubator/ceps/pull/51)
Copy link
Member

@jezdez jezdez May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to CEP-15?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure what you mean? The base_url from CEP-15 is part of the index file.

Co-authored-by: Jannis Leidel <[email protected]>
@baszalmstra baszalmstra requested a review from jaimergp May 7, 2024 15:08

JLAP also does not save anything with a cold cache because the initial repodata still has to be downloaded. This is often the case for CI runners.

Finally, the implementation of JLAP is quite complex which makes it hard to adopt for implementers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the complexity is the least fair criticism of JLAP, it can yield a small implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO in theory its pretty straightforward but to integrate properly there are a number of additional complexities that make this statememt true. To name a few: it requires maintaining extra state on disk, http range requests, we need to make sure that the repodata.json is still the previously stores repodata, indexing requires the previous state, and the problem that you need an additional overlay file to make it perform optimally.

But in fairness this is just my personal experience while implementing it.

Copy link
Contributor

@wolfv wolfv May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dholth I think we can claim that we are the only ones shipping JLAP in production and it's been pretty rough, unfortunately (we observe a waiting progress bar for multiple seconds while patches are applied and the file is serialized). We could improve with the two file approach but I think we make a good point about why this proposal is simpler.

Also, one thing we gain with our proposal here is complete hash-integrity. Something that is not possible with the JLAP two-file approach. We want to enable signed repodata and packages (TUF style) sooner or later and I don't see a way to do that with JLAP's two-file approach.

With our approach, we can at any time prove the integrity of the whole thing.

cep-16.md Outdated Show resolved Hide resolved
cep-16.md Outdated Show resolved Hide resolved
Co-authored-by: Daniel Holth <[email protected]>
cep-16.md Outdated Show resolved Hide resolved
Co-authored-by: Bas Zalmstra <[email protected]>

### Update optimization

We could implement support for smaller index update files. This can be done by creating a rolling daily and weekly index update file that can be used instead of fetching the whole `repodata_shards.msgpack.zst` file. The update operation is very simple (just update the hashmap with the new entries).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done with a jlap-like rolling checksum / changelog file where the client would look for the hash of their current index, and json merge patch instead of the more complicated json patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I like the idea of having a hash of the "expected" file. We don't need any proper "patching" as it shoudl be as simple as

old_repodata["shards"].update(new_repodata["shards"])

(with some extra handling of null keys potentially).

We would also store a timestamp in the info field that we could use to evaluate wether we can use the daily or weekly update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"json merge patch" is similar to ".update()". We could simulate the size of the proposed file using https://github.com/dholth/conda-test-data, probably it is pretty small based on what I've observed about jlap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are logging what percentage of shards changes every 6 hours in the indexing pipeline. It's usually < 1 percent so it's pretty small.

I think adding json-merge-patch isn't going to add anything over the proposed update of dictionary keys - but the principle is definitely the same. We could also define a hash, but I would argue that we should do a "content" hash (ie. one that does not rely on the specific formatting of a "JSON" file. Although that might also be less of an issue since we're using msgpack for serialization in this proposal :)

Copy link
Contributor

@dholth dholth May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msgpack may have more options for formatting than json, e.g. "serializers SHOULD use the format which represents the data in the smallest number of bytes." but does our implementation? JSON with sorted keys and no whitespace is predictable. But we won't think about implementing https://www.rfc-editor.org/rfc/rfc8785
An algorithm like merge-patch would handle any change in the index file not just individual shards.

Copy link

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like sharding approach, but I'm very heavy -1 on mixing in authentication into it. This will conflict with existing implementations and thus will make the feature not usable there.

1. **Speed**: Fetching repodata MUST be very fast. Both in the hot- and cold-cache case.
2. **Easy to update**: The channel MUST be very easy to update when new packages become available.
3. **CDN friendly**: A CDN MUST be usable to cache the majority of the data. This reduces the operating cost of a channel.
4. **Support authN and authZ**: It MUST be possible to implement authentication and authorization with little extra overhead.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel that authorization should be split off. You can do this in multiple ways that are independent of the how shards are actually implemented.


Additionally an index file stores the mapping from package name to shard hash.

Although not explicitly required the server SHOULD support HTTP/2 to reduce the overhead of doing a massive number of requests.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also expect that HTTP/1 gives already a reasonable improvement. Not all corporate proxies (but most) support HTTP/2 yet and thus I'm always happy to have enhancements that can fall back to HTTP/1.


Individual shards are stored under the URL `<shard_base_url>/<subdir>/shards/<sha256>.msgpack.zst`. Where the `sha256` is the lower-case hex representation of the bytes from the index. It is a zstandard compressed msgpack file that contains the metadata of the package. The `<shard_base_url>` is defined in [Authentication](#authentication).

The files are content-addressable which makes them ideal to be served through a CDN. They SHOULD be served with `Cache-Control: immutable` header.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a non-content-addressable variant that might be suitable for a package server with per-user, per-package permissions?

Most permission-based solutions I have yet encountered (I have not seen Anaconda Enterprise) all managed permissions on the basis of whole channels. This would stay compatible with the current approach and was something that was also compatible with the old approach.

For giving access to a subset of packages in a channel, you would need to serve copies of repodata.json depending on the user. As the new approach (new = sharding) would still also serve the old repodata.json, I think this can only be supported with a backend that check permissions on the request basis and doesn't do authentication purely based on the URL.

cep-16.md Outdated

Although these files can become relatively large (100s of kilobytes) typically for a large case (conda-forge) these files remaing very small, e.g. 100s of bytes to a couple of kilobytes.

## <a id="authentication"></a>Authentication
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on adding this specific bit of authentication to the sharding proposal. We should be able to implement the sharding approach independently of the authentication mechanism.

I don't see in some setups how we would be able to provide such an endpoint. This ties the CEP to a certain kind of authentication and removes the flexibility that we currently have in also using HTTP Basic Auth instead of token based auth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am open to opening another CEP specifically for authentication of the sharded index but let me first explain the reasoning behind why we added this.

The reason we added this is that we have noticed that adding support for generic authentication methods provides a very significant overhead to fetching the individual shards and a large infrastructure cost.

Take as an example our prefix.dev backend. When requesting private repodata a user requests https://prefix.dev/mychannel/win-64/repodata.json, our backend server authenticates the request using either a conda token, basic auth or an API key and redirects the request to a presigned-url. If we would do something similar for the shards each request would have to go through this dynamic end-point. This adds additional delay to the request and incurs an infrastructure cost due to the required compute. And we are talking about hundreds to thousands of requests when a user fetching repodata. We have noticed that this significantly impacts the performance.

Ideally, we want to have a solution where the "presigning" happens just once and we can reuse this information on subsequent requests.

I am open to alternative solutions, but I think being able to authenticate once for all subsequent requests is very important to how effective this proposal is. Standardizing this method is also important to ensure that every client will be able to access different server implementations.

Then with regards to our specific solution using the token endpoint. The token endpoint only really needs to be dynamic when the authentication mechanism used is dynamic. In any other case, the token end-point could just be an empty static file. Basic authentication or even conda-token-based authentication can still be applied to both the individual shards and the shard index.

As an alternative though, we could also look at returning a set of headers from the initial request to the shard index instead of requesting a separate file.

FYI @jezdez

Copy link
Contributor

@wolfv wolfv May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we are not making clear in the CEP is that "traditional" auth methods should continue to work just as well. For example, if the server uses basic HTTP authentication or conda token authentication, this should continue to work (and depending on the server implementation should also not negatively affect performance).

We are just interested in adding a mechanism to the protocol in order to facilitate really fast performance for large public channels.

As Bas mentioned, I think we can make this an "add-on" behavior when the request to repodata_shards.msgpack.zst returns specific header values such as:

X-Shard-BaseUrl: https://fast.prefix.dev/conda-forge/
X-Shard-Token: pfx_secret_token
X-Shard-Token-ExpiresAt: TIMESTAMP

The client would then use these values for subsequent requests. If these headers are not returned, it will just use the "regular" URL with regular authentication.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these comments, I understand the need for an efficient implementation. Still, since it is part of this proposal, it seems necessary to implement sharding fully. We will have several places where this pre-signed authentication will not work (e.g., when using an HTTP caching proxy or Artifactory's implementation). I would love to use sharding in these places at the cost of the additional authentication checks.

@xhochy
Copy link

xhochy commented May 13, 2024

Real-life note: This will probably work mostly out-of-the-box with the existing conda support in Artifactory except that the shard index would for now be cached indefinitely there (i.e. that is something JFrog would need to implement). It took them ~2months with repodata.json.zst (and then add another 6 months for corporate IT updating their Artifactory version).


## Proposal

We propose a "sharded" repodata format. It works by splitting the repodata into multiple files (one per package name) and recursively fetching the "shards".
Copy link

@aovasylenko aovasylenko May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reminds me about current simple pypi index (what is mentioned below), which (in case all python ecosystem of 500K+ packages leads to this number of shards)
Were there any ideas about evaluating smaller number of shards, for example by having prefix-based rule or range?
My worry that based on server-side implementation initial pulling of this information will generate massive number of requests (even with using http2) on API/DB level each request could be processed separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't really discussed different sharding strategies. However, our implementation uses a CDN + static files and that seems to work relatively well. It also seems to work well enough for PyPI and crates.io so I don't think we're particularly concerned.

cep-16.md Outdated Show resolved Hide resolved
@baszalmstra
Copy link
Contributor Author

@jezdez @xhochy @dholth @wolfv I updated the CEP to:

  • Add shards_base_url
  • Remove authentication flow from the CEP (we will address this in a future CEP).

Please give it another read, I would like to put this to a vote soon.

Copy link

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me now.

@jezdez
Copy link
Member

jezdez commented Jul 2, 2024

@conda/steering-council

This vote falls under the "Enhancement Proposal Approval" policy of the conda governance policy,
please vote and/or comment on this proposal at your earliest convenience.

It needs 60% of the Steering Council to vote yes to pass.

To vote, please leave yes, no or abstain as comments below.

If you have questions concerning the proposal, you may also leave a comment or code review.

This vote will end on 2024-07-16, End of Day, Anywhere on Earth (AoE). This is an extended voting period due to summer holiday time in the Northern Hemisphere.

@jezdez jezdez added the vote Voting following governance policy label Jul 2, 2024
@baszalmstra
Copy link
Contributor Author

yes

@wolfv
Copy link
Contributor

wolfv commented Jul 2, 2024

Please use the following form to vote!

@xhochy (Uwe Korn)

  • yes
  • no
  • abstain

@CJ-Wright (Christopher J. 'CJ' Wright)

  • yes
  • no
  • abstain

@mariusvniekerk (Marius van Niekerk)

  • yes
  • no
  • abstain

@goanpeca (Gonzalo Peña-Castellanos)

  • yes
  • no
  • abstain

@chenghlee (Cheng H. Lee)

  • yes
  • no
  • abstain

@ocefpaf (Filipe Fernandes)

  • yes
  • no
  • abstain

@marcelotrevisani (Marcelo Duarte Trevisani)

  • yes
  • no
  • abstain

@msarahan (Michael Sarahan)

  • yes
  • no
  • abstain

@mbargull (Marcel Bargull)

  • yes
  • no
  • abstain

@jakirkham (John Kirkham)

  • yes
  • no
  • abstain

@jezdez (Jannis Leidel)

  • yes
  • no
  • abstain

@wolfv (Wolf Vollprecht)

  • yes
  • no
  • abstain

@jaimergp (Jaime Rodríguez-Guerra)

  • yes
  • no
  • abstain

@kkraus14 (Keith Kraus)

  • yes
  • no
  • abstain

@baszalmstra (Bas Zalmstra)

  • yes
  • no
  • abstain

Copy link
Contributor

@jaimergp jaimergp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few typos I found while reading through the whole thing.

cep-16.md Outdated Show resolved Hide resolved
cep-16.md Outdated Show resolved Hide resolved
cep-16.md Outdated Show resolved Hide resolved
@jakirkham
Copy link
Member

Weird. Had checked the box above a few days ago and it appears to have disappeared. Rechecked it

@wolfv
Copy link
Contributor

wolfv commented Jul 16, 2024

@jakirkham strange indeed. I can see your edit in the edit history, but it didn't seem to "tick" the box...

@wolfv
Copy link
Contributor

wolfv commented Jul 16, 2024

@chenghlee @CJ-Wright @mbargull last chance to vote :)

@mbargull
Copy link
Member

I am thankful that someone is working on the repodata implementation since the current one (one big blob per channel/subdir) became unwieldy a good while ago (at least in resource-constrained environments).
As far as I can tell, this proposal is a good starting point but still needs refinement (which I expect to get clearer once more people test it out in rattler and conda-index, etc.).
Unfortunately, I was not able to take the time to do a proper review and consult in the refine process here and as such opted to abstain.

@wolfv
Copy link
Contributor

wolfv commented Jul 22, 2024

This vote was closed, with the following result

Total voters: 15 (valid: 93.33%)

Yes votes (12 / 85.71%):

No votes (0 / 0.00%)):

Abstain votes (2 / 14.29%):

Not voted (1):

Invalid votes (0):

We reached quorum, and enough YES votes to accept the proposal! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
vote Voting following governance policy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants