-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP-305: CIDv2 - Tagged Pointers #305
Conversation
@vmx I'd love to hear your take on this since it's coming from a Lurk perspective now and you're a bit more versed in that land than most of us. The big outstanding question for me before moving on to finer things is the version signalling. My bottom-up implementers brain is coming at this from what's in the bytes and I don't think either the original PR or this are getting specific enough (or I'm reading them wrong, very possible). Are we using the CID version specifier the proper way:
or wrapping (hiding) this in a CIDv1 but using the codec to signal cidv2:
I'm suspecting that the language here, and over in multiformats/cid#49 might be suggesting both, with the latter reserved as a way to do backward compatibility where you know you're going to need it? Which might become a CIDv0-like problem when we've moved to a state where every system supports CIDv2 but everyone still wants to pass around CIDv1-wrapped-CIDv2s. |
@rvagg Just to clarify, the intent of the proposal is to have a
That's what my Rust implementation does However, I also described a backwards-compatible way of embedding a Cidv2 inside a Cidv1, as an optional thing:
We can remove this from my proposal if that would make things more clear. You can already use the identity multihash to embed any bytes inside a CIDv1, so there's no change to anything implied by this, and will still be possible regardless of what the final CIDv2 spec is. It's an existing (though rarely used) feature of multihashes and CIDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some questions and alternatives to try and flush out our options here and the consequences.
I think referring to this proposal as tagged pointers is interesting. Both because it largely matches tagged pointers in them being useful but not necessarily required, and because to some extent CIDv1 is already a tagged pointer in that it is tagged with codec information.
If we extend the tagged pointer concept to allow further information then probably we should make sure there's enough flexibility here that we're less likely to continue to add more tagged pointers in a CIDv3.
|
||
Having arbitrary-length CID metadata allows the data to be fully self-describing and abstracts application-specific interpretation away into the metadata CID. | ||
|
||
### Compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this indicates the scope of what in the ecosystem will be effected by this change. The current text makes it appear as though introducing this new version of CID will be fairly trivial when that's not quite the case.
Some example ramifications:
- Existing CID parsers would need to be updated to support CIDv2, or else would error
- Many CIDv2s will be too big to represent in subdomains which would effectively break how some tooling (e.g. HTTP gateways) work with CIDs today. Yes, the same is true of large CIDv1s but this is more likely with CIDv2s since they contain two CIDv1s.
- Tooling that only supports CIDv1 could break if any node being accessed within the graph contains a CIDv2. This could provide a problematic UX for tools that say only take a root CID and assume they can operate on a graph
- Existing IPLD tooling may need to be upgraded to support the new type of links and expose needed information to users
Many of these are just the cost of doing upgrades in general, or the cost of adding metadata to links, but we should accumulate these and know what we're getting into here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing CID parsers would need to be updated to support CIDv2, or else would error
Yes, but they would error cleanly, since afaiu parsers already have to match on the version varint. But I don't think its particularly complicated change to add a case for version 2, as I did in multiformats/rust-cid#123
Many CIDv2s will be too big to represent in subdomains which would effectively break how some tooling (e.g. HTTP gateways) work with CIDs today. Yes, the same is true of large CIDv1s but this is more likely with CIDv2s since they contain two CIDv1s
The most common CIDv2 sizes will probably be pairs of 256-bit or 512-bit hashes, which are roughly the same sizes as a 512-bit or 1024-bit CIDv1, which should be nearly universally supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most common CIDv2 sizes will probably be pairs of 256-bit or 512-bit hashes, which are roughly the same sizes as a 512-bit or 1024-bit CIDv1, which should be nearly universally supported.
Unfortunately not. The base36 encoding of a SHA2-512 raw CID is too long to fit into a URL subdomain. e.g. https://cid.ipfs.tech/#kf1siqqaod24wzk1b0jwakpjxj8z9xaqxwh56nnc267oznfqrm8cc0w0f36g6ir7zb1tuso6ch7kg3at9o6bnr8lm34hty32o1l0ljycu is 105 characters which is greater than the 63 character DNS label limit.
|
||
- [CIDv2 with arbitrary-precision multicodec size]( | ||
https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7#appendix-a-cidv2-and-arbitrary-precision-multicodec) | ||
- CIDv2 with nested hashes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you detail this a bit more? Is this just allowing the CIDs inside the CIDv2 to also be CIDv2's rather than restricting them to CIDv1s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this proposal the contents of CIDv2s are not CIDv1s, but rather the broken apart multicodec-multihash pairs. This is specifically to mitigate the issues with nesting raised in the previous discussion multiformats/cid#49.
The other idea of arbitrary-precision multicodec is to figure out how to safely remove the 9-byte limit on multicodec-varints (such as by adding a size field), and then managing larger metadata tags by allocating ranges on the now infinite multicodec table. However, that solution requires both technical changes to implementations, as well as process changes to how multicodec is managed, whereas the current IPIP should largely only require the former.
|
||
The proposal is also designed to be purely opt-in and backwards compatible with existing implementations. That said, some work may be required to ensure that implementations that do not wish to support CIDv2 can either read a CIDv2 as if it were a CIDv1 (and discard the trailing metadata), or to error on the CIDv2 entirely. | ||
|
||
### Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if instead of a CIDv2 you just used a CIDv1 that looked something like { Data: <data-cid>, Metadata: <metadata-cid> }
or { Data: <data-cid>, Metadata: <metadata-cid>, Type: <whatever-type-info-you-want> }
This could be encoded as a CIDv1 in DAG-CBOR, or using any other format you wanted.
Some advantages:
- It doesn't require bumping the CID version and as a result a lot of tooling can be left alone
- Your type data can be more than a single block without requiring an extra level of indirection
- You can specify what your data is without reserving a code in the table for every data type you could want.
- Sure maybe "IPLD Schema" is a reasonable way of representing many types, but I could also see applications showing up with a list of 100 types they'd want codes for. Allocating codes like this isn't just a pain for table maintenance and taking up table space, but it also forces more of the data structure logic out of band which makes it harder for an application that doesn't know what to do with the unknown code number to figure out what to do.
Some disadvantages:
- It takes up a couple more bytes
- It's more than a few bytes if you want to be self-describing, but in theory an application could just have a tuple of CIDs which is fairly minimal overhead. This makes the data not self-describing, but it's not in the current proposal either
- A given application or ecosystem needs to decide on how to encode their metadata/type information
- This needs to happen in the current proposal anyhow, but in the current proposal developers don't have to think about how to disambiguate data from metadata just how to actually encode their metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if instead of a CIDv2 you just used a CIDv1 that looked something like { Data: , Metadata: } or { Data: , Metadata: , Type: }
You mean creating IPLD lists or objects and then hashing them? This works fine for some cases, but not for others since it requires that you have to traverse the hash. In the write-up I did for @vmx I go into some detail about why for Lurk we need to have the metadata tags in the pointers themselves: https://gist.github.com/johnchandlerburnham/d9b1b88d49b1e98af607754c0034f1c7
Your type data can be more than a single block without requiring an extra level of indirection
For large metadata, I think having a hash of the metadata is unavoidable. The advantage of this CIDv2 proposal though is that since a CIDv2 is isomorphic to a pair of CIDv1s, you can store your metadata and data in the same content-addressed store with self-describing keys. We do this in Yatima where we have large data and metadata trees for program ASTs: https://github.com/yatima-inc/yatima-lang/blob/35f868ab05a4059690e6da9db2e5c4419537fcd0/Yatima/Datatypes/Cid.lean#L23
So this proposal supports both large metadata (like Yatima's full metadata CIDs) and small metadata (like Lurk's 16-bit tags)
Sure maybe "IPLD Schema" is a reasonable way of representing many types, but I could also see applications showing up with a list of 100 types they'd want codes for. Allocating codes like this isn't just a pain for table maintenance and taking up table space, but it also forces more of the data structure logic out of band which makes it harder for an application that doesn't know what to do with the unknown code number to figure out what to do.
I think what would make sense if this proposal is adopted to allocate a single metadata multicodec for each application, whether that's IPLD Schema, Lurk, Yatima, etc., and then each application would have its own logic of what its own metadata means. E.g.
name | tag | code | description |
---|---|---|---|
dag-cbor | ipld | 0x71 | MerkleDAG cbor |
... | ... | ... | ... |
ipld-schema | ipld | 0x3e7a_da7a_0001 | an IPLD Schema DML in dag-cbor |
lurk-metadata | lurk | 0x3e7a_da7a_0002 | A Lurk tag in the identity multihash |
yatima-metadata | yatima | 0x3e7a_da7a_0003 | A hash of a Yatima metadata AST |
This has a similar effect as allocating ranges in the multicodec table, but without the centralized overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean creating IPLD lists or objects and then hashing them? This works fine for some cases, but not for others since it requires that you have to traverse the hash.
I read through the writeup, but still don't understand. What's the problem that you run into if instead of something like
<0x02><lurk-data-code><lurk-data-multihash><lurk-tag-code><lurk-tag-identity-multihash>
you had
taggedLink = EncodeDagCbor([<0x01><lurk-data-code><lurk-data-multihash>, <0x01><lurk-tag-code><lurk-tag-identity-multihash>])
<0x01><0x71><identity-multihash-of-taggedLink>
It seems like the bytes would be almost the same, and any code working with lurk data would already know how to do the conversion of the CIDv1 into two different objects and the use of identity multihashes saves you from doing any repeated hashing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key difference is that the lurk tags are not legible from taggedLink
without traversing the pointer. In the Lurk case, this might be impossible if we're pointing towards a private input.
@porcuquine, @vmx and I had a long discussion on the Lurk discord about why this is necessary: https://discord.com/channels/908460868176596992/913200327547822110/964156408490754058
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if instead of a CIDv2 you just used a CIDv1 that looked something like ...
My worry about just pushing as much as we can into CIDv1 is that we end up losing the utility of the CID because it just becomes a way to squish in arbitrary data to a point in a block. One of the main purposes of a CID in IPLD is to provide clear linking semantics between blocks. If we overloaded CIDv1 and hid the actual content address of the link in an inline portion of it then even though the blocks might load fine in existing systems, the DAG disappears because the links aren't links anymore. We end up at the same place as a CIDv2 of having to update all our systems to interpret this new thing, and while it may be less painful and give us more time to adjust, it also gives us lots of space to not upgrade at all—or to give edges of our ecosystem space to not upgrade. Turning DAGs into collections of arbitrary blocks.
The choice would be something like: would you rather push your DAG to pinning service where you don't know if they support the new inline CIDv1-with-embedded-link, and therefore, just in case, you have to push them each block one by one and get them to pin each block individually. Or, have the pinning service error with "unknown CID version: 2"
and move on to a different pinning service, knowing that you just want to pin a root and they'll take care of the DAG connectivity.
I think I'm on team just accept the pain and upgrade all the things even though it's going to take time. I also think I'd prefer to not have a CIDv1 variant in the spec because having an easy way out might leave us in a half way state that sucks more than just biting the bullet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think I'd prefer to not have a CIDv1 variant in the spec because having an easy way out might leave us in a half way state that sucks more than just biting the bullet.
I think that makes a lot of sense. While my initial thinking was that CIDv2 would an optional extension that would live alongside CIDv1, I think that there's certainly a way to modify CIDv2 to have it work as a CIDv1 replacement.
Specifically I think what I would want to do is
pub struct Cid<const S: usize, const M: usize> {
/// The version of CID.
version: Version,
/// The codec of CID.
codec: u64,
/// The multihash of CID.
hash: Multihash<S>,
/// metadata multicodec
meta_codec : Option<u64>
/// metadata multihash
meta_hash : Option<Multihash<M>>
}
And then we would need a bit to switch on whether the cid has metadata or not:
<cidv2> ::= <multicodec-cidv2><multicodec-data-content-type><multihash-data>(<multicodec-metadata-content-type><multihash-metadata>)
or
<cidv2> ::= <multicodec-cidv2><multicodec-data-content-type><multihash-data><has-metadata-varint>(<multicodec-metadata-content-type><multihash-metadata>)
where everything in the parenthesis is present if has-metadata = 1
and absent if has-metadata = 0
.
If we don't want to add a whole extra varint for a single bit though, as we could actually switch on the version varint, where Version::V1
has no trailing metadata and Version::V2
has mandatory trailing metadata. That's maybe more in the same vein as "CIDv2 as optional extension for CIDv1" though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the struct but you're right about the optionality - I don't know if I have an opinion yet on having an additional bit vs making metadata mandatory for v2 and therefore requiring a v1 where there is no metadata. A third-way would be to make it mandatory if you're using a v2 but allow for the metadata to be 3 zero-bytes [0,0,0]
(codec=0, hasher=identity/0, digest length=0) which would be equivalent to the v1 form - 3 wasted bytes instead of a single one for a flag, but you still get to choose whether you use a v1 to save those bytes.
One thing that continues to bother me about this (I mentioned this in the other thread) is that I lose the ability to inspect initial bytes to see what's coming. Currently we can do this with just enough bytes to read 3 varints: https://github.com/multiformats/js-multiformats/blob/dcfdac59df3570b85e633afae5ac8f6caf0a4441/src/cid.js#L312-L324
Arguably the utility of this isn't as great as it seems, but I'd probably have to remove that function, or make it throw, or something else in the case of a CIDv2. Its main use is in decodeFirst()
(function defined just above) which is basically the same as: https://github.com/ipfs/go-cid/blob/802b45594e1aed5be3a5b99f00991e9fa8198bfa/cid.go#L691 - the use-case being - "here's a source of bytes I know starts with a CID, give me the CID and the remaining bytes". If there were a way to make it easier to do this initial-bytes-inspection then that'd be great, but it's not a blocker. e.g. if we must have a flag for these optional pieces, we could turn it into a "full length" varint and put it near the front; for common cases I think we'd still fit that in a single byte so it wouldn't be a massive waste. 🤷
The proposal is also designed to be purely opt-in and backwards compatible with existing implementations. That said, some work may be required to ensure that implementations that do not wish to support CIDv2 can either read a CIDv2 as if it were a CIDv1 (and discard the trailing metadata), or to error on the CIDv2 entirely. | ||
|
||
### Alternatives | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative is if instead of redefining CID we redefined what Link means in the IPLD Data Model.
From what I can tell CIDs are used in primarily two places:
- As the descriptions of objects that users and applications pass around (e.g.
ipfs://<cid>
) - As the internal links inside of DAGs
Given that the object descriptions always have their own custom meaning anyway (e.g. ipfs://
currently is approximately equal to "try seeing if the data is UnixFS", ipfs block get <cid>
assumes the data is an independent block, v1 of the remote pinning API assumes the CID to pin is the root of a graph, ...) adding metadata here is not particularly interesting.
Adding metadata inside of the DAG is interesting, however, changing the CID spec isn't necessary for this. You could also change what links mean in the IPLD Data Model and get the same result. Historically it appears that this was intentional, for example in https://github.com/ipld/ipld/blob/835d010583accf0dbec7f3ddbd4b6a66f86e2fa2/_legacy/specs/FOUNDATIONS.md#linked it's indicated that Links were intended to eventually allow for referring to data inside of blocks. Similar logic could extend to allowing for other kinds of type information there as well.
- Advantages:
- No need to bump the CID version and so a lot of existing tooling can be left in place
- Type data can be more than a single block without requiring an extra level of indirection
- It's not necessary to define codes in the table for your types if you don't want to
- Disadvantages:
- Shares disadvantages with the current proposal regarding the need to rework tooling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPLD already feels a little like a second-class citizen in a lot of IPFS implementations, and I worry that breaking the identity between CID and IPLD::Link would just exacerbate that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can that both be true and this CIDv2 proposal be relevant? If you take the position that non-IPLD things are second class then what you're left with is basically UnixFS and then what are these tags going to do for UnixFS data? In order for the tags to be useful the IPLD tooling is going to need to expose it anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take the position that non-IPLD things are second class
That's not my position. I was observing that in e.g. the IPFS http api we have two parallel sets of calls for ipfs block
and ipfs dag
: https://docs.ipfs.tech/reference/kubo/cli/#ipfs-dag, with the latter being generally less well supported.
Changing the IPLD data model to make an IPLD::Link not a CID would probably result in a lot of implementations just not supporting IPLD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the IPLD data model to make an IPLD::Link not a CID would probably result in a lot of implementations just not supporting IPLD
How are these implementations benefiting from the tag information inside the CID if the IPLD tooling doesn't support it exposing or working with that tag information? In your example how would you expect either of kubo's block
or dag
commands to change to benefit from CIDv2 without having IPLD tooling support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't really expect the kubo commands to change much, the additional information in a CIDv2 is primarily intended to be used at the application level.
How are these implementations benefiting from the tag information inside the CID if the IPLD tooling doesn't support it exposing or working with that tag information?
Specific IPLD libraries like rust-cid
will support extracting/manipulating the tag information, and that should be enough for the specific use-cases of CIDv2 I'm aware of
meta_codec : 0x3e7ada7a, | ||
meta_hash: <schema-multihash> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, but what happens if my schema starts becoming large? For example, say I have a 3MiB schema. Now this schema exceeds the 2MiB block limit imposed by many IPFS implementations and that 3MiB schema won't be transferrable. Maybe a 3MiB schema seems excessive, but people may go down this road for other reasons (e.g. I want my tag to be wasm-module
and my WASM code happens to be large).
I could start playing around with a few levels of workarounds here such as:
- Get a new code for
unixfs-representation-of-schema
- Sad because now I need to change my code to process
schema
andunixfs-representation-of-schema
as schemas - Sad because I need a new code for every different system I use to encode my bytes (UnixFS, FBL, BitTorrent v1/v2, WNFS, etc.)
- Sad because now I need to change my code to process
- Make a new type
ipld-ADL-wrapper-dag-cbor
that looks like{ TypeData: <type-cid>, TypeADL: "unixfs" }
encoded as dag-cbor, and add code so that when I encounter anipld-ADL-wrapper-dag-cbor
I recurse in a layer- Sad because it seems like we end up having to make our own type system anyhow despite the CIDv2 version bump
This seems to indicate that putting type information in the CID this way is going to be problematic because types may themselves have types and so we may want to deal with them the same way we deal with the data itself (e.g. allowing the metadata to be a CIDv2 as well, or one of the other proposals).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, but what happens if my schema starts becoming large?
It doesn't sound like this is a CIDv2 specific issue. I can store a 3MiB dag-cbor IPLD object on IPFS and generate a CIDv1 with its sha256 multihash, right? In terms of transport, I think CIDv2 will just behave like a pair of CIDv1s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't sound like this is a CIDv2 specific issue.
No, it is a specific issue with this CIDv2 proposal. Nowhere else do we use codec identifiers as data types, we use them as deserialization types. As a result there is no notion of a data type changing or becoming too big, that becomes an application layer concern. For example, an object can represent a UnixFS directory whether it is a single directory block or the root of a sharded HAMT.
By using the code as a nominative type rather than a description of how to deserialize the data you've navigated into a position where there's nowhere to identify both the type of the data and how to get it as a multiblock data structure without another level of indirection. However, that level of indirection could similarly be used instead of CIDv2 entirely (see alternative proposal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nowhere else do we use codec identifiers as data types, we use them as deserialization types.
Using a codec identifier as a datatype isn't an essential (or actually even intended) part of this proposal, so I'm absolutely happy to make any changes you suggest to better align with how multicodecs should be used.
For example, replacing the 0x3e7ada7a
example codec for ipld-schema
with dag-cbor
, and in general specifying that the metadata codec should refer to the metadata format would be totally fine for the intended use-cases
CidV2 { | ||
data_codec: 0x71, | ||
data_hash: <data_multihash>, | ||
meta_codec : 0x3e7ada7a, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a single code for "IPLD schema" followed by a multihash seems off. IPLD schemas can be represented in multiple different formats including dag-json and dag-cbor. Is this a codec for ipld-schema-dag-json
?
It seems quite bizarre that we'd need to define multiple codes for ipld-schema-<some ipld codec>
for any codec we might want to use to encode a schema. Basically what's happened here is we've glued back together the structure of the data and the serialized form of the data when describing type information. While sometimes users might be fine with that I suspect other times they may not, just as is the case with regular data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a single code for "IPLD schema" followed by a multihash seems off. IPLD schemas can be represented in multiple different formats including dag-json and dag-cbor. Is this a codec for ipld-schema-dag-json
There are a lot of things in multicodec for which this is also true (e.g. the ethereum codecs: https://github.com/multiformats/multicodec/blob/master/table.csv#L55) and my understanding of how it works there is that the format is just described in the description (such as https://ethereum.org/en/developers/docs/data-structures-and-encoding/rlp/, even though you could in principle encode any RLP data as dag-cbor if you wanted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of things in multicodec for which this is also true ... the ethereum codecs ... even though you could in principle encode any RLP data as dag-cbor if you wanted
IPLD codecs tell you how to decode serialized representations of data (into the IPLD data model), not necessarily what the data is or what it's for. The ethereum codecs, like the Git ones are tied to a particular serialized data format if you wanted to transcode the data into something like dag-cbor tagging the data with the prior codec would result in a deserialization error.
Many existing hash linked data structures have more fixed representations then say the FBL ADL which is defined over arbitrary serialized forms as long as they can be decoded into a compatible IPLD Data Model layout. As a result it can appear as though the codecs are types even though they're deserialization mechanisms.
This means really what you'd need to express the type correctly is a second code to say "this is dag-json" next to the code saying it was an ipld-schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPLD codecs tell you how to decode serialized representations of data (into the IPLD data model), not necessarily what the data is or what it's for.
The key word is necessarily there. A multicodec can absolutely tell you what the data is for. For example, if you have a CIDv1 that points to an Ethereum Block you could equally choose to encode using
name | tag | code | description |
---|---|---|---|
rlp | serialization | 0x60 | recursive length prefix |
eth-block | ipld | 0x90 | Ethereum Header (RLP) |
Likewise we have a codec for all cbor
and a more specific codec for dag-cbor
. And ofc raw
supersets everything.
So there's nothing strange if the IPLD Schema team wanted to set a default format
name | tag | code | description |
---|---|---|---|
ipld-schema | ipld | 0xdead_beef | Ipld Schema (dag-cbor) |
or with dag-json. Yes we could have ipld-schema-dag-json
, ipld-schema-dag-cbor
to disambiguate, but that seems like it should be an application level decision whether or not they'd want to ask for multiple multicodecs to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key word is necessarily there. A multicodec can absolutely tell you what the data is for.
If doing nominative typing this way was reasonable then CIDv2 wouldn't be necessary in almost any application since you could just register every type as a different codec. IIUC this kind of thing would in theory work for the UCAN case as well, it's just that using the global code table for nominative typing like this seems bad. Applications can end up with many different named data types, sometimes it's 10s or 100s, or the many more that Lurk would require reserving codes in the table this way.
Some links around not using multicodecs for nominatives types:
- https://github.com/ipld/ipld/blob/master/_legacy/specs/FOUNDATIONS.md#multicodecs-are-not-meant-to-act-as-types
- Qualifications for identification as a "codec" multiformats/multicodec#204
Yes we could have
ipld-schema-dag-json
,ipld-schema-dag-cbor
to disambiguate, but that seems like it should be an application level decision
Sure, but how can I do a non-disambiguated ipld-schema
that just works like IPLD Schemas do on any IPLD Data Model data? This code field has provided nominative typing, but without enough parameters to be useful for parameterized nominative types like IPLD Schemas (or dealing with multiblock data structures as in (#305 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of glad this discussion is happening here, although I feel it might be a diversion from the main discussion—which is why it's probably good that we get this on the table now. This specific point is why I was hoping to have @vmx chime in. I worry that the Lurk specifics embedded in the doc here might be a distraction from the main goal. Even after reading all of this I don't really understand why, with the second CID-ish for metadata Lurk just couldn't encode a dag-cbor, dag-json, or even raw custom format bytes with the tag they want. Specifically: meta_codec
could be dag-cbor (0x71
), and meta_hash
be inline (0x0
) with whatever you like for your tag—you could even embed the mega-int here that the 9-byte varints are getting in the way of currently.
Perhaps that's essentially what you're aiming for through the use of a new "codec" to identify a "schema", just keeping it more efficient.
But my point again is that I think this is a distraction for the purpose of this spec. If Lurk wants to abuse the multicodec spec then that's their choice. It would be best for everyone if they want to register a new codec for this purpose to identify a "schema" in the multicodec table and we could continue this discussion there. For now, I think 0x3e7ada7a
is in the way. It stands apart from the commonly understood purpose of a CID and as this discussion is suggesting there's a weirdness about it that leads us into a deep hole (the multicodec repo has many of these deep holes, covering very similar territory, I even had this discussion specifically about rlp and the eth codecs just a couple of months ago). We accept that there are squishy edges to the concept of a "multicodec", but always work to try and keep things toward the well understood and agreed-upon center where possible.
So my suggestion is to remove 0x3e7ada7a
, shunt this distraction to the multicodec repo in due course, and go with something more commonly understood - maybe just an inline dag-json blob. Then we can at least start reasoning about the basics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But my point again is that I think this is a distraction for the purpose of this spec.
I agree. I see this just as an example of how people might want to use it. I think the purpose of the proposal should be about, whether we want those CIDs with two pointers to provide additional context or not.
In regards to Lurk, I also don't think the 0x3e7ada7a
codec is needed. It could just be e.g. DAG-CBOR and you could encode your schema as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even after reading all of this I don't really understand why, with the second CID-ish for metadata Lurk just couldn't encode a dag-cbor, dag-json, or even raw custom format bytes with the tag they want.
Just to clarify, we absolutely can, and this is part of the intent of having the second CID. The 0x3e7ada7a
was only meant as an illustrative example for ipld-schemas
, but the rest of the proposal is the same if we just replaced it in the doc with 0x71
for dag-cbor
, as @vmx suggested.
Regarding the prior topic about nominative types, I don't think either Lurk or Yatima need or want to add typing to multicodec, really. To be super concrete about what we need: For Lurk, we want to add a 16-bit metadata field to our CIDs, and for Yatima we want a 256-bit metadata multihash. In terms of multicodecs, I don't think it matters that much to either of those use cases whether we get a single application codec, multiple application codecs, or no codecs (and we just use e.g. dag-cbor
). As long as we have a flexible way to add metadata, if we do end up needing additional info in pointers, we can just put it in that variable metadata field (e.g. with the identity multihash)
Just as Rust doesn't force every pointer to be to a trait object (be double wide with vtable) so I don't think this is good for CIDv2. We could have a "fat CID" spec in addition, but I don't think it should replace today's CIDs. It is simply a different use-case. It is also possible to go in the other direction (e.g. WASM decoder as @aschmahmann mentions) and for different purposes do a pair of a raw multi-base and decoding function) but that only makes sense once we have a spec for the decode looks like. In that latter scheme I would want to be careful that we aren't even chooosing between "enumerate" different formats but handling arbitrary data. E.g. the definition of "structured block" (referred to by such a CID) equality would be something like (f, b) == (g, c) = f b == g c i.e. even if two decoders are different, if the result of running them on the raw block (referred to by multibase) is the same, then the referred-to structured blocks are also the same.
This is neat stuff that makes IPFS more inherently pluggable than enumerated with a fixed set of policies (multicodec), but I don't get the sense that that is what this proposal is going for. |
Just wanted to chime in that we're still pretty excited about this happening from the IPLD team's side. Particularly this could be useful for signaling ADLs and schemas and would be a good fit with IPLD URLs for signaling extra parameters beside CIDs. So far I think we're most excited about extensions that don't require extending any multiformats tables. |
@RangerMauve Awesome, I'm happy to make changes to my tentative draft here based on what you guys think will be most useful. It seems like there a lot of different ways to slice the cake here, and I imagine that you guys have the best perspective on what the constraints are across the ecosystem. From the Lurk and Yatima side we mainly just want a flexible/extensible way to add metadata to CIDs that IPFS implementations still know how to resolve, and the proposal by @mikeal for CIDv2 as a pair of CIDv1's seemed essentially in the same vein. Let me know what you guys on the IPLD team think next steps should be! |
@johnchandlerburnham I'm personally into the two CIDs option myself. It'd be useful if Mikeal's team could comment on whether this works for them so we can progress further. 😁 For some of the IPLD URL use cases I was imagning it'd be useful to have the option of having arbitrary key-value pairs, but I think that's covered by having the second CID inline it's data. |
My current thoughts on this: what we seem to be wanting boils down to essentially a combination of a link and a place to store some arbitrary properties. Those properties are likely going to be some form of namespaced value set of one or more things. Which is basically a set of key/value pairs (where perhaps the keys can come in the form of a multicodec table code, but that's still just a key, but with decent uniqueness properties). By using a second CID I suspect we're mostly going to be using it as a place to identity encode arbitrary data, but I'd bet that'd end up looking mostly like a key/value set of one or more things. So, is it possible to make a case to skip the complexity and extra bytes in needing to have it all encoded into a CID and just jump straight to encoding a key/pair list? e.g. a potential scheme could be:
Where:
By skipping the second CID as an identity encoded block of data, and explicitly saying that we have a key/value set, we would be more likely to have an emergent set of common keys that systems could look for and describe. Identity CIDs can be any codec and the encoded data could take any form so mostly you'd have a link and an untyped object. One argument the other way is at least the identity CID can [potentially] tell you how to decode all parts of the bytes, a key/value set leaves you with values that aren't decoded unless you know what to do with the key. Thoughts? |
Separate topic for discussion: limits. We have to launch this thin with some kind of bounds. Not having this for identity CIDs continues to be a pain for so many parts of our stack (ref). My proposal is to come up with a basic byte limit, code it as a constant in our core CID handling libraries to be used wherever CIDs are decoded and error if the limit is exceeded. But, document the limit as changeable—i.e. if you can make a good case for increasing it for your use-case then you need to start a discussion. We could come up with some squishy language in the spec about this too and likely we'd want to have affordances for users to run CID libraries with their own custom limits too. We haven't solidified a format yet and we also haven't heard enough about enough use-cases to make informed decisions about such things, but just to get noggin's joggin' my starting bid would be in the order of 2048 bytes, a number I've started imposing in some places for identity CIDs. |
In the case of the key:value idea, multiformats/multicodec#4 likely has relevance to this since you could encode mime types into CIDs potentially |
@rvagg One question about this layout:
How would this work when expanding out the
In which case, perhaps the |
Alternatively, since making this "CIDv2" is complicated by the fact that it can contain CIDv0 and CIDv1 within itself, what about this?
The idea is to register a "metadata" multicodec which will cause legacy parsers to fail there, and then move standard multicodecs until after the multihash followed by the keypairs (in the case of CIDv0, it'll just omit the standard multicodecs and go straight to the keypairs). I'll openly admit this may be a terrible, terrible idea, but, I thought I'd put the idea out |
Something I'm noticing from all the CIDv2 proposals so far is, what does CIDv2 actually need to do?
It seems that pretty much everything proposed answers some or all of these requirements, but, clearly there are other requirements which make some solutions better than others which aren't being formally declared. |
For the Yatima use-case, we have an intermediate representation for the Lean Theorem Prover (https://github.com/leanprover/lean4) where we separately content-address computationally relevant information from non-computational metadata (similar in spirit to Unison-Lang https://www.unison-lang.org/learn/the-big-idea/). A Yatima identifier for a declaration thus has two hashes, one for the anonymized program, one for the metadata (like variable names), such that whenever two identifiers share the same anon-id, they represent the same program. This CIDv2 proposal will allow us to make Yatima identifiers isomporphic CIDs directly, without a layer of deferencing, or by abusing the identity multihash. But we don't precisely need pairs of CIDs to make that work, pairs of multihashes with a single multicodec would be fine there For the Lurk use case, we have at least 16 bits of metadata that need to be included with the Poseidon digest that hashes some Lurk expression. This requirement is implied by certain subtleties of how Lurk tries minimizes the number of constraints in its zero-knowledge circuit backend. For the Lurk use-case, we don't really need pairs of CIDs either, another option would be to allow users to reserve ranges in the multicodec table (and probably remove the current arbitrary 64-bit size limit on multicodec sizes). I don't have a deep understanding of the DAG House use-case, but from what I infer from https://github.com/multiformats/cid/pull/49/files, multihash pairs might work for them. The meta-level observation unifying these use-cases, is that multicodec as its currently structured (a single
My gut reaction to this is that its adding a lot of epicycles to what a CID is. If CIDs can carry whole maps around, they could get big, so then we have to limit their size, which then might cause users to do strange compression things to try to get their desired metadata into a size limited map. It seems kinda messy That said, if key-value pairs can embed multihash values, then it should be workable for what we want to do, since it's then supersets the expressiveness of my original CIDv2 as a 4-tuple proposal (which is also an epicycle too, tbf). But it feels like there might be some more general/more elegant thing that solves this problem, either in the vein of changing how multicodec works, by changing the IPLD data model to include metadata links (as @aschmahmann suggested earlier), or maybe with some new idea from the WASM encoder effort. |
Also, one thing I've noticed is that the way CIDv2's would nest is really weird. Suppose I have some data
But then if I include
So actually in order to get something that behaves like a metadata CID, it seems like we need two parallel trees, one for data, one for metadata. So for example, if you have some IPLD data
you could have a separate expression
which has the same tree shape, but holds the metadata links which correspond to the data links. This way at every point in constructing nested CIDv2's, the metadata would never "bleed" into the data hashes. |
This comment, and your whole follow-up comment actually get to one of my biggest concerns about what we're doing. Right now it is possible for a CID to carry a whole map around. Identity CIDs open a pandoras box that causes us all sorts of headaches and the initial proposal for So my suggestion was an attempt to open that up a little bit and explore the implications of just acknowledging the fact that people are going to want a bucket to store miscellaneous pieces of metadata and try and embed a basic schema of that metadata into the spec for a CIDv2. Because, if we say it's a list of "key=value" pairs in the same way as a URL querystring is, then at least there's the possibility of looking over them for some common keys that maybe you can do something with. If we used multicodec table entries for keys, then we could have things like Specifically to your second post, I'm not really sure there's much we can do about this other than:
|
Separate topic, again: there was a suggestion in a meeting today that we allocate some time at IPFS Camp in Lisbon to have an in-person round-table about this and see if we can hash (har har) out a way forward. Getting Yatima, Lurk and DAG House in the same place would be good; finding other potential users would be good too - folks like @RangerMauve who are already thinking about ways to make use of this in conjunction with IPLD URLs and gateways. |
For IPLD URLs, I'd like to be able to add some extra CIDs pointing at "additional" information to interpret a CID as. e.g. an IPLD Schema to interpret data as as part of traversal, a WASM blob CID for IPVM / autocodec use cases, etc. I like the idea of having a table of predefined fields, and maybe having an escape for metdata that's super application specific and not needed to standardized on (maybe within an identity CID or a CID to the metadata). Ideally I think the table should be used for fields that we expect different applications to reuse between each other. 😅 Generally, for cases where there's some metadata that you expect will be reused across a lot of nodes, a CID pointing at the object seems efficient enough since it'll likely be cached locally already. For example, in the IPLD Prolly Tree spec that some folks and I are working on, we're planning on linking to the config in root/leaf nodes so that we can quickly tell if two trees are using the same options. If you're doing very application specific data, that isn't going to be standardized to be used outside of the application, could it not be stored in the data without having to add it to the CID? I suppose having it in the CID could get rid of an extra step of indirection, but IMO you could just as easily reference an object that has the metadata + cid instead of just the CID and have your application know to handle it correctly. |
Tagging @hannahhoward to discuss the potential of being able to include size in a CID. |
2022-11-15 IPLD triage conversation: I understand @vmx that you created a related doc here. Is that right? If so, can you please link it? |
Currently it's just rough notes at https://hackmd.io/@vmx/HkoYAr64o. I tried to capture all the conversations I had at LabWeek. The most promising proposal is the "application context" one. That is the one I want to spent some time on, to write it up a bit better. |
2023-01-03 IPLD triage conversation: we're looking for some action items to move this forward or close it out. I've put it down for the next IPLD community call (https://hackmd.io/PjKSfch8QNOY4uNrnrRbDA?edit ) |
In this IPLD sync call it was decided to close this PR. The original author of this IPIP is no longer convinced that this is the right way to do it (I talked with him in person). Also talking with many other folks, we were in agreement that there shouldn't be a CIDv2, but something else. I wrote down a draft of a proposal named "Application Context" at https://hackmd.io/@vmx/SygxnMmso, which is based on the discussions I had with folks. It still have one more idea I need to write down that was floating around, once done, I'll link it from the hackmd mentioned above. |
This adds a spec for a CIDv2 proposal, originally discussed here: multiformats/cid#49
Included are corresponding changes to the CID specification repository (https://github.com/yatima-inc/cid/blob/master/README.md) and a preliminary draft implementation on
rust-cid
(https://github.com/yatima-inc/rust-cid/tree/cid-v2). I am happy to send PRs to https://github.com/multiformats/cid and https://github.com/multiformats/rust-cid respectively, but have not yet in the interest of centralizing discussion.