Implement Provider Store Module #3

iand · 2023-06-26T14:40:42Z

ETA: 2023-07-31

There are currently no implementations of the Provider Store module.

Note that the same kind of mechanism can be used to store both IPNS records and Provider Records.

In a DHT Server node, the role of the Provider Store is to store Provider Records/IPNS records that have been assigned to self, and to serve them to whoever requests them. It is expected that not all provider records fit into memory. Usually provider records are republished every 22 hours, and some provider records are accessed much more than other (some are never requested).

The store should only record the data it can parse and not the full Protobuf record.

When storing data on disk, we must be careful not to block on IO. When doing IO, a new thread should be created, and once it is done doing IO, it should die and the operation should be resumed by the single worker.

Ideally the provider store would contain a (single/multi-layer?) cache. For instance it can be a LRU cache with L1 stored in memory so we don't have to do IO for popular Provider Records/IPNS keys.

iand · 2023-06-30T14:06:52Z

go-kademlia should define the interfaces it needs to support record storage and allow applications to pass an appropriate implementation.

iand · 2023-07-04T14:27:14Z

go-kademlia needs to provide record put/get functionality but the specific record types (provider/ipns) belong in go-libp2p-kad-dht as per the discussion #34 (comment)

dennis-tra · 2023-07-06T15:20:36Z

Some observations:

the current provider manager is single-threaded as well
Are go-datastore implementations thread-safe?
I think IPNS records would belong to go-libp2p-kad-dht but provider records are always required
how do we handle the dependency on the peer store? that's a libp2p package.

iand · 2023-07-06T16:28:41Z

how do we handle the dependency on the peer store? that's a libp2p package.

Let's create a peer interface in go-kademlia and work in terms of that. KadDHT can then supply a libp2p peer

dennis-tra · 2023-07-07T14:35:11Z

I'm still trying to figure out which datastore abstractions make sense for our current goals. I think go-kademlia should be record type agnostic. It shouldn't be concerned about if it's a provider/peer/IPNS record. It should just verify that it's a valid record.

To further understand this, I went through the different PUT/GET operations of the current DHT implementation and which messages carry which record. Further, I want to know which messages we want to support in the go-kademlia vs. go-libp2p-kad-dht vs. boxo repositories.

go-libp2p-kad-dht defines:

PUT_VALUE - For storing IPNS records (also peer records?)
GET_VALUE - For extracting these records + closer nodes
ADD_PROVIDER - To store provider records (basically a specialized PUT_VALUE)
GET_PROVIDERS - To look up provider records + closer nodes (basically a specialized GET_VALUE)
FIND_NODE - To look up closer nodes only
PING (deprecated)

I briefly summarized what's happening for each of the messages to be able to compare it with https://github.com/libp2p/specs/tree/c733210b3a6c042d01f6b39f23c0c9a3a20d3e88/kad-dht. Feel free to skip this.

`PUT_VALUE`

Extract record -> yields a go-libp2p-record
Check if PUT key equals record key
Sanitise record to avoid storing arbitrary data
Validate record
Derive Datastore key
Acquire striped lock
Lookup existing record from datastore
Let the validator decide if existing record is "better"
Assign "received at" timestamp
marshal record to proto
Store record in datastore

`GET_VALUE`

Verify if there's a key to lookup the record for
Derive datastore key
Load data from datastore
Unmarshal record from datastore
Verify record age
If record is too old -> delete record from datastore
Check for better peers to query in routing table
Filter out self+requester from closer peers
Check peerstore for addresses of closer peers
Craft response message (assign record + closer peers)

`ADD_PROVIDER`

This is actually a PUT_VALUE operation.

Verify key in message
Extract "Provider Peers" from message
For every provider peer:
1. Add addresses to peer store
2. Check LRU cache and replace entry if it exists (refresh TTL)
3. Derive provider key
4. store record in datastore

`FIND_NODE`

This is actually a GET_VALUE operation:

Prepare response message with "cluster level" from received message (?)
Check for better peers to query in routing table
Filter out self+requester from closer peers
Add peers to response that we have addresses for

`GET_PROVIDERS`

Verify key in message
Lookup providers in provider store
Check LRU cache for provider set
Derive datastore key
Query data store for data
Verify record age
If record is too old -> delete record from datastore
Extract peerID from record key
Construct peer set
enrich peer set with peer addresses

The original Kademlia paper defines:

STORE
FIND_NODE
FIND_VALUE
PING

As I see it, this translates to:

STORE -> ADD_PROVIDER, PUT_VALUE
FIND_NODE -> FIND_NODE
FIND_VALUE -> GET_PROVIDERS, GET_VALUE
PING -> PING

Now the question: Should we define any message types in go-kademlia at all or solely operate on interfaces? The messages that we somehow should provide some logic for are the ones that the original Kademlia paper defines IMO. This includes PING. PING is deprecated in the libp2p Kademlia spec, but I think it should be part of go-kademlia because we plan to make it independent on libp2p and it’s part of the original spec.

If we don't want to force users of go-kademlia into a message format (e.g., the protobuf format that we're currently using) which I would opt for, then go-kademlia would need to hand control to the user on every message to, e.g., verify it's validity. The user then decides if it's a valid message, and so on.

Alternatively, we could implement a way to register message formats for STORE, FIND_NODE, FIND_VALUE operations + a way to determine the message type if we receive a request. go-kademlia could require at least one message format to be registered for any of the above types.

Different message formats would need to implement different things depending on their type:

STORE:

Generate a datastore key
Validate an incoming message
Extract data to store from the message

FIND_NODE:

Expose the target key (to calculate closer peers)
Validate an incoming message
Expose a way to set closer peers to the message format

FIND_VALUE:

Expose the target key (to calculate closer peers)
Generate a datastore key
Validate an incoming message
Expose a way to set closer peers to the message format

For example, go-libp2p-kad-dht would register GET_VALUE and GET_PROVIDERS for the FIND_VALUE operation. If a request comes in, go-kademlia would determine which type of message that is (based on a function or an interface that all messages must implement) and then would 1) verify the message, 2) generate the datastore key, 3) lookup the data in the local datastore 4) determine closer peers 5) craft a response message.

Each message type could optionally bring its own datastore implementation. If it doesn't bring one, a default one is used. This means, the GET_PROVIDERS data store logic could be slightly different from the GET_VALUE logic.

Now that I think about it, message formats are actually strongly coupled to record types. So a provider record is associated to GET_PROVIDERS ADD_PROVIDER. An IPNS/peer record is associated to GET_VALUE/PUT_VALUE. So users could register record types with go-kademlia. These record types could then define the message formats that are used to store and retrieve them.

iand · 2023-07-07T15:03:20Z

I haven't fully formed my thoughts yet but wanted to respond with a few notes before the weekend.

There is already the concept of a Message interface in https://github.com/plprobelab/go-kademlia/tree/main/network/message. I feel the protobuf specific messages will move to go-libp2p-kad-dht

The messages are associated with a protocol ID: https://github.com/plprobelab/go-kademlia/blob/main/network/address/address.go#L24

The Endpoint interface is the part that ties these together (https://github.com/plprobelab/go-kademlia/blob/main/network/endpoint/endpoint.go#L22) and I think the only essential method there is SendRequestHandleResponse.

A query sends a Message to an Endpoint using a Protocol, so I think the abstractions are there. But I think the specific implementations of the IPFS DHT messages, endpoints and protocols belong in go-libp2p-kad-dht. go-kademlia concerns itself wiih routing the message to the expected destination(s) and handling responses according the to the protocol and the query's handler (I am hand waving here, I don't know if it is doing it at the moment)

yiannisbot · 2023-07-11T07:48:34Z

One thing to clarify here is whether we need the PING functionality. From my perspective it would be useful to have, but I don't have context on why it was deprecated before. Was it to avoid overload, or abuse of some sort? Or was it just not needed? Maybe @aschmahmann can shed some light?

dennis-tra · 2023-07-11T07:56:55Z

Can't recall the exact reasons but here are some previous discussions @yiannisbot:

guillaumemichel · 2023-08-11T10:21:55Z

Great summary @dennis-tra !

I think that the datastore and PUT_VALUE GET_VALUE ADD_PROVIDER GET_PROVIDERS RPCs could be implemented in go-libp2p-kad-dht directly (and not go-kademlia), because these RPCs are specific to IPFS/libp2p. I see the go-kademlia (to be renamed) repo more as a tool to build DHTs defining their own RPCs, but go-kademlia wouldn't contain RPCs that aren't required for the purpose of routing (the only RPC that seems necessary is FIND_NODE).

Each message type could optionally bring its own datastore implementation. If it doesn't bring one, a default one is used.

I really like this principle. As long as the DHT implementation follows the specs of the network it is participating in, it can use any datastore implementation. It means that in the go-libp2p-kad-dht implementation we could have the default data store, but it could be possible to replace it with another one. And it would be possible to have different data stores for Provider and IPNS records.

There is already the concept of a Message interface in https://github.com/plprobelab/go-kademlia/tree/main/network/message. I feel the protobuf specific messages will move to go-libp2p-kad-dht

👍

The messages are associated with a protocol ID: https://github.com/plprobelab/go-kademlia/blob/main/network/address/address.go#L24

The Protocol ID is only required by libp2p. We can probably remove it from go-kademlia (or make it a parameter of the libp2p module, if we keep it).

go-kademlia concerns itself wiih routing the message to the expected destination(s) and handling responses according the to the protocol and the query's handler (I am hand waving here, I don't know if it is doing it at the moment)

I agree that go-kademlia concerns itself with routing the messages to the expected destinations. However handling responses should be done by the DHT implementation (e.g go-libp2p-kad-dht), because this behavior depends on the protocol. go-kademlia could have a generic server interface, but the specific server implementations (such as basicserver) should move away.

One thing to clarify here is whether we need the PING functionality.

I am not sure what a specific DHT PING RPC would achieve. If we need to test the connectivity to a host, we can simply use libp2p or ICMP PING. And if we want to check whether a node can actually respond to a DHT request, we can simply send a FIND_NODE request to this node. But maybe there was another use to this RPC that I missed?

dennis-tra · 2023-08-14T07:54:49Z

I think that the datastore and PUT_VALUE GET_VALUE ADD_PROVIDER GET_PROVIDERS RPCs could be implemented in go-libp2p-kad-dht directly (and not go-kademlia), because these RPCs are specific to IPFS/libp2p. I see the go-kademlia (to be renamed) repo more as a tool to build DHTs defining their own RPCs, but go-kademlia wouldn't contain RPCs that aren't required for the purpose of routing (the only RPC that seems necessary is FIND_NODE).

👍 👍 👍

The Protocol ID is only required by libp2p. We can probably remove it from go-kademlia (or make it a parameter of the libp2p module, if we keep it).

👍

I agree that go-kademlia concerns itself with routing the messages to the expected destinations. However handling responses should be done by the DHT implementation (e.g go-libp2p-kad-dht), because this behavior depends on the protocol. go-kademlia could have a generic server interface, but the specific server implementations (such as basicserver) should move away.

This is something I came across last week. I was wondering why go-kademlia would need to define a server interface at all? Do we want to enqueue requests to the event loop as well? Looking at the handler functions in go-libp2p-kad-dht this seems like an unnecessary indirection (as long as the routing table + data store are thread-safe).

I am not sure what a specific DHT PING RPC would achieve. If we need to test the connectivity to a host, we can simply use libp2p or ICMP PING. And if we want to check whether a node can actually respond to a DHT request, we can simply send a FIND_NODE request to this node. But maybe there was another use to this RPC that I missed?

ICMP is used to test if a host is reachable not necessarily if the process you're looking for is online. The libp2p ping wouldn't count against any resource manager kad/dht stream limits (which could be a good thing?). Generally, I think a FIND_NODE request is a valid way to test if a remote peer speaks the protocol. A FIND_NODE request uses few more CPU cycles (the remote peer needs to determine the closest peers it knows to a specific key) than a plain PING request 🤷

guillaumemichel · 2023-08-16T07:23:45Z

ICMP is used to test if a host is reachable not necessarily if the process you're looking for is online. The libp2p ping wouldn't count against any resource manager kad/dht stream limits (which could be a good thing?). Generally, I think a FIND_NODE request is a valid way to test if a remote peer speaks the protocol. A FIND_NODE request uses few more CPU cycles (the remote peer needs to determine the closest peers it knows to a specific key) than a plain PING request 🤷

Agreed that a dedicated DHT PING RPC would make sense, because it would use less CPU cycles, and could integrate with a resource manager and be used for back pressure e.g if a node is very busy, answering the PING isn't its first priority so the PING response will be delayed, and the remote peer that has sent the PING request may drop the node from its routing table.

The only benefit of using a FIND_NODE RPC to perform a PING is that it prevents misconfigured nodes to be added to other peer's routing tables. If a node correctly implements the PING response, but not the FIND_NODE response, it will propagate in other peers' routing tables even though it is useless to the network.

Overall, I think we should try to optimize for performance (CPU cycles), so we could reintroduce the DHT PING request.

PS: reading through the commit histories, I couldn't figure out a strong reason why PING was deprecated.

iand added the kind/enhancement New feature or request label Jun 26, 2023

guillaumemichel changed the title ~~Optimize Provider Store Module~~ Implement Provider Store Module Jun 27, 2023

guillaumemichel added the scope/required Feature required to match go-libp2p-kad-dht label Jun 27, 2023

dennis-tra added this to DHT Optimization Jul 6, 2023

dennis-tra modified the milestone: Plumb Refactored DHT into Kubo Jul 6, 2023

dennis-tra self-assigned this Jul 6, 2023

guillaumemichel mentioned this issue Jul 6, 2023

Document all features #54

Open

19 tasks

iand mentioned this issue Oct 2, 2023

Minimally Functional Modular DHT probe-lab/zikade#17

Open

iand moved this to Planned in DHT Optimization Jul 11, 2023

iand moved this from Planned to In Progress in DHT Optimization Jul 11, 2023

iand mentioned this issue Aug 11, 2023

Protocol vs Endpoint #89

Open

iand moved this from In Progress to Done in DHT Optimization Sep 19, 2023

iand closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Provider Store Module #3

Implement Provider Store Module #3

iand commented Jun 26, 2023 •

edited

Loading

iand commented Jun 30, 2023

iand commented Jul 4, 2023

dennis-tra commented Jul 6, 2023

iand commented Jul 6, 2023 •

edited

Loading

dennis-tra commented Jul 7, 2023

iand commented Jul 7, 2023

yiannisbot commented Jul 11, 2023

dennis-tra commented Jul 11, 2023 •

edited

Loading

guillaumemichel commented Aug 11, 2023

dennis-tra commented Aug 14, 2023

guillaumemichel commented Aug 16, 2023

Implement Provider Store Module #3

Implement Provider Store Module #3

Comments

iand commented Jun 26, 2023 • edited Loading

iand commented Jun 30, 2023

iand commented Jul 4, 2023

dennis-tra commented Jul 6, 2023

iand commented Jul 6, 2023 • edited Loading

dennis-tra commented Jul 7, 2023

PUT_VALUE

GET_VALUE

ADD_PROVIDER

FIND_NODE

GET_PROVIDERS

iand commented Jul 7, 2023

yiannisbot commented Jul 11, 2023

dennis-tra commented Jul 11, 2023 • edited Loading

guillaumemichel commented Aug 11, 2023

dennis-tra commented Aug 14, 2023

guillaumemichel commented Aug 16, 2023

iand commented Jun 26, 2023 •

edited

Loading

iand commented Jul 6, 2023 •

edited

Loading

`PUT_VALUE`

`GET_VALUE`

`ADD_PROVIDER`

`FIND_NODE`

`GET_PROVIDERS`

dennis-tra commented Jul 11, 2023 •

edited

Loading