POSIX-compliant read on the Mirage_flow.S interface #46

dinosaure · 2020-11-30T16:17:14Z

This issue will be large enough to explain all the situation according to the current MirageOS eco-system.

Currently, Mirage_flow.S has a strange definition of the read function which does not respect the POSIX's one:

Line 56 in 4e7df2c

val read: flow -> (Cstruct.t or_eof, error) result Lwt.t

Indeed, several questions comes from a non-seasoned user:

Where comes from the Cstruct.t?
Who allocates it?
How many bytes are availabe?
Can I write on it?

The documentation can help us a bit but it does not give to us a strict and full definition of behaviors and finally it lets the underlying implementation to complete that - if the maintainer is aware about what he should, at least, respect.

From this definition, I got many troubles when:

A pre-existing project requires a POSIX read (such as Lwt_io.make, colombe, conduit.3.0.0 or gluten)
A protocol implementation provides a POSIX read (like tls.lwt or ssl)
I want to compose protocols (such as TCP/IP and TLS) - composition

From all of these, it's hard to do the glue from a protocol to an implementation (Mirage_flow.S to something) or from a protocol to our interface (something to Mirage_flow.S). And the solution is not good, it requires:

a ring-buffer (Mirage_flow.S to something)
a linger (something to Mirage_flow.S)

Of course, this solution implies a regression about performances because we do some copies in any cases. The problem becomes hard when it's about protocols composition where we must agree about one and unique interface to be able to provide:

val compose : underlayer:(module Mirage_flow.S) -> (module Mirage_flow.S)

So, if we look between lines, it seems that we have a systematic and error-prone layer between the MirageOS ecosystem and the rest of the OCaml ecosystem. Mostly because read : flow -> (Cstruct.t or_eof, [> error ]) result Lwt.t is specific to one project: mirage-tcpip.

From what I know, Mirage_flow.S.read wants to provide a zero-copy read function. I don't know if it's true or not but even if this function can increase performances on some specific application (and I said specific!), this choice becomes a pain where the zero-copy assumption is not true for others protocols/flows (at least TLS does, in any cases, a copy). So it's strange for me to enforce an assumption for one protocol to any others protocols and assume, by this way, a split between the MirageOS ecosystem and the UNIX ecosystem (which should live together - due to the UNIX target).

It seems that the Linux kernel provides a zero-copy interface of its TCP/IP stack too - but it seems that this part is specific too and it should not lead the development of protocols, in first instance - but it can be an open-door to improve performance then.

A possible solution is a major release of Mirage_flow.S and an update of read to be POSIX compliant.

val read : flow -> Cstruct.t -> ([ `Input of int | `End_of_flow ], error) result Lwt.t

It will unlock many issues discovered a long the MirageOS ecosystem (which are considered for me as a technical debt). The result is not a deletion of the zero-copy read but move to consider it as specific to mirage-tcpip. Then, it will unlock:

the ability to the end user to use the zero-copy read when he wants
gives a chance to some others protocols such as TLS to provides a POSIX interface (and merge tls.lwt and tls.mirage into one and unique package)
open the door for protocols or implementations which are not the part of the MirageOS ecosystem - and keep the compatibility with the UNIX world on them
Provide protocols composition without performance regressions

/cc @mirage/core

The text was updated successfully, but these errors were encountered:

avsm · 2020-11-30T17:28:35Z

Thanks for bringing this up, although I'm surprised you brought up read and not write as the latter is where I see semantic issues.

Flow.read seems to be perfectly "POSIX compliant" to me; could you tell me where the below is incorrect so I can understand your issue better?

- Where comes from the Cstruct.t?*
Who allocates it?

These two can probably be answered in one go. The Cstruct.t is freshly allocated by the Flow implementation. This is an internal detail since the driver might use a local pool of buffers, or be passing through buffers from another implementation. Remember that Cstruct.t is a view onto an underlying buffer, so it is cheap to allocate the Cstruct.

The difficult question here is the lifetime of the underlying memory buffer, not the Cstruct (which is disposable). Is that where you view difficulties with POSIX?

How many bytes are available?

The Cstruct is precise: the length of the cstruct (not the underlying buffer) is accurate. So if you get a short read on POSIX, then the implementation will give you back a cstruct of the length read from the socket, but the underlying buffer may be longer (e.g. a page).

Can I write on it?

Good question! No. If we use the ro/rw capabilities in new cstruct, then the call should under most circumstances have to allocate a fresh cstruct. It's pretty hard to have a zero-copy read and write interface that's safe without far stronger lifetime guarantees.

I didn't understand the other points about TLS and so on -- perhaps if we focus on "why isn't Mirage_flow.S.read POSIX-compliant" that might help answer the rest.

dinosaure · 2020-12-01T10:11:32Z

Flow.read seems to be perfectly "POSIX compliant" to me; could you tell me where the below is incorrect so I can understand your issue better?

POSIX compliant means this signature for me: val read : flow ->bytes -> int -> int -> (int, [> error ]) result as we have about Unix.read or the simple C read syscall.

This is an internal detail since the driver might use a local pool of buffers, or be passing through buffers from another implementation. Remember that Cstruct.t is a view onto an underlying buffer, so it is cheap to allocate the Cstruct.

For my point of view, a protocol implementation such as TCP/IP should not implement an other mechanism such as a local pool of buffers, or rely an a pre-existing implementation which does the job. A protocol implementation should just decode and encode - and I think it's enough of complexity for the developper.

But in the case of the protocol handles its own pool of buffer, how such things is concretized in Mirage_flow.S, how I can tweak it (set some configuration values depending my context). How I can replace it by something more aware about my context? The documentation does not mention such point and it seems at least implicit, or magic.

Then, allocate a Cstruct.t can be cheap only if you already allocated a large unproxy-ed Bigarray/Cstruct. In many case, Cstruct.of_string or {CstructBigarray}.create leads some of our codes and I will explain why latter.

The difficult question here is the lifetime of the underlying memory buffer, not the Cstruct (which is disposable). Is that where you view difficulties with POSIX?

My main difficult is about control. When we want to implement a higher protocol such as TLS or Git Smart we discover one main design which leads several others protocols designs: we know the length of our payload (because these protocols use the variable-length packet design). Even SMTP advises to limit the packet/the line by 1000 characters.

In that case we know at most how many bytes we should get from the underlying protocol (in that case, TCP/IP). The spontaneous way to implement these protocols is to pre-allocate a buffer which corresponds to the maximum number of bytes that our peer can send us - of course, I simplify a lot the problematic, but the idea still is the same. By this way, we expect that the underlying protocol fills this buffer and it's about us to: 1) take care when we prompt/ask more inputs/when we start to parse.

From the current status of Mirage_flow.S, in any way, we must implement a ring-buffer (allocate it, and use it to keep remaining bytes given by Mirage_flow.S.read). Indeed, the current interface can not assert that we will get, at most, N bytes.

The Cstruct is precise: the length of the cstruct (not the underlying buffer) is accurate. So if you get a short read on POSIX, then the implementation will give you back a cstruct of the length read from the socket, but the underlying buffer may be longer (e.g. a page).

This is where I have some trouble. We don't have any control about how many bytes we should receive. We just receive something which can be larger than what we expect or smaller. In the first case, we need to keep remaining bytes into a ring-buffer (and pray to never to full fill it - in that case, we fail or we need to re-allocate a bigger version of our ring-buffer). In the second case, we must re-read until we have enough bytes to start our process.

But in anyways, to fit into some constraints about protocol implementation, we do a copy and such slight layer is done in many ways along our MirageOS ecosystem - I can not count but I saw different times the same and unique goal: fallback to val read : flow -> buffer -> int -> int -> (int, [> error ]) result

But all of these describe the half part of the problem because it's mostly about this simple problem: I have a Mirage_flow.S and I would like to have something like Unix.read. As I showed in my first post, many libraries (in the MirageOS ecosystem or out the MirageOS ecosystem) expect the POSIX-compliant read function instead of Mirage_flow.S.read. That mostly mean that for us, where any of our protocol implementation should lead over the Mirage_flow.S interface, we must do the glue with the ring-buffer, but:

we don't have a common and unique way (well tested) to provide the POSIX-compliant read from Mirage_flow.S.read - even if we have ke. I see many ways and all of them try to take opportunities on some details which are protocol-dependent and finally, I see all of them as a technical debt instead of the common way to go
we finally do a copy systematically - I mean, currently, at some others layers such as CoHTTP/HTTP/AF/colombe/etc. we can not assert the zero-copy things
do and redo the same job to be able to provide val read : flow -> bytes -> int -> int -> (int, [> error ]) result is error-prone, I did this several times and I got some errors on them

The other half problem

The other half problem is when we have a protocol implementation which does not respect Mirage_flow.S and it should respect that. Again, the first example in my mind is mirage-tcpip which wants to provide a Mirage_flow.S interface from Lwt_unix.

In such transition, an allocation is required, and in that case, we talk about a linger - and, as we can see, it allocates. Such layer is required for any others protocols which want to have a chance to be used in the MirageOS ecosystem.

And the first entry-point for a protocol implementation to be compatible with MirageOS is Conduit - it's why I made the choice to diverge from Mirage_flow.S interface because it seems:

too specific to mirage-tcpip
implies some tweak (copy/performance regression) on others MirageOS protocols to fit under Mirage_flow.S

It's why I consider that we have a split from MirageOS and the rest of world due to this simple function read and we should not. I mean, we should follow what is convenience to see where we want to read a socket.

Several others questions

Again, the documentation is not clear about the pool of buffer, the implementation of it and where we are able to tweak it. The ownership is an other problem and the POSIX-compliant read solves this issue where the caller keeps the ownership on the pre-allocated Cstruct.t (I mean, even Rust provides the same interface).

About implementation, we can see a diverge between tls.lwt which provides the POSIX-compliant read and tls.mirage which respects the Mirage_flow.S - I think it explain on its own the problem of the split, why we have two different interfaces for the same things?

And, again, if this interface wants to presuppose the zero-copy, it's true (and again, I'm not sure) only for mirage-tcpip. So why we want to enforce one specific aspect of one specific protocols to others - because I'm not wrong when I say that every protocols in MirageOS should respect the Mirage_flow.S interface.

PS: I can go to details and check some code to highlight the existence of ring-buffer and linger in our code-base but I think it's true that we tried to fallback to the POSIX-compliant read several times in some many contexts (and we must maintain them according their details, etc.).

ansiwen · 2022-07-25T13:34:22Z

I can confirm that this is an issue and caused a bug in the read function in gluten-mirage. Also in this case an intermediate buffer must be implemented to fix the issue, because Gluten_lwt.IO.read can be called with a buffer smaller than what Flow.read returns. Then the excess data must be stored and returned the next time Gluten_lwt.IO.read is called.

@avsm While implementing that, I also needed to know: how long is the Cstruct.t returned by Flow.read valid, or is ownership transferred to the caller? At the moment I keep (parts of) the buffer alive, and use it in later calls, but am I risking that it has been overwritten in the meantime?

This is my implementation, that I'm not sure is safe:

module Gluten_flow (Flow : Mirage_flow.S) :
  S with module Mirage = Flow = struct
  module Mirage = Flow

  type t = {
    flow: Flow.flow;
    mutable buf: Cstruct.t;
  }

  let create flow = {
    flow = flow;
    buf = Cstruct.empty;
  }

  let read flow req_len =
    (if flow.buf.len = 0 then
      Flow.read flow.flow
    else
      let res = (Ok (`Data flow.buf)) in
      flow.buf <- Cstruct.empty;
      Lwt.return res)
    >|= function
    | Ok (`Data buf) when buf.len > req_len ->
        let head, rest = Cstruct.split buf req_len in
        flow.buf <- rest;
        Ok (`Data head)
    | x -> x
end

dinosaure · 2022-07-25T13:58:04Z

I can confirm that this is an issue and caused a bug in the read function in gluten-mirage. Also in this case an intermediate buffer must be implemented to fix the issue, because Gluten_lwt.IO.read can be called with a buffer smaller than what Flow.read returns. Then the excess data must be stored and returned the next time Gluten_lwt.IO.read is called.

As you said, a solution like an intermediate buffer - eg. a ring-buffer such as ke or something like mirage-channel - will fix your problem and provide something closer to what Unix.read is.

While implementing that, I also needed to know: how long is the Cstruct.t returned by Flow.read valid, or is ownership transferred to the caller? At the moment I keep (parts of) the buffer alive, and use it in later calls, but am I risking that it has been overwritten in the meantime?

Currently, the story behind the zero-copy is false for Solo5. That mostly means that a fresh Cstruct.t is allocated every times you receive a TCP/IP packet. The ownership remains to the end-user when mirage-tcpip does not do anything than introspect the packet - it does not write into it. However, if the Cstruct.t is shared by writers and readers, you will probably get a problem (but this will depend on the layer above mirage-tcpip).

More generally, a change to the Mirage_flow.S interface has big implications for several packages. In other words, a true consensus must be found for everyone before any changes are made.

ansiwen · 2022-07-26T08:03:18Z

I'm still not sure: does that mean that my code (see above) is safe, as long as nobody writes to the buffer, because the buffer is never reused by the lower layers?

dinosaure · 2022-07-26T08:28:40Z

Yes, but your logic is a bit weird. I mean, req_len does not mean anything when the internal buffer is not empty. That mostly means that, with your code, a subsequent call to read with req_len = 128 (for example) does not give you all times, at most (nor at least), 128 bytes. I'm not sure that it's what you expect.

Otherwise, for ownership even if it depends on the implementation, it's "safe" to pass the underlying buffer to the upper layer. Again, we do a fresh copy for every ethernet packets.

ansiwen · 2022-07-26T09:12:35Z

Yes, but your logic is a bit weird. I mean, req_len does not mean anything when the internal buffer is not empty. That mostly means that, with your code, a subsequent call to read with req_len = 128 (for example) does not give you all times, at most (nor at least), 128 bytes. I'm not sure that it's what you expect.

Ok, we are becoming off topic now, but I think you got the code wrong. If I'm not mistaken, the split with req_len is always done if the buffer is longer, no matter if the the buffer comes from Flow.read or from the internal buffer.

ansiwen · 2022-07-26T09:36:57Z

But it's not concurrency-safe if several read lwt's are on the fly at the same time.

hannesm · 2023-12-17T15:56:19Z

Long thread, if I read it correctly, the proposal is to revise the read signature to:

val read : flow -> Cstruct.t -> ([ `Input of int | `End_of_flow ], error) result Lwt.t

My observation in MirageOS is (lower layers, zero copy, pretty sane)

The lowest thing is Mirage_net.listen : t -> header_size:int -> (Cstruct.t -> unit Lwt.t) -> (unit, error) result Lwt.t (in the implementation of mirage-net-solo5, there's a buffer allocated and then returned --> thus the callback function passed in takes ownership thereof (still, there's the entire callback from the read() is executed in a Lwt.async - which certainly is not ideal);
Next layer is ethernet, which takes 3 callbacks (arp/ipv4/ipv6) and calls the function with the Cstruct.shift()ed data (again, ownership is transferred to the callback (not sure whether documented or not);
Next layer is IPv4, which eventually puts the buffer into the fragment cache, and then pushes it (again, as callback) to the next layer (UDP / TCP / ICMP)

Now, we're finally eventually at the FLOW layer - where TCP is responsible to consumer the received network packet, and provide data to the application. And the question is, how should this be done? On the one side, the lower layers (packet received at the network card) may need properly aligned buffers, thus having a good ownership and lifetime discipline may help to reuse buffers (e.g. in the ixy case). So, my preferred solution would be if the buffer from the network device would be owned by the network device, and already the IPv4 fragment cache should do a copy.

Now, depending on your TCP implementation, it may be worth to copy over data to a user buffer (and have an internal buffer) or not. I'm not finally decided on this. In uTCP, I thought, maybe if read returns a Cstruct.t list, I can avoid some further allocations.

I do like to not have data_or_eof, but instead return the amount of bytes that have been read.

I did not push on this for 4.0.0 since integrating a shutdown already felt like enough change in the API (plus getting it across to the implementations (still WIP)).

ansiwen mentioned this issue Aug 2, 2022

gluten-mirage: fix read and writev functions anmonteiro/gluten#32

Merged

hannesm mentioned this issue Sep 2, 2022

Add Mirage_flow.shutdown: flow -> [ read | write | read_write ] -> unit Lwt.t #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POSIX-compliant read on the Mirage_flow.S interface #46

POSIX-compliant read on the Mirage_flow.S interface #46

dinosaure commented Nov 30, 2020 •

edited

Loading

avsm commented Nov 30, 2020

dinosaure commented Dec 1, 2020

ansiwen commented Jul 25, 2022 •

edited

Loading

dinosaure commented Jul 25, 2022

ansiwen commented Jul 26, 2022

dinosaure commented Jul 26, 2022

ansiwen commented Jul 26, 2022

ansiwen commented Jul 26, 2022

hannesm commented Dec 17, 2023

POSIX-compliant read on the Mirage_flow.S interface #46

POSIX-compliant read on the Mirage_flow.S interface #46

Comments

dinosaure commented Nov 30, 2020 • edited Loading

avsm commented Nov 30, 2020

dinosaure commented Dec 1, 2020

The other half problem

Several others questions

ansiwen commented Jul 25, 2022 • edited Loading

dinosaure commented Jul 25, 2022

ansiwen commented Jul 26, 2022

dinosaure commented Jul 26, 2022

ansiwen commented Jul 26, 2022

ansiwen commented Jul 26, 2022

hannesm commented Dec 17, 2023

dinosaure commented Nov 30, 2020 •

edited

Loading

ansiwen commented Jul 25, 2022 •

edited

Loading