Skip to content
adwpc edited this page May 18, 2020 · 18 revisions

Pion WebRTC Media API

This document details a completely new media API for Pion WebRTC. The current media API has deficiencies that prevent it from being used in a few production workloads. This document doesn't aim to modify/extend the existing API, we are looking at it with fresh eyes.

Adding Comments

I encourage everyone to comment on this page! When adding comments add them in italics and include your GitHub username I believe this API can be improved by doing X -- Sean-Der

API Requirements

API Users

If you can think of more use cases please provide them, this list is not exhaustive!

No CGO or Pluggable CGO -adwpc

prefer pure GO

Sending pre-recorded content to viewer(s)

A user has audio/video file on disk and wants to send the content to many viewers. There will be no congestion control, you will have some loss handling (NACK). If the remote viewer doesn't support the codec we offer handshaking will fail.

Relaying RTP Traffic (with no feedback)

A user has an existing RTP feed (RTSP camera), and wants to send the content to many viewers. There will be no congestion control, you will have some loss handling (NACK). If the remote viewer doesn't support the codec we offer handshaking will fail.

Sending live generated content

A user will be encoding content and sending to many viewers, this could be an MCU, capturing a webcam or desktop (like github.com/nerdism/neko). There will be congestion control, and packet loss handling (NACK/PLI). The user should be informed of the codecs the remote supports, and then be able to generate on the fly what is requested.

Ingesting WebRTC for Later Playback

A user wants to save media from a remote peer to disk. This could be for playback later, or some other async task. We need to ensure the best experience possible by providing loss handling, and congestion control. Latency doesn't matter as much.

Ingesting WebRTC for Live Playback

A user wants to consume media from a remote peer live. This could be used for processing (like GoCv) or playing back live. We need to ensure the best experience possible by providing loss handling, and congestion control. We will also need to be careful to not add much latency, this could hurt the entire experience.

Relaying WebRTC Traffic

Users should be able to build the classical SFU use cases. For each Peer you will have one PeerConnection, and transfer all tracks across that. If possible we should support Simulcast and SVC. However if nothing is supported we should just request the lowest bitrate that works for all peers. Beyond that we should pass everything through and let de-jitter happen on each receiver side. This needs more research.

Code that works in native and web

Users should be able to write idiomatic WebRTC code that works in both their native and Web applications. They should be able to call getUserMedia and have it work across both platforms. This portability is also very important for our ability to test.

API Features

An exact API will be defined below, this is a high level of what the user interaction will look like.

Sending Media

Set supported codecs at PeerConnection Level

A user on startup will declare what codecs they will support.

The user can add/remove from a list of RTCRtpCodecCapability

This allows us to express

  • All codecs (H264, Opus, VPx)
  • Attributes of that codec (packetization, profile)
  • RTCPFeedback (NACK, REMB)
Create a MediaStreamTrack

A user creates a MediaStreamTrack by either calling mediadevices.getUserMedia() OR creating a Track via webrtc.NewTrack(kind RTCCodeType, id, label string, func(RtpSender, supportedCodecs []RTCRtpCodecCapability) (RTCRtpCodecCapability, error)

Tracks must match MediaStreamTrack, so codec/ssrc will no longer be defined at the Track level.

Add a MediaStreamTrack to the PeerConnection

No change from the current Pion API, peerConnection.AddTrack(track)

On SetRemoteDescription a callback is fired on MediaStreamTrack with a RtpSender and supported codecs

Every time a PeerConnection that has added that track has finished signaling a callback is fired. Only then do we know the intersection of codecs. We can't pick H264 (or VPx) until we know the other side supports it.

func(sender RtpSender, supportedCodecs []RTCRtpCodecCapability) (RTCRtpCodecCapability, error) {
    if (len(supportedCodecs) == 0) {
      return fmt.Errorf("No supported codecs")
    }

    fanOutSlice = append(sender, fanOutSlice)
}

The example above shows the typical fan-out case. We get a new RtpSender, and then we add it to a list that another goroutine is looping and writing. When one of the RTPSenders returns io.EOF it removes it from the list. This was possible with the Pion API today, but here are the problems it does solve.

SSRC/PayloadType will be internally managed

Juggling these values makes the API hard to use. Browsers use different PayloadTypes, so this creates a lot of pain for users. It is also hard to debug when an SSRC is wrong.

Codec can be chosen on the fly

You don't know if the remote supports H264/VP9/AV1. You now can pick which codec you prefer out of all the intersections.

RTP and RTCP must be tightly coupled

The current API doesn't allow us to implement congestion control or error correction easily. By instead giving the user direct access to the RTPSender they have the hooks they need.

WriteSample should take time.Duration instead of (samples uint32)

The user shouldn't need to do the math. Internally we should convert it to a sample rate and pass to pion/rtp

Handling Jitter, Loss and Congestion

SettingEngine allows a user to define pass their own JitterBuffer and CongestionController

We will provide a sensible default, but these will both be interfaces that a user just has to satisfy. This is out of the scope of this document, the only thing we need to ensure is that it is possible without a API break.

A user can then go and interact with the JitterBuffer/CongestionController as they wish. If they want to mutate it at runtime or modify values. This will allow them to choose how much loss they are willing to tolerate etc.. This will also be helpful for building an SFU. You can have a CongestionController where you can set the upper bound being the lowest of all recievers. The REMB is then constructed and sent back to the reciever.

RTPSender will have callbacks for RTCP Feedback results

We will put two callbacks on the RTPSender, and the user can ignore them if they wish. These aren't portable, but I think putting them in the SettingEngine is the wrong thing to do.

RtpSender.OnBitrateSuggestion(func(bitrate float) {
})

RtpSender.OnKeyframeRequest(func() {
})

API In Action

Webcam capture that works in WASM and Go mode

This will capture a video device and will work in WASM or Go mode. When running in WASM mode the VP8 selection has no impact though. In the future if the WebRTC API allows that we will support it though.

func main() {
    // We only want to send VP8
	s := webrtc.SettingEngine{
		Codecs: []RTCRtpCodecCapability{
          webrtc.RTCRtpCodecCapabilityDefaultVP8,
        },
	}
	api := webrtc.NewAPI(webrtc.WithSettingEngine(s))

	peerConnection, err := api.NewPeerConnection(webrtc.Configuration{})

    track, err :=mediaDevices.GetUserMedia({Video: true})

    peerConnection.AddTrack(track)
}

We should allow users to provide their own encoders (@lherman-cs)

I think we should allow users to encode their own videos/audios because the tracks that we receive from GetUserMedia should be still in raw format (because we need to be able to transform the video/audio). The following shows the data flow starting from GetUserMedia and ending at the other peer.

Reference: https://w3c.github.io/mediacapture-main/#the-model-sources-sinks-constraints-and-settings

This diagram shows that the data from the source can be broadcasted and transformed. Allowing users to encode their own videos/audios also gives some extra benefits for the users:

  1. Fan-out video to many PeerConnection
  2. Use the source for other outputs, e.g. simply stream mjpeg through HTTP server
  3. Transform the source, the change will be reflected to all of the listeners
  4. Each listener has the option to transform the source without affecting other listeners

So, I propose that we should have a functional option to allow users to give their encoders.

type LocalTrack interface {
  ReadRTP() (*rtp.Packet, error)
  
  // The following methods allow PeerConnection to use RTCP Feedback to automatically control the input

  // SetBitRate sets current target bitrate, lower bitrate means smaller data will be transmitted
  // but this also means that the quality will also be lower.
  SetBitRate(int) error
  // ForceKeyFrame forces the next frame to be a keyframe, aka intra-frame.
  ForceKeyFrame() error
}

type EncoderBuilder interface {
  Codec() webrtc.RTPCodec
  // Notice that this signature is opaque. This allows pion/webrtc to stay Pure Go.
  // The idea is to not require the main pion/webrtc package to know the input format from the track, 
  // it only needs to care how to handle the encoded version. This way, we let the users decide
  // whatever format they wish, which leads to a flexible design. But, since it is opaque, 
  // it'll be more error-prone and feels more "magical".
  BuildEncoder(Track) (LocalTrack, error)
}

type SettingEngine struct{
  // internal stuff
}

func (engine *SettingEngine) WithEncoders(encoders ...EncoderBuilder) {}


func (pc *PeerConnection) AddTrack(track Track) {
  // step 1: find common supported codec builders from SettingEngine
  // note 1.1: if there are multiple codecs as the result, try to build in sequential order, 
  //           if one fail, use the next ones. This is useful if we have 2 or more codec implementations. We allow users,
  //           to prioritize some encoders, e.g. hardware accelerated codecs (it's common to fail since the device 
  //           might not have hardware support).
 
  // step 2: create a local track using the encoder builder

  // step 2: create a new RTPSender

  // step 3: replace the RTPSender's local track from step 3 with the local track from step 2
}

This design is actually similar to what Chromium does, https://chromium.googlesource.com/external/webrtc/+/refs/heads/master/media/engine/webrtc_media_engine.h. They have a MediaEngine and it has an API to set encoder builders, later PeerConnnection can build encoders on the fly.

Note: I've created a couple of POCs in mediadevices:

  1. Non-WebRTC: https://github.com/pion/mediadevices/blob/redesign/examples/simple/main.go
    • Broadcast your camera stream through MJPEG server
  2. WebRTC: https://github.com/pion/mediadevices/blob/redesign/examples/webrtc/main.go
    • Classic 1:1 WebRTC example using jsfiddle

Maybe consider how this ties into a broader (Go) media pipeline? Over time you could build out building blocks like enabling Picture-in-Picture, etc. -- Backkem

Fan-out video from one PeerConnection to many

Distributing pre-recorded content

TODO/Questions

  • How do accomplish SVC?
  • How do we accomplish Simulcast