-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jbuf: fix for adaptive mode #960
Conversation
The wish size for the adaptive mode depends on how much the packets are re- ordered. It can only be measured by looking on the sequence number of the RTP header and how much it jumps around because of re-ordered packets. Counting video frames is not needed and causes failures when the wish size is computed.
macos tests are failing. This has nothing to do with this PR. @sreimers please could you have a look? |
But you want to specify for video how many frames are held. As you say, a video frame can be x packets (depending on keyframe and frame changes) and you want to make sure that a frame is ready (all packets in order) before decoding starts. |
I think it is not relevant if the frame is ready. The decoder should know what to do if the frame is not complete. It waits for more packets. The only application of the jbuf is to re-order packets. Thus observing the sequence number is enough. edit: |
Ok I can move the logic with correct video frame handling also outside of jbuf I think, to keep jbuf simple. |
Is it really relevant if the chunks of a video frame follow very quickly or with small interruptions (waiting for next RTP packets) to the decoder? |
It is important that a frame is complete, because otherwise, say the last packet (or new frame packet) arrives before it is finished (since jbuf doesn't keep all relevant packets by default), the entire frame (all video chunks) is corrupted. Since keyframes can be hundreds of packets in a short period of time, the probability is high. |
Here is a example how pjsip handles this: https://github.com/pjsip/pjproject/blob/master/pjmedia/src/pjmedia/vid_stream.c#L924-L981 |
Another use-case is Generic RTP Feedback - RTPFB (RTCP), since the encoder can try to resend lost packets (makes only sense for video). Therefore the jitter buffer has to held all packets that belongs to a video frame until they arrive to order the packets. |
In a next PR, we could implement this:
We could enable this behavior for |
Not sure if understand the apis correctly, but jbuf has to keep always a whole frame (at least one or more), otherwise a video frame get's easily corrupted as mentioned (if you move only chunks to the decoder). The adaptive mode wish size calculation based on packets, makes not much sense for video I think. Edit: Maybe we should use ptime and timestamp based calculation instead. |
I do not agree. The job of The current implementation in Edit: |
Only in adaptive mode?
Yes, and that's not the point, let's make this example:
If video decoder consumes chunks of packets from jbuf the frame can be corrupted right? You have to wait at least one frame of packets to ensure all packets in correct order before process within video decoder. For audio it's easy because 1 Frame = 1 Packet. |
Yes, only adaptive mode. Because the computation of the
I think I know what you want and that it is enough to wait until:
We could add this behavior in a following PR. Edit: Of course only if jbuf min is configured to zero. As soon as one out-of-order case is detected the |
Low latency video is a necessary requirement to reach lip synchronous and low latency audio+video conversation. |
You always have to wait for a frame to complete before passing to the video decoder. I would prefer not to hold packets within the video decode module itself (avcodec uses an mbuf for this) until a frame is completely received, so jbuf can handle any out-of-order packets, at least for that frame, this adds no extra latency. jbuf can be emptied at once. For real lip sync you need rtcp SR and calculate playout buffers to keep audio and video in sync. |
Okay. In order to decide that a frame with timestamp
|
We should check both, thats what pjsip does, as mentioned earlier: /* Quickly see if there may be a full picture in the jitter buffer, and
* decode them if so. More thorough check will be done in decode_frame().
*/
ts_diff = pj_ntohl(hdr->ts) - stream->dec_frame.timestamp.u32.lo;
if (ts_diff != 0 || hdr->m) { |
Should I add this in this PR? |
Currently this PR breaks fixed buffer video frame handling, since it reverts it. See this example: You want define a min. latency for jitter buffer and audio with at least 100ms, so you know one packet represents 20ms you can configure min. 5 packets. But if you want to archive the same for video you can't specify min packets, since video frames are fragmented. So you have to count/specify frames instead. |
That's not true, see pjsip as mentioned or webrtc |
jbuf: trace data for plot |
Another more detailed explanation why video frame handling is so useful in jbuf (besides jitter timing calculations): If you only handle packets, the decoder pulls packets from jbuf without frame knowledge, so some packets are in the decode buffer now and other packets in jbuf. if a out-of-order packet arrives that is is already consumed/cached by decode, jbuf detects a late packet and it's lost. So a single out-of-order packet will result in the loss of an entire frame, which can be prevented if all the packets that are part of a frame are kept in jbuf. I added a PR with better average ratio calculation (inspired by webrtc code): #962 |
My argument was that if
I will have a look tomorrow. |
I try to implement a frame processing that guarantees that frames are complete. For now I convert this to draft. |
I'm not fully understand why a complete new solution is needed? Can you explain, before putting to much work in to this. |
The field I saw that the frame/packet ratio is not a good measure to adjust the wish size, even if we compute a moving average. Further I would avoid linear effort in I close this and will re-open a draft for discussion. |
I think this works best if video decode is pulling by time callback (like a audio source/destination) and frame. At the moment it's packet/frame/push driven. |
The adaptive mode was broken. Video frames are not relevant for the wish size. The estimation of the number of frames was not accurate because key frames have much higher #frame/#packet ratio then differential frames.
The wish size depends only on
rdiff
, the measure for how much packet sequence number differ from the expected sequence number if packets arrive out of order.The plots generated from
baresip/tools/jbuf
of PR baresip/baresip#2724 visualize how the adaptive mode works.