英文:
How to reduce Gstreamer Latency?
问题
我编写了一个流水线,使用Nvidia Jetson Xavier NX上的v4l2src元素从1920 X 576传感器中获取720 X 576的图像。该流水线获取帧然后执行两个操作:
- 将帧推送到
appsink
元素 - 对其进行编码并使用
udpsink
流式传输到客户端
流水线如下所示:
gst-launch-1.0 v4l2src device=/dev/video0 !
queue max-size-time=1000000 !
videoconvert n-threads=8 !
video/x-raw,format=I420,width=1920,height=576 !
videoscale n-threads=8 method=0 !
video/x-raw,format=I420,width=720,height=576 !
tee name=t ! queue !
valve name=appValve drop=false !
appsink name=app_sink t. !
queue max-size-time=1000000 !
videorate max-rate=25 !
nvvidconv name=nvconv !
capsfilter name=second_caps !
nvv4l2h265enc control-rate=1 iframeinterval=256 bitrate=1615000 peak-bitrate=1938000 preset-level=1 idrinterval=256 vbv-size=64600 maxperf-enable=true !
video/x-h265 !
h265parse config-interval=1 !
tee name=t2 !
queue max-size-time=1000000 !
valve name=streamValve drop=false !
udpsink host=192.168.10.56 port=5000 sync=false name=udpSinkElement
我的问题是:
有没有办法减少这个流水线的延迟?
我尝试通过添加许多队列和在视频缩放和videoconvert
中使用n-threads来减少延迟,但没有帮助。
英文:
I wrote a pipeline that grabs a 720 X 576 image from a 1920 X 576 sensor with the v4l2src element on a Nvidia jetson xavier nx.
The pipeline grabs the frame and then does 2 things:
- pushes the frame to the
appsink
element - encode it and stream with
udpsink
to the client
The pipeline is as follows:
gst-launch-1.0 v4l2src device=/dev/video0 !
queue max-size-time=1000000 !
videoconvert n-threads=8 !
video/x-raw,format=I420,width=1920,height=576 !
videoscale n-threads=8 method=0 !
video/x-raw,format=I420,width=720,height=576 !
tee name=t ! queue !
valve name=appValve drop=false !
appsink name=app_sink t. !
queue max-size-time=1000000 !
videorate max-rate=25 !
nvvidconv name=nvconv !
capsfilter name=second_caps !
nvv4l2h265enc control-rate=1 iframeinterval=256 bitrate=1615000 peak-bitrate=1938000 preset-level=1 idrinterval=256 vbv-size=64600 maxperf-enable=true !
video/x-h265 !
h265parse config-interval=1 !
tee name=t2 !
queue max-size-time=1000000 !
valve name=streamValve drop=false !
udpsink host=192.168.10.56 port=5000 sync=false name=udpSinkElement
My question is:
Is there any way to reduce the latency of this pipeline?
I tried to reduce the latency by adding many queues and the n-threads to the video scale and videoconvert
but it won't help.
答案1
得分: 1
如何测量延迟?
如果是指在启动客户端时看视频所需的时间,那么您可能需要减小GOP(iframeinterval),因为客户端将在能够重建完整画面之前等待I帧(或多个I-slice)。
您可以通过在监视器上捕获解码后的视频流输出,并将视频源指向定时器来轻松查看是否是这种情况。用手机拍摄两者的照片(监视器+计时器),您就可以相对准确地测量从镜头到镜头的延迟。
您将多次启动流并测量延迟。为什么要多次启动?因为取决于GOP中的位置(接近或远离下一个I帧),延迟会有所不同。通常情况下,对于25fps的源,256帧的GOP意味着您在解码之前可能需要0.04秒至10.2秒的延迟。
此外,您的流水线对于您要实现的目标来说太复杂了。您可以使用nvvidconv
(GPU),它比videoscale
(CPU)更好地调整视频大小。您可以使用功能直接设置帧速率(无需使用videorate
)。您还可以限制UDP接收器(和源)的缓冲区大小,在那里,您改变了可靠性的同时也改变了延迟以适应重排序和延迟的数据包。
还有其他降低延迟的技巧,但您可能需要放弃其他功能。您可以要求编码器减小GOP,可以要求不使用B帧,可以启用分片级别编码,减小分片长度,增加或减小分片内部刷新,限制配置文件等等... 所有这些与默认设置相比都有缺点,效果因情况而异。
通常情况下,添加队列会增加延迟(但可以减轻CPU的热点压力),而不会减少延迟,除非您的队列几乎总是空的,在这种情况下,您不需要它们。这是因为队列需要同步线程,这需要时间。只有在数据上进行并行处理且不同分支的速度不同的情况下才需要队列。在“简单”的顺序捕获、编码、流传输模式中,通常不需要队列(因为大多数步骤可以在GPU和NVENC上完成,而不是CPU受限的)。如果您需要同步I/O(例如filesink
),那么队列在处理时间有时高于捕获器的采样率时可能会有益。
英文:
How do you measure latency ?
If it's the time it takes for seeing the video while launching the client, then you'd probably need to reduce the GOP (iframeinterval) because the client will wait for an I-Frame (or many I-slice) before being able to reconstruct a complete picture.
You can easily see if it's the case, by capturing the output of the decoded video stream on a monitor with your video source pointed at a timer. Take a picture with your phone of both (monitor + timer) and you have a pretty good measure of glass to glass latency.
You'll launch the stream multiple time and measure the latency. Why multiple time ? Because depending on where you are in the GOP (close or far from the next I-frame), it'll vary. Typically with a 25fps source, a GOP of 256 frames means that you could have from 0.04s to 10.2s of delay before being able to decode.
Also, your pipeline is too complex for what you're trying to achieve. You can use nvvidconv
(GPU) which is much better than videoscale
(CPU) to rescale your video. You can use capabilities to set the framerate directly (no need for videorate
). You can also limit the UDP sink's (and src) buffer size, where you're changing latency for reliability of reordered & late packets.
There are other tricks to reduce latency, but you'll to loose something else. You can ask the encoder to decrease the GOP, you can ask no to use B-frame, you can enable slice level encoding, reduce the slice length, increase or decrease the slice intra refresh, limit the profile, etc... All have drawback from default settings, YMMV.
Adding queues usually increases latency (but relieve the CPU hotspots) and doesn't reduce it, unless your queues are almost always empty and in that case, you don't need them. That's because a queue requires synchronizing threads and this takes time. This is only required if you have parallel processing on the data and the different branch aren't ticking at the same speed. In the "simple" sequential grab, encode, stream mode, the queue is usually not required (since most steps can be made on the GPU & NVENC and are not CPU limited).
If you need synchronous I/O (like a filesink
), then a queue can be beneficial IIF the processing time is sometimes higher than the grabber's sample rate.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论