Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[humble] How to avoid frames dropped without buffsize and chunksize options? #1845

Open
l4es opened this issue Oct 29, 2024 · 13 comments
Open

Comments

@l4es
Copy link

l4es commented Oct 29, 2024

Hi Everyone,

I am recording IMU topic at 200MHz together with /camera/left/image_raw and /camera/right/image_raw at 20Hz. However, I observed certain frame drops of either left or right camera topics :
ros2_bag_frame_drops_20hz

I even throttled the image topics to 5Hz but was still facing the drops:
ros2_bag_frame_drops_5hz

The resolution of the image frame is 1280x720.

In ROS1 version, there's buffsize and chunksize that could be used to solve this issue I suppose. However, they are not available anymore in the ROS2 version.

There's also another suggestion to override the QoS settings of ros2 bag record. Unfortunately, I received the Incorrect QoS compatibility warnings if I set the reliability of the relevant topic to reliable.

Could someone give me an advice, please ?

Thanks in advance and best regards,
Khang

@christophebedard
Copy link
Member

christophebedard commented Oct 29, 2024

Did you try the --max-cache-size option for ros2 bag record? That may not help if recording just can't keep up with the data input rate. You may have to record topics in separate bags/processes.

@MichaelOrlov
Copy link
Contributor

@l4es The "recording IMU topic at 200MHz together with /camera/left/image_raw and /camera/right/image_raw at 20Hz" is not a very good idea at first hand. At least because camera frames will delay or/and will cause frame drops on the IMU topic.
I recommend recording high frame rate topics as IMU in a separate bag file and then gluing them together or using distributed replay.

@l4es
Copy link
Author

l4es commented Oct 30, 2024

Hi @christophebedard,

I did try --max-cache-size 104857600 but drops still happened.

Hi @MichaelOrlov,

Indeed I could also see the IMU drops as well. Thanks for your suggestion to split the bag file per topic. May I know if this was the same approach with which you'd been able to record up to the 2.6 GBytes/Sec from 8 cameras (without any drop !?!?) ?

... then gluing them together or using distributed replay.

As I am new with ROS, could you help to share some way/tool to merge the separate bag files so that the synchronization of the topics is guaranteed, please ? Our own idea is merging the left and right image topics into one before being recorded with rosbag.

Thanks in advance and best regards,
Khang

@christophebedard
Copy link
Member

As I am new with ROS, could you help to share some way/tool to merge the separate bag files so that the synchronization of the topics is guaranteed, please ? Our own idea is merging the left and right image topics into one before being recorded with rosbag.

If you replay them separately, they indeed won't be sync. You can first merge them into 1 bag (time-synchronized). See "Example merge" under this section in the README: https://github.com/ros2/rosbag2/tree/humble#converting-bags

I'm currently working on adding support for replaying multiple bags (time-synchronized of course), which means you wouldn't need to first merge the bags into 1 bag before replaying them, although this feature will not be available in Humble.

@fujitatomoya
Copy link
Contributor

Just out of curiosity,

recording IMU topic at 200MHz

is this really 200 MHz data rate supported with physical sensor? it is almost processor frequency like Cortex-M???

@MichaelOrlov
Copy link
Contributor

Hi @l4es, As @christophebedard mentioned, we have the ros2 bag convert CLI tool, which is very powerful. However, in the current implementation, it doesn't take into account timestamps when merging multiple bags.
It basically takes the next message from each bag and puts it in the new bag in a round-robin cycle.

If you have a topic in one bag with a frequency at least twice more than other topics from other bags, it is likely that messages from that topic with higher frequency will not be ordered by timestamp in the resulting merged bag.

As @christophebedard mentioned, we gonna soon create a PR with an extension for the Rosbag2 player to allow playback from multiple bag files and respect the received timestamp for each topic and bag file.

@l4es
Copy link
Author

l4es commented Oct 31, 2024

Hi @christophebedard and @MichaelOrlov,

It is good to know and still okay for me to merge separate input bags into one output bag, as long as the output bag would be similar to the case of recording multiple topics into single bag file in terms of timing and synchronization (time-synchronized).

By the way of mentioning about the topics' timing and sync., I would like your confirmation about the chronical order of the topics in the output bag : Are they synced by the time they were received for recording or by the timestamp in their headers, please ?

@fujitatomoya :

is this really 200 MHz data rate supported with physical sensor? it is almost processor frequency like Cortex-M???

I used an MCU to read the IMU by interruption and to publish the IMU topic with microROS over serial port.

Best Regards,
Khang

@christophebedard
Copy link
Member

By the way of mentioning about the topics' timing and sync., I would like your confirmation about the chronical order of the topics in the output bag : Are they synced by the time they were received for recording or by the timestamp in their headers, please ?

Ordered by the time each message was received by the recorder, see

if (earliest_msg == nullptr || msg->recv_timestamp < earliest_msg->recv_timestamp) {
. They are not ordered by the original publication timestamp.

@l4es
Copy link
Author

l4es commented Nov 1, 2024

Hi @christophebedard,

Thanks for your confirmation.

@MichaelOrlov,
I did some extra tests in order to avoid as many as possible the frame drops :

  1. Original case with separate left & right topics in raw :
    IMG_3865

  2. Assemble left & right topic of my stereo camera :
    IMG_3867

  3. Assemble and compressed left & right topic of my stereo camera :
    IMG_3869

It looks like that the drops are caused by the amount of data written to the storage (which is quite natural) : with compressed image, less data is written (~700MB compared to 2.6GB for around 1 minutes of recording) thus less drops. And another conclusion is that it is hard to achieve the ideal situation in which there's no data/frame drop, especially in case of image topics with which there's high amount of data.

@MichaelOrlov
Copy link
Contributor

Hi @l4es, First of all, need to figure out on what side frame drops happened.

  1. Do you use the bag split by size feature, or writing in the one bag file continiously?
  2. Do you see this message in the Rosbag2 recorder console "Cache buffers lost messages per topic: "
  3. What is your maximum disk write speed for the block size similar to the size of the camera image which you are storing in the bag?
  4. What is QoS settings? Reliability, Durability and more important DDS queue sizes?
  5. Have you tried to use composable recorder node to avoid extra data copies and transfer over the DDS?
  6. Have you tried to run recorder on an isolated cpu cores (at least 2-4) ?

@l4es
Copy link
Author

l4es commented Nov 4, 2024

Hi @MichaelOrlov ,

Thanks for your suggestions. Below are my answers :

  1. Do you use the bag split by size feature, or writing in the one bag file continuously?

I recorded single bag file continuously.

  1. Do you see this message in the Rosbag2 recorder console "Cache buffers lost messages per topic: "

No, I do not see the above message.

  1. What is your maximum disk write speed for the block size similar to the size of the camera image which you are storing in the bag?

Would it be possible to do that with dd command?

  1. What is QoS settings? Reliability, Durability and more important DDS queue sizes?

About the QoS settings, as said in initial comment, I tried the solution given in the following link https://robotics.stackexchange.com/questions/100644/detect-messages-dropped-by-ros2-bag-record, but got the Incorrect QoS compatibility warnings if I set the reliability of the relevant topic to reliable. So I believe that they are default settings, especially Reliability = Best Effort.
About the DDS queue sizes, I haven't known how to set it. Could you advice?

  1. Have you tried to use composable recorder node to avoid extra data copies and transfer over the DDS?

My own publishing node has not been designed to be composable yet, is it possible to use just the composable recorder similar to this one : https://github.com/berndpfrommer/rosbag2_composable_recorder as I am still with ROS2 Humble?

  1. Have you tried to run recorder on an isolated cpu cores (at least 2-4) ?

I haven't tried this approach yet.

@MichaelOrlov
Copy link
Contributor

MichaelOrlov commented Nov 5, 2024

Hi @l4es
If you don't see the message "Cache buffers lost messages per topic: " when you stop recording with CTRL+C, it does mean that messages drop happened on the transport layer and internal Rosbag2 buffers didn't get to full.

To benchmark your disk writing speed, please refer to the https://github.com/ros2/rosbag2/tree/rolling/rosbag2_performance/rosbag2_performance_benchmarking#general-knowledge-io-benchmarking

If you are trying to override QoS settings and getting error messages that QoS settings are not compatible, it means that the publisher-side QoS settings need to be adjusted as well. Please refer to the https://docs.ros.org/en/rolling/Concepts/Intermediate/About-Quality-of-Service-Settings.html article.
The DDS queue size is the QoS parameter

Depth

  • Queue size: only honored if the “history” policy was set to “keep last”.

The composable recorder and player nodes are supported starting from the Jazzy release.

@l4es
Copy link
Author

l4es commented Nov 9, 2024

Hi @MichaelOrlov,

I would like to re-confirm that I did see the "Cache buffers lost messages per topic: " message when stopping recording of /camera/left/image_raw and /camera/right/image_raw with CTRL+C :

[INFO] [1731061741.190553915] [rosbag2_cpp]: Writing remaining messages from cache to the bag. It may take a while
[WARN] [1731061741.190605501] [rosbag2_cpp]: Cache buffers lost messages per topic: 
	/camera/right/image_raw: 1
	/camera/left/image_raw: 2
Total lost: 3
[INFO] [1731061741.199839269] [rosbag2_recorder]: Event publisher thread: Exiting
[INFO] [1731061741.200036620] [rosbag2_recorder]: Recording stopped
[INFO] [1731061741.201083981] [rosbag2_recorder]: Recording stopped

Also, I managed to run the rosbag2_performance benchmarking on the testing system (Jetson Orin NX + Samsung NVMe SSD) which gives below result :

~/Workspace/Nvidia/ros_ws$ src/rosbag2/rosbag2_performance/rosbag2_performance_benchmarking/scripts/report_gen.py -i rosbag2_performance_test_results/test_mixed_110Mbs_no_transport_2024-11-09_12-17-22/
100Mbs_large:
  msg_count_each: 100
  msg_size_bytes: 10000000
  publishers_count: 1
  qos:
    qos_depth: 5
    qos_durability: volatile
    qos_reliability: best_effort
  rate_hz: 10
  topic_root: benchmarking_large
10Mbs_many_frequent_small:
  msg_count_each: 2000
  msg_size_bytes: 100
  publishers_count: 500
  rate_hz: 200
  topic_root: benchmarking_small
publisher_groups:
- 10Mbs_many_frequent_small
- 100Mbs_large
wait_for_subscriptions: true

Results: 
	Repetitions: 2
	Max bagfile size: 0
	Compression: <default>
	Compression threads: 0
	Compression queue size: 1
	Recorded messages for different caches and storage config:
		storage config: default:
			cache 10,000,000 - min: 99.37%, average: 99.63%, max: 99.88%
			cache 100,000,000 - min: 94.95%, average: 97.47%, max: 100.00%
		storage config: storage_resilient.yaml:
			cache 10,000,000 - min: 25.22%, average: 26.66%, max: 28.11%
			cache 100,000,000 - min: 48.18%, average: 50.17%, max: 52.15%
Results: 
	Repetitions: 2
	Max bagfile size: 0
	Compression: zstd
	Compression threads: 0
	Compression queue size: 1
	Recorded messages for different caches and storage config:
		storage config: default:
			cache 10,000,000 - min: 45.40%, average: 56.57%, max: 67.73%
			cache 100,000,000 - min: 63.85%, average: 64.60%, max: 65.35%
		storage config: storage_resilient.yaml:
			cache 10,000,000 - min: 18.12%, average: 18.26%, max: 18.40%
			cache 100,000,000 - min: 42.01%, average: 42.14%, max: 42.28%

And some first I/O benchmarking :
with dd :

~/Workspace/Nvidia/ros_ws$ dd if=/dev/zero of=/tmp/output conv=fdatasync bs=384k count=1k; rm -f /tmp/output
1024+0 records in
1024+0 records out
402653184 bytes (403 MB, 384 MiB) copied, 1.90264 s, 212 MB/s

then with fio :

~/Workspace/Nvidia/ros_ws$ fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=write --size=500m --io_size=10g --blocksize=1024k --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
TEST: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32
fio-3.16
Starting 1 process
TEST: Laying out IO file (1 file / 500MiB)
Jobs: 1 (f=1): [W(1)][12.1%][w=199MiB/s][w=199 IOPS][eta 00m:51s]
Jobs: 1 (f=1): [W(1)][22.0%][w=235MiB/s][w=235 IOPS][eta 00m:46s] 
Jobs: 1 (f=1): [W(1)][32.8%][w=141MiB/s][w=141 IOPS][eta 00m:39s] 
Jobs: 1 (f=1): [W(1)][41.7%][w=230MiB/s][w=230 IOPS][eta 00m:35s] 
Jobs: 1 (f=1): [W(1)][51.7%][w=117MiB/s][w=117 IOPS][eta 00m:29s] 
Jobs: 1 (f=1): [W(1)][61.7%][w=65.0MiB/s][w=65 IOPS][eta 00m:23s] 
Jobs: 1 (f=1): [W(1)][71.7%][w=86.1MiB/s][w=86 IOPS][eta 00m:17s] 
Jobs: 1 (f=1): [W(1)][81.7%][w=114MiB/s][w=114 IOPS][eta 00m:11s] 
Jobs: 1 (f=1): [W(1)][91.7%][w=78.0MiB/s][w=78 IOPS][eta 00m:05s] 
Jobs: 1 (f=1): [W(1)][100.0%][w=202MiB/s][w=202 IOPS][eta 00m:00s]
TEST: (groupid=0, jobs=1): err= 0: pid=229317: Sat Nov  9 12:37:28 2024
  write: IOPS=157, BW=157MiB/s (165MB/s)(9478MiB/60225msec); 0 zone resets
    slat (usec): min=59, max=447593, avg=697.87, stdev=13175.38
    clat (msec): min=3, max=1055, avg=202.15, stdev=138.21
     lat (msec): min=3, max=1055, avg=202.85, stdev=138.15
    clat percentiles (msec):
     |  1.00th=[   16],  5.00th=[   61], 10.00th=[  114], 20.00th=[  127],
     | 30.00th=[  130], 40.00th=[  133], 50.00th=[  138], 60.00th=[  146],
     | 70.00th=[  209], 80.00th=[  309], 90.00th=[  393], 95.00th=[  510],
     | 99.00th=[  667], 99.50th=[  718], 99.90th=[  894], 99.95th=[  986],
     | 99.99th=[ 1053]
   bw (  KiB/s): min= 4096, max=296960, per=100.00%, avg=161194.89, stdev=81609.28, samples=120
   iops        : min=    4, max=  290, avg=157.38, stdev=79.71, samples=120
  lat (msec)   : 4=0.09%, 10=0.49%, 20=0.80%, 50=2.62%, 100=4.57%
  lat (msec)   : 250=67.52%, 500=18.36%, 750=5.23%, 1000=0.28%, 2000=0.03%
  cpu          : usr=0.56%, sys=0.86%, ctx=8995, majf=0, minf=25
  IO depths    : 1=0.2%, 2=0.4%, 4=0.8%, 8=1.6%, 16=3.2%, 32=93.8%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,9478,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=157MiB/s (165MB/s), 157MiB/s-157MiB/s (165MB/s-165MB/s), io=9478MiB (9938MB), run=60225-60225msec

Disk stats (read/write):
  nvme0n1: ios=64/9730, merge=0/382, ticks=14399/1823067, in_queue=1838088, util=99.75%

finally with strace :

~/Data/perf-tests/endurance/raw/strace$ strace -c ros2 bag record --max-cache-size 10 /camera/left/image_raw /camera/right/image_raw /imu/data_raw
[INFO] [1731130928.971590855] [rosbag2_recorder]: Press SPACE for pausing/resuming
[INFO] [1731130928.979452208] [rosbag2_storage]: Opened database 'rosbag2_2024_11_09-12_42_08/rosbag2_2024_11_09-12_42_08_0.db3' for READ_WRITE.
[INFO] [1731130928.983323172] [rosbag2_recorder]: Listening for topics...
[INFO] [1731130928.983328452] [rosbag2_recorder]: Event publisher thread: Starting
[INFO] [1731130928.983734087] [rosbag2_recorder]: Recording...
[INFO] [1731130929.588204464] [rosbag2_recorder]: Subscribed to topic '/imu/data_raw'
[INFO] [1731130929.891170185] [rosbag2_recorder]: Subscribed to topic '/camera/left/image_raw'
[INFO] [1731130929.892392719] [rosbag2_recorder]: Subscribed to topic '/camera/right/image_raw'
[INFO] [1731130929.892448975] [rosbag2_recorder]: All requested topics are subscribed. Stopping discovery...
strace: Process 229771 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.30   10.381692       29244       355        10 futex
  0.19    0.019688           3      5935      2097 newfstatat
  0.16    0.016992           3      4301      1803 openat
  0.06    0.006653           4      1530           getdents64
  0.06    0.006214           1      3256           read
  0.05    0.005637           1      4007           fstat
  0.05    0.005104           2      2518           close
  0.04    0.003700           1      3133         3 lseek
  0.03    0.003051           1      2172      1520 ioctl
  0.02    0.001938           6       302           mmap
  0.01    0.001197           6       193           mprotect
  0.00    0.000386          20        19           sendmsg
  0.00    0.000347           5        63           recvmsg
  0.00    0.000297           5        59           brk
  0.00    0.000285           6        46           munmap
  0.00    0.000227           3        70        69 mkdirat
  0.00    0.000165          11        14           clone
  0.00    0.000164           5        28           socket
  0.00    0.000134           9        14           sendto
  0.00    0.000124           4        27        19 setsockopt
  0.00    0.000121           2        45        21 faccessat
  0.00    0.000106           3        31         7 flock
  0.00    0.000076          25         3           fallocate
  0.00    0.000075           5        13         1 unlinkat
  0.00    0.000063           6        10           write
  0.00    0.000060           3        18         6 bind
  0.00    0.000050           0        69           fcntl
  0.00    0.000037           2        14           epoll_ctl
  0.00    0.000029           3         9           getsockname
  0.00    0.000027           2        12           pwrite64
  0.00    0.000025           5         5         4 linkat
  0.00    0.000021           1        13           getpid
  0.00    0.000019           0        73           rt_sigaction
  0.00    0.000018           4         4           rt_sigprocmask
  0.00    0.000015           3         4         2 readlinkat
  0.00    0.000010           3         3           fchmod
  0.00    0.000010           1         9           pread64
  0.00    0.000009           9         1           sysinfo
[INFO] [1731131073.136563494] [rosbag2_cpp]: Writing remaining messages from cache to the bag. It may take a while
  0.00    0.000008           2         3           getrandom
  0.00    0.000005           1         3           dup
  0.00    0.000005           5         1           geteuid
  0.00    0.000004           2         2           getcwd
  0.00    0.000004           4         1           statfs
  0.00    0.000004           4         1           set_tid_address
  0.00    0.000004           4         1           sigaltstack
  0.00    0.000004           4         1           getegid
  0.00    0.000004           2         2           gettid
  0.00    0.000004           4         1           prlimit64
  0.00    0.000003           3         1           eventfd2
  0.00    0.000003           3         1           set_robust_list
  0.00    0.000003           3         1           getuid
  0.00    0.000003           3         1           getgid
  0.00    0.000003           1         2           getsockopt
  0.00    0.000002           2         1           epoll_create1
  0.00    0.000001           1         1           timerfd_create
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         1           execve
------ ----------- ----------- --------- --------- ----------------
100.00   10.454830                 28404      5562 total
[WARN] [1731131073.136686888] [rosbag2_cpp]: Cache buffers lost messages per topic: 
	/camera/right/image_raw: 466
	/camera/left/image_raw: 305
	/imu/data_raw: 3216
Total lost: 3987

By the way, I think I found how to configure the buffsize and chunksize similarly to ROS1 rosbag by configuring the MCAP writter. I also used below configuration :

$ vi mcap_writer_options.yaml : 
 noChunkCRC: false 
 noChunking: false 
 noMessageIndex: false 
 noSummary: false 
 chunkSize: 10485760 
 compression: "Zstd" 
 compressionLevel: "Fastest" 
 forceCompression: false 

$ ros2 bag record -s mcap --max-cache-size 1048576000 --storage-config-file /path-to/mcap_writer_options.yaml /camera/left/image_raw /camera/right/image_raw /imu/data_raw

However, the frame drops could not be avoided.

P/S : I will modify the code to be able to adjust the QoS settings of published messages and keep you updated later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants