Work Pipeline

In this page, we discuss the pipeline of data collection.

Data Collection

Synchronization

The process for synchronization is mentioned in https://github.com/aibtw/ZED_CAM/wiki/Synchronization
One Jetson Nano will be considered (Main/Master) and the rest will be (Secondary)
Start data collection by ensuring the three devices are in sync. The terminal output of the sync commands (on secondary devices) should show time difference no more than 15ms.

Recording

After all devices are in sync, start recording on each device. This utilizes Record/zed_recorder.py Python file. The code accepts one argument (video output path) or just a name and the default path will be used (Documents folder).

Post Processing

This includes several steps:

Timestamp extract

This utilizes postprocessing/timestamp_extract.py where the input arguments are .svo video path. The output will be creating a folder called (ts) in the parent directory of the video, then saving a .csv file inside it, including each frame number and its corresponding timestamp, in addition to number of dropped frames at each frame.

Timestamp Alignment

In timestamp extract (previous step) the output shows each frame, and its timestamp. The videos were synchronized, however, starting and ending moment were different. The synchronization was on the level of timestamping, not on the level of starting/ending at the same moment. This means that recording of each video started at a different time than the other, and similarly for ending time. To make each video completely aligned with the rest, we must first align their timestamps. This is done by finding the video that started the last, then, taking the first frame in that video, and searching by its timestamp in the other videos, until we find frames that correspond to the same timestamp in other videos. Next, these frames will be considered first frames, and the rest is to be discarded. This way, we have aligned the timestamps of each video, and we know which frames are first frame in each video.

This utilizes postprocessing/timestamp_align.py which takes an input of 3 csv files that contain timestamps of 3 synchronized videos, and outputs 3 aligned timestamps in another folder named (aligned).

SVO alignment

After aligning the timestamps, we can align the .svo itself, using postprocessing/svo_cutter.py which has -v flag (videos) and -t flag (aligned timestamps csv). The output will be 3 svo files, perfectly aligned, and are synchronized and ready for any further processing.

It is then needed to again extract timestamps and aligned timestamps for the newly cut files, since the older aligned timestamps were aligned at points other than zero. Meaning, frame 0 in new cut svo might correspond to frame 1234 in the old aligned timestamps file (we did cut the video to start at this frame, so now it is frame 0 not 1234) so we need to extract timestamps again.

At the end, you should have the following: 1- cut svo files, all aligned to start and end to the same frame. 2- timestamps of each cut svo. 3- timestamps aligned (even if they are all already aligned, but number of frames will appear to be different because the ZED API didn't take care of dropped frames counting, this is done in the alignment code).

Annotation

The following repository contains all files related to annotation phase: https://github.com/aibtw/Wudu-GUI-Annotation

Annotation tool

This is the Wudu-GUI-Annotation/OAnnotator.py file. It looks like the following:

Screenshot 2023-03-14 101555

Annotation tool input

You will need to press (browse) button, and select suitable avi file. This file shows ONLY ONE PERSON (captured from the 3 cameras). To achieve this, you will first need to extract avi files (only left lens) from 3 synchronized svo files. After that, we use the Wudu-GUI-Annotation/VideoEditor.py script. This will take in the 3 avi files. It will also read the timestamps-aligned csv files of each svo file corresponding to each avi file.

Recall that, there are dropped frames in each camera, and they are different in each camera. The ts-aligned files to be loaded are aligned in terms of start frame, end frame, and also, dropped frames. This means, if a dropped frame occurs at frame 1000 to 1100 in camera 1, it is probably not a dropped frame in camera 2, so camera 1 must not jump from 1000 to 1100 directly, but should show frames 1000 to 1100 as duplicates of frame 1000. These frames are already filled in the ts-aligned files, but not in the avi videos we loaded. So, the script will fill these frames in the videos.

Then, the script will also (crop) a rectangle around each person in each camera. This is done manually. For our experiments, we manually recorded the boundaries of each box in the Wudu-GUI-Annotation/Frame_cut.txt file. This will be copied from the txt file and pasted in the code for every group of 3 videos.

The output: 30 avi videos. Each video shows exactly 1 person, from 3 different angels. This file will be used as input to the annotation tool. In other words, person-wise videos instead of camera-wise videos.

Annotation rules to annotators:

Don't use the progress bar. Instead, use the skip/back buttons.

Explanation: the buttons are bound to a flag that stops updating the annotations when pressed. So, if someone did a wrong annotation, he can pause, go back 4 seconds and un-pause, and the old annotations will not get overridden. However, to continue updating annotations (to start override again) he must press any label again so that annotation starts updating again. If no label was pressed, this will result in missing labels for a couple of frames.

Try to time annotations with the start of the move.
If any move is not very clear, try to approximate it (especially face-related moves)

Annotation tool output

The output is s csv file, containing Frame number and the label of that frame, for each frame.

Problems in the output

For some reason, when when we compare number of frames in the annotation output, and number of frames in the ts-aligned, we find a constant difference (13 frames). They should be exactly the same, so there is still no explanation. However, 13 missing frames are not a big problem. We just fill these frames with the value of the previous label.
Some people did use the skip/back buttons, but then didn't press a new label until they went further than the place the were originally at. This means some frames will have no annotation. To detect the problem, we use the Wudu-GUI-Annotation/missing_label_detector.ipynb file, which will output to use the frames with no labels and/or the time at which they occur, so we can use the annotation tool and load the same annotation file with missing labels, and label only those missing label frames, and not have to relabel the whole file.

Converting annotation output (moves from avi file) into (moves from svo file)

The idea here is, we will be counting on the raw svo files to extract skeleton information, so we need to convert these annotations (moves) from frames of avi file to frames of svo file.

You now will be thinking, why? aren't the videos aligned and the avi is actually just svo exported to avi? the answer is, it should be the same, but there were problems like we mentioned in the previous section (Problems in the output). So, we try to move these annotations to the ts-aligned files, just in case, and also, this is important because we want later to revert it back to the original, non-aligned timestamps files. Also, don't forget that the annotations output of avi files is person-wise, but our original videos are camera-wise (3 people in each camera), so we need this conversion.

So, this is a two steps:

From avi frame numbers into ts-aligned frame numbers.
From ts-aligned frame numbers to ts frame numbers.

For step 1, we use Wudu-GUI-Annotation/annotations_align.ipynb file, specifically the first part of it. For step 2, we use Wudu-GUI-Annotation/annotations_align.ipynb file again, but the second part of it.

The output of first part (ts_aligned_moves) will show us that exactly 13 frames in have not been annotated (last 13 frames, again, the reason is unknown), so for these frames we just use the same annotation given to the last annotated frame available.

This output represents Annotated Aligned Timestamps. So, for camera 1 take 4 for example, Left person will have the same annotations as Left person in camera 2 and 3 of take 4 as well.

The output of the second part (ts_annotated) will be the same thing except frame numbers in camera 1, 2, and 3 are not aligned in terms of dropped frames (dropped frames are not filled, so if there was a drop at camera 1 frame 1000 to 1100, the frames will just jump from 1000 to 1100).

These two outputs (ts_aligned_moves and ts_annotated) can be used for the next step.

Training data preparation

Each wudu is supposed to be 1 instance. For that, we need to separate each wudu. Luckily, we don't need to cut these files, we can just look at the annotations files, and set start and end frame number for each wudu. We can detect each wudu start by kaf_wash label, and the end by detecting leg_wash followed by transition_text[move] label. This was possible to be done all in code, but since many of the people doing wudu in our data had multiple times where they do moves out of order, doing it programmatically will introduce errors. Instead, we have another script to facilitate this task in semi-manual form.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly