-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Welcome to the E2VID-repro-G36 wiki!
This repository contains the files and documentations of the reproducibility project carried out by Group 36 for the Deep Learning course from TUD (academic course 2021/22). The project focuses on the work done in High Speed and High Dynamic Range Video with an Event Camera by H. Rebecq et al. (ref).
Over the last years, the amount of scientific papers within the field of Deep Learning has grown drastically. However, as some authors have pointed out, many times it is impossible to reproduce the results of the paper, either because of lack of code available, or because it is incomplete. Science has always built itself upon reproducible experiments, used for double-checking the validity of the experiment and corroborating the results. If such reproduction cannot be performed, it is impossible to know about the veracity of the claims or to further build upon the project, for it is impossible to further test and evaluate the solution. It is for such reason that reproducibility projects are critical in the Deep Learning field, which such volume of projects and articles, it is important to shed light on the veracity and reproduction of the results and to further test their robustness and capabilities.
We will start by presenting the scope and focus of the original paper, in Paper Description, followed by an explanation of the previously available code and the dataset description. We will then move on to the work done in our reproducibility work, including the advances and scripts we developed in an attempt to reproduce the results in Work done. We will finish with the final results obtained from the reproduction, the takeaways from the project and a comparison with the results from the authors in the Conclusion.
The reproduced paper, "High Speed and High Dynamic Range Video with an Event Camera" (ref) aims to develop a deep neural network for converting the data recorded by an event-based camera into a frame-by-frame video, as recorded by usual cameras. Event cameras measure changes in intensity in the form of asynchronous pixel-wise events, which encode the time, pixel location of the change, and polarity of the brightness change. Such cameras provide much higher dynamic range than normal cameras, are robust to motion blur, and present a latency in the order of microseconds, enabling the creation of slow motion videos.
However, the reconstruction of images from a stream of events is a difficult process due to noise and sensitivity differences between regions of the camera's sensor. The authors propose a U-Net-like architecture (ref) for reconstructing the images from event information encoded in tensors, as described in incoming sections. The tensors are then fed into the network, which using the convolutional recurrent architecture depicted below, is capable of establishing the relationship between the previous hidden state of the network and the current input in order to generate a prediction.
Fig. 1 Network architecture as described in the paper, with 3 encoder and decoder layers, 2 residual blocks and a head and tail convolutional blocks. Skip connections are considered between symmetric layers. Each encoder block (b) is composed by a down-sampling convolutional block with a stride of 2 and a recurrent convolutional LSTM block.
Rather than feeding and processing the stream of events on an event-by-event base, the authors created their own event tensors. They gathered the events according to the time stamps of the ground truth images. The resulting tensors were of the same dimensions as the camera sensor plus a third dimension to which they refer to as "bins". First, it discretizes the duration of the N events into B temporal bins (i.e. it maps the interval [t0, tN-1] into [0, B-1]). Secondly, it rounds down the normalized timestamps to that of the corresponding bin (that is, for the i-th bin, the normalized time will be {0, 1, 2, ..., B-1]). Finally, it computes the time differences between the normalized timestamp and the associated bin, which is used to propagate the polarity of every event into the assigned cell and the following cell (if possible) at the location of the event in the image.
The tensors are then fed to the network, which will integrate their information with the hidden representation of the convolutional LSTM encoders. The output will then be a frame with the same resolution as the input and similar to the one that would be obtained by a grayscale camera (although additional steps can be taken to also perform color reconstruction). Once the output is obtained, the loss function as defined by the authors will be composed of a reconstruction loss, which will measure the degree up to which the predicted image resembles the ground-truth image. This loss is computed through LPIPS. Additionally, the authors specify a temporal consistency loss, which is used for ensuring that consecutive frames are consistent with each other (in other words, that the image at a certain time step corresponds to a "displaced" version of the image at the previous time step). For this purpose, the authors make us of temporal flow maps between the ground truth frames, obtained using FlowNet2. The exact expression of the temporal consistency loss can be found in the expression below:
In which I^hat_k (I^hat_k-1) is the prediction of the network at time step k (k-1), and W^k_k-1 is the result of warping the image at time step k-1 to the one at time step k using the corresponding flow map. Combining the two losses through a weighing hyperparameter, the authors ensured that the resulting video will not only match the quality of the ground-truth but also that successive frames will be consistent with each other.
The authors released their code on GitHub, the code was neatly organized, and provided with scripts for running a demonstration of the network, using a pre-trained classifier and a toy dataset (although the latter was formatted in a way different to that of the original dataset, which didn't help at reading and comprehending the original one). The repository also contained the necessary files for fully instantiating the network, with the hyperparameters described in the paper.
The repository also contained a number of utility classes for event tensor preprocessing and post-processing, some beyond the scope of the modifications stated in the original paper and some of them missing (such as the data augmentation techniques used, such as random cropping and image rotation). In particular, in addition to the EventPreprocessor and Intensity Rescaler classes defined in the paper, it also included others, such as an Unsharp Mask Filter or CropParameters, which were not mentioned in the paper but were used in the sample reconstruction class.
The remaining code was provided specifically for running the demo of the reconstruction or for integration of the system with ROS, which was not particularly relevant for our purpose.
The dataset used by the authors is also openly available on the project's webpage on ETH's website, although only one of us was actually capable of downloading the dataset through that link. The dataset was manually created by the authors by recording images from the MS-COCO dataset with a DAVIS event camera while moving it with respect to the image. Each sequence was approximately 2 seconds in duration and contained 50-60 ground-truth frames.
The authors reported a total of 1000 sequences, split in training and validations sets of 950 and 50 sequences, respectively, each with their own folder. Every sequence has its own subfolder within the train/validation directory. In addition to the frames subfolder, another two subfolders were found for every sequence, one of them containing the flow tensors obtained from FlowNet2; and the other containing the event tensors themselves. With all this contents, the dataset accounted for nothing but 84GB, which made it very impossible to work on platforms like Google Collab, which require the presence of the dataset on the cloud. Additionally, we found out during the first training attempts that sequences 107 and 382 were missing from the dataset.
The flow tensors were required for performing the conversion of the frame at a certain timestep to the following one, as required for the temporal consistency term of the loss function defined by the authors. The event tensors, on the other hand, were provided already formatted according to the description provided in section 3.2 of the original paper and described in previous sections. Consequently, every event input tensor was of size 5x180x240, matching with the size of the event camera sensor. A depiction of the aforementioned folder structure can be found below:
- ecoco_depthmaps_test
- train
- sequence_0000000000
- flow (containing the flow tensors)
- frames (containing the ground-truth frames)
- VoxelGrid-betweenframes-5 (event tensors)
- ...
- sequence_0000000949
- sequence_0000000000
- validation
- sequence_0000000950
- ...
- sequence_0000000999
- train
Considering all this, the objective of all reproducibility project is to determine whether the results obtained by the authors are, indeed, reproducible. The extensive training of the network in the original paper (including possible perturbations at the input) and the lack of code for loading the data, computing the loss function, and training the network (which proved to be non-trivial), made it impossible to undergo a full replication of the results. Instead, we aimed to reproduce the results using a smaller subset of the training data and for a smaller number of epochs. In particular, we performed different tests in which we evaluated the changes in performance of the solution for different lengths of the unrolled network (for which we rented a more powerful GPU from a public server). Also, we performed some robustness check of the solution to variations in learning rate, as it was a major hyperparameter for which the authors did not verify the robustness of the network. Finally, we performed an additional "hyperparameter" check by altering the different components used in the post-processing of the network's output during training. In particular, we tried with different combinations of the IntensityRescaler and the UnsharpMaskFilter classes. It is worth mentioning that the robustness checks, as we will describe, were carried out with a much more reduced training setup than that of the original paper, due to the time constraints.
However, before diving into the experiments themselves, we would like to comment on some preliminary work that was needed in order to start training the network.
In order to load the data and train the network, we had to define our own version of the DataSet class required for pytorch's DataLoader, which loaded the flowmap tensors, event tensors, and frames from a specific sequence and for a specific sequence duration/length. One DataSet class was created for the train set and another for the validation set so that two different DataLoaders could be created for either DataSet. The DataLoader class performs the indexation of the DataSet numerically and so, we had to come up with a way for indexing sequences organized in folders in a numerical way.
Before that, it is worth mentioning that, in order to train the recurrent network, it needs to be unrolled for a certain number of timesteps. The authors used a sequence length of 40 timesteps, but due to hardware limitations, we were limited to a sequence length of 8 timesteps for most experiments. This means that all tensors and images had to be loaded as 4-dimensional tensors, of shapes:
- [T, 5, 180, 240] for the event tensors.
- [T+1, 1, 180, 240] for the images.
- [T, 2, 180, 240] for the flow maps.
In which T is the number of timesteps in which the network is unrolled. Notice that, while the images have a single channel (grayscale) and the event tensors have as many channels as bins (see Dataset description), the flow tensors have one channel for indicating the shift of the pixel in the X direction, and another channel for the Y direction. Also, since both event tensors and flow maps "interpolate" between consecutive images, an additional "initial" image had to be loaded.
Having said that, and the fact that each sequence contained about 50 images/event tensors (which we later on discovered that was not the case), we decided to implement the dataset in such a way that it could be setup to make use of the same sequence multiple times. Considering that we had 950 sequences, we created two parameters that enabled the DataLoader to index beyond that value, and in doing so, retrieving posterior parts of the sequence. In particular, given a certain index, we would extract the information as:
- Sequence to access = mod(idx, 950)
- First tensor to load = start_idx + shift * floor(idx, 950)
In which the start_idx is a pre-defined parameter specifying the first element of the sequence to use overall and shift is the number of tensors/frames skipped on every access to the sequence. As an example, index 2341 would access sequence 441 and would retrieve the tensors corresponding to frames [16, 24) (given start_idx = 0, shift and sequence length = 8).
The loss function, as hinted above, was not provided by the authors themselves in their repository. The function was well described in the paper, but its implementation, due to the incorporation of flow map warping, posed some challenges in their creation. In particular, we defined two functions for this purpose, one that would compute the loss (loss_fn) and another one for the flow map warping (flow_map).
Another factor that affected the outcomes of our reproduction was the fact that the authors did not consider the temporal consistency term in the loss function for the first two iterations of each sequence's prediction. Considering that the authors unrolled the network for 40 time steps, it is understandable that they could afford to do that. However, with the reduced length of the sequences, 2 time steps became a fraction of the sequence duration that was difficult to neglect and so, we only omitted the first time step.
The training loop was created according to the current implementation of the forward pass of the network and the particular requirements for training a recurrent network, that is, feeding the sequence data time step by time step and accumulating the loss for each output.
The training function was defined as training_loop. After wrapping the training data into the dataloader, we load a mini-batch of event tensors and performed the training epoch. Before feeding them into the model, we first conduct different pre-processing steps, such as padding and normalization.
A mini-batch of event tensors has shape of [N, T, C, H, W]
, we iterate through the time-step dimension of the batch. As showed in the Fig.2,
We feed the time series data at current time-step x_batch[:, t]
(E_k
in the figure) plus the hidden state computed at the previous iteration hidden_states_prev
(s_{k-1}
in the Fig.) in to the model and
compute the outputs I_predict
(\hat{I_k}
in the Fig.) as well as the current hidden state hidden_states
(s_k
in the Fig.). After the forward-pass, the loss is given by loss_fn(I_predict, I_predict_previous, y_batch[:, t + 1], y_batch[:, t], rec_fun, flow=flow_batch[:, t]).sum()
, where we compute the temporal consistency loss between current prediction and previous prediction
(I_predict
and I_predict_previous
), and the Image reconstruction loss between current prediction and the label (I_predict
and y_batch[:, t + 1]
). We call one gradient update step at each iteration and update the model parameters. Then we keep track of the validation loss in each epoch to make sure we are not over-fitting the training set.
Fig. 2 Training pipeline described in the paper
During the implementation of the training loop we encountered a number of problems that we would like to mention:
-
Fristly, we tried to integrate the preprocessing and rescaling functions (EventPreprocessor and Intensity Rescaler) in the training loop. In doing so, we found out that the latter converted the rescaled images into unsigned char datatype and so, modifications had to be done to preserve the datatype for the loss computation. We suppose that such conversion was introduced for image plotting in the demonstration script. However, this hampered its use (without modification) for training the network.
-
Secondly, we encountered an error related to unmatching dimensions of the skip connections. After some inspection, we managed to solve the issue by applying a padding to the input event tensors using the CropParameters class and later on cropping the output image, in a similar way to that of the reconstruction class provided by the authors. We verified that, indeed, the event tensors changed their shape from [B, T, 5, 180, 240] to [B, T, 5, 184, 240] and that this was ultimately affecting the calculation of the input and output sizes of the encoders and decoders at instantiation. However, there is no mention of such padding in the original paper, which caused some confusion about whether the use of such class was required during training or only for the demonstration.
-
Thirdly, we attempted to make a run over the data trying to use as much information from the dataset as the original authors. To do so, we attempted to make use of the information from the first 40 frames of each sequence by loading them as 5 sub-sequences of 8 frames and a shift of 8 (see Loading the data). We encountered that one of the sequences was actually 36 frames long, which interrupted the training process and raised the question about how the authors handled this exception in the original training process.
We performed our experiments using a reduced setup/recipe to that of the authors. In the original paper, the network was unrolled for 40 time steps and trained for a total of 160 epochs, performing additional data augmentation techniques such as random cropping, random rotations and random flipping. Due to the time constraints and the hardware limitations, we performed most of our experiments with a sequence length of 8 time steps and trained for 60 epochs, which resulted in training times of 12-24 hours depending on the hardware.
Once we obtained a system that produced interpretable images, we tried to see whether different learning rates and post-processing objects would alter the performance of the results. Additionally, we tried to improve the quality of our reconstruction by increasing the length of the sequence, for which we rented a GPU on a public server. Despite all our efforts, however, we did not succeed at recreating the level of detail of the original authors, nor managed to remove some additional artifacts from the reconstruction.
Please note that some of the videos link to our own repository and need to be downloaded for visualization. We apologize for this, but GitHub seems to have a limit on the amount of media content the wiki page can hold, so we deemed it more relevant to prioritize the learning rate videos, which yielded larger differences.
From a qualitative perspective, the produced video reconstructions did not result in any significant changes across varied unrolled sequence lengths. The video quality remains essentially the same, no matter the value of the parameter. It seems that increasing the sequence length does result in an increase in the value of the loss, as we can see in the table below. However, those changes are rather small, and as previously mentioned, do not result in any noticeable changes in the reconstructions.
Sequence Length | Training loss | Validation loss |
---|---|---|
8 | 610.90 | 34.775 |
12 | 627.40 | 35.290 |
16 | 628.89 | 35.170 |
20 | 643.46 | 35.110 |
34 | 662.61 | 36.410 |
It was observed that the learning rate had a small effect on the loss, as it can be observed in the table below. The differences can be appreciated more in the corresponding videos. Smaller learning rates result in higher contrast images but showing slightly more bleeding effects and ghosting, while larger learning rates result in more saturated images, in which it is more difficult to distinguish the shapes of the objects. The learning rate chosen by the authors, therefore, seemed to be close to a "sweet spot" in which the contrast was sufficient to distinguish objects but not small enough so as to incur in bleeding edges or ghosting.
Learning rate | Training loss | Validation loss |
---|---|---|
LR 0.001 | 618.67 | 34.288 |
LR 0.003 | 708.24 | 37.238 |
LR 0.0001 | 610.90 | 34.775 |
LR 0.0003 | 604.03 | 34.185 |
LR 0.00001 | 854.43 | 46.131 |
LR 0.00003 | 739.28 | 40.911 |
When it comes to changes in the performance according to different post-processing applications, no particular differences could be noticed from observing the recreated videos. Quantitatively, when looking at the loss scores for each method, the best form of postprocessing seems to be the application of the intensity rescaler (IR), as can be seen in the table below. Adding the Unsharp Mask Filter (UMF) seems to systematically increase the loss (although, by a very small amount).
It must be noted that the obtained discrepancies in loss values are not very significant. As a result, we conclude that the application of any forms of postprocessing is negligible (at least, for the training conditions under which we performed our experiments).
Postprocessing | Training loss | Validation loss |
---|---|---|
IR | 610.90 | 34.775 |
None | 623.05 | 35.434 |
IR + UMF | 626.42 | 35.426 |
UMF | 618.95 | 34.993 |
Our project aimed at reproducing the paper \href{https://arxiv.org/abs/1906.07165}{"High Speed and High Dynamic Range Video with an Event Camera"} by H. Rebecq et al. Due to the hardware limitations and the time constraints of the project, we had to undergo our experiments using a reduced training duration and a shorter sequence length for training the recurrent architecture.
Our results on the experiments show that the current solution's contrast and dynamic range, at least for a fixed number of epochs, is slightly perturbed by the learning rate, being the value chosen by the authors the most optimal choice. With regards to the post-processing objects and the sequence length, the solution is rather robust and generates a rather invariant output to these changes. This was very surprising in the case of the sequence length, since longer sequences would allow the network to establish longer-term relationships. Considering that the network underwent a similar duration for the training regardless of the other parameters, we suspect that the training length with the corresponding data augmentation can result in improved reconstructions.
We tried to distribute the work as much as possible, so that everyone would contribute to the documentation, code generation and experiments. The overall distribution can be seen below:
Rubén Martín Rodríguez
- Organization and coordination
- Documentation on existing code
- Network instantiation and train setup scripts
- Flow map and reconstruction loss definition
- Debugging and modifications to training loop and dataset classes
- Blog creation and development
- Repository maintenance
- Experiments on post-processing effects
Paul Féry
- Dataset parsing
- Sequence loader and Dataset Class creation
- Experiments on sequence length robustness
- Assistance in documentation and blog creation
- Video creation and recording
Xinjie Liu
- Training loop definition
- Reconstruction loss definition
- Padding input functions
- Description of training loop for GitHub documentation
- Assistance in pdf documentation
- Experiments on sequence length performance