Skip to content

Latest commit

 

History

History
173 lines (171 loc) · 236 KB

Stereo-Vision.md

File metadata and controls

173 lines (171 loc) · 236 KB

Stereo Vision

Title Date Abstract Comment CodeRepository
Back to the Future Cyclopean Stereo: a human perception approach unifying deep and geometric constraints 2025-02-28
Show

We innovate in stereo vision by explicitly providing analytical 3D surface models as viewed by a cyclopean eye model that incorporate depth discontinuities and occlusions. This geometrical foundation combined with learned stereo features allows our system to benefit from the strengths of both approaches. We also invoke a prior monocular model of surfaces to fill in occlusion regions or texture-less regions where data matching is not sufficient. Our results already are on par with the state-of-the-art purely data-driven methods and are of much better visual quality, emphasizing the importance of the 3D geometrical model to capture critical visual information. Such qualitative improvements may find applicability in virtual reality, for a better human experience, as well as in robotics, for reducing critical errors. Our approach aims to demonstrate that understanding and modeling geometrical properties of 3D surfaces is beneficial to computer vision research.

None
Mean of Means: Human Localization with Calibration-free and Unconstrained Camera Settings (extended version) 2025-02-18
Show

Accurate human localization is crucial for various applications, especially in the Metaverse era. Existing high precision solutions rely on expensive, tag-dependent hardware, while vision-based methods offer a cheaper, tag-free alternative. However, current vision solutions based on stereo vision face limitations due to rigid perspective transformation principles and error propagation in multi-stage SVD solvers. These solutions also require multiple high-resolution cameras with strict setup constraints.To address these limitations, we propose a probabilistic approach that considers all points on the human body as observations generated by a distribution centered around the body's geometric center. This enables us to improve sampling significantly, increasing the number of samples for each point of interest from hundreds to billions. By modeling the relation between the means of the distributions of world coordinates and pixel coordinates, leveraging the Central Limit Theorem, we ensure normality and facilitate the learning process. Experimental results demonstrate human localization accuracy of 96% within a 0.3$m$ range and nearly 100% accuracy within a 0.5$m$ range, achieved at a low cost of only 10 USD using two web cameras with a resolution of 640$\times$480 pixels.

arXiv...

arXiv admin note: substantial text overlap with arXiv:2407.20870

None
Mean of Means: A 10-dollar Solution for Human Localization with Calibration-free and Unconstrained Camera Settings 2025-01-25
Show

Accurate human localization is crucial for various applications, especially in the Metaverse era. Existing high precision solutions rely on expensive, tag-dependent hardware, while vision-based methods offer a cheaper, tag-free alternative. However, current vision solutions based on stereo vision face limitations due to rigid perspective transformation principles and error propagation in multi-stage SVD solvers. These solutions also require multiple high-resolution cameras with strict setup constraints. To address these limitations, we propose a probabilistic approach that considers all points on the human body as observations generated by a distribution centered around the body's geometric center. This enables us to improve sampling significantly, increasing the number of samples for each point of interest from hundreds to billions. By modeling the relation between the means of the distributions of world coordinates and pixel coordinates, leveraging the Central Limit Theorem, we ensure normality and facilitate the learning process. Experimental results demonstrate human localization accuracy of 95% within a 0.3m range and nearly 100% accuracy within a 0.5m range, achieved at a low cost of only 10 USD using two web cameras with a resolution of 640x480 pixels.

None
YO-CSA-T: A Real-time Badminton Tracking System Utilizing YOLO Based on Contextual and Spatial Attention 2025-01-11
Show

The 3D trajectory of a shuttlecock required for a badminton rally robot for human-robot competition demands real-time performance with high accuracy. However, the fast flight speed of the shuttlecock, along with various visual effects, and its tendency to blend with environmental elements, such as court lines and lighting, present challenges for rapid and accurate 2D detection. In this paper, we first propose the YO-CSA detection network, which optimizes and reconfigures the YOLOv8s model's backbone, neck, and head by incorporating contextual and spatial attention mechanisms to enhance model's ability in extracting and integrating both global and local features. Next, we integrate three major subtasks, detection, prediction, and compensation, into a real-time 3D shuttlecock trajectory detection system. Specifically, our system maps the 2D coordinate sequence extracted by YO-CSA into 3D space using stereo vision, then predicts the future 3D coordinates based on historical information, and re-projects them onto the left and right views to update the position constraints for 2D detection. Additionally, our system includes a compensation module to fill in missing intermediate frames, ensuring a more complete trajectory. We conduct extensive experiments on our own dataset to evaluate both YO-CSA's performance and system effectiveness. Experimental results show that YO-CSA achieves a high accuracy of 90.43% [email protected], surpassing both YOLOv8s and YOLO11s. Our system performs excellently, maintaining a speed of over 130 fps across 12 test sequences.

8 pages,14 figures None
H-Net: A Multitask Architecture for Simultaneous 3D Force Estimation and Stereo Semantic Segmentation in Intracardiac Catheters 2024-12-31
Show

The success rate of catheterization procedures is closely linked to the sensory data provided to the surgeon. Vision-based deep learning models can deliver both tactile and visual information in a sensor-free manner, while also being cost-effective to produce. Given the complexity of these models for devices with limited computational resources, research has focused on force estimation and catheter segmentation separately. However, there is a lack of a comprehensive architecture capable of simultaneously segmenting the catheter from two different angles and estimating the applied forces in 3D. To bridge this gap, this work proposes a novel, lightweight, multi-input, multi-output encoder-decoder-based architecture. It is designed to segment the catheter from two points of view and concurrently measure the applied forces in the x, y, and z directions. This network processes two simultaneous X-Ray images, intended to be fed by a biplane fluoroscopy system, showing a catheter's deflection from different angles. It uses two parallel sub-networks with shared parameters to output two segmentation maps corresponding to the inputs. Additionally, it leverages stereo vision to estimate the applied forces at the catheter's tip in 3D. The architecture features two input channels, two classification heads for segmentation, and a regression head for force estimation through a single end-to-end architecture. The output of all heads was assessed and compared with the literature, demonstrating state-of-the-art performance in both segmentation and force estimation. To the best of the authors' knowledge, this is the first time such a model has been proposed

None
Data Fusion of Semantic and Depth Information in the Context of Object Detection 2024-12-04
Show

Considerable study has already been conducted regarding autonomous driving in modern era. An autonomous driving system must be extremely good at detecting objects surrounding the car to ensure safety. In this paper, classification, and estimation of an object's (pedestrian) position (concerning an ego 3D coordinate system) are studied and the distance between the ego vehicle and the object in the context of autonomous driving is measured. To classify the object, faster Region-based Convolution Neural Network (R-CNN) with inception v2 is utilized. First, a network is trained with customized dataset to estimate the reference position of objects as well as the distance from the vehicle. From camera calibration to computing the distance, cutting-edge technologies of computer vision algorithms in a series of processes are applied to generate a 3D reference point of the region of interest. The foremost step in this process is generating a disparity map using the concept of stereo vision.

None
Bidirectional Stereo Image Compression with Cross-Dimensional Entropy Model 2024-10-26
Show

With the rapid advancement of stereo vision technologies, stereo image compression has emerged as a crucial field that continues to draw significant attention. Previous approaches have primarily employed a unidirectional paradigm, where the compression of one view is dependent on the other, resulting in imbalanced compression. To address this issue, we introduce a symmetric bidirectional stereo image compression architecture, named BiSIC. Specifically, we propose a 3D convolution based codec backbone to capture local features and incorporate bidirectional attention blocks to exploit global features. Moreover, we design a novel cross-dimensional entropy model that integrates various conditioning factors, including the spatial context, channel context, and stereo dependency, to effectively estimate the distribution of latent representations for entropy coding. Extensive experiments demonstrate that our proposed BiSIC outperforms conventional image/video compression standards, as well as state-of-the-art learning-based methods, in terms of both PSNR and MS-SSIM.

ECCV 2024 None
Bio-inspired reconfigurable stereo vision for robotics using omnidirectional cameras 2024-10-11
Show

This work introduces a novel bio-inspired reconfigurable stereo vision system for robotics, leveraging omnidirectional cameras and a novel algorithm to achieve flexible visual capabilities. Inspired by the adaptive vision of various species, our visual system addresses traditional stereo vision limitations, i.e., immutable camera alignment with narrow fields of view, by introducing a reconfigurable stereo vision system to robotics. Our key innovations include the reconfigurable stereo vision strategy that allows dynamic camera alignment, a robust depth measurement system utilizing a nonrectified geometrical method combined with a deep neural network for feature matching, and a geometrical compensation technique to enhance visual accuracy. Implemented on a metamorphic robot, this vision system demonstrates its great adaptability to various scenarios by switching its configurations of 316{\deg} monocular with 79{\deg} binocular field for fast target seeking and 242{\deg} monocular with 150{\deg} binocular field for detailed close inspection.

7 pag...

7 pages, 8 figures, submitted to IEEE ICRA 2025

None
HiSplat: Hierarchical 3D Gaussian Splatting for Generalizable Sparse-View Reconstruction 2024-10-08
Show

Reconstructing 3D scenes from multiple viewpoints is a fundamental task in stereo vision. Recently, advances in generalizable 3D Gaussian Splatting have enabled high-quality novel view synthesis for unseen scenes from sparse input views by feed-forward predicting per-pixel Gaussian parameters without extra optimization. However, existing methods typically generate single-scale 3D Gaussians, which lack representation of both large-scale structure and texture details, resulting in mislocation and artefacts. In this paper, we propose a novel framework, HiSplat, which introduces a hierarchical manner in generalizable 3D Gaussian Splatting to construct hierarchical 3D Gaussians via a coarse-to-fine strategy. Specifically, HiSplat generates large coarse-grained Gaussians to capture large-scale structures, followed by fine-grained Gaussians to enhance delicate texture details. To promote inter-scale interactions, we propose an Error Aware Module for Gaussian compensation and a Modulating Fusion Module for Gaussian repair. Our method achieves joint optimization of hierarchical representations, allowing for novel view synthesis using only two-view reference images. Comprehensive experiments on various datasets demonstrate that HiSplat significantly enhances reconstruction quality and cross-dataset generalization compared to prior single-scale methods. The corresponding ablation study and analysis of different-scale 3D Gaussians reveal the mechanism behind the effectiveness. Project website: https://open3dvlab.github.io/HiSplat/

Code Link
Drone Stereo Vision for Radiata Pine Branch Detection and Distance Measurement: Utilizing Deep Learning and YOLO Integration 2024-10-06
Show

This research focuses on the development of a drone equipped with pruning tools and a stereo vision camera to accurately detect and measure the spatial positions of tree branches. YOLO is employed for branch segmentation, while two depth estimation approaches, monocular and stereo, are investigated. In comparison to SGBM, deep learning techniques produce more refined and accurate depth maps. In the absence of ground-truth data, a fine-tuning process using deep neural networks is applied to approximate optimal depth values. This methodology facilitates precise branch detection and distance measurement, addressing critical challenges in the automation of pruning operations. The results demonstrate notable advancements in both accuracy and efficiency, underscoring the potential of deep learning to drive innovation and enhance automation in the agricultural sector.

None
Fast Object Detection with a Machine Learning Edge Device 2024-10-05
Show

This machine learning study investigates a lowcost edge device integrated with an embedded system having computer vision and resulting in an improved performance in inferencing time and precision of object detection and classification. A primary aim of this study focused on reducing inferencing time and low-power consumption and to enable an embedded device of a competition-ready autonomous humanoid robot and to support real-time object recognition, scene understanding, visual navigation, motion planning, and autonomous navigation of the robot. This study compares processors for inferencing time performance between a central processing unit (CPU), a graphical processing unit (GPU), and a tensor processing unit (TPU). CPUs, GPUs, and TPUs are all processors that can be used for machine learning tasks. Related to the aim of supporting an autonomous humanoid robot, there was an additional effort to observe whether or not there was a significant difference in using a camera having monocular vision versus stereo vision capability. TPU inference time results for this study reflect a 25% reduction in time over the GPU, and a whopping 87.5% reduction in inference time compared to the CPU. Much information in this paper is contributed to the final selection of Google's Coral brand, Edge TPU device. The Arduino Nano 33 BLE Sense Tiny ML Kit was also considered for comparison but due to initial incompatibilities and in the interest of time to complete this study, a decision was made to review the kit in a future experiment.

None
Individuation of 3D perceptual units from neurogeometry of binocular cells 2024-10-03
Show

We model the functional architecture of the early stages of three-dimensional vision by extending the neurogeometric sub-Riemannian model for stereo-vision introduced in \cite{BCSZ23}. A new framework for correspondence is introduced that integrates a neural-based algorithm to achieve stereo correspondence locally while, simultaneously, organizing the corresponding points into global perceptual units. The result is an effective scene segmentation. We achieve this using harmonic analysis on the sub-Riemannian structure and show, in a comparison against Riemannian distance, that the sub-Riemannian metric is central to the solution.

30 pages, 13 figures None
Drone Stereo Vision for Radiata Pine Branch Detection and Distance Measurement: Integrating SGBM and Segmentation Models 2024-09-26
Show

Manual pruning of radiata pine trees presents significant safety risks due to their substantial height and the challenging terrains in which they thrive. To address these risks, this research proposes the development of a drone-based pruning system equipped with specialized pruning tools and a stereo vision camera, enabling precise detection and trimming of branches. Deep learning algorithms, including YOLO and Mask R-CNN, are employed to ensure accurate branch detection, while the Semi-Global Matching algorithm is integrated to provide reliable distance estimation. The synergy between these techniques facilitates the precise identification of branch locations and enables efficient, targeted pruning. Experimental results demonstrate that the combined implementation of YOLO and SGBM enables the drone to accurately detect branches and measure their distances from the drone. This research not only improves the safety and efficiency of pruning operations but also makes a significant contribution to the advancement of drone technology in the automation of agricultural and forestry practices, laying a foundational framework for further innovations in environmental management.

None
EF-Calib: Spatiotemporal Calibration of Event- and Frame-Based Cameras Using Continuous-Time Trajectories 2024-09-25
Show

Event camera, a bio-inspired asynchronous triggered camera, offers promising prospects for fusion with frame-based cameras owing to its low latency and high dynamic range. However, calibrating stereo vision systems that incorporate both event and frame-based cameras remains a significant challenge. In this letter, we present EF-Calib, a spatiotemporal calibration framework for event- and frame-based cameras using continuous-time trajectories. A novel calibration pattern applicable to both camera types and the corresponding event recognition algorithm is proposed. Leveraging the asynchronous nature of events, a derivable piece-wise B-spline to represent camera pose continuously is introduced, enabling calibration for intrinsic parameters, extrinsic parameters, and time offset, with analytical Jacobians provided. Various experiments are carried out to evaluate the calibration performance of EF-Calib, including calibration experiments for intrinsic parameters, extrinsic parameters, and time offset. Experimental results show that EF-Calib achieves the most accurate intrinsic parameters compared to current SOTA, the close accuracy of the extrinsic parameters compared to the frame-based results, and accurate time offset estimation. EF-Calib provides a convenient and accurate toolbox for calibrating the system that fuses events and frames. The code of this paper will also be open-sourced at: https://github.com/wsakobe/EF-Calib.

Accep...

Accepted by IEEE Robotics and Automation Letters

Code Link
D3RoMa: Disparity Diffusion-based Depth Sensing for Material-Agnostic Robotic Manipulation 2024-09-25
Show

Depth sensing is an important problem for 3D vision-based robotics. Yet, a real-world active stereo or ToF depth camera often produces noisy and incomplete depth which bottlenecks robot performances. In this work, we propose D3RoMa, a learning-based depth estimation framework on stereo image pairs that predicts clean and accurate depth in diverse indoor scenes, even in the most challenging scenarios with translucent or specular surfaces where classical depth sensing completely fails. Key to our method is that we unify depth estimation and restoration into an image-to-image translation problem by predicting the disparity map with a denoising diffusion probabilistic model. At inference time, we further incorporated a left-right consistency constraint as classifier guidance to the diffusion process. Our framework combines recently advanced learning-based approaches and geometric constraints from traditional stereo vision. For model training, we create a large scene-level synthetic dataset with diverse transparent and specular objects to compensate for existing tabletop datasets. The trained model can be directly applied to real-world in-the-wild scenes and achieve state-of-the-art performance in multiple public depth estimation benchmarks. Further experiments in real environments show that accurate depth prediction significantly improves robotic manipulation in various scenarios.

None
Object Depth and Size Estimation using Stereo-vision and Integration with SLAM 2024-09-11
Show

Autonomous robots use simultaneous localization and mapping (SLAM) for efficient and safe navigation in various environments. LiDAR sensors are integral in these systems for object identification and localization. However, LiDAR systems though effective in detecting solid objects (e.g., trash bin, bottle, etc.), encounter limitations in identifying semitransparent or non-tangible objects (e.g., fire, smoke, steam, etc.) due to poor reflecting characteristics. Additionally, LiDAR also fails to detect features such as navigation signs and often struggles to detect certain hazardous materials that lack a distinct surface for effective laser reflection. In this paper, we propose a highly accurate stereo-vision approach to complement LiDAR in autonomous robots. The system employs advanced stereo vision-based object detection to detect both tangible and non-tangible objects and then uses simple machine learning to precisely estimate the depth and size of the object. The depth and size information is then integrated into the SLAM process to enhance the robot's navigation capabilities in complex environments. Our evaluation, conducted on an autonomous robot equipped with LiDAR and stereo-vision systems demonstrates high accuracy in the estimation of an object's depth and size. A video illustration of the proposed scheme is available at: \url{https://www.youtube.com/watch?v=nusI6tA9eSk}.

Accep...

Accepted version of the published article in IEEE Sensors Letters

None
$\textit{sweet}$- An Open Source Modular Platform for Contactless Hand Vascular Biometric Experiments 2024-09-11
Show

Current finger-vein or palm-vein recognition systems usually require direct contact of the subject with the apparatus. This can be problematic in environments where hygiene is of primary importance. In this work we present a contactless vascular biometrics sensor platform named \sweet which can be used for hand vascular biometrics studies (wrist, palm, and finger-vein) and surface features such as palmprint. It supports several acquisition modalities such as multi-spectral Near-Infrared (NIR), RGB-color, Stereo Vision (SV) and Photometric Stereo (PS). Using this platform we collect a dataset consisting of the fingers, palm and wrist vascular data of 120 subjects and develop a powerful 3D pipeline for the pre-processing of this data. We then present biometric experimental results, focusing on Finger-Vein Recognition (FVR). Finally, we discuss fusion of multiple modalities, such palm-vein combined with palm-print biometrics. The acquisition software, parts of the hardware design, the new FV dataset, as well as source-code for our experiments are publicly available for research purposes.

None
Extending 6D Object Pose Estimators for Stereo Vision 2024-09-10
Show

Estimating the 6D pose of objects accurately, quickly, and robustly remains a difficult task. However, recent methods for directly regressing poses from RGB images using dense features have achieved state-of-the-art results. Stereo vision, which provides an additional perspective on the object, can help reduce pose ambiguity and occlusion. Moreover, stereo can directly infer the distance of an object, while mono-vision requires internalized knowledge of the object's size. To extend the state-of-the-art in 6D object pose estimation to stereo, we created a BOP compatible stereo version of the YCB-V dataset. Our method outperforms state-of-the-art 6D pose estimation algorithms by utilizing stereo vision and can easily be adopted for other dense feature-based algorithms.

4th I...

4th International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI)

None
Three-dimensional Morphological Reconstruction of Millimeter-Scale Soft Continuum Robots based on Dual-Stereo-Vision 2024-08-15
Show

Continuum robots can be miniaturized to just a few millimeters in diameter. Among these, notched tubular continuum robots (NTCR) show great potential in many delicate applications. Existing works in robotic modeling focus on kinematics and dynamics but still face challenges in reproducing the robot's morphology -- a significant factor that can expand the research landscape of continuum robots, especially for those with asymmetric continuum structures. This paper proposes a dual stereo vision-based method for the three-dimensional morphological reconstruction of millimeter-scale NTCRs. The method employs two oppositely located stationary binocular cameras to capture the point cloud of the NTCR, then utilizes predefined geometry as a reference for the KD tree method to relocate the capture point clouds, resulting in a morphologically correct NTCR despite the low-quality raw point cloud collection. The method has been proved feasible for an NTCR with a 3.5 mm diameter, capturing 14 out of 16 notch features, with the measurements generally centered around the standard of 1.5 mm, demonstrating the capability of revealing morphological details. Our proposed method paves the way for 3D morphological reconstruction of millimeter-scale soft robots for further self-modeling study.

6 pag...

6 pages, 6 figures, submitted to Robio 2024

None
Photogrammetry for Digital Twinning Industry 4.0 (I4) Systems 2024-07-12
Show

The onset of Industry 4.0 is rapidly transforming the manufacturing world through the integration of cloud computing, machine learning (ML), artificial intelligence (AI), and universal network connectivity, resulting in performance optimization and increase productivity. Digital Twins (DT) are one such transformational technology that leverages software systems to replicate physical process behavior, representing the physical process in a digital environment. This paper aims to explore the use of photogrammetry (which is the process of reconstructing physical objects into virtual 3D models using photographs) and 3D Scanning techniques to create accurate visual representation of the 'Physical Process', to interact with the ML/AI based behavior models. To achieve this, we have used a readily available consumer device, the iPhone 15 Pro, which features stereo vision capabilities, to capture the depth of an Industry 4.0 system. By processing these images using 3D scanning tools, we created a raw 3D model for 3D modeling and rendering software for the creation of a DT model. The paper highlights the reliability of this method by measuring the error rate in between the ground truth (measurements done manually using a tape measure) and the final 3D model created using this method. The overall mean error is 4.97% and the overall standard deviation error is 5.54% between the ground truth measurements and their photogrammetry counterparts. The results from this work indicate that photogrammetry using consumer-grade devices can be an efficient and cost-efficient approach to creating DTs for smart manufacturing, while the approaches flexibility allows for iterative improvements of the models over time.

None
Stereo Vision Based Robot for Remote Monitoring with VR Support 2024-06-27
Show

The machine vision systems have been playing a significant role in visual monitoring systems. With the help of stereovision and machine learning, it will be able to mimic human-like visual system and behaviour towards the environment. In this paper, we present a stereo vision based 3-DOF robot which will be used to monitor places from remote using cloud server and internet devices. The 3-DOF robot will transmit human-like head movements, i.e., yaw, pitch, roll and produce 3D stereoscopic video and stream it in Real-time. This video stream is sent to the user through any generic internet devices with VR box support, i.e., smartphones giving the user a First-person real-time 3D experience and transfers the head motion of the user to the robot also in Real-time. The robot will also be able to track moving objects and faces as a target using deep neural networks which enables it to be a standalone monitoring robot. The user will be able to choose specific subjects to monitor in a space. The stereovision enables us to track the depth information of different objects detected and will be used to track human interest objects with its distances and sent to the cloud. A full working prototype is developed which showcases the capabilities of a monitoring system based on stereo vision, robotics, and machine learning.

6 Pages, 10 Figures None
Python-based DSL for generating Verilog model of Synchronous Digital Circuits 2024-06-13
Show

We have designed a Python-based Domain Specific Language (DSL) for modeling synchronous digital circuits. In this DSL, hardware is modeled as a collection of transactions -- running in series, parallel, and loops. When the model is executed by a Python interpreter, synthesizable and behavioural Verilog is generated as output, which can be integrated with other RTL designs or directly used for FPGA and ASIC flows. In this paper, we describe - 1) the language (DSL), which allows users to express computation in series/parallel/loop constructs, with explicit cycle boundaries, 2) the internals of a simple Python implementation to produce synthesizable Verilog, and 3) several design examples and case studies for applications in post-quantum cryptography, stereo-vision, digital signal processing and optimization techniques. In the end, we list ideas to extend this framework.

9 pages, 13 figures None
Multi-Modal UAV Detection, Classification and Tracking Algorithm -- Technical Report for CVPR 2024 UG2 Challenge 2024-05-26
Show

This technical report presents the 1st winning model for UG2+, a task in CVPR 2024 UAV Tracking and Pose-Estimation Challenge. This challenge faces difficulties in drone detection, UAV-type classification and 2D/3D trajectory estimation in extreme weather conditions with multi-modal sensor information, including stereo vision, various Lidars, Radars, and audio arrays. Leveraging this information, we propose a multi-modal UAV detection, classification, and 3D tracking method for accurate UAV classification and tracking. A novel classification pipeline which incorporates sequence fusion, region of interest (ROI) cropping, and keyframe selection is proposed. Our system integrates cutting-edge classification techniques and sophisticated post-processing steps to boost accuracy and robustness. The designed pose estimation pipeline incorporates three modules: dynamic points analysis, a multi-object tracker, and trajectory completion techniques. Extensive experiments have validated the effectiveness and precision of our approach. In addition, we also propose a novel dataset pre-processing method and conduct a comprehensive ablation study for our design. We finally achieved the best performance in the classification and tracking of the MMUAD dataset. The code and configuration of our method are available at https://github.com/dtc111111/Multi-Modal-UAV.

Accep...

Accepted by CVPR 2024 workshop. The 1st winning model in CVPR 2024 UG2+ challenge. The code and configuration of our method are available at https://github.com/dtc111111/Multi-Modal-UAV

Code Link
Geometry-Informed Distance Candidate Selection for Adaptive Lightweight Omnidirectional Stereo Vision with Fisheye Images 2024-05-08
Show

Multi-view stereo omnidirectional distance estimation usually needs to build a cost volume with many hypothetical distance candidates. The cost volume building process is often computationally heavy considering the limited resources a mobile robot has. We propose a new geometry-informed way of distance candidates selection method which enables the use of a very small number of candidates and reduces the computational cost. We demonstrate the use of the geometry-informed candidates in a set of model variants. We find that by adjusting the candidates during robot deployment, our geometry-informed distance candidates also improve a pre-trained model's accuracy if the extrinsics or the number of cameras changes. Without any re-training or fine-tuning, our models outperform models trained with evenly distributed distance candidates. Models are also released as hardware-accelerated versions with a new dedicated large-scale dataset. The project page, code, and dataset can be found at https://theairlab.org/gicandidates/ .

None
DMODE: Differential Monocular Object Distance Estimation Module without Class Specific Information 2024-05-07
Show

Utilizing a single camera for measuring object distances is a cost-effective alternative to stereo-vision and LiDAR. Although monocular distance estimation has been explored in the literature, most existing techniques rely on object class knowledge to achieve high performance. Without this contextual data, monocular distance estimation becomes more challenging, lacking reference points and object-specific cues. However, these cues can be misleading for objects with wide-range variation or adversarial situations, which is a challenging aspect of object-agnostic distance estimation. In this paper, we propose DMODE, a class-agnostic method for monocular distance estimation that does not require object class knowledge. DMODE estimates an object's distance by fusing its fluctuation in size over time with the camera's motion, making it adaptable to various object detectors and unknown objects, thus addressing these challenges. We evaluate our model on the KITTI MOTS dataset using ground-truth bounding box annotations and outputs from TrackRCNN and EagerMOT. The object's location is determined using the change in bounding box sizes and camera position without measuring the object's detection source or class attributes. Our approach demonstrates superior performance in multi-class object distance detection scenarios compared to conventional methods.

7 pag...

7 pages, 3 figures, 3 tables

None
HawkDrive: A Transformer-driven Visual Perception System for Autonomous Driving in Night Scene 2024-05-06
Show

Many established vision perception systems for autonomous driving scenarios ignore the influence of light conditions, one of the key elements for driving safety. To address this problem, we present HawkDrive, a novel perception system with hardware and software solutions. Hardware that utilizes stereo vision perception, which has been demonstrated to be a more reliable way of estimating depth information than monocular vision, is partnered with the edge computing device Nvidia Jetson Xavier AGX. Our software for low light enhancement, depth estimation, and semantic segmentation tasks, is a transformer-based neural network. Our software stack, which enables fast inference and noise reduction, is packaged into system modules in Robot Operating System 2 (ROS2). Our experimental results have shown that the proposed end-to-end system is effective in improving the depth estimation and semantic segmentation performance. Our dataset and codes will be released at https://github.com/ZionGo6/HawkDrive.

Accep...

Accepted by IEEE IV 2024

Code Link
A Minimal Set of Parameters Based Depth-Dependent Distortion Model and Its Calibration Method for Stereo Vision Systems 2024-05-01
Show

Depth position highly affects lens distortion, especially in close-range photography, which limits the measurement accuracy of existing stereo vision systems. Moreover, traditional depth-dependent distortion models and their calibration methods have remained complicated. In this work, we propose a minimal set of parameters based depth-dependent distortion model (MDM), which considers the radial and decentering distortions of the lens to improve the accuracy of stereo vision systems and simplify their calibration process. In addition, we present an easy and flexible calibration method for the MDM of stereo vision systems with a commonly used planar pattern, which requires cameras to observe the planar pattern in different orientations. The proposed technique is easy to use and flexible compared with classical calibration techniques for depth-dependent distortion models in which the lens must be perpendicular to the planar pattern. The experimental validation of the MDM and its calibration method showed that the MDM improved the calibration accuracy by 56.55% and 74.15% compared with the Li's distortion model and traditional Brown's distortion model. Besides, an iteration-based reconstruction method is proposed to iteratively estimate the depth information in the MDM during three-dimensional reconstruction. The results showed that the accuracy of the iteration-based reconstruction method was improved by 9.08% compared with that of the non-iteration reconstruction method.

This ...

This paper has been accepted for publication in IEEE Transactions on Instrumentation and Measurement

None
SAT-NGP : Unleashing Neural Graphics Primitives for Fast Relightable Transient-Free 3D reconstruction from Satellite Imagery 2024-03-27
Show

Current stereo-vision pipelines produce high accuracy 3D reconstruction when using multiple pairs or triplets of satellite images. However, these pipelines are sensitive to the changes between images that can occur as a result of multi-date acquisitions. Such variations are mainly due to variable shadows, reflexions and transient objects (cars, vegetation). To take such changes into account, Neural Radiance Fields (NeRF) have recently been applied to multi-date satellite imagery. However, Neural methods are very compute-intensive, taking dozens of hours to learn, compared with minutes for standard stereo-vision pipelines. Following the ideas of Instant Neural Graphics Primitives we propose to use an efficient sampling strategy and multi-resolution hash encoding to accelerate the learning. Our model, Satellite Neural Graphics Primitives (SAT-NGP) decreases the learning time to 15 minutes while maintaining the quality of the 3D reconstruction.

5 pag...

5 pages, 3 figures, 1 table; Accepted to International Geoscience and Remote Sensing Symposium (IGARSS) 2024; Code available at https://github.com/Ellimac0/SAT-NGP

Code Link
Landmark-based Localization using Stereo Vision and Deep Learning in GPS-Denied Battlefield Environment 2024-02-19
Show

Localization in a battlefield environment is increasingly challenging as GPS connectivity is often denied or unreliable, and physical deployment of anchor nodes across wireless networks for localization can be difficult in hostile battlefield terrain. Existing range-free localization methods rely on radio-based anchors and their average hop distance which suffers from accuracy and stability in dynamic and sparse wireless network topology. Vision-based methods like SLAM and Visual Odometry use expensive sensor fusion techniques for map generation and pose estimation. This paper proposes a novel framework for localization in non-GPS battlefield environments using only the passive camera sensors and considering naturally existing or artificial landmarks as anchors. The proposed method utilizes a customcalibrated stereo vision camera for distance estimation and the YOLOv8s model, which is trained and fine-tuned with our real-world dataset for landmark recognition. The depth images are generated using an efficient stereomatching algorithm, and distances to landmarks are determined by extracting the landmark depth feature utilizing a bounding box predicted by the landmark recognition model. The position of the unknown node is then obtained using the efficient least square algorithm and then optimized using the L-BFGS-B (limited-memory quasi-Newton code for bound-constrained optimization) method. Experimental results demonstrate that our proposed framework performs better than existing anchorbased DV-Hop algorithms and competes with the most efficient vision-based algorithms in terms of localization error (RMSE).

arXiv...

arXiv admin note: text overlap with arXiv:2402.12320

None
Landmark Stereo Dataset for Landmark Recognition and Moving Node Localization in a Non-GPS Battlefield Environment 2024-02-19
Show

In this paper, we have proposed a new strategy of using the landmark anchor node instead of a radio-based anchor node to obtain the virtual coordinates (landmarkID, DISTANCE) of moving troops or defense forces that will help in tracking and maneuvering the troops along a safe path within a GPS-denied battlefield environment. The proposed strategy implements landmark recognition using the Yolov5 model and landmark distance estimation using an efficient Stereo Matching Algorithm. We consider that a moving node carrying a low-power mobile device facilitated with a calibrated stereo vision camera that captures stereo images of a scene containing landmarks within the battlefield region whose locations are stored in an offline server residing within the device itself. We created a custom landmark image dataset called MSTLandmarkv1 with 34 landmark classes and another landmark stereo dataset of those 34 landmark instances called MSTLandmarkStereov1. We trained the YOLOv5 model with MSTLandmarkv1 dataset and achieved 0.95 mAP @ 0.5 IoU and 0.767 mAP @ [0.5: 0.95] IoU. We calculated the distance from a node to the landmark utilizing the bounding box coordinates and the depth map generated by the improved SGM algorithm using MSTLandmarkStereov1. The tuple of landmark IDs obtained from the detection result and the distances calculated by the SGM algorithm are stored as the virtual coordinates of a node. In future work, we will use these virtual coordinates to obtain the location of a node using an efficient trilateration algorithm and optimize the node position using the appropriate optimization method.

None
MMAUD: A Comprehensive Multi-Modal Anti-UAV Dataset for Modern Miniature Drone Threats 2024-02-06
Show

In response to the evolving challenges posed by small unmanned aerial vehicles (UAVs), which possess the potential to transport harmful payloads or independently cause damage, we introduce MMAUD: a comprehensive Multi-Modal Anti-UAV Dataset. MMAUD addresses a critical gap in contemporary threat detection methodologies by focusing on drone detection, UAV-type classification, and trajectory estimation. MMAUD stands out by combining diverse sensory inputs, including stereo vision, various Lidars, Radars, and audio arrays. It offers a unique overhead aerial detection vital for addressing real-world scenarios with higher fidelity than datasets captured on specific vantage points using thermal and RGB. Additionally, MMAUD provides accurate Leica-generated ground truth data, enhancing credibility and enabling confident refinement of algorithms and models, which has never been seen in other datasets. Most existing works do not disclose their datasets, making MMAUD an invaluable resource for developing accurate and efficient solutions. Our proposed modalities are cost-effective and highly adaptable, allowing users to experiment and implement new UAV threat detection tools. Our dataset closely simulates real-world scenarios by incorporating ambient heavy machinery sounds. This approach enhances the dataset's applicability, capturing the exact challenges faced during proximate vehicular operations. It is expected that MMAUD can play a pivotal role in advancing UAV threat detection, classification, trajectory estimation capabilities, and beyond. Our dataset, codes, and designs will be available in https://github.com/ntu-aris/MMAUD.

Accep...

Accepted by ICRA 2024

Code Link
Left-right Discrepancy for Adversarial Attack on Stereo Networks 2024-01-14
Show

Stereo matching neural networks often involve a Siamese structure to extract intermediate features from left and right images. The similarity between these intermediate left-right features significantly impacts the accuracy of disparity estimation. In this paper, we introduce a novel adversarial attack approach that generates perturbation noise specifically designed to maximize the discrepancy between left and right image features. Extensive experiments demonstrate the superior capability of our method to induce larger prediction errors in stereo neural networks, e.g. outperforming existing state-of-the-art attack methods by 219% MAE on the KITTI dataset and 85% MAE on the Scene Flow dataset. Additionally, we extend our approach to include a proxy network black-box attack method, eliminating the need for access to stereo neural network. This method leverages an arbitrary network from a different vision task as a proxy to generate adversarial noise, effectively causing the stereo network to produce erroneous predictions. Our findings highlight a notable sensitivity of stereo networks to discrepancies in shallow layer features, offering valuable insights that could guide future research in enhancing the robustness of stereo vision systems.

None
Autonomous robotic re-alignment for face-to-face underwater human-robot interaction 2024-01-09
Show

The use of autonomous underwater vehicles (AUVs) to accomplish traditionally challenging and dangerous tasks has proliferated thanks to advances in sensing, navigation, manipulation, and on-board computing technologies. Utilizing AUVs in underwater human-robot interaction (UHRI) has witnessed comparatively smaller levels of growth due to limitations in bi-directional communication and significant technical hurdles to bridge the gap between analogies with terrestrial interaction strategies and those that are possible in the underwater domain. A necessary component to support UHRI is establishing a system for safe robotic-diver approach to establish face-to-face communication that considers non-standard human body pose. In this work, we introduce a stereo vision system for enhancing UHRI that utilizes three-dimensional reconstruction from stereo image pairs and machine learning for localizing human joint estimates. We then establish a convention for a coordinate system that encodes the direction the human is facing with respect to the camera coordinate frame. This allows automatic setpoint computation that preserves human body scale and can be used as input to an image-based visual servo control scheme. We show that our setpoint computations tend to agree both quantitatively and qualitatively with experimental setpoint baselines. The methodology introduced shows promise for enhancing UHRI by improving robotic perception of human orientation underwater.

Submi...

Submitted to the Proceedings of the 2024 IEEE Conference on Robotics & Automation (ICRA)

None
Temporal-controlled Frame Swap for Generating High-Fidelity Stereo Driving Data for Autonomy Analysis 2023-12-25
Show

This paper presents a novel approach, TeFS (Temporal-controlled Frame Swap), to generate synthetic stereo driving data for visual simultaneous localization and mapping (vSLAM) tasks. TeFS is designed to overcome the lack of native stereo vision support in commercial driving simulators, and we demonstrate its effectiveness using Grand Theft Auto V (GTA V), a high-budget open-world video game engine. We introduce GTAV-TeFS, the first large-scale GTA V stereo-driving dataset, containing over 88,000 high-resolution stereo RGB image pairs, along with temporal information, GPS coordinates, camera poses, and full-resolution dense depth maps. GTAV-TeFS offers several advantages over other synthetic stereo datasets and enables the evaluation and enhancement of state-of-the-art stereo vSLAM models under GTA V's environment. We validate the quality of the stereo data collected using TeFS by conducting a comparative analysis with the conventional dual-viewport data using an open-source simulator. We also benchmark various vSLAM models using the challenging-case comparison groups included in GTAV-TeFS, revealing the distinct advantages and limitations inherent to each model. The goal of our work is to bring more high-fidelity stereo data from commercial-grade game simulators into the research domain and push the boundary of vSLAM models.

None
Redefining the Laparoscopic Spatial Sense: AI-based Intra- and Postoperative Measurement from Stereoimages 2023-11-16
Show

A significant challenge in image-guided surgery is the accurate measurement task of relevant structures such as vessel segments, resection margins, or bowel lengths. While this task is an essential component of many surgeries, it involves substantial human effort and is prone to inaccuracies. In this paper, we develop a novel human-AI-based method for laparoscopic measurements utilizing stereo vision that has been guided by practicing surgeons. Based on a holistic qualitative requirements analysis, this work proposes a comprehensive measurement method, which comprises state-of-the-art machine learning architectures, such as RAFT-Stereo and YOLOv8. The developed method is assessed in various realistic experimental evaluation environments. Our results outline the potential of our method achieving high accuracies in distance measurements with errors below 1 mm. Furthermore, on-surface measurements demonstrate robustness when applied in challenging environments with textureless regions. Overall, by addressing the inherent challenges of image-guided surgery, we lay the foundation for a more robust and accurate solution for intra- and postoperative measurements, enabling more precise, safe, and efficient surgical procedures.

38th ...

38th AAAI Conference on Artificial Intelligence (AAAI-24)

None
Diver Interest via Pointing in Three Dimensions: 3D Pointing Reconstruction for Diver-AUV Communication 2023-10-17
Show

This paper presents Diver Interest via Pointing in Three Dimensions (DIP-3D), a method to relay an object of interest from a diver to an autonomous underwater vehicle (AUV) by pointing that includes three-dimensional distance information to discriminate between multiple objects in the AUV's camera image. Traditional dense stereo vision for distance estimation underwater is challenging because of the relative lack of saliency of scene features and degraded lighting conditions. Yet, including distance information is necessary for robotic perception of diver pointing when multiple objects appear within the robot's image plane. We subvert the challenges of underwater distance estimation by using sparse reconstruction of keypoints to perform pose estimation on both the left and right images from the robot's stereo camera. Triangulated pose keypoints, along with a classical object detection method, enable DIP-3D to infer the location of an object of interest when multiple objects are in the AUV's field of view. By allowing the scuba diver to point at an arbitrary object of interest and enabling the AUV to autonomously decide which object the diver is pointing to, this method will permit more natural interaction between AUVs and human scuba divers in underwater-human robot collaborative tasks.

Under...

Under Review International Conference of Robotics and Automation 2024

None
AGG-Net: Attention Guided Gated-convolutional Network for Depth Image Completion 2023-09-04
Show

Recently, stereo vision based on lightweight RGBD cameras has been widely used in various fields. However, limited by the imaging principles, the commonly used RGB-D cameras based on TOF, structured light, or binocular vision acquire some invalid data inevitably, such as weak reflection, boundary shadows, and artifacts, which may bring adverse impacts to the follow-up work. In this paper, we propose a new model for depth image completion based on the Attention Guided Gated-convolutional Network (AGG-Net), through which more accurate and reliable depth images can be obtained from the raw depth maps and the corresponding RGB images. Our model employs a UNet-like architecture which consists of two parallel branches of depth and color features. In the encoding stage, an Attention Guided Gated-Convolution (AG-GConv) module is proposed to realize the fusion of depth and color features at different scales, which can effectively reduce the negative impacts of invalid depth data on the reconstruction. In the decoding stage, an Attention Guided Skip Connection (AG-SC) module is presented to avoid introducing too many depth-irrelevant features to the reconstruction. The experimental results demonstrate that our method outperforms the state-of-the-art methods on the popular benchmarks NYU-Depth V2, DIML, and SUN RGB-D.

9 pag...

9 pages, 7 figures, ICCV2023

None
Depth Estimation Analysis of Orthogonally Divergent Fisheye Cameras with Distortion Removal 2023-07-07
Show

Stereo vision systems have become popular in computer vision applications, such as 3D reconstruction, object tracking, and autonomous navigation. However, traditional stereo vision systems that use rectilinear lenses may not be suitable for certain scenarios due to their limited field of view. This has led to the popularity of vision systems based on one or multiple fisheye cameras in different orientations, which can provide a field of view of 180x180 degrees or more. However, fisheye cameras introduce significant distortion at the edges that affects the accuracy of stereo matching and depth estimation. To overcome these limitations, this paper proposes a method for distortion-removal and depth estimation analysis for stereovision system using orthogonally divergent fisheye cameras (ODFC). The proposed method uses two virtual pinhole cameras (VPC), each VPC captures a small portion of the original view and presents it without any lens distortions, emulating the behavior of a pinhole camera. By carefully selecting the captured regions, it is possible to create a stereo pair using two VPCs. The performance of the proposed method is evaluated in both simulation using virtual environment and experiments using real cameras and their results compared to stereo cameras with parallel optical axes. The results demonstrate the effectiveness of the proposed method in terms of distortion removal and depth estimation accuracy.

None
3D reconstruction using Structure for Motion 2023-06-10
Show

We are working towards 3D reconstruction of indoor spaces using a pair of HDR cameras in a stereo vision configuration mounted on an indoor mobile floor robot that captures various textures and spatial features as 2D images and this data is simultaneously utilized as a feed to our algorithm which will allow us to visualize the depth map.

Imple...

Implementation code can be found at https://github.com/KshitijKarnawat/Structure-from-Motion

Code Link
Experimental Energy Consumption Analysis of a Flapping-Wing Robot 2023-06-01
Show

One of the motivations for exploring flapping-wing aerial robotic systems is to seek energy reduction, by maintaining manoeuvrability, compared to conventional unmanned aerial systems. A Flapping Wing Flying Robot (FWFR) can glide in favourable wind conditions, decreasing energy consumption significantly. In addition, it is also necessary to investigate the power consumption of the components in the flapping-wing robot. In this work, two sets of the FWFR components are analyzed in terms of power consumption: a) motor/electronics components and b) a vision system for monitoring the environment during the flight. A measurement device is used to record the power utilization of the motors in the launching and ascending phases of the flight and also in cruising flight around the desired height. Additionally, an analysis of event cameras and stereo vision systems in terms of energy consumption has been performed. The results provide a first step towards decreasing battery usage and, consequently, providing additional flight time.

None
A Multi-modal Garden Dataset and Hybrid 3D Dense Reconstruction Framework Based on Panoramic Stereo Images for a Trimming Robot 2023-05-10
Show

Recovering an outdoor environment's surface mesh is vital for an agricultural robot during task planning and remote visualization. Our proposed solution is based on a newly-designed panoramic stereo camera along with a hybrid novel software framework that consists of three fusion modules. The panoramic stereo camera with a pentagon shape consists of 5 stereo vision camera pairs to stream synchronized panoramic stereo images for the following three fusion modules. In the disparity fusion module, rectified stereo images produce the initial disparity maps using multiple stereo vision algorithms. Then, these initial disparity maps, along with the intensity images, are input into a disparity fusion network to produce refined disparity maps. Next, the refined disparity maps are converted into full-view point clouds or single-view point clouds for the pose fusion module. The pose fusion module adopts a two-stage global-coarse-to-local-fine strategy. In the first stage, each pair of full-view point clouds is registered by a global point cloud matching algorithm to estimate the transformation for a global pose graph's edge, which effectively implements loop closure. In the second stage, a local point cloud matching algorithm is used to match single-view point clouds in different nodes. Next, we locally refine the poses of all corresponding edges in the global pose graph using three proposed rules, thus constructing a refined pose graph. The refined pose graph is optimized to produce a global pose trajectory for volumetric fusion. In the volumetric fusion module, the global poses of all the nodes are used to integrate the single-view point clouds into the volume to produce the mesh of the whole garden. The proposed framework and its three fusion modules are tested on a real outdoor garden dataset to show the superiority of the performance.

32 pages None
MIPI 2023 Challenge on RGB+ToF Depth Completion: Methods and Results 2023-04-27
Show

Depth completion from RGB images and sparse Time-of-Flight (ToF) measurements is an important problem in computer vision and robotics. While traditional methods for depth completion have relied on stereo vision or structured light techniques, recent advances in deep learning have enabled more accurate and efficient completion of depth maps from RGB images and sparse ToF measurements. To evaluate the performance of different depth completion methods, we organized an RGB+sparse ToF depth completion competition. The competition aimed to encourage research in this area by providing a standardized dataset and evaluation metrics to compare the accuracy of different approaches. In this report, we present the results of the competition and analyze the strengths and weaknesses of the top-performing methods. We also discuss the implications of our findings for future research in RGB+sparse ToF depth completion. We hope that this competition and report will help to advance the state-of-the-art in this important area of research. More details of this challenge and the link to the dataset can be found at https://mipi-challenge.org/MIPI2023.

arXiv...

arXiv admin note: substantial text overlap with arXiv:2209.07057

None
Vehicle Safety Management System 2023-04-16
Show

Overtaking is a critical maneuver in driving that requires accurate information about the location and distance of other vehicles on the road. This study suggests a real-time overtaking assistance system that uses a combination of the You Only Look Once (YOLO) object detection algorithm and stereo vision techniques to accurately identify and locate vehicles in front of the driver, and estimate their distance. The system then signals the vehicles behind the driver using colored lights to inform them of the safe overtaking distance. The proposed system has been implemented using Stereo vision for distance analysis and You Only Look Once (YOLO) for object identification. The results demonstrate its effectiveness in providing vehicle type and the distance between the camera module and the vehicle accurately with an approximate error of 4.107%. Our system has the potential to reduce the risk of accidents and improve the safety of overtaking maneuvers, especially on busy highways and roads.

None
Deep learning-based stereo camera multi-video synchronization 2023-03-22
Show

Stereo vision is essential for many applications. Currently, the synchronization of the streams coming from two cameras is done using mostly hardware. A software-based synchronization method would reduce the cost, weight and size of the entire system and allow for more flexibility when building such systems. With this goal in mind, we present here a comparison of different deep learning-based systems and prove that some are efficient and generalizable enough for such a task. This study paves the way to a production ready software-based video synchronization system.

5 pag...

5 pages, 4 figures, Accepted at ICASSP 2023

None
Stereo X-ray Tomography 2023-02-26
Show

X-ray tomography is a powerful volumetric imaging technique, but detailed three dimensional (3D) imaging requires the acquisition of a large number of individual X-ray images, which is time consuming. For applications where spatial information needs to be collected quickly, for example, when studying dynamic processes, standard X-ray tomography is therefore not applicable. Inspired by stereo vision, in this paper, we develop X-ray imaging methods that work with two X-ray projection images. In this setting, without the use of additional strong prior information, we no longer have enough information to fully recover the 3D tomographic images. However, up to a point, we are nevertheless able to extract spatial locations of point and line features. From stereo vision, it is well known that, for a known imaging geometry, once the same point is identified in two images taken from different directions, then the point's location in 3D space is exactly specified. The challenge is the matching of points between images. As X-ray transmission images are fundamentally different from the surface reflection images used in standard computer vision, we here develop a different feature identification and matching approach. In fact, once point like features are identified, if there are limited points in the image, then they can often be matched exactly. In fact, by utilising a third observation from an appropriate direction, matching becomes unique. Once matched, point locations in 3D space are easily computed using geometric considerations. Linear features, with clear end points, can be located using a similar approach.

None
Localizing Scan Targets from Human Pose for Autonomous Lung Ultrasound Imaging 2023-02-25
Show

Ultrasound is progressing toward becoming an affordable and versatile solution to medical imaging. With the advent of COVID-19 global pandemic, there is a need to fully automate ultrasound imaging as it requires trained operators in close proximity to patients for a long period of time, therefore increasing risk of infection. In this work, we investigate the important yet seldom-studied problem of scan target localization, under the setting of lung ultrasound imaging. We propose a purely vision-based, data driven method that incorporates learning-based computer vision techniques. We combine a human pose estimation model with a specially designed regression model to predict the lung ultrasound scan targets, and deploy multiview stereo vision to enhance the consistency of 3D target localization. While related works mostly focus on phantom experiments, we collect data from 30 human subjects for testing. Our method attains an accuracy level of 16.00(9.79) mm for probe positioning and 4.44(3.75) degree for probe orientation, with a success rate above 80% under an error threshold of 25mm for all scan targets. Moreover, our approach can serve as a general solution to other types of ultrasound modalities. The code for implementation has been released.

v2 2023/02/25 None
FLSea: Underwater Visual-Inertial and Stereo-Vision Forward-Looking Datasets 2023-02-24
Show

Visibility underwater is challenging, and degrades as the distance between the subject and camera increases, making vision tasks in the forward-looking direction more difficult. We have collected underwater forward-looking stereo-vision and visual-inertial image sets in the Mediterranean and Red Sea. To our knowledge there are no other public datasets in the underwater environment acquired with this camera-sensor orientation published with ground-truth. These datasets are critical for the development of several underwater applications, including obstacle avoidance, visual odometry, 3D tracking, Simultaneous Localization and Mapping (SLAM) and depth estimation. The stereo datasets include synchronized stereo images in dynamic underwater environments with objects of known-size. The visual-inertial datasets contain monocular images and IMU measurements, aligned with millisecond resolution timestamps and objects of known size which were placed in the scene. Both sensor configurations allow for scale estimation, with the calibrated baseline in the stereo setup and the IMU in the visual-inertial setup. Ground truth depth maps were created offline for both dataset types using photogrammetry. The ground truth is validated with multiple known measurements placed throughout the imaged environment. There are 5 stereo and 8 visual-inertial datasets in total, each containing thousands of images, with a range of different underwater visibility and ambient light conditions, natural and man-made structures and dynamic camera motions. The forward-looking orientation of the camera makes these datasets unique and ideal for testing underwater obstacle-avoidance algorithms and for navigation close to the seafloor in dynamic environments. With our datasets, we hope to encourage the advancement of autonomous functionality for underwater vehicles in dynamic and/or shallow water environments.

None
Mimetic Muscle Rehabilitation Analysis Using Clustering of Low Dimensional 3D Kinect Data 2023-02-15
Show

Facial nerve paresis is a severe complication that arises post-head and neck surgery; This results in articulation problems, facial asymmetry, and severe problems in non-verbal communication. To overcome the side effects of post-surgery facial paralysis, rehabilitation requires which last for several weeks. This paper discusses an unsupervised approach to rehabilitating patients who have temporary facial paralysis due to damage in mimetic muscles. The work aims to make the rehabilitation process objective compared to the current subjective approach, such as House-Brackmann (HB) scale. Also, the approach will assist clinicians by reducing their workload in assessing the improvement during rehabilitation. This paper focuses on the clustering approach to monitor the rehabilitation process. We compare the results obtained from different clustering algorithms on various forms of the same data set, namely dynamic form, data expressed as functional data using B-spline basis expansion, and by finding the functional principal components of the functional data. The study contains data set of 85 distinct patients with 120 measurements obtained using a Kinect stereo-vision camera. The method distinguish effectively between patients with the least and greatest degree of facial paralysis, however patients with adjacent degrees of paralysis provide some challenges. In addition, we compared the cluster results to the HB scale outputs.

None
An Application of Stereo Thermal Vision for Preliminary Inspection of Electrical Power Lines by MAVs 2023-02-09
Show

An application of stereo thermal vision to perform preliminary inspection operations of electrical power lines by a particular class of small Unmanned Aerial Vehicles (UAVs), aka Micro Unmanned Aerial Vehicles (MAVs), is presented in this paper. The proposed hardware and software setup allows the detection of overheated power equipment, one of the major causes of power outages. The stereo vision complements the GPS information by finely detecting the potential source of damage while also providing a measure of the harm extension. The reduced sizes and the light weight of the vehicle enable to survey areas otherwise difficult to access with standard UAVs. Gazebo simulations and real flight experiments demonstrate the feasibility and effectiveness of the proposed setup.

8 pag...

8 pages, 15 figures, conference

None
Real-time FPGA implementation of the Semi-Global Matching stereo vision algorithm for a 4K/UHD video stream 2023-01-12
Show

In this paper, we propose a real-time FPGA implementation of the Semi-Global Matching (SGM) stereo vision algorithm. The designed module supports a 4K/Ultra HD (3840 x 2160 pixels @ 30 frames per second) video stream in a 4 pixel per clock (ppc) format and a 64-pixel disparity range. The baseline SGM implementation had to be modified to process pixels in the 4ppc format and meet the timing constrains, however, our version provides results comparable to the original design. The solution has been positively evaluated on the Xilinx VC707 development board with a Virtex-7 FPGA device.

Paper...

Paper accepted for the DASIP 2023 workshop in conjunction with HiPEAC 2023

None
Vision-Based Environmental Perception for Autonomous Driving 2022-12-22
Show

Visual perception plays an important role in autonomous driving. One of the primary tasks is object detection and identification. Since the vision sensor is rich in color and texture information, it can quickly and accurately identify various road information. The commonly used technique is based on extracting and calculating various features of the image. The recent development of deep learning-based method has better reliability and processing speed and has a greater advantage in recognizing complex elements. For depth estimation, vision sensor is also used for ranging due to their small size and low cost. Monocular camera uses image data from a single viewpoint as input to estimate object depth. In contrast, stereo vision is based on parallax and matching feature points of different views, and the application of deep learning also further improves the accuracy. In addition, Simultaneous Location and Mapping (SLAM) can establish a model of the road environment, thus helping the vehicle perceive the surrounding environment and complete the tasks. In this paper, we introduce and compare various methods of object detection and identification, then explain the development of depth estimation and compare various methods based on monocular, stereo, and RDBG sensors, next review and compare various methods of SLAM, and finally summarize the current problems and present the future development trends of vision technologies.

39 pages, 17 figures None
Real-Time High-Quality Stereo Matching System on a GPU 2022-12-01
Show

In this paper, we propose a low error rate and real-time stereo vision system on GPU. Many stereo vision systems on GPU have been proposed to date. In those systems, the error rates and the processing speed are in trade-off relationship. We propose a real-time stereo vision system on GPU for the high resolution images. This system also maintains a low error rate compared to other fast systems. In our approach, we have implemented the cost aggregation (CA), cross-checking and median filter on GPU in order to realize the real-time processing. Its processing speed is 40 fps for 1436x992 pixels images when the maximum disparity is 145, and its error rate is the lowest among the GPU systems which are faster than 30 fps.

None
SOCRATES: A Stereo Camera Trap for Monitoring of Biodiversity 2022-10-13
Show

The development and application of modern technology is an essential basis for the efficient monitoring of species in natural habitats and landscapes to trace the development of ecosystems, species communities, and populations, and to analyze reasons of changes. For estimating animal abundance using methods such as camera trap distance sampling, spatial information of natural habitats in terms of 3D (three-dimensional) measurements is crucial. Additionally, 3D information improves the accuracy of animal detection using camera trapping. This study presents a novel approach to 3D camera trapping featuring highly optimized hardware and software. This approach employs stereo vision to infer 3D information of natural habitats and is designated as StereO CameRA Trap for monitoring of biodivErSity (SOCRATES). A comprehensive evaluation of SOCRATES shows not only a $3.23%$ improvement in animal detection (bounding box $\text{mAP}_{75}$) but also its superior applicability for estimating animal abundance using camera trap distance sampling. The software and documentation of SOCRATES is provided at https://github.com/timmh/socrates

Code Link
Active-Passive SimStereo -- Benchmarking the Cross-Generalization Capabilities of Deep Learning-based Stereo Methods 2022-09-17
Show

In stereo vision, self-similar or bland regions can make it difficult to match patches between two images. Active stereo-based methods mitigate this problem by projecting a pseudo-random pattern on the scene so that each patch of an image pair can be identified without ambiguity. However, the projected pattern significantly alters the appearance of the image. If this pattern acts as a form of adversarial noise, it could negatively impact the performance of deep learning-based methods, which are now the de-facto standard for dense stereo vision. In this paper, we propose the Active-Passive SimStereo dataset and a corresponding benchmark to evaluate the performance gap between passive and active stereo images for stereo matching algorithms. Using the proposed benchmark and an additional ablation study, we show that the feature extraction and matching modules of a selection of twenty selected deep learning-based stereo matching methods generalize to active stereo without a problem. However, the disparity refinement modules of three of the twenty architectures (ACVNet, CascadeStereo, and StereoNet) are negatively affected by the active stereo patterns due to their reliance on the appearance of the input images.

22 pa...

22 pages, 12 figures, accepted in NeurIPS 2022 Datasets and Benchmarks Track

None
Bayesian Learning for Disparity Map Refinement for Semi-Dense Active Stereo Vision 2022-09-12
Show

A major focus of recent developments in stereo vision has been on how to obtain accurate dense disparity maps in passive stereo vision. Active vision systems enable more accurate estimations of dense disparity compared to passive stereo. However, subpixel-accurate disparity estimation remains an open problem that has received little attention. In this paper, we propose a new learning strategy to train neural networks to estimate high-quality subpixel disparity maps for semi-dense active stereo vision. The key insight is that neural networks can double their accuracy if they are able to jointly learn how to refine the disparity map while invalidating the pixels where there is insufficient information to correct the disparity estimate. Our approach is based on Bayesian modeling where validated and invalidated pixels are defined by their stochastic properties, allowing the model to learn how to choose by itself which pixels are worth its attention. Using active stereo datasets such as Active-Passive SimStereo, we demonstrate that the proposed method outperforms the current state-of-the-art active stereo models. We also demonstrate that the proposed approach compares favorably with state-of-the-art passive stereo models on the Middlebury dataset.

15 pages, 15 figures None
TriStereoNet: A Trinocular Framework for Multi-baseline Disparity Estimation 2022-09-04
Show

Stereo vision is an effective technique for depth estimation with broad applicability in autonomous urban and highway driving. While various deep learning-based approaches have been developed for stereo, the input data from a binocular setup with a fixed baseline are limited. Addressing such a problem, we present an end-to-end network for processing the data from a trinocular setup, which is a combination of a narrow and a wide stereo pair. In this design, two pairs of binocular data with a common reference image are treated with shared weights of the network and a mid-level fusion. We also propose a Guided Addition method for merging the 4D data of the two baselines. Additionally, an iterative sequential self-supervised and supervised learning on real and synthetic datasets is presented, making the training of the trinocular system practical with no need to ground-truth data of the real dataset. Experimental results demonstrate that the trinocular disparity network surpasses the scenario where individual pairs are fed into a similar architecture. Code and dataset: https://github.com/cogsys-tuebingen/tristereonet.

Code Link
Analysis & Computational Complexity Reduction of Monocular and Stereo Depth Estimation Techniques 2022-06-18
Show

Accurate depth estimation with lowest compute and energy cost is a crucial requirement for unmanned and battery operated autonomous systems. Robotic applications require real time depth estimation for navigation and decision making under rapidly changing 3D surroundings. A high accuracy algorithm may provide the best depth estimation but may consume tremendous compute and energy resources. A general trade-off is to choose less accurate methods for initial depth estimate and a more accurate yet compute intensive method when needed. Previous work has shown this trade-off can be improved by developing a state-of-the-art method (AnyNet) to improve stereo depth estimation. We studied both the monocular and stereo vision depth estimation methods and investigated methods to reduce computational complexity of these methods. This was our baseline. Consequently, our experiments show reduction of monocular depth estimation model size by ~75% reduces accuracy by less than 2% (SSIM metric). Our experiments with the novel stereo vision method (AnyNet) show that accuracy of depth estimation does not degrade more than 3% (three pixel error metric) in spite of reduction in model size by ~20%. We have shown that smaller models can indeed perform competitively.

None
Development of a Stereo-Vision Based High-Throughput Robotic System for Mouse Tail Vein Injection 2022-05-25
Show

In this paper, we present a robotic device for mouse tail vein injection. We propose a mouse holding mechanism to realize vein injection without anesthetizing the mouse, which consists of a tourniquet, vacuum port, and adaptive tail-end fixture. The position of the target vein in 3D space is reconstructed from a high-resolution stereo vision. The vein is detected by a simple but robust vein line detector. Thanks to the proposed two-staged calibration process, the total time for the injection process is limited to 1.5 minutes, despite that the position of needle and tail vein varies for each trial. We performed an injection experiment targeting 40 mice and succeeded to inject saline to 37 of them, resulting 92.5% success ratio.

accep...

accepted to ICRA2022 (7 pages, 11 figures, 2 tables)

None
ZZ-Net: A Universal Rotation Equivariant Architecture for 2D Point Clouds 2022-03-28
Show

In this paper, we are concerned with rotation equivariance on 2D point cloud data. We describe a particular set of functions able to approximate any continuous rotation equivariant and permutation invariant function. Based on this result, we propose a novel neural network architecture for processing 2D point clouds and we prove its universality for approximating functions exhibiting these symmetries. We also show how to extend the architecture to accept a set of 2D-2D correspondences as indata, while maintaining similar equivariance properties. Experiments are presented on the estimation of essential matrices in stereo vision.

CVPR ...

CVPR 2022 camera ready

None
3D endoscopic depth estimation using 3D surface-aware constraints 2022-03-04
Show

Robotic-assisted surgery allows surgeons to conduct precise surgical operations with stereo vision and flexible motor control. However, the lack of 3D spatial perception limits situational awareness during procedures and hinders mastering surgical skills in the narrow abdominal space. Depth estimation, as a representative perception task, is typically defined as an image reconstruction problem. In this work, we show that depth estimation can be reformed from a 3D surface perspective. We propose a loss function for depth estimation that integrates the surface-aware constraints, leading to a faster and better convergence with the valid information from spatial information. In addition, camera parameters are incorporated into the training pipeline to increase the control and transparency of the depth estimation. We also integrate a specularity removal module to recover more buried image information. Quantitative experimental results on endoscopic datasets and user studies with medical professionals demonstrate the effectiveness of our method.

None
Self-Supervised Online Learning for Safety-Critical Control using Stereo Vision 2022-03-02
Show

With the increasing prevalence of complex vision-based sensing methods for use in obstacle identification and state estimation, characterizing environment-dependent measurement errors has become a difficult and essential part of modern robotics. This paper presents a self-supervised learning approach to safety-critical control. In particular, the uncertainty associated with stereo vision is estimated, and adapted online to new visual environments, wherein this estimate is leveraged in a safety-critical controller in a robust fashion. To this end, we propose an algorithm that exploits the structure of stereo-vision to learn an uncertainty estimate without the need for ground-truth data. We then robustify existing Control Barrier Function-based controllers to provide safety in the presence of this uncertainty estimate. We demonstrate the efficacy of our method on a quadrupedal robot in a variety of environments. When not using our method safety is violated. With offline training alone we observe the robot is safe, but overly-conservative. With our online method the quadruped remains safe and conservatism is reduced.

7 pag...

7 pages, 4 figures, conference publication at ICRA 2022

None
Rectifying homographies for stereo vision: analytical solution for minimal distortion 2022-02-28
Show

Stereo rectification is the determination of two image transformations (or homographies) that map corresponding points on the two images, projections of the same point in the 3D space, onto the same horizontal line in the transformed images. Rectification is used to simplify the subsequent stereo correspondence problem and speeding up the matching process. Rectifying transformations, in general, introduce perspective distortion on the obtained images, which shall be minimised to improve the accuracy of the following algorithm dealing with the stereo correspondence problem. The search for the optimal transformations is usually carried out relying on numerical optimisation. This work proposes a closed-form solution for the rectifying homographies that minimise perspective distortion. The experimental comparison confirms its capability to solve the convergence issues of the previous formulation. Its Python implementation is provided.

None
Stereo Matching with Cost Volume based Sparse Disparity Propagation 2022-01-28
Show

Stereo matching is crucial for binocular stereo vision. Existing methods mainly focus on simple disparity map fusion to improve stereo matching, which require multiple dense or sparse disparity maps. In this paper, we propose a simple yet novel scheme, termed feature disparity propagation, to improve general stereo matching based on matching cost volume and sparse matching feature points. Specifically, our scheme first calculates a reliable sparse disparity map by local feature matching, and then refines the disparity map by propagating reliable disparities to neighboring pixels in the matching cost domain. In addition, considering the gradient and multi-scale information of local disparity regions, we present a $\rho$-Census cost measure based on the well-known AD-Census, which guarantees the robustness of cost volume even without the cost aggregation step. Extensive experiments on Middlebury stereo benchmark V3 demonstrate that our scheme achieves promising performance comparable to state-of-the-art methods.

None
Post-Stall Navigation with Fixed-Wing UAVs using Onboard Vision 2022-01-04
Show

Recent research has enabled fixed-wing unmanned aerial vehicles (UAVs) to maneuver in constrained spaces through the use of direct nonlinear model predictive control (NMPC). However, this approach has been limited to a priori known maps and ground truth state measurements. In this paper, we present a direct NMPC approach that leverages NanoMap, a light-weight point-cloud mapping framework to generate collision-free trajectories using onboard stereo vision. We first explore our approach in simulation and demonstrate that our algorithm is sufficient to enable vision-based navigation in urban environments. We then demonstrate our approach in hardware using a 42-inch fixed-wing UAV and show that our motion planning algorithm is capable of navigating around a building using a minimalistic set of goal-points. We also show that storing a point-cloud history is important for navigating these types of constrained environments.

7 pages, 10 figures None
3D Scene Understanding at Urban Intersection using Stereo Vision and Digital Map 2021-12-10
Show

The driving behavior at urban intersections is very complex. It is thus crucial for autonomous vehicles to comprehensively understand challenging urban traffic scenes in order to navigate intersections and prevent accidents. In this paper, we introduce a stereo vision and 3D digital map based approach to spatially and temporally analyze the traffic situation at urban intersections. Stereo vision is used to detect, classify and track obstacles, while a 3D digital map is used to improve ego-localization and provide context in terms of road-layout information. A probabilistic approach that temporally integrates these geometric, semantic, dynamic and contextual cues is presented. We qualitatively and quantitatively evaluate our proposed technique on real traffic data collected at an urban canyon in Tokyo to demonstrate the efficacy of the system in providing comprehensive awareness of the traffic surroundings.

6 pages, 6 figures None
DoubleStar: Long-Range Attack Towards Depth Estimation based Obstacle Avoidance in Autonomous Systems 2021-10-07
Show

Depth estimation-based obstacle avoidance has been widely adopted by autonomous systems (drones and vehicles) for safety purpose. It normally relies on a stereo camera to automatically detect obstacles and make flying/driving decisions, e.g., stopping several meters ahead of the obstacle in the path or moving away from the detected obstacle. In this paper, we explore new security risks associated with the stereo vision-based depth estimation algorithms used for obstacle avoidance. By exploiting the weaknesses of the stereo matching in depth estimation algorithms and the lens flare effect in optical imaging, we propose DoubleStar, a long-range attack that injects fake obstacle depth by projecting pure light from two complementary light sources. DoubleStar includes two distinctive attack formats: beams attack and orbs attack, which leverage projected light beams and lens flare orbs respectively to cause false depth perception. We successfully attack two commercial stereo cameras designed for autonomous systems (ZED and Intel RealSense). The visualization of fake depth perceived by the stereo cameras illustrates the false stereo matching induced by DoubleStar. We further use Ardupilot to simulate the attack and demonstrate its impact on drones. To validate the attack on real systems, we perform a real-world attack towards a commercial drone equipped with state-of-the-art obstacle avoidance algorithms. Our attack can continuously bring a flying drone to a sudden stop or drift it away across a long distance under various lighting conditions, even bypassing sensor fusion mechanisms. Specifically, our experimental results show that DoubleStar creates fake depth up to 15 meters in distance at night and up to 8 meters during the daytime. To mitigate this newly discovered threat, we provide discussions on potential countermeasures to defend against DoubleStar.

None
Stereo Video Reconstruction Without Explicit Depth Maps for Endoscopic Surgery 2021-09-16
Show

We introduce the task of stereo video reconstruction or, equivalently, 2D-to-3D video conversion for minimally invasive surgical video. We design and implement a series of end-to-end U-Net-based solutions for this task by varying the input (single frame vs. multiple consecutive frames), loss function (MSE, MAE, or perceptual losses), and network architecture. We evaluate these solutions by surveying ten experts - surgeons who routinely perform endoscopic surgery. We run two separate reader studies: one evaluating individual frames and the other evaluating fully reconstructed 3D video played on a VR headset. In the first reader study, a variant of the U-Net that takes as input multiple consecutive video frames and outputs the missing view performs best. We draw two conclusions from this outcome. First, motion information coming from multiple past frames is crucial in recreating stereo vision. Second, the proposed U-Net variant can indeed exploit such motion information for solving this task. The result from the second study further confirms the effectiveness of the proposed U-Net variant. The surgeons reported that they could successfully perceive depth from the reconstructed 3D video clips. They also expressed a clear preference for the reconstructed 3D video over the original 2D video. These two reader studies strongly support the usefulness of the proposed task of stereo reconstruction for minimally invasive surgical video and indicate that deep learning is a promising approach to this task. Finally, we identify two automatic metrics, LPIPS and DISTS, that are strongly correlated with expert judgement and that could serve as proxies for the latter in future studies.

9 pages, 5 figures None
Graph Attention Layer Evolves Semantic Segmentation for Road Pothole Detection: A Benchmark and Algorithms 2021-09-06
Show

Existing road pothole detection approaches can be classified as computer vision-based or machine learning-based. The former approaches typically employ 2-D image analysis/understanding or 3-D point cloud modeling and segmentation algorithms to detect road potholes from vision sensor data. The latter approaches generally address road pothole detection using convolutional neural networks (CNNs) in an end-to-end manner. However, road potholes are not necessarily ubiquitous and it is challenging to prepare a large well-annotated dataset for CNN training. In this regard, while computer vision-based methods were the mainstream research trend in the past decade, machine learning-based methods were merely discussed. Recently, we published the first stereo vision-based road pothole detection dataset and a novel disparity transformation algorithm, whereby the damaged and undamaged road areas can be highly distinguished. However, there are no benchmarks currently available for state-of-the-art (SoTA) CNNs trained using either disparity images or transformed disparity images. Therefore, in this paper, we first discuss the SoTA CNNs designed for semantic segmentation and evaluate their performance for road pothole detection with extensive experiments. Additionally, inspired by graph neural network (GNN), we propose a novel CNN layer, referred to as graph attention layer (GAL), which can be easily deployed in any existing CNN to optimize image feature representations for semantic segmentation. Our experiments compare GAL-DeepLabv3+, our best-performing implementation, with nine SoTA CNNs on three modalities of training data: RGB images, disparity images, and transformed disparity images. The experimental results suggest that our proposed GAL-DeepLabv3+ achieves the best overall pothole detection accuracy on all training data modalities.

accep...

accepted as a regular paper to IEEE Transactions on Image Processing

None
SIDE: Center-based Stereo 3D Detector with Structure-aware Instance Depth Estimation 2021-08-24
Show

3D detection plays an indispensable role in environment perception. Due to the high cost of commonly used LiDAR sensor, stereo vision based 3D detection, as an economical yet effective setting, attracts more attention recently. For these approaches based on 2D images, accurate depth information is the key to achieve 3D detection, and most existing methods resort to a preliminary stage for depth estimation. They mainly focus on the global depth and neglect the property of depth information in this specific task, namely, sparsity and locality, where exactly accurate depth is only needed for these 3D bounding boxes. Motivated by this finding, we propose a stereo-image based anchor-free 3D detection method, called structure-aware stereo 3D detector (termed as SIDE), where we explore the instance-level depth information via constructing the cost volume from RoIs of each object. Due to the information sparsity of local cost volume, we further introduce match reweighting and structure-aware attention, to make the depth information more concentrated. Experiments conducted on the KITTI dataset show that our method achieves the state-of-the-art performance compared to existing methods without depth map supervision.

accep...

accepted by WACV 2022

None
MobileStereoNet: Towards Lightweight Deep Networks for Stereo Matching 2021-08-22
Show

Recent methods in stereo matching have continuously improved the accuracy using deep models. This gain, however, is attained with a high increase in computation cost, such that the network may not fit even on a moderate GPU. This issue raises problems when the model needs to be deployed on resource-limited devices. For this, we propose two light models for stereo vision with reduced complexity and without sacrificing accuracy. Depending on the dimension of cost volume, we design a 2D and a 3D model with encoder-decoders built from 2D and 3D convolutions, respectively. To this end, we leverage 2D MobileNet blocks and extend them to 3D for stereo vision application. Besides, a new cost volume is proposed to boost the accuracy of the 2D model, making it performing close to 3D networks. Experiments show that the proposed 2D/3D networks effectively reduce the computational expense (27%/95% and 72%/38% fewer parameters/operations in 2D and 3D models, respectively) while upholding the accuracy. Our code is available at https://github.com/cogsys-tuebingen/mobilestereonet.

Under...

Under review. Further figures and tables in the appendix. Code provided

Code Link
Object Disparity 2021-08-18
Show

Most of stereo vision works are focusing on computing the dense pixel disparity of a given pair of left and right images. A camera pair usually required lens undistortion and stereo calibration to provide an undistorted epipolar line calibrated image pair for accurate dense pixel disparity computation. Due to noise, object occlusion, repetitive or lack of texture and limitation of matching algorithms, the pixel disparity accuracy usually suffers the most at those object boundary areas. Although statistically the total number of pixel disparity errors might be low (under 2% according to the Kitti Vision Benchmark of current top ranking algorithms), the percentage of these disparity errors at object boundaries are very high. This renders the subsequence 3D object distance detection with much lower accuracy than desired. This paper proposed a different approach for solving a 3D object distance detection by detecting object disparity directly without going through a dense pixel disparity computation. An example squeezenet Object Disparity-SSD (OD-SSD) was constructed to demonstrate an efficient object disparity detection with comparable accuracy compared with Kitti dataset pixel disparity ground truth. Further training and testing results with mixed image dataset captured by several different stereo systems may suggest that an OD-SSD might be agnostic to stereo system parameters such as a baseline, FOV, lens distortion, even left/right camera epipolar line misalignment.

10 pa...

10 pages, 13 figures, 7 tables

None
Accelerating Markov Random Field Inference with Uncertainty Quantification 2021-08-02
Show

Statistical machine learning has widespread application in various domains. These methods include probabilistic algorithms, such as Markov Chain Monte-Carlo (MCMC), which rely on generating random numbers from probability distributions. These algorithms are computationally expensive on conventional processors, yet their statistical properties, namely interpretability and uncertainty quantification (UQ) compared to deep learning, make them an attractive alternative approach. Therefore, hardware specialization can be adopted to address the shortcomings of conventional processors in running these applications. In this paper, we propose a high-throughput accelerator for Markov Random Field (MRF) inference, a powerful model for representing a wide range of applications, using MCMC with Gibbs sampling. We propose a tiled architecture which takes advantage of near-memory computing, and memory optimizations tailored to the semantics of MRF. Additionally, we propose a novel hybrid on-chip/off-chip memory system and logging scheme to efficiently support UQ. This memory system design is not specific to MRF models and is applicable to applications using probabilistic algorithms. In addition, it dramatically reduces off-chip memory bandwidth requirements. We implemented an FPGA prototype of our proposed architecture using high-level synthesis tools and achieved 146MHz frequency for an accelerator with 32 function units on an Intel Arria 10 FPGA. Compared to prior work on FPGA, our accelerator achieves 26X speedup. Furthermore, our proposed memory system and logging scheme to support UQ reduces off-chip bandwidth by 71% for two applications. ASIC analysis in 15nm shows our design with 2048 function units running at 3GHz outperforms GPU implementations of motion estimation and stereo vision on Nvidia RTX2080Ti by 120X-210X, occupying only 7.7% of the area.

None
Low-cost Stereovision system (disparity map) for few dollars 2021-06-02
Show

The paper presents an analysis of the latest developments in the field of stereo vision in the low-cost segment, both for prototypes and for industrial designs. We described the theory of stereo vision and presented information about cameras and data transfer protocols and their compatibility with various devices. The theory in the field of image processing for stereo vision processes is considered and the calibration process is described in detail. Ultimately, we presented the developed stereo vision system and provided the main points that need to be considered when developing such systems. The final, we presented software for adjusting stereo vision parameters in real-time in the python language in the Windows operating system.

None
MarkerPose: Robust Real-time Planar Target Tracking for Accurate Stereo Pose Estimation 2021-05-29
Show

Despite the attention marker-less pose estimation has attracted in recent years, marker-based approaches still provide unbeatable accuracy under controlled environmental conditions. Thus, they are used in many fields such as robotics or biomedical applications but are primarily implemented through classical approaches, which require lots of heuristics and parameter tuning for reliable performance under different environments. In this work, we propose MarkerPose, a robust, real-time pose estimation system based on a planar target of three circles and a stereo vision system. MarkerPose is meant for high-accuracy pose estimation applications. Our method consists of two deep neural networks for marker point detection. A SuperPoint-like network for pixel-level accuracy keypoint localization and classification, and we introduce EllipSegNet, a lightweight ellipse segmentation network for sub-pixel-level accuracy keypoint detection. The marker's pose is estimated through stereo triangulation. The target point detection is robust to low lighting and motion blur conditions. We compared MarkerPose with a detection method based on classical computer vision techniques using a robotic arm for validation. The results show our method provides better accuracy than the classical technique. Finally, we demonstrate the suitability of MarkerPose in a 3D freehand ultrasound system, which is an application where highly accurate pose estimation is required. Code is available in Python and C++ at https://github.com/jhacsonmeza/MarkerPose.

Accep...

Accepted at CVPR 2021 LXCV Workshop

Code Link
On the Advantages of Multiple Stereo Vision Camera Designs for Autonomous Drone Navigation 2021-05-26
Show

In this work we showcase the design and assessment of the performance of a multi-camera UAV, when coupled with state-of-the-art planning and mapping algorithms for autonomous navigation. The system leverages state-of-the-art receding horizon exploration techniques for Next-Best-View (NBV) planning with 3D and semantic information, provided by a reconfigurable multi stereo camera system. We employ our approaches in an autonomous drone-based inspection task and evaluate them in an autonomous exploration and mapping scenario. We discuss the advantages and limitations of using multi stereo camera flying systems, and the trade-off between number of cameras and mapping performance.

None
Measurement-Robust Control Barrier Functions: Certainty in Safety with Uncertainty in State 2021-04-28
Show

The increasing complexity of modern robotic systems and the environments they operate in necessitates the formal consideration of safety in the presence of imperfect measurements. In this paper we propose a rigorous framework for safety-critical control of systems with erroneous state estimates. We develop this framework by leveraging Control Barrier Functions (CBFs) and unifying the method of Backup Sets for synthesizing control invariant sets with robustness requirements -- the end result is the synthesis of Measurement-Robust Control Barrier Functions (MR-CBFs). This provides theoretical guarantees on safe behavior in the presence of imperfect measurements and improved robustness over standard CBF approaches. We demonstrate the efficacy of this framework both in simulation and experimentally on a Segway platform using an onboard stereo-vision camera for state estimation.

6 pages, 4 figures None
StereoPIFu: Depth Aware Clothed Human Digitization via Stereo Vision 2021-04-13
Show

In this paper, we propose StereoPIFu, which integrates the geometric constraints of stereo vision with implicit function representation of PIFu, to recover the 3D shape of the clothed human from a pair of low-cost rectified images. First, we introduce the effective voxel-aligned features from a stereo vision-based network to enable depth-aware reconstruction. Moreover, the novel relative z-offset is employed to associate predicted high-fidelity human depth and occupancy inference, which helps restore fine-level surface details. Second, a network structure that fully utilizes the geometry information from the stereo images is designed to improve the human body reconstruction quality. Consequently, our StereoPIFu can naturally infer the human body's spatial location in camera space and maintain the correct relative position of different parts of the human body, which enables our method to capture human performance. Compared with previous works, our StereoPIFu significantly improves the robustness, completeness, and accuracy of the clothed human reconstruction, which is demonstrated by extensive experimental results.

Accep...

Accepted by CVPR2021. Project page: http://crishy1995.github.io/StereoPIFuProject

Code Link
Instantaneous Stereo Depth Estimation of Real-World Stimuli with a Neuromorphic Stereo-Vision Setup 2021-04-06
Show

The stereo-matching problem, i.e., matching corresponding features in two different views to reconstruct depth, is efficiently solved in biology. Yet, it remains the computational bottleneck for classical machine vision approaches. By exploiting the properties of event cameras, recently proposed Spiking Neural Network (SNN) architectures for stereo vision have the potential of simplifying the stereo-matching problem. Several solutions that combine event cameras with spike-based neuromorphic processors already exist. However, they are either simulated on digital hardware or tested on simplified stimuli. In this work, we use the Dynamic Vision Sensor 3D Human Pose Dataset (DHP19) to validate a brain-inspired event-based stereo-matching architecture implemented on a mixed-signal neuromorphic processor with real-world data. Our experiments show that this SNN architecture, composed of coincidence detectors and disparity sensitive neurons, is able to provide a coarse estimate of the input disparity instantaneously, thereby detecting the presence of a stimulus moving in depth in real-time.

None
The STDyn-SLAM: A stereo vision and semantic segmentation approach for SLAM in dynamic outdoor environments 2021-03-30
Show

Commonly, SLAM algorithms are focused on a static environment, however, there are several scenes where dynamic objects are present. This work presents the STDyn-SLAM an image feature-based SLAM system working on dynamic environments using a series of sub-systems, like optic flow, orb features extraction, visual odometry, and convolutional neural networks to discern moving objects in the scene. The neural network is used to support object detection and segmentation to avoid erroneous maps and wrong system localization. The STDyn-SLAM employs a stereo pair and is developed for outdoor environments. Moreover, the processing time of the proposed system is fast enough to run in real-time as it was demonstrated through the experiments given in real dynamic outdoor environments. Further, we compare our SLAM with state-of-the-art methods achieving promising results.

8 pages, 18 images None
IRS: A Large Naturalistic Indoor Robotics Stereo Dataset to Train Deep Models for Disparity and Surface Normal Estimation 2021-03-26
Show

Indoor robotics localization, navigation, and interaction heavily rely on scene understanding and reconstruction. Compared to the monocular vision which usually does not explicitly introduce any geometrical constraint, stereo vision-based schemes are more promising and robust to produce accurate geometrical information, such as surface normal and depth/disparity. Besides, deep learning models trained with large-scale datasets have shown their superior performance in many stereo vision tasks. However, existing stereo datasets rarely contain the high-quality surface normal and disparity ground truth, which hardly satisfies the demand of training a prospective deep model for indoor scenes. To this end, we introduce a large-scale synthetic but naturalistic indoor robotics stereo (IRS) dataset with over 100K stereo RGB images and high-quality surface normal and disparity maps. Leveraging the advanced rendering techniques of our customized rendering engine, the dataset is considerably close to the real-world captured images and covers several visual effects, such as brightness changes, light reflection/transmission, lens flare, vivid shadow, etc. We compare the data distribution of IRS with existing stereo datasets to illustrate the typical visual attributes of indoor scenes. Besides, we present DTN-Net, a two-stage deep model for surface normal estimation. Extensive experiments show the advantages and effectiveness of IRS in training deep models for disparity estimation, and DTN-Net provides state-of-the-art results for normal estimation compared to existing methods.

None
MonStereo: When Monocular and Stereo Meet at the Tail of 3D Human Localization 2021-03-22
Show

Monocular and stereo visions are cost-effective solutions for 3D human localization in the context of self-driving cars or social robots. However, they are usually developed independently and have their respective strengths and limitations. We propose a novel unified learning framework that leverages the strengths of both monocular and stereo cues for 3D human localization. Our method jointly (i) associates humans in left-right images, (ii) deals with occluded and distant cases in stereo settings by relying on the robustness of monocular cues, and (iii) tackles the intrinsic ambiguity of monocular perspective projection by exploiting prior knowledge of the human height distribution. We specifically evaluate outliers as well as challenging instances, such as occluded and far-away pedestrians, by analyzing the entire error distribution and by estimating calibrated confidence intervals. Finally, we critically review the official KITTI 3D metrics and propose a practical 3D localization metric tailored for humans.

Accep...

Accepted at the IEEE International Conference on Robotics and Automation (ICRA) 2021

None
Learning Collision-Free Space Detection from Stereo Images: Homography Matrix Brings Better Data Augmentation 2021-03-12
Show

Collision-free space detection is a critical component of autonomous vehicle perception. The state-of-the-art algorithms are typically based on supervised learning. The performance of such approaches is always dependent on the quality and amount of labeled training data. Additionally, it remains an open challenge to train deep convolutional neural networks (DCNNs) using only a small quantity of training samples. Therefore, this paper mainly explores an effective training data augmentation approach that can be employed to improve the overall DCNN performance, when additional images captured from different views are available. Due to the fact that the pixels of the collision-free space (generally regarded as a planar surface) between two images captured from different views can be associated by a homography matrix, the scenario of the target image can be transformed into the reference view. This provides a simple but effective way of generating training data from additional multi-view images. Extensive experimental results, conducted with six state-of-the-art semantic segmentation DCNNs on three datasets, demonstrate the effectiveness of our proposed training data augmentation algorithm for enhancing collision-free space detection performance. When validated on the KITTI road benchmark, our approach provides the best results for stereo vision-based collision-free space detection.

accep...

accepted to IEEE/ASME Transactions on Mechatronics

None
An underwater binocular stereo matching algorithm based on the best search domain 2021-02-09
Show

Binocular stereo vision is an important branch of machine vision, which imitates the human eye and matches the left and right images captured by the camera based on epipolar constraints. The matched disparity map can be calculated according to the camera imaging model to obtain a depth map, and then the depth map is converted to a point cloud image to obtain spatial point coordinates, thereby achieving the purpose of ranging. However, due to the influence of illumination under water, the captured images no longer meet the epipolar constraints, and the changes in imaging models make traditional calibration methods no longer applicable. Therefore, this paper proposes a new underwater real-time calibration method and a matching method based on the best search domain to improve the accuracy of underwater distance measurement using binoculars.

None
Exploitation of Image Statistics with Sparse Coding in the Case of Stereo Vision 2021-01-26
Show

The sparse coding algorithm has served as a model for early processing in mammalian vision. It has been assumed that the brain uses sparse coding to exploit statistical properties of the sensory stream. We hypothesize that sparse coding discovers patterns from the data set, which can be used to estimate a set of stimulus parameters by simple readout. In this study, we chose a model of stereo vision to test our hypothesis. We used the Locally Competitive Algorithm (LCA), followed by a na"ive Bayes classifier, to infer stereo disparity. From the results we report three observations. First, disparity inference was successful with this naturalistic processing pipeline. Second, an expanded, highly redundant representation is required to robustly identify the input patterns. Third, the inference error can be predicted from the number of active coefficients in the LCA representation. We conclude that sparse coding can generate a suitable general representation for subsequent inference tasks. Keywords: Sparse coding; Locally Competitive Algorithm (LCA); Efficient coding; Compact code; Probabilistic inference; Stereo vision

Autho...

Author's accepted manuscript

None
Computer Stereo Vision for Autonomous Driving 2020-12-17
Show

As an important component of autonomous systems, autonomous car perception has had a big leap with recent advances in parallel computing architectures. With the use of tiny but full-feature embedded supercomputers, computer stereo vision has been prevalently applied in autonomous cars for depth perception. The two key aspects of computer stereo vision are speed and accuracy. They are both desirable but conflicting properties, as the algorithms with better disparity accuracy usually have higher computational complexity. Therefore, the main aim of developing a computer stereo vision algorithm for resource-limited hardware is to improve the trade-off between speed and accuracy. In this chapter, we introduce both the hardware and software aspects of computer stereo vision for autonomous car systems. Then, we discuss four autonomous car perception tasks, including 1) visual feature detection, description and matching, 2) 3D information acquisition, 3) object detection/recognition and 4) semantic image segmentation. The principles of computer stereo vision and parallel computing on multi-threading CPU and GPU architectures are then detailed.

Book chapter None
E3D: Event-Based 3D Shape Reconstruction 2020-12-10
Show

3D shape reconstruction is a primary component of augmented/virtual reality. Despite being highly advanced, existing solutions based on RGB, RGB-D and Lidar sensors are power and data intensive, which introduces challenges for deployment in edge devices. We approach 3D reconstruction with an event camera, a sensor with significantly lower power, latency and data expense while enabling high dynamic range. While previous event-based 3D reconstruction methods are primarily based on stereo vision, we cast the problem as multi-view shape from silhouette using a monocular event camera. The output from a moving event camera is a sparse point set of space-time gradients, largely sketching scene/object edges and contours. We first introduce an event-to-silhouette (E2S) neural network module to transform a stack of event frames to the corresponding silhouettes, with additional neural branches for camera pose regression. Second, we introduce E3D, which employs a 3D differentiable renderer (PyTorch3D) to enforce cross-view 3D mesh consistency and fine-tune the E2S and pose network. Lastly, we introduce a 3D-to-events simulation pipeline and apply it to publicly available object datasets and generate synthetic event/silhouette training pairs for supervised learning.

Corre...

Correct author names and only include primary author email

None
ADCPNet: Adaptive Disparity Candidates Prediction Network for Efficient Real-Time Stereo Matching 2020-11-18
Show

Efficient real-time disparity estimation is critical for the application of stereo vision systems in various areas. Recently, stereo network based on coarse-to-fine method has largely relieved the memory constraints and speed limitations of large-scale network models. Nevertheless, all of the previous coarse-to-fine designs employ constant offsets and three or more stages to progressively refine the coarse disparity map, still resulting in unsatisfactory computation accuracy and inference time when deployed on mobile devices. This paper claims that the coarse matching errors can be corrected efficiently with fewer stages as long as more accurate disparity candidates can be provided. Therefore, we propose a dynamic offset prediction module to meet different correction requirements of diverse objects and design an efficient two-stage framework. Besides, we propose a disparity-independent convolution to further improve the performance since it is more consistent with the local statistical characteristics of the compact cost volume. The evaluation results on multiple datasets and platforms clearly demonstrate that, the proposed network outperforms the state-of-the-art lightweight models especially for mobile devices in terms of accuracy and speed. Code will be made available.

None
Adjusting Bias in Long Range Stereo Matching: A semantics guided approach 2020-11-10
Show

Stereo vision generally involves the computation of pixel correspondences and estimation of disparities between rectified image pairs. In many applications, including simultaneous localization and mapping (SLAM) and 3D object detection, the disparities are primarily needed to calculate depth values and the accuracy of depth estimation is often more compelling than disparity estimation. The accuracy of disparity estimation, however, does not directly translate to the accuracy of depth estimation, especially for faraway objects. In the context of learning-based stereo systems, this is largely due to biases imposed by the choices of the disparity-based loss function and the training data. Consequently, the learning algorithms often produce unreliable depth estimates of foreground objects, particularly at large distances~($>50$m). To resolve this issue, we first analyze the effect of those biases and then propose a pair of novel depth-based loss functions for foreground and background, separately. These loss functions are tunable and can balance the inherent bias of the stereo learning algorithms. The efficacy of our solution is demonstrated by an extensive set of experiments, which are benchmarked against state of the art. We show on KITTI~2015 benchmark that our proposed solution yields substantial improvements in disparity and depth estimation, particularly for objects located at distances beyond 50 meters, outperforming the previous state of the art by $10%$.

10 pages, 8 figures None
MorphEyes: Variable Baseline Stereo For Quadrotor Navigation 2020-11-05
Show

Morphable design and depth-based visual control are two upcoming trends leading to advancements in the field of quadrotor autonomy. Stereo-cameras have struck the perfect balance of weight and accuracy of depth estimation but suffer from the problem of depth range being limited and dictated by the baseline chosen at design time. In this paper, we present a framework for quadrotor navigation based on a stereo camera system whose baseline can be adapted on-the-fly. We present a method to calibrate the system at a small number of discrete baselines and interpolate the parameters for the entire baseline range. We present an extensive theoretical analysis of calibration and synchronization errors. We showcase three different applications of such a system for quadrotor navigation: (a) flying through a forest, (b) flying through an unknown shaped/location static/dynamic gap, and (c) accurate 3D pose detection of an independently moving object. We show that our variable baseline system is more accurate and robust in all three scenarios. To our knowledge, this is the first work that applies the concept of morphable design to achieve a variable baseline stereo vision system on a quadrotor.

7 pag...

7 pages, 10 figures, 1 table. Under review in ICRA 2021

None
MaskNet: A Fully-Convolutional Network to Estimate Inlier Points 2020-10-19
Show

Point clouds have grown in importance in the way computers perceive the world. From LIDAR sensors in autonomous cars and drones to the time of flight and stereo vision systems in our phones, point clouds are everywhere. Despite their ubiquity, point clouds in the real world are often missing points because of sensor limitations or occlusions, or contain extraneous points from sensor noise or artifacts. These problems challenge algorithms that require computing correspondences between a pair of point clouds. Therefore, this paper presents a fully-convolutional neural network that identifies which points in one point cloud are most similar (inliers) to the points in another. We show improvements in learning-based and classical point cloud registration approaches when retrofitted with our network. We demonstrate these improvements on synthetic and real-world datasets. Finally, our network produces impressive results on test datasets that were unseen during training, thus exhibiting generalizability. Code and videos are available at https://github.com/vinits5/masknet

Accep...

Accepted at International Conference on 3D Vision (3DV, 2020)

Code Link
Parallax Motion Effect Generation Through Instance Segmentation And Depth Estimation 2020-10-06
Show

Stereo vision is a growing topic in computer vision due to the innumerable opportunities and applications this technology offers for the development of modern solutions, such as virtual and augmented reality applications. To enhance the user's experience in three-dimensional virtual environments, the motion parallax estimation is a promising technique to achieve this objective. In this paper, we propose an algorithm for generating parallax motion effects from a single image, taking advantage of state-of-the-art instance segmentation and depth estimation approaches. This work also presents a comparison against such algorithms to investigate the trade-off between efficiency and quality of the parallax motion effects, taking into consideration a multi-task learning network capable of estimating instance segmentation and depth estimation at once. Experimental results and visual quality assessment indicate that the PyD-Net network (depth estimation) combined with Mask R-CNN or FBNet networks (instance segmentation) can produce parallax motion effects with good visual quality.

2020 ...

2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates

None
Improving Deep Stereo Network Generalization with Geometric Priors 2020-08-25
Show

End-to-end deep learning methods have advanced stereo vision in recent years and obtained excellent results when the training and test data are similar. However, large datasets of diverse real-world scenes with dense ground truth are difficult to obtain and currently not publicly available to the research community. As a result, many algorithms rely on small real-world datasets of similar scenes or synthetic datasets, but end-to-end algorithms trained on such datasets often generalize poorly to different images that arise in real-world applications. As a step towards addressing this problem, we propose to incorporate prior knowledge of scene geometry into an end-to-end stereo network to help networks generalize better. For a given network, we explicitly add a gradient-domain smoothness prior and occlusion reasoning into the network training, while the architecture remains unchanged during inference. Experimentally, we show consistent improvements if we train on synthetic datasets and test on the Middlebury (real images) dataset. Noticeably, we improve PSM-Net accuracy on Middlebury from 5.37 MAE to 3.21 MAE without sacrificing speed.

None
Single Storage Semi-Global Matching for Real Time Depth Processing 2020-07-07
Show

Depth-map is the key computation in computer vision and robotics. One of the most popular approach is via computation of disparity-map of images obtained from Stereo Camera. Semi Global Matching (SGM) method is a popular choice for good accuracy with reasonable computation time. To use such compute-intensive algorithms for real-time applications such as for autonomous aerial vehicles, blind Aid, etc. acceleration using GPU, FPGA is necessary. In this paper, we show the design and implementation of a stereo-vision system, which is based on FPGA-implementation of More Global Matching(MGM). MGM is a variant of SGM. We use 4 paths but store a single cumulative cost value for a corresponding pixel. Our stereo-vision prototype uses Zedboard containing an ARM-based Zynq-SoC, ZED-stereo-camera / ELP stereo-camera / Intel RealSense D435i, and VGA for visualization. The power consumption attributed to the custom FPGA-based acceleration of disparity map computation required for depth-map is just 0.72 watt. The update rate of the disparity map is realistic 10.5 fps.

10 pa...

10 pages, Published in National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics(NCVPRIPG) 2019

None
Monocular Depth Estimation Based On Deep Learning: An Overview 2020-07-03
Show

Depth information is important for autonomous systems to perceive environments and estimate their own state. Traditional depth estimation methods, like structure from motion and stereo vision matching, are built on feature correspondences of multiple viewpoints. Meanwhile, the predicted depth maps are sparse. Inferring depth information from a single image (monocular depth estimation) is an ill-posed problem. With the rapid development of deep neural networks, monocular depth estimation based on deep learning has been widely studied recently and achieved promising performance in accuracy. Meanwhile, dense depth maps are estimated from single images by deep neural networks in an end-to-end manner. In order to improve the accuracy of depth estimation, different kinds of network frameworks, loss functions and training strategies are proposed subsequently. Therefore, we survey the current monocular depth estimation methods based on deep learning in this review. Initially, we conclude several widely used datasets and evaluation indicators in deep learning-based depth estimation. Furthermore, we review some representative existing methods according to different training manners: supervised, unsupervised and semi-supervised. Finally, we discuss the challenges and provide some ideas for future researches in monocular depth estimation.

14 pages, 4 figures None
Three-dimensional Human Tracking of a Mobile Robot by Fusion of Tracking Results of Two Cameras 2020-07-03
Show

This paper proposes a process that uses two cameras to obtain three-dimensional (3D) information of a target object for human tracking. Results of human detection and tracking from two cameras are integrated to obtain the 3D information. OpenPose is used for human detection. In the case of a general processing a stereo camera, a range image of the entire scene is acquired as precisely as possible, and then the range image is processed. However, there are problems such as incorrect matching and computational cost for the calibration process. A new stereo vision framework is proposed to cope with the problems. The effectiveness of the proposed framework and the method is verified through target-tracking experiments.

4 pages, 11 figures None
FP-Stereo: Hardware-Efficient Stereo Vision for Embedded Applications 2020-07-01
Show

Fast and accurate depth estimation, or stereo matching, is essential in embedded stereo vision systems, requiring substantial design effort to achieve an appropriate balance among accuracy, speed and hardware cost. To reduce the design effort and achieve the right balance, we propose FP-Stereo for building high-performance stereo matching pipelines on FPGAs automatically. FP-Stereo consists of an open-source hardware-efficient library, allowing designers to obtain the desired implementation instantly. Diverse methods are supported in our library for each stage of the stereo matching pipeline and a series of techniques are developed to exploit the parallelism and reduce the resource overhead. To improve the usability, FP-Stereo can generate synthesizable C code of the FPGA accelerator with our optimized HLS templates automatically. To guide users for the right design choice meeting specific application requirements, detailed comparisons are performed on various configurations of our library to investigate the accuracy/speed/cost trade-off. Experimental results also show that FP-Stereo outperforms the state-of-the-art FPGA design from all aspects, including 6.08% lower error, 2x faster speed, 30% less resource usage and 40% less energy consumption. Compared to GPU designs, FP-Stereo achieves the same accuracy at a competitive speed while consuming much less energy.

IEEE ...

IEEE International Conference on Field Programmable Logic and Applications (FPL), 2020

None
Stereo Vision Based Single-Shot 6D Object Pose Estimation for Bin-Picking by a Robot Manipulator 2020-05-28
Show

We propose a fast and accurate method of 6D object pose estimation for bin-picking of mechanical parts by a robot manipulator. We extend the single-shot approach to stereo vision by application of attention architecture. Our convolutional neural network model regresses to object locations and rotations from either a left image or a right image without depth information. Then, a stereo feature matching module, designated as Stereo Grid Attention, generates stereo grid matching maps. The important point of our method is only to calculate disparity of the objects found by the attention from stereo images, instead of calculating a point cloud over the entire image. The disparity value is then used to calculate the depth to the objects by the principle of triangulation. Our method also achieves a rapid processing speed of pose estimation by the single-shot architecture and it is possible to process a 1024 x 1024 pixels image in 75 milliseconds on the Jetson AGX Xavier implemented with half-float model. Weakly textured mechanical parts are used to exemplify the method. First, we create original synthetic datasets for training and evaluating of the proposed model. This dataset is created by capturing and rendering numerous 3D models of several types of mechanical parts in virtual space. Finally, we use a robotic manipulator with an electromagnetic gripper to pick up the mechanical parts in a cluttered state to verify the validity of our method in an actual scene. When a raw stereo image is used by the proposed method from our stereo camera to detect black steel screws, stainless screws, and DC motor parts, i.e., cases, rotor cores and commutator caps, the bin-picking tasks are successful with 76.3%, 64.0%, 50.5%, 89.1% and 64.2% probability, respectively.

7 pages, 8 figures None
Stereo Vision for Unmanned Aerial VehicleDetection, Tracking, and Motion Control 2020-05-07
Show

An innovative method of detecting Unmanned Aerial Vehicles (UAVs) is presented. The goal of this study is to develop a robust setup for an autonomous multi-rotor hunter UAV, capable of visually detecting and tracking the intruder UAVs for real-time motion planning. The system consists of two parts: object detection using a stereo camera to generate 3D point cloud data and video tracking applying a Kalman filter for UAV motion modeling. After detection, the hunter can aim and shoot a tethered net at the intruder to neutralize it. The computer vision, motion tracking, and planning algorithms can be implemented on a portable computer installed on the hunter UAV.

This ...

This work was accepted as a Late-Breaking result at the IFAC World Congress 2020

None
Preintegrated Velocity Bias Estimation to Overcome Contact Nonlinearities in Legged Robot Odometry 2020-05-04
Show

In this paper, we present a novel factor graph formulation to estimate the pose and velocity of a quadruped robot on slippery and deformable terrain. The factor graph introduces a preintegrated velocity factor that incorporates velocity inputs from leg odometry and also estimates related biases. From our experimentation we have seen that it is difficult to model uncertainties at the contact point such as slip or deforming terrain, as well as leg flexibility. To accommodate for these effects and to minimize leg odometry drift, we extend the robot's state vector with a bias term for this preintegrated velocity factor. The bias term can be accurately estimated thanks to the tight fusion of the preintegrated velocity factor with stereo vision and IMU factors, without which it would be unobservable. The system has been validated on several scenarios that involve dynamic motions of the ANYmal robot on loose rocks, slopes and muddy ground. We demonstrate a 26% improvement of relative pose error compared to our previous work and 52% compared to a state-of-the-art proprioceptive state estimator.

Accep...

Accepted to ICRA 2020. Video: youtu.be/w1Sx6dIqgQo

None
Active stereo vision three-dimensional reconstruction by RGB dot pattern projection and ray intersection 2020-03-31
Show

Active stereo vision is important in reconstructing objects without obvious textures. However, it is still very challenging to extract and match the projected patterns from two camera views automatically and robustly. In this paper, we propose a new pattern extraction method and a new stereo vision matching method based on our novel structured light pattern. Instead of using the widely used 2D disparity to calculate the depths of the objects, we use the ray intersection to compute the 3D shapes directly. Experimental results showed that the proposed approach could reconstruct the 3D shape of the object significantly more robustly than state of the art methods that include the widely used disparity based active stereo vision method, the time of flight method and the structured light method. In addition, experimental results also showed that the proposed approach could reconstruct the 3D motions of the dynamic shapes robustly.

None
Hybrid calibration procedure for fringe projection profilometry based on stereo-vision and polynomial fitting 2020-03-09
Show

The key to accurate 3D shape measurement in Fringe Projection Profilometry (FPP) is the proper calibration of the measurement system. Current calibration techniques rely on phase-coordinate mapping (PCM) or back-projection stereo-vision (SV) methods. PCM methods are cumbersome to implement as they require precise positioning of the calibration target relative to the FPP system but produce highly accurate measurements within the calibration volume. SV methods generally do not achieve the same accuracy level. However, the calibration is more flexible in that the calibration target can be arbitrarily positioned. In this work, we propose a hybrid calibration method that leverages the SV calibration approach using a PCM method to achieve higher accuracy. The method has the flexibility of SV methods, is robust to lens distortions, and has a simple relation between the recovered phase and the metric coordinates. Experimental results show that the proposed Hybrid method outperforms the SV method in terms of accuracy and reconstruction time due to its low computational complexity.

Accep...

Accepted for publication in Applied Optics Vol. 59 No. 13, 2020

None
A hybrid algorithm for disparity calculation from sparse disparity estimates based on stereo vision 2020-01-20
Show

In this paper, we have proposed a novel method for stereo disparity estimation by combining the existing methods of block based and region based stereo matching. Our method can generate dense disparity maps from disparity measurements of only 18% pixels of either the left or the right image of a stereo image pair. It works by segmenting the lightness values of image pixels using a fast implementation of K-Means clustering. It then refines those segment boundaries by morphological filtering and connected components analysis, thus removing a lot of redundant boundary pixels. This is followed by determining the boundaries' disparities by the SAD cost function. Lastly, we reconstruct the entire disparity map of the scene from the boundaries' disparities through disparity propagation along the scan lines and disparity prediction of regions of uncertainty by considering disparities of the neighboring regions. Experimental results on the Middlebury stereo vision dataset demonstrate that the proposed method outperforms traditional disparity determination methods like SAD and NCC by up to 30% and achieves an improvement of 2.6% when compared to a recent approach based on absolute difference (AD) cost function for disparity calculations [1].

2014 SPCOM None
Deep Learning Stereo Vision at the edge 2020-01-13
Show

We present an overview of the methodology used to build a new stereo vision solution that is suitable for System on Chip. This new solution was developed to bring computer vision capability to embedded devices that live in a power constrained environment. The solution is constructured as a hybrid between classical Stereo Vision techniques and deep learning approaches. The stereoscopic module is composed of two separate modules: one that accelerates the neural network we trained and one that accelerates the front-end part. The system is completely passive and does not require any structured light to obtain very compelling accuracy. With respect to the previous Stereo Vision solutions offered by the industries we offer a major improvement is robustness to noise. This is mainly possible due to the deep learning part of the chosen architecture. We submitted our result to Middlebury dataset challenge. It currently ranks as the best System on Chip solution. The system has been developed for low latency applications which require better than real time performance on high definition videos.

None
3D Particle Positions from Computer Stereo Vision in PK-4 2019-12-09
Show

Complex plasmas consist of microparticles embedded in a low-temperature plasma containing ions, electrons and neutral particles. The microparticles form a dynamical system that can be used to study a multitude of effects on the level of the constituent particles. The microparticles are usually illuminated with a sheet of laser light, and the scattered light can be observed with digital cameras. Some complex plasma microgravity research facilities use two cameras with an overlapping field of view. An overlapping field of view can be used to combine the resulting images into one and trace the particles in the larger field of view. In previous work this was discussed for the images recorded by the PK-4 Laboratory on board the International Space Station. In that work the width of the laser sheet was, however, not taken into account. In this paper, we will discuss how to improve the transformation of the features into a joint coordinate system, and possibly extract information on the 3D position of particles in the overlap region.

None
MORPHOLO C++ Library for glasses-free multi-view stereo vision and streaming of live 3D video 2019-12-04
Show

The MORPHOLO C++ extended Library allows to convert a specific stereoscopic snapshot into a Native multi-view image through morphing algorithms taking into account display calibration data for specific slanted lenticular 3D monitors. MORPHOLO can also be implemented for glasses-free live applicatons of 3D video streaming, and for diverse innovative scientific, engineering and 3D video game applications -see http://www.morpholo.it

28 pages None
Sub-pixel matching method for low-resolution thermal stereo images 2019-11-30
Show

In the context of a localization and tracking application, we developed a stereo vision system based on cheap low-resolution 80x60 pixels thermal cameras. We proposed a threefold sub-pixel stereo matching framework (called ST for Subpixel Thermal): 1) robust features extraction method based on phase congruency, 2) rough matching of these features in pixel precision, and 3) refined matching in sub-pixel accuracy based on local phase coherence. We performed experiments on our very low-resolution thermal images (acquired using a stereo system we manufactured) as for high-resolution images from a benchmark dataset. Even if phase congruency computation time is high, it was able to extract two times more features than state-of-the-art methods such as ORB or SURF. We proposed a modified version of the phase correlation applied in the phase congruency feature space for sub-pixel matching. Using simulated stereo, we investigated how the phase congruency threshold and the sub-image size of sub-pixel matching can influence the accuracy. We then proved that given our stereo setup and the resolution of our images, being wrong of 1 pixel leads to a 500 mm error in the Z position of the point. Finally, we showed that our method could extract four times more matches than a baseline method ORB + OpenCV KNN matching on low-resolution images. Moreover, our matches were more robust. More precisely, when projecting points of a standing person, ST got a standard deviation of 300 mm when ORB + OpenCV KNN gave more than 1000 mm.

None
Vision: A Deep Learning Approach to provide walking assistance to the visually impaired 2019-11-20
Show

Blind people face a lot of problems in their daily routines. They have to struggle a lot just to do their day-to-day chores. In this paper, we have proposed a system with the objective to help the visually impaired by providing audio aid guiding them to avoid obstacles, which will assist them to move in their surroundings. Object Detection using YOLO will help them detect the nearby objects and Depth Estimation using monocular vision will tell the approximate distance of the detected objects from the user. Despite a higher accuracy, stereo vision has many hardware constraints, which makes monocular vision the preferred choice for this application.

10 pages, 10 figures None
ASV: Accelerated Stereo Vision System 2019-11-15
Show

Estimating depth from stereo vision cameras, i.e., "depth from stereo", is critical to emerging intelligent applications deployed in energy- and performance-constrained devices, such as augmented reality headsets and mobile autonomous robots. While existing stereo vision systems make trade-offs between accuracy, performance and energy-efficiency, we describe ASV, an accelerated stereo vision system that simultaneously improves both performance and energy-efficiency while achieving high accuracy. The key to ASV is to exploit unique characteristics inherent to stereo vision, and apply stereo-specific optimizations, both algorithmically and computationally. We make two contributions. Firstly, we propose a new stereo algorithm, invariant-based stereo matching (ISM), that achieves significant speedup while retaining high accuracy. The algorithm combines classic "hand-crafted" stereo algorithms with recent developments in Deep Neural Networks (DNNs), by leveraging the correspondence invariant unique to stereo vision systems. Secondly, we observe that the bottleneck of the ISM algorithm is the DNN inference, and in particular the deconvolution operations that introduce massive compute-inefficiencies. We propose a set of software optimizations that mitigate these inefficiencies. We show that with less than 0.5% hardware area overhead, these algorithmic and computational optimizations can be effectively integrated within a conventional DNN accelerator. Overall, ASV achieves 5x speedup and 85% energy saving with 0.02% accuracy loss compared to today DNN-based stereo vision systems.

MICRO 2019 None
Technical Report: Co-learning of geometry and semantics for online 3D mapping 2019-11-04
Show

This paper is a technical report about our submission for the ECCV 2018 3DRMS Workshop Challenge on Semantic 3D Reconstruction \cite{Tylecek2018rms}. In this paper, we address 3D semantic reconstruction for autonomous navigation using co-learning of depth map and semantic segmentation. The core of our pipeline is a deep multi-task neural network which tightly refines depth and also produces accurate semantic segmentation maps. Its inputs are an image and a raw depth map produced from a pair of images by standard stereo vision. The resulting semantic 3D point clouds are then merged in order to create a consistent 3D mesh, in turn used to produce dense semantic 3D reconstruction maps. The performances of each step of the proposed method are evaluated on the dataset and multiple tasks of the 3DRMS Challenge, and repeatedly surpass state-of-the-art approaches.

None
SteReFo: Efficient Image Refocusing with Stereo Vision 2019-09-29
Show

Whether to attract viewer attention to a particular object, give the impression of depth or simply reproduce human-like scene perception, shallow depth of field images are used extensively by professional and amateur photographers alike. To this end, high quality optical systems are used in DSLR cameras to focus on a specific depth plane while producing visually pleasing bokeh. We propose a physically motivated pipeline to mimic this effect from all-in-focus stereo images, typically retrieved by mobile cameras. It is capable to change the focal plane a posteriori at 76 FPS on KITTI images to enable real-time applications. As our portmanteau suggests, SteReFo interrelates stereo-based depth estimation and refocusing efficiently. In contrast to other approaches, our pipeline is simultaneously fully differentiable, physically motivated, and agnostic to scene content. It also enables computational video focus tracking for moving objects in addition to refocusing of static images. We evaluate our approach on the publicly available datasets SceneFlow, KITTI, CityScapes and quantify the quality of architectural changes.

None
Real-time on-board obstacle avoidance for UAVs based on embedded stereo vision 2019-09-21
Show

In order to improve usability and safety, modern unmanned aerial vehicles (UAVs) are equipped with sensors to monitor the environment, such as laser-scanners and cameras. One important aspect in this monitoring process is to detect obstacles in the flight path in order to avoid collisions. Since a large number of consumer UAVs suffer from tight weight and power constraints, our work focuses on obstacle avoidance based on a lightweight stereo camera setup. We use disparity maps, which are computed from the camera images, to locate obstacles and to automatically steer the UAV around them. For disparity map computation we optimize the well-known semi-global matching (SGM) approach for the deployment on an embedded FPGA. The disparity maps are then converted into simpler representations, the so called U-/V-Maps, which are used for obstacle detection. Obstacle avoidance is based on a reactive approach which finds the shortest path around the obstacles as soon as they have a critical distance to the UAV. One of the fundamental goals of our work was the reduction of development costs by closing the gap between application development and hardware optimization. Hence, we aimed at using high-level synthesis (HLS) for porting our algorithms, which are written in C/C++, to the embedded FPGA. We evaluated our implementation of the disparity estimation on the KITTI Stereo 2015 benchmark. The integrity of the overall realtime reactive obstacle avoidance algorithm has been evaluated by using Hardware-in-the-Loop testing in conjunction with two flight simulators.

Accep...

Accepted in the International Archives of the Photogrammetry, Remote Sensing and Spatial Information Science

None
Extending Monocular Visual Odometry to Stereo Camera Systems by Scale Optimization 2019-09-17
Show

This paper proposes a novel approach for extending monocular visual odometry to a stereo camera system. The proposed method uses an additional camera to accurately estimate and optimize the scale of the monocular visual odometry, rather than triangulating 3D points from stereo matching. Specifically, the 3D points generated by the monocular visual odometry are projected onto the other camera of the stereo pair, and the scale is recovered and optimized by directly minimizing the photometric error. It is computationally efficient, adding minimal overhead to the stereo vision system compared to straightforward stereo matching, and is robust to repetitive texture. Additionally, direct scale optimization enables stereo visual odometry to be purely based on the direct method. Extensive evaluation on public datasets (e.g., KITTI), and outdoor environments (both terrestrial and underwater) demonstrates the accuracy and efficiency of a stereo visual odometry approach extended by scale optimization, and its robustness in environments with challenging textures.

None
Unsupervised Video Depth Estimation Based on Ego-motion and Disparity Consensus 2019-09-03
Show

Unsupervised learning based depth estimation methods have received more and more attention as they do not need vast quantities of densely labeled data for training which are touch to acquire. In this paper, we propose a novel unsupervised monocular video depth estimation method in natural scenes by taking advantage of the state-of-the-art method of Zhou et al. which jointly estimates depth and camera motion. Our method advances beyond the baseline method by three aspects: 1) we add an additional signal as supervision to the baseline method by incorporating left-right binocular images reconstruction loss based on the estimated disparities, thus the left frame can be reconstructed by the temporal frames and right frames of stereo vision; 2) the network is trained by jointly using two kinds of view syntheses loss and left-right disparity consistency regularization to estimate depth and pose simultaneously; 3) we use the edge aware smooth L2 regularization to smooth the depth map while preserving the contour of the target. Extensive experiments on the KITTI autonomous driving dataset and Make3D dataset indicate the superiority of our algorithm in training efficiency. We can achieve competitive results with the baseline by only 3/5 times training data. The experimental results also show that our method even outperforms the classical supervised methods that using either ground truth depth or given pose for training.

None
IVOA: Introspective Vision for Obstacle Avoidance 2019-07-31
Show

Vision, as an inexpensive yet information rich sensor, is commonly used for perception on autonomous mobile robots. Unfortunately, accurate vision-based perception requires a number of assumptions about the environment to hold -- some examples of such assumptions, depending on the perception algorithm at hand, include purely lambertian surfaces, texture-rich scenes, absence of aliasing features, and refractive surfaces. In this paper, we present an approach for introspective vision for obstacle avoidance (IVOA) -- by leveraging a supervisory sensor that is occasionally available, we detect failures of stereo vision-based perception from divergence in plans generated by vision and the supervisory sensor. By projecting the 3D coordinates where the plans agree and disagree onto the images used for vision-based perception, IVOA generates a training set of reliable and unreliable image patches for perception. We then use this training dataset to learn a model of which image patches are likely to cause failures of the vision-based perception algorithm. Using this model, IVOA is then able to predict whether the relevant image patches in the observed images are likely to cause failures due to vision (both false positives and false negatives). We empirically demonstrate with extensive real-world data from both indoor and outdoor environments, the ability of IVOA to accurately predict the failures of two distinct vision algorithms.

To be...

To be published in IROS 2019 (IEEE/RSJ International Conference on Intelligent Robots and Systems)

None
FPGA-based Binocular Image Feature Extraction and Matching System 2019-05-14
Show

Image feature extraction and matching is a fundamental but computation intensive task in machine vision. This paper proposes a novel FPGA-based embedded system to accelerate feature extraction and matching. It implements SURF feature point detection and BRIEF feature descriptor construction and matching. For binocular stereo vision, feature matching includes both tracking matching and stereo matching, which simultaneously provide feature point correspondences and parallax information. Our system is evaluated on a ZYNQ XC7Z045 FPGA. The result demonstrates that it can process binocular video data at a high frame rate (640$\times$480 @ 162fps). Moreover, an extensive test proves our system has robustness for image compression, blurring and illumination.

Accep...

Accepted for the 4th International Conference on Multimedia Systems and Signal Processing (ICMSSP 2019)

None
UDFNet: Unsupervised Disparity Fusion with Adversarial Networks 2019-04-22
Show

Existing disparity fusion methods based on deep learning achieve state-of-the-art performance, but they require ground truth disparity data to train. As far as I know, this is the first time an unsupervised disparity fusion not using ground truth disparity data has been proposed. In this paper, a mathematical model for disparity fusion is proposed to guide an adversarial network to train effectively without ground truth disparity data. The initial disparity maps are inputted from the left view along with auxiliary information (gradient, left & right intensity image) into the refiner and the refiner is trained to output the refined disparity map registered on the left view. The refined left disparity map and left intensity image are used to reconstruct a fake right intensity image. Finally, the fake and real right intensity images (from the right stereo vision camera) are fed into the discriminator. In the model, the refiner is trained to output a refined disparity value close to the weighted sum of the disparity inputs for global initialisation. Then, three refinement principles are adopted to refine the results further. (1) The reconstructed intensity error between the fake and real right intensity image is minimised. (2) The similarities between the fake and real right image in different receptive fields are maximised. (3) The refined disparity map is smoothed based on the corresponding intensity image. The adversarial networks' architectures are effective for the fusion task. The fusion time using the proposed network is small. The network can achieve 90 fps using Nvidia Geforce GTX 1080Ti on the Kitti2015 dataset when the input resolution is 1242 * 375 (Width * Height) without downsampling and cropping. The accuracy of this work is equal to (or better than) the state-of-the-art supervised methods.

13 pa...

13 pages. arXiv admin note: text overlap with arXiv:1803.06657

None
AI-IMU Dead-Reckoning 2019-04-12
Show

In this paper we propose a novel accurate method for dead-reckoning of wheeled vehicles based only on an Inertial Measurement Unit (IMU). In the context of intelligent vehicles, robust and accurate dead-reckoning based on the IMU may prove useful to correlate feeds from imaging sensors, to safely navigate through obstructions, or for safe emergency stops in the extreme case of exteroceptive sensors failure. The key components of the method are the Kalman filter and the use of deep neural networks to dynamically adapt the noise parameters of the filter. The method is tested on the KITTI odometry dataset, and our dead-reckoning inertial method based only on the IMU accurately estimates 3D position, velocity, orientation of the vehicle and self-calibrates the IMU biases. We achieve on average a 1.10% translational error and the algorithm competes with top-ranked methods which, by contrast, use LiDAR or stereo vision. We make our implementation open-source at: https://github.com/mbrossar/ai-imu-dr

Code Link
Real-Time Dense Stereo Embedded in A UAV for Road Inspection 2019-04-12
Show

The condition assessment of road surfaces is essential to ensure their serviceability while still providing maximum road traffic safety. This paper presents a robust stereo vision system embedded in an unmanned aerial vehicle (UAV). The perspective view of the target image is first transformed into the reference view, and this not only improves the disparity accuracy, but also reduces the algorithm's computational complexity. The cost volumes generated from stereo matching are then filtered using a bilateral filter. The latter has been proved to be a feasible solution for the functional minimisation problem in a fully connected Markov random field model. Finally, the disparity maps are transformed by minimising an energy function with respect to the roll angle and disparity projection model. This makes the damaged road areas more distinguishable from the road surface. The proposed system is implemented on an NVIDIA Jetson TX2 GPU with CUDA for real-time purposes. It is demonstrated through experiments that the damaged road areas can be easily distinguished from the transformed disparity maps.

9 pag...

9 pages, 8 figures, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 16-20, 2019, Long Beach, USA

None
High-Precision Online Markerless Stereo Extrinsic Calibration 2019-03-26
Show

Stereo cameras and dense stereo matching algorithms are core components for many robotic applications due to their abilities to directly obtain dense depth measurements and their robustness against changes in lighting conditions. However, the performance of dense depth estimation relies heavily on accurate stereo extrinsic calibration. In this work, we present a real-time markerless approach for obtaining high-precision stereo extrinsic calibration using a novel 5-DOF (degrees-of-freedom) and nonlinear optimization on a manifold, which captures the observability property of vision-only stereo calibration. Our method minimizes epipolar errors between spatial per-frame sparse natural features.It does not require temporal feature correspondences, making it not only invariant to dynamic scenes and illumination changes, but also able to run significantly faster than standard bundle adjustment-based approaches. We introduce a principled method to determine if the calibration converges to the required level of accuracy, and show through online experiments that our approach achieves a level of accuracy that is comparable to offline marker-based calibration methods. Our method refines stereo extrinsic to the accuracy that is sufficient for block matching-based dense disparity computation. It provides a cost-effective way to improve the reliability of stereo vision systems for long-term autonomy.

None
BLVD: Building A Large-scale 5D Semantics Benchmark for Autonomous Driving 2019-03-15
Show

In autonomous driving community, numerous benchmarks have been established to assist the tasks of 3D/2D object detection, stereo vision, semantic/instance segmentation. However, the more meaningful dynamic evolution of the surrounding objects of ego-vehicle is rarely exploited, and lacks a large-scale dataset platform. To address this, we introduce BLVD, a large-scale 5D semantics benchmark which does not concentrate on the static detection or semantic/instance segmentation tasks tackled adequately before. Instead, BLVD aims to provide a platform for the tasks of dynamic 4D (3D+temporal) tracking, 5D (4D+interactive) interactive event recognition and intention prediction. This benchmark will boost the deeper understanding of traffic scenes than ever before. We totally yield 249,129 3D annotations, 4,902 independent individuals for tracking with the length of overall 214,922 points, 6,004 valid fragments for 5D interactive event recognition, and 4,900 individuals for 5D intention prediction. These tasks are contained in four kinds of scenarios depending on the object density (low and high) and light conditions (daytime and nighttime). The benchmark can be downloaded from our project site https://github.com/VCCIV/BLVD/.

To ap...

To appear in ICRA2019

Code Link
Stereo Vision-based Semantic 3D Object and Ego-motion Tracking for Autonomous Driving 2018-11-29
Show

We propose a stereo vision-based approach for tracking the camera ego-motion and 3D semantic objects in dynamic autonomous driving scenarios. Instead of directly regressing the 3D bounding box using end-to-end approaches, we propose to use the easy-to-labeled 2D detection and discrete viewpoint classification together with a light-weight semantic inference method to obtain rough 3D object measurements. Based on the object-aware-aided camera pose tracking which is robust in dynamic environments, in combination with our novel dynamic object bundle adjustment (BA) approach to fuse temporal sparse feature correspondences and the semantic 3D measurement model, we obtain 3D object pose, velocity and anchored dynamic point cloud estimation with instance accuracy and temporal consistency. The performance of our proposed method is demonstrated in diverse scenarios. Both the ego-motion estimation and object localization are compared with the state-of-of-the-art solutions.

14 pa...

14 pages, 9 figures, eccv2018

None
Hybrid Feature Based SLAM Prototype 2018-10-18
Show

The development of data innovation as of late and the expanded limit, has permitted the acquaintance of artificial vision connected with SLAM, offering ascend to what is known as Visual SLAM. The objective of this paper is to build up a route framework dependent on Visual SLAM to get a robot to a fundamental and new condition, have the capacity to set and make a three-dimensional guide thereof, utilizing just as sources of info recording your way with a stereo vision camera. The consequence of this analysis is that the framework Visual SLAM together with the combination of Fast SLAM (combination of kalman with particulate filter and SIFT) perceive and recognize characteristic points in images so adequately exact and unambiguous. This framework uses MATLAB, since its adaptability and comfort for performing a wide range of tests. The program has been tested by inserting a prerecorded video input with a camera stereo in which a course is done by an office environment. The algorithm initially locates points of interest in a stereo frame captured by the camera. These will be located in 3D and they associate an identification descriptor. In the next frame, the camera likewise identified points of interest and it will be compared which of them have been previously detected by comparing their descriptors. This process is known as "data association" and its successful completion is fundamental to the SLAM algorithm. The position data of the robot and points interest stored in data structures known as "particles" that evolve independently. Its management is very important for the proper functioning of the algorithm Fast SLAM. The results are found to be satisfactory.

7 pages,1 figures None
Improved Semantic Stixels via Multimodal Sensor Fusion 2018-09-27
Show

This paper presents a compact and accurate representation of 3D scenes that are observed by a LiDAR sensor and a monocular camera. The proposed method is based on the well-established Stixel model originally developed for stereo vision applications. We extend this Stixel concept to incorporate data from multiple sensor modalities. The resulting mid-level fusion scheme takes full advantage of the geometric accuracy of LiDAR measurements as well as the high resolution and semantic detail of RGB images. The obtained environment model provides a geometrically and semantically consistent representation of the 3D scene at a significantly reduced amount of data while minimizing information loss at the same time. Since the different sensor modalities are considered as input to a joint optimization problem, the solution is obtained with only minor computational overhead. We demonstrate the effectiveness of the proposed multimodal Stixel algorithm on a manually annotated ground truth dataset. Our results indicate that the proposed mid-level fusion of LiDAR and camera data improves both the geometric and semantic accuracy of the Stixel model significantly while reducing the computational overhead as well as the amount of generated data in comparison to using a single modality on its own.

None
Real-Time Stereo Vision on FPGAs with SceneScan 2018-09-21
Show

We present a flexible FPGA stereo vision implementation that is capable of processing up to 100 frames per second or image resolutions up to 3.4 megapixels, while consuming only 8 W of power. The implementation uses a variation of the Semi-Global Matching (SGM) algorithm, which provides superior results compared to many simpler approaches. The stereo matching results are improved significantly through a post-processing chain that operates on the computed cost cube and the disparity map. With this implementation we have created two stand-alone hardware systems for stereo vision, called SceneScan and SceneScan Pro. Both systems have been developed to market maturity and are available from Nerian Vision GmbH.

12 pa...

12 pages, 3 figures; accepted for publication at Forum Bildverarbeitung 2018

None
There's No Place Like Home: Visual Teach and Repeat for Emergency Return of Multirotor UAVs During GPS Failure 2018-09-15
Show

Redundant navigation systems are critical for safe operation of UAVs in high-risk environments. Since most commercial UAVs almost wholly rely on GPS, jamming, interference and multi-pathing are real concerns that usually limit their operations to low-risk environments and Visual Line-Of-Sight. This paper presents a vision-based route-following system for the autonomous, safe return of UAVs under primary navigation failure such as GPS jamming. Using a Visual Teach & Repeat framework to build a visual map of the environment during an outbound flight, we show the autonomous return of the UAV by visually localising the live view to this map when a simulated GPS failure occurs, controlling the vehicle to follow the safe outbound path back to the launch point. Using gimbal-stabilised stereo vision alone, without reliance on external infrastructure or inertial sensing, visual odometry and localisation are achieved at altitudes of 5-25 m and flight speeds up to 55 km/h. We examine the performance of the visual localisation algorithm under a variety of conditions and also demonstrate closed-loop autonomy along a complicated 450 m path.

8 pag...

8 pages, 8 figures, journal

None
Monocular Depth Estimation by Learning from Heterogeneous Datasets 2018-09-12
Show

Depth estimation provides essential information to perform autonomous driving and driver assistance. Especially, Monocular Depth Estimation is interesting from a practical point of view, since using a single camera is cheaper than many other options and avoids the need for continuous calibration strategies as required by stereo-vision approaches. State-of-the-art methods for Monocular Depth Estimation are based on Convolutional Neural Networks (CNNs). A promising line of work consists of introducing additional semantic information about the traffic scene when training CNNs for depth estimation. In practice, this means that the depth data used for CNN training is complemented with images having pixel-wise semantic labels, which usually are difficult to annotate (e.g. crowded urban images). Moreover, so far it is common practice to assume that the same raw training data is associated with both types of ground truth, i.e., depth and semantic labels. The main contribution of this paper is to show that this hard constraint can be circumvented, i.e., that we can train CNNs for depth estimation by leveraging the depth and semantic information coming from heterogeneous datasets. In order to illustrate the benefits of our approach, we combine KITTI depth and Cityscapes semantic segmentation datasets, outperforming state-of-the-art results on Monocular Depth Estimation.

Accep...

Accepted in IEEE-Intelligent Vehicles Symposium, IV'2018

None
Obstacle Detection Quality as a Problem-Oriented Approach to Stereo Vision Algorithms Estimation in Road Situation Analysis 2018-09-06
Show

In this work we present a method for performance evaluation of stereo vision based obstacle detection techniques that takes into account the specifics of road situation analysis to minimize the effort required to prepare a test dataset. This approach has been designed to be implemented in systems such as self-driving cars or driver assistance and can also be used as problem-oriented quality criterion for evaluation of stereo vision algorithms.

None
UnrealStereo: Controlling Hazardous Factors to Analyze Stereo Vision 2018-09-06
Show

A reliable stereo algorithm is critical for many robotics applications. But textureless and specular regions can easily cause failure by making feature matching difficult. Understanding whether an algorithm is robust to these hazardous regions is important. Although many stereo benchmarks have been developed to evaluate performance, it is hard to quantify the effect of hazardous regions in real images because the location and severity of these regions are unknown. In this paper, we develop a synthetic image generation tool enabling to control hazardous factors, such as making objects more specular or transparent, to produce hazardous regions at different degrees. The densely controlled sampling strategy in virtual worlds enables to effectively stress test stereo algorithms by varying the types and degrees of the hazard. We generate a large synthetic image dataset with automatically computed hazardous regions and analyze algorithms on these regions. The observations from synthetic images are further validated by annotating hazardous regions in real-world datasets Middlebury and KITTI (which gives a sparse sampling of the hazards). Our synthetic image generation tool is based on a game engine Unreal Engine 4 and will be open-source along with the virtual scenes in our experiments. Many publicly available realistic game contents can be used by our tool to provide an enormous resource for development and evaluation of algorithms.

3DV 2018 (oral) None
Real-Time Stereo Vision for Road Surface 3-D Reconstruction 2018-08-29
Show

Stereo vision techniques have been widely used in civil engineering to acquire 3-D road data. The two important factors of stereo vision are accuracy and speed. However, it is very challenging to achieve both of them simultaneously and therefore the main aim of developing a stereo vision system is to improve the trade-off between these two factors. In this paper, we present a real-time stereo vision system used for road surface 3-D reconstruction. The proposed system is developed from our previously published 3-D reconstruction algorithm where the perspective view of the target image is first transformed into the reference view, which not only increases the disparity accuracy but also improves the processing speed. Then, the correlation cost between each pair of blocks is computed and stored in two 3-D cost volumes. To adaptively aggregate the matching costs from neighbourhood systems, bilateral filtering is performed on the cost volumes. This greatly reduces the ambiguities during stereo matching and further improves the precision of the estimated disparities. Finally, the subpixel resolution is achieved by conducting a parabola interpolation and the subpixel disparity map is used to reconstruct the 3-D road surface. The proposed algorithm is implemented on an NVIDIA GTX 1080 GPU for the real-time purpose. The experimental results illustrate that the reconstruction accuracy is around 3 mm.

6 pag...

6 pages, 4 figures, IEEE International Conference on Imaging System and Techniques (IST) 2018. arXiv admin note: substantial text overlap with arXiv:1807.02044

None
Multiple Lane Detection Algorithm Based on Optimised Dense Disparity Map Estimation 2018-08-28
Show

Lane detection is very important for self-driving vehicles. In recent years, computer stereo vision has been prevalently used to enhance the accuracy of the lane detection systems. This paper mainly presents a multiple lane detection algorithm developed based on optimised dense disparity map estimation, where the disparity information obtained at time t_{n} is utilised to optimise the process of disparity estimation at time t_{n+1}. This is achieved by estimating the road model at time t_{n} and then controlling the search range for the disparity estimation at time t_{n+1}. The lanes are then detected using our previously published algorithm, where the vanishing point information is used to model the lanes. The experimental results illustrate that the runtime of the disparity estimation is reduced by around 37% and the accuracy of the lane detection is about 99%.

5 pag...

5 pages, 7 figures, IEEE International Conference on Imaging Systems and Techniques (IST) 2018

None
Real-Time Subpixel Fast Bilateral Stereo 2018-08-15
Show

Stereo vision technique has been widely used in robotic systems to acquire 3-D information. In recent years, many researchers have applied bilateral filtering in stereo vision to adaptively aggregate the matching costs. This has greatly improved the accuracy of the estimated disparity maps. However, the process of filtering the whole cost volume is very time consuming and therefore the researchers have to resort to some powerful hardware for the real-time purpose. This paper presents the implementation of fast bilateral stereo on a state-of-the-art GPU. By highly exploiting the parallel computing architecture of the GPU, the fast bilateral stereo performs in real time when processing the Middlebury stereo datasets.

8 pag...

8 pages, 7 figures, International Conference on Information and automation

None
CADDY Underwater Stereo-Vision Dataset for Human-Robot Interaction (HRI) in the Context of Diver Activities 2018-07-12
Show

In this article we present a novel underwater dataset collected from several field trials within the EU FP7 project "Cognitive autonomous diving buddy (CADDY)", where an Autonomous Underwater Vehicle (AUV) was used to interact with divers and monitor their activities. To our knowledge, this is one of the first efforts to collect a large dataset in underwater environments targeting object classification, segmentation and human pose estimation tasks. The first part of the dataset contains stereo camera recordings (~10K) of divers performing hand gestures to communicate and interact with an AUV in different environmental conditions. These gestures samples serve to test the robustness of object detection and classification algorithms against underwater image distortions i.e., color attenuation and light backscatter. The second part includes stereo footage (~12.7K) of divers free-swimming in front of the AUV, along with synchronized IMUs measurements located throughout the diver's suit (DiverNet) which serve as ground-truth for human pose and tracking methods. In both cases, these rectified images allow investigation of 3D representation and reasoning pipelines from low-texture targets commonly present in underwater scenarios. In this paper we describe our recording platform, sensor calibration procedure plus the data format and the utilities provided to use the dataset.

submitted to IJRR None
Real-time stereo vision-based lane detection system 2018-07-08
Show

The detection of multiple curved lane markings on a non-flat road surface is still a challenging task for automotive applications. To make an improvement, the depth information can be used to greatly enhance the robustness of the lane detection systems. The proposed system in this paper is developed from our previous work where the dense vanishing point Vp is estimated globally to assist the detection of multiple curved lane markings. However, the outliers in the optimal solution may severely affect the accuracy of the least squares fitting when estimating Vp. Therefore, in this paper we use Random Sample Consensus to update the inliers and outliers iteratively until the fraction of the number of inliers versus the total number exceeds our pre-set threshold. This significantly helps the system to overcome some suddenly changing conditions. Furthermore, we propose a novel lane position validation approach which provides a piecewise weight based on Vp and the gradient to reduce the gradient magnitude of the non-lane candidates. Then, we compute the energy of each possible solution and select all satisfying lane positions for visualisation. The proposed system is implemented on a heterogeneous system which consists of an Intel Core i7-4720HQ CPU and a NVIDIA GTX 970M GPU. A processing speed of 143 fps has been achieved, which is over 38 times faster than our previous work. Also, in order to evaluate the detection precision, we tested 2495 frames with 5361 lanes from the KITTI database (1637 lanes more than our previous experiment). It is shown that the overall successful detection rate is improved from 98.7% to 99.5%.

24 pages, 10 figures None
Non-flat Ground Detection Based on A Local Descriptor 2018-06-06
Show

The detection of road and free space remains challenging for non-flat plane, especially with the varying latitudinal and longitudinal slope or in the case of multi-ground plane. In this paper, we propose a framework of the ground plane detection with stereo vision. The main contribution of this paper is a newly proposed descriptor which is implemented in the disparity image to obtain a disparity texture image. The ground plane regions can be distinguished from their surroundings effectively in the disparity texture image. Because the descriptor is implemented in the local area of the image, it can address well the problem of non-flat plane. And we also present a complete framework to detect the ground plane regions base on the disparity texture image with convolutional neural network architecture.

9 pag...

9 pages, submitted to IEICE Transactions on Information and Systems

None
Fast Disparity Estimation using Dense Networks 2018-05-19
Show

Disparity estimation is a difficult problem in stereo vision because the correspondence technique fails in images with textureless and repetitive regions. Recent body of work using deep convolutional neural networks (CNN) overcomes this problem with semantics. Most CNN implementations use an autoencoder method; stereo images are encoded, merged and finally decoded to predict the disparity map. In this paper, we present a CNN implementation inspired by dense networks to reduce the number of parameters. Furthermore, our approach takes into account semantic reasoning in disparity estimation. Our proposed network, called DenseMapNet, is compact, fast and can be trained end-to-end. DenseMapNet requires 290k parameters only and runs at 30Hz or faster on color stereo images in full resolution. Experimental results show that DenseMapNet accuracy is comparable with other significantly bigger CNN-based methods.

In Pr...

In Proc. International Conference on Robotics and Automation 2018 (ICRA2018)

None
Fast View Synthesis with Deep Stereo Vision 2018-05-07
Show

Novel view synthesis is an important problem in computer vision and graphics. Over the years a large number of solutions have been put forward to solve the problem. However, the large-baseline novel view synthesis problem is far from being "solved". Recent works have attempted to use Convolutional Neural Networks (CNNs) to solve view synthesis tasks. Due to the difficulty of learning scene geometry and interpreting camera motion, CNNs are often unable to generate realistic novel views. In this paper, we present a novel view synthesis approach based on stereo-vision and CNNs that decomposes the problem into two sub-tasks: view dependent geometry estimation and texture inpainting. Both tasks are structured prediction problems that could be effectively learned with CNNs. Experiments on the KITTI Odometry dataset show that our approach is more accurate and significantly faster than the current state-of-the-art. The code and supplementary material will be publicly available. Results could be found here https://youtu.be/5pzS9jc-5t0

None
Fusion of stereo and still monocular depth estimates in a self-supervised learning context 2018-03-20
Show

We study how autonomous robots can learn by themselves to improve their depth estimation capability. In particular, we investigate a self-supervised learning setup in which stereo vision depth estimates serve as targets for a convolutional neural network (CNN) that transforms a single still image to a dense depth map. After training, the stereo and mono estimates are fused with a novel fusion method that preserves high confidence stereo estimates, while leveraging the CNN estimates in the low-confidence regions. The main contribution of the article is that it is shown that the fused estimates lead to a higher performance than the stereo vision estimates alone. Experiments are performed on the KITTI dataset, and on board of a Parrot SLAMDunk, showing that even rather limited CNNs can help provide stereo vision equipped robots with more reliable depth maps for autonomous navigation.

To be...

To be published at ICRA 2018, 8 pages, 8 figures

None
Cubic Range Error Model for Stereo Vision with Illuminators 2018-03-11
Show

Use of low-cost depth sensors, such as a stereo camera setup with illuminators, is of particular interest for numerous applications ranging from robotics and transportation to mixed and augmented reality. The ability to quantify noise is crucial for these applications, e.g., when the sensor is used for map generation or to develop a sensor scheduling policy in a multi-sensor setup. Range error models provide uncertainty estimates and help weigh the data correctly in instances where range measurements are taken from different vantage points or with different sensors. The weighing is important to fuse range data into a map in a meaningful way, i.e., the high confidence data is relied on most heavily. Such a model is derived in this work. We show that the range error for stereo systems with integrated illuminators is cubic and validate the proposed model experimentally with an off-the-shelf structured light stereo system. The experiments confirm the validity of the model and simplify the application of this type of sensor in robotics. The proposed error model is relevant to any stereo system with low ambient light where the main light source is located at the camera system. Among others, this is the case for structured light stereo systems and night stereo systems with headlights. In this work, we propose that the range error is cubic in range for stereo systems with integrated illuminators. Experimental validation with an off-the-shelf structured light stereo system shows that the exponent is between 2.4 and 2.6. The deviation is attributed to our model considering only shot noise.

6 pag...

6 pages, to be published at ICRA 2018

None
Flexible Stereo: Constrained, Non-rigid, Wide-baseline Stereo Vision for Fixed-wing Aerial Platforms 2018-02-26
Show

This paper proposes a computationally efficient method to estimate the time-varying relative pose between two visual-inertial sensor rigs mounted on the flexible wings of a fixed-wing unmanned aerial vehicle (UAV). The estimated relative poses are used to generate highly accurate depth maps in real-time and can be employed for obstacle avoidance in low-altitude flights or landing maneuvers. The approach is structured as follows: Initially, a wing model is identified by fitting a probability density function to measured deviations from the nominal relative baseline transformation. At run-time, the prior knowledge about the wing model is fused in an Extended Kalman filter~(EKF) together with relative pose measurements obtained from solving a relative perspective N-point problem (PNP), and the linear accelerations and angular velocities measured by the two inertial measurement units (IMU) which are rigidly attached to the cameras. Results obtained from extensive synthetic experiments demonstrate that our proposed framework is able to estimate highly accurate baseline transformations and depth maps.

Accep...

Accepted for publication in IEEE International Conference on Robotics and Automation (ICRA), 2018, Brisbane

None
Automatic Tool Landmark Detection for Stereo Vision in Robot-Assisted Retinal Surgery 2017-11-20
Show

Computer vision and robotics are being increasingly applied in medical interventions. Especially in interventions where extreme precision is required they could make a difference. One such application is robot-assisted retinal microsurgery. In recent works, such interventions are conducted under a stereo-microscope, and with a robot-controlled surgical tool. The complementarity of computer vision and robotics has however not yet been fully exploited. In order to improve the robot control we are interested in 3D reconstruction of the anatomy and in automatic tool localization using a stereo microscope. In this paper, we solve this problem for the first time using a single pipeline, starting from uncalibrated cameras to reach metric 3D reconstruction and registration, in retinal microsurgery. The key ingredients of our method are: (a) surgical tool landmark detection, and (b) 3D reconstruction with the stereo microscope, using the detected landmarks. To address the former, we propose a novel deep learning method that detects and recognizes keypoints in high definition images at higher than real-time speed. We use the detected 2D keypoints along with their corresponding 3D coordinates obtained from the robot sensors to calibrate the stereo microscope using an affine projection model. We design an online 3D reconstruction pipeline that makes use of smoothness constraints and performs robot-to-camera registration. The entire pipeline is extensively validated on open-sky porcine eye sequences. Quantitative and qualitative results are presented for all steps.

Accep...

Accepted in Robotics and Automation Letters (RA-L). Project page: http://www.vision.ee.ethz.ch/~kmaninis/keypoints2stereo/index.html

None
Markerless visual servoing on unknown objects for humanoid robot platforms 2017-10-12
Show

To precisely reach for an object with a humanoid robot, it is of central importance to have good knowledge of both end-effector, object pose and shape. In this work we propose a framework for markerless visual servoing on unknown objects, which is divided in four main parts: I) a least-squares minimization problem is formulated to find the volume of the object graspable by the robot's hand using its stereo vision; II) a recursive Bayesian filtering technique, based on Sequential Monte Carlo (SMC) filtering, estimates the 6D pose (position and orientation) of the robot's end-effector without the use of markers; III) a nonlinear constrained optimization problem is formulated to compute the desired graspable pose about the object; IV) an image-based visual servo control commands the robot's end-effector toward the desired pose. We demonstrate effectiveness and robustness of our approach with extensive experiments on the iCub humanoid robot platform, achieving real-time computation, smooth trajectories and sub-pixel precisions.

None
Automatic Extrinsic Calibration for Lidar-Stereo Vehicle Sensor Setups 2017-07-27
Show

Sensor setups consisting of a combination of 3D range scanner lasers and stereo vision systems are becoming a popular choice for on-board perception systems in vehicles; however, the combined use of both sources of information implies a tedious calibration process. We present a method for extrinsic calibration of lidar-stereo camera pairs without user intervention. Our calibration approach is aimed to cope with the constraints commonly found in automotive setups, such as low-resolution and specific sensor poses. To demonstrate the performance of our method, we also introduce a novel approach for the quantitative assessment of the calibration results, based on a simulation environment. Tests using real devices have been conducted as well, proving the usability of the system and the improvement over the existing approaches. Code is available at http://wiki.ros.org/velo2cam_calibration

Accep...

Accepted to IEEE International Conference on Intelligent Transportation Systems 2017 (ITSC)

None
Coarse-to-Fine Lifted MAP Inference in Computer Vision 2017-07-22
Show

There is a vast body of theoretical research on lifted inference in probabilistic graphical models (PGMs). However, few demonstrations exist where lifting is applied in conjunction with top of the line applied algorithms. We pursue the applicability of lifted inference for computer vision (CV), with the insight that a globally optimal (MAP) labeling will likely have the same label for two symmetric pixels. The success of our approach lies in efficiently handling a distinct unary potential on every node (pixel), typical of CV applications. This allows us to lift the large class of algorithms that model a CV problem via PGM inference. We propose a generic template for coarse-to-fine (C2F) inference in CV, which progressively refines an initial coarsely lifted PGM for varying quality-time trade-offs. We demonstrate the performance of C2F inference by developing lifted versions of two near state-of-the-art CV algorithms for stereo vision and interactive image segmentation. We find that, against flat algorithms, the lifted versions have a much superior anytime performance, without any loss in final solution quality.

Publi...

Published in IJCAI 2017

None
Single-Shot Clothing Category Recognition in Free-Configurations with Application to Autonomous Clothes Sorting 2017-07-22
Show

This paper proposes a single-shot approach for recognising clothing categories from 2.5D features. We propose two visual features, BSP (B-Spline Patch) and TSD (Topology Spatial Distances) for this task. The local BSP features are encoded by LLC (Locality-constrained Linear Coding) and fused with three different global features. Our visual feature is robust to deformable shapes and our approach is able to recognise the category of unknown clothing in unconstrained and random configurations. We integrated the category recognition pipeline with a stereo vision system, clothing instance detection, and dual-arm manipulators to achieve an autonomous sorting system. To verify the performance of our proposed method, we build a high-resolution RGBD clothing dataset of 50 clothing items of 5 categories sampled in random configurations (a total of 2,100 clothing samples). Experimental results show that our approach is able to reach 83.2% accuracy while classifying clothing items which were previously unseen during training. This advances beyond the previous state-of-the-art by 36.2%. Finally, we evaluate the proposed approach in an autonomous robot sorting system, in which the robot recognises a clothing item from an unconstrained pile, grasps it, and sorts it into a box according to its category. Our proposed sorting system achieves reasonable sorting success rates with single-shot perception.

9 pag...

9 pages, accepted by IROS2017

None
Efficient Optical flow and Stereo Vision for Velocity Estimation and Obstacle Avoidance on an Autonomous Pocket Drone 2017-03-14
Show

Miniature Micro Aerial Vehicles (MAV) are very suitable for flying in indoor environments, but autonomous navigation is challenging due to their strict hardware limitations. This paper presents a highly efficient computer vision algorithm called Edge-FS for the determination of velocity and depth. It runs at 20 Hz on a 4 g stereo camera with an embedded STM32F4 microprocessor (168 MHz, 192 kB) and uses feature histograms to calculate optical flow and stereo disparity. The stereo-based distance estimates are used to scale the optical flow in order to retrieve the drone's velocity. The velocity and depth measurements are used for fully autonomous flight of a 40 g pocket drone only relying on on-board sensors. The method allows the MAV to control its velocity and avoid obstacles.

7 pag...

7 pages, 10 figures, Published at IEEE Robotics and Automation Letters

None
Free-Space Detection with Self-Supervised and Online Trained Fully Convolutional Networks 2017-01-05
Show

Recently, vision-based Advanced Driver Assist Systems have gained broad interest. In this work, we investigate free-space detection, for which we propose to employ a Fully Convolutional Network (FCN). We show that this FCN can be trained in a self-supervised manner and achieve similar results compared to training on manually annotated data, thereby reducing the need for large manually annotated training sets. To this end, our self-supervised training relies on a stereo-vision disparity system, to automatically generate (weak) training labels for the color-based FCN. Additionally, our self-supervised training facilitates online training of the FCN instead of offline. Consequently, given that the applied FCN is relatively small, the free-space analysis becomes highly adaptive to any traffic scene that the vehicle encounters. We have validated our algorithm using publicly available data and on a new challenging benchmark dataset that is released with this paper. Experiments show that the online training boosts performance with 5% when compared to offline training, both for Fmax and AP.

versi...

version as accepted at IS&T Electronic Imaging - Autonomous Vehicles and Machines Conference (San Francisco USA, January 2017); updated with two additional robustness experiments and formatted in conference style; 8 pages, public data available

None
Obstacle Avoidance Strategy using Onboard Stereo Vision on a Flapping Wing MAV 2017-01-02
Show

The development of autonomous lightweight MAVs, capable of navigating in unknown indoor environments, is one of the major challenges in robotics. The complexity of this challenge comes from constraints on weight and power consumption of onboard sensing and processing devices. In this paper we propose the "Droplet" strategy, an avoidance strategy based on stereo vision inputs that outperforms reactive avoidance strategies by allowing constant speed maneuvers while being computationally extremely efficient, and which does not need to store previous images or maps. The strategy deals with nonholonomic motion constraints of most fixed and flapping wing platforms, and with the limited field-of-view of stereo camera systems. It guarantees obstacle-free flight in the absence of sensor and motor noise. We first analyze the strategy in simulation, and then show its robustness in real-world conditions by implementing it on a 20-gram flapping wing MAV.

None
Light Field Stitching for Extended Synthetic Aperture 2016-11-15
Show

Through capturing spatial and angular radiance distribution, light field cameras introduce new capabilities that are not possible with conventional cameras. So far in the light field imaging literature, the focus has been on the theory and applications of single light field capture. By combining multiple light fields, it is possible to obtain new capabilities and enhancements, and even exceed physical limitations, such as spatial resolution and aperture size of the imaging device. In this paper, we present an algorithm to register and stitch multiple light fields. We utilize the regularity of the spatial and angular sampling in light field data, and extend some techniques developed for stereo vision systems to light field data. Such an extension is not straightforward for a micro-lens array (MLA) based light field camera due to extremely small baseline and low spatial resolution. By merging multiple light fields captured by an MLA based camera, we obtain larger synthetic aperture, which results in improvements in light field capabilities, such as increased depth estimation range/accuracy and wider perspective shift range.

None
Robot Vision Architecture for Autonomous Clothes Manipulation 2016-10-18
Show

This paper presents a novel robot vision architecture for perceiving generic 3D clothes configurations. Our architecture is hierarchically structured, starting from low-level curvatures, across mid-level geometric shapes & topology descriptions; and finally approaching high-level semantic surface structure descriptions. We demonstrate our robot vision architecture in a customised dual-arm industrial robot with our self-designed, off-the-self stereo vision system, carrying out autonomous grasping and dual-arm flattening. It is worth noting that the proposed dual-arm flattening approach is unique among the state-of-the-art robot autonomous system, which is the major contribution of this paper. The experimental results show that the proposed dual-arm flattening using stereo vision system remarkably outperforms the single-arm flattening and widely-cited Kinect-based sensing system for dexterous manipulation tasks. In addition, the proposed grasping approach achieves satisfactory performance on grasping various kind of garments, verifying the capability of proposed visual perception architecture to be adapted to more than one clothing manipulation tasks.

14 pa...

14 pages, under review

None
Lost and Found: Detecting Small Road Hazards for Self-Driving Vehicles 2016-09-15
Show

Detecting small obstacles on the road ahead is a critical part of the driving task which has to be mastered by fully autonomous cars. In this paper, we present a method based on stereo vision to reliably detect such obstacles from a moving vehicle. The proposed algorithm performs statistical hypothesis tests in disparity space directly on stereo image data, assessing freespace and obstacle hypotheses on independent local patches. This detection approach does not depend on a global road model and handles both static and moving obstacles. For evaluation, we employ a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixelwise annotations of obstacle and free-space and provide a thorough comparison to several stereo-based baseline methods. The dataset will be made available to the community to foster further research on this important topic. The proposed approach outperforms all considered baselines in our evaluations on both pixel and object level and runs at frame rates of up to 20 Hz on 2 mega-pixel stereo imagery. Small obstacles down to the height of 5 cm can successfully be detected at 20 m distance at low false positive rates.

To be...

To be presented at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2016

None
Color Homography Color Correction 2016-08-01
Show

Homographies -- a mathematical formalism for relating image points across different camera viewpoints -- are at the foundations of geometric methods in computer vision and are used in geometric camera calibration, image registration, and stereo vision and other tasks. In this paper, we show the surprising result that colors across a change in viewing condition (changing light color, shading and camera) are also related by a homography. We propose a new color correction method based on color homography. Experiments demonstrate that solving the color homography problem leads to more accurate calibration.

Accep...

Accepted by Color Imaging Conference 2016

None
Using Self-Contradiction to Learn Confidence Measures in Stereo Vision 2016-04-18
Show

Learned confidence measures gain increasing importance for outlier removal and quality improvement in stereo vision. However, acquiring the necessary training data is typically a tedious and time consuming task that involves manual interaction, active sensing devices and/or synthetic scenes. To overcome this problem, we propose a new, flexible, and scalable way for generating training data that only requires a set of stereo images as input. The key idea of our approach is to use different view points for reasoning about contradictions and consistencies between multiple depth maps generated with the same stereo algorithm. This enables us to generate a huge amount of training data in a fully automated manner. Among other experiments, we demonstrate the potential of our approach by boosting the performance of three learned confidence measures on the KITTI2012 dataset by simply training them on a vast amount of automatically generated training data rather than a limited amount of laser ground truth data.

This ...

This paper was accepted to the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. The copyright was transfered to IEEE (https://www.ieee.org). The official version of the paper will be made available on IEEE Xplore (R) (http://ieeexplore.ieee.org). This version of the paper also contains the supplementary material, which will not appear IEEE Xplore (R)

None
Persistent self-supervised learning principle: from stereo to monocular vision for obstacle avoidance 2016-03-25
Show

Self-Supervised Learning (SSL) is a reliable learning mechanism in which a robot uses an original, trusted sensor cue for training to recognize an additional, complementary sensor cue. We study for the first time in SSL how a robot's learning behavior should be organized, so that the robot can keep performing its task in the case that the original cue becomes unavailable. We study this persistent form of SSL in the context of a flying robot that has to avoid obstacles based on distance estimates from the visual cue of stereo vision. Over time it will learn to also estimate distances based on monocular appearance cues. A strategy is introduced that has the robot switch from stereo vision based flight to monocular flight, with stereo vision purely used as 'training wheels' to avoid imminent collisions. This strategy is shown to be an effective approach to the 'feedback-induced data bias' problem as also experienced in learning from demonstration. Both simulations and real-world experiments with a stereo vision equipped AR drone 2.0 show the feasibility of this approach, with the robot successfully using monocular vision to avoid obstacles in a 5 x 5 room. The experiments show the potential of persistent SSL as a robust learning approach to enhance the capabilities of robots. Moreover, the abundant training data coming from the own sensors allows to gather large data sets necessary for deep learning approaches.

None
Design and Analysis of a Single-Camera Omnistereo Sensor for Quadrotor Micro Aerial Vehicles (MAVs) 2015-10-03
Show

We describe the design and 3D sensing performance of an omnidirectional stereo-vision system (omnistereo) as applied to Micro Aerial Vehicles (MAVs). The proposed omnistereo model employs a monocular camera that is co-axially aligned with a pair of hyperboloidal mirrors (folded catadioptric configuration). We show that this arrangement is practical for performing stereo-vision when mounted on top of propeller-based MAVs characterized by low payloads. The theoretical single viewpoint (SVP) constraint helps us derive analytical solutions for the sensor's projective geometry and generate SVP-compliant panoramic images to compute 3D information from stereo correspondences (in a truly synchronous fashion). We perform an extensive analysis on various system characteristics such as its size, catadioptric spatial resolution, field-of-view. In addition, we pose a probabilistic model for uncertainty estimation of the depth from triangulation for skew back-projection rays. We expect to motivate the reproducibility of our solution since it can be adapted (optimally) to other catadioptric-based omnistereo vision applications.

49 pa...

49 pages, 22 figures, journal article draft

None
Behaviour Trees for Evolutionary Robotics 2015-08-07
Show

Evolutionary Robotics allows robots with limited sensors and processing to tackle complex tasks by means of sensory-motor coordination. In this paper we show the first application of the Behaviour Tree framework to a real robotic platform using the Evolutionary Robotics methodology. This framework is used to improve the intelligibility of the emergent robotic behaviour as compared to the traditional Neural Network formulation. As a result, the behaviour is easier to comprehend and manually adapt when crossing the reality gap from simulation to reality. This functionality is shown by performing real-world flight tests with the 20-gram DelFly Explorer flapping wing Micro Air Vehicle equipped with a 4-gram onboard stereo vision system. The experiments show that the DelFly can fully autonomously search for and fly through a window with only its onboard sensors and processing. The success rate of the optimised behaviour in simulation is 88% and the corresponding real-world performance is 54% after user adaptation. Although this leaves room for improvement, it is higher than the 46% success rate from a tuned user-defined controller.

Prepr...

Preprint version of article accepted for publication in Artificial Life, MIT Press. http://www.mitpressjournals.org/loi/artl

None
Detection of Non-Stationary Photometric Perturbations on Projection Screens 2014-11-23
Show

Interfaces based on projection screens have become increasingly more popular in recent years, mainly due to the large screen size and resolution that they provide, as well as their stereo-vision capabilities. This work shows a local method for real-time detection of non-stationary photometric perturbations in projected images by means of computer vision techniques. The method is based on the computation of differences between the images in the projector's frame buffer and the corresponding images on the projection screen observed by the camera. It is robust under spatial variations in the intensity of light emitted by the projector on the projection surface and also robust under stationary photometric perturbations caused by external factors. Moreover, we describe the experiments carried out to show the reliability of the method.

20 pa...

20 pages, Journal of Research and Practice in Information Technology, vol. 44, num. 4, 2012

None
Mobility Enhancement for Elderly 2014-10-21
Show

Loss of Mobility is a common handicap to senior citizens. It denies them the ease of movement they would like to have like outdoor visits, movement in hospitals, social outgoings, but more seriously in the day to day in-house routine functions necessary for living etc. Trying to overcome this handicap by means of servant or domestic help and simple wheel chairs is not only costly in the long run, but forces the senior citizen to be at the mercy of sincerity of domestic helps and also the consequent loss of dignity. In order to give a dignified life, the mobility obtained must be at the complete discretion, will and control of the senior citizen. This can be provided only by a reasonably sophisticated and versatile wheel chair, giving enhanced ability of vision, hearing through man-machine interface, and sensor aided navigation and control. More often than not senior people have poor vision which makes it difficult for them to maker visual judgement and so calls for the use of Artificial Intelligence in visual image analysis and guided navigation systems. In this project, we deal with two important enhancement features for mobility enhancement, Audio command and Vision aided obstacle detection and navigation. We have implemented speech recognition algorithm using template of stored words for identifying the voice command given by the user. This frees the user of an agile hand to operate joystick or mouse control. Also, we have developed a new appearance based obstacle detection system using stereo-vision cameras which estimates the distance of nearest obstacle to the wheel chair and takes necessary action. This helps user in making better judgement of route and navigate obstacles. The main challenge in this project is how to navigate in an unknown/unfamiliar environment by avoiding obstacles.

Maste...

Masters thesis, Indian Institute of Technology Delhi

None
Intelligent Indoor Mobile Robot Navigation Using Stereo Vision 2014-09-10
Show

Majority of the existing robot navigation systems, which facilitate the use of laser range finders, sonar sensors or artificial landmarks, has the ability to locate itself in an unknown environment and then build a map of the corresponding environment. Stereo vision, while still being a rapidly developing technique in the field of autonomous mobile robots, are currently less preferable due to its high implementation cost. This paper aims at describing an experimental approach for the building of a stereo vision system that helps the robots to avoid obstacles and navigate through indoor environments and at the same time remaining very much cost effective. This paper discusses the fusion techniques of stereo vision and ultrasound sensors which helps in the successful navigation through different types of complex environments. The data from the sensor enables the robot to create the two dimensional topological map of unknown environments and stereo vision systems models the three dimension model of the same environment.

9 pag...

9 pages, SIPIJ August 2014

None
Pushbroom Stereo for High-Speed Navigation in Cluttered Environments 2014-07-26
Show

We present a novel stereo vision algorithm that is capable of obstacle detection on a mobile-CPU processor at 120 frames per second. Our system performs a subset of standard block-matching stereo processing, searching only for obstacles at a single depth. By using an onboard IMU and state-estimator, we can recover the position of obstacles at all other depths, building and updating a full depth-map at framerate. Here, we describe both the algorithm and our implementation on a high-speed, small UAV, flying at over 20 MPH (9 m/s) close to obstacles. The system requires no external sensing or computation and is, to the best of our knowledge, the first high-framerate stereo detection system running onboard a small UAV.

None
Multi Modal Face Recognition Using Block Based Curvelet Features 2014-05-21
Show

In this paper, we present multimodal 2D +3D face recognition method using block based curvelet features. The 3D surface of face (Depth Map) is computed from the stereo face images using stereo vision technique. The statistical measures such as mean, standard deviation, variance and entropy are extracted from each block of curvelet subband for both depth and intensity images independently.In order to compute the decision score, the KNN classifier is employed independently for both intensity and depth map. Further, computed decision scoresof intensity and depth map are combined at decision level to improve the face recognition rate. The combination of intensity and depth map is verified experimentally using benchmark face database. The experimental results show that the proposed multimodal method is better than individual modality.

17 pages, 5 Figures None
Convex Relaxations of SE(2) and SE(3) for Visual Pose Estimation 2014-04-06
Show

This paper proposes a new method for rigid body pose estimation based on spectrahedral representations of the tautological orbitopes of $SE(2)$ and $SE(3)$. The approach can use dense point cloud data from stereo vision or an RGB-D sensor (such as the Microsoft Kinect), as well as visual appearance data. The method is a convex relaxation of the classical pose estimation problem, and is based on explicit linear matrix inequality (LMI) representations for the convex hulls of $SE(2)$ and $SE(3)$. Given these representations, the relaxed pose estimation problem can be framed as a robust least squares problem with the optimization variable constrained to these convex sets. Although this formulation is a relaxation of the original problem, numerical experiments indicate that it is indeed exact - i.e. its solution is a member of $SE(2)$ or $SE(3)$ - in many interesting settings. We additionally show that this method is guaranteed to be exact for a large class of pose estimation problems.

ICRA 2014 Preprint None
Efficiently Searching for Frustrated Cycles in MAP Inference 2012-10-16
Show

Dual decomposition provides a tractable framework for designing algorithms for finding the most probable (MAP) configuration in graphical models. However, for many real-world inference problems, the typical decomposition has a large integrality gap, due to frustrated cycles. One way to tighten the relaxation is to introduce additional constraints that explicitly enforce cycle consistency. Earlier work showed that cluster-pursuit algorithms, which iteratively introduce cycle and other higherorder consistency constraints, allows one to exactly solve many hard inference problems. However, these algorithms explicitly enumerate a candidate set of clusters, limiting them to triplets or other short cycles. We solve the search problem for cycle constraints, giving a nearly linear time algorithm for finding the most frustrated cycle of arbitrary length. We show how to use this search algorithm together with the dual decomposition framework and clusterpursuit. The new algorithm exactly solves MAP inference problems arising from relational classification and stereo vision.

Appea...

Appears in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI2012)

None
Efficient Selection of Disambiguating Actions for Stereo Vision 2012-06-27
Show

In many domains that involve the use of sensors, such as robotics or sensor networks, there are opportunities to use some form of active sensing to disambiguate data from noisy or unreliable sensors. These disambiguating actions typically take time and expend energy. One way to choose the next disambiguating action is to select the action with the greatest expected entropy reduction, or information gain. In this work, we consider active sensing in aid of stereo vision for robotics. Stereo vision is a powerful sensing technique for mobile robots, but it can fail in scenes that lack strong texture. In such cases, a structured light source, such as vertical laser line can be used for disambiguation. By treating the stereo matching problem as a specially structured HMM-like graphical model, we demonstrate that for a scan line with n columns and maximum stereo disparity d, the entropy minimizing aim point for the laser can be selected in O(nd) time - cost no greater than the stereo algorithm itself. In contrast, a typical HMM formulation would suggest at least O(nd^2) time for the entropy calculation alone.

Appea...

Appears in Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI2006)

None
Picture Collage with Genetic Algorithm and Stereo vision 2011-11-29
Show

In this paper, a salient region extraction method for creating picture collage based on stereo vision is proposed. Picture collage is a kind of visual image summary to arrange all input images on a given canvas, allowing overlay, to maximize visible visual information. The salient regions of each image are firstly extracted and represented as a depth map. The output picture collage shows as many visible salient regions (without being overlaid by others) from all images as possible. A very efficient Genetic algorithm is used here for the optimization. The experimental results showed the superior performance of the proposed method.

None
Competitive Analysis of Minimum-Cut Maximum Flow Algorithms in Vision Problems 2010-10-18
Show

Rapid advances in image acquisition and storage technology underline the need for algorithms that are capable of solving large scale image processing and computer-vision problems. The minimum cut problem plays an important role in processing many of these imaging problems such as, image and video segmentation, stereo vision, multi-view reconstruction and surface fitting. While several min-cut/max-flow algorithms can be found in the literature, their performance in practice has been studied primarily outside the scope of computer vision. We present here the results of a comprehensive computational study, in terms of execution times and memory utilization, of four recently published algorithms, which optimally solve the {\em s-t} cut and maximum flow problems: (i) Goldberg's and Tarjan's {\em Push-Relabel}; (ii) Hochbaum's {\em pseudoflow}; (iii) Boykov's and Kolmogorov's {\em augmenting paths}; and (iv) Goldberg's {\em partial augment-relabel}. Our results demonstrate that the {\em Hochbaum's pseudoflow} algorithm, is faster and utilizes less memory than the other algorithms on all problem instances investigated.

None
A Machine Learning Approach to Recovery of Scene Geometry from Images 2010-07-17
Show

Recovering the 3D structure of the scene from images yields useful information for tasks such as shape and scene recognition, object detection, or motion planning and object grasping in robotics. In this thesis, we introduce a general machine learning approach called unsupervised CRF learning based on maximizing the conditional likelihood. We apply our approach to computer vision systems that recover the 3-D scene geometry from images. We focus on recovering 3D geometry from single images, stereo pairs and video sequences. Building these systems requires algorithms for doing inference as well as learning the parameters of conditional Markov random fields (MRF). Our system is trained unsupervisedly without using ground-truth labeled data. We employ a slanted-plane stereo vision model in which we use a fixed over-segmentation to segment the left image into coherent regions called superpixels, then assign a disparity plane for each superpixel. Plane parameters are estimated by solving an MRF labelling problem, through minimizing an energy fuction. We demonstrate the use of our unsupervised CRF learning algorithm for a parameterized slanted-plane stereo vision model involving shape from texture cues. Our stereo model with texture cues, only by unsupervised training, outperforms the results in related work on the same stereo dataset. In this thesis, we also formulate structure and motion estimation as an energy minimization problem, in which the model is an extension of our slanted-plane stereo vision model that also handles surface velocity. Velocity estimation is achieved by solving an MRF labeling problem using Loopy BP. Performance analysis is done using our novel evaluation metrics based on the notion of view prediction error. Experiments on road-driving stereo sequences show encouraging results.

None
A $p$-adic RanSaC algorithm for stereo vision using Hensel lifting 2009-11-03
Show

A $p$-adic variation of the Ran(dom) Sa(mple) C(onsensus) method for solving the relative pose problem in stereo vision is developped. From two 2-adically encoded images a random sample of five pairs of corresponding points is taken, and the equations for the essential matrix are solved by lifting solutions modulo 2 to the 2-adic integers. A recently devised $p$-adic hierarchical classification algorithm imitating the known LBG quantisation method classifies the solutions for all the samples after having determined the number of clusters using the known intra-inter validity of clusterings. In the successful case, a cluster ranking will determine the cluster containing a 2-adic approximation to the "true" solution of the problem.

15 pa...

15 pages; typos removed, abstract changed, computation error removed

None
Free actions and Grassmanian variety 2009-03-11
Show

An algebraic notion of representational consistency is defined. A theorem relating it to free actions is proved. A metrizability problem of the quotient (a shape space) is discussed. This leads to a new algebraic variety with a metrizability result. A concrete example is given from stereo vision.

fixed...

fixed matrices lost in latex and numbered equations

None