Dalle immagini allo spazio 3D: il ruolo dei punti chiave semantici per la percezione 3D
Abstract: One of the goals of the Computer Vision community is to comprehend human 3D perception through 2D representations like images and videos. Extracting robust 3D insights from these analyses is a significant challenge. This dissertation focuses on the keypoint-based 3D representation, exploring applications in different real-world scenarios. Unlike traditional pointwise feature descriptors like ORB or SIFT, semantic keypoints establish correlations between specific 3D points belonging to a rigid or articulated object. Recent advances in Deep Learning, particularly in 2D keypoints detection, have paved the way for addressing complex 3D vision problem. This thesis demonstrates the application of these methods in autonomous driving and video surveillance, showcasing their robustness and precision in bridging the gap between 2D image planes and the 3D world. In the automotive context, our investigation centers on the tasks of novel view synthesis and 3D reconstruction of vehicles within urban scenes. A 3D representation of a vehicle in a scene can be valuable for traffic analysis and accident prevention. To achieve this, we design a method leveraging a 2D keypoint localization network to augment visual features for accurate classification of 3D vehicle models. Ensuring a robust classification, we study how to improve the generation of synthetic vehicles from unseen novel views through a deep learning pipeline trained on a collection of single-view images. Additionally, to explore more sophisticated techniques for 3D object reconstruction from images, we introduce a deep learning architecture capable of reconstructing objects across multiple categories. This approach is trained on a dataset of single-view images and involves the deformation of explicit 3D representations. The second research area is focused on predicting the 3D skeletons of both humans and robots, observed from an external perspective, such as a video surveillance camera. The keypoints in this context are integrated into the definition of a skeleton, depicted as a graph of semantic points. Our initial focus is on the robotics domain, where an intelligent system for predicting 3D skeletons can be crucial for safety in collaborative environments shared by humans and robots. Given the challenges of obtaining real datasets in robotics, we emphasize the role of simulation. Our approach involves collecting a synthetic and real dataset, addressing the 3D pose estimation task through a double heatmap-based representation. We explore the domain gap between the synthetic and real data, utilizing depth maps to enhance accuracy. Introducing temporal cues, our pipeline embraces the novel Pose Nowcasting paradigm, where predicting future poses serves as an auxiliary task to refine current pose precision. Shifting to the human scenario, we propose a pose refinement framework based on depth map analysis. Simultaneously, our investigation extends to Human-Computer Interaction, where we present an unsupervised method for detecting and classifying dynamic hand gestures using data from a motion tracking sensor. This thesis seeks to make a valuable contribution to the intersection of 3D Computer Vision and Deep Learning across various domains. Following an overview of the existing state-of-the-art in 3D reconstruction and 3D pose estimation tasks, we present our proposed methods with a comprehensive technical explanation supported by a detailed experimental investigation conducted on benchmark datasets widely acknowledged in the literature.
Citation:
Simoni, Alessandro "Dalle immagini allo spazio 3D: il ruolo dei punti chiave semantici per la percezione 3D" 2024not available