Percepire, Ragionare, Agire: la Nuova Frontiera dell’Embodied AI
Abstract: This thesis contributes to the field of Embodied Artificial Intelligence. Embodied AI is a novel research topic at the intersection of Computer Vision and Robotics and takes advantage of recent findings on Deep Neural Networks. Empowered by the so-called "deep revolution", we strive to create intelligent agents able to: perceive the world, reason about Spatio-temporal relationships, and act to reach a pre-defined goal. First, we need to identify a proper strategy to tackle such a complex topic, which entails time series and long-term dependencies on one end and multiple input modalities on the other end. We distinguish three different problems we need to address to build an intelligent agent. We start from the problem of long-term dependencies and sequence modeling, as the agent needs to process data coming from a sequence of time steps acting as previous experience. Then, we consider and tackle a first simple form of interaction with an unknown environment: exploration. In this way, we combine visual and spatial reasoning to perform simple actions such as in-place rotations and moving forward. Finally, we study how to incorporate natural language instructions to guide the agent's navigation towards a goal. Language then becomes a natural interface to communicate with the agent, paving the way to future research and applications. This thesis presents a step-by-step analysis of these features that any intelligent agent should possess. While doing so, we cover a comprehensive overview of the field, theoretical foundations for Embodied AI, state-of-the-art datasets and benchmarks, and practical indications regarding the deployment of the resulting agent in the real world. In the first part of this thesis, we discuss Recurrent Neural Networks (RNNs). RNNs are the most common approach when dealing with time series. IN particular, Long Short-Term Memory (LSTM) is the standard de-facto for many tasks involving sequential inputs and long-term dependencies. As such, they represent an enabling technology for Embodied AI. We introduce a heuristic enhancement of LSTM that brings better results, increased training stability, and reduced convergence time on a set of tasks. In the following, we place the agent in a simulated photorealistic unknown environment. We aim to explore the largest portion of the environment new scene in a fixed amount of time. To that end, we propose two different training setups. The first approach relies on curiosity, where the agent tries to maximize its surprisal during the exploration episode. The second strategy promotes actions likely to produce a high impact (i.e., visual changes) on the environment. We show that exploration is an essential ability of embodied agents and that it can enable a series of downstream tasks such as scene description and coordinate-driven navigation in unknown environments. Then we tackle the recent task of Vision-and-Language Navigation (VLN). In VLN, the agent needs to follow a language-specified instruction to reach a target location in a new environment. With that in mind, we propose two different methods to fuse lingual and visual information: one based on dynamic convolutional filters and the other based on attention. This way, we show that it is possible to include natural language instructions from a human user in the agent reasoning motor. Hence, we enable a series of future research directions and applications. As a final contribution, we discuss how to deploy agents trained in simulation in the real world. While most of our experiments exploit simulation, we show that it is possible to deploy the resulting models on a Low-Cost Robot (LoCoBot) with little effort.
Citation:Landi, Federico "Percepire, Ragionare, Agire: la Nuova Frontiera dell’Embodied AI" 2022