Unimore logo AImageLab
Back to the research area

Self-Supervised Navigation and Recounting

Advances in the field of embodied AI aim to foster the next generation of autonomous and intelligent robots. At the same time, tasks at the intersection of computer vision and natural language processing are of particular interest for the community, with image captioning being one of the most active areas. By describing the content of an image or a video, captioning models can bridge the gap between the black-box architecture and the user. In this project, we propose a new task at the intersection of embodied AI, computer vision, and natural language processing, and aim to create a robot that can navigate through a new environment and describe what it sees. We call this new task Explore and Explain since it tackles the problem of joint exploration and captioning. In this schema, the agent needs to perceive the environment around itself, navigate it driven by an exploratory goal, and describe salient objects and scenes in natural language. Beyond navigating the environment and translating visual cues in natural language, the agent also needs to identify appropriate moments to perform the explanation step.

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

The ability to generate natural language explanations conditioned on the visual perception is a crucial step towards autonomous agents which can explain themselves and communicate with humans. While the research efforts in image and video captioning are giving promising results, this is often done at the expense of the computational requirements of the approaches, limiting their applicability to real contexts. In this paper, we propose a fully-attentive captioning algorithm which can provide state-of-the-art performances on language generation while restricting its computational demands. Our model is inspired by the Transformer model and employs only two Transformer layers in the encoding and decoding stages. Further, it incorporates a novel memory-aware encoding of image regions. Experiments demonstrate that our approach achieves competitive results in terms of caption quality while featuring reduced computational demands. Further, to evaluate its applicability on autonomous agents, we conduct experiments on simulated scenes taken from the perspective of domestic robots.  


SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

M.Cornia, L.Baraldi, R.Cucchiara

ICRA 2020


1 Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita "SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability" International Conference on Robotics and Automation, Paris, France, pp. 1128 -1134 , May, 31 - June, 4, 2020 Conference

Video Demo