AImageLab - Publications

Integrazione di visione e linguaggio per l'interazione fisica e cognitiva uomo-robot

Abstract: While many researchers study computer vision, natural language processing, or robotics, the works proposed here lie at the intersection of these three domains. In this manuscript, two domains for applying Human-Robot Interaction (HRI) that combine vision and language are explored, namely physical HRI and cognitive HRI. For physical HRI, the task of Vision and Language Navigation (VLN) is considered. In VLN, there is an agent that can perceive the 360-degree view of the environment (vision) and has to follow the language instructions of the human such as “Go to the kitchen and clean the coffee table”. For cognitive HRI, the task of multimodal empathetic dialogue generation is considered. In this task, input signals from facial expressions (vision) and the text of what the human says (language) are provided. The agent should respond to the human empathetically by considering these two multimodal input signals. The first three works are related to physical HRI. The first work proposes a method to improve the navigation performance of an agent by augmenting already existing VLN datasets such as REVERIE. Specifically, a speaker model that generates language instructions for a sequence of images (for example, “Go to the sofa and bring me the remote control.”) using an adversarial approach is proposed. In the second work, the speaker model is extended to generate dialogue whenever the navigation agent gets confused regarding where to go next. Finally, in the third work for physical HRI, a generalized VLN agent is proposed. This agent can summarize a trajectory given a sequence of images, navigate and perform embodied questions and answering. Large Language Models (LLMs), such as ChatGPT, have become popular but these models are prone to giving long and neutral answers to assist humans in one way or another. The works proposed on cognitive HRI introduce ways to make artificial agents respond empathetically to humans. In the first work for cognitive HRI, an agent replies with parallel or reactive empathy to the human with a certain facial expression and the text of what is said. Specifically, a Transformer encoder-decoder structure is used to respond to the human empathetically. The second work also consists of an agent that learns to respond to humans empathetically. However, this work makes use of only the Transformer decoder model to generate the dialogue response and the model is trained using Reinforcement Learning (RL) to respond in a manner that would make the human feel positive. To summarize, approaches based on Transformer models are proposed to enhance the performance of VLN agents for physical HRI tasks. Transformer models were also finetuned to learn to respond to humans empathetically for cognitive HRI tasks. While the two domains of physical HRI and cognitive HRI are kept segregated, ideally, a robot with general intelligence should be able to clean the house or bring a particular object (physical HRI) and be a social companion engaging in empathetic dialogue (cognitive HRI). In the future, a computational model that could perform both physical HRI and cognitive HRI could be developed to investigate how these two fields can interplay.

Citation:

Rawal, Niyati "Integrazione di visione e linguaggio per l'interazione fisica e cognitiva uomo-robot" 2025

 not available

Paper download:

Author version: