
Decoding Facial Expressions in Video: A Multiple Instance Learning Perspective on Action Units
Abstract: Facial expression recognition (FER) in video sequences is a longstanding challenge in affective computing and computer vision, particularly due to the temporal complexity and subtlety of emotional expressions. In this paper, we propose a novel pipeline that leverages facial Action Units (AUs) as structured time series descriptors of facial muscle activity, enabling emotion classification in videos through a Multiple Instance Learning (MIL) framework. Our approach models each video as a bag of AU-based instances, capturing localized temporal patterns, and allows for robust learning even when only coarse video-level emotion labels are available. Crucially, the approach incorporates interpretability mechanisms that highlight the temporal segments most influential to the final prediction, providing informed decision-making and facilitating downstream analysis. Experimental results on benchmark FER video datasets demonstrate that our method achieves competitive performance using only visual data, without requiring multimodal signals or frame-level supervision. This highlights its potential as an interpretable and efficient solution for weakly supervised emotion recognition in real-world scenarios.
Citation:
Del Gaudio, Livia; Cuculo, Vittorio; Cucchiara, Rita "Decoding Facial Expressions in Video: A Multiple Instance Learning Perspective on Action Units" Image Analysis and Processing - ICIAP 2025 Workshops, Rome, Italy, 20/09/2025, 2025
not available