Gaze-enhanced Crossmodal Embeddings for Emotion Recognition

May 13, 2022·

Ahmed Abdou

Ekta Sood

Phillip Müller

Andreas Bulling

· 0 min read

PDF Cite Video

Image credit: Authors

Abstract

Emotional expressions are inherently multimodal – integrating facial behavior, speech, and gaze – but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, an explicit representation of gaze was not included. We propose a new approach to emotion recognition that incorporates an explicit representation of gaze in a crossmodal emotion embedding framework. We show that our method outperforms the previous state of the art for both audio-only and video-only emotion classification on the popular One-Minute Gradual Emotion Recognition dataset. Furthermore, we report extensive ablation experiments and provide detailed insights into the performance of different state-of-the-art gaze representations and integration strategies. Our results not only underline the importance of gaze for emotion recognition but also demonstrate a practical and highly effective approach to leveraging gaze information for this task.

Type

Conference paper

Publication

In ETRA - ACM Symposium on Eye Tracking Research & Applications

Last updated on May 13, 2022

Deep Learning Multi-Modal Learning

Authors

Ahmed Abdou

MLOps @ Zeiss

← Mind Your Neighbours: Leveraging Analogous Instances for Rhetorical Role Labeling for Legal Documents. May 1, 2024