Proposed MVP architecture combines facial expressions and body signals (heart rate, sweat response) using transformers that can analyze longer 1-2 minute clips, outperforming previous systems by better integrating voluntary (facial) and involuntary (physiological) responses for emotion detection. The model utilizes cross-attention where facial data provides keys/values and physiological data provides queries.
reply