Multimodal body language analysis and video captioning

Feb 10, 2025

Every once in a while, I see a video like this show up in my YouTube feed. This is the kind of video where an “expert” breaks down frame by frame a character’s body language and attempts to explain their behaviour and the complexity of human social dynamics in a given scene. These kinds of videos are useful for people wishing to understand human qualities like charisma, present better in public, or perhaps people who may struggle with social anxiety.

I’m not sure how scientific the whole thing is, some of it appears to suggest loose connections with research or “primitive” almost hunter-gatherer behaviours, but it does point to a layer of analysis that we may not fully appreciate as much right now. It’s shocking to see how much information we give away in any conversation with our language, facial expressions, body language, and so much more. We’re constantly expressing ourselves in all these different ways …

The whole thing got me thinking, probably in the next few years, we will be able to apply some kind of multimodal AI video model to analyze any given clip in this way. Providing us some kind of additional layer of transcript or playback option, allowing us to analyze our human social dynamics at play.

I’m not sure if it’s a good thing - but imagine revisiting a work recording on Microsoft teams or Zoom, applying the AI video analysis, and finding out if you are getting that promotion, or if you may have upset somebody. It could be a lot more than just detecting if someone is lying, but actually something to understand or perhaps even empathize better with each other. Something to help us analyze our own behaviour and understand how we can communicate better or remain more elusive.

… or just something out of another episode of Black Mirror.

Multimodal by Bakz T. Future

Discussion about this post