News Apps

Google is making captions on Android even better. Here's how.

Google's Angana Ghosh explains how Android's Live Captions feature is evolving to recognize things like tone, sighs, and laughter.

4 min read
Google is making captions on Android even better. Here's how.

Captions are a crucial accessibility feature that millions of people who are deaf or hard of hearing rely on. Even people who don't have hearing loss use captions to understand what's happening in videos, whether it's because they're in a noisy environment or can't make out what's being said. Many videos, especially those shared on social media, lack captions, though, making them inaccessible. Android's Live Caption feature solves that problem by generating captions for audio playing on the device, but it has a problem: It doesn't properly recognize things like tone, vocal bursts, or background noises. That's changing today with Expressive Captions on Android, a new feature of Live Caption that Google says will "not only tell you what someone says, but how they say it."

What are Expressive Captions?

Expressive Captions uses an on-device AI model to recognize things like tone, volume, environmental cues, and human noises. For example, captions can now reflect the intensity of speech with capitalization so you'll know when someone is shouting "HAPPY BIRTHDAY!" instead of whispering it. They can also show when someone is sighing or gasping so you can make out the tone of someone's speech. Finally, they can tell you when people are applauding or cheering so you know what's happening in the environment.

Here are some examples of what captions can look like when the Expressive Captions feature is turned on versus when it's off.

Because the Expressive Captions feature is part of Live Caption, it'll be available on many Android phones that support the Live Caption feature, not just Pixel phones getting the latest Pixel Drop update. However, it currently only supports devices in the U.S. running Android 14 or later and only works with English language media. To enable the feature, simply toggle "Expressive captions" in Live Caption settings, as shown below.

Once enabled, Expressive Captions will work with most content you watch or listen to, excluding phone calls and Netflix content. They'll even work in real-time when you're offline.

How Google built Expressive Captions

Google says its Android and DeepMind teams worked together to "understand how we engage with content on our devices without sound." Expressive Captions uses "multiple AI models" to not only capture spoken words but also translate them into stylized captions while also providing labels for an even wider range of background sounds.

Angana Ghosh, the Director of Product Management on Android Input and Accessibility, sat down with Jason Howell and myself on the Android Faithful podcast to dive a bit deeper into how the company created the Expressive Captions feature.

You can listen to our full 18 minute interview with Ghosh above, where she answered the following questions:

  1. Can you share what exactly Expressive Captions are and how it evolves Captions on Android?
  2. How did you train the model used by Expressive Captions to recognize things like sighs, cheers, etc.
  3. Why are Expressive Captions only available in the U.S. in English right now?
  4. Currently, Captions work best on audio with clear speech and low background noise, but there’s a lot of media and videos that don’t meet these conditions. Has Google explored applying some sort of background noise reduction algorithm to audio before it’s processed to generate captions?
  5. While most Android devices released in the past year now have Live Caption support, there are still some devices, including flagships from manufacturers like HONOR, ASUS, and Sony, that do not support it. Is Google doing anything to make Live Caption available on all Android devices?
  6. In the U.S. we often find ourselves in multilingual situations whether in a single conversation or when watching content that might have multilingual speakers, how does Expressive Captions handle situations where it detects non-English speech?
  7. Will users have the ability to tune the amount of expressiveness in captions for different contexts? There are several examples of background noise being captioned to give a sense of atmosphere and context like the cheering at a football game, but for situations where the noise might be incidental can a user ask Expressive captions to pay less attention to background sounds?

Share This Post

Check out these related posts

Sneaky Doomscrolling With Expressive Captions - Android Faithful Episode #74

Here's every new feature in the Pixel Drop update for December 2024

Android is getting new features to help you consume and share content