Sony’s Visual Speech Enablement uses cameras and AI to read lips from far distances, no matter the noise. But are its accessibility benefits overshadowed by the potential for privacy violations? By Steven Winkelman
Updated January 13, 2021
Facial-recognition software can identify faces in a crowd, but how about picking up conversations without the help of nearby microphones? Sony’s Visual Speech Enablement does just that, using camera sensors and AI for augmented lip reading in any environment.
Mark Hanson, Sony’s VP of Product Technology and Innovation, gave a limited overview of the technology during a CES keynote. It’s a new use case for Sony’s Intelligent Vision Image Sensor and uses AI to isolate a user’s lips and then translates their movements into words, independent of any background or foreground noise. In fact, it requires no microphone whatsoever. The distance between the sensor and user is almost inconsequential and it can work over many feet, simply by using a higher-resolution sensor, Hanson told us last week.
Sony initially plans to market the technology for a handful of use cases, such as factory automation, kiosks, and voice-enabled ATMs. Visual Speech Enablement is optimized for use on computers, though consumer-facing versions of the feature could roll out on mobile hardware in the future, according to Hanson.
When asked about the potential for assistive uses of this technology -such as improving auto-generated captions, or reducing the need for a relay operator or automated speech-recognition intermediary that requires a solid data connection and minimal background noise- Hanson said the software is not optimized for such use cases yet but could be in the future.
For all of Visual Speech Enablement’s potential for good, there’s also the possibility it could be misused. Hanson says the technology only captures lips, not faces, so no user-identifiable data is retained. What remains unaddressed is the possibility of combining Visual Speech Enablement with other technologies, many of which use cameras and could incorporate Sony’s AI-enhanced sensors. If Visual Speech Enablement were to sit alongside a facial-recognition camera, the data could be aggregated and undo Sony’s built-in privacy protections.
Few mediums remain truly private, of course. Websites track you via cookies; some ISPs and mobile carriers sell your data. Despite crackdowns in some cities and states, facial-recognition technology is already in use on streets and in stores. Time will tell where something like Visual Speech Enablement fits in.
Editors’ Note: This story has been updated to correctly reflect statements from Sony’s spokesperson.