
AI has transformed synthesized speech from the monotone of robocalls and decades-old GPS navigation systems to the polished tone of virtual assistants in smartphones and smart speakers.
But there's still a gap between AI-synthesized speech and the human speech we hear in daily conversation and in the media. That's because people speak with complex rhythm, intonation and timbre that's challenging for AI to emulate.
The gap is closing fast: NVIDIA researchers are building models and tools for high-quality, controllable speech synthesis that capture the richness of human speech, without audio artifacts. Their latest projects are now on display in sessions at the Interspeech 2021 conference, which runs through Sept. 3.
These models can help voice automated customer service lines for banks and retailers, bring video-game or book characters to life, and provide real-time speech synthesis for digital avatars.
NVIDIA's in-house creative team even uses the technology to produce expressive narration for a video series on the power of AI.
Expressive speech synthesis is just one element of NVIDIA Research's work in conversational AI - a field that also encompasses natural language processing, automated speech recognition, keyword detection, audio enhancement and more.
Optimized to run efficiently on NVIDIA GPUs, some of this cutting-edge work has been made open source through the NVIDIA NeMo toolkit, available on our NGC hub of containers and other software.
Behind the Scenes of I AM AI NVIDIA researchers and creative professionals don't just talk the conversational AI talk. They walk the walk, putting groundbreaking speech synthesis models to work in our I AM AI video series, which features global AI innovators reshaping just about every industry imaginable.
But until recently, these videos were narrated by a human. Previous speech synthesis models offered limited control over a synthesized voice's pacing and pitch, so attempts at AI narration didn't evoke the emotional response in viewers that a talented human speaker could.
That changed over the past year when NVIDIA's text-to-speech research team developed more powerful, controllable speech synthesis models like RAD-TTS, used in our winning demo at the SIGGRAPH Real-Time Live competition. By training the text-to-speech model with audio of an individual's speech, RAD-TTS can convert any text prompt into the speaker's voice.
Another of its features is voice conversion, where one speaker's words (or even singing) is delivered in another speaker's voice. Inspired by the idea of the human voice as a musical instrument, the RAD-TTS interface gives users fine-grained, frame-level control over the synthesized voice's pitch, duration and energy.
With this interface, our video producer could record himself reading the video script, and then use the AI model to convert his speech into the female narrator's voice. Using this baseline narration, the producer could then direct the AI like a voice actor - tweaking the synthesized speech to emphasize specific words, and modifying the pacing of the narration to better express the video's tone.
The AI model's capabilities go beyond voiceover work: text-to-speech can be used in gaming, to aid individuals with vocal disabilities or to help users translate between languages in their own voice. It can even recreate the performances of iconic singers, matching not only the melody of a song, but also the emotional expression behind the vocals.
Giving Voice to AI Developers, Researchers With NVIDIA NeMo - an open-source Python toolkit for GPU-accelerated conversational AI - researchers, developers and creators gain a head start in experimenting with, and fine-tuning, speech models for their own applications.
Easy-to-use APIs and models pretrained in NeMo help researchers develop and customize models for text-to-speech, natural language processing and real-time automated speech recognition. Several of the models are trained with tens of thousands of hours of audio data on NVIDIA DGX systems. Developers can fine tune any model for their use cases, speeding up training using mixed-precision computing on NVIDIA Tensor Core GPUs.
Through NGC, NVIDIA NeMo also offers models trained on Mozilla Common Voice, a dataset with nearly 14,000 hours of crowd-sourced speech data in 76 languages. Supported by NVIDIA, the project aims to democratize voice technology with the world's largest open data voice dataset.
Voice Box: NVIDIA Researchers Unpack AI Speech Interspeech brings together more than 1,000 researchers to showcase groundbreaking work in speech technology. At this week's conference, NVIDIA Research is presenting conversational AI model architectures as well as fully formatted speech datasets for developers.
Catch the following sessions led by NVIDIA speakers:
Scene-Agnostic Multi-Microphone Speech Dereverberation - Tues., Aug. 31
SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition - Weds., Sept. 1
Hi-Fi Multi-Speaker English TTS Dataset - Weds., Sept 1
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction - Thurs., Sept. 2
Compressing 1D Time-Channel Separable Convolutions Using Sparse Random Ternary Matrices - Friday, Sept. 3
NeMo Inverse Text Normalization: From Development to Production - Friday, Sept. 3
Find NVIDIA NeMo models in the NGC catalog, and tune into talks by NVIDIA researchers at Interspeech.
Most recent headlines
04/09/2025
Monumental Sports & Entertainment (MSE), in collaboration with Dalet, has been a...
01/05/2025
WASHINGTON The Corporation for Public Broadcasting has filed a lawsuit to stop the Trump administration from firing three of its board members, claiming the pre...
01/05/2025
HONG KONG Riedel Communications today said it has opened a new office in Hong Kong, enhancing its presence in the Asia-Pacific region....
01/05/2025
MUMBAI, India & BALTIMORE In a development that could advance efforts to bring NextGen TV to cellphones, FreeStream Technologies, Lava International and HMD hav...
01/05/2025
01 05 2025 - Media release Screen Australia empowers 100 distinctive Australian narratives
All The Boys Are Here writer/director Goran Stolevski and It s All...
01/05/2025
How PFX Delivered 750 VFX Shots in Just Four Months For Anthony Hopkins Thriller...
01/05/2025
Careline's New Wave Makeup Commercial Shot With URSA Cine 12K LF
Brie Clayton May 1, 2025
0 Comments
CAD Studios relies on large format, full fram...
30/04/2025
Film elements held at the Deluxe warehouse in March of 2015. Photo by Luis Silva...
30/04/2025
EA SPORTS FC 25 fans in Australia and Saudi Arabia, get ready to immerse yourselves in the beautiful game like never before. Spotify is excited to announce a n...
30/04/2025
The past year has been nothing short of a whirlwind for Tucker Wetmore. After th...
30/04/2025
Mexico's love for podcasts is no longer a trend-it's a lifestyle. With m...
30/04/2025
Slam poet Huda the Goddess wins 2025 Les Murray Award
30 April, 2025
Media releases
Australia for UNHCR and SBS are proud to announce that Huda Fadlelmawla...
30/04/2025
SBS, NITV and NIDA partner to empower the next generation of screen creatives
30 April, 2025
Media releases
SBS, National Indigenous Television (NITV) and ...
30/04/2025
By Joyce JC Cataldo
The media and entertainment industry is evolving faster than ever. Technology is shifting the way we create, distribute, and experience c...
30/04/2025
An image of Valencia, Spain, in the OneAtlas Living Library...
30/04/2025
NEWPORT BEACH, Calif. WPSD-TV, the NBC affiliate in Paducah, Ky., has selected Bitcentral's Central Control playout system to upgrade the station's mast...
30/04/2025
AMSTERDAM Zero Density, a global provider of virtual studio productions and on-air graphics, has appointed Baris Zavaroglu as its new CEO. Zavaroglu succeeds Of...
30/04/2025
The Avit Group, a forward-thinking audio-visual technology company, is bringing over 50 years of combined experience in delivering audio-visual design and insta...
30/04/2025
NXTGENbps will be showcasing its latest sustainable battery power solutions at this year's Media Production & Technology Show (MPTS), appearing alongside pa...
30/04/2025
Van Dyke skates and operates on Shoresy | photo by Dave Ferguson
Toronto-based Cinematographer Brett Van Dyke (Heartland, Jann, Carter) stepped into the rink...
30/04/2025
Leading provider of media playout solutions PlayBox Neo will exhibit a vast array of significant upgrades to its range of smart media innovations at CABSAT from...
30/04/2025
ITV Studios is renowned for its award-winning productions, including the reality TV phenomenon I'm a Celebrity Get Me Out of Here! Produced by Lifted Ente...
30/04/2025
The UK's presence at this year's CABSAT brings together 20 innovative companies under the GREAT Britain and Northern Ireland banner. From advanced infra...
30/04/2025
Lightware continues to strengthen its Environmental, Social, and Governance (ESG) initiatives by fostering diversity, inclusion, and community engagement within...
30/04/2025
LiveU is demonstrating for the first time in the UK, its revolutionary technological breakthrough in IP-video transport, LiveU IQ (LIQ ) and bringing its expand...
30/04/2025
Stand: A40
Calrec has been putting sound in the picture for more than six decades and is still pushing the boundaries of audio broadcasting with a full range o...
30/04/2025
Buckinghamshire New University (BNU) has opened its new virtual production studio, designed and integrated by CJP Broadcast. The installation equips students wi...
30/04/2025
Independent news producer in India focuses on delivering stories
nxtedition has supplied a complete newsroom system to Collective Newsroom, based in New Delhi,...
30/04/2025
Connecting the present, building the future
FOR-A, a cutting-edge video broadcast technology company backed by more than 50 years experience, will showcase it...
30/04/2025
DAD by NTP Technology reports a successful NAB Show in Las Vegas where the company focused on upcoming expansion options in development for its Thunder | Core a...
30/04/2025
CVP and Canon UK & Ireland are proud to announce the winning projects from the third annual Stories in Motion Young Filmmakers Awards, which took place on the e...
30/04/2025
Leading video software provider, Synamedia, today announced that beIN MEDIA GROUP ( beIN'), one of the foremost global sports and entertainment broadcasters...
30/04/2025
The Canadian Premier League (CPL) today launched a new-look website and app, offering supporters of Canada's men's domestic professional soccer league a...
30/04/2025
WASHINGTON In a wide-ranging press conference, Federal Communications Commission Chair Brendan Carr had both good and bad news for broadcasters, stressing that ...
30/04/2025
TORONTO Cignal TV is relying on Quickplay's Shorts tool for its recently launched Pilipinas Live Shorts service, available to millions of users of Pilipinas...
30/04/2025
NEW YORK Steve Lanzano will retire as president and CEO of the Television Bureau of Advertising at year-end, the group said....
30/04/2025
The news production tech provider nxtedition is reporting that it supplied a complete newsroom system to Collective Newsroom, an independent news producer based...
30/04/2025
NEW YORK As part of a major push to advance the role that artificial intelligence (AI) plays in the advertising industry, the Interactive Advertising Bureau has...
30/04/2025
Uzbekistan Airways is the latest airline to choose SES's network, while Thai...
30/04/2025
Loaded Uses Blackmagic Design for shroud Subathon Live Streams
Brie Clayton April 29, 2025
0 Comments
Blackmagic Studio Cameras and Micro Studio Camer...
30/04/2025
Help shape the future of video creator tools with MIDiA
Brie Clayton April 29, 2025
0 Comments
After a successful debut in 2024, MIDiA is calling once...
30/04/2025
Tania Le n and Kelli O'Hara to be Honored at Boston Conservatory at Berklee&...
30/04/2025
Andr 3000 and Sara Bareilles to Receive Honorary Doctorates at Berklee College ...
30/04/2025
Damien Molony will head back to Jersey for a second series of Bergerac, written by Toby Whithouse alongside Ashley Sanders, Emilie Robson and Faebian Averies
D...
30/04/2025
Advancing multi-domain EW operations: Rohde & Schwarz unveils latest innovations...
30/04/2025
Rohde & Schwarz pioneers the future of automotive Ethernet using Analog Devices&...
30/04/2025
Wuppertal April 30, 2025
Riedel Communications Expands Asia-Pacific Presence with New Office in Hong KongRiedel Communications, a global leader in real-time v...
30/04/2025
Back to All News
Mad Unicorn' Trailer Reveals a Riveting Rags-to-Riches Sa...
30/04/2025
Back to All News
Get Ready to Get Wild: Netflix Announces Mating Season, a New ...
30/04/2025
SAN JOSE, Calif. - April 30, 2025 - Harmonic (NASDAQ: HLIT) today announced that...