Google’s secretive British DeepMind division is teaching its AI to talk like a human.
The groundbreaking project has already halved the quality gap between computer systems and human speech, its creator say.
Called WaveNet, it is capable of creating natural-sounding synthesized speech by analyzing sound waves from the human voice – rather than focusing on the human language.
Google acquired UK-based DeepMind in 2014 for $533 million, and it has since beat a professional human Go player, learned how to play the Atari game Space Invaders and has read through thousands of Daily Mail and CNN articles.
Now, researchers are working towards a new goal – ‘allowing people to converse with machines,’ DeepMind shared in a recent blog.
The ability of computers to understand natural speech has made great strides over the past few years, which was only possible with the application of deep neural networks.
Google’s existing text-to-speech (TTS) systems typically use a system called concatenative TTS, where audio is created by recombining fragments of recorded speech.
Another technique, called parametric TTS, generates speech by passing information through a vocoder, however this method produces sounds that are even less natural.
DeepMind’s WaveNet does things a little differently – this AI focuses on actual sound waves, rather than just language.
This technology is key for computers to learn speech recognition, image compression, computer vision and more.
WaveNet uses its neural networks to analyze raw waveforms of an audio signal, model speech and other types of audio, including music.
Although this is a big leap for DeepMind, the team has no plans to add the AI to any commercial applications just yet.
WaveNet needs to be fed at least 16,000 waveform samples per second, which the team acknowledges ‘is clearly a challenging task’.
To test WaveNet’s synthesized speech, DeepMind played US English speaking and Mandarin Chinese speaking participants computer generated clips in their own language.
Two of the clips were created with Google’s existing TTS, concatenative and parametric, and the other was WaveNet’s generated speech.
Subjects found that WaveNet-generated speech sounded more natural than the others created by Google’s existing TTS speech programs, but underperformed recordings of actual human speech.
‘WaveNets open up a lot of possibilities for TTS, music generation and audio modelling in general,’ said DeepMind.
‘The fact that directly generating timestep per timestep with deep neural networks works at all for 16kHz audio is really surprising, let alone that it outperforms state-of-the-art TTS systems.’
‘We are excited to see what we can do with them next.’