High-quality expressive TTS that speaks in a multitude of voices created by you
Modern TTS technology approaches human capabilities in terms of speech naturalness and expressiveness. With automation and flexibility, TTS has the potential for more and more applications in domains such as entertainment and education. A fundamental barrier that limits the spread of TTS is that speech synthesis systems can speak in a limited number of voices prepared in advance, typically using an expensive, labor consuming and lengthy process.
Traditionally, each TTS voice is created from a corpus of a single speaker audio recordings. A typical high-quality TTS voice requires 10 – 20 hours of audio data recorded from a voice actor in a professional studio. Actor auditions and recordings could take weeks. Then the recordings are converted to a TTS voice dataset using a complex semi-automatic process. This process typically involves manual inspection and cleaning steps performed by skilled personnel. Hence, this process is time consuming and costly.
The IBM Virtual Voice Creator technology removes this barrier by allowing users to change a TTS voice according to their needs and imagination. An entire universe populated with different human voices, along with exaggerated cartoonish ones, can now be derived from a couple of standard TTS voices.