IBM
Skip to main content
 
Search IBM Research
     Home  |  Products & services  |  Support & downloads  |  My account
 Select a country
 IBM Research Home
China Research Laboratory
Research projects
 · Speech Technologies
 · Human Computer Interaction
 · Natural Language Processing
 · e-Commerce
 · Pervasive Computing
 · Infrastructure Technology
 · Internet Media
Projects Updates
Visitor Information
Feedback

Related Links
  Careers at Research
  Other Labs
  IBM Journal of R & D
  IBM Systems Journal

Chinese Version
 
 


 
China Research Laboratory
Speech Technologies

Text-To-Speech

Text-to-Speech(TTS) is a key speech technology which allows text information to be converted into synthesized speech. Our goal is to make synthesized speech intelligible, natural and pleasant as human speech. TTS facilitates human-machine interface and has great potential in applications such as telephony, embedded systems and assistive technology.

In recent years, major progress has been made in the quality and naturalness of TTS systems. Most high quality systems are corpus based concatenative synthesis systems, where speech segments are selected from a large speech dataset. CRL has been adopting most advanced linguistic and speech signal processing strategies in developing its TTS systems to meet the market demands. Now we have language support for Chinese, Taiwanese Chinese, Cantonese, Korean, Japanese and French. The bilingual TTS engines developed in CRL can output Chinese and English speech seamlessly and be easily integrated into the corresponding desktop, embedded and telephony products.

Server version TTS

The server based system usually requires a large amount of disk and memory resources, ranging from tens to hundreds megabytes for a single voice. With the state of the art Mandarin and English system, our TTS system can provide the seamless mix-language system.

Embedded version TTS

Concatenative TTS could generate voice with high quality and naturalness, but the footprint must be reduced for hand-held device usage. In Embedded Mandarin system, the footprint is only 5.0M Bytes, but the quality is reasonably good compared with the server version system. The downsizing technology includes voice compression and deep segment preselection.

Personalized TTS

Personalized TTS is a technology to construct a specific person's voice only recording a small number of training sentences. The key technologies include prosody adaptation and voice conversion. Here is a sample from the system with 326 recording sentences, which is equal to about 30 minutes' recording data.

We will continue the efforts to improve the quality and the availability of our TTS engines. We will also continue our exploration on expressive TTS.

Below are some of our publications for your reference:

Q. Shi, "A Comparison of Statistical Methods and Features In for the Prediction of prosody Prosodic Structures," Proc. ICSLP, Korean, 2004.
L. Jin, W. Zhang, X.J. Ma, "Mutual-Information Based Segment Pre-selection in Concatenative Text-to-Speech," Proc. ICSLP, Korean, 2004.
F.X. Chen, A.J. Li, "Acoustic Analysis of Friendly Speech," Proc. ICASSP, Montreal, 2004.
X.J. Ma, W. Zhang, W. B. Zhu, Q. Shi and L. Jin, "Probability Based Prosody Model For Unit Selection," Proc. ICASSP, Montreal, 2004.
X.J. Ma, W. Zhang, "Automatic Prosody Labeling Using both Text and Acoustic Information," Proc. ICASSP, HongKong, 2003.
H.P. Li, "Trainable Cantonese/English Dual language Speech Synthesis System," Proc. ICASSP, HongKong, 2003.
F.X. Chen, "Syllable Clustering and Spectral Discontinuity in Syllable-based TTS Systems," Proc. ICASSP, HongKong, 2003.
Q. Shi, W. Zhang, X.J. Ma, "Comparisons among Four statistic based methods of prosody structure prediction," Proc. 7th National Conference on Man-Machine speech Communications, XiaMen, China, 2003.
Q. Shi, L.Q. Shen and H.X. Chai, "Automatic New Word Extraction Method," Proc. ICASSP, Orlando, 2002.
H.P. Li, "Generating Script Using Statistical Information of the Context Variation Unit Vector," Proc. ICSLP, Denver, 2002.
J.H. Yuan, L.Q. Shen and F.X. Chen, "The Acoustic Realization of Anger, Fear, Joy and Sadness in Chinese," Proc. ICSLP, Denver, 2002.
W. Zhu, W. Zhang, Q. Shi, F. Chen, "Corpus Building for Data-Driven TTS System," IEEE TTS Workshop, Santa Monica, 2002.
Q. Shi, X.J. Ma, W.B. Zhu, W. Zhang and L.Q. Shen, "Statistic Prosody Structure Prediction Based on Annotated Corpus," IEEE TTS Workshop, Santa Monica, 2002.
J.F. Cao and W.B. Zhu, "Syntactic and Lexical Constraint in Prosodic Segmentation and Grouping," Speech Prosody 2002, France.
W.S. Lee, F.X. Chen, K.K. Luke and L.Q. Shen, "The Prosody of Bisyllabic and Polysyllabic Words in Hong Kong Cantonese," Speech Prosody 2002, France.
F.X. Chen, "Issues in Speech Synthesis for Tone Languages," Proc. 5th Symposium on Natural Language Processing 2002, Thailand.
W. Zhang, L.Q. Shen, and D. Tang, "Voice Conversion Based on Acoustic Feature Transformation," Proc. 6th National Conference on Man-Machine speech Communications, ShenZhen, China, 2001.
K.K Luke, F.X. Chen, W.S. Lee and L.Q. Shen, "A Phonetic Study of the Prosodic Properties of Bisyllabic Compounds in Hong Kong Cantonese," Proc. 5th National Conference on Modern Phonetics, Beijing, China, 2001.
X.C. Niu, L.Q. Shen, W.B. Zhu, and Q. Shi, "Modelling and Decision Tree Based Prediction of Pitch Contour in IBM's Mandarin Speech Synthesis System," Proc. International Symposium on Chinese Spoken Language Processing, Beijing, China, 2000.
W.B. Zhu, L.Q. Shen, and X.C. Niu, "Duration Modeling for Chinese Synthesis from C-ToBI Labeled Corpus," Proc. ICSLP, Beijing, China, 2000.

Speech Recognition

History

ViaVoice

IBM has always been the technology front runner in the field of speech recognition in the world, with more than 100 patents in this area. In September, 1997, IBM released the first large vocabulary, speaker-independent and continuous Chinese speech recognition system for simplified Chinese, ViaVoice, in the world and it brought great attention from personalities of various circles. It has paved the way for speedy and easy input of Chinese characters so widely acclaimed as an important milestone of Chinese input. From then on, we continue our efforts to improve the recognition accuracy in the series of our products.

We also took part in DARPA HUB4 evaluation activities on Chinese broadcasting news transcription in 1997 and 1998, and won No. 1 position in the evaluation activities.

We now have simplified Chinese, traditional Chinese and Cantonese ViaVoice technologies.

WebSphere Voice Server (WVS)

WVS is the extension and enhancement of IBM's speech technologies in the field of telecommunication. It works on top of WebSphere Application Server (WAS) and consists of speech recognition server, speech synthesize server and IBM WebSphere Voice Toolkit. WVS gives application developers several programming interfaces to develop speech enabled IVR applications, including VoiceXML, java and stable tables.

With this technology, many services can be automated, such as information inquiry, call center; home banking and stock quote in the finance area; ticket booking and hotel reservation in travel services etc.

Now WVS supports both simplified and traditional Chinese.

Embedded ViaVoice Technology

In 2001, IBM released IBM Embedded ViaVoice, it delivers IBM voice technologies to mobile devices, giving manufacturers the power to develop solutions that allow voice access to information from anywhere, at anytime, on any device. It supports a variety of real-time operating systems and microprocessors, making the development of robust mobile speech solutions easy and practical for device and application developers.

Currently, we have Embedded ViaVoice products of simplified Chinese and traditional Chinese.


On-going projects

Telematics solution

With close collaboration with several world-wide car-service providers and car-manufacturers, IBM built up two kinds of in-car speech-enabled solutions. One is based on WVS and the other is based on embedded ViaVoice technology. In both of those solutions, the most difficult issue we attacked is the robustness issue of the speech recognizer under noisy environment, especially when car is running at a high speed.

In Great China, we support telematics solutions for both simplified Chinese and traditional Chinese.

Audio Search

Retrieval of information from audio, and especially spoken documents, has been a goal of many people over the past ten years. From 2003, we opened Audio Search (AS) project. AS targets to resolve out-of-vocabulary (OOV) detection issue in multimedia content information retrieval application. According to our experiment result, AS improved OOV detection rate significantly and AS has been integrated with IBM Content Management System (IBM CMS).

Publications

Guo Xue Feng, "The IBM LVCSR system used for 1998 mandarin broadcast news transcription evaluation", HUB4 workshop, 1998.

Shen Li Qin, "The measurement of acoustic similarity and its applications", ICSLP, 2000.

Shi Qin, "A novel classification method in building a class language model vocabulary",  NCMMSC6, 2001. 

Ye Meng, "Capacity modeling for enterprise voice application platform", NCMMSC6, 2001.

C. J. Chen, "Recognition tone languages using pitch informationon the main vowel of each syllable", Proc. ICASSP, 2001.

Shi Qin, "Automatic new word extraction method", Proc. ICASSP, 2002.

Qin Yong, "A study on word acoustic distance measurement", NCMMSC7, 2003.

Li Hai Ping, "An information gain and grammar complexity based approach to approach to attribute selection in speech enabled information retrieval dialogs", ISCSLP, 2004.

 
  About IBM  |  Privacy  |  Terms of use  |  Contact