
|
.gif) |
|
| China Research
Laboratory |
 |
Text-To-Speech
Text-to-Speech(TTS) is a key speech technology which allows text
information to be converted into synthesized speech. Our goal is to
make synthesized speech intelligible, natural and pleasant as human speech. TTS facilitates
human-machine interface and has great potential in applications such as
telephony, embedded systems and assistive technology.
In recent years, major progress has been made in the quality and naturalness of
TTS systems. Most high quality systems are corpus based concatenative synthesis
systems, where speech segments are selected from a large speech dataset.
CRL has been adopting most advanced linguistic and speech signal processing strategies
in developing its TTS systems to meet the market demands. Now we have language support for
Chinese, Taiwanese Chinese, Cantonese, Korean, Japanese and French.
The bilingual TTS engines developed in CRL can output
Chinese and English speech seamlessly and be easily integrated
into the corresponding desktop, embedded and telephony products.
Server version TTS
The server based system usually requires a large amount of disk and memory resources, ranging from tens to hundreds megabytes for a single voice. With the state of the art Mandarin and English system, our TTS system can provide the seamless mix-language system.
Embedded version TTS
Concatenative TTS could generate voice with high quality and naturalness,
but the footprint must be reduced for hand-held device usage. In Embedded Mandarin system,
the footprint is only 5.0M Bytes, but the quality is reasonably good compared with the server version system.
The downsizing technology includes voice compression and deep segment preselection.
Personalized TTS
Personalized TTS is a technology to construct a specific person's voice
only recording a small number of training sentences. The key technologies include prosody adaptation
and voice conversion. Here is a sample from the system with 326 recording sentences, which is equal to about 30 minutes' recording data.
We will continue the efforts to improve the quality and the availability of our TTS engines.
We will also continue our exploration on expressive TTS.
Below are some of our publications for your reference:
Q. Shi, "A Comparison of Statistical Methods and Features In for the Prediction of prosody Prosodic Structures," Proc. ICSLP, Korean, 2004.
L. Jin, W. Zhang, X.J. Ma, "Mutual-Information Based Segment Pre-selection in Concatenative Text-to-Speech," Proc. ICSLP, Korean, 2004.
F.X. Chen, A.J. Li, "Acoustic Analysis of Friendly Speech," Proc. ICASSP, Montreal, 2004.
X.J. Ma, W. Zhang, W. B. Zhu, Q. Shi and L. Jin, "Probability Based Prosody Model For Unit Selection," Proc. ICASSP, Montreal, 2004.
X.J. Ma, W. Zhang, "Automatic Prosody Labeling Using both Text and Acoustic Information," Proc. ICASSP, HongKong, 2003.
H.P. Li, "Trainable Cantonese/English Dual language Speech Synthesis System," Proc. ICASSP, HongKong, 2003.
F.X. Chen, "Syllable Clustering and Spectral Discontinuity in Syllable-based TTS Systems," Proc. ICASSP, HongKong, 2003.
Q. Shi, W. Zhang, X.J. Ma, "Comparisons among Four statistic based methods of prosody structure prediction," Proc. 7th National Conference on Man-Machine speech Communications, XiaMen, China, 2003.
Q. Shi, L.Q. Shen and H.X. Chai, "Automatic New Word Extraction Method," Proc. ICASSP, Orlando, 2002.
H.P. Li, "Generating Script Using Statistical Information of the Context Variation Unit Vector," Proc. ICSLP, Denver, 2002.
J.H. Yuan, L.Q. Shen and F.X. Chen, "The Acoustic Realization of Anger, Fear, Joy and Sadness in Chinese," Proc. ICSLP, Denver, 2002.
W. Zhu, W. Zhang, Q. Shi, F. Chen, "Corpus Building for Data-Driven TTS System," IEEE TTS Workshop, Santa Monica, 2002.
Q. Shi, X.J. Ma, W.B. Zhu, W. Zhang and L.Q. Shen, "Statistic Prosody Structure Prediction Based on Annotated Corpus," IEEE TTS Workshop, Santa Monica, 2002.
J.F. Cao and W.B. Zhu, "Syntactic and Lexical Constraint in Prosodic Segmentation and Grouping," Speech Prosody 2002, France.
W.S. Lee, F.X. Chen, K.K. Luke and L.Q. Shen, "The Prosody of Bisyllabic and Polysyllabic Words in Hong Kong Cantonese," Speech Prosody 2002, France.
F.X. Chen, "Issues in Speech Synthesis for Tone Languages," Proc. 5th Symposium on Natural Language Processing 2002, Thailand.
W. Zhang, L.Q. Shen, and D. Tang, "Voice Conversion Based on Acoustic Feature Transformation," Proc. 6th National Conference on Man-Machine speech Communications, ShenZhen, China, 2001.
K.K Luke, F.X. Chen, W.S. Lee and L.Q. Shen, "A Phonetic Study of the Prosodic Properties of Bisyllabic Compounds in Hong Kong Cantonese," Proc. 5th National Conference on Modern Phonetics, Beijing, China, 2001.
X.C. Niu, L.Q. Shen, W.B. Zhu, and Q. Shi, "Modelling and Decision Tree Based Prediction of Pitch Contour in IBM's Mandarin Speech Synthesis System," Proc. International Symposium on Chinese Spoken Language Processing, Beijing, China, 2000.
W.B. Zhu, L.Q. Shen, and X.C. Niu, "Duration Modeling for Chinese Synthesis from C-ToBI Labeled Corpus," Proc. ICSLP, Beijing, China, 2000.
Speech Recognition
History
ViaVoice
IBM has always been the technology front runner in the field of speech
recognition in the world, with more than 100 patents in this area. In September, 1997,
IBM released the first large vocabulary, speaker-independent and continuous
Chinese speech recognition system for simplified Chinese, ViaVoice, in the world and it brought
great attention from personalities of various circles. It has paved the
way for speedy and easy input of Chinese characters so widely acclaimed as
an important milestone of Chinese input. From then on, we continue our
efforts to improve the recognition accuracy in the series of our products.
We also took part in DARPA HUB4 evaluation activities on Chinese
broadcasting news transcription in 1997 and 1998, and won No. 1 position
in the evaluation activities.
We now have simplified Chinese, traditional Chinese and Cantonese
ViaVoice technologies.
WebSphere Voice Server (WVS)
WVS is the extension and enhancement of IBM's speech
technologies in the field of telecommunication.
It works on top of WebSphere Application Server (WAS) and consists of
speech recognition server, speech synthesize server and IBM
WebSphere Voice Toolkit. WVS gives application developers several
programming interfaces to develop speech enabled IVR applications,
including VoiceXML, java and stable tables.
With this technology, many services can be automated, such as information inquiry,
call center; home banking and stock quote in the finance area; ticket
booking and hotel reservation in travel services etc.
Now WVS supports both simplified and traditional Chinese.
Embedded ViaVoice Technology
In 2001,
IBM released IBM
Embedded ViaVoice, it delivers IBM voice technologies to
mobile devices, giving manufacturers the power to develop solutions that
allow voice access to information from anywhere, at anytime, on any
device. It supports a variety of real-time operating systems and
microprocessors, making the development of robust mobile speech solutions
easy and practical for device and application developers.
Currently, we have Embedded ViaVoice products of simplified Chinese and
traditional Chinese.
On-going projects
Telematics solution
With close collaboration with several
world-wide car-service providers and car-manufacturers, IBM built up two
kinds of in-car speech-enabled solutions. One is based on WVS and the
other is based on embedded ViaVoice technology. In both of those
solutions, the most difficult issue we attacked is the robustness issue of the speech recognizer
under noisy environment, especially when car is running
at a high speed.
In Great China, we support telematics solutions for both simplified Chinese and
traditional Chinese.
Audio Search
Retrieval of
information from audio, and especially spoken documents, has been a goal
of many people over the past ten years. From 2003, we opened Audio
Search (AS) project. AS targets to resolve out-of-vocabulary (OOV)
detection issue in multimedia content information retrieval application.
According to our experiment result, AS improved OOV detection rate
significantly and AS has been integrated with IBM Content Management
System (IBM CMS).
Publications
Guo Xue Feng, "The IBM LVCSR system
used for 1998 mandarin broadcast news transcription evaluation", HUB4
workshop, 1998.
Shen Li Qin, "The measurement of
acoustic similarity and its applications", ICSLP, 2000.
Shi Qin, "A novel classification method
in building a class language model vocabulary", NCMMSC6, 2001.
Ye Meng, "Capacity modeling for
enterprise voice application platform", NCMMSC6, 2001.
C. J. Chen, "Recognition tone languages
using pitch informationon the main vowel of each syllable", Proc. ICASSP,
2001.
Shi Qin, "Automatic new word extraction
method", Proc. ICASSP, 2002.
Qin Yong, "A study on word acoustic
distance measurement", NCMMSC7, 2003.
Li Hai Ping, "An information gain and
grammar complexity based approach to approach to attribute selection in
speech enabled information retrieval dialogs", ISCSLP, 2004.
|
|