IBM
Skip to main content
 
Search IBM Research
     Home  |  Products & services  |  Support & downloads  |  My account
 Select a country
 IBM Home
IBM Research
Think Research
Technical Disciplines
Cross-Disciplines
About IBM Research
Resources
Search Research
Feedback

Related Links
  Worldwide Labs
  Page Contact
 
 


IBM Research
User Interface Technologies



Computer Science > User Interface Technologies > Research Spotlight (January 2003)

Research on User Interface Technologies (UIT) at IBM is dedicated to developing innovative technologies, algorithms and tools for next-generation human-computer interfaces.

For decades, IBM has been among the leaders in all aspects of speech recognition technology. Most recently, the IBM Embedded Speech Recognition Technology that represents the state of the art in utilizing the voice modality for command and control, ViaVoice for Embedded Multiplatforms, was made available on multiple platforms and systems, such as Palm and iPaq handheld computers and cars.

The Telephony Speech Recognition project is aimed at developing algorithms for improving the accuracy & robustness of speech recognition in conversational telephony applications e.g., bank transactions, directory assistance, etc. The project has a significant product impact by directly contributing algorithms and data files for WebSphere Voice Server and related voice technology offerings from IBM.

The multi-year goal of the Superhuman Speech Recognition project is to develop speech recognition technology that surpasses the abilities of humans to recognize speech. We are addressing this goal by attacking a graded series of speech recognition tasks, ranging from simple read speech to complex spontaneous conversations in difficult channels and environments. The focal point of this work is the creation of more robust and sophisticated acoustic and language modeling techniques to improve performance while simultaneously reducing the labor required to install and tune new applications.

The Audio-Visual Speech Recognition project, which has been selected as an" IBM Research Science and Technology accomplishment" for 2002, explores the use of visual information in speech recognition systems. It aims at combining visual cues with audio signals for the purpose of improved automatic machine recognition of large-vocabulary continuous speech. This combination proves important particularly in acoustically challenging conditions with significant background noise levels, such as multi-speaker environments, production lines, airport halls, or cars. Audio-Visual Speech Recognition can also provide help with speech-reading for the hearing/speech impaired.

IBM is utilizing two key UIT technologies - speaker verification and conversational systems - to provide enhanced security for voice-based transactions. Our leadership speaker verification technology (ranked first among 25 worldwide participants in the 2002 Speaker Recognition Evaluation organized by the National Institute of Standards and Technology) is combined with user knowledge (i.e. passwords and personal information) elicited through a brief conversation. The combination of the two information sources - called a "Conversational Biometric" - greatly increases the security and reliability of voice based transactions in a non-intrusive manner and provides a flexible framework for various authentication scenarios so as to maximize user convenience. A demonstration of "Conversational Biometrics" can be seen at http://www.research.ibm.com/VIVA_Demo/.

Exciting advancements are taking place in the area of text-to-speech (TTS). The goal of the project is to produce computer-generated speech which is indistinguishable from natural speech; our newest system takes a big step in that direction. The TTS system relies on a large database of natural speech, which is automatically divided into small building blocks which are then reassembled to form arbitrary word sequences. In synthesis the blocks are chosen to minimize a cost function which considers various important aspects of naturalness in speech. Systems have been built in US and UK English as well as Chinese, Japanese, French, German, and Spanish. Here is an audio sample generated by our system.

An example of how UIT can enable human-to-human communication is the IBM Speech-To-Speech (Speech Translation) Technology. A speech translation system is capable of processing spoken input to translate the content into another language, for example translate from English to Mandarin Chinese and from Mandarin Chinese to English.

Another important example of translation technology is InfoScope, a handheld device equipped with a digital camera that can take snapshots of text in English, French, German, Spanish, Italian and Chinese and translate the image to another language in a matter of seconds. The device displays characteristics of augmented reality, by presenting the real world in the form of a captured image, such as a restaurant sign, and merging it with virtual data, by providing a translation of the image as an overlay to the PDA's screen.

The use of UIT to increase productivity is demonstrated in BlueSpace, a next-generation workspace solution encompassing multiple software and hardware components that integrate sensors, actuators, displays and wireless networks into architectural elements. The goal of the space is to increase knowledge workers’ productivity by deterring unwanted interruptions, improving awareness and fluid communication among team members, and providing greater individual comfort through personalized environmental settings.

Combining many aspects of the UIT in one device, the unique project Cross-Industry Dashboard for Retail and Healthcare should be named. This project represents a combination of groundbreaking research in new form-factor multi-modal devices (MetaPad), digital ink using new InkXML standards, speech, and analytic methods. We are developing a mobile wireless device and middleware to enable retail store employees to access store-specific information anywhere utilizing various input modalities such as traditional keyboard, bar code scanners, magnetic stripe readers, and handwritten information in the form of digital ink. This technology has applications to various domains including healthcare.

As part of the IBM Corporate Community Relations (CCR) program, the Web Adaptation project has received a lot of attention this past year in the press and currently is in use by several organizations serving populations of elderly users and users with disabilities. For 2003, CCR plans to internationalize this project to deploy the software to CCR partners worldwide.

IBM researchers have been, and continue to be, among the worldwide leaders in developing User-Interface Technologies. Being part of IBM gives us rare opportunities to have our research affect both the state-of-the-art and the state-of-the-practice.

Image

 Selected Papers

A.W. Senior, Tracking with Probabilistic Appearance Model, in proceedings ECCV workshop on Performance Evaluation of Tracking and Surveillance Systems 1 June 2002 pp 48--55.

C.Neti & G. Potamianos (et al.) wrote the Editorial to the Special Issue "Joint audio-visual speech processing" in Eurasip Journal of Applied signal processing, in Press, November 2003.

Fairweather, P. G., Richards, J. T., & Hanson, V. L. (2002). Distributed accessibility control points to help deliver a directly accessible Web. Universal Access and Inclusion in Design: A Special Issue of Universal Access in the Information Society. DOI 10.1007/s10209-002-0037-3.

G. Potamianos, C. Neti, J. Luettin, and I. Matthews, ``Audio-visual automatic speech recognition: An overview,'' To appear in: Audio-Visual Speech Processing, E. Vartikiotis-Bateson, G. Bailly, and P. Perrier (Eds.), MIT Press, pp. 121-148, 2003.MIT press book chapter on "audio-visual speech recognition".

Lisa Brown and Yingli TianComparative Study of Coarse Head Pose Estimation," IEEE Workshop on Motion and Video Computing, Dec. 5-6, 2002. (Orlando FL)

Malcolm Slaney,"Image-based Facial Synthesis", To appear in: Audio-Visual Speech Processing, E. Vartikiotis-Bateson, G. Bailly, and P. Perrier (Eds.), MIT Press, pp. 149-161, 2003.

N. K. Ratha, J. H. Connell and R. M. Bolle, "Secure Fingerprint Authentication". Chapter 11, Automated Biometrics: Technologies and Systems, Kluwer 2002 (David Zhang Editor)

S. Maes, J. Navratil, U. Chaudhari, "Conversational Speech Biometrics," Chapter in "E-Commerce Agents Marketplace Solutions, Security Issues, and Supply and Demand," J. Liu and Y. Ye (Eds.): Springer Verlag, 2001, Pages 166-179.

 Recent Accomplishments

Ruud Bolle and Nalini Ratha - General chair and Co-chair of AUTOID workshop

Ramesh Gopinath - Associate Editor IEEE Transactions on Speech and Audio Processing

David Nahamoo - IEEE Fellow Sci/Tech Society 11/2001

Chalapathy Neti - Associate Editor IEEE Transactions on Multimedia, 2002

Chalapathy Neti - Member, Multimedia signal processing Technical Committee Society/Academy: IEEE Signal Processing Society (2001),

Michael Picheny - IEEE Fellow, 2000

Michael Picheny - Chair, Speech Technical Committee Society/Academy: IEEE Signal Processing Society (2002),

Organizing Activities

InkXML Tutorial at IWFHR 2002
At the last IWFHR (Int'l Workshop on Frontiers in Handwriting Recognition, 2002), IBM (Greg Russell) organized a tutorial on InkXML for the UIT community. InkXML is an XML-based standard for representing handwritten data portably and interoperably. It was jointly developed by IBM, Intel, Motorola and the Unipen Foundation (a university-based organization) and has been submitted to the W3C for review.

ACM ASSETS 2002 conference: the 5th International ACM Conference on Assistive Technologies (Conference Chair: Vicki Hanson)

INCITS V2 Standards Committee (Member and co-author of working document: Shari Trewin)


Speech-To-Speech (Speech Translation) Technology

Image


infoScope: Point, Shoot, Translate

Image

 
  About IBM  |  Privacy  |  Terms of use  |  Contact