The Speech Technologies group specializes in speech and multimodal signal processing for interaction, analytics, entertainment and security applications. Through advanced research and development, we create algorithms, technology components, solutions, and services that enhance the experience and capabilities offered to enterprises, mobile users, application developers and content developers. The group's expertise covers a wide spectrum of technologies for expressive speech synthesis, voice transformation, speech-based emotion recognition, and multimodal biometrics.

With our vast expertise in speech signal processing, machine learning and deep learning, we provide research support to IBM product teams, and participate in research activities that advance speech science and technology.



IBM Virtual Voice Creator

Text to Speech (TTS) synthesis technologies have become increasingly natural sounding and expressive, opening up new opportunities in domains such as entertainment and education. The IBM Virtual Voice Creator vision of customizable voice generation for game characters, cartoon heroes, and engaging conversational agents becomes a reality using high quality, expressive TTS with customizable on-line voice transformations and an interactive voice design web studio. We’re developing an inexpensive, fast, repeatable, and flexible voice-over process that encompasses static as well as dynamic and AI-generated textual content.


Mobile Multi-Factor Authentication (MMFA)

The mobile environment presents many security and usability challenges. MMFA utilizes the multitude of sensors and information channels on mobile devices together with our multimodal biometric authentication technology. We’re working to maximize both security and usability levels according to the situation, risk, and environment. Video authentication combines speaker and face verification for high accuracy (0.1% equal error rate), audiovisual liveness detection against replay attacks, and high usability in a short authentication session - several seconds of Selfie video.


Ron Hoory, Manager Speech Technologies, IBM Research - Haifa




State-of-the-art expressive Text-to-Speech (TTS) technology for delivering information and interacting with enterprise customers, as well as for education and entertainment purposes. In addition to active participation in development of the TTS service on the IBM Watson Developer Cloud, we’re working on other facets of TTS. Current research other topics include customizing the TTS voice by applying on-line voice transformation with user-controlled parameters, or by learning the characteristics and speaking style of uploaded samples of target speaker and automatically generating an appropriate voice transformation. We are also working on controlling the synthesized speech expressions, emphasis, and emotions while preserving high quality.

Multimodal Biometrics

Multimodal Biometrics

Advanced multimodal biometrics technology for mobile multi-factor authentication solutions. Our focus is on voice, face, and video authentication with biometric fusion, as well as vocal, visual, and audiovisual liveness detection to protect against spoofing/replay attacks.

Affective Computing

Speech-based Emotion Recognition

Technologies and solutions for speech-based emotion recognition that enable analytics of spoken data, as well as for affect aware human computer interaction. Speech-based emotion recognition combines prediction from verbal content (textual transcript) as well as the non-verbal content (by direct signal analysis).