Photo
Distributed Speech Recognition

Distributed Speech Recognition

 

Project Summary:

Distributed Speech Recognition (DSR) is an innovative technology that is making speech recognition practical for mobile devices. DSR works by splitting the processing required for speech recognition between the mobile device and network servers, instead of sending the speech data to the server and having all the processing done there. Beginning the processing on the mobile device, or 'front-end', enables the device itself to extract spectral features from the speech. These features are compressed, error protected, and transmitted over the wireless channel to the server, or 'back-end'. Once the compressed features have arrived at the server, the server can then convert the incoming stream of features into text.

DSR technology dramatically improves recognition performance, while minimizing the memory and CPU requirements on the device. This is achieved by using a noise robust front-end and by eliminating the detrimental effects of low bit-rate coding and channel errors that occur when speech itself is sent over a communications channel to the server.

Speech reconstruction is done by using the pitch and voicing, together with the information from the features themselves, and reconstructing a speech waveform. In this way, users’ voice can also be listened to at the receiving server. This is extremely useful for scenarios that include voice response services where the content is sensitive (e.g., financial transactions), or for reviewing how live customers interact with voice service applications. This can help developers tune the grammar and dialog for voice services.

DSR and Telematics in the Automotive Industry
Distributed speech recognition is just now starting to pick up speed in the telecommunications industry as we begin to hear more about speech-enabled services accessed from mobile devices. An example is the automotive industry, where speech can substantially improve services like telematics for car navigation, commanding car devices, and accessing remote information. Cars offer an ideal environment for voice-activated services, where using speech via mobile devices offers the only real option for a driver to safely communicate with the world. DSR offers a new approach for coping with the noisy car environment without compromising recognition accuracy.

Multi-modal Applications
The new extended standard opens the doors to multi-modal applications, where applications use two or more modalities of input and output, for example, voice and keypad or mouse-clicks. Let's say users can fill in a form by voice, by using clicks, or by choosing a menu item by voice or clicks. For multimodal applications to really work, all data must be fully synchronized between the different modalities. In these cases, all the information flows on a data channel, as opposed to a voice channel that is used, for example, for regular phone conversations. Most new communication technologies, including GPRS, EDGE, 3G, and 3GPPm use data channels.

The advantage to DSR is that the voice features are sent on a data channel, and not on a voice channel. Using DSR, multi-modal applications can send speech data and application data together on one channel. Using two separate channels for voice and data would require sequential synchronization, where each channel would have to wait for the other to finish sending data.

(The above DSR project description was sourced from articles posted in the w3research News Archives.

Publication List