|  |
 |
Table of contents:
|  | HTML |  | PDF |
This article:
|  |
HTML
|  | PDF | DOI: 10.1147/rd.515.0583 | Copyright info |  |
 |
 |
Speech recognition systems on the Cell Broadband Engine processor
|  |  |
by Y. Liu, H. Jones, S. Vaidya, M. Perrone, B. Tydlitát, and A. K. Nanda
|
|
|  |
 |  |  |
|
| |
|
Speech recognition has already been successfully integrated into many application areas and commercial products. Consider, for example, the Honda Acura** TL navigational system that responds to verbal queries, the Palm OS** 5 Voice Command recognition software for personal digital assistants (PDAs), the Motorola Bluetooth** Car Kit that includes voice recognition and automatic dial, or the Genesta speech-controlled portable computer. These products demonstrate that speech recognition at interactive rates is viable even within the limited processing capabilities and resources of portable and embedded devices. However, many other applications require speech processing beyond interactive rates. Speech recognition systems in telephony applications for automated call centers represent the largest segment of the speech processing market; these centers receive and must process thousands of telephone conversations. Similarly, in areas of data mining, such as intelligence and surveillance, there is also a growing interest in applying speech recognition to both online compressed speech channels and repositories of archival speech.
These systems must process many channels of speech at real-time rates and are generally constructed from clusters of processors based on commodity CPUs (central processing units). The number of nodes in such a cluster scales commensurately with the amount of speech traffic the system is expected to process. With the current generation of processors, each node can manage roughly 20 to 30 speech channels in real time, and cluster sizes range from tens to thousands of nodes. System performance can also be scaled by incorporating more powerful processors. This is perhaps a more viable approach since recent trends show that vector-streaming architectures, such as that of the Cell Broadband Engine† (Cell/B.E.) processor, exhibit a better cost–performance ratio than traditional computer architectures for a variety of data-parallel applications. Implementing a speech system on the Cell/B.E. processor, however, requires more effort than simply porting legacy source codes and then expecting automatic hardware acceleration to result only from compiler optimizations and special hand-tuned math libraries. Individual algorithms must be profiled and reformulated to explicitly expose areas of data parallelism amenable for a streaming and vector implementation. This is the approach we took in designing a prototype speech recognition engine on the Cell/B.E. processor. The results we observed are very surprising and encouraging: Our system performs roughly two orders of magnitude faster than existing speech systems.
| |
|
The Cell/B.E. processor is a new streaming heterogeneous multiprocessor architecture jointly designed by Sony, Toshiba, and IBM. This architecture is heterogeneous in the sense that it combines a general-purpose IBM PowerPC* processor element (PPE) with several special-purpose vector processing cores, called synergistic processor elements (SPEs). Each core executes on an independent instruction stream. The Cell/B.E. processor also supports data streaming by providing explicit user management over the data communication via direct memory access (DMA) transfers between the PPE main memory and the local store memories of the SPEs. Memory transactions can be interleaved with instruction execution, allowing their transfer latencies to be partially or completely concealed to improve pipeline efficiency.
This design provides the Cell/B.E. processor with several interesting advantages over traditional processors. Many data-parallel tasks can be structured to expose single-instruction multiple-data (SIMD) parallelism, predictable memory access patterns, and data-independent processing. These parallel tasks generally execute much faster on the SPE processors than on the PPE processor. SIMD computations map directly to vector instructions, predictable memory access patterns allow prefetching of data elements, and data-independent processing enables simplification of the vector execution pipeline (eliminating the need for complex branch-prediction strategies). Furthermore, whereas traditional processors employ caches to exploit data coherency, the Cell/B.E. processor allows users to directly program the memory hierarchy and implement their own application-specific data caching policies. Streaming applications with completely predictable memory access benefit the most from user-managed caches and, when implemented correctly, can experience 100% cache hit performance. For further information on the Cell/B.E. Architecture and its programming models, please refer to References [1] and [2].
| |
|
Early analysts segmented speech signals into small windowed intervals and annotated them by phonemes (linguistically distinct speech sounds). This classification is possible because a speech signal looks roughly like a sequence of stationary waveforms. Analysts look at the waveforms and spectrogram plots and distinguish phonemes by examining their spectral characteristics (e.g., format frequencies) (Figure 1). Today, this analysis is completely automated by digital signal processing and pattern-matching algorithms.
Figure 1
Speech recognition systems generally consist of three components: feature extraction, pattern matching, and model training. These components work together to recognize the information being communicated by verbal speech. In a real-time system, the first two components require special optimization since this system has the constraint that speech channels must be processed at line rates using a fixed amount of memory. Optimizations for model training are less important since the models need to be trained only once prior to any recognition activity. Efforts to optimize this step, however, are still worthwhile because training is an iterative process and can be computationally expensive. We limit our discussion here to only the feature extraction and pattern-matching components.
| |
|
The feature extraction front end takes a windowed speech frame from the speech audio waveform and from it derives a compact feature vector representation that captures important spectral and temporal properties. The most common features used by speech systems are the mel frequency cepstral coefficients (MFCC), which are based on the Fourier spectrum of the audio signal, mapped to a nonlinear frequency scale that roughly corresponds to the human perception of sounds. The first and second derivatives of this spectrum are also considered to measure the rate at which sounds change. The mean energy is subtracted and the variance is normalized to remove the channel transfer function.
There are 12 stages of processing:
- Window frame extraction.
- Mean subtraction.
- Energy computation.
- Preemphasis filtering.
- Hamming window filtering.
- Spectrum computation (using fast Fourier transform [FFT]).
- Mel frequency scale mapping.
- Cepstrum computation.
- Decorrelation (using discrete Fourier transform).
- Cepstral filtering.
- Cepstrum energy normalization.
- First- and second-order derivatives.
The first processing stage starts with a windowed frame of 200 samples (25 ms of audio at 8 kHz) and the final result is a 39-component feature vector (12 MFCC, 1 energy, 13 first derivative, 13 second derivative). This processing is uniformly applied to overlapping frames (10 ms of overlap) in the speech signal to produce a sequence of MFCC feature vectors.
| |
|
Under this representation, new speech samples can be compared with reference samples by discovering and quantifying common patterns in their feature vector sequences. This is a test for similarity rather than equality, since speech samples are not expected to match exactly. Matches are scored using hidden Markov models (HMMs), which statistically summarize patterns over a reference set. The purpose of the pattern-matching component is to then evaluate or decode new speech samples by comparing them against a set of HMMs. HMMs
An HMM [3–5] models a stochastic temporal process with parameters that are not directly observable (hence, hidden), but that can be inferred only from the set of observation sequences that it generates (here, the observations are MFCC feature vectors). HMMs are graphically represented by a set of nodes and directed edges. The nodes represent states and edges represent transitions between states. Observation sequences are generated by paths between the start node and the end nodes. Start and end states are special states that do not generate observations. All other states generate observations whenever they are visited according to their probability density functions (PDFs). State transitions also occur probabilistically.
An HMM learns patterns over reference examples by assigning state PDFs and transition probabilities that maximize the probabilities of their sequence output, while also accommodating the variability of individual feature sequences. For example, constructing an HMM to recognize the word “one” requires several verbal samples of this word by different speakers. The pronunciation of this word could vary from speaker to speaker, and even the same speaker cannot exactly reproduce the same sounds twice. However, these pronunciations share common spectral and temporal patterns that are captured by the HMM through selectively strengthening paths and feature distributions in the network during the training process. Although HMMs cannot be explicitly trained using negative examples, discrimination is possible by comparing probabilities across all other models.
An HMM is scored against a new speech sample by evaluating paths through the HMM network. Multiple paths could generate the same feature sequence, so the likelihood for an HMM matching the feature sequence is given by the total of all possible path probabilities. This requires an exhaustive search through the network, which can be efficiently computed using the Viterbi algorithm, explained in the next section. Viterbi algorithm
A direct search of all paths in the network is not feasible computationally, so the Viterbi algorithm applies recursion to cache the intermediate path probabilities. This recursion can be efficiently implemented using dynamic programming. For each feature vector frame fi, the algorithm examines each HMM state sj and computes its emission probability p(f = fi|sj) by evaluating the feature vector against the PDF of state sj. All possible transitions into this state are then examined. Probabilities from previous states that transition into this one are multiplied with their respective transitional probabilities p(s = sj|sk) and are then summed. This result is multiplied with the emission probability of state sj to give the total probability Li,j of all intermediate paths between the start state and the current state that generate feature vector frames up to fi. The initial conditions are set such that path computations begin at the start state for the first feature vector frame.
The recurrence is given by
 | (1a) |
with the initial values set to
| L0,0 = 1, L0,j = 0 | ∀j > 0. | (1b) |
Since the system obeys stochastic constraints, all path probabilities sum to unity. This means that probabilities of individual paths can be quite small. Therefore, it is useful to express these probabilities on a log scale. However, it is very expensive to add two numbers together in the log scale, i.e., computing log(a + b) directly from log(a) and log(b). To simplify matters, the maximal path is generally a good approximation to the summation of all possible paths. Using this approximation, Equations (1a) and (1b) can be approximated by
| L’i,j = log(p(f = fi | sj)) + maxk(L’i−1,k + log(p(s = sj | sk))), | (2a) |
with the initial conditions
L’0,0 = 0, L’0,j = − | ∀j > 0. | (2b) |
The goal of this computation is to evaluate Lm,n, the probability that the feature sequence was generated by a path through the HMM. The approximation for the term L’m,n is called the Viterbi probability and is computed recursively using Equations (2a) and (2b). Strictly speaking, L’m,n is not a probability, but a likelihood. This likelihood value is sufficient for recognizing speech from samples that contain exactly one word unit (called the isolated digit recognition problem). However, in most practical recognition systems, the speech channels contain multiple words, and decoding from these channels (called the connected digit recognition problem) requires an additional trace-back step after computing L’m,n to recover the maximal path through the HMM network and to identify the actual sequence of decoded words encountered along this path. Supporting this trace-back step requires that bookkeeping information, such as back-pointers and model labels, be maintained along with intermediate path likelihoods during the recursion. The Viterbi algorithm decodes HMMs against isolated digits, but recognizing connected digits requires searching hypothetical paths that pass through multiple HMMs. This search can also be organized efficiently using dynamic programming to extend the basic Viterbi algorithm. Such an approach, called the level-building algorithm (LBA), is discussed in the next section.
| |
|
Constructing a speech recognition system requires modeling at two levels. At the highest level, the vocabulary and grammar that govern their syntactic use must first be decided. Figure 2 shows an example of the simple task of recognizing sequences of numbers. The vocabulary consists of the numbers “one” through “nine” and “oh” (meaning “zero”). The grammar allows any arrangement of digits in the sequence. A special silence model is also included to account for periods of silence (or background noise) between each utterance of a number. The set of numbers and the silence model are modeled by HMMs. The type of HMM (e.g., number of states and the allowable transitions between them) most commonly used in speech recognition is called the left–right HMM. Here, the number of states roughly corresponds to the duration of the utterance, and the states are connected and arranged sequentially so that transitions occur only monotonically from left to right; that is, each state allows only self-transitions and forward transitions. The PDF for each HMM state is generally modeled by a set of Gaussian functions over the feature vectors. This representation is called the Gaussian mixture model (GMM). The parameters of a GMM include the Gaussian means and covariances (the feature vectors are decorrelated so the covariance matrices are diagonal), as well as weights for each Gaussian. Gaussian functions are commonly shared across multiple GMMs to reduce the model complexity, a technique called Gaussian parameter tying.
Figure 2
Decoding isolated digits amounts to evaluating the Viterbi probability of a speech sample against several HMM word models and selecting the best. Decoding connected digits is more challenging because the speech sample contains several words and the word boundaries are unknown. The LBA solves this problem by evaluating multiple hypothetical intervals within the speech sample. The computation is organized into levels, each of which corresponds to a single digit decode. The process begins by initializing all HMMs to decode starting at the first speech frame. The location of the ending frame for the first digit is unknown, so each HMM evaluates all speech frames thereafter as potential candidates for the last frame of the first digit. In practice though, only a small interval past the first speech frame is searched since the word utterance is not expected to span the entire speech sample. The second level then evaluates each of the ending frames from the first level as a possible starting frame for the second digit, and this process proceeds until all speech frames are evaluated. During the course of the decode process, word transition probabilities (e.g., bigrams or trigrams) can be applied to enforce a local syntax. Back-pointers are also kept to support the trace-back step, in which we work backwards from the last speech frame to recover all of the word-level transitions that were made.
In the LBA just described, all possibilities are explored; it has an exponential computational complexity but captures the idea of decoding connected digits. In real-time systems, the amount of processing must be directly proportional to the size of the speech sample, and the amount of storage must be constant. Therefore, the decoding must occur synchronously with each speech frame, and only a small word transition history can be kept. For details about this approach, please refer to the frame-synchronous level building (FSLB) algorithm by Lee and Rabiner [6].
| |
|
Our speech recognition system on the Cell/B.E. processor is implemented by three SPE kernel programs: spe_extract, spe_computeobs, and spe_viterbi. The PPE processor is responsible for initializing and loading data into the SPE kernels, invoking the SPE kernels, and performing the final scoring. In the future, we will implement a scheduler on the PPE to analyze and distribute load across the SPE processors. The feature extraction front end is implemented by spe_extract while the decoder is factored into two SPE programs: spe_computeobs and spe_viterbi. Our system processes a speech channel by calling each of the SPE kernels in sequence. Intermediate data is streamed between the SPE local store and PPE memory during successive SPE calls. The final scoring lattice from spe_viterbi is traversed by an FSLB implementation on the PPE to perform a trace-back step and recover the decoded text.
| |
|
The design of the feature extraction is based on the pipeline from the Mississippi State Institute for Signal and Information Processing (MS ISIP) speech recognition toolkit [7], with the stages listed in the section on feature extraction. All of these steps are implemented within the resources of a single SPE program. The mean subtraction and energy computation across the speech window requires the summation of elements in the window. The computation for this sum is vectorized by laying out data elements as an array of 128-bit (four-component) SIMD vectors, and then performing the sum across the vectors. Elements in the resulting array are then combined by dot product with a ones vector. The spectrum computation step is considered the pivot or core of the pipeline, as it is the algorithm with the highest computational cost. Fortunately the Cell/B.E. Software Development Kit library contains an extremely efficient FFT algorithm [8], which we judiciously apply. We profiled the FFT performance and determined that it completes eight FFTs in 3,800 cycles, which roughly accounts for 69 Gflops of computation and represents 34% efficiency on the Cell/B.E. processor. Data vectorization occurs along the axis of a speech window frame; each block of four sequential data elements in the window is processed concurrently using vector instructions. However, the FFT routine expects a complex signal input in a format that interleaves real and imaginary components.
Accommodating this data layout incurs only a small performance penalty to perform data interleaving and de-interleaving when moving data into and out of the FFT routine. Many of these stages require precomputed lookup tables. For example, FFT requires a table of twiddle factors to be precomputed for one of its parameters. Likewise, the discrete cosine transform (DCT) step, which decorrelates the MFCC vectors (to allow diagonal covariance matrices), and the various filtering operations also take advantage of precomputed factors. Using table lookups helps in both computation and accuracy as constant data terms can be computed only once and in higher precision. The first and second MFCC derivatives are computed by central differencing. Supporting this computation requires a short queue of MFCC frames to be maintained.
| |
|
Evaluating the emission probability of generating a particular feature vector at a particular HMM state is independent of the network search, and this computation can be factored easily from the main decode. This is actually necessary for recognition tasks of higher complexity since the parameters for the GMMs occupy a significant amount of memory. Currently, we use four Gaussian functions per mixture (one GMM per state), with a total of 57 HMM states. These parameters use roughly 71 KB of memory, which is almost a quarter of the SPE local store memory. Future implementations may require the parameter to be streamed into the local store. The end result of this kernel computation is a two-dimensional table (feature vector frames by HMM states) of emission probabilities. Since probabilities are expressed in log space, vectorizing this computation is very straightforward:
 | (3) |
The first term on the right-hand side of Equation (3) can be precomputed, and the second term can be computed by a dot product, since ∑−1 is diagonal. The dot-product computation is vectorized by first multiplying the three vectors componentwise using the 128-bit SIMD registers and then aggregating the result into a single value by summing four components at a time across the vector. Evaluating the emission probabilities is the most arithmetic-intensive step and represents the bottleneck of our system.
| |
|
The decoding process occurs frame synchronously by permitting the predecessor state for a model start state to come from the end state of any HMM (as allowed by the language grammar). This computation is vectorized along the HMM state axis by concatenating HMMs together and setting transitional probabilities across model boundaries to zero. The recurrence relations given in Equations (2a) and (2b) are processed using vector instructions for each block of four states. The layout for this computation is shown in Figure 3. Shaded groups of four elements in each column are stored in 128-bit registers and are operated on by SIMD vector instructions. Evaluating the path probabilities requires only two columns of data to be stored at any given time. In addition to path probabilities, other bookkeeping information is kept to support the trace-back process. Since all HMMs in our experiment are strictly first-order left–right, data access to previous state probabilities is aligned by shifting the state column down by one state. Decoding proceeds by seeding the start states of each model with an initial probability and then streaming the emission probability table in and streaming the intermediate path probabilities out to main memory.
Figure 3
| |
|
To profile the performance of our recognizer, we set up a simple experiment to perform speaker-independent speech recognition of phonemes from a digit vocabulary based on the TIDIGITS corpus. This vocabulary includes the utterances “zero,” “one,” …, “nine,” and “oh” by speakers of different gender and dialects, which are altogether modeled by 19 phonemes (each composed of a three-state HMM with four Gaussian functions per HMM state). We used the MS ISIP speech decoder to provide model training and establish the baseline decoder performance. The platforms we used for testing are shown in Table 1.
|
| Table 1 Speech recognition performance. |
|
|
|
|
|
| Platform | Processor | RTCs |
|
| SIM | 4.0-GHz Cell/B.E. processor simulator | 1,759 |
| Cell/B.E. processor | 2.4-GHz Cell/B.E. processor hardware | 1,216 |
| Sony PLAYSTATION†3 system | 3.2-GHz PLAYSTATION3 system (six SPEs) | 526 |
| Central processing unit | 3.2-GHz Intel Pentium** 4 | 10 |
|
The performance is measured by timing the latency to process a single-channel speech sample on a single SPE, and then extrapolating to the total number of physical SPEs available. This helps to estimate peak system performance under perfect load balancing and task scheduling. Table 1 also summarizes the speech recognition performance for these platforms. Units are measured in real-time channels (RTCs), where 1 RTC = 1 second of audio per 1 second of processing time.
On both the Cell/B.E. processor and the software platforms, recognition accuracy was 99%, which is to be expected for such a simple recognition task. Since recognition accuracy depends only on the training and language modeling, the performance of our prototype speech recognition engine on the Cell/B.E. processor can be extended to production systems because the SPE kernel programs were designed to scale with model and language complexity.
| |
|
We have implemented and demonstrated a prototype speech recognition engine that is capable of processing approximately 1,000 speech channels on a single Cell/B.E. processor. The kernel computations are designed to be highly scalable, and we expect this performance result to generalize well to commercial speech systems. We attribute the performance gains in our system mainly to the raw computational power and memory management of the Cell/B.E. processor. We harness these resources by carefully choosing data layouts and reformulating algorithms to expose data parallelism and streaming opportunities.
Although the performance we measured pertains to only a simple digit recognition problem with a small vocabulary, the relative performance between CPU and Cell/B.E. processor–based systems is important to note. Speech recognition systems that incorporate very large vocabularies, complex grammar, and detailed GMMs decode channels at rates far below real time. Implementing these systems on the Cell/B.E. processor allows the channel density to be scaled upward while handling more complex tasks and resulting in higher recognition quality.
| |
|
Having implemented the core algorithms in a basic speech recognition system, we identify three areas in which we can focus our future development efforts: language modeling, compressed speech, and speech activity detection.
| |
|
Toward the longer-range goal of developing a production speech system, we plan to apply tools from the HMM Toolkit 3.3 (HTK 3.3) framework [9] to train HMMs and language models to recognize speech from more complex and challenging sources. The simple experiment we conducted for this study did not include any language modeling; any digit can follow any other digit, and we made no attempt to construct actual words from the sequence of decoded phonemes. The HTK is a collection of software utilities and tools to train, decode, and evaluate HMMs. We plan to start with the TIMIT corpus [10], which contains conversations from a finite dictionary and strictly follows a language grammar. After constructing and training the appropriate models, we will integrate them into our existing Cell/B.E. speech recognition system.
| |
|
The amount of speech traffic being transmitted over digital networks (e.g., voice over Internet Protocol) is rapidly outpacing our ability to efficiently process it. To communicate over a digital network, speech samples are first encoded by a lossy compression protocol. This compression step allows speech to be represented using a very low (fixed) bit rate, which increases the channel density given a target bandwidth while unfortunately introducing significant noise and degradation to the original audio signals. Qualifying and quantifying how compression artifacts affect recognition accuracy is an interesting area of study. Furthermore, recent techniques have been proposed to derive MFCC features from the speech-encoding parameters and use the encoding parameters directly as a feature set. We expect to test both approaches and compare their results with a third approach, which is to compress and decompress audio samples (to artificially add compression noise) and apply speech recognition to establish a baseline. After establishing the best approach to computing features on compressed speech, we will integrate it into our existing feature-extraction pipeline on the Cell/B.E. processor.
| |
|
Speech channels often contain long periods of no speech. Removing these segments not only helps cull computation, but also improves recognition performance since speaker normalization is intended to be performed over voice activity. Identifying and annotating intervals of speech activity in voice channels is a binary classification problem; we are trying to classify speech from background. Therefore, models for both are required. We plan to investigate approaches using linear classifiers, such as support vector machines, single-state HMMs, GMMs, or a hybrid combination of these approaches. We plan to integrate the best result in our Cell/B.E. processor speech recognition pipeline.
*Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both.
**Trademark, service mark, or registered trademark of Honda Motor Company Ltd., Palm, Inc., Bluetooth SIG, Inc., or Intel Corporation in the United States, other countries, or both.
†Cell Broadband Engine and PLAYSTATION are trademarks of Sony Computer Entertainment, Inc., in the United States, other countries, or both.
| |
|
Received March 15, 2007; accepted for publication April 3, 2007; Published online August 11, 2007.
|
|