1. Introduction
The standardization of MPEG-2 [1] has greatly facilitated the transmission, representation, storage, and manipulation of digital video in various environments such as broadcast television, wireless communications, consumer electronics, and multimedia computers. Further, applications ranging from desktop publishing on personal computers (PCs) to home authoring with digital video disc (DVD) players demonstrate the major role played by the MPEG-2 standard in promoting new storage media for consumer video and interactive multimedia. This has enabled the home user to download video streams from a satellite system or an Internet site in order to create DVD video programs or multimedia presentations using a recordable medium. Although the syntax and specification of MPEG-2 bitstreams and multimedia programs (as in DVD titles) are well defined by the international standards, the actual encoding parameters and functions, which should lead to a fully compliant bitstream, have been and are the subject of many research efforts. The main challenge is how to achieve a close-to-optimum video quality in a compressed stream while reducing the amount of information in the source. This is because the statistical nature of any video source is either not known a priori or will change over time, and a true estimation of source distribution can be anywhere from computationally extensive, as in non-real-time multipass encoding, to almost impossible for real-time encoding.
The nonstationary nature of images makes them inherently variable; compression optimality for MPEG-2 coder-decoders (codecs) is achieved by carefully selecting a set of spatial or temporal image analyzers, quantizers, and variable-length entropy coders, of which some are frozen by the standard and some are to be defined by the designer. The results are variable1 streams which require a sophisticated buffering scheme to smooth out the variability of the signal before transmission over a fixed bandwidth is carried out. The receiver will have a similar buffering policy to convert the fixed-channel rate to variable streams prior to the decoding and display of each picture.
Since design and development of an MPEG-2 video encoder can become cumbersome as a result of formulating several mathematically or perceptually derived parameters, typical approaches [2] enforce a constant bit rate (CBR) for a group of pictures (GOP) regardless of the complexity of the video interval. This scheme assumes equal weighting of bit distribution among GOPs and reduces the degree of freedom of the encoding task. In short, the problem is reduced to minimizing (maximizing) the GOP distortion (quality) subject to a constant target rate. By CBR we mean that the sustainable rate of the encoded video stream per GOP is close to a constant target rate, but the instantaneous rate changes per picture depending on picture type2 or the quantization scaler. Another advantage of a CBR stream is that the transmitted signal may be terminated at any time and the user is assured of maintaining a rate close to the target rate. All CBR MPEG-2 encoders enforce different quantizing scalers for each picture type to achieve good-quality streams within a GOP. This method of compression works adequately when the complexity of the source varies slowly over time and therefore the encoding algorithm has time to adjust itself. However, if the statistical features of the source change rapidly over time, a constant-bit-rate operation may result in good picture quality for a short time window (e.g., a few frames or a GOP) and discontinuous quality when the whole video is perceived.
One way to improve the perceptual quality of a CBR stream while maintaining its constant rate from start to finish is to identify "difficult-to-encode" pictures and increase their bit budget accordingly. Conventional approaches to real-time CBR encoding use picture-to-picture3 correlations in terms of complexity measures to predict the level of difficulty of a picture. In this paper we improve upon this first-order prediction by estimating the encoding difficulty of a picture on the basis of the complexity of all previously encoded pictures. Our single-pass CBR algorithm employs an infinite impulse response (IIR) filter to dynamically determine a nominal value which represents the degree of difficulty of the partially analyzed video stream in real time. We claim that the number of bits consumed by a more (or less) "difficult- to-encode" picture can be adjusted by comparing the encoding difficulty of the substream against a local measurement. This method of real-time compression improves the overall quality of the video and maintains a constant rate throughout the stream. Therefore, it can potentially become a key encoding element in cable television (CATV), direct satellite, and terrestrial broadcast arenas as well as mobile and asynchronous transfer mode (ATM) communications.
Since we argued that the video is inherently variable, an even better compressed stream can be created by employing a variable-bit-rate (VBR) encoder algorithm. Applications for such a scheme are plentiful. VBR can be used for networks that employ a dynamic bandwidth, as in ATMs, or it can be exploited as a means of achieving statistical multiplexing for digital broadcast satellites. Other major arenas which can benefit from the use of a VBR encoder are consumer video and interactive multimedia, where recording of high-quality pictures onto a storage medium is desired. For the aforementioned environments, DVD video movies are created to be played on set-top players, while DVD-RAM drives are used for multimedia productions in computers. Some examples of end-to-end solutions of the above industry segments are shown in Figure 1.
Figure 1
Some attempts at formulating a VBR video coding scheme have appeared in the literature. Most of them propose constraining the quantizer-scale parameter to a user-defined target value for long periods of time [3]. This value is adjusted via network negotiation or monitoring the system buffer to match quality of service (QoS). Although simple intuition suggests that fixing the quantizer scaler would redistribute the amount of bits among GOPs of differing complexity, there are no guarantees of obtaining a constant video quality. Further, in ATM applications an encoded stream is usually queued before a bandwidth is available in the network. This lead time enables pre-encoding tasks to be performed on particular streams subject to the overall channel bandwidth and buffer fullness. Others have investigated a simple variable-bandwidth estimation model for the number of ATM cells generated through packetization of video sources [4]. In this allocation scheme, a fixed quantizer-scale parameter is again used by the encoder. Authors in [5] have presented a VBR video encoder which takes advantage of the limits of the human observer to improve the perceived quality of the decoded sequence while maintaining the output bit rate within permitted bounds.
For consumer video, a popular mode of encoding operation has been the multipass VBR scheme. For example, high-end DVD mastering can be accomplished with an encoding system which typically uses a three-step procedure. The first step encodes the video source in a CBR mode and gathers a set of predefined statistical features. This information is then used to compute a set of optimized quantization parameters which closely match the source distribution of the video data and would provide a better compressed stream during a second-pass encoding. The last step is a postprocessing task which creates the final DVD format. The intermediate procedure can be carried out several times to produce an ideal video program. This type of application has the advantage that many lookahead parameters are known a priori when the VBR video coding algorithm is implemented. Unfortunately, for applications such as home DVD productions, writable DVDs for PC, digital camcorders, and DVCRs, the availability of a platform capable of analyzing and performing a multipass VBR encoding is unrealistic. Therefore, a VBR scheme is desired which is superior to CBR and can be implemented in one pass. This paper is intended to introduce such an algorithm.
In the first part of the paper we propose a new single-pass CBR encoder algorithm which is tailor-made for real-time compression and can easily be realized with the IBM encoder architecture. We add a further level of complexity to the CBR encoder and introduce a new real-time single-pass VBR encoder in the second part of the paper. Our VBR scheme employs a causal predictive model to distinguish the "hardness" or "softness" of the incoming video material on the fly and adapts itself accordingly. Moreover, it relies on a perceptual model to improve the quality of the stressful segments of the stream and produce high-fidelity video. The perceptual model is also responsible for adjustment of the average rate of the stream. The rest of the paper is organized as follows. Section 2 gives some background information for CBR and VBR algorithms. The new CBR encoding technique is presented in Section 3, and the VBR rate-control algorithm is defined in Section 4. Simulation results are given in Section 5, and concluding remarks are provided in Section 6.
2. Background formulation
An MPEG-2 sequence is typically partitioned into small intervals called GOPs (groups of pictures), which in turn are categorized by picture types I (intracoded or intrapicture), P (predicted), and B (bidirectionally predicted) [1]. The number of bits per GOP is distributed such that the allocation for an I-picture is more than that for a P-picture. This is because a P-picture uses a motion-estimation (ME) technique to estimate its content; as a result, a motion-compensated frame difference (MCFD) with a lower entropy than the original source is encoded. B-pictures use the smallest number of bits because their ME techniques are more intensive than those for P-pictures. This method provides a basis for maintaining the same picture quality within a GOP when pictures of different types are encoded. We may further lower the bit allocation of B-pictures, since they will not be used to estimate other pictures. Each picture type is subdivided into square blocks of pixels called macroblocks. Since producing an efficient compressed stream is the main mission of an MPEG-2 encoder, devising a robust rate-control (RC) algorithm becomes an integral part of the encoding task. The rate-control algorithm monitors the number of bits that should be allocated to each picture or macroblock on the basis of type or image feature, respectively. Moreover, it should ensure that the decoder buffer does not experience an overflow or underflow during the time the stream is received from the communication channel and prepared for decoding. In short, there is a need for the following items in any type of MPEG-2 compression scheme:
- Target bits (or quantization scalers) for picture types.
-
Buffer regulation to avoid overflow/underflow conditions.
-
Maintenance of a target rate or consumption of no more than the bit budget.
-
A rate-control strategy which ensures that all of the above are monitored/satisfied.
In the remainder of this section, we build a framework for a CBR compression scheme, since most of these ideas are needed for the new single-pass CBR and VBR approaches. We assume that pictures of a video sequence can be modeled as a memoryless Gaussian source with variance 2 varying from picture to picture; hence, their rate-distortion (R-D) relationship is defined by R(D) = (1/2) log ( 2/D). Experimental results in the literature suggest that a similar behavior exists between the rate R of the source and the quantization factor Q [6],
Instead of using the logarithmic model of (1), we adopt a simpler hyperbolic relationship which is easily implementable in real time and has proved to be an effective realization of the R-Q model [2]. The simplified equation takes the form
|
R(Q) =
|
| |
(2)
|
|
.
|
|
Q.
|
Equation (2) indicates that the picture rate is inversely proportional to the quantization factor. is a predefined measure of complexity for each picture type. However, since the nature of the source changes over time, a new complexity measure is required prior to encoding of each picture type. This parameter is usually computed using the past encoding parameters, e.g., bits, quantization factor, and/or some lookahead statistics. For each GOP4 of the MPEG-2 stream, we enforce a total number of bits given by C ,
|
NxRx = C x = I, P, B.
|
(3)
|
|
x
|
Index denotes the GOP number, and x is the picture type. Nx is the number of pictures of type x in a GOP, and Rx is the number of target bits for picture type x. For a CBR sequence we have C = Cgop, where Cgop is a fixed number of GOP bits. For a given C , the video quality of the GOP is maximized by minimizing the average sum of the quantization scalers subject to the condition of Equation (3),
Instead of minimizing subject to the constraint of (3), we remove this condition and use the Lagrange multiplier to minimize the Lagrangian cost :
With the aid of the rate-quantization model described in (2), the target for each picture type is deduced:
Targets in Equation (6) represent only ideal picture bits; the actual bits would almost always deviate from this. The accumulated error must be computed and fed back to the rate-control algorithm to ensure that the final MPEG-2 bitstream meets the average bit rate or the total bit budget. Let C ,ideal and C ,actual represent the ideal and actual bits for GOP , respectively, and let  ,gop = C ,actual C ,ideal be the difference between the two. Further, let Ri,ideal and Ri,actual represent the ideal and actual bits for picture i, respectively, and i,pic = Ri,actual Ri,ideal be the difference between the two. After n pictures have been encoded, the total accumulated error can be computed as
|
ng-1
|
 ,gop +
|
n-ngG-1
|
i,pic = n-1,gop + n-1,pic.
|
(7)
|
|
|
=0
|
i-0
|
The size of the GOP is given by G = x Nx, and ng = n/G is the number of fully encoded GOPs. Suberror accumulation for all processed GOPs is given by n-1,gop, while n-1,pic is the suberror accumulation for the last but not yet finished GOP in the encoding order. The ideal picture target can now be adjusted for overproduction or underproduction of bits, computed from previously encoded pictures. The new set of ideal bits prior to encoding picture n belonging to GOP ( = ng) is
|
Rxn,ideal =
|
xn-1(C 1 n-1,gop 2 n-1,pic)
| |
(8)
|
|
,
|
x xn-1Nx ,
|
where 1 and 2 are constants that indicate how aggressively this adjustment is carried out. For a CBR scheme it is recommended to use 1 = 2 = with n-1 = n-1,gop + n-1,pic, and compute the adjusted picture bits as
|
Rxn,ideal =
|
xn-1(Cgop- n-1)
| |
(9)
|
|
.
|
x xn-1Nx .
|
The rate-quantization framework described so far is based on the assumptions that the decoder buffer is of infinite size and that a large enough number of bits is always available. However, this is not true in a real-life scenario, where the buffer size is limited and defined by the MPEG-2 standard [1]. The encoding scheme is responsible for eliminating any overflow or underflow condition that the decoder buffer may encounter. This is accomplished by examining a hypothetical decoder buffer, i.e., video buffer verifier (VBV), and computing lower/upper bounds on the number of bits assigned for a picture type. The lower bound should be a large enough nonnegative number to prevent the overflow condition, while the upper bound should not be larger than a predetermined value above which the VBV buffer will underflow. Let Bn and B*n be the decoder buffer fullness before and after picture n is removed, respectively. B*n is then computed as
|
B*n = Bn Rn,actual ,
|
(10)
|
and the buffer fullness before removing the next picture is
|
Bn+1 = B*n + RchTn ,
|
(11)
|
where Rch is the channel rate (in Mb/s) at which the decoder buffer is being filled, and Tn is the display period for picture n. Figure 2 shows how the occupancy of a VBV buffer changes over time. Before we remove picture n, we have all the information in the buffer available to us; therefore, the upper bound Un becomes the buffer occupancy. To compute the lower bound Ln, suppose Rn,actual is just that, and picture n has been removed at once and delivered for display. The decoder buffer is filled at the rate Rch during this period, and the VBV buffer fullness before removing picture (n + 1) would have to be smaller than the total size of the video buffer verifier, i.e., Bvbv, in order to prevent overflow. Therefore, the nominal values by which the picture target bits are bounded are given by
|
Un = Bn,
|
|
Ln = max (0, Bn + RchTn Bvbv).
|
(12)
|
Figure 2
After each picture is encoded, the complexity measures x are updated, and a new target based on Equation (8) is computed for the next picture. This target should meet the constraints imposed by (12) to satisfy the VBV buffer policy. We clip the ideal picture bits using the picture bounds Un and Ln:
|
Rn,ideal =
|
|
Un
|
if (Rn,ideal > Un),
|
(13)
|
|
Ln
|
else if (Rn,ideal < Ln),
|
|
Rn,ideal
|
else
|
Finally, a quantization scaler Q, which is defined on the hyperbola of Equation (2), is obtained. It should be noted that each picture type has its own composite R-Q curve, and further, for each picture type the Q factor may be adjusted slightly to ensure that all pictures are perceived to be of equal quality. In this paper we do not define how to incorporate perceptual effects, via modulating the Q factor, for a macroblock-level rate control during the encoding task; the reader can refer to [2] for this procedure. However, we must describe a strategy which ensures that the ideal target number of bits for each picture type is met. The above condition can be satisfied by monitoring the actual number of bits computed for a set of encoded macroblocks against the ideal average number of bits of a macroblock. Let rm,n,actual represent the actual number of bits calculated for macroblock m in picture n. In order to track how closely we match the ideal picture bit, a parameter m,n,mb is defined:
m,n,mb =
|
m-1
|
rp,n,actual
|
mRn,ideal
|
,
|
(14)
|
|
|
|
p=0
|
Mmb
|
where Mmb is the total number of macroblocks in a picture. A positive m,n,mb indicates an overproduction of bits at the macroblock level; therefore, the picture quantizer must be increased for the next macroblock to satisfy the target number of bits. Similarly, a negative value dictates the opposite scenario. A macroblock-level rate-control strategy can be formulated to modulate the Q factor on the basis of overproduction or underproduction of bits. We accomplish this by defining a factor qm,n, used to scale macroblock m in picture n, through [7]
|
qm,n = Qn
|
|
Ac(Dn, m,n,mb)
|
if ( m,n,mb > 0),
|
(15)
|
Af(En, m,n,mb)
|
else,
|
where Ac(., .) and Af(., .) are empirically derived functions which help in maintaining the picture target bits. They should behave such that Ac(., .) is always larger than 1 and Af(., .) is smaller than or equal to 1. Qn is the picture quantizer, and differential targets Dn and En are
|
Dn = Un Rn,ideal ,
|
|
|
En = Rn,ideal Ln .
|
(16)
|
We have now defined a CBR MPEG-2 encoder specified by {Ri, Qi}ni=0. In the following section, we describe how this framework can be modified to produce a better CBR stream.
3. Single-pass CBR video
Previous approaches to CBR video compression such as the schemes defined in Section 2 and in [2] have used only the encoding parameters of one previously encoded picture to estimate the R-Q relationship of a picture having the same type. After a picture is encoded, its complexity x is updated as defined in [2] and used for the next picture of the same type,
xn-1 =
|
|
Rxn-1,actualM -1mb
|
Mmb-1
|
qxm,n-1
|
|
if (n-1) is of type x,
|
(17)
|
|
|
m=0
|
|
|
xn-2
|
for all other types, |
and the complexity measures are used to predict a quantization value for picture n,
|
Qxn =
|
In-1NI + Pn-1NP + Bn-1NB
|
.
|
(18)
|
|
Cgop  n-1
|
An examination of Equation (18) reveals that a first-order prediction is used to estimate the position of an (Ri, Qi) pair for each picture type to be encoded. This method of estimation can be improved by observing the long-term complexity of all pictures processed so far and determining the relative complexity of a current picture to be encoded. The long-term complexity xn-1 of a picture type x is defined as
xn-1 =
|
|
( x 1) xn-2 + xn-1
|
|
if (n-1) is of type x,
|
(19)
|
|
x
|
|
|
xn-2
|
for all other types.
|
Equation (19) characterizes an IIR filter structure with the complexity of a picture as an input and the average complexity of a subset of the stream as the output. Coefficient x is fixed over time for a picture type x and reflects how strongly the output of the filter is dependent on the previous input and output samples. The canonical structure of the IIR filter represents an efficient and simple way to respond quickly to the statistical variations of a video sequence in real time. The recurrence in Equation (19) can be factored to write the average complexity in terms of the individual complexity of encoded pictures,
xn-1 =
|
( x-1)n-1
|
x0 +
|
n-1
|
( x-1)n-(1+i)
|
xi ,
|
(20)
|
|
|
|
xn-1
|
i=1
|
xn-i
|
with
I0 = init ,
|
x0 = x I0 for x {P, B}.
|
(21)
|
Since each GOP typically starts with an I-picture, the selection of the magnitude of init determines how well we may encode the first few Is of the sequence. Constants P and B are intended to maintain a complexity ratio among the start-up values. We now take advantage of the fact that before compression of a "difficult-to-encode" picture, the differential target Dn may be a large positive number which can be used to adjust the ideal target allocation of such a picture. Similarly, the bit allocation of an "easy- to-encode" picture can be lowered on the basis of the differential target En. It should be noted that the notion of difficulty of encoding reflects the relative complexity of a picture among all processed pictures. This conception is substituted for the expression "more (less) difficult to encode" throughout this formulation. In order to identify the level of difficulty of a picture type, we compare its complexity against the output of the IIR filter, which indicates the history of all previously encoded pictures of this type. If ( xn-1 xn-1), we label this picture as "difficult to encode" and increase its bit allocation. For the opposite scenario, i.e., ( xn-1 < xn-1), the picture is labeled as "easy to encode," and bit allocation is lowered. The actual adjustments are defined by modifying Rn,ideal of Equation (13) through
|
Rxn,ideal =
|
|
min
|
|
Rxn,ideal +
|
xn-1 xn-1
|
maxDn, maxRxn,ideal
|
|
|
xn-1 + xn-1
|
|
|
if ( xn-1 xn-1),
|
|
max
|
|
Rxn,ideal +
|
xn-1 xn-1
|
minEn, minRxn,ideal
|
|
|
xn-1 + xn-1
|
|
else,
|
|
(22)
|
and the picture quantizer is computed as
|
Qxn =
|
xn-1
|
(23)
|
|
|
Rxn,ideal .
|
The new targets in Equation (22) suggest that for extreme cases, i.e., pictures that are very difficult (or easy) to encode, we may come very close to the upper or lower bound of a picture and force the decoder buffer to underflow or overflow. To resolve this problem, constants max and min are incorporated to use only a portion of the available buffers Dn or En. The min and max operations, along with user-defined max and min parameters, are used to prevent a picture from using too many or too few bits. A further precaution is taken to avoid decoder buffer overflow by adding a guard band gL to increase the lower picture bound:
the new adjusted differential targets (buffers) are given by
|
Dn,adj = Un Rn,ideal gU ,
|
|
|
En,adj = Rn,ideal Ln,adj .
|
(25)
|
Again, a guard band gU is used to prevent the decoder buffer from underflowing. After the new target for a picture is set through the conditions given in Equation (22), we must still clip the bits using the adjusted lower bound Ln,adj and the upper bound Un. The updated parameters Dn,adj and En,adj are used to ensure that the new targets are achieved by implementing a macroblock-level rate-control strategy, as defined in (15). Figure 3 displays a graphical representation of how a transformation of a plot of the relative complexity of a picture is used to set a new bit target. It further reflects how VBV buffer compliance is guaranteed and picture bit budgets are satisfied by enforcing the actual bit production to operate within certain limits and cling to the ideal bit allocation.
Figure 3
The robustness of any type of constant-bit-rate control algorithm is directly dependent on the speed with which it can respond to the changes of the video content of an incoming stream. For real-time applications in which an efficient hardware implementation must be realized with dedicated integrated circuits, the veracity of the RC algorithm may be severely tested under stressful conditions. This is because such customized solutions do not use any preprocessing functions to pre-analyze the nature of the stream; consequently, the availability of the true encoding parameters of a picture always lags the assigned values. A real-time RC algorithm must process and memorize a few GOPs of the same complexity before it can reach its optimality. We argue, on the basis of the following observations, that our new approach to CBR video compression, as formulated in Equation (22), offers a quicker response and better results than the conventional method of CBR encoding.
Consider an encoder that has processed a few "difficult-to-encode" pictures and, further, that this new group of pictures belong to the same GOP. The increase in picture difficulty may be due to the sudden (or gradual) appearance of a high degree of spatial image detail, an increase in the velocity of many moving objects of different scales, directional changes of objects, or some form of higher-order combinations. At the start of a new GOP, we must encode an I-picture. Since we have already compressed one difficult I-picture, there is a high probability that the next I-picture is also "difficult to encode," as reflected by the complexity of the previously analyzed picture. Our CBR scheme can adjust to this picture much more quickly by knowing that the I-picture is significantly different from the rest of the video sequence. This observation is made by comparing the output of the IIR filter against the estimated I-complexity. As a result, the I-picture consumes more bits than the normal bit allocation generated by a conventional method of CBR encoding. Therefore, a higher-quality I-picture is reconstructed at the decoder output. I-pictures are used as references to predict a block of pixels in P- and B-pictures. A better prediction is now obtained for the non-intracoded pictures, resulting in better reconstruction of such pictures at the cost of a small number of bits. Hence, we improve the perceptual quality of a GOP while still adhering to the constant bit rate of the stream. If an easy picture is to be encoded, it consumes fewer bits, and the remaining bit budget is used to encode the future pictures of a GOP. Overall, the number of bits and the picture quality average out over an "easy-to-encode" GOP. It should be noted that the human observer finds image distortions in "difficult-to-encode" pictures most annoying; for long video programs, a small degradation in "easy-to-encode" pictures is tolerated. Further assessment of the argument presented in this section and comparison with the method described in Section 2 are provided in Section 5, which discusses the simulation results.
4. VBR video
The CBR rate-control strategy of the previous section is inspired by the fact that for a fixed target Cgop, constant quality is achieved within a GOP by modulating the quantization parameters. Any statistical variations or bit offshoots are exploited to help stabilize the CBR RC algorithm over time and maintain a desired rate. In addition, an MPEG-2 CBR encoder takes advantage of a set of universal constants and predetermined initial complexity measures to ensure a certain ratio between the number of bits allocated among different picture types [2]. However, efforts in classifying a group of continuous pictures, such as GOPs or video segments, into different types of time intervals in terms of complexity "hardness" or "softness" have been limited for real-time single-pass MPEG-2 encoding. We define "hardness" ("softness") of the video by the large (small) number of bits that it requires to produce high-fidelity results. For multipass CBR or VBR encoding, such information is known; hence, quality improvements can be made. One way of achieving single-pass VBR compression is to compare the Q factor, derived by the CBR algorithm, against a fixed parameter [8]:
|
Qvbr = max (Qfix, Q).
|
(26)
|
The method in (26) is intended to provide an upper bound on the quality of pictures which belong to soft video segments. When soft segments are analyzed, it is very likely that we have (Qfix > Q). Therefore, a quantization scaler, larger than the values normally assigned by the CBR encoder, is assigned to pictures in soft segments. A VBR stream is produced by distributing the surplus bits among the hard segments of the video. Qfix may be obtained through experimentation with a large number of sequences, but finding a near-optimum value is difficult if not impossible. A better scheme can be formulated by calculating a Qfix scaler in real time using prior image statistics [7]. The concept behind the method in [7] is to combine the Qfix approach of [8] with that of the CBR RC algorithm.
Our real-time single-pass VBR encoder exploits an R-Q compression model to differentiate the degree of "hardness" or "softness" of video segments, each segment corresponding to a particular hyperbola similar to the one defined by (2). The actual encoding parameters of the video segments are computed along this hyperbola. We also use an R-Q perceptual model to prioritize the video segments in terms of visual importance. To satisfy the average rate of the VBR stream, the position of the perceptual model must be changed over time. In this paper we specify two methods for meeting this condition. The perceptual model and the VBR RC algorithms are described next.
Rate-quantization perceptual model
The VBR scheme proposed in this paper is conceptually motivated by the fact that each video segment is associated with a level of encoding difficulty, and this difficulty can be measured by various source statistics or compression parameters such as total picture bits, quantization scaler, spatial activity, temporal activity, signal-to-noise ratio, or any combination of them. A larger number of bits should be allocated to a video segment with a high level of encoding difficulty. This is a different approach from that of the CBR RC algorithm, in which a fixed number of bits is allocated to each GOP regardless of the degree of complexity of the source. Research efforts have shown that for a large number of test cases composed of complex, moderate, and easy materials, a strong correlation exists between the rate of the video interval and its corresponding quantization scaler [9]:
Ra (a1 + a2Qßa).
|
(27)
|
The R-Q relationship of (27) is deduced on the basis of the criterion that all types of video intervals should be perceived equally. Further, it suggests that difficult video segments producing a large quantization scaler should consume more than the average bit rate of the compressed stream, while easy segments would use smaller bits. This provides a natural building block in formulating a robust single-pass VBR encoding algorithm, since it takes advantage of the variability of the video source. The actual number of bits allocated to each interval is determined by the slope K of the perceptual model,
|
Ra = K(a1 + a2Qßa) = KF(Qa).
|
(28)
|
Constant K (measured in Mb/s) is modulated for each video interval to ensure that the average rate of the compressed stream meets the desired target. For the case of ß = 1.0 the perceptual model reduces to a linear relationship.
VBR rate-control algorithms
The efficiency of a single-pass VBR encoder is assessed by the speed with which its rate-control algorithm can learn and adjust itself to the "softness" or "hardness" of the video stream. For regions where image discontinuities or special effects occur, degradations in picture quality should be minimized. Since for single-pass encoding, image statistics are limited by the previously analyzed pictures, the learning rate of the RC algorithm must be adequate to predict the content of the future video intervals, yet not aggressive enough to result in algorithmic instabilities. One way to solve the twofold problem is to adjust the quality of the encoded stream for every time interval and let the RC algorithm learn the local content of each picture within that time interval.
In this subsection we claim that a VBR video stream Svbr is a concatenation of several contiguous video intervals, each operating at a different CBR bit rate. We further use the GOP terminology as the definition of a video (or time) interval for the remainder of the paper, and compute the quantization parameters of the single-pass VBR algorithm for each GOP. Therefore, Svbr can be partitioned into a number of piecewise-continuous GOPs specified by {S } =0m. For GOP S we define an average number of bits  and an average quantization scaler  such that [   ] = G -1[C Q ]. We modify the terms of (28) to form a dependency between the average number of bits and quantization scaler of GOP S ,
The picture rate of the video is defined by f. Equation (29) represents the perceptual rate-quantization model for a GOP. We further take advantage of the theoretical rate-quantization model and assume a hyperbolic dependency between the average number of bits and quantization scaler of a GOP as in (2). The GOP rate-quantization model takes the form of
The average complexity of a GOP is given by c . To understand how the two perceptual and theoretical rate-quantization models work in harmony, we should look at their graphical representations in Figure 4. The perceptual model is overlaid on the ideal rate-quantization behavior of GOPs. Once the VBR encoder processes a few GOPs, the - relationship of Equation (30) can be realized. Depending on the actual compression parameters, the model of Equation (30) characterizes a "hard" or "soft" GOP. Then the - perceptual model is used to emphasize the visual importance of a "hard" GOP. The positive slope of this model ensures that "hard" GOPs are assigned more bits. A more detailed description of the VBR rate-control algorithm is presented later in this section.
Figure 4
Since our goal is to formulate a VBR algorithm which is readily applicable for hardware implementations, we form a linear approximation for the model of Equation (30). The new model uses previously processed GOPs to construct a straight line for the - relationship. We call this line the - predictive model and define its slope and intercept by  and  , respectively. Next, we explain how to derive this model.
Let ( -2,actual,  -2,actual) and ( -1,actual,  -1,actual) represent the average number of bits and average quantization scaler pairs of GOPs S -2 and S -1 just encoded.5 Then, the - relationship for GOP S to be encoded is given by
with
|
 =   (  ) -1
|
 =  -1,actual +    -1,actual ,
|
  =  -1,actual  -2,actual
|
  =  -2,actual  -1,actual .
|
Equation (31) defines the instantaneous rate-quantization behavior of a particular GOP under analysis and will change over time. To find the optimum operating point, we solve the causal predictive model of (31) together with the perceptual model of (29) and encode the next GOP at the average number of bits of
 =
|
K( a1 +  a2)
|
.
|
(32)
|
|
f + a2K
|
Constant K is modulated for each GOP to ensure that the total number of bits produced by the VBR stream is not more than the size of the storage or retrieval device, e.g., a DVD disc. Assume that the total number of bits available is RTOT. Then, after every GOP of the video sequence is analyzed and encoded, we can use the perceptual model of (29) to compute K:
|
fRTOT
| |
(33)
|
|
K =
|
|
.
|
G
|
Ngop-1
|
F( ,actual)
|
=0
|
The number of GOPs in the video sequence is given by Ngop. Given a number of pictures and a bit budget RTOT, the single-pass VBR encoder has the responsibility of fitting all of the produced bits into the digital medium. To prevent overruns or underruns, RTOT must be dynamically modified for each GOP. The adjusted budget R ,tot is obtained by subtracting the actual bits from RTOT and is used to set a new slope,
|
fR ,tot
| |
(34)
|
K =
|
|
.
|
G
|
Ngop-1
|
F( ,actual)
|
=0
|
We call the VBR encoder which uses this method of bit assignment (or slope modulation) VBR method 1 (VBR-1). The denominator of Equation (34) can easily be computed in a multipass encoding scheme, but is not available for a single-pass real-time compression. Instead, we use a pre-encoded phantom sequence with a set of GOP quantizers defined by { * } =0N*gop-1, and adjust the summation of Equation (34) after each GOP of the test sequence is encoded. N*gop is the number of encoded GOPs in the phantom sequence. The learning procedure for (34) is formulated as follows.
Learning procedure 1
-
Let P be a pre-encoded phantom defined by the tuple {N*gop, {
 } =0N*gop-1}, and set Z0 = [ =0N*gop-1 F( * )], N0 = N*gop.
-
Initialize the VBR-1 RC algorithm by encoding the first two GOPs of the video sequence, i.e., S0 and S1, at a rate of (f
gop) using a CBR MPEG-2 encoder. The nominal value of (f gop) should correspond to the average rate of the VBR stream.
-
After S
-1 is encoded, update Z and N as Z = Z -1 + F( -1,actual), N = N -1 + , where is the update speed of the learning algorithm. Compute C -1,actual and then the sum j=0 -1 Cj,actual.
-
If (
1 = 0) increment by one, go to step 3.
-
Otherwise, compute
 and  according to the causal predictive model defined in (31).
-
Use the sum term of step 3 to determine the remaining bits R
,tot in the budget. Update the number of remaining pictures P to be encoded.
-
Measure the new slope: K
= fR ,totN (P Z ) -1.
-
Calculate the new target
 and encode S in VBR mode.
-
If there are unfinished GOPs in the sequence, go to step 3; otherwise stop.
Term Z is intended to adapt itself to the image content of each GOP and smooth out the volatility of the VBR video. It acts as a safety measure to prohibit unrealistic bit allocations to GOPs of very high (low) complexity. Finally, the average number of bits of the GOP S to be encoded by the VBR-1 encoder is set by
 =
|
|
R ,totN
|
|
|
 a1 +  a2
|
|
.
|
(35)
|
|
|
P Z
|
 + a2
|
|
R ,totN
|
|
|
|
|
P Z
|
To adjust the target bits of each picture type, is multiplied by the GOP size G and Equation (8) is used to set the VBR picture bits. For a VBR scenario, upper and lower picture bounds are defined differently from those for the CBR mode of operation. An overflow condition cannot occur for the decoder buffer of a VBR codec. This is because the task of filling the decoder buffer is immediately stopped after the VBV buffer occupancy reaches its maximum level. Therefore, a lower bound of zero is imposed for picture bound. Moreover, the decoder buffer is filled at the maximum rate of Rmax,6 set by the user. Figure 5 shows an example of how the occupancy of a VBV buffer changes over time for the VBR mode of compression. We clip the target bits of Equation (8) by the newly derived picture bounds
|
Un = min (Bvbv, Bn-1 + RmaxTn-1 Rn-1,actual),
Ln = 0.
|
(36)
|
Figure 5
Picture quantization scalers are computed as in (23) and the macroblock-level rate-control strategy of Section 2 is used to maintain the picture targets. We have now defined a single-pass VBR MPEG-2 encoder specified by {RTOT, {Ri, Qi}ni=0}. The encoder operates along constellations which are formed by jointly solving for a time-varying perceptual model and a bank of - models. The relative ideal position of the ( ,  ) pair of a GOP within the constellation is first determined by the "softness" or "hardness" of the GOP and then adjusted by the remaining number of bits in the bit budget. The local position of the (Ri, Qi) pair of a picture is represented by the R-Q model of the picture type. Figure 4 displays how a constellation of ( ,  ) pairs is formed by the VBR-1 encoder. In this figure, the constant line defined by = gop indicates the ideal location of all ( ,  ) pairs if they were to be encoded in CBR mode. Moreover, it reflects that the average quantization scalers increase as the hardness of GOPs increases. The ideal and actual operating points are depicted by light and dark circles, respectively. Let ( 1, 1) and ( 2, 2) be the first two pairs of a "soft" GOP, which are obtained by initialization of the VBR-1 encoder in the beginning of a video sequence. The next ideal operating point, denoted by "3," is the intersection of the perceptual model (solid line) and the causal predictive model. The position of point "3" indicates that an average number of bits smaller than the sequence average gop is allocated. However, the output of the encoder, i.e., point ( 3, 3), will be different, and as a result, a new perceptual model (shown by the dotted line) is derived to meet the bit budget constraint. The pairs ( 2, 2) and ( 3, 3) are then used to obtain a new predictive model (shown by the dotted line) and compute a new operating point, i.e., "4." If the video sequence contains a large number of contiguous "soft" GOPs, the perceptual model will eventually converge to a line which intersects the = gop line in close proximity to the CBR output pairs ( 1, 1) and ( 2, 2).
If the video material starts with "hard" GOPs, the location of the actual point ( 3, 3) is above the = gop line. This ensures the optimization of the available bit budget for regions where a CBR encoder is most vulnerable, i.e., "hard" GOPs. In this scenario, a new perceptual model (shown by the dashed line) is formed to gradually lower the allocated GOP bits and comply with the bit budget constraint. This model, along with the new causal predictive model (also shown by the dashed line), determines the next operating point, denoted by "4." For the case in which the incoming sequence is composed only of "hard" GOPs, the perceptual model will eventually conform to a line which meets the = gop line at a location close to the ( 1, 1) and ( 2, 2) points. For a typical video program, it is unlikely that we operate along the same GOP (or a constellation of previously computed points) for a long duration of time. It is, however, likely that we jump out of a GOP to the next neighboring GOP after a short time. Therefore, we can deduce the following behavior. Before the perceptual model settles in a situation where it can monotonously take away from "hardness" or "softness" of a GOP, the RC algorithm will migrate to a new GOP. The actual quantization scaler values for previously encoded picture types determine the migration to a "harder" or "softer" GOP. For a future "harder" ("softer") GOP, we move to the right (left) of the previously encoded GOP as depicted in Figure 4. Hence, the point at which we intersect the perceptual model produces a number of GOP bits which is higher (lower) than the number of actual bits consumed by a previous GOP. This mechanism ensures that, for a collection of contiguous GOPs of similar "level of encoding difficulty," the effect of the "hardness" (or "softness") reduction or augmentation is distributed over the encoded content of the corresponding video interval, while the most (least) complex GOP still gets the largest (smallest) bit allocation relative to its lookalike GOPs.
While the VBR-1 encoding framework is based on modulating the slope K of the perceptual model, an alternative VBR RC algorithm can be formulated by translation of the perceptual model to meet the total number of bits set by the user. This method is labeled as VBR-2 and is defined next. For the VBR-2 bit-modulation scheme, we fix the slope K at a CBR rate of f gop and translate the position of the - perceptual model by solving for parameter a1 of (29) through
|
RTOT Cgopa2
|
|
Ngop-1
|

| |
(37)
|
|
=0
|
|
a1 =
|
|
.
|
|
CgopNgop
|
The sum term in the above equation is undefined for the real-time single-pass VBR-2 encoder. We use a pre-encoded phantom sequence (as in the previous algorithm) to initialize the VBR-2 RC algorithm. To ensure that the digital storage medium does not experience overruns/underruns, we monitor the remaining bits periodically and use them as the instantaneous total bit budget, i.e., R ,tot. Translation of the perceptual model is varied as
a1 =
|
P -1R ,tot
|
a2
|
Y
|
= r a a2E .
|
(38)
|
|
|
gop
|
N
|
Term R a represents a ratio between the long-term rate of the GOPs not yet encoded and the CBR rate of the video stream. For "hard" GOPs we have ( > gop) and, therefore, ratio R a must become smaller over time to satisfy the total bit budget. Further enforcement is provided by E to move the position of the ( , ) pairs downstream, along the constellation, to lower the rate of the compressed stream. The opposite scenario takes place if several soft GOPs are encoded over time. The average number of bits of each GOP is set as
 =
|
 a1 +  a2
|
.
|
(39)
|
|
  -1gop + a2
|
Term Y is initialized and updated during a learning procedure given below. The rest of the encoding parameters are defined with the formulation of the VBR-1 algorithm.
Learning procedure 2
-
Let P be a pre-encoded phantom defined by the tuple {N*gop, {
 } =0N*gop-1}, and set Y0 = ( =0N*gop-1 Q bar]* ), N0 = N*gop.
-
Initialize the VBR-1 RC algorithm by encoding the first two GOPs of the video sequence, i.e., S0 and S1, at a rate of (f
gop) using a CBR MPEG-2 encoder. The nominal value of (f gop) should correspond to the average rate of the VBR stream.
-
After S
-1 is encoded, update Y and N as Y = Y -1 +  -1,actual, N = N -1 + , where is the update speed of the learning algorithm. Compute C -1,actual and then the sum j=0 -1 Cj,actual.
-
If (
1 = 0) increment by one, go to step 3.
-
Otherwise, compute
 and  according to the causal predictive model defined in (31).
-
Use the sum term of step 3 to determine the remaining bits R
,tot in the budget. Update the number of remaining pictures P to be encoded.
-
Compute R
a, E , and the new translation factor a1 .
-
Calculate the new target
 and encode S in VBR mode.
-
If there are unfinished GOPs in the sequence, go to step 3; otherwise stop.
Constraints on the VBV buffer occupancy and target adjustments are handled as before. Figure 6 displays how the instantaneous causal predictive models are formed within each GOP and used in conjunction with a time-varying perceptual model to estimate a new ( ,  ) point for the VBR-2 encoder. If several soft (or hard) GOPs are encoded over time, the perceptual model will be shifted upward (or downward) to meet the bit budget constraint. In Section 5 we evaluate the efficiency and robustness of the single-pass VBR compression schemes by encoding several video sequences.
Figure 6
Scene transitions
The reliability of any type of real-time single-pass RC algorithm is directly related to the number of pictures processed after the encoder start-up. The encoder is initialized with a set of statistical information which is updated over time as the encoder learns and adjusts itself to the spatial or temporal perturbation of image details. As a result, the output stream may suffer some degradations at the beginning of the video sequence. This distortion is often ignored, since the human visual system (HVS) requires a recovery time to comprehend changes in the image distortion [10], and, more significantly, no past encoded pictures are available as a reference at the beginning of the video. After a few GOPs are encoded, the learning curve of the RC algorithm reaches an equilibrium level and produces streams which display acceptable image quality upon decoding.
During the course of the encoding procedure, the RC algorithm can become unreliable. This is usually caused by special effects or naturally occurring phenomena such as scene cuts, slow- or fast-moving fades, and luminance changes. For such cases, the encoder scheme undergoes temporal transitions for which parameter adjustments become nearly impossible. These image discontinuities typically result in serious image degradations which the human observer finds very distracting. One way to overcome this problem is to detect the temporal position of the special effect in the video and replace the quantization parameters with a properly adjusted set. The former is easily derivable for hard scene cuts but can become complicated for fades, while the latter requires a new framework for deriving a new set of target rates and quantization scalers.
The existence of image discontinuities is even more troublesome for a VBR video because of the volatile nature of the compressed video parameters. Since our intention is to formulate a VBR RC algorithm offering continuous picture quality, we seek a VBR encoding scheme which displays graceful degradations when special effects are compressed. We accomplish this by detecting scenarios in which, as a result of scene transitions, the causal predictive model cannot provide a faithful estimation. Under these circumstances we set the GOP targets along a trajectory different from the procedure described in the subsection on VBR rate-control algorithms. Figure 7 shows two cases (for the VBR-1 video encoder) in which transitions from "soft" GOP to "hard" GOP and vice versa make incorrect estimations for a future GOP to be encoded. The incorrect estimation is denoted by "bad operating point" in the figure. A much better estimation can be made by applying a three-point median filter to a set of carefully selected parameters. For the VBR-1 RC scheme, the new targets are derived using the following rule:
Figure 7
C ,new = G
|
|
median
|
[µ1f -1K F( -1,actual), µ2 gop, µ3f -1K ] |
|
if ( < 0),
|

|
else;
|
|
(40)
|
for the VBR-2 RC scheme, the targets are
C ,new = G
|
|
median
|
|
µ4 gop(a1 + a2 -1,actual), µ5 gop, µ6
|
R ,tot
|
|
|
P
|
|
if ( < 0),
|

| |
else;
|
|
(41)
|
The median filter is used to compensate for instability conditions that occur in the transition of video scenery among all types of GOPs, i.e., "soft" to "hard" and vice versa. The constant set {µi|1 i 6} enforces a bound on the number of bits produced during a scene transition. For the "soft"-to-"hard" GOP transition, the output of the filter in Equation (40) is marked by "1" in Figure 7, while point "2" is the new estimate for the reverse transition. The nonlinear nature of the filtering schemes in (40) and (41) is an effective way of controlling the rate of the VBR encoder for higher-order temporal changes. Further, it does not require a local (picture-level) scene-change detector. However, if the encoder is equipped with a scene-change detector, we may take advantage of the GOP target allocation of Equations (40) and (41).
- irregularities
The R-Q relationship of Equation (2) is known to work fairly well for a diverse class of video sequences and is a fundamental concept used for many real-time MPEG and non-MPEG compression systems. In this paper we have used it to build a causal predictive model to estimate the average bits of a present GOP to be encoded on the basis of the actual average number of GOP bits and average quantization factor of previous GOPs. However, this predictive model can become unreliable for some video segments even when there are no special effects, scene transitions, or scene cuts. The cause of this unreliability is related to the nonstationary nature of video sources, e.g., several highly detailed objects moving slowly across a background of significant luminance or chrominance image details, the presence of unwanted noise in the scene, etc. Figure 8 displays two points derived by an encoder, with one being outside the composite - curve. This point is a result of the aforementioned criteria and does not obey the - relationship. Further, it will contribute to the calculation of a "bad operating point." Our VBR encoder suppresses this irregularity by making an additional contribution to the strategies defined in (40) and (41). We set the GOP bits according to the condition
C ,new =
|
|
max (C ,new, G max ,fil)
|
|
|
if ( < 0) ( L < | | < U)\ ( ,fil > gop),
|
min (C ,new, G min ,fil)
|
|
else if ( < 0) ( L < | | < U) ( ,fil < gop),
|
G gop
|
|
else if ( < 0) ( L < | | < U) ( ,fil = gop),
|
|
(42)
|
with
 ,fil =
|
1 -2,actual + 2 -1,actual
|
.
|
(43)
|
|
1 + 2
|
Figure 8
 ,fil is an average number of GOP bits which is determined by applying a linear filter on the average number of bits of the previously encoded GOPs. The output of this filter is the new operating point, and ( 1, 2) represent the weights of the filter. Constants max and min are used to scale the filter output. Thresholds L and U are chosen under the assumption that a future GOP belongs to the same video segment as the already processed GOPs. The max and min functions are implemented to filter out the encoding bit parameters that result in - irregularities.
5. Simulation results and discussion
The performance of the proposed single-pass CBR and VBR video encoders is evaluated by compressing a library of test sequences. This library contains a diverse class of composite sequences, as described in Table 1. The video sequences are digitized according to the CCIR 601 resolution [11], with a color sampling ratio of 4:2:2. A 4:2:0 color sampling ratio is created by applying preprocessing filters on the chrominance samples of each source. The compression procedure is carried out in 4:2:0 mode using MPEG-2-compliant main profile at main level encoder software developed at our research laboratory. Since optimizing the communications bandwidth or the digital storage medium is very important for bro |