|
1. Introduction
The most effective video compression standards [1-4] use the motion-compensated picture difference (MCPD) technique to achieve high degrees of compression at acceptable levels of picture quality. For the purposes of this paper, an MPEG-2 MCPD is generated by three steps:
-
Segmenting a target picture into a grid of macroblocks of 16 x 16 pixels.
-
Predicting the pixel values in each target macroblock by estimating the translational displacement (motion) between macroblocks in the target picture and macroblocks in one or two reference pictures.
-
Subtracting the target macroblocks from their predicted values to generate an MCPD picture.
Two types of MCPD pictures exist in MPEG-2: P- and B-pictures. P-pictures use only one reference picture, which is temporally located before the target picture. B-pictures use two references: one before and one after the target picture. A third type of picture in MPEG-2, the so-called I-picture, is not motion-compensated. Recent reviews of MPEG-2 include [5] and [6].
When generating an MPEG** compressed stream, it is common to define an m-parameter to indicate the distance in picture periods between target P-pictures and their corresponding reference pictures. An m = 1 sequence is made up of P- and I-pictures only, and each P-picture in the sequence is motion-compensated on the basis of the previous P- or I-picture. In an m = 3 sequence, two B-pictures are typically sandwiched between pairs of P-pictures. Because in MPEG only P- and I-pictures can be used as reference pictures, the largest distance between target and reference pictures occurs when P-pictures are encoded. It turns out that the larger the distance between target and reference pictures, the larger the displacements found in step 2 above. In the remainder of this paper, we attempt to characterize the statistics for the maximum size of these displacements; thus, we concern ourselves primarily with the prediction of P-pictures at m = 3.1 Figure 1 shows these functions for the well-known set of MPEG test sequences. Note that we evaluate the cumulative distribution functions for the absolute value of motion-vector components; this is because we are interested in investigating only motion-estimation search ranges that are symmetrical around 0.
Figure 1
It is important to understand how these plots were derived in order to interpret them correctly. As previously explained, macroblocks in a P-picture can be coded as intra-, frame-predicted, or field-predicted. In the case of field-predicted macroblocks, there exist two motion vectors per macroblock. In the case of frame-predicted macroblocks, only one motion vector per macroblock is used, whereas no motion vectors are specified for intra-macroblocks. Since we wish to make "motion-vector statistics" correspond to actual "picture area statistics," we use the following rules in deriving f and F from the experimental data:
-
Frame-motion vectors are counted twice in calculating f or F.
-
Intra-macroblocks are counted as two zero-motion vectors in calculating F.
In this manner, when calculating F, two motion vectors are always used for every macroblock in a P-picture. Thus, a value of MVx = x0, corresponding to F(MVx) = 0.9, indicates that 90% of the total pixel area in P-pictures can be "coded" with motion vectors MVx x0 (note that this 90% includes "unpredicted" intra-macroblocks). We refer to x0 as the 90% probability search range for the x-component of motion vectors.
Short-term versus long-term statistics
To experimentally describe the f and F statistics, we must collect macroblock motion-vector measurements for one or more pictures. The elapsed time of the observations is very important. As one might expect, the statistics are heavily dependent on video content, and their behavior tends to appear stationary only from one scene change to the next. It should be clear, for example, that measuring the long-term statistics of motion vectors for the length of a two-hour movie will tell us very little about the short-term statistics of each individual scene in the movie. In this paper we are interested in both long-term and short-term behavior of motion-vector statistics. Long-term statistics provide us with an indication of the value of the required motion-estimation search range for effective video compression, where by effective we mean compression with optimum video quality "most of the time." Short-term statistics provide us with an indication of the value of the required motion-estimation search range for robust video compression, where by robust we mean compression with close-to-optimum video quality4 "all the time." Clearly, robust encoding is a desirable objective; however, one must deal with practical considerations such as the tradeoff between accuracy and search range when limited computational power is available.
We report on the results of simulation studies for several video sequences. The sequences we tested comprise three different groups: 1) the MPEG-2 set of test sequences, with which many practitioners of the MPEG-2 standard are familiar; 2) a set of sports-related test sequences which were deliberately chosen to stress the requirements for large search ranges; 3) several minutes of 24-frame-per-second film material, representative of typical action movie content.
Statistics for standard MPEG sequences
Table 1 lists the MPEG test sequences we used in our experiments, including our attempt to describe their content. The term simple motion means few objects moving at slow speeds; complex motion means one or more objects moving at moderate to high speeds; zoom and camera panning are self-descriptive. We measured motion-vector statistics for sixty pictures in each test sequence at m = 1 and m = 3.
|
Table 1 Description of MPEG-2 test sequences.
|
|
Motion sequence
|
Simple motion
|
Complex motion
|
Zoom (in/out)
|
Camera panning
|
|
|
Cheerleaders
| |
|
|
Flower Garden
|
| | |
|
|
Mobile and Cal.
|
| |
|
|
|
Susie
|
|
|
TableTennis
|
| |
|
|
Football
| |
| |
|
|
Carousel
| |
| |
|
|
While we are interested primarily in m = 3, it is useful to observe how search-range requirements scale with m. For most sequences, motion vectors do not scale linearly with m, as one might erroneously assume. In fact, Table 2 shows that only in the case of simple motion and camera panning is the scaling approximately linear. For all other cases, the motion-vector search range required for m = 3 is typically less than the linear rule would predict.
|
Table 2 Experimental results for m = 1 and m = 3.
|
|
|
F(MVx) = 0.95
|
F(MVy) = 0.95
|
F(MVx) = 0.99
|
F(MVy) = 0.99
|
|
|
|
|
|
|
m = 1
|
m = 3
|
m = 1
|
m = 3
|
m = 1
|
m = 3
|
m = 1
|
m = 3
|
|
|
|
|
Cheerleaders
|
16
|
42
|
14
|
40
|
38
|
75
|
38
|
72
|
|
Flower Garden
|
11
|
32
|
2
|
6
|
15
|
40
|
8
|
20
|
|
Mobile and Cal.
|
3
|
9
|
2
|
4
|
28
|
36
|
4
|
16
|
|
Susie
|
2
|
4
|
1
|
4
|
4
|
7
|
3
|
10
|
|
Table Tennis
|
8
|
22
|
6
|
8
|
19
|
44
|
14
|
18
|
|
Football
|
39
|
60
|
16
|
40
|
52
|
92
|
32
|
68
|
|
Carousel
|
38
|
100
|
10
|
26
|
55
|
120
|
40
|
54
|
|
|
|
The picture-by-picture statistics for these sequences are shown in Figure 2. In this figure we show the average motion-vector component, the 95% probability search range, and the 99% probability search range, for the horizontal and vertical components at m = 3. Also shown are the percentage of intra-macroblocks (note that, as one would expect, the number of intra-macroblocks correlates well with the amount of motion). Each of these sequences corresponds to a single video scene, and each of them shows an approximately statistically stationary behavior. Their overall statistics have already been presented in Figure 1. We observe in Table 2 that for even a 99% probability search range, MVx 120 and MVy 72 in all cases. To achieve 99% coverage for MVx, only Carousel requires a search range of ±120. This means that when we limit the horizontal search range to ±120, fewer than 1% of the macroblocks in this sequence may become "unpredictable" and may have to be coded as intra-macroblocks, thus adding to the roughly 6% intra-macroblocks required by these sequences to start with. This small increment has an imperceptible impact on coding efficiency or video quality, as we later see. To achieve 95% coverage, we see from Figure 1 and Table 2 that MVx 100 and MVy 40. Once again, only Carousel requires MVx = 100; if we limited the search range to this value, the number of intra-macroblocks for this sequence could potentially double to about 10% (see Figure 2). As we later see, objective and subjective measurements of video quality suggest that, even at 95% search range, video quality degradation appears to be below the threshold of human perception.
Figure 2
Statistics of sports sequences
Action sports stress the requirements for motion-estimation search range. For this reason we chose to simulate a set of six different sports clips of interlaced 60-Hz video, as shown in Table 3. We captured 30 seconds for each clip, and sampled two pictures every 0.5 second. These two pictures were separated by three picture periods such that the simulation results correspond to m = 3. The contents of each sequence are described in Table 3. The term close-up shot indicates that the height of the person or persons that the camera is tracking is greater than one half of the picture height; otherwise we label it a long shot. In close-up shots, tracking a moving object can result in extremely fast panning of the more distant background.
|
Table 3 Description of sports sequence contents.
|
|
Motion sequence
|
Close-up shot
|
Long shot
|
Zoom (in/out)
|
|
Tennis
|
| |
|
|
Football II
|
|
|
|
|
Ice Skating
|
|
|
Basketball I
| |
|
|
Basketball II
|
|
|
Hockey
|
|
|
|
Soccer
| |
|
|
|
|---|
Figure 3 shows the picture statistics for all of these sequences. The two basketball sequences appear to be the most demanding in terms of horizontal and vertical motion. Also identified in this figure is a portion of video with large motion vectors which is indicated as Segment 1. The short-term statistics of this segment are studied in more detail in Section 4. To compare "Basketball I" and "Basketball II," the close-up sequence requires a larger overall search. This is seen more easily in Table 4 and Figure 4, where long-term (30-second) statistics for each sequence are shown. The overall 99% probability search range requirement is MVx 123 and MVy 84, whereas the 95% probability search range is MVx 77 and MVy 50. Both of these ranges are defined by the statistics of "Basketball II."
|
Table 4 Experimental results for m = 3. The 95% and 99% probability search ranges are compared.
|
|
|
F(MVx) = 0.95
|
F(MVy) = 0.95
|
F(MVx) = 0.99
|
F(MVy) = 0.99
|
|
|
Tennis
|
54
|
18
|
82
|
42
|
|
Football
|
46
|
20
|
78
|
40
|
|
Ice Skating
|
59
|
18
|
82
|
40
|
|
Basketball I
|
50
|
22
|
102
|
54
|
|
Basketball II
|
77
|
50
|
123
|
84
|
|
Hockey
|
44
|
10
|
71
|
20
|
|
Soccer
|
22
|
8
|
64
|
18
|
|
|
|
Figure 3
Figure 4
Statistics of movie sequences
To complete our experiments, we simulated film at 24 pictures per second. This content was provided to us through the courtesy of a movie studio, as an example of typical action material. Four different movie clips with different time lengths, labeled Movie 1 through Movie 4, were used. We sampled pairs of progressive pictures corresponding to m = 3 (after telecine inversion) every half second. We intentionally avoided those cases in which the two pictures in a pair belonged to different scenes (these cases should not be handled with MCPD techniques). The clips contain a variety of material ranging from low motion, e.g., people talking, to extreme motion, e.g., close-up of horseback riding scene.
Picture statistics for these clips are shown in Figures 5-7. Long-term averages are shown in Table 5 and Figure 8. The largest averages correspond to Movie 2, which has a 99% macroblock coverage with a search range of MVx 103 and MVy 61. In contrast, the corresponding 95% range is MVx 46 and MVy 35.
Figure 5
Figure 6
Figure 7
Figure 8
|
Table 5 Experimental results for m = 3. The 95% and 99% probability horizontal search ranges are compared.
|
|
|
F(MVx) = 0.95
|
F(MVy) = 0.95
|
F(MVx) = 0.99
|
F(MVy) = 0.99
|
|
|
Movie 1
|
37
|
25
|
72
|
61
|
|
Movie 2
|
46
|
35
|
103
|
61
|
|
Movie 3
|
23
|
19
|
51
|
43
|
|
Movie 4
|
22
|
11
|
53
|
29
|
|
In Figures 5-7 we have also identified a number of segments that significantly exceeded the 99% search range for Movie 2; they are labeled Segment 2, Segment 3, and Segment 4. Segment 2 was chosen because of its large vertical motion. In the next section of this paper, we analyze the short-term statistics of these segments more closely.
4. Picture quality and coding efficiency with a constrained search range
The experiments presented in Section 3 showed that the most demanding content in terms of motion-estimation search range was that of sports video. On the basis of those measurements, we suggest that a search range somewhere between the 95% and the 99% statistics of our sports video clips is sufficient to guarantee close-to-optimum coding results for even critical applications. More concretely, we recommend a search range in the interval
80 MVx 120,
50 MVy 85
for all applications of MPEG-2 encoding of CCIR 601 video resolution. We observe, however, that over short periods of time, short-term statistics can exceed this recommended search range significantly. Several examples were singled out in Section 3 and labeled as Segments 1-4. Using these segments and other video samples, we now wish to study the impact on picture quality (or, alternately, coding efficiency) of MPEG compression with a constrained search range. We study this impact with subjective and objective measures.
Factors affecting picture quality in MPEG-2 coding are numerous, and their interaction is complex. Some of the factors that relate to the presence of motion include 1) motion-estimation search range; 2) human perception in the presence of fast motion; 3) picture complexity (motion tends to produce blurred images of low complexity); 4) number of unpredicted (intra-) macroblocks; and 5) data rate. Predicting the effect of a constrained search range in the presence of fast-moving images is difficult because of the complex interactions of all of these factors. For example, does it matter if a few extra macroblocks cannot be motion-compensated when most of a picture is blurred and of low complexity? What is the impact on image quality of adding a few more unpredicted macroblocks to an already significant percentage of intra-macroblocks? Even if all of these elements matter quantitatively, do they matter subjectively? That is, can the eye perceive the impact on quality in the presence of fast motion? At what data rate can the human eye perceive these effects? In practice, these complex interactions cannot be predicted; they can only be measured by experiments. That is the goal of this section.
Objective measurements of video quality and compression efficiency
In this section we measure the impact on PSNR of compressing video at constant bit rates as a function of motion-estimation search range. We also present an alternative view: the impact on compressed bit rate when coding with a constrained search range at constant PSNR. While we recognize that PSNR is not a perfect measure of video quality, relative values of PSNR do correlate with subjective evaluation of quality. This is particularly true for PSNR values below 38 dB; above 38 dB, video tends to be of "high" quality, and changes in PSNR are very difficult to observe.
We focus on a couple of examples from the MPEG-2 test sequences, as well as on the short segments that we identified in Section 3 as cases that fall outside the search range we recommend. From the MPEG test set we choose two sequences, "Carousel" and "Mobile and Calendar," as examples of high motion and low motion, respectively. (We have experimented with the remainder of the sequences and find that these two are most representative of these two extremes.) Figures 9 and 10 respectively show the results of our simulations for the MPEG-2 test sequences and the selected video segments. The sequences in Figure 9 are more difficult to compress than those in Figure 10: We observe that at 4 Mbps the sequences in Figure 9 result in a PSNR well below 38 dB, whereas those in Figure 10 result in a PSNR well above 38 dB. The results in Figures 9 and 10 are based on actual MPEG-2 encoding experiments at search ranges of 4 x 4, 8 x 6, 16 x 12, 32 x 22, 60 x 42, 100 x 70, 128 x 90, and 200 x 140. For all of these, the ratio of horizontal to vertical search range is approximately constant and proportional to 100/70.
Figure 9
Figure 10
It is interesting to note that movie segments 2-4 are the "easiest" to compress, despite having the largest apparent requirement for search range. Even at 2 Mbps, these segments result in a PSNR greater than 36 dB, compared to 30 dB or less for the sequences in Figure 9. Figure 10 also shows that constraining the search range to 100 x 70 has an insignificant impact on PSNR or data rate.
Of the sequences in Figure 9, "Segment 1" (Basketball II) shows the most sensitivity to search range, particularly at low data rates. The loss of PSNR, due to our recommendation for constrained search range, is limited to less than 0.3 dB. In fact, at a 100 x 70 search range, the loss is 0.1 dB or less. One interesting artifact is seen in Figure 9(a): It appears that for "Mobile and Calendar" the coding efficiency or video quality, as measured by PSNR, actually decreases with increased search range! We have noticed this effect in almost all other sequences when the search range becomes larger than the 99% statistics of the sequence. Careful examination of the data shows that this effect can be explained by a combination of our choice for motion-estimation cost function (Section 2) and the frame/field/intra-coding decision algorithm. What happens is that with a larger search range, larger motion vectors are being selected, and more bits are being used to code the motion-vector differences. These additional bits, however, are not being sufficiently compensated for by the corresponding savings in coding smaller macroblock residual errors. We can think of other cost and other intra/inter-macroblock decision functions that could avoid this paradoxical behavior. However, such functions are generally not of practical value. One example is an algorithm in which we optimize the global picture PSNR at a given target bit rate by trying out various combinations of MPEG-2 coding and motion-compensation modalities for individual macroblocks. Such an algorithm, however, is simply impractical. As we pointed out in Sections 1 and 2, in this paper we are interested in practical results; thus, we believe that this paradoxical effect will always be present with practical implementations of MPEG-2 encoding and motion compensation. Fortunately the effect is small and, as far as we can tell from visual experiments, below the threshold of visibility.
Subjective measurements of video quality
We performed limited subjective evaluation experiments using fourteen observers. Included among these fourteen were five with experience in MPEG video coding. Our experiments followed closely the setup and procedures used in the MPEG committee [12]. A subjective rating from 0 to 5 was assigned in conformity with the following quality scale: Bad (0-1), Poor (1-2), Fair (2-3), Good (3-4), and Excellent (4-5).
The experiment comprised a number of sessions. For any one session we tested multiple data rates and multiple search ranges for one video segment and one observer. A session consisted of presenting the observer with multiple pairs of sequences, in which each pair consisted of one reference sequence and one test sequence in random order. Pairs of sequences were presented twice: the first time for previewing, followed immediately by a second time during which observers registered a quality score. All reference sequences were MPEG coded with a search range of 200 x 140, and at the same bit rate as the corresponding test sequence. Three data rates (2, 4, and 6 Mbps) and three search ranges (30 x 22, 60 x 42, and 100 x 70) were chosen for the test sequences; thus, a total of nine pairs were possible for each video segment. In addition, we chose to repeat each pair (but in reverse order of presentation) so as to test the consistency of the observer's evaluations. Thus, for each session, eighteen pairs were presented to observers in a random order. In addition, and in order to evaluate the effects of coding impairment at the test data rates (without including limitations of search range), we also paired the uncompressed originals against the reference sequences.
We asked the assessors to judge the relative and absolute quality of the sequences in each pair by marking a score sheet during the scoring portion of the presentation (we refer again to [12] for more details). To help the assessors calibrate their evaluation of quality, they were first shown the original sequence without degradation and then the most degraded sequence, prior to beginning the actual rating experiment. After this, observers proceeded with the formal testing, in which, as mentioned above, they did not know the order of presentation, or, for that matter, the specific coding parameters for which we were testing.
We calculated the average statistics for all observers. The results for three representative video segments are shown in Figure 11; these are the same as "Carousel," "Segment 1," and "Segment 2" in Figures 9 and 10. The Diff data shown in Figure 11(a) were obtained by subtracting the reference sequence score from the test sequence score for each pair and for each observer. This figure uses asterisks to show the average values of Diff for all observers; also shown are the one-sigma intervals (68% confidence intervals) for these statistics. Positive values of Diff mean that, on the average, observers rated the test sequences as being of higher quality than the reference sequences. On the other hand, negative values indicate that observers were able to perceive impairments in the sequences of limited search range, as compared to their corresponding references. We note that differences in Diff values of the order of 0.1 are statistically insignificant. In fact, we believe that Diff values in the interval -0.1 Diff 0.1 correspond to cases in which test and reference sequences were indistinguishable.
Figure 11
For each video segment, we also analyzed separately the results for a subgroup of four "expert" observers from the pool of fourteen observers. "Experts" were selected from the four largest scores resulting from an algorithm in which, for each test-reference difference (or Diff score), we counted the number of negative Diff scores and then subtracted the number of positive Diff scores (zero differences were excluded from this count). In other words, this algorithm rewards observers who favored the reference sequence over the test sequence, and penalizes those who favored test over reference. The results for these "expert" groups are shown with black circles in Figure 11(a).
The results from subjective experiments tend to confirm some of the trends and observations of the objective PSNR measurements. However, they also demonstrate that measuring differences in PSNR is much easier than "seeing" the effects of those differences in compressed MPEG video. In fact, the quality of sequences with extreme and complex motion, such as "Segment 1" (Basketball), appears to be unaffected by limitations in search range! Only absolute data rate appears to have an impact on subjective video quality. Clearly, the human eye is not capable of discerning small improvements in PSNR (due to increased search range,) when the video has much disorganized motion. This conclusion appears to apply to both expert and nonexpert observers.
Humans appear to be a little more sensitive to differences in PSNR when the motion is more organized and predictable. Such is the case for "Carousel," in which statistically significant differences are observed in Figure 11, particularly for low data rates and expert observers. In fact, the effects of limited search range, when observable, are more significant at the lower bit rates. It is important to note, however, that at the 2-Mbps data rate, none of the three sequences we tested are of acceptable quality. As shown in Figure 11(b), all of the sequences were rated as "poor" when coded at this rate. Thus, the fact that observers could see that search-range-limited sequences were somewhat worse than "poor" reference sequences is of doubtful value. Regardless of data rate, however, we could not find one example for which expert assessors could tell the difference between a 100 x 70 and a 200 x 140 search range. We thus believe that this search range is a conservative value we can use for robust MPEG-2 encoding of CCIR 601 video.
5. Conclusions
Although our experimental data is strictly relevant only to our particular experimental setup and to the video sequences we tested, we believe that our conclusions have a much wider validity. We have conducted our experiments by using practical algorithms and intentionally looking for a range of demanding sequences that we believe stress the requirements for motion-estimation search range.
On the basis of the experimental data we conclude that for CCIR 601, 4:3 aspect ratio, video, and film, a search range of around 100 x 70 is sufficient for robust motion estimation. In fact, this may well be a conservative search range to use. Even when this search range is clearly exceeded by the 99% statistics of a video segment, we find that objective differences in video quality are nonexistent or insignificant, while subjective differences in video quality are simply not observed. Furthermore, our experiments show that further constraining the search range will have an impact primarily on compression at very low data rates, where the quality of the video is poor, regardless of search range.
When these results are extrapolated for video with the same number of samples per picture, but with an aspect ratio of 16:9 instead of 4:3, the 100 x 70 search range becomes a more symmetrical 75 x 70. Further extrapolating from the CCIR 601 16:9 aspect ratio to the ATSC 16:9, 1920 x 1080 high-definition format, the required search range for robust MPEG-2 encoding becomes 200 x 158. The latter result takes into account the different pixel shapes between CCIR 601 720 x 480 and the ATSC 1920 x 1080 picture format.
Acknowledgments
The authors appreciate the help and encouragement from their colleagues Jack Kouloheris and Wai Man Lam.
**Trademark or registered trademark of Moving Picture Experts Group.
Footnotes
1In practice, m-values larger than 3 are not used; therefore, they are not considered here.
2This paper focuses on frame-structure MPEG-2 encoding. However, the results are general and also apply to field-structure coding.
3C. Gonzales, J. Kouloheris, W. Lam, H. Yeo, and C. J. Kuo, "A Family of Cost Functions for Motion Estimation in MPEG-2," work in preparation.
4In this paper we measure video quality by either peak signal-to-noise ratio (PSNR) or a subjective evaluation of picture quality.
Received June 10, 1998; accepted for publication June 2, 1999
|