Photo
Video Semantic Summarization Systems

The Video Semantic Summarization System allows users to access personalized video summaries based on the requested preferences, keywords and time, while dynamically adapting to the transmission and device constraints.  Our Video Semantic Summarization System comprises of three major components, the user client, the database server, and the video middleware.  Figure 1 illustrates the block diagram of these components. 

The user client component, which is shown in the left module of the figure, allows the user to specify his or her preferences along with client profiles, and receives the personalized summaries on the video client. The database server component, which is displayed in the right module of Figure 1, stores all the video contents as well as their corresponding MPEG-7 descriptions. The video middleware represents the intermediate component of the figure, and processes the user preference with the video descriptions to generate a personalized summarization, which is appropriately transcoded and transmitted to the user.  The video middleware is appropriately divided into the semantic transcoder and the transmission server. The semantic transcoder accommodates user preferences, video requests, and personalized summaries. The transmission server adapts to different video sources, transmissions, and devices.

Video System
Figure 1: Block Diagram of Video Semantic Summarization System.



Table of Content

Video Semantic Summarization System

            User Client

            Database Server

            Video Middleware

                    Semantic Transcoder

                    Transmission Server

    MPEG-7 Video Annotation Tool

            Video Segmentation

            Video Content Semantic Lexicon

            VideoAnnEx Graphical User Interface

            MPEG-7 Video Segment Description

    Summarization Techniques

    Universal Tuner Pervasive Application

    Video Editing Software

            MPEG-1/2 Compressed Domain Composition

            VideoEd Graphical User Interface

    Video Semantic Summarization Demos

            Viacom Universal Tuner

            Viacom Web Interface


User Client

        The user client allows a user to ask for some video by specifying a user request and sharing his/her client profile.  In response, the video client receives and displays the personalized video to the user.  The user client module of Figure 1 illustrates these three components and their data exchange with the video middleware module.  The user initiates a user request for some specific content to the query block of the video middleware.  The user request can take the form of preference topics, certain keywords, and the user’s time constraint for watching the video.

        At the same time, several client profiles are also sent over to the control module of the video middleware, including the user profile [i.e., English speaking American, user residing in New York City], the device profile [i.e., Palm IIIc with limited color display, no audio capabilities, and memory capacity], and the transmission profile [i.e., internet access through 56K modem.]  These client profiles allow the middleware to dynamically perform the appropriate video transcoding on the requested video content.  Thus the delivered multimedia content can be perfectly customized and optimized for the current user environment.

Video System

Figure 2: The User Client Interface on the Palm OS PDA for 3 Personalized Video Scenarios: (a) Video-on-demand with interactive hyperlinks,  (b) Summarized video based on preference topics and time constraint, and (c) Summarized video based on query keywords and time constraint.

        Following, a personalized video is delivered to the user and displayed on the video client.  The video client receives the multimedia stream from the video transcoder of the video middleware in accordance with the client profile and the user request.  The video client application is built based on our Universal Tuner system [7]. This system includes two parts -- a software video transcoder at the server end, which transcodes MPEG-1/2 video or live broadcast video into client dependent formats, and the client application software on the color or black-and-white Palm OS PDA.

        Figure 2 illustrates the user client interface on the Palm OS PDA for three personalized video scenarios.  The first emulator, Figure 2(a), shows video-on-demand on the Universal Tuner client application.  The interface allows the user to select among different channels of video content.  Furthermore, interactive hyperlinks are embedded in the video content and are shown as clickable hypertext on the device.  The second emulator, Figure 2(b), illustrates the video client interface of our summarization scenario based on preference topics.  The user selects a video to summarize according to two listed preference choices and a total playing time constraint.  Similarly, the third emulator of Figure 2(c) depicts the video client of another summarization scenario based on query keywords.  The user is only interested in multimedia content with the specified keywords, and requests the video summary to be limited to the desired time constraint.

        In summary, the user client allows the user to specify his/her preference, provides critical persistent data through the client profiles, and interfaces the user input/output with a video client.  The user client communicates through the video middleware in order to retrieve the appropriate personalized contents. In the following section, the database server, which is where the video content resides, is described.  

< Back to Table of Contents >


Database Server

        The database server provides the video middleware with the video descriptions and corresponding video sources in order to generate the personalized contents.  In the database server, video content is stored, analyzed and annotated with MPEG-7 descriptions.  Figure 1 illustrates the database server with three major components.  First, the video sources identify the location and format of our content.  Because our video transmission transcoding middleware accepts various kinds of video file types, bit-rates and resolutions, the database can store videos in any of the most popular formats (e.g., MPEG-1/2, AVI, and QuickTime). Also, even though each output summarized video is a composition of video clips, our database does not need to pre-segment the video sequences into individual shots. Each video can be saved in only one format, because the video transcoding middleware can randomly access/decode any video clips in the video sequence at real time. Second, manual, semi-automatic, or automatic annotation tools assist to generation the descriptions for videos.  The annotation can range from high-level semantic concepts to low-level feature descriptions.  And third, the video descriptions are stored as MPEG-7 XML files.  These MPEG-7 descriptions identify the underlying content of the videos. 

        Annotation tools generate semantic meanings as well as other feature descriptions of the video, and output them as MPEG-7 description schemes.  Our annotation tool allows annotators to tag video segments with detail text descriptions, which is based on a defined semantic lexicon, on individual shots or individual regions/objects on the key frame of the shots.  In addition, depending on the application of the semantic summarization system and the corresponding video content, the annotation tool can be easily customized. For instance, the lexicon can be changed based on the content, or the shot boundary detection method can be customized to a hierarchical structure, which represents hierarchical semantic events. 

< Back to Table of Contents >


Video Middleware

        The video middleware interfaces the user client and the database server.  As shown in Figure 1, the middleware consists of the semantic transcoder and the transmission server.  In the semantic transcoder, a query and retrieval component matches the user preference with the MPEG-7 video descriptions to generate a video summarization.  In the transmission server, the control module determines the appropriate transcoding for the desired video content to the user client, which in turn depends on the client profiles.  Afterwards, the summarized videos are optimally transcoded and transmitted to the user’s video client.  In the following two subsections, the semantic transcoder and the transmission server of the video middleware are explained.

Semantic Transcoder

        The semantic transcoder performs the matching of the user request with the video description.  The user may specify his/her request in terms of preference topics, topic ranking, query keywords, and time constraint.  The query module receives the user request to determine if there are contents that will fit these preferences.  A search request is send to the database server, which in turn responds with the appropriate MPEG-7 video descriptions.  The MPEG-7 descriptions are initially passed to the MPEG-7 parser and the desired query results are extracted.  These query results identify the matched semantic descriptions and the corresponding video segments. 

        Having matched the user preference with the video descriptions from the MPEG-7 XML, the appropriate videos are summarized and delivered to the user.  However, there are two additional issues that need to be addressed before the summarized video segments are send back to the user.  First, there may be too many query results that satisfy the search request.  Consequently, the resulting summarization can be refined based on the priority ranking, playing time, and the user profile.  These different summarization techniques are discussed later in more detail.

        The second issue concerns the format and transmission of the video segments.  If the video sources in the database server are stored in varying formats, bit rates, frame rates, image sizes, and other variations, the video segments that comprise the summarization are likely of different variations.  These variations may not be compatible or optimized for the user’s video client.  As a result, the transmission server dynamically transcodes all the video segments and streams a compatible video summary to the user, as to be discussed in the next section.

Transmission Server

        The Transmission Server is designed  for delivering customized videos to various connected desktops, pervasive devices, or internet browsers. We have implemented a prototype of variable-complexity codecs that can transcode AVI, MPEG-1/2 video or live TV broadcasting in real-time and stream it to various platforms.

< Back to Table of Contents >


MPEG-7 Video Annotation Tool

        The effectiveness of the Video Semantic Summarization System is highly dependent on the annotation descriptions of our content.  If the annotation tool generates useful and detailed MPEG-7 descriptions for an application, then the resulting summarization will be comprehensive and desirable.  In this section, we outline the functionalities and feature attributes generated by our annotation tool, called IBM VideoAnnEx Annotation Tool.

Video System

Figure 4: Four Major Components of the IBM VideoAnnEx Annotation Tool.  

        Four major components describe the annotation process and are depicted in Figure 4.  First, video segmentation is performed to cut up the video sequence into smaller video units.  Second, semantic lexicon is defined in order to regulate the video content descriptions. Third, an annotator labels the video segments with the semantic descriptions and relevance scores are also calculated to reflect the importance with respect to the labels.  Fourth, the MPEG-7 descriptions of the annotation process are directly outputted from the IBM VideoAnnEx Annotation Tool.  The goal of the video annotation is to categorize the semantic content of each video unit, assign the corresponding relevance score, and output the MPEG-7 XML description file.  The following four subsections describe these components in further detail.

Video Segmentation

        A short video clip can be simply annotated by describing its content in its entirety.  However when the video is longer, annotation of its content can benefit from segmenting the video into smaller units.  In our IBM VideoAnnEx Annotation Tool, the annotation is performed on the video shot level.  A video shot is defined as a continuous camera-captured segment of a scene, and is usually well defined for most video content.  Given the shot boundaries, the annotations are assigned for each video shot.

        For a video sequence, shot boundary detection is performed to divide the video into multiple shots. The IBM CueVideo Toolkit performs the shot detection algorithm, which is based on the multiple timescale differencing of the color histogram [1].  CueVideo segments our video content into shorter shots, where scene cuts, dissolves, and fades are effectively detected. Because each video shot can be described and retrieved independently of each other, the next step is to define our lexicon for shot descriptions.

Video Content Semantic Lexicon

        Given the segmentation of video content into video shots, the second step is to define the semantic lexicon in which to label the shots.  A video shot can fundamentally be described by three attributes.  The first is the background surrounding of where the shot was captured by the camera, which is referred to as the static scene.  The second attribute is the collection of significant subjects involved in the shot sequence, which is referred to as the key object.  Lastly, the third attribute is the corresponding action taken by some of the key objects, which is referred to as the event.  These three types of lexicon define the vocabulary for our video content. 

        An example of our lexicon is shown as follows. Our vocabulary for the static scenes includes “indoors”, “outdoors”, and “outer space”.  Furthermore, each category is hierarchically sub-classified to comprise more specific scene descriptions.  For example, “outdoors” consists of these three sub-categories: “natural scene – low level”, “natural scene – high level”, and “man-made”.  Our simplified vocabulary for the key objects includes the following categories: “animals”, “human”, “man-made structures”, “man-made objects”, “nature objects”, “graphics & text”, “transportation”, and “astronomy”.  In addition, each key object category is subdivided into more specific object descriptions; for instance, "rockets", "fire", "flag", "flower" and "robots."  For our events vocabulary, only six events are of specific interest to our summarization work, and they are the following: "water skiing", "boat sailing", "person speaking", "landing", "take-off or launch", and "explosion."

        Using the defined vocabulary for static scenes, key objects, and events, the lexicon is imported into our IBM VideoAnnEx Annotation Tool for describing and labeling each video shot.  The shots are labeled for its content with respect to the selected lexicon.   Note that the set of lexicon is dependent on the summarization application, and can be easily modified and imported into the annotation tool.

VideoAnnEx Graphical User Interface

        The IBM VideoAnnEx Annotation Tool assists authors in the task of annotating video sequences.  Each shot in the video sequence can be annotated with static scenes, key objects, events, and other keywords.  These descriptions are labeled for each shot and are stored as MPEG-7 descriptions in the output XML file.  VideoAnnEx can also save, open, and retrieve MPEG-7 files in order to display the annotations for corresponding video sequences.

        VideoAnnEx is divided into four graphical sections as illustrated in Figure 5.  On the upper right-hand corner of the tool is the Video Playback window with shot information.  On the upper left-hand corner of the tool is the Shot Annotation with a key frame image display.  On the bottom portion of the tool is two different Views Panel of the annotation preview.  A fourth component, not shown in Figure 5, is the Region Annotation pop-up window for specifying annotated regions.  These four sections provide interactivity to assist authors of the annotation tool.  

Video System

Figure 5: The IBM VideoAnnEx Annotation Tool.

        The Video Playback window displays the opened MPEG video sequence.  As the video is played back in the display window, the current shot information is given as well.  The Shot Annotation module displays the defined semantic lexicons and the key frame window. The key frame is a representative image of the video shot segment, and thus offer an instantaneous recap of the whole video shot.  This is the region where the annotator selects the descriptions for the video segment.  The Views Panel displays two different previews of representative images of the video.  The Frames in the Shot shows all the I-frames as representative images of the current video shot, while the Shots in the Video view (as in the bottom of Figure 5) shows all the key frames of each shot as representative images over the entire video.  As the annotator labels each shot, the descriptions are displayed below the corresponding key frames in the Shots in the Video view.  Furthermore after the MPEG-7 descriptions are saved into an XML file, anyone can load and review these files at a later time by previewing the annotations at this views panel.  The Region Annotation window allows the author to associate a rectangular region with a labeled text annotation.  After the text annotations are identified on the Shot Annotation window, each description can be associated with a corresponding region on the selected key frame of that shot.  The region annotation is also saved in the MPEG-7 descriptions, as is described next. A more detailed description of the annotation tool as well as its active learning components are shown in [16].

MPEG-7 Video Segment Description

        The IBM VideoAnnEx Annotation Tool segments the video content into shots, labels each video shot with some descriptions, identifies the associated region boundary, and generates an MPEG-7 XML description [10].  The ISO standardized MPEG-7 defines the compatible scheme and language to represent semantic meaning of multimedia content. Our MPEG-7 output is the Video Segment Description Scheme, as shown in Figure 6. In Figure 6, we annotate the first shot, which includes 136 frames, of a video as “Slide Representation” and annotate a rectangular region in the key frame as “Graphics & Text.” In MPEG-7, each video shot is defined as a Video Segment, where the shot start-time (shown in the Thh:mm:ss:nnF30 format) and duration are given and the annotations are described. Furthermore, the embedded <SpatioTemporalDecomposition> tag allows us to specify the region location and the corresponding text annotation in a key frame. In Figure 6, the key frame is the 82th frame of the video sequence.  The annotated region is specified by the <SpatialLocator> tag. It is identified by a polygon whose n vertex coordinates are recorded in the order of <x0, y0>, <x1, y1>, …, <xn-1, yn-1> after the <CoordsI> tag. For multiple regions in a key frame, the system needs to repeat the section between <StillRegion> and its closing tag inside the <SpatialDecomposition> section. If the annotator needs to label multiple frames in the shot, then the system needs to repeat the <StillRegion> section inside the <SpatioTemporalDecomposition> section.  

Video System

Figure 6: Example of MPEG-7 Video Segment Description XML file.

        In our Video Semantic Summarization System, a relevance score is automatically assigned to the video based on the confidence value of the classification. For our system, the annotation process generates a relevance score for the whole video sequence and for each attribute based on the probability of that attribute to the corresponding video unit.  After these steps, we implemented an interface to allow users to manually correct the annotation as well as the scene boundaries. All these results are then saved as an MPEG-7 XML file.

< Back to Table of Contents >


Summarization Techniques

        Given annotation descriptions and corresponding scores for our video content, this section describes the summarization techniques to customize video delivery based on user preference and time constraint. The objective of video summarization is to show a shortened video that maintains as much semantic content within the desired time constraint based on user preference.  Using shots as the basic video unit, there are four forms of video summarization based on temporal compression of the original video sequence: (1) maintain or delete each video shot depending on user preference, (2) extract temporal subsets of the original shot depending on attribute specification, (3) condense each shot in fast-forward temporal mode while maintaining comprehension, and (4) combine a weighted combination of the previous forms.

Video System

Figure 7: Three Forms of Video Summarization Techniques.

        In the first formulation, each video shot is either included or excluded from the final video summary.  In each shot, video annotation describes the semantic content with attributes and corresponding scores.  Assume there are a total of N attribute categories.  Let  be the user preference vector, where Video System  denotes the preference weighting for attribute i, Video System .  Assume there are a total of M shots.  Let Video System  be the shot segments that comprise the original video sequence, where Video System denotes shot number i, Video System .  Subsequently, the attribute score Video System  is defined as the relevance of attribute i in shot j, Video System  and Video System .  The attribute matrix A is:

Video System

It then follows that the weighted attribute Video System  for shot i given the user preference Video System is calculated as:

Video System

Video System specifies the weighted importance of each shot for the user.  Assume shot Video System spans a durations of Video System , Video System .  Then this summary formulation suggests that we include shot Video System  if the importance weighting Video System of this shot is greater than some threshold, Video System , and excluded otherwise.  Video System  is determined such that the sum of the shot durations Video System  is less than the user specified time constraint.  As a result, each shot is either included or excluded in the final video summary.

        In the second formulation, we extend the attribute scoring from the shot level to the time domain.  In the first form, video annotation specifies a relevant score of each attribute for each shot based on semantic relevance of the shot content.  Here, attribute scoring will incorporate low-level features to more precisely annotate each frame of the shot.  For example in a sports clip, there should be higher relevance scoring associated with those temporal frames that demonstrate higher motion components.  So if we want to include the high activity component of a shot, then only the highest score subset is extracted.  Consequently, attribute score Video System  for attribute i and shot j is no longer a constant within a shot, but becomes a function of time based on low-level features.  The attribute scoring function Video System can be calculated automatically as the product of the constant attribute score Video System  of shot j and a normalized low-level feature weighting Video System , where Video System  and t spans the duration of the shot.  Thus,

Video System .  Similar to the first formulation, the attribute scoring function Video System now determines the time interval Video System  of shot subset Video System  to be included in the final video summary.

        In the third formulation, the video is compressed temporally by subsampling the total number of frames in the original sequence.  The resulting video summary resembles a fast forward playback in a shorter period of time.  When video is temporally compressed in this manner, maintaining comprehension becomes the most critical issue.  Comprehension depends on the compression factor, which in turn is highly dependent on the content and the amount of motion in the original video.  For example, commercial videos, which consist of high motion and very short shots, cannot tolerate a high compression factor.  On the other hand, interview or conferencing videos, which consist of nearly stationary people with limited motion, can withstand high compression factors.  Consequently, we can assume that to maintain comprehension of the summarized video, the cumulative motion of the resulting subsampled video must be below a perceptually acceptable threshold.  This requirement then determines the maximum sampling rate to adopt for the video.  Note that it is desirable to use one subsampling rate for the entire video, so as not to perceptually confuse the user when changing compression rates.  This restriction also limits the sampling rate over the entire video.  Let Video System  denote the perceived motion associated with time t.  Let Video System  denote the human tolerance limit for perceived motion.  It then follows that we are looking to find the maximum period T such that the perceived motion is always limited by the human tolerance limit Video System , Video System  for all t.  After the sampling period T is determined, the final video summary consists of subsampled frames of the selected video shots.

< Back to Table of Contents >


Universal Tuner Pervasive Application

        The Video Semantic Summarization System includes an MPEG-7 compliant annotation interface, a semantic summarization middleware, a real-time MPEG-1/2 video transcoder on PCs, and an application interface on color/black-and-white Palm-OS PDAs. We designed a video annotation tool, VideoAnnEx, to annotate semantic labels associated with video shots. Videos are first segmented into shots based on their visual-audio characteristics. They are played back using an interactive interface, which facilitate and fasten the annotation process. Users can annotate the video content with the units of temporal shots or spatial regions. The annotated results are stored in the MPEG-7 XML format. We also designed and implemented a video transmission system, Universal Tuner, for wireless video streaming. This system transcodes MPEG-1/2 videos or live TV broadcasting videos to the BW or indexed color Palm OS devices. In our system, the complexity of multimedia compression and decompression algorithms is adaptively partitioned between the encoder and decoder. In the client end, users can access the summarized video based on their preferences, time, keywords, as well as the transmission bandwidth and the remaining battery power on the pervasive devices.

With the growing popularity and capability of Personal Digital Assistants (PDA) and mobile phones, users have become much enthusiastic in watching videos through their pervasive mobile devices. These devices vary widely and are limited in terms of power consumption, processing speed, display constraint, and video decoding capabilities.  When people use their pervasive devices, they generally restrict their viewing time on the limited displays and minimize the amount of interaction and navigation to get to the content. Therefore, summarization of video content for pervasive devices entails temporal as well as spatial considerations.

Pervasive devices usually have a smaller display resolution both in spatial domain and in color depth. Also, some devices such as Palm-OS PDAs do not have sound playing functionality. These constraints affect the key shots selection process of a video summarization system. For instance, if audio information is not available in the decoder, we may need to annotate more text information for video or extract video clips with embedded texts. Also, the limitation of remaining battery power in the devices may require the video summarization system to choose invariant stable scenes rather than high-motion video clips.

Most existing video summarization tools address their applications on the Internet environments. Browsing tools can display the summarized video using a number of key frames for each detected scene shot to generate a storyboard [20]. Some clustering techniques are used to optimize key frame selection based on their visual attributes or motion features [3, 12, 21]. Merialdo et.al. generate personalized TV news programs based on user preference and time constraint [13]. Gong and Liu use Singular Value Decomposition (SVD) of attribute matrix to reduce the redundancy of video segments and thus generate video summaries [5, 6]. In addition to the systems based on audio-visual information, some researchers have proposed methods to detect semantic important events based on other resources. For instance, Aizawa et. al. use brain waves to detect exciting moments of the subjects [2]. Hu et. al. use detect similar video clips cross different source of news stations to identify interesting news events [8]. In industry, companies such as NTT DoCoMo and Virage have implemented preliminary video summarization systems. NTT DoCoMo streams video summaries to its I-mode cellular phones [17]. Virage creates video clips of NHL hockey highlights [19].  

This paper addresses issues of designing a video semantic summarization system in the wireless/mobile environments. We have designed and implemented a video summarization system, which includes an MPEG-7 compliant annotation interface, a semantic summarization middleware, a real-time MPEG-1/2 video transcoder for Palm-OS devices, and an application interface on color/black-and-white Palm-OS PDAs.   We will first describe the system architecture of our video summarization system, and then focus on video summarization of annotated contents, which are segmented and classified.  The video contents are annotated using the ISO standardized Multimedia Content Description Interface, also known as MPEG-7. MPEG-7 addresses specific requirements for Description Schemes to describe different abstraction levels and variations of multimedia content.

The Universal Tuner system includes two parts -- a software video transcoder at the server end which can transcode MPEG-1/2 video or A/D converted live broadcasting video into client dependent format, and a software client application software on the color or Black-and-White Palm OS PDA. The software client part consists the key display component of the user client in this Video Semantic Summarization System.

An example of the client end has been shown in Figure 2, this application is capable of displaying transcoded MPEG1/2 color video in both 80x80 format (as shown in Figure 2) and full-screen 160x160 viewing mode. Using Palm IIIc emulator, we can show video in the 80x80 video mode at about 6 frame per second, where motion could be marginally considered as continuous in human perception, and 1.5 frame per second in the 160x160 video mode, which is perceptually similar to the slide show rather than the continuous motion video.  In the client application end, we also added WML compatibility in the small video mode that can access information on the Interent through the transcoding server.

We have tested the effectiveness of the BW mode of our Universal Tuner in the real environments through a 9600 bps wireless modem. For the color modes, we have also tested our Video Semantic Summarization System on Palm m505 devices, which support USB connection and Wireless LAN using 48kbps.  Our testing results show the effectiveness of the system. In the future, we will test our system in the 3G and Bluetooth environments.

< Back to Table of Contents >


Video Editing Software

 

MPEG-1/2 Compressed Domain Composition

 

VideoEd Graphical User Interface

 

 

< Back to Table of Contents >


Video Semantic Summarization Demos

         The Video Semantic Summarization System generates personalized video summaries using MPEG-7 descriptions and delivers the content effectively to the user on three implemented platforms: (1) the stand-alone application, (2) mobile devices, and (3) web browser.  Each system allows the user to specify topic preferences, query keywords and total summary time. We present our contributions and corresponding demos here.

  • Universal Tuner Pervasive Application
  • VideoAnnEx Annotation Tool
  • VideoEd Editing Software
  • VideoSoup Summarization on User Preference Web Interface

The video semantic summarization system is described which includes an MPEG-7 compliant annotation interface, a semantic summarization middleware, a real-time MPEG-1/2 video transcoder on PCs, and an application interface on color/black-and-white Palm-OS PDAs. Several issues regarding to designing a wireless compliant video summarization system have been addressed in this paper. Our video summarization system has these characteristics for the user of pervasive devices: 1.) Save time from navigating to the desired information [no clicking through a series of links]; 2.) Save time from viewing whole video clips [only show desired video segments]; 3.) Save bandwidth by downloading only desired video segments in the device-dependent formats. In the future, we will add more functionalities to this Video Semantic Summarization system such as mobile phone client applications and automatic semantic audio-visual indexing as well as event detection.

< Back to Table of Contents >