|
The Video Semantic Summarization
System allows users to access personalized video summaries
based on the requested preferences, keywords and time, while dynamically
adapting to the transmission and device constraints. Our Video Semantic
Summarization System comprises of three major components, the user
client, the database server,
and the video middleware.
Figure 1 illustrates the block diagram of these components.
The
user client component, which is shown in the left module of the
figure, allows the user to specify his or her preferences along with client
profiles, and receives the personalized summaries on the video client.
The database server component,
which is displayed in the right module of Figure 1, stores all the video
contents as well as their corresponding MPEG-7 descriptions. The
video middleware represents the intermediate component of the
figure, and processes the user preference with the video descriptions
to generate a personalized summarization, which is appropriately transcoded
and transmitted to the user. The video middleware is appropriately
divided into the semantic transcoder and the transmission server. The
semantic transcoder accommodates
user preferences, video requests, and personalized summaries. The
transmission server adapts to different video sources, transmissions,
and devices.

Figure 1: Block Diagram of Video Semantic Summarization System.
Table of Content
Video Semantic Summarization System
User Client
Database Server
Video Middleware
Semantic
Transcoder
Transmission
Server
MPEG-7
Video Annotation Tool
Video Segmentation
Video Content Semantic Lexicon
VideoAnnEx Graphical User Interface
MPEG-7 Video Segment Description
Summarization
Techniques
Universal
Tuner Pervasive Application
Video
Editing Software
MPEG-1/2 Compressed Domain Composition
VideoEd Graphical User Interface
Video Semantic Summarization Demos
Viacom Universal Tuner
Viacom Web Interface
User
Client
The user client allows a user
to ask for some video by specifying a user request and sharing his/her
client profile. In response,
the video client receives and displays the personalized video to the user.
The user client module of Figure 1 illustrates these three components
and their data exchange with the video middleware
module. The user initiates
a user request for some specific content to the query block of the video
middleware. The user request
can take the form of preference topics, certain keywords, and the user’s
time constraint for watching the video.
At the same time, several
client profiles are also sent over to the control module of the video
middleware, including the user profile [i.e., English speaking American,
user residing in New York City], the device profile [i.e., Palm IIIc with
limited color display, no audio capabilities, and memory capacity], and
the transmission profile [i.e., internet access through 56K modem.]
These client profiles allow the middleware to dynamically perform
the appropriate video transcoding on the requested video content.
Thus the delivered multimedia content can be perfectly customized
and optimized for the current user environment.
Figure
2: The User Client Interface on the Palm OS PDA for 3 Personalized Video
Scenarios: (a) Video-on-demand with interactive hyperlinks,
(b) Summarized video based on preference topics and time constraint,
and (c) Summarized video based on query keywords and time constraint.
Following, a personalized
video is delivered to the user and displayed on the video client.
The video client receives the multimedia stream from the video
transcoder of the video middleware in accordance with the client profile
and the user request. The
video client application is built based on our
Universal Tuner system [7]. This system includes two parts -- a software
video transcoder at the server end, which transcodes MPEG-1/2 video or
live broadcast video into client dependent formats, and the client application
software on the color or black-and-white Palm OS PDA.
Figure 2 illustrates the user
client interface on the Palm OS PDA for three personalized video scenarios.
The first emulator, Figure 2(a), shows video-on-demand on the Universal
Tuner client application. The
interface allows the user to select among different channels of video
content. Furthermore, interactive
hyperlinks are embedded in the video content and are shown as clickable
hypertext on the device. The second emulator, Figure 2(b), illustrates the video client
interface of our summarization scenario based on preference topics.
The user selects a video to summarize according to two listed preference
choices and a total playing time constraint.
Similarly, the third emulator of Figure 2(c) depicts the video
client of another summarization scenario based on query keywords.
The user is only interested in multimedia content with the specified
keywords, and requests the video summary to be limited to the desired
time constraint.
In summary, the user client
allows the user to specify his/her preference, provides critical persistent
data through the client profiles, and interfaces the user input/output
with a video client. The
user client communicates through the video
middleware in order to retrieve the appropriate personalized contents.
In the following section, the database server, which is where the video
content resides, is described.
<
Back to Table of Contents >
Database
Server
The database server provides
the video middleware with the video descriptions
and corresponding video sources in order to generate the personalized
contents. In the database
server, video content is stored, analyzed and annotated with MPEG-7 descriptions.
Figure 1 illustrates the database server with three major components.
First, the video sources identify the location and format of our
content. Because our video
transmission transcoding middleware accepts various kinds of video file
types, bit-rates and resolutions, the database can store videos in any
of the most popular formats (e.g., MPEG-1/2, AVI, and QuickTime).
Also, even though each output summarized video is a composition of video
clips, our database does not need to pre-segment the video sequences into
individual shots. Each video can be saved in only one format, because
the video transcoding middleware can randomly access/decode any video
clips in the video sequence at real time. Second, manual, semi-automatic,
or automatic annotation tools assist
to generation the descriptions for videos.
The annotation can range from high-level semantic concepts to low-level
feature descriptions. And
third, the video descriptions are stored as
MPEG-7 XML files. These
MPEG-7 descriptions identify the underlying content of the videos.
Annotation tools generate
semantic meanings as well as other feature descriptions of the video,
and output them as MPEG-7 description
schemes. Our
annotation tool allows annotators to tag video segments with detail
text descriptions, which is based on a defined
semantic lexicon, on individual shots or individual regions/objects
on the key frame of the shots. In
addition, depending on the application of the semantic summarization system
and the corresponding video content, the annotation tool can be easily
customized. For instance, the lexicon can be changed based on the content,
or the shot boundary detection method can be customized to a hierarchical
structure, which represents hierarchical semantic events.
<
Back to Table of Contents >
Video
Middleware
The video middleware interfaces
the user client and the
database server. As shown
in Figure 1, the middleware consists of the
semantic transcoder and the transmission
server. In the semantic
transcoder, a query and retrieval component matches the user preference
with the MPEG-7 video descriptions to generate a video summarization. In the transmission server, the control module determines the
appropriate transcoding for the desired video content to the user client,
which in turn depends on the client profiles.
Afterwards, the summarized videos are optimally transcoded and
transmitted to the user’s video client.
In the following two subsections, the semantic transcoder and the
transmission server of the video middleware are explained.
Semantic
Transcoder
The semantic transcoder performs
the matching of the user request with the video description.
The user may specify his/her request in terms of preference topics,
topic ranking, query keywords, and time constraint.
The query module receives the user request to determine if there
are contents that will fit these preferences.
A search request is send to the database server, which in turn
responds with the appropriate MPEG-7 video descriptions.
The MPEG-7 descriptions
are initially passed to the MPEG-7 parser and the desired query results
are extracted. These query
results identify the matched semantic descriptions and the corresponding
video segments.
Having matched the user preference
with the video descriptions from the MPEG-7 XML, the appropriate videos
are summarized and delivered to the user. However, there are two additional issues that need to be addressed
before the summarized video segments are send back to the user.
First, there may be too many query results that satisfy the search
request. Consequently, the
resulting summarization can be refined based on the priority ranking,
playing time, and the user profile. These different summarization
techniques are discussed later in more detail.
The second issue concerns
the format and transmission of the video segments.
If the video sources in the database
server are stored in varying formats, bit rates, frame rates, image
sizes, and other variations, the video segments that comprise the summarization
are likely of different variations.
These variations
may not be compatible or optimized for the user’s video client. As a result, the transmission server dynamically transcodes
all the video segments and streams a compatible video summary to the user,
as to be discussed in the next section.
Transmission
Server
The Transmission Server
is designed for delivering
customized videos to various connected desktops, pervasive devices, or
internet browsers. We have implemented a prototype of variable-complexity
codecs that can transcode AVI, MPEG-1/2 video or live TV broadcasting
in real-time and stream it to various platforms.
<
Back to Table of Contents >
MPEG-7
Video Annotation Tool
The effectiveness of the Video Semantic Summarization System is highly
dependent on the annotation descriptions of our content.
If the annotation tool generates useful and detailed
MPEG-7 descriptions for an application, then the resulting summarization
will be comprehensive and desirable.
In this section, we outline the functionalities and feature attributes
generated by our annotation tool, called IBM VideoAnnEx Annotation Tool.
Figure
4: Four Major Components of the IBM VideoAnnEx Annotation Tool.
Four major components describe the annotation process and are depicted
in Figure 4. First,
video segmentation is performed to cut up the video sequence into
smaller video units. Second,
semantic lexicon is defined in order
to regulate the video content descriptions. Third, an annotator labels
the video segments with the semantic descriptions and relevance scores
are also calculated to reflect the importance with respect to the labels.
Fourth, the MPEG-7 descriptions
of the annotation process are directly outputted from the IBM
VideoAnnEx
Annotation Tool. The
goal of the video annotation is to categorize the semantic content of
each video unit, assign the corresponding relevance score, and output
the MPEG-7 XML description file.
The following four subsections describe these components in further
detail.
Video
Segmentation
A short video clip can be simply annotated by describing its content in
its entirety. However when
the video is longer, annotation of its content can benefit from segmenting
the video into smaller units. In
our IBM VideoAnnEx Annotation Tool, the annotation is performed
on the video shot level. A video shot is defined as a continuous camera-captured segment
of a scene, and is usually well defined for most video content.
Given the shot boundaries, the annotations are assigned for each
video shot.
For a video sequence, shot boundary detection is performed to divide the
video into multiple shots. The IBM CueVideo Toolkit performs the
shot detection algorithm, which is based on the multiple timescale differencing
of the color histogram [1]. CueVideo
segments our video content into shorter shots, where scene cuts, dissolves,
and fades are effectively detected. Because each video shot can be described
and retrieved independently of each other, the next step is to define
our lexicon for shot descriptions.
Video
Content Semantic Lexicon
Given the segmentation of
video content into video shots, the second step is to define the semantic
lexicon in which to label the shots.
A video shot can fundamentally
be described by three attributes. The first is the background surrounding of where the shot was
captured by the camera, which is referred to as the static scene.
The second attribute is the collection of significant subjects
involved in the shot sequence, which is referred to as the key object.
Lastly, the third attribute is the corresponding action taken by
some of the key objects, which is referred to as the event.
These three types of lexicon define the vocabulary for our video
content.
An example of our lexicon is shown as follows. Our vocabulary for the
static scenes includes “indoors”, “outdoors”, and “outer space”.
Furthermore, each category is hierarchically sub-classified to
comprise more specific scene descriptions.
For example, “outdoors” consists of these three sub-categories:
“natural scene – low level”, “natural scene – high level”, and “man-made”.
Our simplified vocabulary for the key objects includes the following
categories: “animals”, “human”, “man-made structures”, “man-made objects”,
“nature objects”, “graphics & text”, “transportation”, and “astronomy”.
In addition, each key object category is subdivided into more specific
object descriptions; for instance, "rockets", "fire",
"flag", "flower" and "robots."
For our events vocabulary, only six events are of specific interest
to our summarization work, and they are the following: "water skiing",
"boat sailing", "person speaking", "landing",
"take-off or launch", and "explosion."
Using the defined vocabulary for static scenes, key objects, and events,
the lexicon is imported into our IBM VideoAnnEx Annotation Tool
for describing and labeling each video shot.
The shots are labeled for its content with respect to the selected
lexicon. Note that
the set of lexicon is dependent on the summarization application, and
can be easily modified and imported into the annotation tool.
VideoAnnEx
Graphical User Interface
The IBM VideoAnnEx Annotation Tool assists authors in the task of
annotating video sequences. Each
shot in the video sequence can be annotated with static scenes, key objects,
events, and other keywords. These
descriptions are labeled for each shot and are stored as
MPEG-7 descriptions in the output XML file.
VideoAnnEx can also save, open, and retrieve MPEG-7 files
in order to display the annotations for corresponding video sequences.
VideoAnnEx is divided into four graphical sections as illustrated
in Figure 5. On the upper
right-hand corner of the tool is the Video Playback window with
shot information. On the
upper left-hand corner of the tool is the Shot Annotation with
a key frame image display. On
the bottom portion of the tool is two different Views Panel of
the annotation preview. A
fourth component, not shown in Figure 5, is the Region Annotation
pop-up window for specifying annotated regions.
These four sections provide interactivity to assist authors of
the annotation tool.
Figure 5: The IBM VideoAnnEx Annotation Tool.
The Video Playback window displays the opened MPEG video sequence.
As the video is played back in the display window, the current
shot information is given as well.
The Shot Annotation module displays the defined semantic
lexicons and the key frame window. The key frame is a representative image
of the video shot segment, and thus offer an instantaneous recap of the
whole video shot. This is
the region where the annotator selects the descriptions for the video
segment. The Views Panel
displays two different previews of representative images of the video.
The Frames in the Shot shows all the I-frames as representative
images of the current video shot, while the Shots in the Video
view (as in the bottom of Figure 5) shows all the key frames of each shot
as representative images over the entire video.
As the annotator labels each shot, the descriptions are displayed
below the corresponding key frames in the Shots in the Video view.
Furthermore after the MPEG-7 descriptions are saved into an XML
file, anyone can load and review these files at a later time by previewing
the annotations at this views panel.
The Region Annotation window allows the author to associate
a rectangular region with a labeled text annotation.
After the text annotations are identified on the Shot Annotation
window, each description can be associated with a corresponding region
on the selected key frame of that shot.
The region annotation is also saved in the MPEG-7 descriptions,
as is described next. A more detailed description of the annotation tool
as well as its active learning components are shown in [16].
MPEG-7
Video Segment Description
The IBM VideoAnnEx Annotation Tool segments the video content into
shots, labels each video shot with some descriptions, identifies the associated
region boundary, and generates an MPEG-7 XML description [10].
The ISO standardized MPEG-7 defines the compatible scheme and language
to represent semantic meaning of multimedia content. Our MPEG-7 output
is the Video Segment Description Scheme, as shown in Figure 6. In Figure
6, we annotate the first shot, which includes 136 frames, of a video as
“Slide Representation” and annotate a rectangular region in the
key frame as “Graphics & Text.” In MPEG-7, each video shot
is defined as a Video Segment, where the shot start-time (shown in the
Thh:mm:ss:nnF30 format) and duration are given
and the annotations are described. Furthermore, the embedded <SpatioTemporalDecomposition>
tag allows us to specify the region location and the corresponding text
annotation in a key frame. In Figure 6, the key frame is the 82th frame
of the video sequence. The
annotated region is specified by the <SpatialLocator>
tag. It is identified by a polygon whose n vertex coordinates are
recorded in the order of <x0, y0>,
<x1, y1>, …, <xn-1,
yn-1> after the <CoordsI>
tag. For multiple regions in a key frame, the system needs to repeat the
section between <StillRegion> and its
closing tag inside the <SpatialDecomposition>
section. If the annotator needs to label multiple frames in the shot,
then the system needs to repeat the <StillRegion>
section inside the <SpatioTemporalDecomposition>
section.

Figure
6: Example
of MPEG-7 Video Segment Description XML file.
In our Video Semantic Summarization System,
a relevance score is automatically assigned to the video based on the
confidence value of the classification. For our system, the annotation
process generates a relevance score for the whole video sequence and for
each attribute based on the probability of that attribute to the corresponding
video unit. After these steps,
we implemented an interface to allow users to manually correct the annotation
as well as the scene boundaries. All these results are then saved as an
MPEG-7 XML file.
<
Back to Table of Contents >
Summarization Techniques
Given annotation descriptions
and corresponding scores for our video content, this section describes
the summarization techniques to customize video delivery based on user
preference and time constraint. The objective of video summarization is
to show a shortened video that maintains as much semantic content within
the desired time constraint based on user preference.
Using shots as the basic video unit, there are four forms of video
summarization based on temporal compression of the original video sequence:
(1) maintain or delete each video shot depending on user preference, (2)
extract temporal subsets of the original shot depending on attribute specification,
(3) condense each shot in fast-forward temporal mode while maintaining
comprehension, and (4) combine a weighted combination of the previous
forms.

Figure
7: Three Forms of Video Summarization Techniques.
In the first formulation,
each video shot is either included or excluded from the final video summary.
In each shot, video annotation describes the semantic content with
attributes and corresponding scores.
Assume there are a total of N attribute categories.
Let be
the user preference vector, where
denotes the preference weighting
for attribute i,
. Assume there are a total
of M shots. Let
be the shot segments that
comprise the original video sequence, where
denotes shot number i,
. Subsequently, the attribute
score
is defined as the relevance
of attribute i in shot j,
and
. The attribute matrix A
is:
It then follows that the weighted
attribute
for shot i given the user preference
is calculated as:
specifies the weighted importance of each shot for the user. Assume
shot
spans a durations of
,
. Then this summary formulation suggests that we include shot
if the importance weighting
of this shot is greater than some threshold,
, and excluded otherwise.
is determined such that the sum of the shot durations
is less than the user specified time constraint. As a result,
each shot is either included or excluded in the final video summary.
In the second formulation, we extend the attribute scoring from the shot
level to the time domain. In the first form, video annotation specifies
a relevant score of each attribute for each shot based on semantic relevance
of the shot content. Here, attribute scoring will incorporate low-level
features to more precisely annotate each frame of the shot. For
example in a sports clip, there should be higher relevance scoring associated
with those temporal frames that demonstrate higher motion components.
So if we want to include the high activity component of a shot, then only
the highest score subset is extracted. Consequently, attribute score
for attribute i and shot j is no longer a constant
within a shot, but becomes a function of time based on low-level features.
The attribute scoring function
can be calculated automatically as the product of the constant attribute
score
of shot j and a normalized low-level feature weighting
, where
and t spans the duration of the shot. Thus,
. Similar to the first
formulation, the attribute scoring function
now determines the time interval
of shot subset
to be included in the final
video summary.
In the third formulation, the video is compressed temporally by subsampling
the total number of frames in the original sequence. The resulting
video summary resembles a fast forward playback in a shorter period of
time. When video is temporally compressed in this manner, maintaining
comprehension becomes the most critical issue. Comprehension depends
on the compression factor, which in turn is highly dependent on the content
and the amount of motion in the original video. For example, commercial
videos, which consist of high motion and very short shots, cannot tolerate
a high compression factor. On the other hand, interview or conferencing
videos, which consist of nearly stationary people with limited motion,
can withstand high compression factors. Consequently, we can assume
that to maintain comprehension of the summarized video, the cumulative
motion of the resulting subsampled video must be below a perceptually
acceptable threshold. This requirement then determines the maximum
sampling rate to adopt for the video. Note that it is desirable
to use one subsampling rate for the entire video, so as not to perceptually
confuse the user when changing compression rates. This restriction
also limits the sampling rate over the entire video. Let
denote the perceived motion associated with time t. Let
denote the human tolerance limit for perceived motion. It
then follows that we are looking to find the maximum period T such
that the perceived motion is always limited by the human tolerance limit
,
for all t. After the sampling period T is determined,
the final video summary consists of subsampled frames of the selected
video shots.
<
Back to Table of Contents >
Universal
Tuner Pervasive Application
The Video Semantic Summarization System includes an MPEG-7 compliant annotation
interface, a semantic summarization middleware, a real-time MPEG-1/2 video
transcoder on PCs, and an application interface on color/black-and-white
Palm-OS PDAs. We designed a video annotation tool, VideoAnnEx, to
annotate semantic labels associated with video shots. Videos are first
segmented into shots based on their visual-audio characteristics. They
are played back using an interactive interface, which facilitate and fasten
the annotation process. Users can annotate the video content with the
units of temporal shots or spatial regions. The annotated results are
stored in the MPEG-7 XML format. We also designed and implemented a video
transmission system, Universal Tuner, for wireless video streaming.
This system transcodes MPEG-1/2 videos or live TV broadcasting videos
to the BW or indexed color Palm OS devices. In our system, the complexity
of multimedia compression and decompression algorithms is adaptively partitioned
between the encoder and decoder. In the client end, users can access the
summarized video based on their preferences, time, keywords, as well as
the transmission bandwidth and the remaining battery power on the pervasive
devices.
With the growing popularity and capability
of Personal Digital Assistants (PDA) and mobile phones, users have become
much enthusiastic in watching videos through their pervasive mobile devices.
These devices vary widely and are limited in terms of power consumption,
processing speed, display constraint, and video decoding capabilities.
When people use their pervasive devices, they generally restrict their
viewing time on the limited displays and minimize the amount of interaction
and navigation to get to the content. Therefore, summarization of video
content for pervasive devices entails temporal as well as spatial considerations.
Pervasive devices usually have a
smaller display resolution both in spatial domain and in color depth.
Also, some devices such as Palm-OS PDAs do not have sound playing functionality.
These constraints affect the key shots selection process of a video summarization
system. For instance, if audio information is not available in the decoder,
we may need to annotate more text information for video or extract video
clips with embedded texts. Also, the limitation of remaining battery power
in the devices may require the video summarization system to choose invariant
stable scenes rather than high-motion video clips.
Most existing video summarization
tools address their applications on the Internet environments. Browsing
tools can display the summarized video using a number of key frames for
each detected scene shot to generate a storyboard [20]. Some clustering
techniques are used to optimize key frame selection based on their visual
attributes or motion features [3, 12, 21]. Merialdo et.al. generate
personalized TV news programs based on user preference and time constraint
[13]. Gong and Liu use Singular Value Decomposition (SVD) of attribute
matrix to reduce the redundancy of video segments and thus generate video
summaries [5, 6]. In addition to the systems based on audio-visual information,
some researchers have proposed methods to detect semantic important events
based on other resources. For instance, Aizawa et. al. use brain
waves to detect exciting moments of the subjects [2]. Hu et. al.
use detect similar video clips cross different source of news stations
to identify interesting news events [8]. In industry, companies such as
NTT DoCoMo and Virage have implemented preliminary video summarization
systems. NTT DoCoMo streams video summaries to its I-mode cellular phones
[17]. Virage creates video clips of NHL hockey highlights [19].
This paper addresses issues of designing
a video semantic summarization system in the wireless/mobile environments.
We have designed and implemented a video summarization system, which includes
an MPEG-7 compliant annotation interface, a semantic summarization middleware,
a real-time MPEG-1/2 video transcoder for Palm-OS devices, and an application
interface on color/black-and-white Palm-OS PDAs. We will first
describe the system architecture of our video summarization system, and
then focus on video summarization of annotated contents, which are segmented
and classified. The video contents are annotated using the ISO standardized
Multimedia Content Description Interface, also known as MPEG-7. MPEG-7
addresses specific requirements for Description Schemes to describe different
abstraction levels and variations of multimedia content.
The Universal Tuner system includes
two parts -- a software video transcoder at the server end which can transcode
MPEG-1/2 video or A/D converted live broadcasting video into client dependent
format, and a software client application software on the color or Black-and-White
Palm OS PDA. The software client part consists the key display component
of the user client in this Video Semantic Summarization System.
An example of the client end has
been shown in Figure 2, this application is capable of displaying transcoded
MPEG1/2 color video in both 80x80 format (as shown in Figure 2) and full-screen
160x160 viewing mode. Using Palm IIIc emulator, we can show video in the
80x80 video mode at about 6 frame per second, where motion could be marginally
considered as continuous in human perception, and 1.5 frame per second
in the 160x160 video mode, which is perceptually similar to the slide
show rather than the continuous motion video. In the client application
end, we also added WML compatibility in the small video mode that can
access information on the Interent through the transcoding server.
We have tested the effectiveness
of the BW mode of our Universal Tuner in the real environments through
a 9600 bps wireless modem. For the color modes, we have also tested our
Video Semantic Summarization System on Palm m505 devices, which support
USB connection and Wireless LAN using 48kbps. Our testing results
show the effectiveness of the system. In the future, we will test our
system in the 3G and Bluetooth environments.
<
Back to Table of Contents >
Video
Editing Software
MPEG-1/2
Compressed Domain Composition
VideoEd
Graphical User Interface
<
Back to Table of Contents >
Video
Semantic Summarization Demos
The Video Semantic Summarization System generates personalized video summaries
using MPEG-7 descriptions and delivers the content effectively to the
user on three implemented platforms: (1) the stand-alone application,
(2) mobile devices, and (3) web browser. Each system allows the
user to specify topic preferences, query keywords and total summary time.
We present our contributions and corresponding demos here.
- Universal Tuner Pervasive
Application
- VideoAnnEx Annotation Tool
- VideoEd Editing Software
- VideoSoup Summarization
on User Preference Web Interface
The video semantic summarization
system is described which includes an MPEG-7 compliant annotation interface,
a semantic summarization middleware, a real-time MPEG-1/2 video transcoder
on PCs, and an application interface on color/black-and-white Palm-OS
PDAs. Several issues regarding to designing a wireless compliant video
summarization system have been addressed in this paper. Our video summarization
system has these characteristics for the user of pervasive devices: 1.)
Save time from navigating to the desired information [no clicking through
a series of links]; 2.) Save time from viewing whole video clips [only
show desired video segments]; 3.) Save bandwidth by downloading only desired
video segments in the device-dependent formats. In the future, we will
add more functionalities to this Video Semantic Summarization system such
as mobile phone client applications and automatic semantic audio-visual
indexing as well as event detection.
<
Back to Table of Contents >
|