|
This text was written by us as part of an authoring chapter in an MPEG-4
book. The entire authoring chapter was however not included in the final book. Since this text discussed
XMT in the context of authoring for the Web and our authoring tools we present it here.
1. Authoring in XMT
The XMT textual format for MPEG-4 is designed as a tool where one of
its main aims is to alleviate the complexity of authoring. MPEG-4 provides many
powerful, sophisticated tools for the representation of audiovisual scenes. Such
tools cover visual coding, audio coding, scene representation (BIFS), as well as
media properties and their associations with the scene (OD Framework). To take
advantage of the powerful tools and functionalities that MPEG-4 provides can take
considerable skill and is often best left to expert authors. XMT changes this
landscape and allows expert authors and novices alike to produce compelling, rich,
interactive, animated content using MPEG-4 tools.
1.1 Using the XMT Formats
An author should be aware that XMT format, which is XML-based,
consists of two levels: the XMT-O, a high-level format, and the XMT-A, a low-level
format.
The XMT-A format is a direct textual representation of the MPEG-4
Systems binary coded tools and provides a deterministic mapping to and from the
MPEG-4 binary representation. As such its primary users will be authors skilled in
MPEG-4 that can utilize fundamental MPEG-4 tools and create the right nodes and
routes etc. for the job they have in mind. One authoring dilemma is that such
scenes can become very complicated very quickly. If the content is passed to
another author then understanding the combination of nodes and routes can be
difficult and it is often not easy to retrieve author's original intentions from the
XMT-A format. Obviously the original author can add suitable human readable
comments to the text to explain, akin to adding comments to code when producing
software, but such comments though useful for humans do not allow authoring tools
such an understanding.
XMT-O addresses the issue of complexity, author intentions and skills.
XMT-O provides a useful high-level abstraction of the low-level MPEG-4
functionalities. It is based on SMIL and also has elements of syntax familiar to
Web authors. The high-level abstractions allow both expert and novice authors alike
to produce advanced MPEG-4 content, limited only by an authoring tool's capability
to take advantage of the richness of MPEG-4 tools to represent such content. In
sharing syntax and semantics with other web oriented media languages XMT-O allows
re-use of skills and provides familiarity for authors who can now more easily move
to utilizing the power of MPEG-4. Since XMT-O provides high-level abstractions
content is more easily passed from author to author, and among authoring tools,
since the intentions are clearer in XMT-O being high-level. For example a
transition between two images can be simply stated as 'Four-Corners In', meaning the
four corners of the new image are progressively shown hiding the prior image. In
XMT-A decoding the combination of nodes and routes to come up with 'Four-Corners In'
would be complicated. There is more than one way to represent this and hence XMT-O
can better communicate authors' intents. XMT-O can include custom media objects
defined in XMT-A so it is possible to use a mix-and-match approach taking advantage
of the benefits of both formats.
1.2 MPEG-4 for Non-expert Authors
In addition to having an authoring format the availability of suitable
authoring tool(s) at the author's disposal is vitally important. Although the XMT
is a textual format and an author could resort to using a simple text editor, as one
might have done in the early days of HTML, a suitable authoring tool will vastly
improve productivity and provide accessibility to MPEG-4 for novice authors.
While a text based authoring or structured XML editor may work, an
authoring tool that allows drag and drop, visual manipulation of content so that
what-you-see-is-what-you-get works well for novices and experts alike. For the
experts additional features can allow direct editing of the content right into the
XMT format.
To get started with creation of content a wizard or template may be
used. A wizard, for example, would guide an author through the process of producing
a certain type of content for a particular application; for example a slideshow with
subtitles. Here the wizard prompts for each image and subtitle and when complete
will create the XMT representing that content. This can then be further edited as
desired. A template may also be a suitable starting point; for example to create an
advertising flyer for marketing a house. The template may contain non-editable
items e.g. company logo and disclaimers. Other items can be edited simply by
clicking on the placeholder within the template whereupon the tool will prompt for
suitable content. E.g. clicking in the placeholder for the photo of the house will
prompt for a suitable image to insert.
Content can also be created new 'from scratch', using whatever
artistic flair one has. Sometimes it is necessary to lay down a few media objects
to see how it looks. So in this section authoring MPEG-4 using a visual approach
with high-level media objects (xMedia objects) as building blocks is presented. The
artistic, creative process is best realized iteratively, with a rapid feedback loop.
When designed properly, an authoring tool, using simple menus and easy-to-use user
interfaces, can allow high-level media objects to be used as an effective tool for
developing an initial storyboard.
Such an authoring tool can provide a palette of media objects.
Inserting such an object into the scene, with appropriate defaulted values, would
also create the necessary XMT to represent this. Having appropriate default values
allows an author to readily insert a media object without being prompted for every
possible attribute. Values for the defaults can be set in options for the tools so
that satisfactory defaults are achieved for the author. However each attribute can
also be changed later, whether by visual interaction, such as dragging to move it,
through simple dialog boxes, say to change color, via pop-up menus for selections,
or even by direct editing of the XMT representation.
Some properties may be intrinsic to the media, such as color, while
others are applied to the media, such as its duration or start time. Dependent on
the property of the media object being changed one method may be more satisfactory
than another, although a tool will often provide more than one way to do a job.
Once the content has been created from a suitable assembly of media
objects the XMT-O format will be a true representation of that content; even if the
content was only finally arrived at through a lengthy iterative, interactive, visual
creation process.
With the XMT-O format it can be passed to other authors for refinement
in a publishing process. Indeed other authoring tools can be used to process the
content at this high-level. When the content is considered complete, and ready for
deployment, the content can be encoded to the MPEG-4 binary representation and
stored in a file such as mp4. Such encoding can be done directly, or the XMT-O may
first be transformed to XMT-A before encoding. When in XMT-A an expert author has
the opportunity to tweak the content, if necessary or desirable, before encoding it
to the MPEG-4 binary representation.
With the MPEG-4 binary content in hand the author can now check the
content on the target client(s) to check how it plays etc. If late changes are
needed these can be rapidly accommodated in XMT. Having a feedback loop that is
rapidly responsive to change is a vital part of the authoring process;
especially so if content is created on-the-fly interactively; it is important
to see the final result and be able to change it quickly.
While an author may choose to author with either XMT-O or XMT-A, the
author is likely to stay with the one format for the entire content creation cycle.
Of course, an author is free to ignore the XMT-O format once an initial storyboard
is obtained, and refine the presentation working with the XMT-A format. However
once a decision is made to work in XMT-A it will not, in general, be possible to go
back to XMT-O since there is no guarantee of an inverse mapping from XMT-A to XMT-O.
In working with XMT-O it is therefore recommended to stay with the XMT-O format. If
the XMT-A format is wanted then a tool can automatically generate the corresponding
XMT-A format as needed.
1.2.1 Objects as Building Blocks
The building blocks of the authoring tool described here are XMT-O
xMedia objects as defined in ISO/IEC 14496 (MPEG-4 Systems specification). The
XMT-O format has been designed around the xMedia objects, which include media such
as rectangle, circle, text, string, img, video, audio, sphere, and cylinder,
etc.
An xMedia object is represented by an XML element and is principally an
abstraction based on MPEG-4 geometries with media specific attributes, which can
contain other media independent elements having attributes such as color, position,
scale and visibility. Some attributes are type dependent, e.g. a color applies
only to visual elements and not to audio elements; while some elements are specific
only to a certain media type, e.g. font style for text.
Given an xMedia object, behaviors can be defined such as animation of
the object where properties are altered over time or are triggered by events. An
event is often a user interaction such as a mouse click on the object. But other
events associated with timing such as begin event, and repeat count event are also
defined in XMT-O. Events and animation provide the basis for highly interactive
dynamic presentations.
The timeline for XMT-O media objects is represented in XMT by timing
containers and synchronization dependencies (sync-arcs). Timing containers, such
as "par" and "seq" that play contained media objects in parallel or in
sequence, are based on the timing modules as defined in SMIL 2.0. Note that
the elements and the attributes defined in XMT-O are a large subset of those defined
in SMIL 2.0. As such the authoring approach described above is not only useful for
authoring MPEG-4 but it can also be applied for authoring SMIL documents.
xMedia objects can thus be seen to provide an abstraction layer over
MPEG-4 nodes, and commands etc (BIFS) and the Object Descriptor Framework. This
makes it possible to design presentations at a more abstract level and allows
authors to avoid the complexity of MPEG-4 for many MPEG-4 applications.
1.3 MPEG-4 Authoring Jump-Start
To facilitate creative authoring for non-expert authors, it is
important to help them with easy-to-use user interfaces, and to provide wizards and
templates as easy starting points. This section describes IBM's MPEG-4 authoring
toolkit as an example of an XMT tool, as outlined above, that illustrates the
template and media object based approach. The tools consist of a visual editor, an
XMT Editor, and a preview facility as illustrated in Figure 1. The XMT textual
format, the output of the visual process and/or XMT editor, is compiled to the
MPEG-4 binary format that can then be previewed in a rapid feedback loop.
The toolkit is supported by various transformation tools, such as an
XMT-O to XMT-A converter, an XMT-A to mp4 encoder, and a hinting utility for
stream-ready files.
Figure 1. IBM MPEG-4 XMT Editor Tool
The building-blocks (xMedia objects) and their properties can be
altered either visually in the Visual Editor as described next, or textually in the
XMT Editor described later in this section.
1.3.1 Visual Editor
Designing visual metaphors for (MPEG-4) authoring is a difficult
problem, which will not be addressed here. However there are proven techniques for
visually representing some of spatial and temporal properties, e.g. resizing,
drag-and-drop and time-boxes for manipulating time etc., which are utilized by the
tool.
Rather than attempting visual representations for low-level
functionalities in MPEG-4 Systems, the authoring tool takes a top-down approach and
treats media objects as pre-defined building blocks. Then the visual editing
becomes simply a matter of the spatial/temporal integration of these pre-defined
building blocks.
The Visual Editor provides a quick entry to MPEG-4 authoring even for
non-expert authors. It provides three views, a spatial view, a details view, and a
temporal view that are represented visually in the three panels as depicted in
Figure 1. The author can size the three view panels according to their
requirements.
The Spatial view is where the author can layout the media
objects to create a suitable arrangement on the workspace and define animation
and user interactions with the objects. In the Spatial view the author can
therefore carry out the following tasks:
- Add, move, and size objects using drag and drop
- Change intrinsic properties of the media objects, such as color
- Change the z-order of the media objects
- Align media objects by position and/or size
- Define user interactions on the media objects
- Animate media object
The Temporal view, in the lower right in Figure 1 contains
timeline information about each media object building-block. These timelines
specify when each media object will appear (start) and for how long it will remain
in the presentation. The start time can be moved by dragging the whole "time-box"
horizontally, and the duration of the time-box can changed by stretching or
shrinking by dragging one end of the time-box. By thus placing media objects on the
timeline this is effectively assigning fixed start/end times to the object.
Instead, or in addition to assigning fixed timing, the author can establish temporal
relationships among the objects as will be described shortly.
In the example in Figure 2 the timelines of three media objects can be
seen. Here the video media object starts at 0s (second), the image media object at
2s, and the text media object at 3s. The text can be seen in its entirety in the
view and hence it can be seen to have a duration of 4s.
The timeline can also depict when transition effects occur as media
objects start. A transition defines a visual (or audio) effect whereby one media is
gradually replaced by a new media. The new media may fade in, it may slide in, or
it may grow out from the center. Or it may be anyone or numerous other transition
effects defined in XMT-O. The fact that a transition occurs and its duration can be
marked on the media timeline.
The Details view, in the lower left corner in Figure 1, contains
information about each property of the media objects. The author can choose from
the total set of properties which particular properties and values should be shown
in this view. The order of the properties in this table view can also be
configured. The values of the media object properties may be directly edited in
this view.
The three views share a common data model that holds all the
information about the media objects and their properties. Since the views share
the same data, any change to the model, made from any one of the views, can be
immediately reflected in the other views accordingly.
Figure 2. Working with the Visual Editor
Consider now a simple example of using the Visual editor to add a
rectangle media object and then changing its color: Using the Media option on
the drop-down menus, a rectangle can be added to the presentation by choosing Add
rectangle as shown in Figure 3. A rectangle will be added to the model and it
will appear in all the views, i.e. be displayed in the Spatial view, its properties
are added to the tabular detail view and a timeline for it shown in the Temporal
view; with default values accordingly as given by the model.
Figure 3. Adding a Media Object
To change the color property of the rectangle in the Spatial view, use
the Color option from the Media menu. A color palette with three tabs will appear.
Choose a pre-defined color from the Swatches tab, or define a color using either the
HSB tab or the RGB tab as in Figure 4. Note that the same task can be accomplished
using context sensitive pop-up menu from the right-mouse button in the
Spatial view, and also by directly editing the color in the Details view.
Figure 4. Changing the Color of the Rectangle
1.3.1.2 Establishing Temporal Relationships
Using the Media option in the menu bar, and choosing
properties..., the author can assign "co-start," "co-end," or "meet" timing
relationships among the selected objects. Co-start means that the selected objects
should start at the same time, Co-end means to end at the same time, and Meet means
to have new media start when the prior one(s) end, i.e. to play them back-to-back
in sequence. Figure 5 depicts a simple presentation where the rectangle is followed
by the circle using the "meet" relationship.
Figure 5. Establishing Temporal Connections (relationships)
This relationship model of the timing of objects is a fundamental
aspect of the FlexTime model [FlexTime] of MPEG-4, where objects are timed relative
to each other rather than being timed against a static fixed timeline. The concept
behind the FlexTime model is to provide application level (author-determined)
quality of service that allows for runtime, dynamic adjustment of the playback of
the presentation over networks that have non-deterministic transmission delays. So
for example even if the media in a relationship is delayed, prior media can be held,
repeated etc, according to authors preference, to ensure a smooth, seamless playback
even when media does not arrive in time. Using Flextime the author has control over
what happens in the presence of delays. Of course this adds a little extra work on
the part of the author, but rather than rely on the client just trying to do its
best to muddle through, the author can insert guidelines into the Flextime
information on how to handle the conditions so that the presentation cab be
experienced in the best way possible despite the network conditions.
Note that the FlexTime model can help the author's job easier during
the planning phase. One of the main benefits of this relationship-based timing is
manifested when the author needs to make a temporal change (e.g. change the
duration) to an object that is positioned in the middle of the fixed timeline. If
the author intends to position the remaining objects relative to the one that is to
be changed, it becomes a chore to have to move the corresponding objects on the
timeline one at a time accordingly. But with the relationships assigned, it becomes
the authoring tools responsibility to reschedule the timing of all the objects
reflecting the change. The author does not have to do any detail alignments.
1.3.2 XMT Editor
As xMedia objects are added to the data model for the presentation at
hand, with their times and (optionally) temporal connections (or relationships), the
corresponding XMT is created. The XML representation of elements, which themselves
can contain other elements, forms a tree structure where elements have parent child
relationships to one another through this containment. This XML tree can then be
operated through a DOM (Document Object Model) that defines the manipulation and
interaction with this tree. From the end user's point of view a hierarchical tree
structured view of the document can be edited. The XML tree can be kept consistent
with the XMT Schemas, which defines the rules of element containment and attribute
values etc.
The IBM MPEG-4 XMT Editor tool, using the model view paradigm, can
display the XMT tree as yet another view of the data model, just as it does for
the Spatial, Details and Temporal views described above. The XMT editor has
been designed to review/edit the XMT-O format signaling the "textual" changes
made by the author back into the data model, from where it can be presented in
the other views in Visual editor, and vice-versa.
As the XMT-O is an XML document the presentation can be saved at any
point to a file whereupon the author can choose to manipulate or review the
content in any other XML editor of their choice. The XMT-O document will be
validated for conformance to the Schema when it is opened again in the authoring
tool.
In the authoring tool the XMT Editor displays a hierarchical tree
structure of the model, as in Figure 6, and provides a view of the media objects in
the presentation in terms of the XMT-O elements and their attributes that define
them.
The left panel displays the document tree. Items can be expanded by
clicking on the plus symbols, or collapsed by clicking on the minus
symbols.
The right panel displays the attributes for the highlighted node.
Attributes can be entered and their values edited in this area. After a change
is made in this panel, the author can click on the Apply button for the
changes to take effect, that is, to update the data model and hence propagate
the change to all the other views.
Figure 6. The XMT-O view in the XMT editor
1.4 Transformation Tools
The IBM authoring tool always saves the presentation being authored as
an XMT-O format document instance whenever it is saved. However other forms can be
exported such as the corresponding XMT-A format and/or the mp4 interchangeable file
format. To deliver the mp4 over various transports, the mp4 file can also be
hinted, for example for transmission using RTP/RTSP. Hinting is to provide
assistance (hints) to servers so they can more readily serve the content to clients.
Interleaved formats suitable for simple lightweight delivery over HTTP can also be
exported. Moreover, to facilitate interoperability of the XMT format with SMIL 2.0
on the one hand and with X3D on the other, the toolkit contains format converters to
go to and from the XMT-O and SMIL 2.0 formats and also to and from the XMT-A and X3D
formats. In this section we provide a discussion on the profile dependent XMT-O to
XMT-A export facility.
1.4.1 Profile dependent export of XMT-O to XMT-A/mp4
The XMT-O is a high-level authoring format that abstracts media and
captures an author's intent. The power of the XMT-O format is that the author
does not need to select necessary MPEG-4 tools at the time of authoring. Given
an XMT-O file, many different combinations of MPEG-4 tools can be selected when
it is converted into XMT-A/mp4.
The choice of MPEG-4 tools used by the converter can be made
according to the MPEG-4 profile and level for which a particular
instance of the presentation is needed at that time. Of course, the XMT-O can
be authored knowing the particular profile and level that is the target; for
instance not much can reasonably be done with a video in the presentation if
the author wants to target an audio-only profile. And it is not always possible
to find a set of MPEG-4 tools supported by the profile and level for a
presentation authored using arbitrary XMT-O constructs. However to facilitate
the creation of alternate content XMT-O contains a content selection (switch)
mechanism. Using this allows alternate content representations or substitutes
to be defined by the author, or even the content to be omitted entirely. The
switch uses test attributes that determine which elements shall be selected.
The same switch mechanism also allows alternate forms of the same
content to be created in the same document. For instance a presentation may
contain more than one language. The content may hence be exported using the
chosen single language. Or it may be alternate forms for different bandwidths.
Test attributes also allow audiovisual source media to be encoded
according to the target profile. Encoding hints in the XMT format complement
the encoding process and allow some authoring control over the process.
To summarize, it is the choice of the author to create a
presentation that is either independent of the profile and level, or dependent
on a target profile and level. More, the high-level, intent-based, nature of
the XMT-O format provides flexibility in the delivery over various
transports by providing delivery hints that allow encoding to the various
transport-ready (hinted) forms, whether it's hinted mp4 file or live streams.

|