IBM®
Skip to main content
    Country/region change    Terms of use
 
 
 
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Authoring in XMT


This text was written by us as part of an authoring chapter in an MPEG-4 book. The entire authoring chapter was however not included in the final book. Since this text discussed XMT in the context of authoring for the Web and our authoring tools we present it here.


1. Authoring in XMT

The XMT textual format for MPEG-4 is designed as a tool where one of its main aims is to alleviate the complexity of authoring. MPEG-4 provides many powerful, sophisticated tools for the representation of audiovisual scenes. Such tools cover visual coding, audio coding, scene representation (BIFS), as well as media properties and their associations with the scene (OD Framework). To take advantage of the powerful tools and functionalities that MPEG-4 provides can take considerable skill and is often best left to expert authors. XMT changes this landscape and allows expert authors and novices alike to produce compelling, rich, interactive, animated content using MPEG-4 tools.

1.1 Using the XMT Formats

An author should be aware that XMT format, which is XML-based, consists of two levels: the XMT-O, a high-level format, and the XMT-A, a low-level format.

The XMT-A format is a direct textual representation of the MPEG-4 Systems binary coded tools and provides a deterministic mapping to and from the MPEG-4 binary representation. As such its primary users will be authors skilled in MPEG-4 that can utilize fundamental MPEG-4 tools and create the right nodes and routes etc. for the job they have in mind. One authoring dilemma is that such scenes can become very complicated very quickly. If the content is passed to another author then understanding the combination of nodes and routes can be difficult and it is often not easy to retrieve author's original intentions from the XMT-A format. Obviously the original author can add suitable human readable comments to the text to explain, akin to adding comments to code when producing software, but such comments though useful for humans do not allow authoring tools such an understanding.

XMT-O addresses the issue of complexity, author intentions and skills. XMT-O provides a useful high-level abstraction of the low-level MPEG-4 functionalities. It is based on SMIL and also has elements of syntax familiar to Web authors. The high-level abstractions allow both expert and novice authors alike to produce advanced MPEG-4 content, limited only by an authoring tool's capability to take advantage of the richness of MPEG-4 tools to represent such content. In sharing syntax and semantics with other web oriented media languages XMT-O allows re-use of skills and provides familiarity for authors who can now more easily move to utilizing the power of MPEG-4. Since XMT-O provides high-level abstractions content is more easily passed from author to author, and among authoring tools, since the intentions are clearer in XMT-O being high-level. For example a transition between two images can be simply stated as 'Four-Corners In', meaning the four corners of the new image are progressively shown hiding the prior image. In XMT-A decoding the combination of nodes and routes to come up with 'Four-Corners In' would be complicated. There is more than one way to represent this and hence XMT-O can better communicate authors' intents. XMT-O can include custom media objects defined in XMT-A so it is possible to use a mix-and-match approach taking advantage of the benefits of both formats.

1.2 MPEG-4 for Non-expert Authors

In addition to having an authoring format the availability of suitable authoring tool(s) at the author's disposal is vitally important. Although the XMT is a textual format and an author could resort to using a simple text editor, as one might have done in the early days of HTML, a suitable authoring tool will vastly improve productivity and provide accessibility to MPEG-4 for novice authors.

While a text based authoring or structured XML editor may work, an authoring tool that allows drag and drop, visual manipulation of content so that what-you-see-is-what-you-get works well for novices and experts alike. For the experts additional features can allow direct editing of the content right into the XMT format.

To get started with creation of content a wizard or template may be used. A wizard, for example, would guide an author through the process of producing a certain type of content for a particular application; for example a slideshow with subtitles. Here the wizard prompts for each image and subtitle and when complete will create the XMT representing that content. This can then be further edited as desired. A template may also be a suitable starting point; for example to create an advertising flyer for marketing a house. The template may contain non-editable items e.g. company logo and disclaimers. Other items can be edited simply by clicking on the placeholder within the template whereupon the tool will prompt for suitable content. E.g. clicking in the placeholder for the photo of the house will prompt for a suitable image to insert.

Content can also be created new 'from scratch', using whatever artistic flair one has. Sometimes it is necessary to lay down a few media objects to see how it looks. So in this section authoring MPEG-4 using a visual approach with high-level media objects (xMedia objects) as building blocks is presented. The artistic, creative process is best realized iteratively, with a rapid feedback loop. When designed properly, an authoring tool, using simple menus and easy-to-use user interfaces, can allow high-level media objects to be used as an effective tool for developing an initial storyboard.

Such an authoring tool can provide a palette of media objects. Inserting such an object into the scene, with appropriate defaulted values, would also create the necessary XMT to represent this. Having appropriate default values allows an author to readily insert a media object without being prompted for every possible attribute. Values for the defaults can be set in options for the tools so that satisfactory defaults are achieved for the author. However each attribute can also be changed later, whether by visual interaction, such as dragging to move it, through simple dialog boxes, say to change color, via pop-up menus for selections, or even by direct editing of the XMT representation.

Some properties may be intrinsic to the media, such as color, while others are applied to the media, such as its duration or start time. Dependent on the property of the media object being changed one method may be more satisfactory than another, although a tool will often provide more than one way to do a job.

Once the content has been created from a suitable assembly of media objects the XMT-O format will be a true representation of that content; even if the content was only finally arrived at through a lengthy iterative, interactive, visual creation process.

With the XMT-O format it can be passed to other authors for refinement in a publishing process. Indeed other authoring tools can be used to process the content at this high-level. When the content is considered complete, and ready for deployment, the content can be encoded to the MPEG-4 binary representation and stored in a file such as mp4. Such encoding can be done directly, or the XMT-O may first be transformed to XMT-A before encoding.  When in XMT-A an expert author has the opportunity to tweak the content, if necessary or desirable, before encoding it to the MPEG-4 binary representation.

With the MPEG-4 binary content in hand the author can now check the content on the target client(s) to check how it plays etc. If late changes are needed these can be rapidly accommodated in XMT. Having a feedback loop that is rapidly responsive to change is a vital part of the authoring process; especially so if content is created on-the-fly interactively; it is important to see the final result and be able to change it quickly.

While an author may choose to author with either XMT-O or XMT-A, the author is likely to stay with the one format for the entire content creation cycle. Of course, an author is free to ignore the XMT-O format once an initial storyboard is obtained, and refine the presentation working with the XMT-A format. However once a decision is made to work in XMT-A it will not, in general, be possible to go back to XMT-O since there is no guarantee of an inverse mapping from XMT-A to XMT-O. In working with XMT-O it is therefore recommended to stay with the XMT-O format. If the XMT-A format is wanted then a tool can automatically generate the corresponding XMT-A format as needed.

1.2.1 Objects as Building Blocks

The building blocks of the authoring tool described here are XMT-O xMedia objects as defined in ISO/IEC 14496 (MPEG-4 Systems specification). The XMT-O format has been designed around the xMedia objects, which include media such as rectangle, circle, text, string, img, video, audio, sphere, and cylinder, etc.

An xMedia object is represented by an XML element and is principally an abstraction based on MPEG-4 geometries with media specific attributes, which can contain other media independent elements having attributes such as color, position, scale and visibility. Some attributes are type dependent, e.g. a color applies only to visual elements and not to audio elements; while some elements are specific only to a certain media type, e.g. font style for text.

Given an xMedia object, behaviors can be defined such as animation of the object where properties are altered over time or are triggered by events. An event is often a user interaction such as a mouse click on the object. But other events associated with timing such as begin event, and repeat count event are also defined in XMT-O. Events and animation provide the basis for highly interactive dynamic presentations.

The timeline for XMT-O media objects is represented in XMT by timing containers and synchronization dependencies (sync-arcs). Timing containers, such as "par" and "seq" that play contained media objects in parallel or in sequence, are based on the timing modules as defined in SMIL 2.0. Note that the elements and the attributes defined in XMT-O are a large subset of those defined in SMIL 2.0. As such the authoring approach described above is not only useful for authoring MPEG-4 but it can also be applied for authoring SMIL documents.

xMedia objects can thus be seen to provide an abstraction layer over MPEG-4 nodes, and commands etc (BIFS) and the Object Descriptor Framework. This makes it possible to design presentations at a more abstract level and allows authors to avoid the complexity of MPEG-4 for many MPEG-4 applications.

1.3 MPEG-4 Authoring Jump-Start

To facilitate creative authoring for non-expert authors, it is important to help them with easy-to-use user interfaces, and to provide wizards and templates as easy starting points. This section describes IBM's MPEG-4 authoring toolkit as an example of an XMT tool, as outlined above, that illustrates the template and media object based approach. The tools consist of a visual editor, an XMT Editor, and a preview facility as illustrated in Figure 1. The XMT textual format, the output of the visual process and/or XMT editor, is compiled to the MPEG-4 binary format that can then be previewed in a rapid feedback loop.

The toolkit is supported by various transformation tools, such as an XMT-O to XMT-A converter, an XMT-A to mp4 encoder, and a hinting utility for stream-ready files.


IBM MPEG-4 XMT Editor Tool Figure

Figure 1. IBM MPEG-4 XMT Editor Tool


The building-blocks (xMedia objects) and their properties can be altered either visually in the Visual Editor as described next, or textually in the XMT Editor described later in this section.

1.3.1 Visual Editor

Designing visual metaphors for (MPEG-4) authoring is a difficult problem, which will not be addressed here. However there are proven techniques for visually representing some of spatial and temporal properties, e.g. resizing, drag-and-drop and time-boxes for manipulating time etc., which are utilized by the tool.

Rather than attempting visual representations for low-level functionalities in MPEG-4 Systems, the authoring tool takes a top-down approach and treats media objects as pre-defined building blocks. Then the visual editing becomes simply a matter of the spatial/temporal integration of these pre-defined building blocks.

The Visual Editor provides a quick entry to MPEG-4 authoring even for non-expert authors. It provides three views, a spatial view, a details view, and a temporal view that are represented visually in the three panels as depicted in Figure 1. The author can size the three view panels according to their requirements.

The Spatial view is where the author can layout the media objects to create a suitable arrangement on the workspace and define animation and user interactions with the objects. In the Spatial view the author can therefore carry out the following tasks:

  • Add, move, and size objects using drag and drop
  • Change intrinsic properties of the media objects, such as color
  • Change the z-order of the media objects
  • Align media objects by position and/or size
  • Define user interactions on the media objects
  • Animate media object

The Temporal view, in the lower right in Figure 1 contains timeline information about each media object building-block. These timelines specify when each media object will appear (start) and for how long it will remain in the presentation. The start time can be moved by dragging the whole "time-box" horizontally, and the duration of the time-box can changed by stretching or shrinking by dragging one end of the time-box. By thus placing media objects on the timeline this is effectively assigning fixed start/end times to the object. Instead, or in addition to assigning fixed timing, the author can establish temporal relationships among the objects as will be described shortly.

In the example in Figure 2 the timelines of three media objects can be seen. Here the video media object starts at 0s (second), the image media object at 2s, and the text media object at 3s. The text can be seen in its entirety in the view and hence it can be seen to have a duration of 4s.

The timeline can also depict when transition effects occur as media objects start. A transition defines a visual (or audio) effect whereby one media is gradually replaced by a new media. The new media may fade in, it may slide in, or it may grow out from the center. Or it may be anyone or numerous other transition effects defined in XMT-O. The fact that a transition occurs and its duration can be marked on the media timeline.

The Details view, in the lower left corner in Figure 1, contains information about each property of the media objects. The author can choose from the total set of properties which particular properties and values should be shown in this view. The order of the properties in this table view can also be configured. The values of the media object properties may be directly edited in this view.

The three views share a common data model that holds all the information about the media objects and their properties. Since the views share the same data, any change to the model, made from any one of the views, can be immediately reflected in the other views accordingly.


Working with the Visual Editor

Figure 2. Working with the Visual Editor


Consider now a simple example of using the Visual editor to add a rectangle media object and then changing its color: Using the Media option on the drop-down menus, a rectangle can be added to the presentation by choosing Add rectangle as shown in Figure 3. A rectangle will be added to the model and it will appear in all the views, i.e. be displayed in the Spatial view, its properties are added to the tabular detail view and a timeline for it shown in the Temporal view; with default values accordingly as given by the model.


Adding a Media Object

Figure 3. Adding a Media Object


To change the color property of the rectangle in the Spatial view, use the Color option from the Media menu. A color palette with three tabs will appear. Choose a pre-defined color from the Swatches tab, or define a color using either the HSB tab or the RGB tab as in Figure 4. Note that the same task can be accomplished using context sensitive pop-up menu from the right-mouse button in the Spatial view, and also by directly editing the color in the Details view.


Changing the Color of the Rectangle

Figure 4. Changing the Color of the Rectangle


1.3.1.2 Establishing Temporal Relationships

Using the Media option in the menu bar, and choosing properties..., the author can assign "co-start," "co-end," or "meet" timing relationships among the selected objects. Co-start means that the selected objects should start at the same time, Co-end means to end at the same time, and Meet means to have new media start when the prior one(s) end, i.e. to play them back-to-back in sequence. Figure 5 depicts a simple presentation where the rectangle is followed by the circle using the "meet" relationship.


Establishing Temporal Connections (relationships)

Figure 5. Establishing Temporal Connections (relationships)


This relationship model of the timing of objects is a fundamental aspect of the FlexTime model [FlexTime] of MPEG-4, where objects are timed relative to each other rather than being timed against a static fixed timeline. The concept behind the FlexTime model is to provide application level (author-determined) quality of service that allows for runtime, dynamic adjustment of the playback of the presentation over networks that have non-deterministic transmission delays. So for example even if the media in a relationship is delayed, prior media can be held, repeated etc, according to authors preference, to ensure a smooth, seamless playback even when media does not arrive in time. Using Flextime the author has control over what happens in the presence of delays. Of course this adds a little extra work on the part of the author, but rather than rely on the client just trying to do its best to muddle through, the author can insert guidelines into the Flextime information on how to handle the conditions so that the presentation cab be experienced in the best way possible despite the network conditions.

Note that the FlexTime model can help the author's job easier during the planning phase. One of the main benefits of this relationship-based timing is manifested when the author needs to make a temporal change (e.g. change the duration) to an object that is positioned in the middle of the fixed timeline. If the author intends to position the remaining objects relative to the one that is to be changed, it becomes a chore to have to move the corresponding objects on the timeline one at a time accordingly. But with the relationships assigned, it becomes the authoring tools responsibility to reschedule the timing of all the objects reflecting the change. The author does not have to do any detail alignments.

1.3.2 XMT Editor

As xMedia objects are added to the data model for the presentation at hand, with their times and (optionally) temporal connections (or relationships), the corresponding XMT is created. The XML representation of elements, which themselves can contain other elements, forms a tree structure where elements have parent child relationships to one another through this containment. This XML tree can then be operated through a DOM (Document Object Model) that defines the manipulation and interaction with this tree. From the end user's point of view a hierarchical tree structured view of the document can be edited. The XML tree can be kept consistent with the XMT Schemas, which defines the rules of element containment and attribute values etc.

The IBM MPEG-4 XMT Editor tool, using the model view paradigm, can display the XMT tree as yet another view of the data model, just as it does for the Spatial, Details and Temporal views described above. The XMT editor has been designed to review/edit the XMT-O format signaling the "textual" changes made by the author back into the data model, from where it can be presented in the other views in Visual editor, and vice-versa.

As the XMT-O is an XML document the presentation can be saved at any point to a file whereupon the author can choose to manipulate or review the content in any other XML editor of their choice. The XMT-O document will be validated for conformance to the Schema when it is opened again in the authoring tool.

In the authoring tool the XMT Editor displays a hierarchical tree structure of the model, as in Figure 6, and provides a view of the media objects in the presentation in terms of the XMT-O elements and their attributes that define them.

The left panel displays the document tree. Items can be expanded by clicking on the plus symbols, or collapsed by clicking on the minus symbols.

The right panel displays the attributes for the highlighted node. Attributes can be entered and their values edited in this area. After a change is made in this panel, the author can click on the Apply button for the changes to take effect, that is, to update the data model and hence propagate the change to all the other views.


The XMT-O view in the XMT editor

Figure 6. The XMT-O view in the XMT editor


1.4 Transformation Tools

The IBM authoring tool always saves the presentation being authored as an XMT-O format document instance whenever it is saved. However other forms can be exported such as the corresponding XMT-A format and/or the mp4 interchangeable file format. To deliver the mp4 over various transports, the mp4 file can also be hinted, for example for transmission using RTP/RTSP. Hinting is to provide assistance (hints) to servers so they can more readily serve the content to clients. Interleaved formats suitable for simple lightweight delivery over HTTP can also be exported. Moreover, to facilitate interoperability of the XMT format with SMIL 2.0 on the one hand and with X3D on the other, the toolkit contains format converters to go to and from the XMT-O and SMIL 2.0 formats and also to and from the XMT-A and X3D formats. In this section we provide a discussion on the profile dependent XMT-O to XMT-A export facility.

1.4.1 Profile dependent export of XMT-O to XMT-A/mp4

The XMT-O is a high-level authoring format that abstracts media and captures an author's intent. The power of the XMT-O format is that the author does not need to select necessary MPEG-4 tools at the time of authoring. Given an XMT-O file, many different combinations of MPEG-4 tools can be selected when it is converted into XMT-A/mp4.

The choice of MPEG-4 tools used by the converter can be made according to the MPEG-4 profile and level for which a particular instance of the presentation is needed at that time. Of course, the XMT-O can be authored knowing the particular profile and level that is the target; for instance not much can reasonably be done with a video in the presentation if the author wants to target an audio-only profile. And it is not always possible to find a set of MPEG-4 tools supported by the profile and level for a presentation authored using arbitrary XMT-O constructs. However to facilitate the creation of alternate content XMT-O contains a content selection (switch) mechanism. Using this allows alternate content representations or substitutes to be defined by the author, or even the content to be omitted entirely. The switch uses test attributes that determine which elements shall be selected.

The same switch mechanism also allows alternate forms of the same content to be created in the same document. For instance a presentation may contain more than one language. The content may hence be exported using the chosen single language. Or it may be alternate forms for different bandwidths.

Test attributes also allow audiovisual source media to be encoded according to the target profile. Encoding hints in the XMT format complement the encoding process and allow some authoring control over the process.

To summarize, it is the choice of the author to create a presentation that is either independent of the profile and level, or dependent on a target profile and level. More, the high-level, intent-based, nature of the XMT-O format provides flexibility in the delivery over various transports by providing delivery hints that allow encoding to the various transport-ready (hinted) forms, whether it's hinted mp4 file or live streams.




    About IBMPrivacyContact