TRL
TOP PAGETokyo Research LaboratoryEmploymentProjectsRelated InformationIBM Research
Japanese page is here.
@ @

Video enrichment/Image annotation scheme and search



Annotations

The position of a video object is usually expressed in image (pixel) coordinates. However, as the camera moves, these coordinates have to be corrected in order to reflect an object's atual movements. For this purpose, the camera's movements are reconstructed and the resulting parameters are used to project the video images on a virtual plane. With the virtual plane as reference, it is possible to recalculate an object's position in a consistent coordinate system, using video obtained from a single camera. In addition, setting the virtual plane as if seen from a camera above the ground, it is possible to reconstruct an aerial view with the position of the objects projected on it, allowing the measurement of actual distances between objects in the ground instead of distances given in image pixels.

Motion recognition is performed based on changes in an object's shape through time. This is accomplished by discarding the color information inside the region of the image comprising the object, obtaining a silhouette. This silhouette changes as the object moves, generating patterns in eigenspace that characterizes given movements. Therefore, a motion can be recognized by matching a movement in eigenspace with previously recorded movement patterns. This process requires computing the changes in an object's continuous movement and mapping them to the eigenspace, however at the present development stage, these changes have been inputted manually. It has been done by determining an object's motion during an interval, attributing a motion identifier and registering the frame numbers at the beginning and end of the movement. Considering that an object performs the same motion in all frames within this interval, this data is inputted only at the boundaries, when the object changes to a different movement. This is done for all objects during their lifetime in the video. Thus, the essential description unit is the motion identifier. The description of an objects movements is called "Action", and the annotation comprises the motion identifier, start and end frames and the object's position observed thru time (i.e. its trajectory, described as a series of discrete points in the time interval; the position of the object in an arbitrary point in time can be calculated by interpolating the points registered in the "Action").

The example below depicts 20 seconds of a soccer game, showing the movements of the main players (thin line in black) and the trajectory of the ball (thick line in red). It was obtained by analyzing scenes of a video from an actual soccer game, extracting the objects, recovering the camera movement parameters and recreating the movements of each player on the playing field.

figtopview

The figure below represents the concept of objects in a time interval. (A) and (B) represents the teams, Obj. X is the ball. The annotation for the ball's movement is of an object without its motion identifier.

Action ::= < Action ID>< Time Inter-val>< Object ID>< Trajectory>
ActionDS

Next, an "Interaction" is built describing the meaning of a scene composed of several objects. Objects pictured in a scene can have different lifetimes and may be performing different actions, but their interaction is used to annotate a scene.

Interaction ::= 
< Interaction ID>< Time Inter-val>< Object No>< Object IDs>< Spatial Description>
"Interaction" describes events such as "pass" (passing a ball) or "goal" (scoring a goal).

Search

An "Interaction" is completely dependent on the contents of an image, its definition changing with each type of content. However, to achieve consistency between results from different search engines, there are provisions to define an "Interaction" based of logical operations performed on multiple "Actions" and other "Interactions". For example, in a "through pass" an offensive player passes the ball to a teammate who is cutting toward the goal through the line of defensive players. Defining this "Interaction" using the players "Actions" and other "pass" interactions:
begin
iact Through_pass t0 O0 L0                           
child_iact 1 Pass t1 O1 L1                                     
child_act 3 Stay Walk Run t2 o2  L2  
child_act 3 Stay Walk Run t3 o3  L3          
where
/* o2,and o3 are defense player*/   
get_object_from_GO o4 1 O1                             
not_same_team o4 o2  
not_same_team o4 o3 
.
.
.
less_than d3
7.0                                                             
less_than d4 7.0          
fill 
t0 t1 
O0 O1
L0 L1 
end
With this description, it is possible to consistently search for a "Through_pass". Using results of a search like this, new descriptions are generated as new "Interactions", allowing the search and retrieval correspondent scenes thereafter.

The following picture shows the search screen and the results from a search. The interface is based on a web browser, sending search queries to a video database server, retrieveing the results and showing the correspondent scenes at the client side.
Menu
Goals

Research home IBM home Order Privacy Legal Contact IBM
Last modified 30 September 1999