|

|
[ This page is under construction!!
]
My research interest is on Machine
Cognition and its applications. The objective is to advance
scientific progress on
making machines to process and learn from enormous amount of
multimodality
sensing data, similar to or beyond what human beings can do. In order
to
organize, realize and extract intelligence out of the huge amount of
continuous
multimodality streams that a person, a company, or a community
receives, we consider real-world machine learning and
distributed
signal processing as the
keys. Real-world machine learning focuses
on multimodality machine learning from imperfect and continuous sensing
data. Distributed signal processing allows
machine to classify large-scale data based on the learned models. Future machine cognition techniques have to
be scalable, autonomous, and general enough to be able to handle
diverse types
of data, such as video, text, speech, sensor signals, human activity
logs, etc.
We are on the verge of the boom in content understanding techniques for
various applications. For instance, there is an old saying – a picture is worth a thousand words.
But, understanding which thousand words was previously considered a
far-fetched
dream for machines. Only until recent research progress showed the dawn
of the
possibility of machine understanding of visual semantics. Integrating
content
understanding techniques with distributed
sensor or information processing
systems, we will be able to significantly reduce the communication
loads by
making decisions at various streaming stages. Integrating content
understanding
with medical systems, we shall be
able to provide novel hardware devices that gather information more
intelligently and efficiently. Privacy, security
and information assurance
requirement of systems can be better achieved,
if sensors can automatically remove or encrypt sensitive data before
capturing
or transmission. By applying content understanding techniques on
emails,
behavior logs, or audio-visual sensing data with social
analysis techniques, we shall better understand peoples’
communication behaviors and societal activities and thus develop better
technologies for enhancing human-machine and human-human interaction.

My research projects are divided
into six main areas
PART
I: MULTIMODALITY LEARNING MACHINE
- Autonomous Learning
- Smart Semantic Video Camera
- Imperfect and Continuous Learning
- Multimedia Semantic Concept
Analysis
PART
II: MACHINE COGNITION for DISTRIBUTED SIGNAL PROCESSING
PART
III: MACHINE COGNITION for MULTIMEDIA SECURITY
-
Multimedia Authentication
-
Robust Digital Signature and
Self-Authentication-and-Recovery Image Systems
-
Robust Watermarks surviving
Rotation, Scaling, Translation
and Print-and-Scan Processes
-
Theoretical Data Hiding
Capacity Bounds for Images
PART
IV: MACHINE COGNITION for MULTIMEDIA INFORMATION
MANAGEMENT
- Multimedia Semantic
Adaptation and Summarization
- Multimedia Indexing and Retrieval
PART
V: MACHINE COGNITION for SOCIAL INFORMATICS
- Expertise Mining
- Community Mining
- Personalized Recommendation
PART
VI: MACHINE COGNITION for LIFE SCIENCE APPLICATIONS
- Multimodality Sensors for Sleep Activitiy Monitoring
- Contextual Environment Monitoring
PART I:
LEARNING MACHINES
We are investigating the theory, algorithm
and system issues toward the
construction of automatic cognitive learning machine. With the recent
success
of machine learning algorithms, many traditional thinking of system
design may
be re-examined via machine learning approaches. Under machine learning
infrastructure, system designers no longer play the role of assigning
rules but
designing algorithms to allow systems to learn to solve problems by
themselves.
For instance, this approach has significantly increased the number of
concepts
that machine can understand. Previously, researchers work for decades
to model
a few concepts – e.g., face, car, people, etc. However, with machine
learning
approaches, the number of concept detectors increase to the range of
hundreds
and have competitive or better accuracy than prior methods. In
order to increase machine’s capability of
problem-solving, we propose to create a new kind of learning system.
These
learning machines automatically capture data from multimodality sensing
sources
(audio, visual, text, etc), execute recognition, and then learn to
infer more
complicated concepts/knowledge. Also, the learning system can execute
existing
knowledge-base, e.g., electronic dictionary, to extend and connect
concepts.
(1)
Autonomous Learning (with Xiaodan Song and Ming-Ting Sun)
(2) Smart Semantic Video Camera (with Victor Sutan and Jason
Cardillo)
(3) Imperfect and Continuous Learning (with Xiaodan Song, Panda Navneet and Gang Wu)
(3) Multimedia Semantic Analysis
Multimedia Semantic Concept
Analysis
My research objectives are (1) Object Detection
- Robust and accurate detection, location and counts of specific
objects; (2) Object Recognition
- Determine the specific instance of an
object class (e.g. person); (3) Event
Understanding - Form inferences
from occurrences or reoccurrence of activity; (4) Multi-Modal Fusion -
Combine multiple sources to maximize the salient information that can
be extracted from the video; (5) Video
Query by Example - Retrieve
information through database inquiry using a sequence of video or
content descriptors; (6) Video Summary - Methods
to reduce information representation and scenario based activity
summarization; (7) Multi-Modal Video Mining -
Automatically discovering trends, patterns, and associations in video;
(8) Object Tracking - Determine the path of a known object within a
video sequence; (9) Motion Analysis
- Quantity the movement of objects
or phenomenon in a video sequence and (10) Kinematics Analysis -
Identify an object or phenomenon by its motion
We had the following objectives of
addressing the challenging problem of fully-automatic indexing and
retrieval of unstructured video content, engaging the research/industry
community in establishing benchmark for video content retrieval,
participating in the benchmark and leveraging it for advancing
technology in video content retrieval, and establishing IBM Research as premiere
thought leaders in the area of multimedia indexing and semantic
understanding.
Our effort resulted in the following
accomplishments. First, we helped with the formation of the TREC video retrieval
benchmark and its tasks and participated in TREC video retrieval
benchmark since its establishment in 2001. We provided the leadership
role in establishing the "concept detection" task within the TREC video
retrieval benchmark. IBM proposed the idea
to NIST in Nov. 2001 and followed through by leading effort to design
the benchmark and test methodology and choose the concepts for
detection. We provided the leadership role in establishing the "MPEG-7
concept/transcript/shot exchange" task of the TREC-2002 benchmark with
the goal of accelerating the pace of technological advancement by
allowing different participants to focus on different aspects of
multimedia indexing problems. In 2003, we initiated and
organized a collaborative
video annotation forum, in which we jointly work with colleagues in
23 groups to build ground-truth labels on 62 hours of
video. Near 500K of labels (after hierarchical propagation) have been
annotated on 45K of shots. These ground-truth labels have been widely
used for video semantic concept training and system evaluation.
(Collaborators: Belle L.
Tseng, Milind Naphade, Apostol Natsev, John R. Smith)
PART
II:
MACHINE COGNITION for DISTRIBUTED SIGNAL PROCESSING
Cognition is a form of compression. In the past decades, the main
thrust of communications has been evolving from one-to-many
broadcasting to many-to-many peer-to-peer communications. We are seeing
the trends that many-to-one communications is becoming an important
driving force for the next generation of information technology
industry. In the era of information overflow, technology users demand
machines to help them interpret the signals they receive. While the
fidelity of transmitted signals and network infrastructure are
necessary, consumers are becoming keen to the content of communication
data. A technology that delivers needed information to consumers can
stand out as the winner. We consider machine cognition technology has
to be integrated in the communication systems to help users to filter
the data and, thus, significantly reduce the communication load.
Since summer 2004, I joined the
Distributed Computing department and have been working on a novel video
semantic routing/filtering and sensor network system. We propose novel
mechanism to reduce the amount of transmission loads based on the
semantic user profiles. In other words, the system shall only transmit
those video shots or stories that are of interest to the end users.
Figure 1 shows an example of concept filtering and semantic routing for
large-scale video streaming system. In the system, we deploy concept
filters hierarchically based on the semantic trees. For instance, if an
end user is interested at the basket clips, then the processing
elements would first filter out all shots that are not sport-event, and
then classify video packets to baseball, basketball, hockey, soccer,
tennis, etc. Using this semantic routing structure, processing loads
for each nodes can be reduced and thus make the overall system scale
for large streaming environments. We also utilize this distributed
video semanitc recognition framework to understand/manage visual
information from visual sensor network. Our work is mainly focusing on
the scalability and real time processing issues.
We propose to use the
complexity-accuracy curves to optimally choose operating points in this
semantic routing scenario. We also propose a set of novel video
features, that result in better performance, in terms of both speed and
accuracy, than our previous generic video concept classifiers. We
have built one hundred concept classification filters. Experiments
on 154 hours of video streams validated the effectiveness of the
proposed system.
(Collaborators: Lisa Amini, Olivier Verschure,
Anshul
Sehgal)
PART III: MULTIMEDIA
SECURITY
My
research on
multimedia security consists of three parts, multimedia authentication,
copyright protection and theoretical information hiding capacity.
Multimedia
authentication techniques are required in order to ensure
trustworthiness of multimedia data. Its objective is to detect or
prevent integrity tampering on video content in either the syntactic
level or semantic level. We proposed a robust digital
signature
technique that can unambiguously distinguish some content-preserving
manipulations from malicious tampering. We proposed a unique
Self-Authentication-and-Recovery Image (SARI) system, which results
in producing smart images that can help to detect the changed location
and recovery from the manipulations.
Watermarking is
a promising solution that can protect the copyright of multimedia data
through transcoding. A reasonable expectation of applying watermarking
techniques for copyright protection is to consider specific application
scenarios, because the distortion behavior involved in these cases
(geometric distortion and pixel value distortion) could be reasonably
predictable. We proposed a practical public watermarking algorithm that
is robust to rotation, scaling, and/or translation (RST) distortion. It
plays an important role in our design of the
first watermarking technique, which survives the image print-and-scan
process.
In
addition, we
examined an important issue regarding the maximum amount of watermark
information without causing noticeable perceptual degradation. This is an original work in analyzing the
theoretical watermarking capacity bounds for digital images, based on
the information theory and the characteristics of the human vision
system.
The
well-known adage that “seeing is believing” is no longer true due to
the pervasive and powerful multimedia manipulation tools. Such
development has decreased the credibility that multimedia data such as
photos, video or audio clips, printed documents, etc. used to
command.
We first
proposed a robust digital signature technique which substitute the hash
function in the digital signature with a content-preserving robust
visual hash. We studied the methodologies behind multimedia compression
techniques and then developed several theories that guided the design
of visual hash based on unambiguous invariant signal properties of
multimedia data across various compressions. Because
of the adequateness of the theories in both the theoretical domain and
practical system implementation domain, the proposed system has been
proved to be successful in achieving error-free capabilities in
distinguishing content-preserving standard compressions from malicious
compressions.
We then
proposed a unique Self-Authentication-and-Recovery Image (SARI) system.
SARI utilizes a novel semi-fragile watermarking technique that accepts
JPEG lossy compression on the watermarked image to a pre-determined
quality factor, and rejects malicious attacks. The authenticator can
identify the positions of corrupted blocks, and recover them with
approximations of the original ones. In addition to JPEG compression,
adjustments of the brightness of the image within reasonable ranges are
also acceptable using the proposed authenticator. The security of the
proposed method is achieved by using the secret block mapping function
which controls the signature generating/ embedding processes.
Here is an
example:
![[SARI Example]](sarifigure-cylin.jpg)
(Collaborators: Shih-Fu Chang)
Watermarking
has been considered to be a promising solution that can protect the
copyright of multimedia data through transcoding, because the embedded
message is always included in the data. We first
proposed a public watermark technique that is invariant to geometric
distortions. Our method does not embed an additional registration
pattern or embed watermark in a recognizable structure, so there is no
need to identify and invert them. In particular, we are concerned with
distortions due to rotation, scale and/or translation (RST).
We then
proposed a hypothetical model of the pixel value distortions. To our
knowledge, in 1999, there was no existing appropriate model in the
literature to describe the pixel value distortions in PS process.
Therefore, we propose a hypothetical model based on our
experiments and some relative literature. Although more experiments are
needed to verify its validity, we found this model is appropriate in
our experiments using different printers and scanners, as it shows
several characteristics of rescanned images. We also note that, in
general image editing processes, geometric distortion cannot be
adequately modeled by the well-known rotation, scaling, and translation
(RST) effects, because of the effect of cropping. In the PS process,
the scanned image may cover part of the original picture and/or part of
the background, and may have an arbitrarily cropped size. These
changes, especially that of image size, will introduce significant
changes of the DFT coefficients. Thus, we analyzed the geometric
distortion in the PS process, and then focus on the changes of DFT
coefficients for invariants extraction. Based on these adjustments, we
then applied our proposed technique for watermarking design on the
protection of copyright information through image print-and-scan
process.
(Collaborators: Jeffrey Bloom, Min Wu, Ingemar Cox, Matthew Miller,
Yuiman Rui, Shih-Fu
Chang)
In
addition, we study the theoretic issue with regard to watermarking
embedding space existing in multimedia data. This space should depend
on the properties of human audio-visual system. It is a complex
scientific question that we may not be able to find a thorough answer
in this thesis. Our objective is to study existing human vision system
models, achieve better understanding of various watermarking space, and
then develop information-theoretic estimation of information capacity
via watermark. We
investigate watermarking capacity in three directions: the zero-error
capacity for public watermarking in magnitude-bounded noisy
environments, the watermarking capacity based on domain-specific
masking effects, and the watermarking capacity issues based on
sophisticated Human Vision System models.
First, we
investigated the watermarking capacity based on content-independent
constraints on the magnitudes of watermarks and noises. We showed that,
in the case that the noise magnitudes are constrained, a capacity bound
with “deterministic” zero error can be actually achieved. We showed
that, in an environment with finite states and bounded noises,
transmission error can be actually zero, instead of approaching zero as
contemplated in Shannon's channel capacity theory. Specifically, we
found the zero-error capacity for private and public watermarking in a
magnitude-bounded noisy environment. An example case is that, assuming
the added noise is due to quantization (as in JPEG), we can calculate
the zero-error capacity based on the setting of the magnitude
constraints on watermark and noise. Note that we consider all pixels,
watermarks and noises are discrete values, which occur in realistic
cases. Second, we found out the watermarking capacity based on
domain-specific masking effects. We showed the capacity of private
watermarking in which the power constraints are not uniform. Then, we
applied several domain-specific HVS approximation models to estimate
the power constraints and then show the theoretical watermarking
capacity of an image in a general noisy environment. Third, we
conducted the watermarking capacity issues based on actual Human Vision
System models. We described in details the most sophisticated Human
Vision Systems developed by Daly and Lubin. Then, we discuss issues and
possible directions in applying these models to estimation of the
watermarking capacity.
(Collaborators: Shih-Fu Chang)
PART IV:
MULTIMEDIA INFORMATION MANAGEMENT
My main focus on multimedia semantic analysis
research is on automatic content learning, detection and recognition
technologies for video data sources that include video scenes of
various indoor and outdoor activities involving people, meetings, and
vehicles, and TV news broadcasts. For applications, our goal is to
achieve (1) significant improvement in indexing and retrieval
performance for video data; (2) autonomous video understanding; (3)
ancillary improvement for still image processing; (4) enabling
technologies for video data mining, filtering and selection; and (5) a
drastic reduction in volume for video storage. My researches would be
applied on both the commercial uses of multimedia data management and
the security uses of intelligence analysis.
In order to achieve these application
objectives, we are investigating researches on forwarding mechanisms at
fully automatic video indexing based on image, text, and audio content,
which provides low coast video corpus marking and preparation. We are
also developing methods in robust person and text detection and
recognition, event detection, recognition and understanding. Applying
these semantic analysis techniques, we try to develop efficient methods
for representing content and make efforts in cross media content search
and extraction.
With
the growing amount of multimedia content, people become more willing to
view personalized multimedia based on their usage environments. When
people use their pervasive devices, they generally restrict their
viewing time on the limited displays and minimize the amount of
interaction and navigation to get to the content. When they browse
video on the Internet, they may want to get only the videos that match
their preferences. Because of the
existence of heterogeneous user clients and data sources, it is a real
challenge to implement a universally compliant system that fits various
usage environments. We proposed a video personalization system, which
is comprised of three major components, the user client, the
database server, and the media middleware. This middleware
framework is powered by a personalization engine and an adaptation
engine that optimally produce video summaries based on the MPEG-7
metadata descriptions, the MPEG-21 rights expressions, and content
adaptability declarations on the server database and the MPEG-7 user
preference, MPEG-21 usage environments, and user query at the client
devices.
The
major component tools: in this system include: VideoSue,
VideoEd,
VideoAnnEx,
and Universal
Tuner.
(Collaborators: Belle L.
Tseng, John Smith)
PART V:
SOCIAL INFORMATICS
Working in the information age, the
most important is not what you know, but who you know. A social
network, the graph of relationships and interactions within a group of
individuals, plays a fundamental role as a medium for the spread of
information, ideas, and influence. At the organizational level,
personal social networks are activated for recruitment, partnering, and
information access. At the individual level, people exploit their
networks to advance careers and gather information. In addition, using
multimodality sensors with speaker/face recognition techniques can help
capturing peoples’ daily social activities. These are all important
aspects for human interest, behavior modeling, and thus shall have
significant impact on personalized services.
(1) ExpertiseNet
(with Xiaodan
Song, Belle
L. Tseng and Ming-Ting Sun)
(2) CommunityNet
(with Xiaodan Song, Belle
L. Tseng and Ming-Ting Sun)
(3) Community-based Dynamic
Recommendation (with Xiaodan Song, Belle
L. Tseng and Ming-Ting Sun)
PART VI:
LIFE SCIENCE APPLICATIONS
As the percentage of the elder
population continues to rise, advanced health care system has been an
active subject of many research projects. The common goal of these
projects is to assist people in different aspects through various
methods so that the quality and the length of their life can be
improved. For the health care applications, there are often critical
needs for recognizing a person’s physical location, condition, and
activity. With the knowledge of a user’s contextual information, a
personal assistant/recommender or warning system can be developed to
recommend activities for improving long-term health conditions, raise
alarms for emergency situations, and monitor long-term health changes
via early-warning detection. Various sensors can be deployed to acquire
multimodality data for recognizing a user’s contextual information.
Once sensory personal information is acquired, the challenge is how to
automatically perform semantic, contextual and ontological analyses and
data fusion, in order to reliably support high-quality recommendation,
detection, and decision-making.
(1) Multimodaliy Sleep
Activity Detector (with Ya-Ti Peng and Ming-Ting Sun)
(2) Contextual Environemntal
Inference (with Ya-Ti Peng and Ming-Ting Sun)
[ This page is under
construbtion!!]
Last Updated:
10/21/2005
|