IBM Photo
.
Ching-Yung Lin -- Projects

IBM

Ching-Yung Lin's Photo 

[ This page is under construction!! ]

My research interest is on Machine Cognition and its applications. The objective is to advance scientific progress on making machines to process and learn from enormous amount of multimodality sensing data, similar to or beyond what human beings can do. In order to organize, realize and extract intelligence out of the huge amount of continuous multimodality streams that a person, a company, or a community receives, we consider real-world machine learning and distributed signal processing as the keys. Real-world machine learning focuses on multimodality machine learning from imperfect and continuous sensing data. Distributed signal processing allows machine to classify large-scale data based on the learned models.  Future machine cognition techniques have to be scalable, autonomous, and general enough to be able to handle diverse types of data, such as video, text, speech, sensor signals, human activity logs, etc. 


We are on the verge of the boom in content understanding techniques for various applications. For instance, there is an old saying – a picture is worth a thousand words. But, understanding which thousand words was previously considered a far-fetched dream for machines. Only until recent research progress showed the dawn of the possibility of machine understanding of visual semantics. Integrating content understanding techniques with distributed sensor or information processing systems, we will be able to significantly reduce the communication loads by making decisions at various streaming stages. Integrating content understanding with medical systems, we shall be able to provide novel hardware devices that gather information more intelligently and efficiently. Privacy, security and information assurance requirement of systems can be better achieved, if sensors can automatically remove or encrypt sensitive data before capturing or transmission. By applying content understanding techniques on emails, behavior logs, or audio-visual sensing data with social analysis techniques, we shall better understand peoples’ communication behaviors and societal activities and thus develop better technologies for enhancing human-machine and human-human interaction.

Machine Cognition


My research projects are divided into six main areas


PART I: MULTIMODALITY LEARNING MACHINE

  • Autonomous Learning
  • Smart Semantic Video Camera
  • Imperfect and Continuous Learning
  • Multimedia Semantic Concept Analysis

PART II: MACHINE COGNITION for DISTRIBUTED SIGNAL PROCESSING

  • Large-Scale Video Semantic Concept Filtering and Sensor Network System

  • Large-Scale Email Topic Filtering
  • Social Network Construction for Radio Monitoring

PART III: MACHINE COGNITION for MULTIMEDIA SECURITY

  • Multimedia Authentication

  • Robust Digital Signature and Self-Authentication-and-Recovery Image Systems

  • Robust Watermarks surviving Rotation, Scaling, Translation and Print-and-Scan Processes

  • Theoretical Data Hiding Capacity Bounds for Images

PART IV: MACHINE COGNITION for MULTIMEDIA INFORMATION MANAGEMENT

  • Multimedia Semantic Adaptation and Summarization
  • Multimedia Indexing and Retrieval

PART V: MACHINE COGNITION for SOCIAL INFORMATICS

  • Expertise Mining
  • Community Mining
  • Personalized Recommendation

PART VI: MACHINE COGNITION for LIFE SCIENCE APPLICATIONS

  • Multimodality Sensors for Sleep Activitiy Monitoring
  • Contextual Environment Monitoring


PART I: LEARNING MACHINES

 We are investigating the theory, algorithm and system issues toward the construction of automatic cognitive learning machine. With the recent success of machine learning algorithms, many traditional thinking of system design may be re-examined via machine learning approaches. Under machine learning infrastructure, system designers no longer play the role of assigning rules but designing algorithms to allow systems to learn to solve problems by themselves. For instance, this approach has significantly increased the number of concepts that machine can understand. Previously, researchers work for decades to model a few concepts – e.g., face, car, people, etc. However, with machine learning approaches, the number of concept detectors increase to the range of hundreds and have competitive or better accuracy than prior methods. In order to increase machine’s capability of problem-solving, we propose to create a new kind of learning system. These learning machines automatically capture data from multimodality sensing sources (audio, visual, text, etc), execute recognition, and then learn to infer more complicated concepts/knowledge. Also, the learning system can execute existing knowledge-base, e.g., electronic dictionary, to extend and connect concepts.


(1) Autonomous Learning   (with Xiaodan Song and Ming-Ting Sun)


(2) Smart Semantic Video Camera  (with Victor Sutan and Jason Cardillo)


(3) Imperfect and Continuous Learning (with Xiaodan Song, Panda Navneet and Gang Wu)


(3) Multimedia Semantic Analysis


[VideoAL Mark]  Multimedia Semantic Concept Analysis

 My research objectives are (1) Object Detection - Robust and accurate detection, location and counts of specific objects; (2) Object Recognition - Determine the specific instance of an object class (e.g. person); (3) Event Understanding - Form inferences from occurrences or reoccurrence of activity; (4) Multi-Modal Fusion - Combine multiple sources to maximize the salient information that can be extracted from the video; (5) Video Query by Example - Retrieve information through database inquiry using a sequence of video or content descriptors; (6) Video Summary - Methods to reduce information representation and scenario based activity summarization; (7) Multi-Modal Video Mining - Automatically discovering trends, patterns, and associations in video; (8) Object Tracking - Determine the path of a known object within a video sequence; (9) Motion Analysis - Quantity the movement of objects or phenomenon in a video sequence and (10) Kinematics Analysis - Identify an object or phenomenon by its motion

 We had the following objectives of addressing the challenging problem of fully-automatic indexing and retrieval of unstructured video content, engaging the research/industry community in establishing benchmark for video content retrieval, participating in the benchmark and leveraging it for advancing technology in video content retrieval, and establishing IBM Research as premiere thought leaders in the area of multimedia indexing and semantic understanding.

 Our effort resulted in the following accomplishments. First, we helped with the formation of the TREC video retrieval benchmark and its tasks and participated in TREC video retrieval benchmark since its establishment in 2001. We provided the leadership role in establishing the "concept detection" task within the TREC video retrieval benchmark.  IBM proposed the idea to NIST in Nov. 2001 and followed through by leading effort to design the benchmark and test methodology and choose the concepts for detection. We provided the leadership role in establishing the "MPEG-7 concept/transcript/shot exchange" task of the TREC-2002 benchmark with the goal of accelerating the pace of technological advancement by allowing different participants to focus on different aspects of multimedia indexing problems. In 2003, we initiated and organized a collaborative video annotation forum, in which we jointly work with colleagues in 23 groups to build ground-truth labels on 62 hours of video. Near 500K of labels (after hierarchical propagation) have been annotated on 45K of shots. These ground-truth labels have been widely used for video semantic concept training and system evaluation.


(Collaborators: Belle L. Tseng, Milind Naphade, Apostol Natsev, John R. Smith)





PART II: MACHINE COGNITION for DISTRIBUTED SIGNAL PROCESSING


Cognition is a form of compression. In the past decades, the main thrust of communications has been evolving from one-to-many broadcasting to many-to-many peer-to-peer communications. We are seeing the trends that many-to-one communications is becoming an important driving force for the next generation of information technology industry. In the era of information overflow, technology users demand machines to help them interpret the signals they receive. While the fidelity of transmitted signals and network infrastructure are necessary, consumers are becoming keen to the content of communication data. A technology that delivers needed information to consumers can stand out as the winner. We consider machine cognition technology has to be integrated in the communication systems to help users to filter the data and, thus, significantly reduce the communication load.



[VideoDIG Mark]  Large-Scale Video Semantic Routing and Sensor Network System


Since summer 2004, I joined the Distributed Computing department and have been working on a novel video semantic routing/filtering and sensor network system. We propose novel mechanism to reduce the amount of transmission loads based on the semantic user profiles. In other words, the system shall only transmit those video shots or stories that are of interest to the end users. Figure 1 shows an example of concept filtering and semantic routing for large-scale video streaming system. In the system, we deploy concept filters hierarchically based on the semantic trees. For instance, if an end user is interested at the basket clips, then the processing elements would first filter out all shots that are not sport-event, and then classify video packets to baseball, basketball, hockey, soccer, tennis, etc. Using this semantic routing structure, processing loads for each nodes can be reduced and thus make the overall system scale for large streaming environments. We also utilize this distributed video semanitc recognition framework to understand/manage visual information from visual sensor network. Our work is mainly focusing on the scalability and real time processing issues.

We propose to use the complexity-accuracy curves to optimally choose operating points in this semantic routing scenario. We also propose a set of novel video features, that result in better performance, in terms of both speed and accuracy, than our previous generic video concept classifiers. We have built one hundred concept classification filters. Experiments on 154 hours of video streams validated the effectiveness of the proposed system.

(Collaborators: Lisa Amini, Olivier Verschure, Anshul Sehgal)

 


PART III: MULTIMEDIA SECURITY

  My research on multimedia security consists of three parts, multimedia authentication, copyright protection and theoretical information hiding capacity.

Multimedia authentication techniques are required in order to ensure trustworthiness of multimedia data. Its objective is to detect or prevent integrity tampering on video content in either the syntactic level or semantic level. We proposed a robust digital signature technique that can unambiguously distinguish some content-preserving manipulations from malicious tampering. We proposed a unique Self-Authentication-and-Recovery Image (SARI) system, which results in producing smart images that can help to detect the changed location and recovery from the manipulations.

  Watermarking is a promising solution that can protect the copyright of multimedia data through transcoding. A reasonable expectation of applying watermarking techniques for copyright protection is to consider specific application scenarios, because the distortion behavior involved in these cases (geometric distortion and pixel value distortion) could be reasonably predictable. We proposed a practical public watermarking algorithm that is robust to rotation, scaling, and/or translation (RST) distortion. It plays an important role in our design of the first watermarking technique, which survives the image print-and-scan process.

   In addition, we examined an important issue regarding the maximum amount of watermark information without causing noticeable perceptual degradation.  This is an original work in analyzing the theoretical watermarking capacity bounds for digital images, based on the information theory and the characteristics of the human vision system.    

[SARI Mark] 

Robust Digital Signature and Self-Authentication-and-Recovery Image (SARI) System

 The well-known adage that “seeing is believing” is no longer true due to the pervasive and powerful multimedia manipulation tools. Such development has decreased the credibility that multimedia data such as photos, video or audio clips, printed documents, etc. used to command.

 We first proposed a robust digital signature technique which substitute the hash function in the digital signature with a content-preserving robust visual hash. We studied the methodologies behind multimedia compression techniques and then developed several theories that guided the design of visual hash based on unambiguous invariant signal properties of multimedia data across various compressions.  Because of the adequateness of the theories in both the theoretical domain and practical system implementation domain, the proposed system has been proved to be successful in achieving error-free capabilities in distinguishing content-preserving standard compressions from malicious compressions.

 We then proposed a unique Self-Authentication-and-Recovery Image (SARI) system. SARI utilizes a novel semi-fragile watermarking technique that accepts JPEG lossy compression on the watermarked image to a pre-determined quality factor, and rejects malicious attacks. The authenticator can identify the positions of corrupted blocks, and recover them with approximations of the original ones. In addition to JPEG compression, adjustments of the brightness of the image within reasonable ranges are also acceptable using the proposed authenticator. The security of the proposed method is achieved by using the secret block mapping function which controls the signature generating/ embedding processes.

Here is an example:

[SARI Example]

(Collaborators: Shih-Fu Chang

 

[Robust Watermark Mark] Robust Watermarks Surviving Rotation, Scaling, Cropping, and Image Print-and-Scan Process

 Watermarking has been considered to be a promising solution that can protect the copyright of multimedia data through transcoding, because the embedded message is always included in the data. We first proposed a public watermark technique that is invariant to geometric distortions. Our method does not embed an additional registration pattern or embed watermark in a recognizable structure, so there is no need to identify and invert them. In particular, we are concerned with distortions due to rotation, scale and/or translation (RST).

 We then proposed a hypothetical model of the pixel value distortions. To our knowledge, in 1999, there was no existing appropriate model in the literature to describe the pixel value distortions in PS process. Therefore, we propose a  hypothetical model based on our experiments and some relative literature. Although more experiments are needed to verify its validity, we found this model is appropriate in our experiments using different printers and scanners, as it shows several characteristics of rescanned images. We also note that, in general image editing processes, geometric distortion cannot be adequately modeled by the well-known rotation, scaling, and translation (RST) effects, because of the effect of cropping. In the PS process, the scanned image may cover part of the original picture and/or part of the background, and may have an arbitrarily cropped size. These changes, especially that of image size, will introduce significant changes of the DFT coefficients. Thus, we analyzed the geometric distortion in the PS process, and then focus on the changes of DFT coefficients for invariants extraction. Based on these adjustments, we then applied our proposed technique for watermarking design on the protection of copyright information through image print-and-scan process.

 (Collaborators: Jeffrey Bloom, Min Wu, Ingemar Cox, Matthew Miller, Yuiman Rui, Shih-Fu Chang

 

Theoretical Data Hiding Capacity Bound for Digital Images

 In addition, we study the theoretic issue with regard to watermarking embedding space existing in multimedia data. This space should depend on the properties of human audio-visual system. It is a complex scientific question that we may not be able to find a thorough answer in this thesis. Our objective is to study existing human vision system models, achieve better understanding of various watermarking space, and then develop information-theoretic estimation of information capacity via watermark. We investigate watermarking capacity in three directions: the zero-error capacity for public watermarking in magnitude-bounded noisy environments, the watermarking capacity based on domain-specific masking effects, and the watermarking capacity issues based on sophisticated Human Vision System models.

 First, we investigated the watermarking capacity based on content-independent constraints on the magnitudes of watermarks and noises. We showed that, in the case that the noise magnitudes are constrained, a capacity bound with “deterministic” zero error can be actually achieved. We showed that, in an environment with finite states and bounded noises, transmission error can be actually zero, instead of approaching zero as contemplated in Shannon's channel capacity theory. Specifically, we found the zero-error capacity for private and public watermarking in a magnitude-bounded noisy environment. An example case is that, assuming the added noise is due to quantization (as in JPEG), we can calculate the zero-error capacity based on the setting of the magnitude constraints on watermark and noise. Note that we consider all pixels, watermarks and noises are discrete values, which occur in realistic cases. Second, we found out the watermarking capacity based on domain-specific masking effects. We showed the capacity of private watermarking in which the power constraints are not uniform. Then, we applied several domain-specific HVS approximation models to estimate the power constraints and then show the theoretical watermarking capacity of an image in a general noisy environment. Third, we conducted the watermarking capacity issues based on actual Human Vision System models. We described in details the most sophisticated Human Vision Systems developed by Daly and Lubin. Then, we discuss issues and possible directions in applying these models to estimation of the watermarking capacity.

 (Collaborators: Shih-Fu Chang




PART IV: MULTIMEDIA INFORMATION MANAGEMENT

 My main focus on multimedia semantic analysis research is on automatic content learning, detection and recognition technologies for video data sources that include video scenes of various indoor and outdoor activities involving people, meetings, and vehicles, and TV news broadcasts. For applications, our goal is to achieve (1) significant improvement in indexing and retrieval performance for video data; (2) autonomous video understanding; (3) ancillary improvement for still image processing; (4) enabling technologies for video data mining, filtering and selection; and (5) a drastic reduction in volume for video storage. My researches would be applied on both the commercial uses of multimedia data management and the security uses of intelligence analysis. 

 In order to achieve these application objectives, we are investigating researches on forwarding mechanisms at fully automatic video indexing based on image, text, and audio content, which provides low coast video corpus marking and preparation. We are also developing methods in robust person and text detection and recognition, event detection, recognition and understanding. Applying these semantic analysis techniques, we try to develop efficient methods for representing content and make efforts in cross media content search and extraction.


[Video Summarization Mark]    Multimedia Semantic Adaptation and Summarization


     With the growing amount of multimedia content, people become more willing to view personalized multimedia based on their usage environments. When people use their pervasive devices, they generally restrict their viewing time on the limited displays and minimize the amount of interaction and navigation to get to the content. When they browse video on the Internet, they may want to get only the videos that match their preferences.  Because of the existence of heterogeneous user clients and data sources, it is a real challenge to implement a universally compliant system that fits various usage environments. We proposed a video personalization system, which is comprised of three major components, the user client, the database server, and the media middleware. This middleware framework is powered by a personalization engine and an adaptation engine that optimally produce video summaries based on the MPEG-7 metadata descriptions, the MPEG-21 rights expressions, and content adaptability declarations on the server database and the MPEG-7 user preference, MPEG-21 usage environments, and user query at the client devices.

The major component tools: in this system include: VideoSue, VideoEd, VideoAnnEx, and Universal Tuner.

(Collaborators: Belle L. Tseng, John Smith)




PART V: SOCIAL INFORMATICS

  Working in the information age, the most important is not what you know, but who you know. A social network, the graph of relationships and interactions within a group of individuals, plays a fundamental role as a medium for the spread of information, ideas, and influence. At the organizational level, personal social networks are activated for recruitment, partnering, and information access. At the individual level, people exploit their networks to advance careers and gather information. In addition, using multimodality sensors with speaker/face recognition techniques can help capturing peoples’ daily social activities. These are all important aspects for human interest, behavior modeling, and thus shall have significant impact on personalized services.

(1) ExpertiseNet   (with Xiaodan Song, Belle L. Tseng and Ming-Ting Sun)

(2) CommunityNet  (with Xiaodan Song, Belle L. Tseng and Ming-Ting Sun)

(3) Community-based Dynamic Recommendation  (with Xiaodan Song, Belle L. Tseng and Ming-Ting Sun)




PART VI: LIFE SCIENCE APPLICATIONS

  As the percentage of the elder population continues to rise, advanced health care system has been an active subject of many research projects. The common goal of these projects is to assist people in different aspects through various methods so that the quality and the length of their life can be improved. For the health care applications, there are often critical needs for recognizing a person’s physical location, condition, and activity. With the knowledge of a user’s contextual information, a personal assistant/recommender or warning system can be developed to recommend activities for improving long-term health conditions, raise alarms for emergency situations, and monitor long-term health changes via early-warning detection. Various sensors can be deployed to acquire multimodality data for recognizing a user’s contextual information. Once sensory personal information is acquired, the challenge is how to automatically perform semantic, contextual and ontological analyses and data fusion, in order to reliably support high-quality recommendation, detection, and decision-making.


(1) Multimodaliy Sleep Activity Detector   (with Ya-Ti Peng and Ming-Ting Sun)

(2) Contextual Environemntal Inference   (with Ya-Ti Peng and Ming-Ting Sun)


[ This page is under construbtion!!]

Last Updated: 10/21/2005