Dr. Philip S. Yu


Current goals of the project include the development of advanced algorithms to explore new data mining techniques and applications, indexing techniques to facilitate non-conventional similarity searches, as well as other optimization and performance subjects (such as optimization techniques for scheduling). In the data mining area, we focus on extending the limitations and shortcomings of state of the art mining methods and exploring new applications areas enabled by the new mining capabilities. These include stream data mining, grid mining, interactive mining and anomaly detection.

The increasing number of applications requiring support for processing data in the form of a continuous stream poses stringent requirements on the data mining algorithms. For example, under the data stream model, data records can only be accessed sequentially and read only once or a small number of times. Furthermore, the computing systems performing the stream mining usually only have a limited amount of memory for storing the intermediate results, and need to keep up with the rate the data is streamed in. We study new mining algorithms under the data stream model. The focus is not only on supporting one-pass algorithms, but also on capturing the dynamically changing characteristics of the data.

Grid computing is a new model for distributed, high performance computing. As many commercial and scientific data are maintained over geographically distributed sites, we study how to adapt mining applications to explore the grid model. Our focus is on using an ensemble based learning approach to aggregate models trained on distributed grid nodes.

Frustratingly most mining algorithms still are performed as batch processes. Throughout the mining process, the user has minimal interaction with the systems. If after waiting for a long period of time, the results are unsatisfactory, all the computing resources consumed for the mining effort are wasted. We focus on new approaches to perform mining in an interactive fashion to control the mining directions (e.g., on clustering) and even the accuracy (e.g., on classification) and efficiency.

Mining for anomaly events is very important with wide applicability from business activity monitoring to intrusion detection to uncovering irregularities/frauds in financial transactions to cleansing of data feeds to autonomic computing. It has also taken on increasing urgency (since 9/11) in terms of detecting potential leads or early warning signals of terrorist or biological attacks. Our work focuses on fundamental algorithms for anomaly detection. This includes one class and partially supervised classification methods, cost sensitive learning, rule scheduling and indexing to support real-time detection.

In Bioinformatics, the amount of data generated is growing at an exponential rate, where data mining is badly needed. Nonetheless, extensions to the current mining techniques are required to handle some of the domain specific issues. Our focus is on addressing these limitations and extending the capability of current mining techniques. We have studied clustering techniques based on pattern coherency for micro-array data and string-based classification techniques. Another area is the mining of repeated patterns in a string sequence in the presence of noise. Indexing methods to support approximate pattern matching of bio-sequences and micro-array data have also been pursued.

With the popularity of on-line commerce, it has become increasingly critical for companies to employ data mining in order to obtain a competitive edge in the market place. We have studied various issues pertaining to data mining. One area of focus is the mining of high dimensional data, where we have developed a 'projected clustering' approach. Much work has been done on mining association rules, including developing faster and on-line algorithms, refining the large item set concept using collective strength, and devising a new framework to mine associations by pattern structures in relational databases.

Personalization refers to the ability to gather and store information about individual customers, analyze the information, and then act on the knowledge by delivering the right information to each customer at the proper time. It is a key technology needed in a variety of E-business applications, including customer relation management, advertisement targeting and product promotion, marketing campaign management, Web site content management, knowledge management, personalized portal management, and so on. Although each specific application area may need special tailoring, especially in the areas of user interface and data collection, the core techniques for personalization are quite similar. We have developed improved collaborative filtering algorithms and also a content-based collaborative filtering approach. We have also developed various text mining algorithms which provide automated content taxonomy and conceptual indexing, and which facilitate content-based collaborative filtering.

We have also studied Web enabling technologies including the development of caching and load balancing schemes to improve web/proxy server performance, the discovery of new business methods and processes for the Internet, and the identification and design of new Internet and pervasive computing applications. We have developed a variety of collaborative proxy caching algorithms and load balancing algorithms for clusters of Web servers, also known as 'Web server farms'. We have developed near-optimal scheduling strategies for the electronic distribution via television broadcast channels of digital content purchased over the Web. (Examples of such digital content include CDs, DVDs, software, and, in the future, books.) Finally, in the coming post-PC era, cellular phones and other small devices will be increasingly connected to the Internet. It is difficult to display large tables of information on these small devices, and accordingly we have developed table summarization techniques.