Skip to main content
    Israel [change]    Terms of use
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Information Retrieval/Search Technologies Seminar 2007

IBM Haifa Labs

Invitation Program Registration Abstracts

June 27, 2007
Organized by IBM Research Lab in Haifa, Israel

Cluster Ranking

Ziv Bar-Yossef, Google Inc. (Haifa) and Faculty of Electrical Engineering, Technion, Israel

We initiate the study of a new clustering framework, called cluster ranking. Rather than simply partitioning a network into clusters, a cluster ranking algorithm also orders the clusters by their strength. To this end, we introduce a novel strength measure for clusters---the integrated cohesion---which is applicable to arbitrary weighted networks. We then present C-Rank: a new cluster ranking algorithm. Given a network with arbitrary pairwise similarity weights, C-Rank creates a list of overlapping clusters and ranks them by their integrated cohesion. We provide extensive theoretical and empirical analysis of C-Rank and show that it is likely to have high precision and recall.

Our experiments focus on mining mailbox networks. A mailbox network is an egocentric social network, consisting of contacts with whom an individual exchanges email. Ties among contacts are represented by the frequency of their co-occurrence on message headers. C-Rank is well suited to mine such networks, since they are abundant with overlapping communities of highly variable strengths. We demonstrate the effectiveness of C-Rank on the Enron data set, consisting of 130 mailbox networks.

Joint work with Ido Guy, Ronny Lempel, Yoelle Maarek, and Vova Soroka.

Using Text Analytics Techniques to Improve User Search Experience

Josemina Magdalen, Development Team Leader, Content Discovery Engineering, IBM Israel Software Labs, Jerusalem

The typical search experience is not good enough, since people cannot efficiently find the information they need. Conventional search techniques index content as text that users access through keyword searches. This tends to yield "feast or famine" results; many of the results lack relevance and there is no efficient way to refine results. However, in Enterprise Search experiences, it is extremely important to address the customer business needs, reduce site abandonment, and provide a friendly end-user experience.

IBM Omnifind Discovery Edition (IODE) is oriented towards the Enterprise Search market. IODE is built on technology that empowers solutions that deliver contextually-related information to the right people at the right time. This technology enables end-users to efficiently find the information that they are attempting to locate. In this presentation, we will focus on Query Contextual Understanding. Query Contextual Understanding uses text analytics techniques and end-user context to determine the "intent" of a query. Effective query understanding is extremely important for a successful search and navigation application. IODE's Natural Language understanding of search query captures end user "intent", by using technique of query expansion, query understanding, contextual mining and personalization and query classification. The talk we will present some of the Contextual Understanding text analytics techniques and their applications.

IR on Semistructured Data: from XML Trees to Entity-Relation Graphs

Gerhard Weikum, Research Director, Max-Planck Institute for Informatics (MPII), Saarbruecken, Germany

Information retrieval (IR) on semistructured data is important for digital libraries, enterprise search, and other application areas where text and structured data are combined or structured data bears some level of uncertainty. These situations mandate ranked retrieval rather than database-style precise querying. For non-schematic XML document collections, various approaches of XML IR have been advanced in the last years.

Considering also links within and across documents, this direction evolves into expressive forms of graph IR with challenging issues regarding ranking semantics and computational efficiency. At the same time, Web search is gaining structure and context awareness and more semantic flavor, for example, in the forms of vertical object search, Deep-Web search, or when exploiting semantically rich tags derived from large-scale information extraction or Social-Web-style user annotations. Again, this leads to a new kind of graph IR on large entity-relation networks.

The talk will discuss lessons learned from XML IR, ongoing approaches towards entity-relation graph IR, and research opportunities in this emerging area.

PageRank and Hubs and Authorities Without Hyperlinks: Structural Re-ranking of Search Results using Links Induced by Language Models

Oren Kurland, Faculty of Industrial Engineering & Management, Technion, Israel

The ad hoc retrieval task is to rank documents in a corpus (repository) in response to a query by their assumed relevance to the information need underlying the query. Inspired by the PageRank and HITS (hubs and authorities) algorithms for Web search, we propose a graph-based framework that applies to document collections lacking hyperlink information. Specifically, we focus on a structural re-ranking approach: we re-order the documents in an initially retrieved set by exploiting textual similarities between them. We show that centrality, as induced over graphs wherein links represent asymmetric language-model-based inter-document similarities, constitutes the basis of effective re-ranking algorithms. Furthermore, incorporating information induced from clusters of documents from the initial list in the graphs helps to improve the effectiveness of centrality-based approaches. For example, document "authoritativeness" as induced by the HITS algorithm over cluster-document graphs is a highly effective re-ranking criterion. Furthermore, ``authoritative'' clusters are shown to contain a high percentage of relevant documents.

This is joint work with Lillian Lee.

Incremental Caching for Collection Selection Architectures

Fabrizio Silvestri, Institute of Information Science and Technologies, National Research Council (ISTI-CNR), Pisa, Italy

To address the rapid growth of the Internet, modern Web search engines have to adopt distributed organizations, where the collection of indexed documents is partitioned among several servers, and query answering is performed as a parallel and distributed task. Collection selection can be a way to reduce the overall computing load, by finding a trade-off between the quality of results retrieved and the cost of solving queries. In this paper, we analyze the relationship between the caching subsystem and the collection selection strategy, by exploring the design-space of this combined approach. In particular, we propose a novel caching policy able to incrementally refine the effectiveness of the results returned for each subsequent cache hit. The combination of collection selection and incremental caching strategies allows our system to retrieve two thirds of the top-ranked results returned by a baseline centralized index, with only one fifth of the computing workload.

This is a joint work with Ricardo Baeza-Yates, Raffaele Perego and Diego Puppin.

XML Fragments Extended with Database Operators

Benjamin Sznajder, IBM Haifa Research Lab

XML documents represent a middle range between unstructured data such as textual documents and fully structured data encoded in databases. Typically, information retrieval techniques are used to support search on the "unstructured" end of this scale, while database techniques are used for the structured part. To date, most of the works on XML query and search have stemmed from the structured side and are strongly inspired by database techniques. In a previous work we described a new query approach via pieces of XML data called "XML Fragments" which are of the same nature as the queried XML documents and are specifically targeted to support the information needs of end-users in an intuitive way. In addition to its simplicity, XML Fragments represent a natural extension to traditional free text information retrieval queries where both documents and queries are represented as vectors of words and as such it enables a natural extension of IR ranking models to rank XML documents by context and structure. In this paper, we extend XML Fragments with database operators thus allowing both IR style approach together with database "structured" query capabilities.

Joint work with Yosi Mass, Dafna Sheinwald and Sivan Yogev.

Keynote: Algorithmic Advertising

Andrei Broder, Fellow and VP of Emerging Search Technology, Yahoo! Research, USA

Algorithmic advertising is a new scientific discipline, at the intersection of large scale search and text analysis, information retrieval, statistical modeling, machine learning, classification, optimization, and microeconomics. The central challenge of algorithmic advertising is to find the "best match" between a given user in a given context and a suitable advertisement. The context could be a user entering a query in a search engine ("sponsored search") , a user reading a web page ("content match"), a user watching a movie on a portable device, and so on. The information about the user can vary from scarily detailed to practically nil. The number of potential advertisements might be in the billions. Thus, depending on the definition of "best match" this challenge leads to a variety of massive optimization and search problems, with complicated constraints.

In this talk I will present a survey of these issues, emphasizing the connections with classical IR. Time permitting, I will briefly discuss two incoming SIGIR papers, as examples of algorithmic advertising research.

High Throughput Real-Time IR Solution for the Financial Community

Yonatan Ben-Simhon, AOL Relegence Israel Ltd.

Financial Analysts have unique needs when it comes to Information Retrieval. On top of the usual need to get the most relevant information they need the information the second it gets released from any of a multitude of data sources. Analysis of the data flow serves as a further source of information never available before.

The Relegence Real Time Internet Platform is a powerful Information Retrieval platform that gathers news stories, analyzes them and distributes them in less than a second. The platform monitors tens of thousands of sources, and processes more than half a million stories a day. Machine learning and rule-based algorithms are used to analyze these unstructured documents and enrich them with relevant information. The Relegence Real Time Internet Platform cuts through the information overloads to obtain meaningful content with innovative features such as Entity Extraction, Heat analysis, Category Classification, Clustering of related stories, Relevancy Scoring, automatic Summarization and more. This platform serves as a perfect solution for the financial industry needs and can be utilized for processing of other types data streams.

In this talk I will describe the platform and demonstrate its capabilities.

A Novel Approach for Speech Information Retrieval

Jonathan Mamou, IBM Haifa Research Lab

We are interested in retrieving information from speech data such as broadcast news, telephone conversations, contact center calls and roundtable meetings. Today, many systems use large vocabulary continuous speech recognition tools to produce word transcripts; the transcripts are indexed and query terms are retrieved from the index. However, out-of-vocabulary (OOV) query terms, that are not part of the recognizer's vocabulary, cannot be retrieved, and the recall of the search is affected. In addition, recognition errors can also affect search effectiveness.

We present a vocabulary independent system that can handle queries containing OOV terms, exploiting the information provided by both word transcripts and phonetic transcripts. For in-vocabulary query terms, our system also uses word confusion networks (WCNs) generated by the speech recognizer. By taking the word alternatives provided by the WCNs and the terms' confidence levels into consideration, our system is able to reduce the effect of recognition errors in the word transcripts and to improve the search effectiveness. The word and phonetic transcripts are both indexed and combined during the query processing.

We analyze the retrieval effectiveness at different error levels and when different ranking models are used. We show that the mean average precision is improved using WCNs compared to the raw word transcripts even under high error rate. We also show an improvement in the retrieval performance when using phonetic index for search of queries having OOV terms.

This approach combining word transcripts and phonetic transcripts is guaranteed to outperform other approaches using only word index or phonetic index.

The value of the proposed method has been demonstrated by the relative high performance of our system, which received the highest overall ranking for US English speech data in the 2006 NIST Spoken Term Detection evaluation.

The talk is based on a joint work with David Carmel, Ron Hoory, Bhuvana Ramabhadran and Olivier Siohan.

Panel: Web 2.0 Impact on Search

Moderator: David Konopnicki, Manager, Search Technologies Development, IBM Haifa Research Lab

  • Andrei Broder, Fellow and VP of Emerging Search Technology, Yahoo! Research, USA
  • Yoelle Maarek, Director, Google Engineering Lab, Haifa, Israel
  • Prof. Sheizaf Rafaeli, Director, Center for the Study of the Information Society and Head, Graduate School of Management, University of Haifa
  • Aya Soffer, Department Group Manager, Information and User Technologies, IBM Haifa Research Lab


Related Seminar Links
Visitors information  

    About IBMPrivacyContact