Skip to main content
    Israel [change]    Terms of use
    Home    Products    Services & solutions    Support & downloads    My account    
IBM Research

Information Retrieval Seminar 2005

IBM Haifa Labs

Invitation Program Registration Abstracts

December 05, 2005
Organized by IBM Research Lab in Haifa, Israel

Advances and Challenges for Information Retrieval Evaluation

Ian Soboroff, NIST

Since the 1960s, advances in information retrieval technology have come hand-in-hand with methods for evaluating their effectiveness. The Text REtrieval Conference (TREC) series of workshops reflect this history of technology driving evaluation driving technology. The Cranfield paradigm, which uses test collections to measure the relative effectiveness of different retrieval approaches, spurred dramatic improvements in the state of the art. At the same time, new search tasks and information domains continually force us to adapt and extend the methodology in order to measure how well we perform in those spaces. A particular challenge has been the emergence of ever larger collections of information which undergo continuous change, such as the web. Originally, the test collection paradigm assumed all documents could be assessed for relevance to a given information need, and were thus limited to hundreds or a few thousand documents. The technique of pooling allowed for the creation of reusable test collections of millions of documents by focusing on the documents most likely to be relevant. However, as the size of the collection grows, systems are increasingly likely to retrieve unjudged documents at high rank, making it difficult to measure their effectiveness.

The TREC Terabyte track was created to examine this problem of large-scale test collections and to propose solutions to it, while at the same time spurring IR researchers and research IR systems to work at a larger scale. So far, we have learned that pooling is effective up to much larger collection sizes than we first believed. At the same time, recent advances in test collection design and effectiveness metrics may make it possible to use the web as a test collection, allowing us to reliably compare retrieval approaches on immense, dynamic information domains.

Learning to predict query difficulty

David Carmel, IBM Haifa Research Lab

In this work we present novel learning methods for estimating the quality of results returned by a search engine in response to a query. Estimation is based on the agreement between the top results of the full query and the top results of its sub-queries. We demonstrate the usefulness of quality estimation for several applications, among them improvement of retrieval, detecting queries for which no relevant content exists in the document collection, and distributed information retrieval. Experiments on TREC data demonstrate the robustness and the effectiveness of our learning algorithms.

The paper presented, coauthored with Elad Yom-Tov, Shai Fine, and Adam Darlow of IBM Research Lab in Haifa, won Best Paper Award at SIGIR 2005.

Peer-to-Peer Data Integration with Active XML

Tova Milo, Tel Aviv University, Israel

The advent of XML as a universal exchange format and of Web services as a basis for distributed computing, has fostered the emergence of a new class of documents that we call Active XML documents (AXML in short). These are XML documents where some of the data is given explicitly while other parts are given only intentionally by means of embedded calls to web services, which can be called to generate the required information. We argue that AXML provides powerful means for the modeling and integration of distributed dynamic Web data. AXML can capture various integration scenarios including peer-to-peer data mediation and warehousing, while providing support for new features of Web services such as subscription, service directories, and controlling data changes. Moreover, by allowing service call parameters and responses to contain calls to other services, AXML enables distributed computation over the web. We overview here the AXML project, considering the new possibilities that Active XML brings to Web data management and the fundamental challenges it raises.

The research addresses functionality as well as efficiency of execution and develops data management and query processing techniques for AXML data and services.

Database Inspired Search

David Konopnicki, IBM Haifa Research Lab

In this talk, we will review recent trends in the search engines market: new features of web search engines, the emergence of desktop and entreprise search engines and others. We will show why search is still a difficult task and argue that the use of techniques used in traditional database systems could help: namely, we will focus on search indexes integration, object awareness (schema), correlation awareness (joins) and context awareness (high-level query languages).

The talk reflects upon the paper, coauthored with Oded Shmueli of the Technion, that won the VLDB 2005 10 Year Best Paper Award.

Searching the Tagspace with RawSugar

Frank Smadja, RawSugar, Israel

There has recently there been a great surge of interest in collaborative tagging as a means of facilitating knowledge sharing in social computing. Collaborative tagging refers to the process in which a community of users adds meta-information in the form of keywords or tags to Web content such as web pages, links, photographs, and audio files on a centralized web server. While collaborative tagging is only starting to be researched in the research community, it seems to address a real need on the Web as demonstrated by the growing popularity of tagging and annotation sites (see, flickr, technorati, RawSugar, Shadows, etc.); the most popular sites already have a combined user base of well over one million. The philosophy of what is called Web 2.0, the social Web or also the two-way Web is that users can and should be content creators as well as consumers and it suggests that there is a great deal of untapped potential for tagging to improve how web content is organized, navigated and experienced. Yet most tagging services are still far from fulfilling this promise as they are still limited to a very technical audience, do not address the search issues of the tagspace and it is not really clear how they will evolve and scale, when, if at all, the usage base will go beyond early adopters.

In this talk we present RawSugar's approach to searching the tagspace and we focus on several technology aspects that we use for this purpose. Specifically, we discuss the use of faceted search and hierarchical tags and how we control the tagging language. We will also briefly discuss how we mine and learn from the tagspace, and our attempts to read the user's mind in our suggestion engine.

Linguistic Tools for Advanced Information Extraction

Aaron Kaplan, Xerox, XRCE, France

In order to advance the capabilities of information extraction systems beyond the recognition of simple patterns involving names and dates, Xerox is developing a set of tools based on deep linguistic processing. The foundation of this work is a parsing engine with a rich grammar formalism that supports the creation of fast parsers for multiple languages. A tightly-integrated python module facilitates the manipulation of parsing results, which speeds the development of new applications. A graphical interface currently under development will allow users with no knowledge of NLP techniques to create information extraction patterns from examples.

When You Don't Want to Search

Bob Rosenschein,, Israel

As we watch them compete for market dominance, today's search engines index the Web, providing almost infinite options and potential paths to explore. The paradigm has existed for over a decade without significant improvement to the user experience. A wealth of links is an alluring notion and widely accepted for now, but not a particularly efficient one when the goal is not to surf, but rather to learn. Therein lies the essence of next-generation web information retrieval; it removes the intermediary step, saving the user's time and focusing his attention.

The new players - and new products from existing players - provide actual answers, in context, without the need to chase links. Bob Rosenschein will look at the industry players, their technologies, their challenges, and their individual approaches to providing answers.

Random Sampling from a Search Engine's Index

Ziv Bar-Yossef, Technion, Israel

We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from a search engine's index using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines.

The technique of Bharat and Broder suffers from two well recorded biases: it favors long documents and highly ranked documents. In this paper we introduce two novel sampling techniques: a lexicon-based technique and a random walk technique. Our methods produce biased sample documents, but each sample is accompanied by a corresponding "weight", which represents the probability of this document to be selected in the sample. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to three well known Monte Carlo simulation methods: rejection sampling, importance sampling and the Metropolis-Hastings algorithm.

We analyze our methods rigorously and prove that under plausible assumptions, our techniques are guaranteed to produce near-uniform samples from the search engine's index. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long or highly ranked documents. We use our algorithms to collect fresh interesting data about the relative sizes of Google and Yahoo!.

Joint work with Maxim Gurevich.

Ranking Systems: The PageRank Axioms

Alon Altman, Technion, Israel

This talk covers initial research on the foundations of ranking systems, a fundamental ingredient of web search and internet technologies. In order to understand the essence and the exact rationale of page ranking algorithms we suggest the axiomatic approach taken in the formal theory of social choice. In this talk we deal with PageRank, the most famous page ranking algorithm. We present a set of simple (graph-theoretic, ordinal) axioms that are satisfied by PageRank, and moreover any page ranking algorithm that does satisfy them must coincide with PageRank. This is the first representation theorem of that kind, bridging the gap between page ranking algorithms and the mathematical theory of social choice.


Related Seminar Links
Visitors information  

    About IBMPrivacyContact