IBM Israel
Skip to main content
Search IBM Research
   Home  |  Products & services  |  Support & downloads  |  My account
Select a Country Select a country
IBM Research Home IBM Research Home
IBM Haifa Labs Homepage IBM Haifa Labs Home
IBM Haifa Labs Homepage IBM Haifa Labs Leadership Seminars

Search and Collaboration Seminar 2004

Visitors information

Information Extraction: Current State of the Art (Abstract)
Ronen Feldman, Chairman, Clearforest Corp.

The information age has made it easy to store large amounts of data. The proliferation of documents available on the Web, on corporate intranets, on news wires, and elsewhere is overwhelming. However, while the amount of data available to us is constantly increasing, our ability to absorb and process this information remains constant. Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes. We will contrast the two major approaches to Information Extraction, the rule based approach, and the machine learning approach. In particular we will focus on comparing IE rule languages, and the various machine learning approaches. We will discuss the suitability of the two approaches to the tasks of Named Entity Recognition, Fact Extraction, and Relationship Extraction. We will provide an experimental evaluation of the two approaches and discuss way to build a third hybrid approach that can benefit from the strengths of the two approaches. We will demonstrate several real world applications of Information Extraction.

Rapid Deployment of High-quality Search: Removing the Evaluation Bottleneck (Abstract)
Ronny Lempel, IBM Haifa Research Lab

The evaluation of information retrieval (IR) systems is the process of assessing how well a system meets the information needs of its users. IR research has a well established tradition of comparing the relative effectiveness of different retrieval approaches, these techniques however require human judgements of search quality an thus often times result in an evaluation bottleneck. In this talk, we will present a new evaluation method that measures an IR system's quality by examining the content of the retrieved results rather than by looking for pre-specified relevant pages. The proposed method does not involve any document relevance judgments, and as such is not adversely affected by changes to the underlying collection. Consequently, it can scale to very large, dynamic collections such as the Web, and can evaluate a system's effectiveness on updatable "live" collections as well as collections derived from different data sources. We show that the new method is highly correlated with traditional IR measures and can thus be used for rapid deployment of high-quality search.

Online Learning by Projecting (Abstract)
Yoram Singer, Hebrew University

A unified view for online classification, regression, and uniclass problems is presented. This view leads to a single algorithmic framework for the three problems via projections. The new algorithms share similar loss bounds which will be discussed briefly. The proposed framework is also extended to more complex decision tasks, in particular, ranking problems and will be described in more detail. To conclude, a demonstration of a few prototype systems will be presented.

Web-a-Where: Geotagging Web Content (Abstract)
Nadav Har'El and Ron Sivan, IBM Haifa Research Lab

Web-a-Where is a system for associating Geography with Web pages. Web-a-Where locates mentions of places and determines the place each name refers to. In addition, it assigns to each page a geographic focus --- a locality that the page discusses as a whole. The tagging process is simple and fast, aimed to be applied to large collections of Web pages and to facilitate a variety of location-based applications and data analyses.

Geotagging involves arbitrating two types of ambiguities: geo/non-geo and geo/geo. A geo/non-geo ambiguity occurs when a place name also has a non-geographic meaning, such as a person name (e.g., Washington) or a common word (e.g., Turkey or Java). A geo/geo ambiguity arises when distinct places have the same name, as in London, England vs. London, Ontario.

An implementation of the tagger within the framework of the WebFountain data mining system is described, and evaluated on several corpora of real Web pages. Precision of up to 82% on individual geotags is achieved. We also evaluate the relative contribution of various heuristics the tagger employs, and evaluate the focus-finding algorithm using a corpus pretagged with localities, showing that as many as 91% of the page foci reported are correct up to the country level.

Parameterized Generation of Labeled Datasets for Text Categorization (Abstract)
Evgeniy Gabrilovich, Technion

Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define \emph{parameters} of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements.
We also propose a new similarity metric for sets of documents based on the ease of separating them with a text classifier. Since texts acquired from the WWW are often plagued with noise and are generally quite different in nature from formal written English found in printed publications, we report specific steps we undertook to filter the data and monitor its quality during acquisition. A large collection of automatically generated datasets are made available for other researchers to use; the software system that performs parameterized dataset acquisition will also be released later.

IBM Search Technologies - the Haifa Perspective (Abstract)
Aya Soffer, Manager, Information Retrieval Group, IBM Haifa Research Lab

Search is increasingly becoming crucial for efficient and effective use of knowledge within the enterprise. In this talk we will describe IBM's offerings in the enterprise search market, highlighting the contributions of Haifa Research Lab. We will outline the difficulties of search in the enterprise in comparison to Web search and describe two search solutions in which we are involved. Juru, a highly efficient, customizable search solution written entirely in Java was fully developed in Haifa. Juru is the search engine of IBM's Lotus products including Websphere Portal and Lotus Workplace. We will additionally describe our contributions to IBM's internal enterprise search solution which powers IBM's Intranet search.

Keynote: Challenges in Running a Commercial Search Engine (Abstract)
Amit Singhal, Principal Scientist, Google

Running a web search engine searching billions of pages is a challenge in itself. The need to serve over 200 Million queries a day makes it much harder. Not only do we have to address the engineering challenges, we also have to deal with the fact that many sites get a large proportion of their traffic from Google; and there is a huge incentive for sites to rank well in Google. This forces us to live in a world of "adversarial information retrieval" like we have never seen before. In this talk, I'll point out the challenges we face everyday in running a high quality service.

We Can See You: A Study of Communities Invisible People through ReachOut (Abstract)
Michal Jacovi, Collaboration Technologies Group, IBM Haifa Research Lab

Virtual communities are a great tool, both at home and in the workplace. They help in finding new friends and solving complicated problems by creating a virtual family or a giant group-mind. However, building a virtual community is not a trivial task. Many problems need to be addressed for a new community to be successful. While many of these problems are features of the medium, participants themselves are still the major part of the equation. Understanding the behavioral patterns of virtual community members is crucial for attracting participants and facilitating active participation. In this paper, we describe our findings from analyzing more than a year of activities of a workplace community. Our community used ReachOut, a tool developed in our group to support semi-persistent collaboration and community building. Throughout the year, all users’ activities were logged, providing us with very detailed information. Not only do we know of people’s postings to the community, but we can also track lurking behavior that is usually hidden. This allows us to check several hypotheses about non-active participants’ behavior and propose some directions to increase active participation in virtual communities.

Mobilizing Communities, From Mobile Instant Messaging to Total Communication (Abstract)
Yuval Neria, Director Business Development, Comverse Instant Communication Division

During the past three years, mobile Instant Messaging (IM) services started creeping into the market, but despite the promise none reached the expected levels of success. Various reasons have been put forward for the slow uptake, including the unappealing user experience offered through the use of SMS or WAP as transports.

Most of the wireless IM services focused solely on local communities instead of leveraging the existing and popular web based communities for an immediate service adoption.

Now, all the signs point to the time being right for Mobile IM to take off as a lucrative business opportunity for mobile operators. There are a number of good reasons why this is happening:

The fundamental characteristics of IM are ideally suited for the wireless environment, liberating IM users from the desktop. It has the textual emulation of human conversation rhythms; uses short text replicating the familiar and popular PC based IM and phone based SMS experience.

SMS growth, meanwhile, is flattening out and operators are looking to the next versions of messaging to further drive the market.

Next generation advanced handsets containing rich graphics and an intuitive user interface, open the environment for uploading applications and make the user experience much easier and more pleasant.

Most importantly, Web portals want to participate in the mobile messaging hype, applying a range of revenue model approaches with Telco operators. Accordingly, the largest operators in the world are launching the next generation of Mobile IM solutions incorporating the lessons from the past.

People want better ways of keeping in touch. Sophisticated mobile instant messaging is more than just an extension of the already popular and widely-used desktop instant messaging. It is more than just SMS with buddy lists. Mobile IM takes messaging to the next level in terms of convenience, speed, and power.

The prime market for Mobile IM covers consumers, teens and career driven young professionals that use communication for coordination with groups of contacts and friends; they like the status of using new technology, and the immediacy it affords them.

In the enterprise environment, workers will be able to collaborate with each other more effectively than ever before. They will be able to tell which co-workers are available when and where and how best to reach them.

Mobile IM is the foundation for an advanced yet intuitive communication platform. Operators will be able to convert the IM contact list to a presence enabled launch pad for rich Multimedia services such as: push to talk, picture messaging, direct calls, instant conferencing and much more.

With proper solutions, Mobile IM will evolve into a Presence-based contact list that can be used as a launch pad for Value Added Services, eventually leading to Total Communication - a concept in which both consumers and corporate users are free to communicate in the way that is most appropriate and convenient for them.

QSIA - do online educators compete or collaborate? (Abstract)
Prof. Sheizaf Rafaeli, Center for the Study of the Information Society and the Graduate School of Business, University of Haifa

The ideology of the internet has often mentioned equality, participation, and democracy. The reality of internet-based environments has turned out to be, more ofen than not, unequal, lopsided, and governed by power laws rather than symmetry.

Is a-symmetry a natural quality of interaction in human groups? Is the internet's promise of sharing information an illusion, a potential forever crippled by power laws?

QSIA is a Web-based and distributed system that serves as an environment for learning, assessing and knowledge sharing. QSIA - Questions Sharing and Interactive Assignments - offers a unified infrastructure for developing, collecting, managing and sharing of knowledge items. QSIA enhances collaboration in authoring via online recommendations and generates communities of teachers and learners. At the same time, QSIA fosters individual learning and might promote high-order thinking skills among its users.

QSIA is described in detail the the QSIA web site ( ) and in the following publications:
  • Rafaeli, S., Barak, M., Dan-Gur, Y. and Toch, E. (2004, in press) QSIA - A Web-based environment for learning, assessing and knowledge sharing in communities, Computers and Education
  • Barak, M. & Rafaeli, S. (2004, in press) Online Question-Posing and Peer-Assessment as Means for Web-based Knowledge Sharing in Learning, International Journal of Human-Computer Studies
  • Rafaeli, S., Dan-Gur, Y. And Barak, M. (2004, in press) "Finding friends among recommenders:Social and "Black-Box" recommender systems", Journal of Distance Education Technologies, Vol. 2, N. 4 Special Issue on Knowledge Management Technologies for E-learning: Exploiting knowledge flows and knowledge networks for learning, October 2004

The organizational and procedural notion represented by QSIA calls for fewer boundaries and more boundary-spanning in higher education. Higher education has been urged to evolve from "Sage of the Stage" to "Guide and the Side" modes. Is there room for a "Colleague in the League" metaphor too? Sharing makes sense for a variety of pedagogical, ontological and economical reasons. Can information systems make such sharing possible? What are the behavioral and structural impedances to such sharing? Does information lend itself to "being shared"? Can internet based systems overcome these barriers?

WatchMe: Mobile Communication and Awareness Between Members of a Closely-knit Group (Abstract)
Natalia Marmasse, Speech Interface Group, MIT Media Lab

Communication between people who share the same physical space can be very rich. It is an interactive process, a collaborative act made up of both verbal and non-verbal cues. It is much more than simply a direct transfer of information. Today's telecommunication has several limitations. The telephone enables communication at a distance, in shared time, but without the richness we have when co-located. Its focus has been on verbal communication, restricting the non-verbal expression. Current communication devices lack back-channels helping us maintain awareness of those with whom we share communication space, they afford no way of inferring a person's situation before the communication has been initiated, and moreover they do not create opportunities for additional communication.

WatchMe, embodied in a wristwatch, addresses mobile communication and awareness between people in a closely-knit group e.g. family and friends. It aims to enhance the telecommunication between them by providing relevant telepresence and a mobile platform facilitating various channels of verbal and non-verbal communication.

  About IBM  |  Privacy  |  Terms of use  |  Contact