Web Mining
The World Wide Web may be viewed a vast repository of unstructured and
heterogeneous information. The challenge is to extract non-trivial, interesting,
and useful information from the Web and to organize it in a form that can
be used to answer specific queries or to synthesize new knowledge. This
process of extraction requires a host of tools that can perform various
tasks at different levels. At the lowest level, we would need tools to
perform HTML parsing, parts-of-speech tagging and other types of annotation.
At the intermediate level, we might need tools that can automatically identify
specific types of entities (such as important events, people, competitive
products and organizations) and the attributes associated with them. At
a higher level, we might need miners capable of synthesizing the information
required to solve a complex problem. At IRL, we are developing tools to
mine the Web pages and synthesize useful knowledge out of it. Some examples
are given below.
-
The List and Table Miner identifies lists and tables in a page and extracts items from
them.
-
The Link Miner mines for a list of all pages that refer to a given page.
-
The Email, Phone and Zip Miner identifies phone numbers, e-mail addresses,
and zip codes from a given document.
-
The Noun Phrase Miner extracts noun phrases from tagged phrases.
-
The Spam Miner identifies pages with spam text.
-
The Co-occurrence Miner computes the association (correlation) between
important phrases.
-
The Duplicate Detector detects duplicate and near-duplicate text documents
in the corpus.
-
The Opinion Miner extracts opinions expressed on the Web about competitive
products.
-
The Event Miner extracts events related to a given query along with the
dates, people, locations, and organizations associated with them.
Most of the miners mentioned above are being used in IBM's Project WF.
The goal of this project is to provide a new, easily programmable interface
to the entire Web.
Another application that IRL is developing is a website content monitoring
tool that can track the changes that occur at an on-line store. Here, the
aim is to improve customer satisfaction by identifying and reporting incorrect
or missing data on the website. For example, the policy of the on-line
store might be to list a finance price for all products that cost more
than $1,000, but the finance price may be missing on the Web page. The
tool also collects information about product prices and other attributes
that can be used by marketing managers. For example, the tool can answer
queries such as: "List all URLs on which the products of Category Laptop
are listed at less than $ 800". The solution consists of a crawler, a miner,
and a reporter, which can be configured via a Web-based graphical user
interface. The miner identifies and indexes differences between the newly
crawled pages and the previously crawled pages (including dynamic pages),
as well as other information of interest.
Information Organization and Retrieval
|