|
|
|
|
Database Research at Watson
|
| Semi-structured and Unstructured Data Storage and Retrieval |
| With an increasing amount of data being captured with partial or no structure, efficient and meaningful
access to this data becomes paramount. We are investigating methods to store, index, and retrieve semi-structured XML
and WWW documents in large document repositories. |
 |
Autonomous Data Management in Fast Changing Environments
The highly dynamic and fast paced business environment demands business support software to adapt at an equal or faster pace. Business applications in turn
translates the adaptation requirement to data management middleware. It is increasingly acknowledged that the current practice of manually defining schemas is
not a sustainable model for two reasons. First, the pace of schema change has increased dramatically in areas such as business process refinement and integration.
Second, real-time integration of data cannot be done because of the length of time that it takes to manually evolve the schema. The project was motivated from two
separate scenarios, e-commerce catalog and business performance management, both of which demand dynamic schema declaration, data population and
efficient retrieval. In the former scenario, catalog content unforeseen at tool design time must be added and managed by the catalog tool seamlessly. In the latter
scenario, new business performance monitors are plugged in and out across the enterprise with no centralized control but the monitored performance data needs
to be archived for search and retrieval. Both call for a set of data management tools with self-configuration and self-optimization features. We are developing a set
of data management tools that evolve along the business requirements by embedding the logic of configuration and performance tuning. We have demonstrated
a data management layer in the support of e-commerce catalog that can accommodate a wide variety of heterogenous catalog content. We have also been
developing sensor data management for smart oil fields early warning system. While most autonomous features are embedded in the database tuning today, this
project is targeted at the higher layer of autonomous support for applications.
Contributors: Yuan-Chi Chang, Ioana Stanoi
|
| |
 |
Query-Aware XML View Materialization
The current version of the Store2XML component of the XDM system provides the basic framework for generating XML documents from
relational databases but if an application wants to extract some data from these documents using a query language, the whole document
needs to be generated first before it can be queried. In this project we consider the problem of rewriting an XML view, into several custom
views in order to support clients that require various portions of the mapping-defined data. Each custom view extracts only the relevant data
from the underlying relational databases. View rewriting has the effect of reducing the amount of shipped data and, potentially, query
processing time at the client.
Contributors: George Mihaila, Oded Shmueli (Technion University)
|
| |
 |
Encoding XML for Faster Processing
The most important obstacle for efficient XML processing today is the compute-intensive parsing of XML documents in their textual
representation. This problem is further exacerbated in an e-business environment where XML messages need to be parsed multiple
times as they are processed by several subsystems (for example payment processing, inventory, warehouse management, and shipping).
To alleviate the XML parsing problem, one solution is to convert the XML messages from their textual form to a binary encoded form.
We are designing a binary format for XML targeted at efficient parsing and query processing. The objective is to allow fast, streaming
query processing over encoded XML messages.
Contributors: George Mihaila, Yi Chen (University of Pennsylvania)
|
| |
 |
Memory and Bandwidth Adaptive Access Methods for Tagged Documents
An emerging trend in data exchange and management is to annotate data with metadata in
order to allow for more meaningful semantic interpretation. One example
is the increasing popularity of XML for data representation. For some applications, it
is crucial to efficiently filter through a collection of documents. To answer the need for efficient and portable filters, our project focuses
on developing effective structural filters with memory constraints. In other applications, it is more important
to quickly process queries over large documents. For existing structural indexes to be practical, there is the need
for efficient incremental maintenance algorithms that ensure the consistency between the data access structures and the
underlying data. We have developed the first such algorithms with provable guarantees on the
quality of the resulting structural index. Furthermore, we also considered the processing of XML documents in environments where the portability of
the index is a requirement, and developed a family of data access methods based on navigational aids that conform
to an a-priori assigned memory size.
Contributors: Christian Lang, Ioana Stanoi
|
| |
 |
XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation
The extensible mark-up language (XML) is gaining widespread
use as a format for data exchange and storage on the World
Wide Web. Queries over XML data require accurate
selectivity estimation of path expressions to optimize query
execution plans. Selectivity estimation of XML path
expression is usually done based on summary statistics
about the structure of the underlying XML repository. All
previous methods require an off-line scan of the XML
repository to collect the statistics. In this project, we
propose XPathLearner, a method for estimating selectivity
of the most commonly used types of path expressions without
looking at the XML data. XPathLearner gathers and refines the
statistics using query feedback in an on-line manner and
is especially suited to queries in Internet
scale applications since the underlying XML repository is
either inaccessible or too large to be scanned
in its entirety. Besides the on-line property, our method also has
two other novel features: (a) XPathLearner is workload-aware
in collecting the statistics and thus can be
more accurate than the more costly off-line method under
tight memory constraints, and (b) XPathLearner automatically
adjusts the statistics using query feedback when the
underlying XML data change.
Contributors: Min Wang, Lipyeow Lim (Duke University), Sriram Padmanabhan (IBM Silicon Valley Labs),
Jeffrey Scott Vitter (Purdue University), and Ronald Parr (Duke University)
Publications
|
| |
 |
Dynamic Maintenance of Web Indexes Using Landmarks
Recent work on incremental crawling has enabled a search engine to keep its indexed document collection
more synchronized with the changing World Wide Web. However, the information in this synchronized
collection is not immediately searchable, since the keyword index is rebuilt from scratch
less frequently than the collection can be refreshed. An inverted index is usually used
to index documents crawled from the web. Complete index rebuild at high frequency is expensive.
Previous work on incremental inverted index updates have been restricted to adding and
removing documents. Updating the inverted index for previously indexed documents that have changed
has not been addressed.
In this project, we propose an efficient method to update the inverted index for previously indexed
documents whose contents have changed. Our method uses the idea of landmarks together
with the diff algorithm to significantly reduce the number of postings in the inverted
index that need to be updated. Our experiments verify that
our landmark-diff method results in significant savings in the number of update operations on the
inverted index.
Contributors: Min Wang, Lipyeow Lim (Duke University), Sriram Padmanabhan (IBM Silicon Valley Labs),
Jeffrey Scott Vitter (Purdue University), and Ramesh Agarwal
Publications
|
|