mineXML
As industrial strength XML databases emerging, the amount of
available XML information becomes tremendous. An interactive
system that can automatically analyze XML data, and find
meaningful and useful information within particular application
domains, is needed.
Goal
The project mineXML is designed
to extract high-level information
from the Web Service description (XML documents)
repositories for purpose of establishing
associations, building classifications, and
extracting meaningful information of various
types (i.e., business related, configuration
related).
Following are some techniques and issues
we are investigating:
- XML Schema Integration -- When the XQuery schema is different from
the schema of the XML documents it querying,
or the XML documents have different schemas,
some automatic action needs to be taken to
migrate the schemas or decide that the schemas
are not migratable
Example: how do we know when different seeming items
in different schemas may actually correspond,
such as "state" in the US and "province"
in Canada?
- Mining
- Association -- Find interesting association or correlation
relationships among a large set of data items;
the rules so mined can be used for cross-market
analysis, correlation analysis, and so on
Example: what services customers are likely to use
together with service A?
- Classification -- Extract models describing important data
classes and predict categorical labels; the
model so constructed is used to classify
future data and develop a better understanding
of data
Example: what services customers with characteristic
A tend to use?
- Tooling -- Visualize, and navigate through, those
extracted views of the services.
Collaboration
This project is a joint effort by Cindy Chen, Judah
Diament, George Mihaila, Haixun Wang and Isabelle Rouvellou, at IBM
T. J. Watson Research Center, Hawthorne, New York, USA.
For further information on
mineXML, please contact Cindy Chen.