A Survey of Search Engines for XML Documents

Robert Luk, Alvin Chan, Tharam Dillon (FIEEE) and H.V. Leong

Department of Computing, Hong Kong Polytechnic University

E-mail: {csrluk, cstschan, csdillon, cshleong}@comp.polyu.edu.hk


We survey several searching and indexing techniques for XML documents. There are three broad approaches classified, namely, database-oriented approach, information retrieval-oriented approach and hybrid approach. We illustrate in each approach at least one related commercial/academic software as an example to motivate our classification. Relevant issues with indexing and searching XML documents are also discussed in this paper.

1. Introduction

An Internet search engine (e.g. Altavista or Infoseek) returns thousands of matched documents for a single query, some which are relevant and others irrelevant to the query. Human users typically have problems with digesting such vast quantities of information and much information is irrelevant. XML holds the promise that searching can be done more precisely because structural, self-describing and meta (e.g. RDF) data are available, to allow for context-based and/or category-based search. XML also holds the promise to model heterogeneous data, generated from databases or from word processors, thereby enabling search engines to locate for and process heterogeneous documents or records.

Navarro and Baeza-Yates [1] provide a survey of the academic work in indexing and searching structured documents, and a query language for structured documents. Some of these techniques could be applied to index and search XML documents. However, they are implemented as research prototypes. Here, we focus on reviewing commercially available search engines for XML documents. We classify them into three general categories, adapting different approaches: database-oriented, information retrieval-oriented and hybrid. Each approach is discussed in turn but a comprehensive survey is not carried out.

2. Database-oriented Approach

XML can be applied to represent data in databases. For instance, the W3C provides an example [2], showing how data from a relational database can be represented using XML. The primary goal is to enable the transfer of data in a standard format, like that in EDI, across different systems on different platforms to improve system interoperability and integration. We are observing the contents of more and more databases being published on the Internet first in HTML and now XML. In this way, a database engine can read data from the database to provide search facilities (e.g. SQL) and formulate the results in dynamic web pages. However, this approach requires the XML documents to conform to a pre-defined DTD, with nesting levels limited to 4. In addition, at present, DTDs have very poor data and referential integrity constraints, which may cause problems when the XML documents are imported into database software. To mitigate this problem, W3C is drafting proposals of DCDs for document content description [3] and XML Schema [4] to handle more stringent data constraints and user defined data types.

An estimated 75% of the data [5] in Web pages are derived from databases. Instead of converting these data into HTML format, which is display-friendly, it has been advocated to convert them into XML documents, which are rendered using XSL stylesheets. In this way, the relationships in the database can be self-described or retained in the XML descriptions. Converters like DB2XML [5] are available to perform this kind of translations. Importing this data into the database system should cause few problems. Information extraction from XML documents may also be achieved by means of storing as semi-structured data rather than purely database, while supporting certain database related operations [6]. The Stanford TSIMMIS project demonstrates one such example [7]. To clearly separate the translation and data access requirements in the context of a mobile web, the IBM DataX project adopts the middleware solution [8]. A proxy server interacts with the master database server to convert database information into appropriate XML pages in client mirrors, whose contents are synchronized with the actual client cached XML pages upon connection from a mobile client. Data filtering and transformation is required to reduce the sheer size of the XML pages generated. Post-translation steps are thus required and database independency is possible.

Typically, XML documents are generated from different databases with different internal structures. While converting database result sets into XML is relatively straightforward, the reverse procedure is more difficult and requires special handling. One major reason is that different databases with different entity-relationship structures, which cannot be handled by a conventional database system using a single entity-relationship structure. Therefore, a meta-database system is required to ensure system interoperability. A single SQL query posted to the meta-database system can be passed to a set of databases. If the database structure does not qualify for the particular SQL query, no result would be generated. If the database structure is qualified, then some results are returned. Techniques for merging results from multiple databases are required. Alternatively, techniques for finding the desired databases are also needed. At present, there are few products with such functions. Research results in server ranking [9] can be employed to select the appropriate databases.

Although relational databases dominate the commercial market, object-oriented (OO) databases are becoming increasingly popular. Instead of assuming XML documents to be generated from a relational database, we must cater for those XML documents generated from an OO database. In this case, the DTD specifies the objects and relationships although referential integrity constraints may not necessarily be checked. Retrieval of information from XML documents has also been considered as an object view problem [10], which can be based on the ODMG model [11].

A well-known technique to index OO databases is the use of path dictionaries. In this case, the structures of OO databases are represented as a set of paths. Each record is associated with a path. Whether a record is qualified for a query depends on both information about the record and the path. Excelon corp [12] is an example of using OO database technology (called ObjectStore) to index and search XML documents.

One problem with the OO database approach is due to the use of a common indexing method of superimposed coding signature [13,14], which introduce the possibility of false drops (alarms). The other problem is that the XML documents may represent data from heterogeneous sources. The paths may be derived from a number of different OO databases. Clashes and differences in the use of names and conventions may degrade the retrieval performance or may even prevent proper indexing. Finally, there is a large overhead [15] in OO Database Management Systems to support sophisticated access methods and relations that are not needed in many XML documents.

XML-QL [16,17] represents a novel approach in which SQL-like queries can be posted to search XML document databases. The following is an XML-QL example, which matches every <book> element in the XML document bib.xml that has at least one <title> element, one <author> element, and one <publisher> element whose <name> element is equal to Addison-Wesley. For each such match, it binds the variables $t and $a to every title and author pair. Note that variable names are preceded by $ to distinguish them from string literals in the XML document (like Addison-Wesley). Here </> is a short form for </element> for any matching element.

WHERE <bib><book>


<title> $t </>

<author> $a </>

</></> IN "bib.xml"


XML-QL is claimed to be relational complete, can extract and construct XML documents and support order and unordered views on XML documents. Internally, an XML document or XML-QL can be modeled by the XML graph, where the edges are element tags, nodes are attribute-value pairs and leaves the string values.

By comparison, the IBM DataX project uses the folder concept to model access to database. The folder definitions, coded by application programmers, map the database into XML documents when folder instances are created. Information from multiple relations is formatted in XML via hyperlinks upon materialization of the folder definition. In a sense, it does not support SQL with join operations explicitly. Rather, the concept of navigation via foreign keys is implemented by programmers. On the other hand, the Stanford TSIMMIS project is based on the OEM model and the wrapper concept as middleware. Native queries can be submitted to a data source to retrieve native result. A native query can be SQL statement in the common sense, but it can also be any program written to access any data source, e.g., legacy system. OEM objects are created by wrappers from native results of native queries. The structure of an OEM object is closely similar to a typical XML representation that XML is useful to store them. Unlike DataX, the capability of TSIMMIS querying depends on the querying mechanism of the native module. Under normal cases, that can be a SQL server.

3. Information Retrieval-oriented Approach

Information retrieval techniques can be directly applied to the processing and retrieval of XML documents, each of which can be considered to be a text document, with additional markup overheads. There are several methods to deal with the markup tags. One method simply discards all the tags. The advantage is its simplicity but the disadvantage is the loss of information, leading to lower retrieval performance. Another method is to extract important structural and contextual information from the XML document to index. For example, the Ultraseek index server [16] finds the first title element of an XML document to index.

A more comprehensive method is to index the tags as if they were the index terms. Obviously, there is no need to index the end tags since the start tag would have provided the structuring information. It is best to index the tags and the tokens in the element content differently so that the search process can be more flexible. For example, a user may want to find all documents containing certain tags, or certain words or certain words in certain tags. Such information may be expressed as forms or as regular expressions. Using regular expression further allows one to combine the technique of pattern-based full text search approaches such as PAT arrays [19] into regular information retrieval-based approaches.

Several XML search engines treat the search for particular words in certain particular tags differently. For example, GoXml.com considers the search requirement nested inside a tag as a proximity search: the user can post a query where the word SIGIR is contained inside the element conference, to search for information about the SIGIR conference, without being overwhelmed by other SIGIR-related information. In this case, SIGIR can be in an element that is a descendant of conference, in a hierarchy. The larger the levels of nesting between the word and the containing element the further are the semantic distance and therefore the lower the document is likely to be ranked.

4. Hybrid Approach

Instead of proximity search, a more accurate method is to specify the nesting elements. Again, GoXML.com provides a direct association between words in the content and the immediate containing element. Other systems (e.g. Xdex [20] or XML Query Engine [21]) provide a flexible way to specify the nesting elements using Xpaths [22]. The query language for specifying the Xpaths is called XQL [23]. For example, the following XQL query specifies all documents that has an author name (a child element or an attribute of the author element):

author/name | author/@name

On the other hand, Xset [15] requires the specification of the full path because the meaning behind the tag may differ in different paths.

XRS by Shin and his co-workers [24,25] uses a BUS architecture to index and search structured documents like XML documents. The basic idea is that an inverted list of all the terms appearing in the content of the elements and values of the attributes are indexed. The index maintains the association of terms and the corresponding Xpaths and document identifier reaching that term. The Xpaths are generated and matched dynamically during searching. The rank value of the XML document is the aggregate of all the weights of the matched terms in the document. The BUS architecture uses a database engine to maintain the list of postings. In this way, additions, deletions and changes to XML document contents can be carried out without re-indexing.

In another hybrid method, a multi-level solution is adopted. Upon a search request, category search based on information retrieval techniques is first performed. Once the category is found, a standard database query (e.g. in SQL) can be posted. These standard database queries are typically presented as a form that basically abstracts out the concept of a view in database, using XSL transformations. XYZFind [26] is a good example that adopts this hybrid method.

The benefit of this approach is that it is able to filter out fast the best set of databases to be searched for, and the database queries then allow very efficient retrieval of required information. Apart from searching, the XYZFind indexer can automatically produce a (query) form for a set of XML documents.

5. Discussion

Search engines that index XML documents hold the promise that searching can be performed over heterogeneous data sources and that it can be carried out in a more precise manner. In the database-oriented approach, the advantages of the database approach are that (1) data relations, constraints and integrity can be modeled and checked, (2) the query languages are similar to standard database query language (SQL) so that little training is needed to use the system and (3) standard (relational and object-oriented) database engines can be used. However, directly importing XML documents to the database engine can be difficult due to the heterogeneous structures of different documents. Similarly, the user may not know what kind of query specification is the most appropriate, since the heterogeneous database structures are unknown to the user. Also, these structures may be modified periodically in a dynamic environment like the Internet and the frequent of updates of new database structures are costly. An alternative is to use a meta database engine, similar to a meta search engine. The meta database engine can post a set of queries to multiple databases and merge the results from them. In addition, server ranking methods [27] can be used to decide ultimately which database(s) is/are most appropriate for user information need. Such ranking information may be maintained at the server as a guideline to future queries submitted by users.

In the information retrieval-oriented approach, search based idea on proximity can be applied to XML documents to achieve better precision. The advantages of this approach are that (1) existing information retrieval engine can readily be modified for this purpose, (2) no training is required since it can be used similar to a standard search engine, and (3) it has a smaller indexing cost since it does not have to contain detail information about the structures. However, the problem with this approach is that it may not be as precise as the previous approach because it is based on an approximate matching technique, which does not support sophisticated matching of document structures. If the user is not interested in uniformly structured results and (s)he only needs the top N documents, then this approach is adequate. Interestingly, in this survey, there is little work on extending the Boolean model directly for searching of XML documents.

In the hybrid approach, some popular techniques are combined, in the search for the relevant Xpaths, for example expressed in XQL, and full text search. This is likely to give a more precise search results since Xpaths specification limits the possible matches. Also, this approach is flexible: working like a standard information retrieval engine or working like a database engine (with full path specification). In this way, the novice user can use it like a search engine and the expert user can get more precise solution. However, some storage overhead is paid for this flexibility. In addition, user requires some additional knowledge of Xpaths for more precise results. The more interesting hybrid search strategy is examplified by XYZfind, in which an initial category search based on information retrieval is followed by a database search. The category search is a coarse search where as the database search is a fine search. The advantage of this approach based on category-database search is that (1) the user can identify the relevant form for the database search and (2) it can find information based on the full text search alone. However, it is not know about the amount of storage overhead and whether arbitrary web pages can be classified into different databases.

One issue that seems to be not addressed is about the problem brought by allowing user defined document structures. The problem is due to the fact that document structures are coming from different sources, and the tag names and document structures may be different and idiosyncratic. This affects the recall performance of the search engine. Another related issue is about relating multilingual XML documents since XML is specified using Unicode. The tag names coming from different sources may be given in different languages. Since a word can have more than one translation and even no translation, how to find or make the appropriate translation is an interesting issue for multilingual information retrieval [28] of XML documents. Both issues affect all the mentioned approaches and are important for a dynamic and heterogeneous environment, like the Internet.


This work is supported by the Departmental funded project H-ZJ88.


[1] Navarro, G. and R. Baeza-Yates (1997) Proximal nodes: a model to query document databases by content and structure, ACM Trans. on Information Systems, 15(4), 400-435.

[2] Bos, B. (1997) "XML representation of a relational database", http://www.w3.org/XML/RDB.html.

[3] W3C (1998) "Document content description for XML", http://www.w3.org/TR/NOTE-dcd.

[4] W3C (2000) "XML Schema Part 0: Primer W3C Working Draft", http://www.w3.org/TR/2000/WD-xmlschema-0-20000407.

[5] Turau, V. (1999) "Making legacy data accessible for XML applications", http://www.informatik.fh-wiesbaden.de/~turau/ps/legacy.pdf.

[6] Hammer, J., Garcia-Molina, H., Cho, J., Aranha, R., and Crespo, A. (1997) "Extracting semistructured information from the web", Proc. Workshop on Management of Semistructured Data, pp. 18-25.

[7] Hammer, J., McHugh J., and Garcia-Molina, H. (1997) "Semistructured data: the TSIMMIS experience", Proc. Advances in Databases and Information Systems, pp. 1-8.

[8] Lei, H., Lee, K.W., Blount, M., and Tait, C. (1999) "Enabling ubiquitous database access with XML", Proc. of Mobile data Access'99, pp. 119-134.

[9] Hawking, D. and P. Thistlewaite (1999) "Methods for information server selection", ACM Transactions on Information Systems, 17:1, 40-76.

[10] Abiteboul, S. (1999) "On views and XML", SIGMOD Record, 28(4), 30-38.

[11] Cattell, R.G. (1997) The Object Database Standard: ODMG 2.0, Morgan Kaufmann.

[12] Exelon corp. (2000) "ObjectStore: object data management system", http://www.exceloncorp.com/products/objectstore/os_overview.html.

[13] Lee, D.L. and W.C. Lee (1994) "Using path information for query processing in object-oriented database systems", Proc. 3rd International Conference on Information and Knowledge Management, pp. 64 71.

[14] Shin, H.G. K.S. Kim and J.W. Chang (1996) "S-signature: a new scheme for efficient query processing of complex objects in OODB", Proc. of CKIM 96, pp. 207-214

[15] Zhao, B.Y. and A. Joseph, "XSet: A High Performance XML Search Engine", http://www.cs.berkeley.edu/%7Eravenben/xset/.

[16] Infoseek, "Ultraseek Server FAQ: what is the nature of Ultraseek Server's XML support?", http://software.infoseek.com/products/ultraseek/faqs/faq069.htm.

[17] Florescu, D., A. Deutsch, A. Levy, D. Suciu and M. Fernandez (1999) "A Query Language for {XML}", Proc. 8th International World Wide Web Conference.

[18] Levy,A., M. Fernandez, D. Suciu, D. Florescu and A. Deutsch, (1998) "{XML-QL}: A Query Language for {XML}", World Wide Web Consortium Technical Report, Number NOTE-xml-ql-19980819.

[19] Frakes, W. and Baeza-Yates, R. (1992) Information Retrieval: Data Structures and Algorithms, Prentice-Hall.

[20] CAP Ventures (1999) "Xdex, an XML indexing engine from Sequoia Software Corporation", Dynamic Content Software Strategies Analysis, 4(33) http://www.sequoiasoftware.com/xdex/cap101700.asp.

[21] Katz, H. (2000) "XML Query Engine", http://www.fatdog.com/.

[22] W3C (1999) "XML Path Language". http://www.w3.org/TR/xpath.

[23] Robie, J., J. Lapp, D. Schach. (1998) "XML Query Language (XQL)", http://www.w3.org/TandS/QL/QL98/pp/xql.html.

[24] Shin, D.W., H.C. Jang and H.L. Jin (1998) "BUS: an effective indexing and retrieval scheme in structured documents", Proc. of Digital Libraries 98, pp. 235-243.

[25] Jang, H.C., Y.I. Kim and D.W. Shin (1999) "An effective mechanism for index udpate in structured documents", Proc. of International Conference on Information and Knowledge Management, pp. 383-390.

[26] XZYFind Corporation (2000) "XYZFind white paper: a new technology for search over heterogeneous structured data", http://www.xyzfind.com/paper_001.pdf.

[27] Dreilinger, D. and A. E. Howe, (1997) "Experiences with selecting search engines using metasearch", ACM Trans. on Information Systems, 15(3), 195 222.

[28] Grefenstette, G. (1998) Cross-language information retrieval, Kluwer Academic Publisher.

Short Vita

Robert Luk is an assistant professor of the Department of Computing, Hong Kong Polytechnic University. He is teaching XML and XSL in the Internet Computing module for both undergraduates and postgraduates of the department. He has worked in signature-based and multi-lingual indexing.

Alvin Chan is an assistant professor of the Department of Computing, Hong Kong Polytechnic University. He is teaching Internet Computing. His research is in Internet computing and computer networking and security.

Prof. Tharam Dillon is the Acting Head of the Department of Computing, Hong Kong Polytechnic University and he is Fellow of the IEEE. He is working on indexing for data mining and indexing XML documents. He is also working on self describing data model expressed in XML and conceptual data models for capturing XML expressed data. He has published 4 authored books, 3 edited books and over 400 refereed journal, conference papers and book chapters.

Hong Va Leong is an assistant professor of the Department of Computing, Hong Kong Polytechnic University. He is teaching Distributed Computing and has been developing XML search engines. His research interests include Distributed Computing, Distributed Database, Mobile Computing and Internet Computing..