|
The FTContainsExpr expression consists of two parts. The first part specifies the sequence of XML nodes over which the full-text search is to be performed. We call this sequence the search context. The second part specifies the full-text search condition. The full-text search condition is specified using expressions called FTSelection, which express simple term search queries as well as more complex phrase matching, such as Boolean connectives, proximity operators, stemming, and thesauruses.
The FTContainsExpr expression has the following syntax: FTContainsExpr ::= Expr "ftcontains" FTSelection
Expr is an XQuery expression that specifies the search context, which is the sequence of XML nodes over which the full-text search is to be performed. (The issue of whether the search context can contain atomic values is still under discussion at the FTTF.) FTSelection specifies the full-text search condition. FTContainsExpr returns a Boolean value that is true if and only if some node in the search context satisfies the full-text search condition.
In order to evaluate a full-text search condition over a search context node, all textual content of that node is (conceptually) transformed into a sequence of words, or more generally, tokens, by a process called tokenization. Those tokens are the units that search predicates ultimately can “look for.” In XQFT, the process of tokenization is left to be defined by the implementation, as it is highly language- and domain-dependent. Note, however, that the tokenization process establishes the most fundamental difference between pure substring matching and full-text search. We use “word” and “token” interchangeably, and when we talk about matching a word or a phrase, we generally mean matching a token or a sequence of tokens. Examples
We now present several examples of queries that use the FTContainsExpr expression: //book ftcontains "web" && "usability"
The above query returns TRUE if and only if some book in the search context //book (which is an XQuery expression) contains the search terms web and usability. Here "web" && "usability" is a simple example of an FTSelection. Note how FTContainsExpr can limit the search context by using an XQuery expression (//book in the above example). Further, because FTContainsExpr returns a result in the XQuery data model (a Boolean atomic value), it can be arbitrarily nested within other XQuery expressions. A more complex example illustrating this aspect is given below (Query 4.2.1 in Reference 9): book[.//@shortTitle ftcontains "improve" && "web" && "usability"]/title
The above query returns the titles of those books whose short title contains the search terms improve, web, and usability. Note how the FTContainsExpr expression (.//@shortTitle ftcontains "improve" && "web" && "usability") is nested within the XQuery expression book [...]/title.
There are two other interesting aspects to note about the preceding query. First, the query shows how XQFT can specify a return context, or the part of the selected XML items that are to be returned. Specifically, the return context is only the titles of the selected books (and not the contents of these books). Second, XQFT takes advantage of existing XQuery constructs, such as path expressions, to specify the search context (.//@shortTitle) and the return context (/title).
We can use the composability of the FTContainsExpr expression with XQuery to construct more sophisticated queries with which we can search in multiple search contexts, as illustrated below: //book [(metadata ftcontains "usability tests") and (content/part/chapter/title ftcontains "web-site usability")] /title
The above query returns the titles of books that contain the phrase usability tests in their metadata and the phrase web-site usability in a chapter title.
Finally, the following example shows how an XQuery expression can be used also inside the FTSelection of an FTContainsExpr expression: //book[.//section ftcontains {//article[@id = 10]/title} all words]
The above query returns the books that contain at least one section such that the section contains all the words in the title (or titles) of the article with id = 10. Here, the full-text search condition is {//article[@id = 10]/title} all words and is based on the XQuery expression //article[@id = 10]/title. The keywords all words state that all the words in the title (or titles) should be present in the relevant section of the book. FTSelection expressions
As mentioned above, FTSelection expressions are used to specify the full-text condition in an FTContainsExpr expression. There are many types of FTSelection expressions and they are fully composable, so that users can construct complex full-text conditions. We now describe the FTSelection expressions supported in XQFT.
Word and phrase matching: The simplest FTSelection expression contains a single word or phrase in a search string. The following two queries return books that contain the word usability and the phrase usability testing, respectively: //book[. ftcontains "usability"] //book[. ftcontains "usability testing"]
As mentioned earlier, an FTSelection expression can also be the result of an XQuery query (the latter must be enclosed in curly braces {}). The following query returns books that contain an occurrence of one of the section titles of the article with id = 10 matched as a phrase: //book[. ftcontains {//article[@id = 10]/section/title} any]
For example, if the expression in curly braces evaluates to the sequence ("site usability", "testing"), books are required to contain the phrase site usability or the word testing.
Other possible options are any word, all, all word, and phrase. The difference between the any and any word options is that in the latter case, the elements of the sequence are not matched as phrases, but are tokenized into separate words and searched individually. Therefore, if in the preceding example any word would be used instead of any, the query would require only that books contain any of the words site, usability, or testing. Likewise, if the all option is used on a sequence, then all elements of the sequence are required to be contained simultaneously as phrases (site usability and testing in our example), whereas the all word option requires only that all of the individual words be contained (site, usability, and testing in our example). Finally, the phrase option requires that all strings from the sequence returned by the nested XQuery expression be concatenated into a single phrase, which is to be matched. For example, the next query requires that books contain the phrase site usability testing: //book[. ftcontains {//article[@id = 10]/section/title} phrase]
Boolean operators—FTSelection expressions can be combined by using Boolean operators to create more complex full-text search conditions: && specifies a conjunction of words, || specifies a disjunction, ! specifies negation or absence of a word, and not in (also called mild negation) specifies that a word or phrase is not considered a match if occurring in a given context. For example, the following query returns books that contain both the word usability and the word testing: //book[. ftcontains "usability" && "testing"]
Note that the FTSelection expression in the above query ("usability" && "testing") contains two simpler FTSelection expressions ("usability" and "testing") combined by using a &&. As a more complex example, the following query combines the && and || operators to return books that contain both site and usability or both usability and testing: //book[. ftcontains ("site" && "usability") ||("usability" && "testing")]
The negation ! specifies that a full-text search condition must not be satisfied in the search context. For example, the following query returns books that contain the phrase New Mexico but not the phrase Mexico City: //book[. ftcontains "New Mexico" && !"Mexico City"]
Sometimes the negation ! can be too restrictive and can produce unexpected results. As an illustration, consider a user who is interested in books about Mexico, but not about New Mexico. Assume the user expresses the query using ! as follows: //book[. ftcontains "Mexico" && ! "New Mexico"]
The query will not return a book about Mexico if it contains a statement such as Mexico shares a border with New Mexico. Clearly, this is not what the user intended. To address this issue, XQFT supports a weaker notion of negation using the not in FTSelection. The binary operator not in returns nodes from the search context that contain words satisfying the left operand, so long as the same words are not part of a match that satisfies the right operand. For example, the following query will return the books that contain an occurrence of the word Mexico that is not part of the phrase New Mexico: //book[. ftcontains "Mexico" not in "New Mexico"]
Distance predicates—In many applications, users may wish to specify the distance between words in the full-text condition. For example, a user may wish to search for books where the words site and usability occur close to each other. XQFT supports three ways to specify such distance predicates.
The first approach uses the same sentence, different sentence, same paragraph, and different paragraph FTSelection operators, which specify that the words should occur in the same sentence, different sentence, same paragraph, or different paragraph, respectively. The boundaries of a sentence and paragraph are determined by an implementation-dependent tokenizer that operates on the search context, as these concepts may vary across languages. Apart from the Boolean operators, all FTSelection operators are postfix operators that are appended to an FTSelection expression to form a new FTSelection expression. For example, the following query returns books that contain the words site, usability, and testing in the same paragraph: //book[. ftcontains ("site" && "usability" && "testing") same paragraph]
Similarly, the following query returns books that contain the words site, usability, and testing such that each of them appears in a different sentence: //book[. ftcontains ("site" && "usability" && "testing") different sentence]
The second approach to specifying distance predicates is using the distance FTSelection operator, which specifies the distance between every two consecutive occurrences of the matching words in units of words, sentences, or paragraphs. For example, the following query returns books that contain the words site, web, and testing in the same sentence, so that there is a triple occurrence of these three words where every two consecutive occurrences do not have more than one intervening term: //book[. ftcontains ("site" && "web" && "testing") same sentence distance at most 1 words
Note that the distance is given as a range (between 0 and 1 in this example). Other options available for specifying the allowable range of distances are at least E (denoting the range [E, +∞]), exactly E (denoting the range [E, E]), and from E1 to E2 (denoting the range [E1, E2]). Here E, E1, and E2 are XQuery expressions that evaluate to an integer number.
The third approach to distance predicates is using the window FTSelection operator. It specifies that words or paragraphs must be matched (or not matched) within a certain number of consecutive words, sentences, or paragraphs within the text. The following first query returns books that contain the words site and usability within a window of at most three words, whereas the second query requires an occurrence of web site such that neither the sentence preceding it nor the sentence following it contains the word testing: //book[. ftcontains ("site" && "usability") window 3 words] //book[. ftcontains ("web site" && ! "testing") window 2 sentences]
Order of the words—The ordered FTSelection operator specifies whether the words in the search context should occur in the same order as they appear in the query. For example, the following query returns books that contain the word site before the word usability within a window of three words: //book[. ftcontains ("site" && "usability") ordered window 3 words]
Number of occurrences—The occurs FTSelection operator can be used to specify the number of distinct occurrences of a full-text search condition. For example, the following query returns books that contain at least two distinct instances of occurrences of the words site and testing within a window of three words: //book[. ftcontains ("site" && "testing") window 3 words occurs at least 2]
String content—The FTContent operator is used to find matches in which the words and phrases are the first, last, or all of the words and phrases in the tokenized string value of the element being searched. For example, /books//title[. ftcontains "improving the usability of a web site" at start]
finds title elements starting with the phrase improving the usability of a web site. If at end was used instead of at start, the query would find title elements ending with the phrase improving the usability of a web site. Finally, the entire content option would return title elements where the phrase improving the usability of a web site constitutes the entire content of the title. Match options
Although FTSelection expressions are used to find search context nodes that contain exact matches for the query words, in many cases, users may also be interested in context nodes that do not contain exact matches for the query words, but contain similar matches. For example, a user searching for search context nodes that contain the word usability may also be interested in search context nodes that contain the word usage (with the same stem as usability, namely use), or the word Usability (with the same spelling as usability, but with an uppercase character), or the word easy-to-use (with the same semantic meaning as usability). Match options are used to specify such relaxations on the query words so that they can be matched in a more flexible manner with the search context nodes.
Match options can be seamlessly composed with FTSelection expressions. A match option applied on a (possibly complex) FTSelection expression applies to all query words and distance predicates within the FTSelection expression. We now describe the match options supported in XQFT.
Stemming—Implementations of full-text search usually have a means to extend the result set by looking for linguistic variants of the query terms, such as use, used, and using. In XQFT, the “with stemming” match option is used for this purpose, and the “without stemming” match option is used to disable this feature. For example, the following query returns books that contain the word achieve as well as all words that share the same stem as achieve (such as achieving): //book[. ftcontains "achieve" with stemming]
Stemming can also be selectively disabled, as illustrated in the following query that returns books which contain the word Tudor-Medina without applying stemming, and contain the words site and testing in the same sentence, using stemming to match site and testing: //book[. ftcontains ("Tudor-Medina" without stemming && ("site" && "testing" same sentence)) with stemming]
Note that the outermost “with stemming” applies to the entire FTSelection, except where it is explicitly overridden within (for Tudor-Medina).
The exact method used to perform stemming is implementation defined. Hence, implementations are free to provide more sophisticated linguistic matching than a simple stemming approach, which in many languages gives poor results.
Character case variations—The case sensitive, case insensitive, lowercase, and uppercase match options deal with variations in the character case of words. By default, the case insensitive match option is used, which means that the case of the words is not considered when interpreting the full-text search condition. For example, the following two queries are equivalent and will return books that contain the words Usability and testing, ignoring the case of the words: //book[. ftcontains "Usability" && "testing"] //book[. ftcontains ("usability" && "testing") case insensitive]
Although the case insensitive match option is the default, users may wish to explicitly specify it because the default can be overridden in the query prologue or by using another case match option at a higher level of the query.
The case sensitive match option is used to match the search context nodes that contain exactly the same word (in the same case) as the query. For example, the following query returns books that contain the word LaTeX in which each character of the word is spelled in exactly the same way: //book[. ftcontains "LaTeX" case sensitive]
The lowercase and uppercase match options match words that appear as all lowercase or all uppercase in the search context. For example, the following query returns books that contain all the words from each title of the article with id = 10, with all the words being interpreted as uppercase words: //book[. ftcontains {//article[@id = 10]/title} all words uppercase]
Diacritics—The diacritics sensitive, diacritics insensitive, with diacritics, and without diacritics match options deal with diacritical marks in characters, such as accents, diaeresis, and cedillas (see the Unicode standard35 for a definition of diacritical marks). By default, the diacritics insensitive match option is used, which means that when matching a word in a search context node against a query word, diacritical marks are ignored. For example, the following three queries return the same set of books—books that contain the word Vera, possibly including diacritics: //book[. ftcontains "Véra"] //book[. ftcontains "Véra" diacritics insensitive] //book[. ftcontains "Vera" diacritics insensitive]
The diacritics sensitive match option requires words in the search context nodes to contain exactly the same characters as the respective query words, including diacritics. The with diacritics and without diacritics match options both imply a diacritics insensitive match option, but in the with diacritics case, only matching words in the search context nodes that contain at least one diacritical character are considered, while in the without diacritics case, matching words must not contain any diacritical character. Note that any diacritics in the query words have no impact on the result in both cases. For example, the following query returns books that contain the word naive with at least one diacritical character (such as naïve, näive, etc.): //books[. ftcontains "naive" with diacritics]
Character wildcards—The with wildcards match option allows certain character sequences in a query word to be interpreted as character wildcards. The following wildcard character sequences are supported by XQFT:
-
"." stands for any single character.
-
".?" stands for zero or one character.
-
".*" stands for zero or more characters.
-
".+" stands for one or more characters.
-
".{n,m}" stands for n to m characters (where n and m are numbers)
This notation is a subset of the regular expression notation used elsewhere in XQuery. When the with wildcards match option is used and a wildcard is present in a query word, it means that the query word matches a word in a document if and only if the document word can be obtained from the query word by replacing the wildcard character sequence by a sequence of arbitrary characters with a length as allowed by the wildcard. The match between query words and document words is always one to one. A wildcard, therefore, does not allow a query word to match multiple words simultaneously. If the with wildcards match option is not used, then wildcard characters are simply matched as regular characters in the document.
As an illustration, the following query returns books that contain at least one word that matches the query word "eff.c.+", where "." and ".+" are interpreted as wildcards (e.g., books that contain words such as efficient and effective): //book[. ftcontains "eff.c.+" with wildcards]
Another interesting example is the following query: //book//*[. ftcontains "site.* user." with wildcards]
It contains a multiword query term containing the wildcards ".*" and ".". The tokenizer should break this up into two words (the wildcard sequences are to be considered token-internal characters). Hence, if applied to our sample document, this query returns only the note element. The word sequence site supports the users in the paragraph above is not matched, because "site.*" can only match a single word.
Thesaurus expansions—In some cases, when users issue a full-text search query with query words such as canine, they are also looking for results that contain semantically related words such as dog and poodle. The with thesaurus match option specifies such full-text search conditions, where relationships as defined in a standard thesaurus—such as synonyms, broader terms, and narrower terms (see References 34–38 for thesaurus standards)—are exploited. In XQFT, a thesaurus expansion can be specified in a query by providing three things: a Uniform Resource Identifier (URI) reference to the thesaurus to be used, the relationship to be used, and an optional depth parameter. As with thesaurus is a match option, it can be specified at any level of the query and applied to all query words mentioned in that part of the query. The application of a thesaurus expansion to a query word means that the full-text search is performed as though the disjunction of related words has been specified in place of the query word.
As an example, the following query returns books that contain a synonym of the word canine: //book[. ftcontains "canine" with thesaurus at "http://bstore1.example.com/BSThesaurus.xml" relationship "synonyms"]
The actual technique used for thesaurus expansion is implementation defined, including whether the thesaurus URI refers to a system-defined or user-defined thesaurus.
Stop words—When performing a search, search engines often have a built-in means to disregard words that do not carry their own meaning, such as articles, prepositions, and function words (such as but and if). Such words are called stop words. The advantage of disregarding stop words is that queries can be processed faster (because common stop words do not need to be processed), and the returned results are of higher relevance (as stop words typically carry little meaning). Further, if stop words are not indexed, the size of the full-text indexes may be considerably smaller, depending on the kind of encoding used.39 However, if stop words are ignored, then queries where stop words are relevant, as in the phrase query to be or not to be, can no longer be answered. In XQFT, the with stop words match option can be used to control the list of stop words to be employed. The stop words can be specified either as a URI that points to a stop word list or by directly listing the stop words in the query. In addition, the lists can be combined dynamically by using the set operations union and except. For example, the following query returns books that contain the phrase planning then conducting while ignoring stop words that are specified in the URI http://bstore1.example.com/StopWordList.xml:
//book[. ftcontains "planning then conducting" with stop words at "http://bstore1.example.com/StopWordList.xml"]
The with default stop words match option can be used to select a system-provided default stop word list, and the without stop words match option switches off stop-word processing in the part of the query to which this option is applied.
Language option—The stemming, thesaurus, and stop words match options may not produce sensible results if the language of the documents or query words is unknown. For example, but in English is likely to be a stop word, whereas in French this word means aim, which is not likely to be a stop word. It is therefore necessary to be able to specify the language of the words in a query. In XQFT, the language is specified using the language match option. The following query selects the French language for language-dependent features, such as the selection of the default stop-word list: $book[. ftcontains "salon de the" with default stop words language "fr"]
The set of valid language identifiers (such as fr) is implementation defined.
FTIgnore—The match options we have described so far all relate to the matching of single words or phrases. Using the FTIgnore option, it is possible to modify which parts of the XML structure are available for a single match of FTSelection. The FTIgnore option can be specified only on the top-level FTSelection, basically extending the syntax of FTContainsExpr to the following: FTContainsExpr ::= Expr "ftcontains" FTSelection ["without" "content" Expr]
The Expr following without content specifies a sequence of nodes, the containing text of which should be ignored when searching the search context nodes. For instance, the following query allows us to search the content element of a book, including all its descendant elements, but not including descendants of type footnote: //book[. ftcontains "web site testing" without content .//footnote]
There are two aspects to this exclusion of text material from the search context. First, when the phrase web site testing appears in a footnote element (or descendant element thereof), it should not be found. Second, when eliminating footnotes, the distances of terms in the remaining text are affected. For instance, when ignoring a footnote that happens to stand between web site and testing, as in the example below, the terms become adjacent and can be matched as a single phrase: <p>Web site<footnote>only sample.com here</footnote> testing ...</p>
|