Full-Text

From BaseX Documentation

Jump to: navigation, search

This article is part of the XQuery Portal. It summarizes the full-text and language-specific features of BaseX.

Full-text retrieval is an essential query feature for working with XML documents, and BaseX was the first query processor that fully supported the W3C XQuery Full Text 1.0 Recommendation.

Contents

[edit] Introduction

The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. This section gives you a quick insight into the most important features of the language.

This is a simple example for a basic full-text expression:

"This is YOUR World" contains text "your world"

It yields true, because the search string is tokenized before it is compared with the tokenized input string. In the tokenization process, several normalizations take place. Many of those steps can hardly be simulated with plain XQuery: as an example, upper/lower case and diacritics (umlauts, accents, etc.) are removed and an optional, language-dependent stemming algorithm is applied. Beside that, special characters such as whitespaces and punctuation marks will be ignored. Thus, this query also yields true:

"Well... Done!" contains text "well, done"

The occurs keyword comes into play when more than one occurrence of a token is to be found:

"one and two and three" contains text "and" occurs at least 2 times

Varius range modifiers are available: exactly, at least, at most, and from ... to ....

[edit] Combining Results

In the given example, curly braces are used to combine multiple keywords:

for $country in doc('factbook')//country
where $country//religions[text() contains text { 'Sunni', 'Shia' } any]
return $country/name

The query will output the names of all countries with a religion element containing sunni or shia. The any keyword is optional; it can be replaced with:

The keywords ftand, ftor and ftnot can also be used to combine multiple query terms. The following query yields the same result as the last one does (but it takes more memory):

doc('factbook')//country[descendant::religions contains text 'sunni' ftor 'shia']/name

The keywords not in are special: they are used to find tokens which are not part of a longer token sequence:

for $text in ("New York", "new conditions")
return $text contains text "New" not in "New York"

[edit] Positional Filters

A popular retrieval operation is to filter texts by the distance of the searched words. In this query…

<xml>
  <text>There is some reason why ...</text>
  <text>For some good yet unknown reason, ...</text>
  <text>The reason why some people ...</text>
</xml>//text[. contains text { "some", "reason" } all ordered distance at most 3 words]

…the two first texts will be returned as result, because there are at most three words between some and reason. Additionally, the ordered keyword ensures that the words are found in the specified order, which is why the third text is excluded. Note that all is required here to guarantee that only those hits will be accepted that contain all searched words.

The window keyword is related: it accepts those texts in which all keyword occur within the specified number of tokens. Can you guess what is returned by the following query?

("A C D", "A B C D E")[. contains text { "A", "E" } all window 3 words]

Sometimes it is interesting to only select texts in which all searched terms occur in the same sentence or paragraph (you can even filter for different sentences/paragraphs). This is obviously not the case in the following example:

'“I will survive!” This is what Mary told me.' contains text { 'will', 'told' } all words same sentence

Sentences are delimited by end of line markers (., !, ?, etc.), and newline characters are treated as paragraph delimiters. By the way: in the examples above, the word unit has been used, but sentences and paragraphs are valid alternatives.

Last but not least, three specifiers exist to filter results depending on the position of a hit:

[edit] Match Options

As indicated in the introduction, the input and query texts are tokenized before they are compared with each other. During this process, texts are split into tokens, which are then normalized, based on the following matching options:

"Respect Upper Case" contains text "Upper" using case sensitive
"'Äpfel' will not be found..." contains text "Apfel" diacritics sensitive
"catch" contains text "catches" using stemming,
"Haus" contains text "Häuser" using stemming using language 'de'
"You and me" contains text "you or me" using stop words ("and", "or"),
"You and me" contains text "you or me" using stop words at "http://files.basex.org/etc/stopwords.txt"

The wildcards option facilitates search operations similar to simple regular expressions:

"This may be interesting in the year 2000" contains text { "interest.*", "2.{3,3}" } using wildcards

This was a quick introduction to XQuery Full Text; you are invited to explore the numerous other features of the language!

[edit] BaseX Features

This page lists BaseX-specific full-text features and options.

[edit] Options

The available full-text index can handle various combinations of the match options defined in the XQuery Full Text Recommendation. By default, most options are disabled. The GUI dialogs for creating new databases or displaying the database properties contain a tab for choosing between all available options. On the command-line, the SET command can be used to activate full-text indexing or creating a full-text index for existing databases:

The following indexing options are available:

[edit] Languages

The chosen language determines how the input text will be tokenized and stemmed. The basic code base and jar file of BaseX comes with built-in support for English and German. More languages are supported if the following libraries are found in the classpath:

The JAR files can also be found in the zip and exe distribution files of BaseX.

The following two queries, which both return true, demonstrate that stemming depends on the selected language:

"Indexing" contains text "index" using stemming,
"häuser" contains text "haus" using stemming using language "de"

[edit] Scoring

The XQuery Full Text Recommendation allows for the usage of scoring models and values within queries, with scoring being completely implementation-defined.

The scoring model of BaseX takes into consideration the number of found terms, their frequency in a text, and the length of a text. The shorter the input text is, the higher scores will be:

(: Score values: 1 0.62 0.45 :)
for $text score $score in ("A", "A B", "A B C")[. contains text "A"]
order by $score descending
return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>

This simple approach has proven to consistently deliver good results, and in particular when little is known about the structure of the queried XML documents.

As scores will be implicitly bound to the boolean item of a full-text expression, they can also be made visible later by binding full-text results to a score variable:

let $hits1 := //text[. contains text "A"]
let $hits2 := //text[. contains text "B"]
let score $score := $hits1 and $hits2
return $score

[edit] Thesaurus

BaseX supports full-text queries using thesauri, but it does not provide a default thesaurus. This is why queries such as

'computers' contains text 'hardware'
  using thesaurus default

will return false. However, if the thesaurus is specified, then the result will be true:

'computers' contains text 'hardware'
  using thesaurus at 'XQFTTS_1_0_4/TestSources/usability2.xml'

The format of the thesaurus files must be the same as the format of the thesauri provided by the XQuery and XPath Full Text 1.0 Test Suite. It is an XML with structure defined by an XSD Schema.

[edit] Fuzzy Querying

In addition to the official recommendation, BaseX supports fuzzy querying. The XQFT grammar was enhanced by the FTMatchOption using fuzzy to allow for approximate searches in full texts. By default, the standard full-text index already supports the efficient execution of fuzzy searches.

Document 'doc.xml':

<doc>
   <a>house</a>
   <a>hous</a>
   <a>haus</a>
</doc>

Command: CREATE DB doc.xml; CREATE INDEX fulltext

Query:

//a[text() contains text 'house' using fuzzy]

Result:

<a>house</a>
<a>hous</a>

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the LSERROR property (default: SET LSERROR 0). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”.

[edit] Performance

[edit] Index Processing

BaseX offers different evaluation strategies for XQFT queries, the choice of which depends on the input data and the existence of a full text index. The query compiler tries to optimize and speed up queries by applying a full text index structure whenever possible and useful. Three evaluation strategies are available: the standard sequential database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see XQuery Full Text implementation in BaseX). Query optimization and selection of the most efficient evaluation strategy is done in a full-fledged automatic manner. The output of the query optimizer indicates which evaluation plan is chosen for a specific query. It can be inspected by activating verbose querying (Command: SET VERBOSE ON) or opening the Query Info in the GUI. The message

Applying full-text index

suggests that the full-text index is applied to speed up query evaluation. A second message

Removing path with no index results

indicates that the index does not yield any results for the specified term and is thus skipped. If index optimizations are missing, it sometimes helps to give the compiler a second chance and try different rewritings of the same query.

[edit] FTAnd

The internal XQuery Full Text data model is pretty complex and may consume more main memory as would initially guess. If you plan to combine search terms via ftand, we recommend you to resort to an alternative, memory-saving representation:

(: representation via "ftand" :)
"A B" contains text "A" ftand "B" ftor "C" ftor "D"

(: memory saving representation :)
"A B" contains text { "A", "B" } all ftor { "C", "D" } all

[edit] Mixed Content

When working with so-called narrative XML documents, such as HTML, TEI, or DocBook documents, you typically have mixed content, i.e., elements containing a mix of text and markup, such as:

<p>This is only an illustrative <hi>example</hi>, not a <q>real</q> text.</p>

Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see XQuery and XPath Full Text 1.0 Use Cases.

To enable this kind of searches, whitespace chopping must be turned off when importing XML documents by setting the option CHOP to OFF (default: SET CHOP ON). In the GUI, you find this option in DatabaseNew…ParsingChop Whitespaces. A query such as //p[. contains text 'real text'] will then match the example paragraph above. However, the full-text index will not be used in this query, so it may take a long time. The full-text index would be used for the query //p[text() contains text 'real text'], but this query will not find the example paragraph, because the matching text is split over two text nodes.

Note that the node structure is completely ignored by the full-text tokenizer: The contains text expression applies all full-text operations to the string value of its left operand. As a consequence, the ft:mark and ft:extract functions (see Full-Text Functions) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:

(: Structure is ignored; no highlighting: :)
ft:mark(//p[. contains text 'real'])
(: Single text nodes are addressed: results will be highlighted: :)
ft:mark(//p[.//text() contains text 'real'])

Note that BaseX does not support the ignore option (without content) of the W3C XQuery Full Text 1.0 Recommendation. This means that it is not possible to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow. Here is an example document:

<p>This text is provided for illustrative<note>Serving as an example or explanation.</note> purposes only.</p>

The ignore option would enable you to search for the string “illustrative purposes”:

//p[. contains text 'illustrative purposes' without content note]

For more examples, see XQuery and XPath Full Text 1.0 Use Cases.

As BaseX does not support the ignore option, it raises error FTST0007 when it encounters without content in a full-text contains expression.

[edit] Functions

Some additional Full-Text Functions have been added to BaseX to extend the official language recommendation with useful features, such as explicitly requesting the score value of an item, marking the hits of a full-text request, or directly accessing the full-text index with the default index options.

[edit] Collations

Another XQuery feature related to natural language processing are Collations. By default, string comparisons in XQuery are based on the Unicode codepoint order. The default namespace URI http://www.w3.org/2003/05/xpath-functions/collation/codepoint specifies this ordering. Other collations are completely implementation-defined. In BaseX, the following namespace syntax is supported to specify collations:

 http://basex.org/collation?lang=...;strength=...;decomposition=...

Semicolons can be replaced with ampersands; for convenience, the URL can be reduced to its query string component (including the question mark). All arguments are optional:

Argument Description
lang A language code, selecting a Locale. It may be followed by a language variant. If no language is specified, the system’s default will be chosen. Examples: de, en-US.
strength Level of difference considered significant in comparisons. Four strengths are supported: primary, secondary, tertiary, and identical. For example, in German, "Ä" and "A" are considered primary differences, "Ä" and "ä" are secondary differences, "Ä" and "A&#x308;" are tertiary differences, and "A" and "A" are identical.
decomposition Defines how composed characters are handled. Three decompositions are supported: none, standard, and full. More details are found in the JavaDoc of the JDK.

[edit] Examples

If a default collation is specified, it applies to all collation-dependent string operations in the query. The following expression yields true:

declare default collation 'http://basex.org/collation?lang=de;strength=secondary';
'Straße' = 'Strasse'

Collations can also be specified in order by and group by clauses of FLWOR expressions. This query returns à plutôt! bonjour!:

for $w in ("bonjour!", "à plutôt!") order by $w collation "?lang=fr" return $w

Various string function exists that take an optional collation as argument: The following functions give us a and 1 2 3 as results:

distinct-values(("a", "á", "à"), "?lang=it-IT;strength=primary"),
index-of(("a", "á", "à"), "a", "?lang=it-IT;strength=primary")

Changelog

Version 7.7
Version 7.3
Personal tools
Namespaces
Variants
Actions
Navigation
Print/export