Full-Text

From BaseX Documentation
Revision as of 16:48, 18 June 2012 by CG (talk | contribs) (→‎Options)
Jump to navigation Jump to search

This article is part of the XQuery Portal. It summarizes the full-text features of BaseX.

Full-text retrieval is an essential query feature for working with XML documents, and BaseX was the first query processor that fully supported the W3C XQuery Full Text 1.0 Recommendation. This page lists some singularities and extensions of the BaseX implementation.

Query Evaluation

BaseX offers different evaluation strategies for XQFT queries, the choice of which depends on the input data and the existence of a full text index. The query compiler tries to optimize and speed up queries by applying a full text index structure whenever possible and useful. Three evaluation strategies are available: the standard sequential database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see "XQuery Full Text implementation in BaseX"). Query optimization and selection of the most efficient evaluation strategy is done in a full-fledged automatic manner. The output of the query optimizer indicates which evaluation plan is chosen for a specific query. It can be inspected by activating verbose querying (Command: SET VERBOSE ON) or opening the Query Info in the GUI. The message

Applying full-text index

suggests that the full-text index is applied to speed up query evaluation. A second message

Removing path with no index results

indicates that the index does not yield any results for the specified term and is thus skipped. If index optimizations are missing, it sometimes helps to give the compiler a second chance and try different rewritings of the same query.

Options

The available full-text index can handle various combinations of the match options defined in the XQuery Full Text Recommendation. By default, most options are disabled. The GUI dialogs for creating new databases or displaying the database properties contain a tab for choosing between all available options. On the command-line, the SET command can be used to activate full-text indexing or creating a full-text index for existing databases:

  • SET FTINDEX true; CREATE DB input.xml
  • CREATE INDEX fulltext

The following indexing options are available:

  • Language: see below for more details (SET LANGUAGE EN).
  • Stemming: tokens are stemmed with the Porter Stemmer before being indexed (SET STEMMING true).
  • Case Sensitive: tokens are indexed in case-sensitive mode (SET CASESENS true).
  • Diacritics: diacritics are indexed as well (SET DIACRITICS true).
  • Stopword List: a stop word list can be defined to reduce the number of indexed tokens (SET STOPWORDS [filename]).
  • Template:Mark: TF/IDF Scoring: TF/IDF-based scoring values are calculated and stored in the index (SET SCORING 0/1/2). This feature was removed in favor of the internal scoring model; see below for more details.
  • Template:Mark: The explicit option for wildcard queries was removed, as the fuzzy index now supports both wildcard and fuzzy queries.

Languages

The chosen language determines how the input text will be tokenized and stemmed. The basic code base and jar file of BaseX comes with built-in support for English and German. More languages are supported if the following libraries are found in the classpath:

  • lucene-stemmers-3.4.0.jar: includes Snowball and Lucene stemmers and extends language support to the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

The JAR files can also be found in the zip and exe distribution files of BaseX.

The following two queries, which both return true, demonstrate that stemming depends on the selected language:

"Indexing" contains text "index" using stemming,
"häuser" contains text "haus" using stemming using language "de"

Scoring

The XQuery Full Text Recommendation allows for the usage of scoring models and values within queries, with scoring being completely implementation defined. BaseX offers an internal scoring model which can be extended to different application scenarios.

Template:Mark TF/IDF scoring was discarded in favor of the internal scoring model, which proved to yield better results for XML documents in most of the cases. The score of a full-text result is calculated by taking the number of found terms and their frequency in a single text node into account. Terms will be ranked higher if they are found in short texts.

Thesaurus

BaseX supports full-text queries using thesauri, but it does not provide a default thesaurus. This is why query such as

'computers' contains text 'hardware'
  using thesaurus default

will return false. However, if the thesaurus is specified, then the result will be true

'computers' contains text 'hardware'
  using thesaurus at 'XQFTTS_1_0_4/TestSources/usability2.xml'

The format of the thesaurus files must be the same as the format of the thesauri provided by the XQuery and XPath Full Text 1.0 Test Suite. It is an XML with structure defined by an XSD Schema.

Fuzzy Querying

In addition to the official recommendation, BaseX supports fuzzy querying. The XQFT grammar was enhanced by the FTMatchOption using fuzzy to allow for approximate searches in full texts. By default, the standard full-text index already supports the efficient execution of fuzzy searches.

Document 'doc.xml':

<doc>
   <a>house</a>
   <a>hous</a>
   <a>haus</a>
</doc>

Command: CREATE DB doc.xml; CREATE INDEX fullext

Query:

//a[text() contains text 'house' using fuzzy]

Result:

<a>house</a>
<a>hous</a>

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the LSERROR property (default: SET LSERROR 0). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”.

Mixed Content

When working with so-called narrative XML documents, such as HTML, TEI, or DocBook documents, you typically have mixed content, i.e., elements containing a mix of text and markup, such as:

<p>This is only an illustrative <hi>example</hi>, not a <q>real</q> text.</p>

Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see XQuery and XPath Full Text 1.0 Use Cases.

To enable this kind of searches, whitespace chopping must be turned off when importing XML documents by setting the option CHOP to OFF (default: SET CHOP ON). In the GUI, you find this option in Database → New… → Parsing → Chop Whitespaces. A query such as //p[. contains text 'real text'] will then match the example paragraph above.

Note that the node structure is completely ignored by the full-text tokenizer: The contains text expression applies all full-text operations to the string value of its left operand. As a consequence, the ft:mark and ft:extract functions (see Full-Text Functions) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:

(: Structure is ignored; no highlighting: :)
ft:mark(//p[. contains text 'real'])
(: Single text nodes are addressed: results will be highlighted: :)
ft:mark(//p[.//text() contains text 'real'])

Note that BaseX does not support the ignore option (without content) of the W3C XQuery Full Text 1.0 Recommendation. This means that it is not possible to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow. Here is an example document:

<p>This text is provided for illustrative<note>Serving as an example or explanation.</note> purposes only.</p>

The ignore option would enable you to search for the string “illustrative purposes”:

//p[. contains text 'illustrative purposes' without content note]

For more examples, see XQuery and XPath Full Text 1.0 Use Cases.

As BaseX does not support the ignore option, it raises error FTST0007 when it encounters without content in a full-text contains expression.

Functions

Some additional Full-Text Functions have been added to BaseX to extend the official language recommendation with useful features, such as explicitly requesting the score value of an item, marking the hits of a full-text request, or directly accessing the full-text index with the default index options.

Changelog

Version 7.3
  • Removed: The trie index, which was specialized on wildcard queries, was removed. The fuzzy index now supports both wildcard and fuzzy queries.
  • Removed: TF/IDF scoring was discarded in favor of the internal scoring model.