Full-Text
This article is part of the Query Portal. It summarizes the Full-Text features of BaseX.
Full-text retrieval is an essential query feature for working with XML documents, and BaseX was the first query processor that fully supported the W3C XQuery Full Text 1.0 Recommendation. This page lists some singularities and extensions of the BaseX implementation.
Contents
Query Evaluation
BaseX offers different evaluation strategies for XQFT queries, the choice of which
depends on the input data and the existence of a full text index. The query compiler tries
to optimize and speed up queries by applying a full text index structure whenever
possible and useful. Three evaluation strategies are available: the standard sequential
database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see "XQuery Full Text implementation in BaseX").
Query optimization and selection of the most efficient evaluation strategy is done
in a full-fledged automatic manner. The output of the query optimizer indicates which
evaluation plan is chosen for a specific query. It can be inspected by activating verbose
querying (Command: SET VERBOSE ON
) or opening the Query Info in the GUI.
The message
Applying full-text index
suggests that the full-text index is applied to speed up query evaluation. A second message
Removing path with no index results
indicates that the index does not yield any results for the specified term and is thus skipped. If index optimizations are missing, it sometimes helps to give the compiler a second chance and try different rewritings of the same query.
Options
To support a wide variety of scenarios, the available full-text index can handle different
combinations of the match options defined in the XQuery Full Text Recommendation.
By default, most indexing options are disabled. The GUI dialogs for creating new databases
or displaying the database properties contain a tab for choosing between all available
options. On the command-line, the SET
command can be used to activate
full-text indexing or creating a full-text index for existing databases:
SET FTINDEX on; CREATE DB FILENAME.xml
CREATE INDEX fulltext
The following indexing options are available:
- Language: see below for more details (
SET LANGUAGE EN
) - Support Wildcards: a trie-based index can be applied to support wildcard searches (
SET WILDCARDS ON
) - Stemming: tokens are stemmed with the Porter Stemmer before being indexed (
SET STEMMING ON
) - Case Sensitive: tokens are indexed in case-sensitive mode (
SET CASESENS ON
) - Diacritics: diacritics are indexed as well (
SET DIACRITICS ON
) - TF/IDF Scoring: TF/IDF-based scoring values are calculated and stored in the index (
SET SCORING 0/1/2
; details see below) - Stopword List: a stop word list can be defined to reduce the number of indexed tokens (
SET STOPWORDS [filename]
)
Caution: The index will only be applied if the activated options are also specified in the query:
Index Options: Case Sensitive, Stemming ON
Query 1 (wrong):
//*[text() contains text 'inform']
Query 2 (correct):
//*[text() contains text 'inform' using case sensitive using stemming]
Query 3 (correct):
declare ft-option using case sensitive using stemming; //*[text() contains text 'inform']
Languages
The chosen language determines how the input text will be tokenized and stemmed. The basic code base and jar
file of BaseX comes with built-in support for English and German. More languages are supported if the following libraries are placed in the classpath (Version 6.8):
- lucene-stemmers-3.4.0.jar: includes Snowball and Lucene stemmers and extends language support to the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
- igo-0.4.3.jar: includes a tokenizer for Japanese texts. In addition to the library, the file ipadic.zip must either be unzipped in the current directory, or in the project’s Home Directory. A big thank you goes out to Toshio HIRAI for integrating the lexer into BaseX!
The JAR files can also be found in the zip
and exe
distribution files of BaseX.
Scoring
The XQuery Full Text Recommendation allows for the usage of scoring models
and values within queries, with scoring being completely implementation defined.
BaseX offers an efficient internal scoring model which can be easily extended to
different application scenarios. Additionally, BaseX allows to store scoring
values within the full-text index structure (demanding additional time and
memory). Three scoring types are currently available, which can be adjusted
with the SCORING
property (Default: SET SCORING 0
):
0:
This algorithm yields the best results for general-purpose use cases. It calculates the scoring value out of the length of a term and its frequency in a single text node. This algorithm is also applied if no index exists, or if the index cannot be applied in a query.1:
Standard TF/IDF algorithm, which treats document nodes as document units.2:
Each text node is treated as a document unit in the TF/IDF algorithm. This variant is an alternative for type1
, if the database contains large, few XML files.
Querying Using Thesaurus
BaseX supports full-text queries using thesauri, but it does not provide a default thesaurus. This is why query such as
'computers' contains text 'hardware' using thesaurus default
will return false
. However, if the thesaurus is specified, then the result will be true
'computers' contains text 'hardware' using thesaurus at 'XQFTTS_1_0_4/TestSources/usability2.xml'
The format of the thesaurus files must be the same as the format of the thesauri provided by the XQuery and XPath Full Text 1.0 Test Suite. It is an XML with structure defined by an XSD Schema.
Fuzzy Querying
In addition to the official recommendation, BaseX supports fuzzy querying.
The XQFT grammar was enhanced by the FTMatchOption using fuzzy
to allow for approximate searches in full texts.
By default, the standard full-text index already supports the efficient
execution of fuzzy searches.
Document 'doc.xml':
<doc> <a>foo bar</a> <a>foa bar</a> <a>faa bar</a> </doc>
Command: CREATE DB doc.xml; CREATE INDEX fullext
Query:
//a[text() contains text 'foo' using fuzzy]
Result:
<a>foo bar</a> <a>foa bar</a>
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed
errors is calculated by dividing the token length of a specified query term by 4,
preserving a minimum of 1 errors. A static error distance can be set by adjusting
the LSERROR
property (default: SET LSERROR 0
).
The query above yields two results as there is no error between the query term
"foo" and the text node "foo bar", and one error between
"foo" and "foa bar".
Functions
Some additional Full-Text Functions have been added to BaseX to extend the official language recommendation with useful features, such as explicitly requesting the score value of an item, marking the hits of a full-text request, or directly accessing the full-text index with the default index options.
Error Messages
Along with the Update Facility, a number of new error codes and messages have been added to the specification and BaseX. All errors are listed in the XQuery Errors overview.