Changes

Jump to navigation Jump to search
62 bytes added ,  16:58, 10 April 2019
no edit summary
This article is part of the [[XQuery|XQuery Portal]]. It summarizes the fulltext features of the [http://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and custom features of the implementation in BaseX.
Full-text retrieval in XML documents is an essential requirement in many use cases. BaseX was the first query processor that supported Please read the [http://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and it additionally comes with a powerful separate [[Indexes#Full-Text Index|Full-Text Index]], which allows section in our documentation if you want to learn how to evaluate full-text queries requests on large databases within milliseconds.
=Introduction=
The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification. This section gives you a quick insight into the most important features of the language.
This is a simple example for a basic full-text expression:
</pre>
Varius Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.
==Combining Results==
* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:
<pre class="brush:xquery">
"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive
</pre>
* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:
* Paragraph delimiters are newlines (<code>&amp;#xa;</code>).
The basic JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:
* [http://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar] includes the Snowball and Lucene stemmers for the following languages: Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish. With {{Version|9.2}}, support for Arabic texts was added.
* [http://en.sourceforge.jp/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.
</pre>
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the <code>[[Options#LSERROR{{Option|LSERROR]]</code> property }} option (default: <code>SET LSERROR 0</code>). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”.
Fuzzy search is also supported by the full-text index.
To enable this kind of searches, it is recommendable to:
* Turn off ''whitespace chopping'' when importing XML documents. This can be done by setting the option <code>[[Options#CHOP{{Option|CHOP]]</code> }} to <code>OFF</code>. This can also be done in the GUI if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Chop Whitespaces'').* Turn off automatic indentation by assigning <code>indent=no</code> to the <code>[[Options#SERIALIZER{{Option|SERIALIZER]]</code> }} option.
A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.
=Changelog=
 
; Version 9.2:
 
* Added: Arabic stemmer.
; Version 8.0:
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu