Changes

Jump to navigation Jump to search
725 bytes added ,  16:58, 10 April 2019
no edit summary
This article is part of the [[XQuery|XQuery Portal]]. It summarizes the fulltext features of the [http://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and custom features of the implementation in BaseX.
Full-text retrieval in XML documents is an essential requirement in many use cases. BaseX was the first query processor that supported Please read the [http://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and it additionally comes with a powerful separate [[Indexes#Full-Text Index|Full-Text Index]], which allows section in our documentation if you want to learn how to evaluate full-text queries requests on large databases within milliseconds.
=Introduction=
The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification. This section gives you a quick insight into the most important features of the language.
This is a simple example for a basic full-text expression:
</pre>
Varius Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.
==Combining Results==
<pre class="brush:xquery">
'Mary told me, “I will survive!” This is what Mary told me.' contains text { 'will', 'told' } all words same sentence
</pre>
Sentences are delimited by end of line markers ({{Code|.}}, {{Code|!}}, {{Code|?}}, etc.), and newline characters are treated as paragraph delimiters. By the way: in the In some examples above, the {{Code|wordwords}} unit has been was used, but {{Code|sentences}} and {{Code|paragraphs}} are would have been valid alternatives.
Last but not least, three specifiers exist to filter results depending on the position of a hit:
* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:
<pre class="brush:xquery">
"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive
</pre>
* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:
==Languages==
The chosen language determines how the input text strings will be tokenized and stemmed. The basic Either names (e.g. <code>English</code>, <code>German</code>) or codes (<code>en</code>, <code>de</code>) can be specified. A list of all language codes that are available on your system can be retrieved as follows: <pre class="brush:xquery">declare namespace locale = "java:java.util.Locale";distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))</pre> By default, unless the languages codes <code>ja</code>, <code>ar</code>, <code>ko</code>, <code>th</code>, or <code>zh</code> are specified, a tokenizer for Western texts is used: * Whitespaces are interpreted as token delimiters.* Sentence delimiters are <code>.</code>, <code>!</code base >, and <code>jar?</code> .* Paragraph delimiters are newlines (<code>&amp;#xa;</code>). The basic JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. More Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:
* [http://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes the Snowball and Lucene stemmers and extends language support to for the following languages: Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish. With {{Version|9.2}}, support for Arabic texts was added.
* [http://en.sourceforge.jp/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.
The JAR files can also be found are included in the <code>zip</code> ZIP and <code>exe</code> distribution files EXE distributions of BaseX.
The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:
==Fuzzy Querying==
In addition to the official recommendation, BaseX supports a fuzzy search feature. The XQFT grammar was enhanced by the FTMatchOption <code>using fuzzy </code> match option to allow for approximate results in full texts. Fuzzy search is also supported by the full-text index.:
'''Document 'doc.xml'''':
</pre>
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the <code>[[Options#LSERROR{{Option|LSERROR]]</code> property }} option (default: <code>SET LSERROR 0</code>). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”. Fuzzy search is also supported by the full-text index.
=Mixed Content=
Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [http://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].
To enable this kind of searches, it is recommendable to: * Turn off ''whitespace chopping'' must be turned off when importing XML documents . This can be done by setting the option <code>[[Options#CHOP{{Option|CHOP]]</code> }} to <code>OFF</code> (default: <code>SET CHOP ON</code>). In This can also be done in the GUI, you find this option in if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Chop Whitespaces'').* Turn off automatic indentation by assigning <code>indent=no</code> to the {{Option|SERIALIZER}} option.  A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.
Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the <code>ft:mark</code> and <code>ft:extract</code> functions (see [[Full-Text Module|Full-Text Functions]]) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:
=Changelog=
 
; Version 9.2:
 
* Added: Arabic stemmer.
; Version 8.0:
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu