Changes

Jump to navigation Jump to search
245 bytes removed ,  06:43, 29 November 2019
no edit summary
This article is part of the [[XQuery|XQuery Portal]].It summarizes the features of the [http://www.w3.org/TR/xpath-full-text -10/ W3C XQuery Full Text 1.0] Recommendation, and language-specific custom features of the implementation in BaseX.
Please read the separate [[Indexes#Full-text retrieval is an essential query feature for working with XML documents, and BaseX was the first query processor that fully supported the [http://www.w3.org/TR/xpathText Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text-10/ W3C XQuery Full Text 1.0] Recommendationrequests on large databases within milliseconds.
=Introduction=
The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification. This section gives you a quick insight into the most important features of the language.
This is a simple example for a basic full-text expression:
</pre>
Varius Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.
==Combining Results==
<pre class="brush:xquery">
'Mary told me, “I will survive!” This is what Mary told me.' contains text { 'will', 'told' } all words same sentence
</pre>
Sentences are delimited by end of line markers ({{Code|.}}, {{Code|!}}, {{Code|?}}, etc.), and newline characters are treated as paragraph delimiters. By the way: in the In some examples above, the {{Code|wordwords}} unit has been was used, but {{Code|sentences}} and {{Code|paragraphs}} are would have been valid alternatives.
Last but not least, three specifiers exist to filter results depending on the position of a hit:
* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:
<pre class="brush:xquery">
"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive
</pre>
* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:
<pre class="brush:xquery">
"catch" contains text "catches" using stemming,"Haus" contains text "Häuser" using stemming using language 'de'
</pre>
* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful when if the full-text index takes too much space(a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):
<pre class="brush:xquery">
"You and me" contains text "you or me" using stop words ("and", "or"),
=BaseX Features=
==OptionsLanguages==
The available [[Indexes#Full-Text Index|Full-Text Index]] can handle various combinations of the match options defined in the XQuery Full Text Recommendationchosen language determines how strings will be tokenized and stemmed. Either names (e.g. By default<code>English</code>, most options are disabled. The GUI dialogs for creating new databases <code>German</code>) or displaying the database properties contain a tab for choosing between all available options. On command-linecodes (<code>en</code>, the <code>SETde</code> command ) can be specified. A list of all language codes that are available on your system can be used to activate full-text indexing or creating a full-text index for existing databasesretrieved as follows:
* <codepre class="brush:xquery">SET FTINDEX truedeclare namespace locale = "java:java.util.Locale"; CREATE DB inputdistinct-values(locale:getAvailableLocales() ! locale:getLanguage(.xml</code>))* <code>CREATE INDEX fulltext</codepre> The following indexing options are available:
* '''Language''': [[#Languages|see below]] for more details (By default, unless the languages codes <code>SET LANGUAGE ENja</code>).* '''Stemming''': tokens are stemmed with the Porter Stemmer before being indexed (, <code>SET STEMMING truear</code>).* '''Case Sensitive''': tokens are indexed in case-sensitive mode (, <code>SET CASESENS trueko</code>).* '''Diacritics''': diacritics are indexed as well (, <code>SET DIACRITICS trueth</code>).* '''Stopword List''': a stop word list can be defined to reduce the number of indexed tokens (, or <code>SET STOPWORDS [filename]zh</code>).are specified, a tokenizer for Western texts is used:
==Languages==* Whitespaces are interpreted as token delimiters.* Sentence delimiters are <code>.</code>, <code>!</code>, and <code>?</code>.* Paragraph delimiters are newlines (<code>&amp;#xa;</code>).
The chosen language determines how the input text will be tokenized and stemmed. The basic code base and <code>jar</code> JAR file of BaseX comes with built-in stemming support for English , German, Greek and GermanIndonesian. More Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:
* [http://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes the Snowball and Lucene stemmers and extends language support to for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Greek, Hindi, Hungarian, Indonesian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
* [http://en.sourceforge.jp/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.
The JAR files can also be found are included in the <code>zip</code> ZIP and <code>exe</code> distribution files EXE distributions of BaseX.
The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:
<pre class="brush:xquery">
"Indexing" contains text "index" using stemming,
"häuser" contains text "haus" using stemming using language "deGerman"
</pre>
==Fuzzy Querying==
In addition to the official recommendation, BaseX supports a fuzzy search feature. The XQFT grammar was enhanced by the FTMatchOption <code>using fuzzy </code> match option to allow for approximate results in full texts. Fuzzy search is also supported by the [[Indexes#Full-Text Index|Full-Text Index]].:
'''Document 'doc.xml'''':
</pre>
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the <code>[[Options#LSERROR{{Option|LSERROR]]</code> property }} option (default: <code>SET LSERROR 0</code>). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”. Fuzzy search is also supported by the full-text index.
=Mixed Content=
Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [http://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].
To enable this kind of searches, it is recommendable to: * Turn off ''whitespace chopping'' must be turned off when importing XML documents . This can be done by setting the option <code>[[Options#CHOP{{Option|CHOP]]</code> }} to <code>OFF</code> (default: <code>SET CHOP ON</code>). In This can also be done in the GUI, you find this option in if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Chop Whitespaces'').* Turn off automatic indentation by assigning <code>indent=no</code> to the {{Option|SERIALIZER}} option.  A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.
Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the <code>ft:mark</code> and <code>ft:extract</code> functions (see [[Full-Text Module|Full-Text Functions]]) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:
|-
| {{Code|strength}}
| Level of difference considered significant in comparisons. Four strengths are supported: {{Code|primary}}, {{Code|secondary}}, {{Code|tertiary}}, and {{Code|identical}}. For As an example, in German, :* "Ä" and "A" are considered primary differences, * "Ä" and "ä" are secondary differences, * "Ä" and "A[&amp;#x308;" (see http://www.fileformat.info/info/unicode/char/308/index.htm &amp;#x308;]" ) are tertiary differences, and * "A" and "A" are identical.
|-
| {{Code|decomposition}}
=Changelog=
 
; Version 9.2:
 
* Added: Arabic stemmer.
; Version 8.0:
* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.
* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.
 
[[Category:XQuery]]
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu