Changes

Jump to navigation Jump to search
1,339 bytes added ,  10:41, 25 April 2022
This article is part of the [[XQuery|XQuery Portal]]. It summarizes the fulltext features of the [https://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text] Recommendation, and custom features of the implementation in BaseX.
Full-text retrieval in XML documents is an essential requirement in many use cases. BaseX was the first query processor that supported Please read the [http://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and it additionally comes with a powerful separate [[Indexes#Full-Text Index|Full-Text Index]], which allows section in our documentation if you want to learn how to evaluate full-text queries requests on large databases within milliseconds.
=Introduction=
The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification. This section gives you a quick insight into the most important features of the language.
This is a simple example for a basic full-text expression:
<pre classsyntaxhighlight lang="brush:xquery">
"This is YOUR World" contains text "your world"
</presyntaxhighlight>
It yields {{Code|true}}, because the search string is ''tokenized'' before it is compared with the tokenized input string. In the tokenization process, several normalizations take place. Many of those steps can hardly be simulated with plain XQuery: as an example, upper/lower case and diacritics (umlauts, accents, etc.) are removed and an optional, language-dependent stemming algorithm is applied. Beside that, special characters such as whitespaces and punctuation marks will be ignored. Thus, this query also yields true:
<pre classsyntaxhighlight lang="brush:xquery">
"Well... Done!" contains text "well, done"
</presyntaxhighlight>
The {{Code|occurs}} keyword comes into play when more than one occurrence of a token is to be found:
<pre classsyntaxhighlight lang="brush:xquery">
"one and two and three" contains text "and" occurs at least 2 times
</presyntaxhighlight>
Varius Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.
==Combining Results==
In the given example, curly braces are used to combine multiple keywords:
<pre classsyntaxhighlight lang="brush:xquery">
for $country in doc('factbook')//country
where $country//religions[text() contains text { 'Sunni', 'Shia' } any]
return $country/name
</presyntaxhighlight>
The query will output the names of all countries with a religion element containing {{Code|sunni}} or {{Code|shia}}. The {{Code|any}} keyword is optional; it can be replaced with:
The keywords {{Code|ftand}}, {{Code|ftor}} and {{Code|ftnot}} can also be used to combine multiple query terms. The following query yields the same result as the last one does:
<pre classsyntaxhighlight lang="brush:xquery">
doc('factbook')//country[descendant::religions contains text 'sunni' ftor 'shia']/name
</presyntaxhighlight>
The keywords {{Code|not in}} are special: they are used to find tokens which are not part of a longer token sequence:
<pre classsyntaxhighlight lang="brush:xquery">
for $text in ("New York", "new conditions")
return $text contains text "New" not in "New York"
</presyntaxhighlight>
Due to the complex data model of the XQuery Full Text spec, the usage of {{Code|ftand}} may lead to a high memory consumption. If you should encounter problems, simply use the {{Code|all}} keyword:
<pre classsyntaxhighlight lang="brush:xquery">
doc('factbook')//country[descendant::religions contains text { 'Christian', 'Jewish'} all]/name
</presyntaxhighlight>
==Positional Filters==
A popular retrieval operation is to filter texts by the distance of the searched words. In this query…
<pre classsyntaxhighlight lang="brush:xquery">
<xml>
<text>There is some reason why ...</text>
<text>The reason why some people ...</text>
</xml>//text[. contains text { "some", "reason" } all ordered distance at most 3 words]
</presyntaxhighlight>
…the two first texts will be returned as result, because there are at most three words between {{Code|some}} and {{Code|reason}}. Additionally, the {{Code|ordered}} keyword ensures that the words are found in the specified order, which is why the third text is excluded. Note that {{Code|all}} is required here to guarantee that only those hits will be accepted that contain all searched words.
The {{Code|window}} keyword is related: it accepts those texts in which all keyword occur within the specified number of tokens. Can you guess what is returned by the following query?
<pre classsyntaxhighlight lang="brush:xquery">
("A C D", "A B C D E")[. contains text { "A", "E" } all window 3 words]
</presyntaxhighlight>
Sometimes it is interesting to only select texts in which all searched terms occur in the {{Code|same sentence}} or {{Code|paragraph}} (you can even filter for {{Code|different}} sentences/paragraphs). This is obviously not the case in the following example:
<pre classsyntaxhighlight lang="brush:xquery">'Mary told me, “I will survive!” This is what Mary told me.' contains text { 'will', 'told' } all words same sentence</presyntaxhighlight>
Sentences are delimited by end of line markers ({{Code|.}}, {{Code|!}}, {{Code|?}}, etc.), and newline characters are treated as paragraph delimiters. By the way: in the In some examples above, the {{Code|wordwords}} unit has been was used, but {{Code|sentences}} and {{Code|paragraphs}} are would have been valid alternatives.
Last but not least, three specifiers exist to filter results depending on the position of a hit:
* If {{Code|case}} is insensitive, no distinction is made between characters in upper and lower case. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:
<pre classsyntaxhighlight lang="brush:xquery">
"Respect Upper Case" contains text "Upper" using case sensitive
</presyntaxhighlight>
* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:
<pre classsyntaxhighlight lang="brush:xquery">"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive</presyntaxhighlight>
* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:
<pre classsyntaxhighlight lang="brush:xquery">
"catch" contains text "catches" using stemming
</presyntaxhighlight>
* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful if the full-text index takes too much space (a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):
<pre classsyntaxhighlight lang="brush:xquery">
"You and me" contains text "you or me" using stop words ("and", "or"),
"You and me" contains text "you or me" using stop words at "http://files.basex.org/etc/stopwords.txt"
</presyntaxhighlight>
* Related terms such as synonyms can be found with the sophisticated [[#Thesaurus|Thesaurus]] option.
* <code>.{min,max}</code> matches ''min''–''max'' number of characters.
<pre classsyntaxhighlight lang="brush:xquery">
"This may be interesting in the year 2000" contains text { "interest.*", "2.{3,3}" } using wildcards
</presyntaxhighlight>
This was a quick introduction to XQuery Full Text; you are invited to explore the numerous other features of the language!
A list of all language codes that are available on your system can be retrieved as follows:
<pre classsyntaxhighlight lang="brush:xquery">
declare namespace locale = "java:java.util.Locale";
distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))
</presyntaxhighlight>
By default, unless the languages codes <code>ja</code>, <code>ar</code>, <code>ko</code>, <code>th</code>, or <code>zh</code> are specified, a tokenizer for Western texts will be is used to tokenize texts:
* Whitespaces are interpreted as token delimiters.
* Sentence delimiters are <code>.</code>, <code>!</code>, and <code>?</code>.
* Paragraph delimiters are newlines (<code>&amp;#xa;</code>).
The basic <code>jar</code> JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:
* [httphttps://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes the Snowball and Lucene stemmers and extends language support to for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
* [httphttps://enosdn.sourceforge.jpnet/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.
The JAR files are included in the <code>zip</code> ZIP and <code>exe</code> EXE distributions of BaseX.
The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:
<pre classsyntaxhighlight lang="brush:xquery">
"Indexing" contains text "index" using stemming,
"häuser" contains text "haus" using stemming using language "German"
</presyntaxhighlight>
==Scoring==
The scoring model of BaseX takes into consideration the number of found terms, their frequency in a text, and the length of a text. The shorter the input text is, the higher scores will be:
<pre classsyntaxhighlight lang="brush:xquery">
(: Score values: 1 0.62 0.45 :)
for $text score $score in ("A", "A B", "A B C")[. let score $score := $text contains text "A"]
order by $score descending
return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>
</presyntaxhighlight>
This simple approach has proven to consistently deliver good results, and in particular when little is known about the structure of the queried XML documents.
Please note that scores will only Scoring values can be computed if a parent expression requests themfurther processed to compute custom values:
<pre classsyntaxhighlight lang="brush:xquery">let $terms := (: Computes and returns 'a scoring value. :', 'b')let score $score scores := <x>Hello Universe</x> ft:score($terms ! ('a b c' contains text "hello"{ . }))return avg($scorescores)</syntaxhighlight>
(Scoring is supported within full-text expressions, by {{Function|Full-Text|ft: No scoring value will search}}, and by simple predicate tests that can be computed here. rewritten to {{Function|Full-Text|ft:)let $result := <x>Hello Universe</x> contains text "hello"let score $score search}}:= $resultreturn $score</pre>
Scores will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates. In the following query<syntaxhighlight lang="xquery">let $string := 'a b'return ft:score($string contains text 'a' ftand 'b'), all returned scores are equal:
<pre class="brush:xquery">let for $text := "A B C"let n score $s1 s in ft:= search('factbook', 'orthodox')order by $text contains text "A" ftand "B C"s descendinglet score return $s2 s || ':= ' || $text contains text "A" ftand "B C"n,let score $s3 := $text contains text "A" and $text contains text "B C"let score $s4 := for $text contains text "A" or $text contains text "B C"let n score $s5 s in db:= $open('factbook')//text()[. contains text "A"][. contains text "B C"'orthodox']order by $s descendingreturn ($s1, $s2, $s3, $s4, s || ': ' || $s5)n</presyntaxhighlight>
==Thesaurus==
BaseX supports One or more thesaurus files can be specified in a full-text queries using thesauri, but it does not provide a default thesaurusexpression. This is why queries such asThe following query returns {{Code|false}}:
<pre classsyntaxhighlight lang="brush:xquery">'computershardware' contains text 'hardwarecomputers'
using thesaurus default
</presyntaxhighlightIf a thesaurus is employed… <syntaxhighlight lang="xml"><thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus"> <entry> <term>computers</term> <synonym> <term>hardware</term> <relationship>NT</relationship> </synonym> </entry></thesaurus></syntaxhighlight> …the result will be {{Code|true}}:
will return <code>false</codesyntaxhighlight lang="xquery">'hardware' contains text 'computers' using thesaurus at 'thesaurus. However, if the thesaurus is specified, then the result will be <code>truexml'</codesyntaxhighlight>:
Thesaurus files must comply with the [https://dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd XSD Schema] of the XQFT Test Suite (but the namespace can be omitted). Apart from the relationship defined in [https://www.iso.org/standard/7776.html ISO 2788] (NT: narrower team, RT: related term, etc.), custom relationships can be used. The type of relationship and the level depth can be specified as well: <pre classsyntaxhighlight lang="brush:xquery">(: BT: find broader terms; NT means narrower term :)
'computers' contains text 'hardware'
using thesaurus at 'XQFTTS_1_0_4/TestSources/usability2x.xml'relationship 'BT' from 1 to 10 levels</presyntaxhighlight>
The format of the thesaurus files must More details can be found in the same as the format of the thesauri provided by the [http://dev.w3.org/2007/xpath-full-text-10-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an [httphttps://devwww.w3.org/cvsweb/~checkout~/2007TR/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd?rev=1.3;content-type=application%2Fxml XSD Schema#ftthesaurusoption specification].
==Fuzzy Querying==
In addition to the official recommendation, BaseX supports a fuzzy search feature. The XQFT grammar was enhanced by the FTMatchOption <code>using fuzzy </code> match option to allow for approximate results in full texts. Fuzzy search is also supported by the full-text index.:
'''Document 'doc.xml'''':
<pre classsyntaxhighlight lang="brush:xml">
<doc>
<a>house</a>
<a>haus</a>
</doc>
</presyntaxhighlight>
'''Query:'''
<pre classsyntaxhighlight lang="brush:xquery">
//a[text() contains text 'house' using fuzzy]
</presyntaxhighlight>
'''Result:'''
<pre classsyntaxhighlight lang="brush:xml">
<a>house</a>
<a>hous</a>
</presyntaxhighlightFuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4. The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”. A user-defined value can be adjusted globally via the {{Option|LSERROR}} option or via an additional argument:
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the <codesyntaxhighlight lang="xquery">//a[[Options#LSERROR|LSERROR]text() contains text 'house' using fuzzy 3 errors]</codesyntaxhighlight> property (default: <code>SET LSERROR 0</code>). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”.
=Mixed Content=
When working with so-called narrative XML documents, such as HTML, [httphttps://tei-c.org/ TEI], or [httphttps://docbook.org / DocBook] documents, you typically have ''mixed content'', i.e., elements containing a mix of text and markup, such as:
<pre classsyntaxhighlight lang="brush:xml">
<p>This is only an illustrative <hi>example</hi>, not a <q>real</q> text.</p>
</presyntaxhighlightSince the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [https://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].
Since the logical flow To enable this kind of the text searches, it is not interrupted by the child elements, you will typically want recommendable to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [http://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].
To enable this kind of searches, * Turn off ''whitespace chopping'' must be turned off when importing XML documents . This can be done by setting the option <code>[[Options#CHOP{{Option|CHOP]]</code> }} to <code>OFF</code> (default: <code>SET CHOP ON</code>). In This can also be done in the GUI, you find this option in if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Chop Whitespaces'').* Turn off automatic indentation by assigning <code>indent=no</code> to the {{Option|SERIALIZER}} option.  A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.
Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the <code>ft:mark</code> and <code>ft:extract</code> functions (see [[Full-Text Module|Full-Text Functions]]) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:
<pre classsyntaxhighlight lang="brush:xquery">
(: Structure is ignored; no highlighting: :)
ft:mark(//p[. contains text 'real'])
(: Single text nodes are addressed: results will be highlighted: :)
ft:mark(//p[.//text() contains text 'real'])
</presyntaxhighlight>
BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [httphttps://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you do not want to search for. See the following example (visit [[XQuery Update]] to learn more about updates):
<pre classsyntaxhighlight lang="brush:xquery">
let $docs := db:open('docs')
return db:create(
map { 'ftindex': true() }
)
</presyntaxhighlight>
=Functions=
|-
| {{Code|decomposition}}
| Defines how composed characters are handled. Three decompositions are supported: {{Code|none}}, {{Code|standard}}, and {{Code|full}}. More details are found in the [httphttps://docs.oracle.com/en/java/javase/711/docs/api/java.base/java/text/Collator.html JavaDoc] of the JDK.
|}
* If a default collation is specified, it applies to all collation-dependent string operations in the query. The following expression yields <code>true</code>:
<pre classsyntaxhighlight lang="brush:xquery">
declare default collation 'http://basex.org/collation?lang=de;strength=secondary';
'Straße' = 'Strasse'
</presyntaxhighlight>
* Collations can also be specified in {{Code|order by}} and {{Code|group by}} clauses of FLWOR expressions. This query returns {{Code|à plutôt! bonjour!}}:
<pre classsyntaxhighlight lang="brush:xquery">
for $w in ("bonjour!", "à plutôt!") order by $w collation "?lang=fr" return $w
</presyntaxhighlight>
* Various string function exists that take an optional collation as argument: The following functions give us {{Code|a}} and {{Code|1 2 3}} as results:
<pre classsyntaxhighlight lang="brush:xquery"><nowiki>
distinct-values(("a", "á", "à"), "?lang=it-IT;strength=primary"),
index-of(("a", "á", "à"), "a", "?lang=it-IT;strength=primary")
</nowiki></presyntaxhighlight> If the [http://site.icu-project.org/download ICU Library] is added to the classpath, the full [https://www.w3.org/TR/xpath-functions-31/#uca-collations Unicode Collation Algorithm] features become available: <syntaxhighlight lang="xquery">(: returns 0 (both strings are compared as equal) :)compare('a-b', 'ab', 'http://www.w3.org/2013/collation/UCA?alternate=shifted')</syntaxhighlight>
=Changelog=
 
; Version 9.6
* Updated: [[#Fuzzy_Querying|Fuzzy Querying]]: Specify Levenshtein error
 
; Version 9.5:
* Removed: Scoring propagation.
 
; Version 9.2:
* Added: Arabic stemmer.
; Version 8.0:
 
* Updated: [[#Scoring|Scores]] will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates.
; Version 7.7:
 
* Added: [[#Collations|Collations]] support.
; Version 7.3:
 
* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.
* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu