Changes

Jump to navigation Jump to search
372 bytes removed ,  18:34, 1 December 2023
m
Text replacement - "<syntaxhighlight lang="xquery">" to "<pre lang='xquery'>"
This is a simple example for a basic full-text expression:
<syntaxhighlight pre lang="'xquery"'>
"This is YOUR World" contains text "your world"
</pre>
It yields {{Code|true}}, because the search string is ''tokenized'' before it is compared with the tokenized input string. In the tokenization process, several normalizations take place. Many of those steps can hardly be simulated with plain XQuery: as an example, upper/lower case and diacritics (umlauts, accents, etc.) are removed and an optional, language-dependent stemming algorithm is applied. Beside that, special characters such as whitespaces and punctuation marks will be ignored. Thus, this query also yields true:
<syntaxhighlight pre lang="'xquery"'>
"Well... Done!" contains text "well, done"
</pre>
The {{Code|occurs}} keyword comes into play when more than one occurrence of a token is to be found:
<syntaxhighlight pre lang="'xquery"'>
"one and two and three" contains text "and" occurs at least 2 times
</pre>
In the given example, curly braces are used to combine multiple keywords:
<syntaxhighlight pre lang="'xquery"'>
for $country in doc('factbook')//country
where $country//religions[text() contains text { 'Sunni', 'Shia' } any]
The keywords {{Code|ftand}}, {{Code|ftor}} and {{Code|ftnot}} can also be used to combine multiple query terms. The following query yields the same result as the last one does:
<syntaxhighlight pre lang="'xquery"'>
doc('factbook')//country[descendant::religions contains text 'sunni' ftor 'shia']/name
</pre>
The keywords {{Code|not in}} are special: they are used to find tokens which are not part of a longer token sequence:
<syntaxhighlight pre lang="'xquery"'>
for $text in ("New York", "new conditions")
return $text contains text "New" not in "New York"
Due to the complex data model of the XQuery Full Text spec, the usage of {{Code|ftand}} may lead to a high memory consumption. If you should encounter problems, simply use the {{Code|all}} keyword:
<syntaxhighlight pre lang="'xquery"'>
doc('factbook')//country[descendant::religions contains text { 'Christian', 'Jewish'} all]/name
</pre>
A popular retrieval operation is to filter texts by the distance of the searched words. In this query…
<syntaxhighlight pre lang="'xquery"'>
<xml>
<text>There is some reason why ...</text>
The {{Code|window}} keyword is related: it accepts those texts in which all keyword occur within the specified number of tokens. Can you guess what is returned by the following query?
<syntaxhighlight pre lang="'xquery"'>
("A C D", "A B C D E")[. contains text { "A", "E" } all window 3 words]
</pre>
Sometimes it is interesting to only select texts in which all searched terms occur in the {{Code|same sentence}} or {{Code|paragraph}} (you can even filter for {{Code|different}} sentences/paragraphs). This is obviously not the case in the following example:
<syntaxhighlight pre lang="'xquery"'>
'Mary told me, “I will survive!”.' contains text { 'will', 'told' } all words same sentence
</pre>
* If {{Code|case}} is insensitive, no distinction is made between characters in upper and lower case. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:
<syntaxhighlight pre lang="'xquery"'>
"Respect Upper Case" contains text "Upper" using case sensitive
</pre>
* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:
<syntaxhighlight pre lang="'xquery"'>
"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive
</pre>
* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:
<syntaxhighlight pre lang="'xquery"'>
"catch" contains text "catches" using stemming
</pre>
* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful if the full-text index takes too much space (a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):
<syntaxhighlight pre lang="'xquery"'>
"You and me" contains text "you or me" using stop words ("and", "or"),
"You and me" contains text "you or me" using stop words at "http://files.basex.org/etc/stopwords.txt"
* <code>.{min,max}</code> matches ''min''–''max'' number of characters.
<syntaxhighlight pre lang="'xquery"'>
"This may be interesting in the year 2000" contains text { "interest.*", "2.{3,3}" } using wildcards
</pre>
A list of all language codes that are available on your system can be retrieved as follows:
<syntaxhighlight pre lang="'xquery"'>
declare namespace locale = "java:java.util.Locale";
distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))
The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:
<syntaxhighlight pre lang="'xquery"'>
"Indexing" contains text "index" using stemming,
"häuser" contains text "haus" using stemming using language "German"
The scoring model of BaseX takes into consideration the number of found terms, their frequency in a text, and the length of a text. The shorter the input text is, the higher scores will be:
<syntaxhighlight pre lang="'xquery"'>
(: Score values: 1 0.62 0.45 :)
for $text in ("A", "A B", "A B C")
Scoring values can be further processed to compute custom values:
<syntaxhighlight pre lang="'xquery"'>
let $terms := ('a', 'b')
let $scores := ft:score($terms ! ('a b c' contains text { . }))
Scoring is supported within full-text expressions, by {{Function|Full-Text|ft:search}}, and by simple predicate tests that can be rewritten to {{Function|Full-Text|ft:search}}:
<syntaxhighlight pre lang="'xquery"'>
let $string := 'a b'
return ft:score($string contains text 'a' ftand 'b'),
One or more thesaurus files can be specified in a full-text expression. The following query returns {{Code|false}}:
<syntaxhighlight pre lang="'xquery"'>
'hardware' contains text 'computers'
using thesaurus default
…the result will be {{Code|true}}:
<syntaxhighlight pre lang="'xquery"'>
'hardware' contains text 'computers'
using thesaurus at 'thesaurus.xml'
The type of relationship and the level depth can be specified as well:
<syntaxhighlight pre lang="'xquery"'>
(: BT: find broader terms; NT means narrower term :)
'computers' contains text 'hardware'
'''Query:'''
<syntaxhighlight pre lang="'xquery"'>
//a[text() contains text 'house' using fuzzy]
</pre>
A user-defined value can be adjusted globally via the {{Option|LSERROR}} option or via an additional argument:
<syntaxhighlight pre lang="'xquery"'>
//a[text() contains text 'house' using fuzzy 3 errors]
</pre>
Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the {{Function|Full-Text|ft:mark}} and {{Function|Full-Text|ft:extract}} functions will only yield useful results if they are applied to single text nodes, as the following example demonstrates:
<syntaxhighlight pre lang="'xquery"'>
(: Structure is ignored; no highlighting: :)
ft:mark(//p[. contains text 'real'])
BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [https://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you want to avoid searching for. See the following example (visit [[XQuery Update]] to learn more about updates):
<syntaxhighlight pre lang="'xquery"'>
let $docs := db:get('docs')
return db:create(
* If a default collation is specified, it applies to all collation-dependent string operations in the query. The following expression yields <code>true</code>:
<syntaxhighlight pre lang="'xquery"'>
declare default collation 'http://basex.org/collation?lang=de;strength=secondary';
'Straße' = 'Strasse'
* Collations can also be specified in {{Code|order by}} and {{Code|group by}} clauses of FLWOR expressions. This query returns {{Code|à plutôt! bonjour!}}:
<syntaxhighlight pre lang="'xquery"'>
for $w in ("bonjour!", "à plutôt!") order by $w collation "?lang=fr" return $w
</pre>
* Various string function exists that take an optional collation as argument: The following functions give us {{Code|a}} and {{Code|1 2 3}} as results:
<syntaxhighlight pre lang="'xquery"'><nowiki>
distinct-values(("a", "á", "à"), "?lang=it-IT;strength=primary"),
index-of(("a", "á", "à"), "a", "?lang=it-IT;strength=primary")
If the [http://site.icu-project.org/download ICU Library] is added to the classpath, the full [https://www.w3.org/TR/xpath-functions-31/#uca-collations Unicode Collation Algorithm] features become available:
<syntaxhighlight pre lang="'xquery"'>
(: returns 0 (both strings are compared as equal) :)
compare('a-b', 'ab', 'http://www.w3.org/2013/collation/UCA?alternate=shifted')
Bureaucrats, editor, reviewer, Administrators
13,551

edits

Navigation menu