Changes

Full-Text (edit)

Revision as of 18:34, 1 December 2023

372 bytes removed , 18:34, 1 December 2023

m

Text replacement - "<syntaxhighlight lang="xquery">" to "<pre lang='xquery'>"

This is a simple example for a basic full-text expression:

<~~syntaxhighlight~~ pre lang="'xquery"'>

"This is YOUR World" contains text "your world"

</pre>

It yields {{Code|true}}, because the search string is ''tokenized'' before it is compared with the tokenized input string. In the tokenization process, several normalizations take place. Many of those steps can hardly be simulated with plain XQuery: as an example, upper/lower case and diacritics (umlauts, accents, etc.) are removed and an optional, language-dependent stemming algorithm is applied. Beside that, special characters such as whitespaces and punctuation marks will be ignored. Thus, this query also yields true:

<~~syntaxhighlight~~ pre lang="'xquery"'>

"Well... Done!" contains text "well, done"

</pre>

The {{Code|occurs}} keyword comes into play when more than one occurrence of a token is to be found:

<~~syntaxhighlight~~ pre lang="'xquery"'>

"one and two and three" contains text "and" occurs at least 2 times

</pre>

In the given example, curly braces are used to combine multiple keywords:

<~~syntaxhighlight~~ pre lang="'xquery"'>

for $country in doc('factbook')//country

where $country//religions[text() contains text { 'Sunni', 'Shia' } any]

The keywords {{Code|ftand}}, {{Code|ftor}} and {{Code|ftnot}} can also be used to combine multiple query terms. The following query yields the same result as the last one does:

<~~syntaxhighlight~~ pre lang="'xquery"'>

doc('factbook')//country[descendant::religions contains text 'sunni' ftor 'shia']/name

</pre>

The keywords {{Code|not in}} are special: they are used to find tokens which are not part of a longer token sequence:

<~~syntaxhighlight~~ pre lang="'xquery"'>

for $text in ("New York", "new conditions")

return $text contains text "New" not in "New York"

Due to the complex data model of the XQuery Full Text spec, the usage of {{Code|ftand}} may lead to a high memory consumption. If you should encounter problems, simply use the {{Code|all}} keyword:

<~~syntaxhighlight~~ pre lang="'xquery"'>

doc('factbook')//country[descendant::religions contains text { 'Christian', 'Jewish'} all]/name

</pre>

A popular retrieval operation is to filter texts by the distance of the searched words. In this query…

<~~syntaxhighlight~~ pre lang="'xquery"'>

<xml>

<text>There is some reason why ...</text>

The {{Code|window}} keyword is related: it accepts those texts in which all keyword occur within the specified number of tokens. Can you guess what is returned by the following query?

<~~syntaxhighlight~~ pre lang="'xquery"'>

("A C D", "A B C D E")[. contains text { "A", "E" } all window 3 words]

</pre>

Sometimes it is interesting to only select texts in which all searched terms occur in the {{Code|same sentence}} or {{Code|paragraph}} (you can even filter for {{Code|different}} sentences/paragraphs). This is obviously not the case in the following example:

<~~syntaxhighlight~~ pre lang="'xquery"'>

'Mary told me, “I will survive!”.' contains text { 'will', 'told' } all words same sentence

</pre>

* If {{Code|case}} is insensitive, no distinction is made between characters in upper and lower case. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:

<~~syntaxhighlight~~ pre lang="'xquery"'>

"Respect Upper Case" contains text "Upper" using case sensitive

</pre>

* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:

<~~syntaxhighlight~~ pre lang="'xquery"'>

"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive

</pre>

* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:

<~~syntaxhighlight~~ pre lang="'xquery"'>

"catch" contains text "catches" using stemming

</pre>

* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful if the full-text index takes too much space (a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):

<~~syntaxhighlight~~ pre lang="'xquery"'>

"You and me" contains text "you or me" using stop words ("and", "or"),

"You and me" contains text "you or me" using stop words at "http://files.basex.org/etc/stopwords.txt"

* <code>.{min,max}</code> matches ''min''–''max'' number of characters.

<~~syntaxhighlight~~ pre lang="'xquery"'>

"This may be interesting in the year 2000" contains text { "interest.*", "2.{3,3}" } using wildcards

</pre>

A list of all language codes that are available on your system can be retrieved as follows:

<~~syntaxhighlight~~ pre lang="'xquery"'>

declare namespace locale = "java:java.util.Locale";

distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))

The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:

<~~syntaxhighlight~~ pre lang="'xquery"'>

"Indexing" contains text "index" using stemming,

"häuser" contains text "haus" using stemming using language "German"

The scoring model of BaseX takes into consideration the number of found terms, their frequency in a text, and the length of a text. The shorter the input text is, the higher scores will be:

<~~syntaxhighlight~~ pre lang="'xquery"'>

(: Score values: 1 0.62 0.45 :)

for $text in ("A", "A B", "A B C")

Scoring values can be further processed to compute custom values:

<~~syntaxhighlight~~ pre lang="'xquery"'>

let $terms := ('a', 'b')

let $scores := ft:score($terms ! ('a b c' contains text { . }))

Scoring is supported within full-text expressions, by {{Function|Full-Text|ft:search}}, and by simple predicate tests that can be rewritten to {{Function|Full-Text|ft:search}}:

<~~syntaxhighlight~~ pre lang="'xquery"'>

let $string := 'a b'

return ft:score($string contains text 'a' ftand 'b'),

One or more thesaurus files can be specified in a full-text expression. The following query returns {{Code|false}}:

<~~syntaxhighlight~~ pre lang="'xquery"'>

'hardware' contains text 'computers'

using thesaurus default

…the result will be {{Code|true}}:

<~~syntaxhighlight~~ pre lang="'xquery"'>

'hardware' contains text 'computers'

using thesaurus at 'thesaurus.xml'

The type of relationship and the level depth can be specified as well:

<~~syntaxhighlight~~ pre lang="'xquery"'>

(: BT: find broader terms; NT means narrower term :)

'computers' contains text 'hardware'

'''Query:'''

<~~syntaxhighlight~~ pre lang="'xquery"'>

//a[text() contains text 'house' using fuzzy]

</pre>

A user-defined value can be adjusted globally via the {{Option|LSERROR}} option or via an additional argument:

<~~syntaxhighlight~~ pre lang="'xquery"'>

//a[text() contains text 'house' using fuzzy 3 errors]

</pre>

Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the {{Function|Full-Text|ft:mark}} and {{Function|Full-Text|ft:extract}} functions will only yield useful results if they are applied to single text nodes, as the following example demonstrates:

<~~syntaxhighlight~~ pre lang="'xquery"'>

(: Structure is ignored; no highlighting: :)

ft:mark(//p[. contains text 'real'])

BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [https://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you want to avoid searching for. See the following example (visit [[XQuery Update]] to learn more about updates):

<~~syntaxhighlight~~ pre lang="'xquery"'>

let $docs := db:get('docs')

return db:create(

* If a default collation is specified, it applies to all collation-dependent string operations in the query. The following expression yields <code>true</code>:

<~~syntaxhighlight~~ pre lang="'xquery"'>

declare default collation 'http://basex.org/collation?lang=de;strength=secondary';

'Straße' = 'Strasse'

* Collations can also be specified in {{Code|order by}} and {{Code|group by}} clauses of FLWOR expressions. This query returns {{Code|à plutôt! bonjour!}}:

<~~syntaxhighlight~~ pre lang="'xquery"'>

for $w in ("bonjour!", "à plutôt!") order by $w collation "?lang=fr" return $w

</pre>

* Various string function exists that take an optional collation as argument: The following functions give us {{Code|a}} and {{Code|1 2 3}} as results:

<~~syntaxhighlight~~ pre lang="'xquery"'><nowiki>

distinct-values(("a", "á", "à"), "?lang=it-IT;strength=primary"),

index-of(("a", "á", "à"), "a", "?lang=it-IT;strength=primary")

If the [http://site.icu-project.org/download ICU Library] is added to the classpath, the full [https://www.w3.org/TR/xpath-functions-31/#uca-collations Unicode Collation Algorithm] features become available:

<~~syntaxhighlight~~ pre lang="'xquery"'>

(: returns 0 (both strings are compared as equal) :)

compare('a-b', 'ab', 'http://www.w3.org/2013/collation/UCA?alternate=shifted')

CG

Bureaucrats, editor, reviewer, Administrators

13,551

edits

Changes

Full-Text (edit)

Revision as of 18:34, 1 December 2023

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools