Changes

Full-Text (edit)

Revision as of 18:31, 1 December 2023

420 bytes removed , 18:31, 1 December 2023

m

Text replacement - "</syntaxhighlight>" to "</pre>"

"This is YOUR World" contains text "your world"

</~~syntaxhighlight~~pre>

It yields {{Code|true}}, because the search string is ''tokenized'' before it is compared with the tokenized input string. In the tokenization process, several normalizations take place. Many of those steps can hardly be simulated with plain XQuery: as an example, upper/lower case and diacritics (umlauts, accents, etc.) are removed and an optional, language-dependent stemming algorithm is applied. Beside that, special characters such as whitespaces and punctuation marks will be ignored. Thus, this query also yields true:

"Well... Done!" contains text "well, done"

</~~syntaxhighlight~~pre>

The {{Code|occurs}} keyword comes into play when more than one occurrence of a token is to be found:

"one and two and three" contains text "and" occurs at least 2 times

</~~syntaxhighlight~~pre>

Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.

where $country//religions[text() contains text { 'Sunni', 'Shia' } any]

return $country/name

</~~syntaxhighlight~~pre>

The query will output the names of all countries with a religion element containing {{Code|sunni}} or {{Code|shia}}. The {{Code|any}} keyword is optional; it can be replaced with:

doc('factbook')//country[descendant::religions contains text 'sunni' ftor 'shia']/name

</~~syntaxhighlight~~pre>

The keywords {{Code|not in}} are special: they are used to find tokens which are not part of a longer token sequence:

for $text in ("New York", "new conditions")

return $text contains text "New" not in "New York"

</~~syntaxhighlight~~pre>

Due to the complex data model of the XQuery Full Text spec, the usage of {{Code|ftand}} may lead to a high memory consumption. If you should encounter problems, simply use the {{Code|all}} keyword:

doc('factbook')//country[descendant::religions contains text { 'Christian', 'Jewish'} all]/name

</~~syntaxhighlight~~pre>

==Positional Filters==

<text>The reason why some people ...</text>

</xml>//text[. contains text { "some", "reason" } all ordered distance at most 3 words]

</~~syntaxhighlight~~pre>

…the two first texts will be returned as result, because there are at most three words between {{Code|some}} and {{Code|reason}}. Additionally, the {{Code|ordered}} keyword ensures that the words are found in the specified order, which is why the third text is excluded. Note that {{Code|all}} is required here to guarantee that only those hits will be accepted that contain all searched words.

("A C D", "A B C D E")[. contains text { "A", "E" } all window 3 words]

</~~syntaxhighlight~~pre>

Sometimes it is interesting to only select texts in which all searched terms occur in the {{Code|same sentence}} or {{Code|paragraph}} (you can even filter for {{Code|different}} sentences/paragraphs). This is obviously not the case in the following example:

'Mary told me, “I will survive!”.' contains text { 'will', 'told' } all words same sentence

</~~syntaxhighlight~~pre>

By the way: In some examples above, the {{Code|words}} unit was used, but {{Code|sentences}} and {{Code|paragraphs}} would have been valid alternatives.

"Respect Upper Case" contains text "Upper" using case sensitive

</~~syntaxhighlight~~pre>

* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:

"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive

</~~syntaxhighlight~~pre>

* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:

"catch" contains text "catches" using stemming

</~~syntaxhighlight~~pre>

* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful if the full-text index takes too much space (a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):

"You and me" contains text "you or me" using stop words ("and", "or"),

"You and me" contains text "you or me" using stop words at "http://files.basex.org/etc/stopwords.txt"

</~~syntaxhighlight~~pre>

* Related terms such as synonyms can be found with the sophisticated [[#Thesaurus|Thesaurus]] option.

"This may be interesting in the year 2000" contains text { "interest.*", "2.{3,3}" } using wildcards

</~~syntaxhighlight~~pre>

This was a quick introduction to XQuery Full Text; you are invited to explore the numerous other features of the language!

declare namespace locale = "java:java.util.Locale";

distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))

</~~syntaxhighlight~~pre>

By default, unless the languages codes <code>ja</code>, <code>ar</code>, <code>ko</code>, <code>th</code>, or <code>zh</code> are specified, a tokenizer for Western texts is used:

"Indexing" contains text "index" using stemming,

"häuser" contains text "haus" using stemming using language "German"

</~~syntaxhighlight~~pre>

==Scoring==

order by $score descending

return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>

</~~syntaxhighlight~~pre>

This simple approach has proven to consistently deliver good results, in particular when little is known about the structure of the queried XML documents.

let $scores := ft:score($terms ! ('a b c' contains text { . }))

return avg($scores)

</~~syntaxhighlight~~pre>

Scoring is supported within full-text expressions, by {{Function|Full-Text|ft:search}}, and by simple predicate tests that can be rewritten to {{Function|Full-Text|ft:search}}:

order by $s descending

return $s || ': ' || $n

</~~syntaxhighlight~~pre>

==Thesaurus==

'hardware' contains text 'computers'

using thesaurus default

</~~syntaxhighlight~~pre>

If a thesaurus is employed…

</entry>

</thesaurus>

</~~syntaxhighlight~~pre>

…the result will be {{Code|true}}:

'hardware' contains text 'computers'

using thesaurus at 'thesaurus.xml'

</~~syntaxhighlight~~pre>

Thesaurus files must comply with the [https://dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd XSD Schema] of the XQFT Test Suite (but the namespace can be omitted). Apart from the relationship defined in [https://www.iso.org/standard/7776.html ISO 2788] (NT: narrower team, RT: related term, etc.), custom relationships can be used.

'computers' contains text 'hardware'

using thesaurus at 'x.xml' relationship 'BT' from 1 to 10 levels

</~~syntaxhighlight~~pre>

More details can be found in the [https://www.w3.org/TR/xpath-full-text-10/#ftthesaurusoption specification].

</doc>

</~~syntaxhighlight~~pre>

'''Query:'''

//a[text() contains text 'house' using fuzzy]

</~~syntaxhighlight~~pre>

'''Result:'''

<a>house</a>

</~~syntaxhighlight~~pre>

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4. The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”.

//a[text() contains text 'house' using fuzzy 3 errors]

</~~syntaxhighlight~~pre>

=Mixed Content=

<p>This is only an illustrative <hi>example</hi>, not a <q>real</q> text.</p>

</~~syntaxhighlight~~pre>

Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [https://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].

(: Single text nodes are addressed: results will be highlighted: :)

ft:mark(//p[.//text() contains text 'real'])

</~~syntaxhighlight~~pre>

BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [https://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you want to avoid searching for. See the following example (visit [[XQuery Update]] to learn more about updates):

map { 'ftindex': true() }

)

</~~syntaxhighlight~~pre>

=Functions=

declare default collation 'http://basex.org/collation?lang=de;strength=secondary';

'Straße' = 'Strasse'

</~~syntaxhighlight~~pre>

* Collations can also be specified in {{Code|order by}} and {{Code|group by}} clauses of FLWOR expressions. This query returns {{Code|à plutôt! bonjour!}}:

for $w in ("bonjour!", "à plutôt!") order by $w collation "?lang=fr" return $w

</~~syntaxhighlight~~pre>

* Various string function exists that take an optional collation as argument: The following functions give us {{Code|a}} and {{Code|1 2 3}} as results:

distinct-values(("a", "á", "à"), "?lang=it-IT;strength=primary"),

index-of(("a", "á", "à"), "a", "?lang=it-IT;strength=primary")

</nowiki></~~syntaxhighlight~~pre>

If the [http://site.icu-project.org/download ICU Library] is added to the classpath, the full [https://www.w3.org/TR/xpath-functions-31/#uca-collations Unicode Collation Algorithm] features become available:

(: returns 0 (both strings are compared as equal) :)

compare('a-b', 'ab', 'http://www.w3.org/2013/collation/UCA?alternate=shifted')

</~~syntaxhighlight~~pre>

=Changelog=

CG

Bureaucrats, editor, reviewer, Administrators

13,551

edits

Changes

Full-Text (edit)

Revision as of 18:31, 1 December 2023

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools