Changes

Jump to navigation Jump to search
544 bytes added ,  13:34, 20 July 2022
m
Text replacement - "db:pre(" to "db:get("
This article is part of the [[XQuery|XQuery Portal]]. It summarizes the features of the [https://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and custom features of the implementation in BaseX.
Please read the separate [[Indexes#Full-Text Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text requests on large databases within milliseconds.
<syntaxhighlight lang="xquery">
(: Score values: 1 0.62 0.45 :)
for $text score $score in ("A", "A B", "A B C")[. let score $score := $text contains text "A"]
order by $score descending
return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>
</syntaxhighlight>
This simple approach has proven to consistently deliver good results, and in particular when little is known about the structure of the queried XML documents.
Please note that scores will only Scoring values can be computed if a parent expression requests themfurther processed to compute custom values:
<syntaxhighlight lang="xquery">
let $terms := (: Computes and returns 'a scoring value. :', 'b')let score $score scores := <x>Hello Universe</x> contains text "hello"return $ft:score (: No scoring value will be computed here. :)let $result := <x>Hello Universe</x> terms ! ('a b c' contains text "hello"let score $score := $result{ . }))return avg($scorescores)
</syntaxhighlight>
Scores will be propagated Scoring is supported within full-text expressions, by the {{CodeFunction|Full-Text|andft:search}} , and by simple predicate tests that can be rewritten to {{CodeFunction|Full-Text|orft:search}} expressions and in predicates. In the following query, all returned scores are equal:
<syntaxhighlight lang="xquery">
let $text string := "A B C"'a b'let return ft:score ($s1 := $text string contains text "A" 'a' ftand "B C"'b'), let for $n score $s2 s in ft:= search('factbook', 'orthodox')order by $text contains text "A" ftand "B C"s descendinglet score return $s3 s || ':= ' || $text contains text "A" and $text contains text "B C"n, let score for $s4 := $text contains text "A" or $text contains text "B C"let n score $s5 s in db:= $get('factbook')//text()[. contains text "A"][. contains text "B C"'orthodox']order by $s descendingreturn ($s1, $s2, $s3, $s4, s || ': ' || $s5)n
</syntaxhighlight>
==Thesaurus==
BaseX supports One or more thesaurus files can be specified in a full-text queries using thesauri, but it does not provide a default thesaurusexpression. This is why queries such asThe following query returns {{Code|false}}:
<syntaxhighlight lang="xquery">
'computershardware' contains text 'hardwarecomputers'
using thesaurus default
</syntaxhighlight>
will return If a thesaurus is employed… <syntaxhighlight lang="xml"><thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus"> <entry> <term>computers</term> <synonym> <term>hardware</term> <relationship>NT</relationship> <code/synonym>false </codeentry>. However, if the </thesaurus is specified, then the ></syntaxhighlight> …the result will be {{Code|true}}: <codesyntaxhighlight lang="xquery">true'hardware' contains text 'computers' using thesaurus at 'thesaurus.xml'</codesyntaxhighlightThesaurus files must comply with the [https://dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd XSD Schema] of the XQFT Test Suite (but the namespace can be omitted). Apart from the relationship defined in [https://www.iso.org/standard/7776.html ISO 2788] (NT: narrower team, RT: related term, etc.), custom relationships can be used. The type of relationship and the level depth can be specified as well:
<syntaxhighlight lang="xquery">
(: BT: find broader terms; NT means narrower term :)
'computers' contains text 'hardware'
using thesaurus at 'XQFTTS_1_0_4/TestSources/usability2x.xml'relationship 'BT' from 1 to 10 levels
</syntaxhighlight>
The format of the thesaurus files must More details can be the same as the format of the thesauri provided by found in the [https://devwww.w3.org/2007TR/xpath-full-text-10-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an [https://dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd XSD Schema#ftthesaurusoption specification].
==Fuzzy Querying==
</syntaxhighlight>
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the {{Option|LSERROR}} option (default: <code>SET LSERROR 0</code>). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”. A user-defined value can be adjusted globally via the {{Option|LSERROR}} option or via an additional argument:
Fuzzy search is also supported by the full-<syntaxhighlight lang="xquery">//a[text index.() contains text 'house' using fuzzy 3 errors]</syntaxhighlight>
=Mixed Content=
To enable this kind of searches, it is recommendable to:
* Turn off Keep ''whitespace choppingstripping'' turned off when importing XML documents. This can be done by setting ensuring that {{Option|CHOPSTRIPWS}} to <code>OFF</code>is disabled. This can also be done in the GUI if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Chop Strip Whitespaces'').* Turn off Keep automatic indentation by assigning <code>turned off. Ensure that the [[Serialization|serialization parameter]] {{Code|indent=no</code> }} is set to the {{OptionCode|SERIALIZERno}} option.
A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.
Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the <code>{{Function|Full-Text|ft:mark</code> }} and <code>{{Function|Full-Text|ft:extract</code> }} functions (see [[Full-Text Module|Full-Text Functions]]) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:
<syntaxhighlight lang="xquery">
</syntaxhighlight>
BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [https://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you do not want to search avoid searching for. See the following example (visit [[XQuery Update]] to learn more about updates):
<syntaxhighlight lang="xquery">
let $docs := db:openget('docs')
return db:create(
'index-db',
=Changelog=
 
; Version 9.6
* Updated: [[#Fuzzy_Querying|Fuzzy Querying]]: Specify Levenshtein error
 
; Version 9.5:
* Removed: Scoring propagation.
; Version 9.2:
 
* Added: Arabic stemmer.
; Version 8.0:
 
* Updated: [[#Scoring|Scores]] will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates.
; Version 7.7:
 
* Added: [[#Collations|Collations]] support.
; Version 7.3:
 
* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.
* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu