Changes

Jump to navigation Jump to search
491 bytes added ,  10:41, 25 April 2022
This article is part of the [[XQuery|XQuery Portal]]. It summarizes the features of the [https://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and custom features of the implementation in BaseX.
Please read the separate [[Indexes#Full-Text Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text requests on large databases within milliseconds.
The basic JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:
* [httphttps://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar] includes the Snowball and Lucene stemmers for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
* [httphttps://enosdn.sourceforge.jpnet/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.
The JAR files are included in the ZIP and EXE distributions of BaseX.
<syntaxhighlight lang="xquery">
(: Score values: 1 0.62 0.45 :)
for $text score $score in ("A", "A B", "A B C")[. let score $score := $text contains text "A"]
order by $score descending
return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>
</syntaxhighlight>
This simple approach has proven to consistently deliver good results, and in particular when little is known about the structure of the queried XML documents.
Please note that scores will only Scoring values can be computed if a parent expression requests themfurther processed to compute custom values:
<syntaxhighlight lang="xquery">
let $terms := (: Computes and returns 'a scoring value. :', 'b')let score $score scores := <x>Hello Universe</x> contains text "hello"return $ft:score (: No scoring value will be computed here. :)let $result := <x>Hello Universe</x> terms ! ('a b c' contains text "hello"let score $score := $result{ . }))return avg($scorescores)
</syntaxhighlight>
Scores will be propagated Scoring is supported within full-text expressions, by the {{CodeFunction|Full-Text|andft:search}} , and by simple predicate tests that can be rewritten to {{CodeFunction|Full-Text|orft:search}} expressions and in predicates. In the following query, all returned scores are equal:
<syntaxhighlight lang="xquery">
let $text string := "A B C"'a b'let return ft:score ($s1 := $text string contains text "A" 'a' ftand "B C"'b'), let for $n score $s2 s in ft:= search('factbook', 'orthodox')order by $text contains text "A" ftand "B C"s descendinglet score return $s3 s || ':= ' || $text contains text "A" and $text contains text "B C"n, let score for $s4 := $text contains text "A" or $text contains text "B C"let n score $s5 s in db:= $open('factbook')//text()[. contains text "A"][. contains text "B C"'orthodox']order by $s descendingreturn ($s1, $s2, $s3, $s4, s || ': ' || $s5)n
</syntaxhighlight>
==Thesaurus==
BaseX supports One or more thesaurus files can be specified in a full-text queries using thesauri, but it does not provide a default thesaurusexpression. This is why queries such asThe following query returns {{Code|false}}:
<syntaxhighlight lang="xquery">
'computershardware' contains text 'hardwarecomputers'
using thesaurus default
</syntaxhighlight>
will return If a thesaurus is employed… <syntaxhighlight lang="xml"><thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus"> <entry> <term>computers</term> <synonym> <term>hardware</term> <relationship>NT</relationship> <code/synonym>false </codeentry>. However, if the </thesaurus is specified, then the ></syntaxhighlight> …the result will be {{Code|true}}: <codesyntaxhighlight lang="xquery">true'hardware' contains text 'computers' using thesaurus at 'thesaurus.xml'</codesyntaxhighlightThesaurus files must comply with the [https://dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd XSD Schema] of the XQFT Test Suite (but the namespace can be omitted). Apart from the relationship defined in [https://www.iso.org/standard/7776.html ISO 2788] (NT: narrower team, RT: related term, etc.), custom relationships can be used. The type of relationship and the level depth can be specified as well:
<syntaxhighlight lang="xquery">
(: BT: find broader terms; NT means narrower term :)
'computers' contains text 'hardware'
using thesaurus at 'XQFTTS_1_0_4/TestSources/usability2x.xml'relationship 'BT' from 1 to 10 levels
</syntaxhighlight>
The format of the thesaurus files must More details can be found in the same as the format of the thesauri provided by the [http://dev.w3.org/2007/xpath-full-text-10-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an [httphttps://devwww.w3.org/cvsweb/~checkout~/2007TR/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd?rev=1.3;content-type=application%2Fxml XSD Schema#ftthesaurusoption specification].
==Fuzzy Querying==
</syntaxhighlight>
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the {{Option|LSERROR}} option (default: <code>SET LSERROR 0</code>). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”. A user-defined value can be adjusted globally via the {{Option|LSERROR}} option or via an additional argument:
Fuzzy search is also supported by the full-<syntaxhighlight lang="xquery">//a[text index.() contains text 'house' using fuzzy 3 errors]</syntaxhighlight>
=Mixed Content=
When working with so-called narrative XML documents, such as HTML, [httphttps://tei-c.org/ TEI], or [httphttps://docbook.org / DocBook] documents, you typically have ''mixed content'', i.e., elements containing a mix of text and markup, such as:
<syntaxhighlight lang="xml">
|-
| {{Code|decomposition}}
| Defines how composed characters are handled. Three decompositions are supported: {{Code|none}}, {{Code|standard}}, and {{Code|full}}. More details are found in the [httphttps://docs.oracle.com/en/java/javase/711/docs/api/java.base/java/text/Collator.html JavaDoc] of the JDK.
|}
=Changelog=
 
; Version 9.6
* Updated: [[#Fuzzy_Querying|Fuzzy Querying]]: Specify Levenshtein error
 
; Version 9.5:
* Removed: Scoring propagation.
; Version 9.2:
 
* Added: Arabic stemmer.
; Version 8.0:
 
* Updated: [[#Scoring|Scores]] will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates.
; Version 7.7:
 
* Added: [[#Collations|Collations]] support.
; Version 7.3:
 
* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.
* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu