Changes

Jump to navigation Jump to search
519 bytes added ,  13:47, 19 August 2021
no edit summary
This article is part of the [[XQuery|XQuery Portal]]. It summarizes the features of the [httphttps://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and custom features of the implementation in BaseX.
Please read the separate [[Indexes#Full-Text Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text requests on large databases within milliseconds.
The basic JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:
* [httphttps://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar] includes the Snowball and Lucene stemmers for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
* [httphttps://enosdn.sourceforge.jpnet/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.
The JAR files are included in the ZIP and EXE distributions of BaseX.
<syntaxhighlight lang="xquery">
(: Score values: 1 0.62 0.45 :)
for $text score $score in ("A", "A B", "A B C")[. let score $score := $text contains text "A"]
order by $score descending
return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>
</syntaxhighlight>
This simple approach has proven to consistently deliver good results, and in particular when little is known about the structure of the queried XML documents.
Please note that scores will only Scoring values can be computed if a parent expression requests themfurther processed to compute custom values:
<syntaxhighlight lang="xquery">
let $terms := (: Computes and returns 'a scoring value. :', 'b')let score $score scores := <x>Hello Universe</x> contains text "hello"return $ft:score (: No scoring value will be computed here. :)let $result := <x>Hello Universe</x> terms ! ('a b c' contains text "hello"let score $score := $result{ . }))return avg($scorescores)
</syntaxhighlight>
Scores will be propagated Scoring is supported within full-text expressions, by the {{CodeFunction|Full-Text|andft:search}} , and by simple predicate tests that can be rewritten to {{CodeFunction|Full-Text|orft:search}} expressions and in predicates. In the following query, all returned scores are equal:
<syntaxhighlight lang="xquery">
let $text string := "A B C"'a b'let return ft:score ($s1 := $text string contains text "A" 'a' ftand "B C"'b'), let for $n score $s2 s in ft:= search('factbook', 'orthodox')order by $text contains text "A" ftand "B C"s descendinglet score return $s3 s || ':= ' || $text contains text "A" and $text contains text "B C"n, let score for $s4 := $text contains text "A" or $text contains text "B C"let n score $s5 s in db:= $open('factbook')//text()[. contains text "A"][. contains text "B C"'orthodox']order by $s descendingreturn ($s1, $s2, $s3, $s4, s || ': ' || $s5)n
</syntaxhighlight>
==Thesaurus==
BaseX supports One or more thesaurus files can be specified in a full-text queries using thesauri, but it does not provide a default thesaurusexpression. This is why queries such asThe following query returns {{Code|false}}:
<syntaxhighlight lang="xquery">
'computershardware' contains text 'hardwarecomputers'
using thesaurus default
</syntaxhighlight>
will return If a thesaurus is employed… <syntaxhighlight lang="xml"><thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus"> <entry> <term>computers</term> <synonym> <term>hardware</term> <relationship>NT</relationship> <code/synonym>false </codeentry>. However, if the </thesaurus is specified, then the ></syntaxhighlight> …the result will be {{Code|true}}: <codesyntaxhighlight lang="xquery">true'hardware' contains text 'computers' using thesaurus at 'thesaurus.xml'</codesyntaxhighlightThesaurus files must comply with the [https://dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd XSD Schema] of the XQFT Test Suite (but the namespace can be omitted). Apart from the relationship defined in [https://www.iso.org/standard/7776.html ISO 2788] (NT: narrower team, RT: related term, etc.), custom relationships can be used. The type of relationship and the level depth can be specified as well:
<syntaxhighlight lang="xquery">
(: BT: find broader terms; NT means narrower term :)
'computers' contains text 'hardware'
using thesaurus at 'XQFTTS_1_0_4/TestSources/usability2x.xml'relationship 'BT' from 1 to 10 levels
</syntaxhighlight>
The format of the thesaurus files must More details can be found in the same as the format of the thesauri provided by the [http://dev.w3.org/2007/xpath-full-text-10-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an [httphttps://devwww.w3.org/cvsweb/~checkout~/2007TR/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd?rev=1.3;content-type=application%2Fxml XSD Schema#ftthesaurusoption specification].
==Fuzzy Querying==
</syntaxhighlight>
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the {{Option|LSERROR}} option (default: <code>SET LSERROR 0</code>). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”. A user-defined value can be adjusted globally via the {{Option|LSERROR}} option or, since {{Version|9.6}}, via an additional argument:
Fuzzy search is also supported by the full-<syntaxhighlight lang="xquery">//a[text index.() contains text 'house' using fuzzy 3 errors]</syntaxhighlight>
=Mixed Content=
When working with so-called narrative XML documents, such as HTML, [httphttps://tei-c.org/ TEI], or [httphttps://docbook.org / DocBook] documents, you typically have ''mixed content'', i.e., elements containing a mix of text and markup, such as:
<syntaxhighlight lang="xml">
</syntaxhighlight>
Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [httphttps://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].
To enable this kind of searches, it is recommendable to:
</syntaxhighlight>
BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [httphttps://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you do not want to search for. See the following example (visit [[XQuery Update]] to learn more about updates):
<syntaxhighlight lang="xquery">
|-
| {{Code|decomposition}}
| Defines how composed characters are handled. Three decompositions are supported: {{Code|none}}, {{Code|standard}}, and {{Code|full}}. More details are found in the [httphttps://docs.oracle.com/en/java/javase/711/docs/api/java.base/java/text/Collator.html JavaDoc] of the JDK.
|}
</nowiki></syntaxhighlight>
If the [http://site.icu-project.org/download ICU Library] is added to the classpath, the full [httphttps://www.w3.org/TR/xpath-functions-31/#uca-collations Unicode Collation Algorithm] features become available:
<syntaxhighlight lang="xquery">
=Changelog=
 
; Version 9.6
* Updated: [[#Fuzzy_Querying|Fuzzy Querying]]: Specify Levenshtein error
 
; Version 9.5:
* Removed: Scoring propagation.
; Version 9.2:
 
* Added: Arabic stemmer.
; Version 8.0:
 
* Updated: [[#Scoring|Scores]] will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates.
; Version 7.7:
 
* Added: [[#Collations|Collations]] support.
; Version 7.3:
 
* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.
* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu