Changes

Jump to navigation Jump to search
360 bytes added ,  15:14, 14 January 2021
no edit summary
This article is part of the [[XQuery|XQuery Portal]]. It summarizes the features of the [httphttps://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and custom features of the implementation in BaseX.
Please read the separate [[Indexes#Full-Text Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text requests on large databases within milliseconds.
The basic JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:
* [httphttps://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar] includes the Snowball and Lucene stemmers for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
* [httphttps://enosdn.sourceforge.jpnet/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.
The JAR files are included in the ZIP and EXE distributions of BaseX.
<syntaxhighlight lang="xquery">
(: Score values: 1 0.62 0.45 :)
for $text score $score in ("A", "A B", "A B C")[. let score $score := $text contains text "A"]
order by $score descending
return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>
</syntaxhighlight>
This simple approach has proven to consistently deliver good results, and in particular when little is known about the structure of the queried XML documents.
Please note that scores will only Scoring values can be computed if a parent expression requests themfurther processed to compute custom values:
<syntaxhighlight lang="xquery">
let $terms := (: Computes and returns 'a scoring value. :', 'b')let score $score scores := <x>Hello Universe</x> ft:score($terms ! ('a b c' contains text "hello"{ . }))return avg($scorescores)</syntaxhighlight>
(: No Please note that scoring value propagation was removed with {{Mark|Version 9.5}}. The following expressions will be computed here. now yield {{Code|0}}:) <syntaxhighlight lang="xquery">let $result string := <x>Hello Universe</x> 'a b'return ft:score($string contains text 'a' and $string contains text "hello"'b'), let for $n score $score s in db:= open('factbook')//religions[text() contains text 'orthodox']order by $results descendingreturn $scores || ': ' || $n
</syntaxhighlight>
Scores will be propagated Scoring is still supported within full-text expressions, by the {{CodeFunction|Full-Text|andft:search}} , and by simple predicate tests that can be rewritten to {{CodeFunction|Full-Text|orft:search}} expressions and in predicates. In the following query, all returned scores are equal:
<syntaxhighlight lang="xquery">
let $text string := "A B C"'a b'let return ft:score ($s1 := $text string contains text "A" 'a' ftand "B C"'b'), let for $n score $s2 s in ft:= search('factbook', 'orthodox')order by $text contains text "A" ftand "B C"s descendinglet score return $s3 s || ':= ' || $text contains text "A" and $text contains text "B C"n, let score for $s4 := $text contains text "A" or $text contains text "B C"let n score $s5 s in db:= $open('factbook')//text()[. contains text "A"][. contains text "B C"'orthodox']order by $s descendingreturn ($s1, $s2, $s3, $s4, s || ': ' || $s5)n
</syntaxhighlight>
 
The reason for removing the scoring propagation was that the storage of scoring values required additional memory, even if scoring is not required.
==Thesaurus==
</syntaxhighlight>
The format of the thesaurus files must be the same as the format of the thesauri provided by the [httphttps://dev.w3.org/2007/xpath-full-text-10-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an [httphttps://dev.w3.org/cvsweb/~checkout~/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd?rev=1.3;content-type=application%2Fxml XSD Schema].
==Fuzzy Querying==
=Mixed Content=
When working with so-called narrative XML documents, such as HTML, [httphttps://tei-c.org/ TEI], or [httphttps://docbook.org / DocBook] documents, you typically have ''mixed content'', i.e., elements containing a mix of text and markup, such as:
<syntaxhighlight lang="xml">
</syntaxhighlight>
Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [httphttps://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].
To enable this kind of searches, it is recommendable to:
</syntaxhighlight>
BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [httphttps://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you do not want to search for. See the following example (visit [[XQuery Update]] to learn more about updates):
<syntaxhighlight lang="xquery">
|-
| {{Code|decomposition}}
| Defines how composed characters are handled. Three decompositions are supported: {{Code|none}}, {{Code|standard}}, and {{Code|full}}. More details are found in the [httphttps://docs.oracle.com/en/java/javase/711/docs/api/java.base/java/text/Collator.html JavaDoc] of the JDK.
|}
=Changelog=
 
; Version 9.5:
 
* Removed: Scoring propagation.
; Version 9.2:
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu