Changes

Full-Text (edit)

Revision as of 13:47, 19 August 2021

519 bytes added , 13:47, 19 August 2021

no edit summary

This article is part of the [[XQuery|XQuery Portal]]. It summarizes the features of the [~~http~~https://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text ~~1.0~~] Recommendation, and custom features of the implementation in BaseX.

Please read the separate [[Indexes#Full-Text Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text requests on large databases within milliseconds.

The basic JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:

* [~~http~~https://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar] includes the Snowball and Lucene stemmers for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

* [~~http~~https://enosdn.~~sourceforge.jp~~net/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.

The JAR files are included in the ZIP and EXE distributions of BaseX.

(: Score values: 1 0.62 0.45 :)

for $text ~~score $score~~ in ("A", "A B", "A B C")[. let score $score := $text contains text "A"]

order by $score descending

return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>

</syntaxhighlight>

This simple approach has proven to consistently deliver good results, ~~and~~ in particular when little is known about the structure of the queried XML documents.

~~Please note that scores will only~~ Scoring values can be ~~computed if a parent expression requests them~~further processed to compute custom values:

let $terms := (~~: Computes and returns~~ 'a ~~scoring value. :~~', 'b')let ~~score~~ $~~score~~ scores := ~~<x>Hello Universe</x> contains text "hello"return $~~ft:score (~~: No scoring value will be computed here. :)let~~ $~~result := <x>Hello Universe</x>~~ terms ! ('a b c' contains text ~~"hello"let score $score := $result~~{ . }))return avg($~~score~~scores)

</syntaxhighlight>

~~Scores will be propagated~~ Scoring is supported within full-text expressions, by ~~the~~ {{~~Code~~Function|Full-Text|~~and~~ft:search}} , and by simple predicate tests that can be rewritten to {{~~Code~~Function|Full-Text|orft:search}} ~~expressions and in predicates. In the following query, all returned scores are equal~~:

let $~~text~~ string := ~~"A B C"~~'a b'~~let~~ return ft:score ($~~s1 := $text~~ string contains text ~~"A"~~ 'a' ftand ~~"B C"~~'b'), ~~let~~ for $n score $s2 s in ft:= search('factbook', 'orthodox')order by $~~text contains text "A" ftand "B C"~~s descending~~let score~~ return $s3 s || ':= ' || $~~text contains text "A" and $text contains text "B C"~~n, ~~let score~~ for $~~s4 := $text contains text "A" or $text contains text "B C"let~~ n score $s5 s in db:~~= $~~open('factbook')//text()[. contains text ~~"A"][. contains text "B C"~~'orthodox']order by $s descendingreturn ($~~s1, $s2, $s3, $s4,~~ s || ': ' || $~~s5)~~n

</syntaxhighlight>

==Thesaurus==

~~BaseX supports~~ One or more thesaurus files can be specified in a full-text ~~queries using thesauri, but it does not provide a default thesaurus~~expression. ~~This is why queries such as~~The following query returns {{Code|false}}:

'~~computers~~hardware' contains text '~~hardware~~computers'

using thesaurus default

</syntaxhighlight>

~~will return~~ If a thesaurus is employed… <syntaxhighlight lang="xml"><thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus"> <entry> <term>computers</term> <synonym> <term>hardware</term> <relationship>NT</relationship> <~~code~~/synonym>~~false~~ </~~code~~entry>~~. However, if the~~ </thesaurus ~~is specified, then the~~ ></syntaxhighlight> …the result will be {{Code|true}}: <~~code~~syntaxhighlight lang="xquery">~~true~~'hardware' contains text 'computers' using thesaurus at 'thesaurus.xml'</~~code~~syntaxhighlight> Thesaurus files must comply with the [https://dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd XSD Schema] of the XQFT Test Suite (but the namespace can be omitted). Apart from the relationship defined in [https://www.iso.org/standard/7776.html ISO 2788] (NT: narrower team, RT: related term, etc.), custom relationships can be used. The type of relationship and the level depth can be specified as well:

(: BT: find broader terms; NT means narrower term :)

'computers' contains text 'hardware'

using thesaurus at '~~XQFTTS_1_0_4/TestSources/usability2~~x.xml'relationship 'BT' from 1 to 10 levels

</syntaxhighlight>

~~The format of the thesaurus files must~~ More details can be found in the ~~same as the format of the thesauri provided by the [http://dev.w3.org/2007/xpath-full-text-10-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an~~ [~~http~~https://~~dev~~www.w3.org/~~cvsweb/~checkout~/2007~~TR/xpath-full-text-10~~-test-suite/TestSuiteStagingArea/TestSources~~/~~thesaurus.xsd?rev=1.3;content-type=application%2Fxml XSD Schema~~#ftthesaurusoption specification].

==Fuzzy Querying==

</syntaxhighlight>

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4~~, preserving a minimum of 1 errors. A static error distance can be set by adjusting the {{Option|LSERROR}} option (default: <code>SET LSERROR 0</code>)~~. The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”. A user-defined value can be adjusted globally via the {{Option|LSERROR}} option or, since {{Version|9.6}}, via an additional argument:

~~Fuzzy search is also supported by the full-~~<syntaxhighlight lang="xquery">//a[text ~~index.~~() contains text 'house' using fuzzy 3 errors]</syntaxhighlight>

=Mixed Content=

When working with so-called narrative XML documents, such as HTML, [~~http~~https://tei-c.org/ TEI], or [~~http~~https://docbook.org / DocBook] documents, you typically have ''mixed content'', i.e., elements containing a mix of text and markup, such as:

</syntaxhighlight>

Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [~~http~~https://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].

To enable this kind of searches, it is recommendable to:

</syntaxhighlight>

BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [~~http~~https://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you do not want to search for. See the following example (visit [[XQuery Update]] to learn more about updates):

|-

| {{Code|decomposition}}

| Defines how composed characters are handled. Three decompositions are supported: {{Code|none}}, {{Code|standard}}, and {{Code|full}}. More details are found in the [~~http~~https://docs.oracle.com/en/java/javase/711/docs/api/java.base/java/text/Collator.html JavaDoc] of the JDK.

|}

</nowiki></syntaxhighlight>

If the [http://site.icu-project.org/download ICU Library] is added to the classpath, the full [~~http~~https://www.w3.org/TR/xpath-functions-31/#uca-collations Unicode Collation Algorithm] features become available:

=Changelog=

; Version 9.6

* Updated: [[#Fuzzy_Querying|Fuzzy Querying]]: Specify Levenshtein error

; Version 9.5:

* Removed: Scoring propagation.

; Version 9.2:

* Added: Arabic stemmer.

; Version 8.0:

* Updated: [[#Scoring|Scores]] will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates.

; Version 7.7:

* Added: [[#Collations|Collations]] support.

; Version 7.3:

* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.

* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Full-Text (edit)

Revision as of 13:47, 19 August 2021

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools