Changes

Full-Text (edit)

Revision as of 13:34, 20 July 2022

432 bytes added , 13:34, 20 July 2022

m

Text replacement - "db:pre(" to "db:get("

</syntaxhighlight>

~~Please note that scoring propagation was removed with~~ Scoring is supported within full-text expressions, by {{~~Mark~~Function|Full-Text|~~Version 9.5~~ft:search}}~~. The following expressions will now yield~~ , and by simple predicate tests that can be rewritten to {{~~Code~~Function|Full-Text|0ft:search}}:

let $string := 'a b'return ft:score($string contains text 'a' ftand 'b'), for $n score $s in dbft:~~open~~search('factbook'~~)//religions[text() contains text~~ , 'orthodox'])order by $s descendingreturn $s|| ': ' || $n,

~~let~~ for $~~string~~ n score $s in db:= get('~~a b~~factbook'~~return ft:score~~)//text(~~$string~~ )[. contains text 'aorthodox' ~~and~~ ]order by $s descendingreturn $~~string contains text~~ s || 'b: ')|| $n

</syntaxhighlight>

~~Scoring is still supported within~~ ==Thesaurus== One or more thesaurus files can be specified in a full-text ~~expressions and by~~ expression. The following query returns {{~~Function~~Code|~~Full-Text|ft:search~~false}}:

~~for $n score $s in ft:search(~~'~~factbook~~hardware'~~, 'orthodox')return $s,~~ ~~let $string := 'a b'return ft:score($string~~ contains text 'acomputers' ~~ftand 'b')~~ using thesaurus default

</syntaxhighlight>

~~The reason for removing the scoring propagation was that the storage of scoring values required additional memory, even if scoring~~ If a thesaurus is ~~not required.~~employed…

<syntaxhighlight lang="xml"><thesaurus xmlns=~~Thesaurus==~~"http://www.w3.org/2007/xqftts/thesaurus"> <entry> <term>computers</term> <synonym> <term>hardware</term> <relationship>NT</relationship> </synonym> </entry></thesaurus></syntaxhighlight>

~~BaseX supports full-text queries using thesauri, but it does not provide a default thesaurus. This is why queries such as~~…the result will be {{Code|true}}:

'~~computers~~hardware' contains text '~~hardware~~computers' using thesaurus ~~default~~at 'thesaurus.xml'

</syntaxhighlight>

~~will return <code>false<~~Thesaurus files must comply with the [https://dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/~~code>~~thesaurus.xsd XSD Schema] of the XQFT Test Suite (but the namespace can be omitted). ~~However~~Apart from the relationship defined in [https://www.iso.org/standard/7776.html ISO 2788] (NT: narrower team, RT: related term, ~~if the thesaurus is specified~~etc.), ~~then~~ custom relationships can be used. The type of relationship and the ~~result will~~ level depth can be ~~<code>true</code>~~specified as well:

(: BT: find broader terms; NT means narrower term :)

'computers' contains text 'hardware'

using thesaurus at '~~XQFTTS_1_0_4/TestSources/usability2~~x.xml'relationship 'BT' from 1 to 10 levels

</syntaxhighlight>

~~The format of the thesaurus files must~~ More details can be ~~the same as the format of the thesauri provided by~~ found in the [https://~~dev~~www.w3.org/~~2007~~TR/xpath-full-text-10~~-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an [https:/~~/~~dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd XSD Schema~~#ftthesaurusoption specification].

==Fuzzy Querying==

</syntaxhighlight>

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4~~, preserving a minimum of 1 errors. A static error distance can be set by adjusting the {{Option|LSERROR}} option (default: <code>SET LSERROR 0</code>)~~. The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”.

~~Fuzzy search is also supported by~~ A user-defined value can be adjusted globally via the ~~full-~~{{Option|LSERROR}} option or via an additional argument: <syntaxhighlight lang="xquery">//a[text() contains text ~~index.~~'house' using fuzzy 3 errors]</syntaxhighlight>

=Mixed Content=

To enable this kind of searches, it is recommendable to:

* ~~Turn off~~ Keep ''whitespace ~~chopping~~stripping'' turned off when importing XML documents. This can be done by ~~setting~~ ensuring that {{Option|~~CHOP~~STRIPWS}} ~~to <code>OFF</code>~~is disabled. This can also be done in the GUI if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''~~Chop~~ Strip Whitespaces'').* ~~Turn off~~ Keep automatic indentation ~~by assigning <code>~~turned off. Ensure that the [[Serialization|serialization parameter]] {{Code|indent~~=no</code>~~ }} is set to ~~the~~ {{~~Option~~Code|~~SERIALIZER~~no}} ~~option~~.

A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.

Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the ~~<code>~~{{Function|Full-Text|ft:mark~~</code>~~ }} and ~~<code>~~{{Function|Full-Text|ft:extract~~</code>~~ }} functions ~~(see [[Full-Text Module|Full-Text Functions]])~~ will only yield useful results if they are applied to single text nodes, as the following example demonstrates:

</syntaxhighlight>

BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [https://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you ~~do not~~ want to ~~search~~ avoid searching for. See the following example (visit [[XQuery Update]] to learn more about updates):

let $docs := db:~~open~~get('docs')

return db:create(

'index-db',

=Changelog=

; Version 9.26* Updated:[[#Fuzzy_Querying|Fuzzy Querying]]: Specify Levenshtein error

; Version 9.5:

* Removed: Scoring propagation.

; Version 9.2:

* Added: Arabic stemmer.

; Version 8.0:

* Updated: [[#Scoring|Scores]] will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates.

; Version 7.7:

* Added: [[#Collations|Collations]] support.

; Version 7.3:

* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.

* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.

CG

Bureaucrats, editor, reviewer, Administrators

13,551

edits

Changes

Full-Text (edit)

Revision as of 13:34, 20 July 2022

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools