Changes

Full-Text (edit)

Revision as of 10:41, 25 April 2022

1,339 bytes added , 10:41, 25 April 2022

This article is part of the [[XQuery|XQuery Portal]]. It summarizes the ~~fulltext~~ features of the [https://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text] Recommendation, and custom features of the implementation in BaseX.

~~Full-text retrieval in XML documents is an essential requirement in many use cases. BaseX was the first query processor that supported~~ Please read the ~~[http://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and it additionally comes with a powerful~~ separate [[Indexes#Full-Text Index|Full-Text Index]]~~, which allows~~ section in our documentation if you want to learn how to evaluate full-text ~~queries~~ requests on large databases within milliseconds.

=Introduction=

The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification. This section gives you a quick insight into the most important features of the language.

This is a simple example for a basic full-text expression:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"This is YOUR World" contains text "your world"

</~~pre~~syntaxhighlight>

It yields {{Code|true}}, because the search string is ''tokenized'' before it is compared with the tokenized input string. In the tokenization process, several normalizations take place. Many of those steps can hardly be simulated with plain XQuery: as an example, upper/lower case and diacritics (umlauts, accents, etc.) are removed and an optional, language-dependent stemming algorithm is applied. Beside that, special characters such as whitespaces and punctuation marks will be ignored. Thus, this query also yields true:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"Well... Done!" contains text "well, done"

</~~pre~~syntaxhighlight>

The {{Code|occurs}} keyword comes into play when more than one occurrence of a token is to be found:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"one and two and three" contains text "and" occurs at least 2 times

</~~pre~~syntaxhighlight>

~~Varius~~ Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.

==Combining Results==

In the given example, curly braces are used to combine multiple keywords:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

for $country in doc('factbook')//country

where $country//religions[text() contains text { 'Sunni', 'Shia' } any]

return $country/name

</~~pre~~syntaxhighlight>

The query will output the names of all countries with a religion element containing {{Code|sunni}} or {{Code|shia}}. The {{Code|any}} keyword is optional; it can be replaced with:

The keywords {{Code|ftand}}, {{Code|ftor}} and {{Code|ftnot}} can also be used to combine multiple query terms. The following query yields the same result as the last one does:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

doc('factbook')//country[descendant::religions contains text 'sunni' ftor 'shia']/name

</~~pre~~syntaxhighlight>

The keywords {{Code|not in}} are special: they are used to find tokens which are not part of a longer token sequence:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

for $text in ("New York", "new conditions")

return $text contains text "New" not in "New York"

</~~pre~~syntaxhighlight>

Due to the complex data model of the XQuery Full Text spec, the usage of {{Code|ftand}} may lead to a high memory consumption. If you should encounter problems, simply use the {{Code|all}} keyword:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

doc('factbook')//country[descendant::religions contains text { 'Christian', 'Jewish'} all]/name

</~~pre~~syntaxhighlight>

==Positional Filters==

A popular retrieval operation is to filter texts by the distance of the searched words. In this query…

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

<xml>

<text>There is some reason why ...</text>

<text>The reason why some people ...</text>

</xml>//text[. contains text { "some", "reason" } all ordered distance at most 3 words]

</~~pre~~syntaxhighlight>

…the two first texts will be returned as result, because there are at most three words between {{Code|some}} and {{Code|reason}}. Additionally, the {{Code|ordered}} keyword ensures that the words are found in the specified order, which is why the third text is excluded. Note that {{Code|all}} is required here to guarantee that only those hits will be accepted that contain all searched words.

The {{Code|window}} keyword is related: it accepts those texts in which all keyword occur within the specified number of tokens. Can you guess what is returned by the following query?

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

("A C D", "A B C D E")[. contains text { "A", "E" } all window 3 words]

</~~pre~~syntaxhighlight>

Sometimes it is interesting to only select texts in which all searched terms occur in the {{Code|same sentence}} or {{Code|paragraph}} (you can even filter for {{Code|different}} sentences/paragraphs). This is obviously not the case in the following example:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">'Mary told me, “I will survive!” ~~This is what Mary told me~~.' contains text { 'will', 'told' } all words same sentence</~~pre~~syntaxhighlight>

~~Sentences are delimited by end of line markers ({{Code|.}}, {{Code|!}}, {{Code|?}}, etc.), and newline characters are treated as paragraph delimiters.~~ By the way: ~~in the~~ In some examples above, the {{Code|~~word~~words}} unit ~~has been~~ was used, but {{Code|sentences}} and {{Code|paragraphs}} ~~are~~ would have been valid alternatives.

Last but not least, three specifiers exist to filter results depending on the position of a hit:

* If {{Code|case}} is insensitive, no distinction is made between characters in upper and lower case. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"Respect Upper Case" contains text "Upper" using case sensitive

</~~pre~~syntaxhighlight>

* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive</~~pre~~syntaxhighlight>

* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"catch" contains text "catches" using stemming

</~~pre~~syntaxhighlight>

* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful if the full-text index takes too much space (a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"You and me" contains text "you or me" using stop words ("and", "or"),

"You and me" contains text "you or me" using stop words at "http://files.basex.org/etc/stopwords.txt"

</~~pre~~syntaxhighlight>

* Related terms such as synonyms can be found with the sophisticated [[#Thesaurus|Thesaurus]] option.

* <code>.{min,max}</code> matches ''min''–''max'' number of characters.

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"This may be interesting in the year 2000" contains text { "interest.*", "2.{3,3}" } using wildcards

</~~pre~~syntaxhighlight>

This was a quick introduction to XQuery Full Text; you are invited to explore the numerous other features of the language!

A list of all language codes that are available on your system can be retrieved as follows:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

declare namespace locale = "java:java.util.Locale";

distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))

</~~pre~~syntaxhighlight>

By default, unless the languages codes <code>ja</code>, <code>ar</code>, <code>ko</code>, <code>th</code>, or <code>zh</code> are specified, a tokenizer for Western texts ~~will be~~ is used ~~to tokenize texts~~:

* Whitespaces are interpreted as token delimiters.

* Sentence delimiters are <code>.</code>, <code>!</code>, and <code>?</code>.

* Paragraph delimiters are newlines (<code>&#xa;</code>).

The basic ~~<code>jar</code>~~ JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:

* [~~http~~https://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes the Snowball and Lucene stemmers ~~and extends language support to~~ for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

* [~~http~~https://enosdn.~~sourceforge.jp~~net/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.

The JAR files are included in the ~~<code>zip</code>~~ ZIP and ~~<code>exe</code>~~ EXE distributions of BaseX.

The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"Indexing" contains text "index" using stemming,

"häuser" contains text "haus" using stemming using language "German"

</~~pre~~syntaxhighlight>

==Scoring==

The scoring model of BaseX takes into consideration the number of found terms, their frequency in a text, and the length of a text. The shorter the input text is, the higher scores will be:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: Score values: 1 0.62 0.45 :)

for $text ~~score $score~~ in ("A", "A B", "A B C")[. let score $score := $text contains text "A"]

order by $score descending

return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>

</~~pre~~syntaxhighlight>

This simple approach has proven to consistently deliver good results, ~~and~~ in particular when little is known about the structure of the queried XML documents.

~~Please note that scores will only~~ Scoring values can be ~~computed if a parent expression requests them~~further processed to compute custom values:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">let $terms := (~~: Computes and returns~~ 'a ~~scoring value. :~~', 'b')let ~~score~~ $~~score~~ scores := ~~<x>Hello Universe</x>~~ ft:score($terms ! ('a b c' contains text ~~"hello"~~{ . }))return avg($~~score~~scores)</syntaxhighlight>

(Scoring is supported within full-text expressions, by {{Function|Full-Text|ft: ~~No scoring value will~~ search}}, and by simple predicate tests that can be ~~computed here.~~ rewritten to {{Function|Full-Text|ft:)~~let $result := <x>Hello Universe</x> contains text "hello"let score $score~~ search}}:~~= $resultreturn $score</pre>~~

~~Scores will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates. In the following query~~<syntaxhighlight lang="xquery">let $string := 'a b'return ft:score($string contains text 'a' ftand 'b'), ~~all returned scores are equal:~~

~~<pre class="brush:xquery">let~~ for $~~text := "A B C"let~~ n score $s1 s in ft:= search('factbook', 'orthodox')order by $~~text contains text "A" ftand "B C"~~s descending~~let score~~ return $s2 s || ':= ' || $~~text contains text "A" ftand "B C"~~n,~~let score $s3 := $text contains text "A" and $text contains text "B C"let score $s4 :=~~ for $~~text contains text "A" or $text contains text "B C"let~~ n score $s5 s in db:~~= $~~open('factbook')//text()[. contains text ~~"A"][. contains text "B C"~~'orthodox']order by $s descendingreturn ($~~s1, $s2, $s3, $s4,~~ s || ': ' || $~~s5)~~n</~~pre~~syntaxhighlight>

==Thesaurus==

~~BaseX supports~~ One or more thesaurus files can be specified in a full-text ~~queries using thesauri, but it does not provide a default thesaurus~~expression. ~~This is why queries such as~~The following query returns {{Code|false}}:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">'~~computers~~hardware' contains text '~~hardware~~computers'

using thesaurus default

</~~pre~~syntaxhighlight> If a thesaurus is employed… <syntaxhighlight lang="xml"><thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus"> <entry> <term>computers</term> <synonym> <term>hardware</term> <relationship>NT</relationship> </synonym> </entry></thesaurus></syntaxhighlight> …the result will be {{Code|true}}:

~~will return~~ <~~code>false</code~~syntaxhighlight lang="xquery">'hardware' contains text 'computers' using thesaurus at 'thesaurus. ~~However, if the thesaurus is specified, then the result will be <code>true~~xml'</~~code~~syntaxhighlight>:

Thesaurus files must comply with the [https://dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd XSD Schema] of the XQFT Test Suite (but the namespace can be omitted). Apart from the relationship defined in [https://www.iso.org/standard/7776.html ISO 2788] (NT: narrower team, RT: related term, etc.), custom relationships can be used. The type of relationship and the level depth can be specified as well: <~~pre class~~syntaxhighlight lang="~~brush:~~xquery">(: BT: find broader terms; NT means narrower term :)

'computers' contains text 'hardware'

using thesaurus at '~~XQFTTS_1_0_4/TestSources/usability2~~x.xml'relationship 'BT' from 1 to 10 levels</~~pre~~syntaxhighlight>

~~The format of the thesaurus files must~~ More details can be found in the ~~same as the format of the thesauri provided by the [http://dev.w3.org/2007/xpath-full-text-10-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an~~ [~~http~~https://~~dev~~www.w3.org/~~cvsweb/~checkout~/2007~~TR/xpath-full-text-10~~-test-suite/TestSuiteStagingArea/TestSources~~/~~thesaurus.xsd?rev=1.3;content-type=application%2Fxml XSD Schema~~#ftthesaurusoption specification].

==Fuzzy Querying==

In addition to the official recommendation, BaseX supports a fuzzy search feature. The XQFT grammar was enhanced by the ~~FTMatchOption~~ <code>~~using~~ fuzzy </code> match option to allow for approximate results in full texts~~. Fuzzy search is also supported by the full-text index.~~:

'''Document 'doc.xml'''':

<~~pre class~~syntaxhighlight lang="~~brush:~~xml">

<doc>

<a>house</a>

</doc>

</~~pre~~syntaxhighlight>

'''Query:'''

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

//a[text() contains text 'house' using fuzzy]

</~~pre~~syntaxhighlight>

'''Result:'''

<~~pre class~~syntaxhighlight lang="~~brush:~~xml">

<a>house</a>

</~~pre~~syntaxhighlight> Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4. The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”. A user-defined value can be adjusted globally via the {{Option|LSERROR}} option or via an additional argument:

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the <~~code~~syntaxhighlight lang="xquery">//a[~~[Options#LSERROR|LSERROR]~~text() contains text 'house' using fuzzy 3 errors]</~~code~~syntaxhighlight> property (default: <code>SET LSERROR 0</code>). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”.

=Mixed Content=

When working with so-called narrative XML documents, such as HTML, [~~http~~https://tei-c.org/ TEI], or [~~http~~https://docbook.org / DocBook] documents, you typically have ''mixed content'', i.e., elements containing a mix of text and markup, such as:

<~~pre class~~syntaxhighlight lang="~~brush:~~xml">

<p>This is only an illustrative <hi>example</hi>, not a <q>real</q> text.</p>

</~~pre~~syntaxhighlight> Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [https://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].

~~Since the logical flow~~ To enable this kind of ~~the text~~ searches, it is ~~not interrupted by the child elements, you will typically want~~ recommendable to ~~search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [http~~:~~//www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].~~

~~To enable this kind of searches,~~ * Turn off ''whitespace chopping'' ~~must be turned off~~ when importing XML documents . This can be done by setting ~~the option <code>[[Options#CHOP~~{{Option|CHOP~~]]</code>~~ }} to <code>OFF</code> ~~(default: <code>SET CHOP ON</code>)~~. In This can also be done in the GUI~~, you find this option in~~ if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Chop Whitespaces'').* Turn off automatic indentation by assigning <code>indent=no</code> to the {{Option|SERIALIZER}} option. A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.

Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the <code>ft:mark</code> and <code>ft:extract</code> functions (see [[Full-Text Module|Full-Text Functions]]) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: Structure is ignored; no highlighting: :)

ft:mark(//p[. contains text 'real'])

(: Single text nodes are addressed: results will be highlighted: :)

ft:mark(//p[.//text() contains text 'real'])

</~~pre~~syntaxhighlight>

BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [~~http~~https://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you do not want to search for. See the following example (visit [[XQuery Update]] to learn more about updates):

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

let $docs := db:open('docs')

return db:create(

map { 'ftindex': true() }

)

</~~pre~~syntaxhighlight>

=Functions=

|-

| {{Code|decomposition}}

| Defines how composed characters are handled. Three decompositions are supported: {{Code|none}}, {{Code|standard}}, and {{Code|full}}. More details are found in the [~~http~~https://docs.oracle.com/en/java/javase/711/docs/api/java.base/java/text/Collator.html JavaDoc] of the JDK.

|}

* If a default collation is specified, it applies to all collation-dependent string operations in the query. The following expression yields <code>true</code>:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

declare default collation 'http://basex.org/collation?lang=de;strength=secondary';

'Straße' = 'Strasse'

</~~pre~~syntaxhighlight>

* Collations can also be specified in {{Code|order by}} and {{Code|group by}} clauses of FLWOR expressions. This query returns {{Code|à plutôt! bonjour!}}:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

for $w in ("bonjour!", "à plutôt!") order by $w collation "?lang=fr" return $w

</~~pre~~syntaxhighlight>

* Various string function exists that take an optional collation as argument: The following functions give us {{Code|a}} and {{Code|1 2 3}} as results:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery"><nowiki>

distinct-values(("a", "á", "à"), "?lang=it-IT;strength=primary"),

index-of(("a", "á", "à"), "a", "?lang=it-IT;strength=primary")

</nowiki></~~pre~~syntaxhighlight> If the [http://site.icu-project.org/download ICU Library] is added to the classpath, the full [https://www.w3.org/TR/xpath-functions-31/#uca-collations Unicode Collation Algorithm] features become available: <syntaxhighlight lang="xquery">(: returns 0 (both strings are compared as equal) :)compare('a-b', 'ab', 'http://www.w3.org/2013/collation/UCA?alternate=shifted')</syntaxhighlight>

=Changelog=

; Version 9.6

* Updated: [[#Fuzzy_Querying|Fuzzy Querying]]: Specify Levenshtein error

; Version 9.5:

* Removed: Scoring propagation.

; Version 9.2:

* Added: Arabic stemmer.

; Version 8.0:

* Updated: [[#Scoring|Scores]] will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates.

; Version 7.7:

* Added: [[#Collations|Collations]] support.

; Version 7.3:

* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.

* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Full-Text (edit)

Revision as of 10:41, 25 April 2022

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools