Changes

Full-Text (edit)

Revision as of 11:01, 15 September 2020

1,049 bytes added , 11:01, 15 September 2020

no edit summary

This article is part of the [[XQuery|XQuery Portal]]. It summarizes the ~~fulltext~~ features of the [https://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text] Recommendation, and custom features of the implementation in BaseX.

~~Full-text retrieval in XML documents is an essential requirement in many use cases. BaseX was the first query processor that supported~~ Please read the ~~[http://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation, and it additionally comes with a powerful~~ separate [[Indexes#Full-Text Index|Full-Text Index]]~~, which allows~~ section in our documentation if you want to learn how to evaluate full-text ~~queries~~ requests on large databases within milliseconds.

=Introduction=

The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification. This section gives you a quick insight into the most important features of the language.

This is a simple example for a basic full-text expression:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"This is YOUR World" contains text "your world"

</~~pre~~syntaxhighlight>

It yields {{Code|true}}, because the search string is ''tokenized'' before it is compared with the tokenized input string. In the tokenization process, several normalizations take place. Many of those steps can hardly be simulated with plain XQuery: as an example, upper/lower case and diacritics (umlauts, accents, etc.) are removed and an optional, language-dependent stemming algorithm is applied. Beside that, special characters such as whitespaces and punctuation marks will be ignored. Thus, this query also yields true:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"Well... Done!" contains text "well, done"

</~~pre~~syntaxhighlight>

The {{Code|occurs}} keyword comes into play when more than one occurrence of a token is to be found:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"one and two and three" contains text "and" occurs at least 2 times

</~~pre~~syntaxhighlight>

~~Varius~~ Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.

==Combining Results==

In the given example, curly braces are used to combine multiple keywords:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

for $country in doc('factbook')//country

where $country//religions[text() contains text { 'Sunni', 'Shia' } any]

return $country/name

</~~pre~~syntaxhighlight>

The query will output the names of all countries with a religion element containing {{Code|sunni}} or {{Code|shia}}. The {{Code|any}} keyword is optional; it can be replaced with:

The keywords {{Code|ftand}}, {{Code|ftor}} and {{Code|ftnot}} can also be used to combine multiple query terms. The following query yields the same result as the last one does:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

doc('factbook')//country[descendant::religions contains text 'sunni' ftor 'shia']/name

</~~pre~~syntaxhighlight>

The keywords {{Code|not in}} are special: they are used to find tokens which are not part of a longer token sequence:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

for $text in ("New York", "new conditions")

return $text contains text "New" not in "New York"

</~~pre~~syntaxhighlight>

Due to the complex data model of the XQuery Full Text spec, the usage of {{Code|ftand}} may lead to a high memory consumption. If you should encounter problems, simply use the {{Code|all}} keyword:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

doc('factbook')//country[descendant::religions contains text { 'Christian', 'Jewish'} all]/name

</~~pre~~syntaxhighlight>

==Positional Filters==

A popular retrieval operation is to filter texts by the distance of the searched words. In this query…

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

<xml>

<text>There is some reason why ...</text>

<text>The reason why some people ...</text>

</xml>//text[. contains text { "some", "reason" } all ordered distance at most 3 words]

</~~pre~~syntaxhighlight>

…the two first texts will be returned as result, because there are at most three words between {{Code|some}} and {{Code|reason}}. Additionally, the {{Code|ordered}} keyword ensures that the words are found in the specified order, which is why the third text is excluded. Note that {{Code|all}} is required here to guarantee that only those hits will be accepted that contain all searched words.

The {{Code|window}} keyword is related: it accepts those texts in which all keyword occur within the specified number of tokens. Can you guess what is returned by the following query?

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

("A C D", "A B C D E")[. contains text { "A", "E" } all window 3 words]

</~~pre~~syntaxhighlight>

Sometimes it is interesting to only select texts in which all searched terms occur in the {{Code|same sentence}} or {{Code|paragraph}} (you can even filter for {{Code|different}} sentences/paragraphs). This is obviously not the case in the following example:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

'Mary told me, “I will survive!”.' contains text { 'will', 'told' } all words same sentence

</~~pre~~syntaxhighlight>

By the way: In some examples above, the {{Code|words}} unit was used, but {{Code|sentences}} and {{Code|paragraphs}} would have been valid alternatives.

* If {{Code|case}} is insensitive, no distinction is made between characters in upper and lower case. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"Respect Upper Case" contains text "Upper" using case sensitive

</~~pre~~syntaxhighlight>

* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive</~~pre~~syntaxhighlight>

* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"catch" contains text "catches" using stemming

</~~pre~~syntaxhighlight>

* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful if the full-text index takes too much space (a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"You and me" contains text "you or me" using stop words ("and", "or"),

"You and me" contains text "you or me" using stop words at "http://files.basex.org/etc/stopwords.txt"

</~~pre~~syntaxhighlight>

* Related terms such as synonyms can be found with the sophisticated [[#Thesaurus|Thesaurus]] option.

* <code>.{min,max}</code> matches ''min''–''max'' number of characters.

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"This may be interesting in the year 2000" contains text { "interest.*", "2.{3,3}" } using wildcards

</~~pre~~syntaxhighlight>

This was a quick introduction to XQuery Full Text; you are invited to explore the numerous other features of the language!

A list of all language codes that are available on your system can be retrieved as follows:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

declare namespace locale = "java:java.util.Locale";

distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))

</~~pre~~syntaxhighlight>

By default, unless the languages codes <code>ja</code>, <code>ar</code>, <code>ko</code>, <code>th</code>, or <code>zh</code> are specified, a tokenizer for Western texts is used:

* Paragraph delimiters are newlines (<code>&#xa;</code>).

The basic JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:

* [~~http~~https://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar] includes the Snowball and Lucene stemmers for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

* [~~http~~https://enosdn.~~sourceforge.jp~~net/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.

The JAR files are included in the ZIP and EXE distributions of BaseX.

The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

"Indexing" contains text "index" using stemming,

"häuser" contains text "haus" using stemming using language "German"

</~~pre~~syntaxhighlight>

==Scoring==

The scoring model of BaseX takes into consideration the number of found terms, their frequency in a text, and the length of a text. The shorter the input text is, the higher scores will be:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: Score values: 1 0.62 0.45 :)

for $text score $score in ("A", "A B", "A B C")[. contains text "A"]

order by $score descending

return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>

</~~pre~~syntaxhighlight>

This simple approach has proven to consistently deliver good results, and in particular when little is known about the structure of the queried XML documents.

Please note that scores will only be computed if a parent expression requests them:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: Computes and returns a scoring value. :)

let score $score := <x>Hello Universe</x> contains text "hello"

let score $score := $result

return $score

</~~pre~~syntaxhighlight>

Scores will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates. In the following query, all returned scores are equal:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

let $text := "A B C"

let score $s1 := $text contains text "A" ftand "B C"

let score $s5 := $text[. contains text "A"][. contains text "B C"]

return ($s1, $s2, $s3, $s4, $s5)

</~~pre~~syntaxhighlight>

==Thesaurus==

BaseX supports full-text queries using thesauri, but it does not provide a default thesaurus. This is why queries such as:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

'computers' contains text 'hardware'

using thesaurus default

</~~pre~~syntaxhighlight>

will return <code>false</code>. However, if the thesaurus is specified, then the result will be <code>true</code>:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

'computers' contains text 'hardware'

using thesaurus at 'XQFTTS_1_0_4/TestSources/usability2.xml'

</~~pre~~syntaxhighlight>

The format of the thesaurus files must be the same as the format of the thesauri provided by the [~~http~~https://dev.w3.org/2007/xpath-full-text-10-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an [~~http~~https://dev.w3.org~~/cvsweb/~checkout~~~/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd~~?rev=1.3;content-type=application%2Fxml~~ XSD Schema].

==Fuzzy Querying==

'''Document 'doc.xml'''':

<~~pre class~~syntaxhighlight lang="~~brush:~~xml">

<doc>

<a>house</a>

</doc>

</~~pre~~syntaxhighlight>

'''Query:'''

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

//a[text() contains text 'house' using fuzzy]

</~~pre~~syntaxhighlight>

'''Result:'''

<~~pre class~~syntaxhighlight lang="~~brush:~~xml">

<a>house</a>

</~~pre~~syntaxhighlight>

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the ~~<code>[[Options#LSERROR~~{{Option|LSERROR~~]]</code> property~~ }} option (default: <code>SET LSERROR 0</code>). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”.

Fuzzy search is also supported by the full-text index.

=Mixed Content=

When working with so-called narrative XML documents, such as HTML, [~~http~~https://tei-c.org/ TEI], or [~~http~~https://docbook.org / DocBook] documents, you typically have ''mixed content'', i.e., elements containing a mix of text and markup, such as:

<~~pre class~~syntaxhighlight lang="~~brush:~~xml">

<p>This is only an illustrative <hi>example</hi>, not a <q>real</q> text.</p>

</~~pre~~syntaxhighlight> Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [https://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases]. To enable this kind of searches, it is recommendable to:

~~Since the logical flow of the text is not interrupted~~ * Turn off ''whitespace chopping'' when importing XML documents. This can be done by ~~the child elements, you will typically want~~ setting {{Option|CHOP}} to ~~search across elements, so that~~ <code>OFF</code>. This can also be done in the ~~above paragraph would match~~ GUI if a ~~search for “real text”~~new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Chop Whitespaces''). ~~For more examples, see [http:~~* Turn off automatic indentation by assigning <code>indent=no</~~/www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases]~~code> to the {{Option|SERIALIZER}} option.

To enable this kind of searches, ''whitespace chopping'' must be turned off when importing XML documents by setting the option <code>[[Options#CHOP|CHOP]]</code> to <code>OFF</code> (default: <code>SET CHOP ON</code>). In the GUI, you find this option in ''Database'' → ''New…'' → ''Parsing'' → ''Chop Whitespaces''. A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.

Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the <code>ft:mark</code> and <code>ft:extract</code> functions (see [[Full-Text Module|Full-Text Functions]]) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: Structure is ignored; no highlighting: :)

ft:mark(//p[. contains text 'real'])

(: Single text nodes are addressed: results will be highlighted: :)

ft:mark(//p[.//text() contains text 'real'])

</~~pre~~syntaxhighlight>

BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [~~http~~https://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you do not want to search for. See the following example (visit [[XQuery Update]] to learn more about updates):

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

let $docs := db:open('docs')

return db:create(

map { 'ftindex': true() }

)

</~~pre~~syntaxhighlight>

=Functions=

|-

| {{Code|decomposition}}

| Defines how composed characters are handled. Three decompositions are supported: {{Code|none}}, {{Code|standard}}, and {{Code|full}}. More details are found in the [~~http~~https://docs.oracle.com/en/java/javase/711/docs/api/java.base/java/text/Collator.html JavaDoc] of the JDK.

|}

* If a default collation is specified, it applies to all collation-dependent string operations in the query. The following expression yields <code>true</code>:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

declare default collation 'http://basex.org/collation?lang=de;strength=secondary';

'Straße' = 'Strasse'

</~~pre~~syntaxhighlight>

* Collations can also be specified in {{Code|order by}} and {{Code|group by}} clauses of FLWOR expressions. This query returns {{Code|à plutôt! bonjour!}}:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

for $w in ("bonjour!", "à plutôt!") order by $w collation "?lang=fr" return $w

</~~pre~~syntaxhighlight>

* Various string function exists that take an optional collation as argument: The following functions give us {{Code|a}} and {{Code|1 2 3}} as results:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery"><nowiki>

distinct-values(("a", "á", "à"), "?lang=it-IT;strength=primary"),

index-of(("a", "á", "à"), "a", "?lang=it-IT;strength=primary")

</nowiki></~~pre~~syntaxhighlight> If the [http://site.icu-project.org/download ICU Library] is added to the classpath, the full [https://www.w3.org/TR/xpath-functions-31/#uca-collations Unicode Collation Algorithm] features become available: <syntaxhighlight lang="xquery">(: returns 0 (both strings are compared as equal) :)compare('a-b', 'ab', 'http://www.w3.org/2013/collation/UCA?alternate=shifted')</syntaxhighlight>

=Changelog=

; Version 9.2:

* Added: Arabic stemmer.

; Version 8.0:

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Full-Text (edit)

Revision as of 11:01, 15 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools