Difference between revisions of "Full-Text"

From BaseX Documentation
Jump to navigation Jump to search
(Added thesuarus query section)
m (Text replacement - "syntaxhighlight" to "pre")
 
(190 intermediate revisions by 7 users not shown)
Line 1: Line 1:
Full-text retrieval is an essential [[Querying|query]] feature for working with XML documents, and BaseX was the first query processor that fully supported the [http://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text 1.0] Recommendation. This page lists some singularities and extensions of the BaseX implementation.
+
This article is part of the [[XQuery|XQuery Portal]]. It summarizes the features of the [https://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text] Recommendation, and custom features of the implementation in BaseX.
  
==Syntax==
+
Please read the separate [[Indexes#Full-Text Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text requests on large databases within milliseconds.
In 2010, the official syntax of the full-text expression has been changed from <code>ftcontains</code> to <code>contains text</code>. While BaseX 6.0 still supported both variants, the first writing was eventually removed from later releases. <!-- Instead, BaseX provides the arrow shortcut <code>&lt;-</code> to shorten full-text requests. As a consequence, the following two queries both yield <code>true</code>:
 
  
'''Query:'''
+
=Introduction=
<pre class="brush:xquery">
+
 
"hello world" contains text "hello",
+
The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification.
"hello world" <- "hello"
+
 
</pre> -->
+
This section gives you a quick insight into the most important features of the language.
 +
 
 +
This is a simple example for a basic full-text expression:
 +
 
 +
<pre lang='xquery'>
 +
"This is YOUR World" contains text "your world"
 +
</pre>
 +
 
 +
It yields {{Code|true}}, because the search string is ''tokenized'' before it is compared with the tokenized input string. In the tokenization process, several normalizations take place. Many of those steps can hardly be simulated with plain XQuery: as an example, upper/lower case and diacritics (umlauts, accents, etc.) are removed and an optional, language-dependent stemming algorithm is applied. Beside that, special characters such as whitespaces and punctuation marks will be ignored. Thus, this query also yields true:
 +
 
 +
<pre lang='xquery'>
 +
"Well... Done!" contains text "well, done"
 +
</pre>
 +
 
 +
The {{Code|occurs}} keyword comes into play when more than one occurrence of a token is to be found:
 +
 
 +
<pre lang='xquery'>
 +
"one and two and three" contains text "and" occurs at least 2 times
 +
</pre>
 +
 
 +
Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.
 +
 
 +
==Combining Results==
 +
 
 +
In the given example, curly braces are used to combine multiple keywords:
 +
 
 +
<pre lang='xquery'>
 +
for $country in doc('factbook')//country
 +
where $country//religions[text() contains text { 'Sunni', 'Shia' } any]
 +
return $country/name
 +
</pre>
 +
 
 +
The query will output the names of all countries with a religion element containing {{Code|sunni}} or {{Code|shia}}. The {{Code|any}} keyword is optional; it can be replaced with:
 +
 
 +
* {{Code|all}}: all strings need to be found
 +
* {{Code|any word}}: any of the single words within the specified strings need to be found
 +
* {{Code|all words}}: all single words within the specified strings need to be found
 +
* {{Code|phrase}}: all strings need to be found as a single phrase
 +
 
 +
The keywords {{Code|ftand}}, {{Code|ftor}} and {{Code|ftnot}} can also be used to combine multiple query terms. The following query yields the same result as the last one does:
 +
 
 +
<pre lang='xquery'>
 +
doc('factbook')//country[descendant::religions contains text 'sunni' ftor 'shia']/name
 +
</pre>
 +
 
 +
The keywords {{Code|not in}} are special: they are used to find tokens which are not part of a longer token sequence:
 +
 
 +
<pre lang='xquery'>
 +
for $text in ("New York", "new conditions")
 +
return $text contains text "New" not in "New York"
 +
</pre>
 +
 
 +
Due to the complex data model of the XQuery Full Text spec, the usage of {{Code|ftand}} may lead to a high memory consumption. If you should encounter problems, simply use the {{Code|all}} keyword:
 +
 
 +
<pre lang='xquery'>
 +
doc('factbook')//country[descendant::religions contains text { 'Christian', 'Jewish'} all]/name
 +
</pre>
 +
 
 +
==Positional Filters==
 +
 
 +
A popular retrieval operation is to filter texts by the distance of the searched words. In this query…
 +
 
 +
<pre lang='xquery'>
 +
<xml>
 +
  <text>There is some reason why ...</text>
 +
  <text>For some good yet unknown reason, ...</text>
 +
  <text>The reason why some people ...</text>
 +
</xml>//text[. contains text { "some", "reason" } all ordered distance at most 3 words]
 +
</pre>
 +
 
 +
…the two first texts will be returned as result, because there are at most three words between {{Code|some}} and {{Code|reason}}. Additionally, the {{Code|ordered}} keyword ensures that the words are found in the specified order, which is why the third text is excluded. Note that {{Code|all}} is required here to guarantee that only those hits will be accepted that contain all searched words.
 +
 
 +
The {{Code|window}} keyword is related: it accepts those texts in which all keyword occur within the specified number of tokens. Can you guess what is returned by the following query?
 +
 
 +
<pre lang='xquery'>
 +
("A C D", "A B C D E")[. contains text { "A", "E" } all window 3 words]
 +
</pre>
 +
 
 +
Sometimes it is interesting to only select texts in which all searched terms occur in the {{Code|same sentence}} or {{Code|paragraph}} (you can even filter for {{Code|different}} sentences/paragraphs). This is obviously not the case in the following example:
 +
 
 +
<pre lang='xquery'>
 +
'Mary told me, “I will survive!”.' contains text { 'will', 'told' } all words same sentence
 +
</pre>
 +
 
 +
By the way: In some examples above, the {{Code|words}} unit was used, but {{Code|sentences}} and {{Code|paragraphs}} would have been valid alternatives.
 +
 
 +
Last but not least, three specifiers exist to filter results depending on the position of a hit:
 +
 
 +
* {{Code|at start}} expects tokens to occur at the beginning of a text
 +
* {{Code|at end}} expects tokens to occur at the text end
 +
* {{Code|entire content}} only accepts texts which have no other words at the beginning or end
 +
 
 +
==Match Options==
 +
 
 +
As indicated in the introduction, the input and query texts are tokenized before they are compared with each other. During this process, texts are split into tokens, which are then normalized, based on the following matching options:
 +
 
 +
* If {{Code|case}} is insensitive, no distinction is made between characters in upper and lower case. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:
 +
<pre lang='xquery'>
 +
"Respect Upper Case" contains text "Upper" using case sensitive
 +
</pre>
 +
* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:
 +
<pre lang='xquery'>
 +
"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive
 +
</pre>
 +
* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:
 +
<pre lang='xquery'>
 +
"catch" contains text "catches" using stemming
 +
</pre>
 +
* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful if the full-text index takes too much space (a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):
 +
<pre lang='xquery'>
 +
"You and me" contains text "you or me" using stop words ("and", "or"),
 +
"You and me" contains text "you or me" using stop words at "http://files.basex.org/etc/stopwords.txt"
 +
</pre>
 +
* Related terms such as synonyms can be found with the sophisticated [[#Thesaurus|Thesaurus]] option.
 +
 
 +
The {{Code|wildcards}} option facilitates search operations similar to simple regular expressions:
 +
 
 +
* {{Code|.}} matches a single arbitrary character.
 +
* {{Code|.?}} matches either zero or one character.
 +
* {{Code|.*}} matches zero or more characters.
 +
* {{Code|.+}} matches one or more characters.
 +
* <code>.{min,max}</code> matches ''min''–''max'' number of characters.
 +
 
 +
<pre lang='xquery'>
 +
"This may be interesting in the year 2000" contains text { "interest.*", "2.{3,3}" } using wildcards
 +
</pre>
 +
 
 +
This was a quick introduction to XQuery Full Text; you are invited to explore the numerous other features of the language!
 +
 
 +
=BaseX Features=
 +
 
 +
==Languages==
 +
 
 +
The chosen language determines how strings will be tokenized and stemmed. Either names (e.g. <code>English</code>, <code>German</code>) or codes (<code>en</code>, <code>de</code>) can be specified.
 +
A list of all language codes that are available on your system can be retrieved as follows:
 +
 
 +
<pre lang='xquery'>
 +
declare namespace locale = "java:java.util.Locale";
 +
distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))
 +
</pre>
 +
 
 +
By default, unless the languages codes <code>ja</code>, <code>ar</code>, <code>ko</code>, <code>th</code>, or <code>zh</code> are specified, a tokenizer for Western texts is used:
  
==Query Evaluation==
+
* Whitespaces are interpreted as token delimiters.
BaseX offers different evaluation strategies for XQFT queries, the choice of which
+
* Sentence delimiters are <code>.</code>, <code>!</code>, and <code>?</code>.
depends on the input data and the existence of a full text index. The query compiler tries
+
* Paragraph delimiters are newlines (<code>&amp;#xa;</code>).
to optimize and speed up queries by applying a full text index structure whenever
 
possible and useful. Three evaluation strategies are available: the standard sequential
 
database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see [http://www.inf.uni-konstanz.de/gk/pubsys/pubsys/publishedFiles/GrGaHo09.pdf "XQuery Full Text implementation in BaseX"]).
 
Query optimization and selection of the most efficient evaluation strategy is done
 
in a full-fledged automatic manner. The output of the query optimizer indicates which
 
evaluation plan is chosen for a specific query. It can be inspected by activating verbose
 
querying (Command: <code>SET VERBOSE ON</code>) or opening the Query Info in the GUI.
 
The message
 
 
<code>Applying full-text index</code>
 
 
suggests that the full-text index is applied to speed up query evaluation.
 
A second message
 
 
<code>Removing path with no index results</code>
 
 
indicates that the index does not yield any results for the specified term and
 
is thus skipped. If index optimizations are missing, it sometimes helps to give
 
the compiler a second chance and try different rewritings of the same query.
 
  
==Indexes==
+
The basic JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:
To support a wide variety of scenarios, the available full-text index can handle different
 
combinations of the match options defined in the XQuery Full Text Recommendation.
 
By default, most indexing options are disabled. The GUI dialogs for creating new databases
 
or displaying the database properties contain a tab for choosing between all available
 
options. On the command-line, the <code>SET</code> command can be used to activate
 
full-text indexing or creating a full-text index for existing databases:
 
* <code> SET FTINDEX on; CREATE DB FILENAME.xml</code>
 
* <code> CREATE INDEX fulltext </code>
 
 
The following indexing options are available:  
 
  
* '''Language''': language-specific parsers wil be used; this option affects tokenization and stemming (if enabled). BaseX comes with built-in support for English and German. Additionally, other languages can be supported using stemmers from [http://lucene.apache.org/java/docs/index.html Lucene] or [http://snowball.tartarus.org Snowball]. If the corresponding JARs or classes are in the Java class-path, BaseX will automatically make use of them (<code>SET LANGUAGE DE</code>)
+
* [https://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar] includes the Snowball and Lucene stemmers for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
* '''Support Wildcards''': a trie-based index can be applied to support wildcard searches (<code>SET WILDCARDS ON</code>)
 
* '''Stemming''': tokens are stemmed with the Porter Stemmer before being indexed (<code>SET STEMMING ON</code>)
 
* '''Case Sensitive''': tokens are indexed in case-sensitive mode (<code>SET CASESENS ON</code>)
 
* '''Diacritics''': diacritics are indexed as well (<code>SET DIACRITICS ON</code>)
 
* '''TF/IDF Scoring''': TF/IDF-based scoring values are calculated and stored in the index (<code>SET SCORING 0/1/2</code>; details see below)
 
* '''Stopword List''': a stop word list can be defined to reduce the number of indexed tokens (<code>SET STOPWORDS [filename]</code>)
 
  
'''Caution:''' The index will only be applied if the activated options are also specified in the query:
+
* [https://osdn.net/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.
  
'''Index Options:''' Case Sensitive, Stemming ON
+
The JAR files are included in the ZIP and EXE distributions of BaseX.
  
'''Query 1 (wrong):'''
+
The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:
  
<pre class="brush:xquery">
+
<pre lang='xquery'>
//*[text() contains text 'inform']
+
"Indexing" contains text "index" using stemming,
 +
"häuser" contains text "haus" using stemming using language "German"
 
</pre>
 
</pre>
 
'''Query 2 (correct):'''
 
  
<pre class="brush:xquery">
+
==Scoring==
//*[text() contains text 'inform' using case sensitive using stemming]
+
 
 +
The XQuery Full Text Recommendation allows for the usage of scoring models and values within queries, with scoring being completely implementation-defined.
 +
 
 +
The scoring model of BaseX takes into consideration the number of found terms, their frequency in a text, and the length of a text. The shorter the input text is, the higher scores will be:
 +
 
 +
<pre lang='xquery'>
 +
(: Score values: 1 0.62 0.45 :)
 +
for $text in ("A", "A B", "A B C")
 +
let score $score := $text contains text "A"
 +
order by $score descending
 +
return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>
 
</pre>
 
</pre>
 
'''Query 3 (correct):'''
 
  
<pre class="brush:xquery">
+
This simple approach has proven to consistently deliver good results, in particular when little is known about the structure of the queried XML documents.
declare ft-option using case sensitive using stemming;
+
 
//*[text() contains text 'inform']
+
Scoring values can be further processed to compute custom values:
 +
 
 +
<pre lang='xquery'>
 +
let $terms := ('a', 'b')
 +
let $scores := ft:score($terms ! ('a b c' contains text { . }))
 +
return avg($scores)
 
</pre>
 
</pre>
  
==Scoring==
+
Scoring is supported within full-text expressions, by {{Function|Full-Text|ft:search}}, and by simple predicate tests that can be rewritten to {{Function|Full-Text|ft:search}}:
The XQuery Full Text Recommendation allows for the usage of scoring models
+
 
and values within queries, with scoring being completely implementation defined.
+
<pre lang='xquery'>
BaseX offers an efficient internal scoring model which can be easily extended to
+
let $string := 'a b'
different application scenarios. Additionally, BaseX allows to store scoring
+
return ft:score($string contains text 'a' ftand 'b'),
values within the full-text index structure (demanding additional time and
+
 
memory). Three scoring types are currently available, which can be adjusted
+
for $n score $s in ft:search('factbook', 'orthodox')
with the <code>SCORING</code> property (Default: <code>SET SCORING 0</code>):  
+
order by $s descending
 +
return $s || ': ' || $n,
  
*<code>0:</code> This algorithm yields the best results for general-purpose use cases. It calculates the scoring value out of the length of a term and its frequency in a single text node. This algorithm is also applied if no index exists, or if the index cannot be applied in a query.
+
for $n score $s in db:get('factbook')//text()[. contains text 'orthodox']
*<code>1:</code> Standard TF/IDF algorithm, which treats ''document nodes'' as document units.
+
order by $s descending
*<code>2:</code> Each ''text node'' is treated as a document unit in the TF/IDF algorithm. This variant is an alternative for type <code>1</code>, if the database contains large, few XML files.
+
return $s || ': ' || $n
 +
</pre>
  
==Querying Using Thesaurus==  
+
==Thesaurus==
  
BaseX supports full-text queries using thesauri, but it does not provide a default thesaurus. This is why query such as
+
One or more thesaurus files can be specified in a full-text expression. The following query returns {{Code|false}}:
  
<pre class="brush:xquery">
+
<pre lang='xquery'>
'computers' contains text 'hardware'
+
'hardware' contains text 'computers'
 
   using thesaurus default
 
   using thesaurus default
 
</pre>
 
</pre>
  
will return <code>false</code>. However, if the thesaurus is specified, then the result will be <code>true</code>
+
If a thesaurus is employed…
 +
 
 +
<pre lang="xml">
 +
<thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus">
 +
  <entry>
 +
    <term>computers</term>
 +
    <synonym>
 +
      <term>hardware</term>
 +
      <relationship>NT</relationship>
 +
    </synonym>
 +
  </entry>
 +
</thesaurus>
 +
</pre>
  
<pre class="brush:xquery">
+
…the result will be {{Code|true}}:
 +
 
 +
<pre lang='xquery'>
 +
'hardware' contains text 'computers'
 +
  using thesaurus at 'thesaurus.xml'
 +
</pre>
 +
 
 +
Thesaurus files must comply with the [https://dev.w3.org/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd XSD Schema] of the XQFT Test Suite (but the namespace can be omitted). Apart from the relationship defined in [https://www.iso.org/standard/7776.html ISO 2788] (NT: narrower team, RT: related term, etc.), custom relationships can be used.
 +
 
 +
The type of relationship and the level depth can be specified as well:
 +
 
 +
<pre lang='xquery'>
 +
(: BT: find broader terms; NT means narrower term :)
 
'computers' contains text 'hardware'
 
'computers' contains text 'hardware'
   using thesaurus at 'XQFTTS_1_0_4/TestSources/usability2.xml'
+
   using thesaurus at 'x.xml' relationship 'BT' from 1 to 10 levels
 
</pre>
 
</pre>
  
The format of the thesaurus files must be the same as the format of the thesauri provided by the [http://dev.w3.org/2007/xpath-full-text-10-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an [http://dev.w3.org/cvsweb/~checkout~/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd?rev=1.3;content-type=application%2Fxml XSD Schema].
+
More details can be found in the [https://www.w3.org/TR/xpath-full-text-10/#ftthesaurusoption specification].
  
 
==Fuzzy Querying==
 
==Fuzzy Querying==
In addition to the official recommendation, BaseX supports fuzzy querying.
+
 
The XQFT grammar was enhanced by the FTMatchOption <code>using fuzzy </code>  
+
In addition to the official recommendation, BaseX supports a fuzzy search feature. The XQFT grammar was enhanced by the <code>fuzzy</code> match option to allow for approximate results in full texts:
to allow for approximate searches in full texts.
 
By default, the standard  [[indexes|full-text index]] already supports the efficient
 
execution of fuzzy searches.
 
  
 
'''Document 'doc.xml'''':
 
'''Document 'doc.xml'''':
<pre class="brush:xml">
+
<pre lang="xml">
 
<doc>
 
<doc>
   <a>foo bar</a>
+
   <a>house</a>
   <a>foa bar</a>
+
   <a>hous</a>
   <a>faa bar</a>
+
   <a>haus</a>
</doc>
+
</doc>
 
</pre>  
 
</pre>  
'''Command:''' <code>CREATE DB doc.xml; CREATE INDEX fullext</code>
 
 
   
 
   
 
'''Query:'''
 
'''Query:'''
<pre class="brush:xquery">
+
<pre lang='xquery'>
//a[text() contains text 'foo' using fuzzy]
+
//a[text() contains text 'house' using fuzzy]
 
</pre>
 
</pre>
 
   
 
   
 
'''Result:'''
 
'''Result:'''
<pre class="brush:xml">
+
<pre lang="xml">
<a>foo bar</a> <a>foa bar</a>
+
<a>house</a>
 +
<a>hous</a>
 
</pre>
 
</pre>
  
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed
+
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4. The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”.
errors is calculated by dividing the token length of a specified query term by 4,
+
 
preserving a minimum of 1 errors. A static error distance can be set by adjusting
+
A user-defined value can be adjusted globally via the {{Option|LSERROR}} option or via an additional argument:
the <code>[[Options#LSERROR|LSERROR]]</code> property (default: <code>SET LSERROR 0</code>).
+
 
The query above yields two results as there is no error between the query term
+
<pre lang='xquery'>
&quot;foo&quot; and the text node &quot;foo bar&quot;, and one error between
+
//a[text() contains text 'house' using fuzzy 3 errors]
&quot;foo&quot; and &quot;foa bar&quot;.
+
</pre>
 +
 
 +
=Mixed Content=
 +
 
 +
When working with so-called narrative XML documents, such as HTML, [https://tei-c.org/ TEI], or [https://docbook.org/ DocBook] documents, you typically have ''mixed content'', i.e., elements containing a mix of text and markup, such as:
 +
 
 +
<pre lang="xml">
 +
<p>This is only an illustrative <hi>example</hi>, not a <q>real</q> text.</p>
 +
</pre>
 +
 
 +
Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”.  For more examples, see [https://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].
 +
 
 +
To enable this kind of searches, it is recommendable to:
 +
 
 +
* Keep ''whitespace stripping'' turned off when importing XML documents. This can be done by ensuring that {{Option|STRIPWS}} is disabled. This can also be done in the GUI if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Strip Whitespaces'').
 +
* Keep automatic indentation turned off. Ensure that the [[Serialization|serialization parameter]] {{Code|indent}} is set to {{Code|no}}.
 +
 
 +
A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above.  However, the full-text index will '''not''' be used in this query, so it may take a long time.  The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph because the matching text is split over two text nodes.
 +
 
 +
Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the {{Function|Full-Text|ft:mark}} and {{Function|Full-Text|ft:extract}} functions will only yield useful results if they are applied to single text nodes, as the following example demonstrates:
 +
 
 +
<pre lang='xquery'>
 +
(: Structure is ignored; no highlighting: :)
 +
ft:mark(//p[. contains text 'real'])
 +
(: Single text nodes are addressed: results will be highlighted: :)
 +
ft:mark(//p[.//text() contains text 'real'])
 +
</pre>
 +
 
 +
BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [https://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you want to avoid searching for. See the following example (visit [[XQuery Update]] to learn more about updates):
 +
 
 +
<pre lang='xquery'>
 +
let $docs := db:get('docs')
 +
return db:create(
 +
  'index-db',
 +
  $docs update delete node (
 +
    .//footnote
 +
  ),
 +
  $docs/db:path(.),
 +
  map { 'ftindex': true() }
 +
)
 +
</pre>
 +
 
 +
=Functions=
 +
 
 +
Some additional [[Full-Text Module|Full-Text Functions]] have been added to BaseX to extend the official language recommendation with useful features, such as explicitly requesting the score value of an item, marking the hits of a full-text request, or directly accessing the full-text index with the default index options.
 +
 
 +
=Collations=
 +
 
 +
See [[XQuery 3.1#Collations|XQuery 3.1]] for standard collation features.
 +
 
 +
By default, string comparisons in XQuery are based on the Unicode codepoint order. The default namespace URI {{Code|http://www.w3.org/2003/05/xpath-functions/collation/codepoint}} specifies this ordering. In BaseX, the following URI syntax is supported to specify collations:
 +
 
 +
  <nowiki>http://basex.org/collation?lang=...;strength=...;decomposition=...</nowiki>
 +
 
 +
Semicolons can be replaced with ampersands; for convenience, the URL can be reduced to its ''query string component'' (including the question mark). All arguments are optional:
 +
 
 +
{| class="wikitable"
 +
|-
 +
! width="190" | Argument
 +
! Description
 +
|-
 +
| {{Code|lang}}
 +
| A language code, selecting a Locale. It may be followed by a language variant. If no language is specified, the system’s default will be chosen. Examples: {{Code|de}}, {{Code|en-US}}.
 +
|-
 +
| {{Code|strength}}
 +
| Level of difference considered significant in comparisons. Four strengths are supported: {{Code|primary}}, {{Code|secondary}}, {{Code|tertiary}}, and {{Code|identical}}. As an example, in German:
 +
* "Ä" and "A" are considered primary differences,
 +
* "Ä" and "ä" are secondary differences,
 +
* "Ä" and "A&amp;#x308;" (see http://www.fileformat.info/info/unicode/char/308/index.htm) are tertiary differences, and
 +
* "A" and "A" are identical.
 +
|-
 +
| {{Code|decomposition}}
 +
| Defines how composed characters are handled. Three decompositions are supported: {{Code|none}}, {{Code|standard}}, and {{Code|full}}. More details are found in the [https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/text/Collator.html JavaDoc] of the JDK.
 +
|}
 +
 
 +
'''Some Examples:'''
 +
 
 +
* If a default collation is specified, it applies to all collation-dependent string operations in the query. The following expression yields <code>true</code>:
 +
 
 +
<pre lang='xquery'>
 +
declare default collation 'http://basex.org/collation?lang=de;strength=secondary';
 +
'Straße' = 'Strasse'
 +
</pre>
 +
 
 +
* Collations can also be specified in {{Code|order by}} and {{Code|group by}} clauses of FLWOR expressions. This query returns {{Code|à plutôt! bonjour!}}:
 +
 
 +
<pre lang='xquery'>
 +
for $w in ("bonjour!", "à plutôt!") order by $w collation "?lang=fr" return $w
 +
</pre>
 +
 
 +
* Various string function exists that take an optional collation as argument: The following functions give us {{Code|a}} and {{Code|1 2 3}} as results:
 +
 
 +
<pre lang='xquery'><nowiki>
 +
distinct-values(("a", "á", "à"), "?lang=it-IT;strength=primary"),
 +
index-of(("a", "á", "à"), "a", "?lang=it-IT;strength=primary")
 +
</nowiki></pre>
 +
 
 +
If the [http://site.icu-project.org/download ICU Library] is added to the classpath, the full [https://www.w3.org/TR/xpath-functions-31/#uca-collations Unicode Collation Algorithm] features become available:
 +
 
 +
<pre lang='xquery'>
 +
(: returns 0 (both strings are compared as equal) :)
 +
compare('a-b', 'ab', 'http://www.w3.org/2013/collation/UCA?alternate=shifted')
 +
</pre>
 +
 
 +
=Changelog=
 +
 
 +
; Version 9.6
 +
* Updated: [[#Fuzzy_Querying|Fuzzy Querying]]: Specify Levenshtein error
  
==Functions==
+
; Version 9.5:
 +
* Removed: Scoring propagation.
  
Some additional [[Full-Text Functions]] have been added to BaseX to extend the official language recommendation with useful features, such as explicitly requesting the score value of an item, marking the hits of a full-text request, or directly accessing the full-text index with the default index options.
+
; Version 9.2:
 +
* Added: Arabic stemmer.
  
==Error Messages==
+
; Version 8.0:
 +
* Updated: [[#Scoring|Scores]] will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates.
  
Along with the Update Facility, a number of new error codes and messages have been added
+
; Version 7.7:
to the specification and BaseX. All errors are listed in the
+
* Added: [[#Collations|Collations]] support.
[[XQuery Errors#Full-Text Errors|XQuery Errors]] overview.
 
  
[[Category:XQuery]]
+
; Version 7.3:
 +
* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.
 +
* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.

Latest revision as of 18:38, 1 December 2023

This article is part of the XQuery Portal. It summarizes the features of the W3C XQuery Full Text Recommendation, and custom features of the implementation in BaseX.

Please read the separate Full-Text Index section in our documentation if you want to learn how to evaluate full-text requests on large databases within milliseconds.

Introduction[edit]

The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification.

This section gives you a quick insight into the most important features of the language.

This is a simple example for a basic full-text expression:

"This is YOUR World" contains text "your world"

It yields true, because the search string is tokenized before it is compared with the tokenized input string. In the tokenization process, several normalizations take place. Many of those steps can hardly be simulated with plain XQuery: as an example, upper/lower case and diacritics (umlauts, accents, etc.) are removed and an optional, language-dependent stemming algorithm is applied. Beside that, special characters such as whitespaces and punctuation marks will be ignored. Thus, this query also yields true:

"Well... Done!" contains text "well, done"

The occurs keyword comes into play when more than one occurrence of a token is to be found:

"one and two and three" contains text "and" occurs at least 2 times

Various range modifiers are available: exactly, at least, at most, and from ... to ....

Combining Results[edit]

In the given example, curly braces are used to combine multiple keywords:

for $country in doc('factbook')//country
where $country//religions[text() contains text { 'Sunni', 'Shia' } any]
return $country/name

The query will output the names of all countries with a religion element containing sunni or shia. The any keyword is optional; it can be replaced with:

  • all: all strings need to be found
  • any word: any of the single words within the specified strings need to be found
  • all words: all single words within the specified strings need to be found
  • phrase: all strings need to be found as a single phrase

The keywords ftand, ftor and ftnot can also be used to combine multiple query terms. The following query yields the same result as the last one does:

doc('factbook')//country[descendant::religions contains text 'sunni' ftor 'shia']/name

The keywords not in are special: they are used to find tokens which are not part of a longer token sequence:

for $text in ("New York", "new conditions")
return $text contains text "New" not in "New York"

Due to the complex data model of the XQuery Full Text spec, the usage of ftand may lead to a high memory consumption. If you should encounter problems, simply use the all keyword:

doc('factbook')//country[descendant::religions contains text { 'Christian', 'Jewish'} all]/name

Positional Filters[edit]

A popular retrieval operation is to filter texts by the distance of the searched words. In this query…

<xml>
  <text>There is some reason why ...</text>
  <text>For some good yet unknown reason, ...</text>
  <text>The reason why some people ...</text>
</xml>//text[. contains text { "some", "reason" } all ordered distance at most 3 words]

…the two first texts will be returned as result, because there are at most three words between some and reason. Additionally, the ordered keyword ensures that the words are found in the specified order, which is why the third text is excluded. Note that all is required here to guarantee that only those hits will be accepted that contain all searched words.

The window keyword is related: it accepts those texts in which all keyword occur within the specified number of tokens. Can you guess what is returned by the following query?

("A C D", "A B C D E")[. contains text { "A", "E" } all window 3 words]

Sometimes it is interesting to only select texts in which all searched terms occur in the same sentence or paragraph (you can even filter for different sentences/paragraphs). This is obviously not the case in the following example:

'Mary told me, “I will survive!”.' contains text { 'will', 'told' } all words same sentence

By the way: In some examples above, the words unit was used, but sentences and paragraphs would have been valid alternatives.

Last but not least, three specifiers exist to filter results depending on the position of a hit:

  • at start expects tokens to occur at the beginning of a text
  • at end expects tokens to occur at the text end
  • entire content only accepts texts which have no other words at the beginning or end

Match Options[edit]

As indicated in the introduction, the input and query texts are tokenized before they are compared with each other. During this process, texts are split into tokens, which are then normalized, based on the following matching options:

  • If case is insensitive, no distinction is made between characters in upper and lower case. By default, the option is insensitive; it can also be set to sensitive:
"Respect Upper Case" contains text "Upper" using case sensitive
  • If diacritics is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is insensitive; it can also be set to sensitive:
"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive
  • If stemming is activated, words are shortened to a base form by a language-specific stemmer:
"catch" contains text "catches" using stemming
  • With the stop words option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful if the full-text index takes too much space (a standard stopword list for English texts is provided in the directory etc/stopwords.txt in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):
"You and me" contains text "you or me" using stop words ("and", "or"),
"You and me" contains text "you or me" using stop words at "http://files.basex.org/etc/stopwords.txt"
  • Related terms such as synonyms can be found with the sophisticated Thesaurus option.

The wildcards option facilitates search operations similar to simple regular expressions:

  • . matches a single arbitrary character.
  • .? matches either zero or one character.
  • .* matches zero or more characters.
  • .+ matches one or more characters.
  • .{min,max} matches minmax number of characters.
"This may be interesting in the year 2000" contains text { "interest.*", "2.{3,3}" } using wildcards

This was a quick introduction to XQuery Full Text; you are invited to explore the numerous other features of the language!

BaseX Features[edit]

Languages[edit]

The chosen language determines how strings will be tokenized and stemmed. Either names (e.g. English, German) or codes (en, de) can be specified. A list of all language codes that are available on your system can be retrieved as follows:

declare namespace locale = "java:java.util.Locale";
distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))

By default, unless the languages codes ja, ar, ko, th, or zh are specified, a tokenizer for Western texts is used:

  • Whitespaces are interpreted as token delimiters.
  • Sentence delimiters are ., !, and ?.
  • Paragraph delimiters are newlines (&#xa;).

The basic JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the classpath:

  • lucene-stemmers-3.4.0.jar includes the Snowball and Lucene stemmers for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

The JAR files are included in the ZIP and EXE distributions of BaseX.

The following two queries, which both return true, demonstrate that stemming depends on the selected language:

"Indexing" contains text "index" using stemming,
"häuser" contains text "haus" using stemming using language "German"

Scoring[edit]

The XQuery Full Text Recommendation allows for the usage of scoring models and values within queries, with scoring being completely implementation-defined.

The scoring model of BaseX takes into consideration the number of found terms, their frequency in a text, and the length of a text. The shorter the input text is, the higher scores will be:

(: Score values: 1 0.62 0.45 :)
for $text in ("A", "A B", "A B C")
let score $score := $text contains text "A"
order by $score descending
return <hit score='{ format-number($score, "0.00") }'>{ $text }</hit>

This simple approach has proven to consistently deliver good results, in particular when little is known about the structure of the queried XML documents.

Scoring values can be further processed to compute custom values:

let $terms := ('a', 'b')
let $scores := ft:score($terms ! ('a b c' contains text { . }))
return avg($scores)

Scoring is supported within full-text expressions, by ft:search, and by simple predicate tests that can be rewritten to ft:search:

let $string := 'a b'
return ft:score($string contains text 'a' ftand 'b'),

for $n score $s in ft:search('factbook', 'orthodox')
order by $s descending
return $s || ': ' || $n,

for $n score $s in db:get('factbook')//text()[. contains text 'orthodox']
order by $s descending
return $s || ': ' || $n

Thesaurus[edit]

One or more thesaurus files can be specified in a full-text expression. The following query returns false:

'hardware' contains text 'computers'
  using thesaurus default

If a thesaurus is employed…

<thesaurus xmlns="http://www.w3.org/2007/xqftts/thesaurus">
  <entry>
    <term>computers</term>
    <synonym>
      <term>hardware</term>
      <relationship>NT</relationship>
    </synonym>
  </entry>
</thesaurus>

…the result will be true:

'hardware' contains text 'computers'
  using thesaurus at 'thesaurus.xml'

Thesaurus files must comply with the XSD Schema of the XQFT Test Suite (but the namespace can be omitted). Apart from the relationship defined in ISO 2788 (NT: narrower team, RT: related term, etc.), custom relationships can be used.

The type of relationship and the level depth can be specified as well:

(: BT: find broader terms; NT means narrower term :)
'computers' contains text 'hardware'
  using thesaurus at 'x.xml' relationship 'BT' from 1 to 10 levels

More details can be found in the specification.

Fuzzy Querying[edit]

In addition to the official recommendation, BaseX supports a fuzzy search feature. The XQFT grammar was enhanced by the fuzzy match option to allow for approximate results in full texts:

Document 'doc.xml':

<doc>
   <a>house</a>
   <a>hous</a>
   <a>haus</a>
</doc>

Query:

//a[text() contains text 'house' using fuzzy]

Result:

<a>house</a>
<a>hous</a>

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4. The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”.

A user-defined value can be adjusted globally via the LSERROR option or via an additional argument:

//a[text() contains text 'house' using fuzzy 3 errors]

Mixed Content[edit]

When working with so-called narrative XML documents, such as HTML, TEI, or DocBook documents, you typically have mixed content, i.e., elements containing a mix of text and markup, such as:

<p>This is only an illustrative <hi>example</hi>, not a <q>real</q> text.</p>

Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see XQuery and XPath Full Text 1.0 Use Cases.

To enable this kind of searches, it is recommendable to:

  • Keep whitespace stripping turned off when importing XML documents. This can be done by ensuring that STRIPWS is disabled. This can also be done in the GUI if a new database is created (DatabaseNew…ParsingStrip Whitespaces).
  • Keep automatic indentation turned off. Ensure that the serialization parameter indent is set to no.

A query such as //p[. contains text 'real text'] will then match the example paragraph above. However, the full-text index will not be used in this query, so it may take a long time. The full-text index would be used for the query //p[text() contains text 'real text'], but this query will not find the example paragraph because the matching text is split over two text nodes.

Note that the node structure is ignored by the full-text tokenizer: The contains text expression applies all full-text operations to the string value of its left operand. As a consequence, the ft:mark and ft:extract functions will only yield useful results if they are applied to single text nodes, as the following example demonstrates:

(: Structure is ignored; no highlighting: :)
ft:mark(//p[. contains text 'real'])
(: Single text nodes are addressed: results will be highlighted: :)
ft:mark(//p[.//text() contains text 'real'])

BaseX does not support the ignore option (without content) of the W3C XQuery Full Text 1.0 Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you want to avoid searching for. See the following example (visit XQuery Update to learn more about updates):

let $docs := db:get('docs')
return db:create(
  'index-db',
  $docs update delete node (
    .//footnote
  ),
  $docs/db:path(.),
  map { 'ftindex': true() }
)

Functions[edit]

Some additional Full-Text Functions have been added to BaseX to extend the official language recommendation with useful features, such as explicitly requesting the score value of an item, marking the hits of a full-text request, or directly accessing the full-text index with the default index options.

Collations[edit]

See XQuery 3.1 for standard collation features.

By default, string comparisons in XQuery are based on the Unicode codepoint order. The default namespace URI http://www.w3.org/2003/05/xpath-functions/collation/codepoint specifies this ordering. In BaseX, the following URI syntax is supported to specify collations:

 http://basex.org/collation?lang=...;strength=...;decomposition=...

Semicolons can be replaced with ampersands; for convenience, the URL can be reduced to its query string component (including the question mark). All arguments are optional:

Argument Description
lang A language code, selecting a Locale. It may be followed by a language variant. If no language is specified, the system’s default will be chosen. Examples: de, en-US.
strength Level of difference considered significant in comparisons. Four strengths are supported: primary, secondary, tertiary, and identical. As an example, in German:
decomposition Defines how composed characters are handled. Three decompositions are supported: none, standard, and full. More details are found in the JavaDoc of the JDK.

Some Examples:

  • If a default collation is specified, it applies to all collation-dependent string operations in the query. The following expression yields true:
declare default collation 'http://basex.org/collation?lang=de;strength=secondary';
'Straße' = 'Strasse'
  • Collations can also be specified in order by and group by clauses of FLWOR expressions. This query returns à plutôt! bonjour!:
for $w in ("bonjour!", "à plutôt!") order by $w collation "?lang=fr" return $w
  • Various string function exists that take an optional collation as argument: The following functions give us a and 1 2 3 as results:
distinct-values(("a", "á", "à"), "?lang=it-IT;strength=primary"),
index-of(("a", "á", "à"), "a", "?lang=it-IT;strength=primary")

If the ICU Library is added to the classpath, the full Unicode Collation Algorithm features become available:

(: returns 0 (both strings are compared as equal) :)
compare('a-b', 'ab', 'http://www.w3.org/2013/collation/UCA?alternate=shifted')

Changelog[edit]

Version 9.6
Version 9.5
  • Removed: Scoring propagation.
Version 9.2
  • Added: Arabic stemmer.
Version 8.0
  • Updated: Scores will be propagated by the and and or expressions and in predicates.
Version 7.7
Version 7.3
  • Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.
  • Removed: TF/IDF scoring was discarded in favor of the internal scoring model.