Difference between revisions of "Full-Text Module"

From BaseX Documentation
Jump to navigation Jump to search
m (Text replacement - "<syntaxhighlight lang="xquery">" to "<pre lang='xquery'>")
Tags: Mobile web edit Mobile edit
 
(97 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This [[Module Library|XQuery Module]] extends the [http://www.w3.org/TR/xpath-full-text-10 W3C Full Text Recommendation] with some useful functions: The index can be directly accessed, full-text results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the {{Code|contains text}} expression, can be explicitly requested from items.
+
This [[Module Library|XQuery Module]] extends the [[Full-Text]] features of BaseX: The index can be directly accessed, full-text results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the {{Code|contains text}} expression, can be explicitly requested from items.
  
 
=Conventions=
 
=Conventions=
  
All functions in this module are assigned to the {{Code|http://basex.org/modules/ft}} namespace, which is statically bound to the {{Code|ft}} prefix.<br/>
+
All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/ft</nowiki></code> namespace, which is statically bound to the {{Code|ft}} prefix.<br/>
All errors are assigned to the {{Code|http://basex.org/errors}} namespace, which is statically bound to the {{Code|bxerr}} prefix.
 
  
=Functions=
+
=Database Functions=
  
 
==ft:search==
 
==ft:search==
  
{{Mark|Updated with Version 7.3:}} second argument generalized, third parameter added.
+
{| width='100%'
 
+
|- valign="top"
{|
+
| width='120' | '''Signature'''
|-
+
|<pre>ft:search(
| width='90' | '''Signatures'''
+
  $db       as xs:string,
|{{Func|ft:search|$db as item(), $terms as item()*|text()*}}<br/>{{Func|ft:search|$db as item(), $terms as item()*, $options as item()|text()*}}
+
  $terms   as item()*,
|-
+
  $options as map(*)?    := map { }
 +
) as text()*</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Returns all text nodes from the full-text index of the [[Database Module#Database Nodes|database node]] {{Code|$db}} that contain the specified {{Code|$terms}}.<br/>The options used for building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.
+
|Returns all text nodes from the full-text index of the database {{Code|$db}} that contain the specified {{Code|$terms}}.<br/>The options used for tokenizing the input and building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.
The {{Code|$options}} argument can be used to overwrite the default full-text options, which can be either specified<br/>
+
The {{Code|$options}} argument can be used to control full-text processing. The following options are supported (the introduction on [[Full-Text]] processing gives you equivalent expressions in the XQuery Full-Text notation):
* as children of an {{Code|&lt;options/&gt;}} element, e.g.:
+
* {{Code|mode}}: determine the mode how tokens are searched. Allowed values are {{Code|any}}, {{Code|any word}}, {{Code|all}}, {{Code|all words}}, and {{Code|phrase}}. {{Code|any}} is the default search mode.
<pre class="brush:xml">
+
* {{Code|wildcards}}: turn wildcard querying on or off. Allowed values are {{Code|true}} and {{Code|false}}. By default, wildcard querying is turned off.
<options>
+
* {{Code|fuzzy}}: turn fuzzy querying on or off. Allowed values are {{Code|true}} and {{Code|false}}. By default, fuzzy querying is turned off.
  <key1 value='value1'/>
+
* {{Code|errors}}: control the maximum number of tolerated errors for fuzzy querying. By default, {{Code|0}} is assigned (see [[Full-Text#Fuzzy_Querying|Fuzzy Querying]] for more details).
  ...
+
* {{Code|ordered}}: indicate if all tokens must occur in the order in which they are specified. Allowed values are {{Code|true}} and {{Code|false}}. The default is {{Code|false}}.
</options>
+
* {{Code|content}}: specify that the matched tokens need to occur at the beginning or end of a searched string, or need to cover the entire string. Allowed values are {{Code|start}}, {{Code|end}}, and {{Code|entire}}. By default, the option is turned off.
</pre>
+
* {{Code|scope}}: define the scope in which tokens must be located. The option has following sub options:
* as map, which contains all key/value pairs:
+
** {{Code|same}}: can be set to {{Code|true}} or {{Code|false}}. It specifies if tokens need to occur in the same or different units.
<pre class="brush:xml">
+
** {{Code|unit}}: can be {{Code|sentence}} or {{Code|paragraph}}. It specifies the unit for finding tokens.
map { "key1" := "value1", ... }
+
* {{Code|window}}: set up a window in which all tokens must be located. By default, the option is turned off. It has following sub options:
</pre>
+
** {{Code|size}}: specify the size of the window in terms of ''units''.
The following keys are supported:
+
** {{Code|unit}}: can be {{Code|sentences}}, {{Code|sentences}} or {{Code|paragraphs}}. The default is {{Code|words}}.
* {{Code|mode}}: determines the search mode (also called [http://www.w3.org/TR/xpath-full-text-10/#ftwords AnyAllOption]). Allowed values are {{Code|any}}, {{Code|any word}}, {{Code|all}}, {{Code|all words}}, and {{Code|phrase}}. {{Code|any}} is the default search mode.
+
* {{Code|distance}}: specify the distance in which tokens must occur. By default, the option is turned off. It has following sub options:
* {{Code|fuzzy}}: turns fuzzy querying on or off. Allowed values are an empty string or {{Code|true}}, or {{Code|false}}. By default, fuzzy querying is turned off.
+
** {{Code|min}}: specify the minimum distance in terms of ''units''. The default is {{Code|0}}.
* {{Code|wildcards}}: turns wildcard querying on or off. Allowed values are an empty string or {{Code|true}}, or {{Code|false}}. By default, wildcard querying is turned off.
+
** {{Code|max}}: specify the maximum distance in terms of ''units''. The default is {{Code|∞}}.
|-
+
** {{Code|unit}}: can be {{Code|words}}, {{Code|sentences}} or {{Code|paragraphs}}. The default is {{Code|words}}.
 +
|- valign="top"
 
| '''Errors'''
 
| '''Errors'''
|{{Error|BXDB0004|Database Module#Errors}} the full-text index is not available.<br/>{{Error|BXFT0001|#Errors}} both fuzzy and wildcard querying was selected.
+
|{{Error|db:get|Database Module#Errors}} The addressed database does not exist or could not be opened.<br/>{{Error|db:no-index|Database Module#Errors}} the index is not available.<br/>{{Error|options|#Errors}} the fuzzy and wildcard option cannot be both specified.
|-
+
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
* {{Code|ft:search("DB", "QUERY")}} returns all text nodes of the database {{Code|DB}} that contain the term {{Code|QUERY}}.
+
* {{Code|ft:search("DB", "QUERY")}}: Return all text nodes of the database {{Code|DB}} that contain the term {{Code|QUERY}}.
* <code>ft:search("DB", ("2010","2011"), map { 'mode':='all' })</code><br/>returns all text nodes of the database {{Code|DB}} that contain the numbers {{Code|2010}} and {{Code|2011}}.
+
* Return all text nodes of the database {{Code|DB}} that contain the numbers {{Code|2010}} and {{Code|2020}}:<br/><code>ft:search("DB", ("2010", "2020"), map { 'mode': 'all' })</code>
* The last example iterates over five databases and returns all elements containing terms similar to {{Code|Hello World}} in the text nodes:
+
* Return text nodes that contain the terms {{Code|A}} and {{Code|B|}} in a distance of at most 5 words:
<pre class="brush:xquery">
+
<pre lang='xquery'>
 +
ft:search("db", ("A", "B"), map {
 +
  "mode": "all words",
 +
  "distance": map {
 +
    "max": "5",
 +
    "unit": "words"
 +
  }
 +
})
 +
</pre>
 +
* Iterate over three databases and return all elements containing terms similar to {{Code|Hello World}} in the text nodes:
 +
<pre lang='xquery'>
 
let $terms := "Hello Worlds"
 
let $terms := "Hello Worlds"
 
let $fuzzy := true()
 
let $fuzzy := true()
let $options :=
 
  <options>
 
    <fuzzy>{ $fuzzy }</fuzzy>
 
  </options>
 
 
for $db in 1 to 3
 
for $db in 1 to 3
 
let $dbname := 'DB' || $db
 
let $dbname := 'DB' || $db
return ft:search($dbname, $terms, $options)/..
+
return ft:search($dbname, $terms, map { 'fuzzy': $fuzzy })/..
 
</pre>
 
</pre>
 
|}
 
|}
  
==ft:mark==
+
==ft:tokens==
{|
+
 
|-
+
{| width='100%'
| width='90' | '''Signatures'''
+
|- valign="top"
|{{Func|ft:mark|$nodes as node()*|node()*}}<br />{{Func|ft:mark|$nodes as node()*, $tag as xs:string|node()*}}
+
| width='120' | '''Signature'''
|-
+
|<pre>ft:tokens(
 +
  $db      as xs:string,
 +
  $prefix  as xs:string := ()
 +
) as element(value)*</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Puts a marker element around the resulting {{Code|$nodes}} of a full-text index request.<br />The default tag name of the marker element is {{Code|mark}}. An alternative tag name can be chosen via the optional {{Code|$tag}} argument.<br />Note that the XML node to be transformed must be an internal "database" node. The {{Code|transform}} expression can be used to apply the method to a main-memory fragment (see example).
+
|Returns all full-text tokens stored in the index of the database {{Code|$db}}, along with their numbers of occurrences.<br/>If {{Code|$prefix}} is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
|-
+
|- valign="top"
 +
| '''Errors'''
 +
|{{Error|db:get|Database Module#Errors}} The addressed database does not exist or could not be opened.<br/>{{Error|db:no-index|Database Module#Errors}} the full-text index is not available.
 +
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
|
+
|Returns the number of occurrences for a single, specific index entry:
* The following query returns {{Code|&lt;XML&gt;&lt;mark&gt;hello&lt;/mark&gt; world&lt;/XML&gt;}}, if one text node of the database {{Code|DB}} has the value "hello world":
+
<pre lang='xquery'>
<pre class="brush:xquery">
+
let $term := ft:tokenize($term)
ft:mark(db:open('DB')//*[text() contains text 'hello'])
+
return number(ft:tokens('db', $term)[. = $term]/@count)
 
</pre>
 
</pre>
* The following expression returns {{Code|&lt;p&gt;&lt;b&gt;word&lt;/b&gt;&lt;/p&gt;}}:
 
<pre class="brush:xquery">
 
copy $p := &lt;p&gt;word&lt;/p&gt;
 
modify ()
 
return ft:mark($p[text() contains text 'word'], 'b')</pre>
 
 
|}
 
|}
  
==ft:extract==
+
=General Functions=
{|
+
 
|-
+
==ft:contains==
| width='90' | '''Signatures'''
+
 
|{{Func|ft:extract|$nodes as node()*|node()*}}<br />{{Func|ft:extract|$nodes as node()*, $tag as xs:string|node()*}}<br />{{Func|ft:extract|$nodes as node()*, $tag as xs:string, $length as xs:integer|node()*}}
+
{| width='100%'
|-
+
|- valign="top"
 +
| width='120' | '''Signature'''
 +
|<pre>ft:contains(
 +
  $input    as item()*,
 +
  $terms    as item()*,
 +
  $options  as map(*):= map { }
 +
) as xs:boolean</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting {{Code|$nodes}} of a full-text index request and chops irrelevant sections of the result.<br />The default tag name of the marker element is {{Code|mark}}. An alternative tag name can be chosen via the optional {{Code|$tag}} argument.<br />The default length of the returned text is {{Code|150}} characters. An alternative length can be specified via the optional {{Code|$length}} argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.
+
|Checks if the specified {{Code|$input}} items contain the specified {{Code|$terms}}.<br/>The function does the same as the [[Full-Text]] expression {{Code|contains text}}, but options can be specified more dynamically. The {{Code|$options}} are the same as for {{Function||ft:search}}, and the following ones exist:
|-
+
* {{Code|case}}: determines how character case is processed. Allowed values are {{Code|insensitive}}, {{Code|sensitive}}, {{Code|upper}} and {{Code|lower}}. By default, search is case-insensitive.
 +
* {{Code|diacritics}}: determines how diacritical characters are processed. Allowed values are {{Code|insensitive}} and {{Code|sensitive}}. By default, search is diacritical insensitive.
 +
* {{Code|stemming}}: determines is tokens are stemmed. Allowed values are {{Code|true}} and {{Code|false}}. By default, stemming is turned off.
 +
* {{Code|language}}: determines the language. This option is relevant for stemming tokens. All language codes are supported. The default language is {{Code|en}}.
 +
|- valign="top"
 +
| '''Errors'''
 +
|{{Error|options|#Errors}} specified options are conflicting.
 +
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
* The following query may return {{Code|&lt;XML&gt;...&lt;b&gt;hello&lt;/b&gt;...&lt;XML&gt;}} if a text node of the database {{Code|DB}} contains the string "hello world":
+
* Checks if {{Code|jack}} or {{Code|john}} occurs in the input string {{Code|John Doe}}:
<pre class="brush:xquery">
+
<pre lang='xquery'>
ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1)
+
ft:contains("John Doe", ("jack", "john"), map { "mode": "any" })
 +
</pre>
 +
* Calls the function with stemming turned on and off:
 +
<pre lang='xquery'>
 +
(true(), false()) ! ft:contains("Häuser", "Haus", map { 'stemming': ., 'language':'de' })
 
</pre>
 
</pre>
 
|}
 
|}
  
 
==ft:count==
 
==ft:count==
{|
+
 
|-
+
{| width='100%'
| width='90' | '''Signatures'''
+
|- valign="top"
|{{Func|ft:count|$nodes as node()*|xs:integer}}
+
| width='120' | '''Signature'''
|-
+
|<pre>ft:count(
 +
  $nodes as node()*
 +
) as xs:integer</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
 
|Returns the number of occurrences of the search terms specified in a full-text expression.
 
|Returns the number of occurrences of the search terms specified in a full-text expression.
|-
+
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
Line 111: Line 142:
  
 
==ft:score==
 
==ft:score==
{|
+
 
|-
+
{| width='100%'
| width='90' | '''Signatures'''
+
|- valign="top"
|{{Func|ft:score|$item as item()*|xs:double*}}
+
| width='120' | '''Signature'''
|-
+
|<pre>ft:score(
 +
  $item as item()*
 +
) as xs:double*</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
 
|Returns the score values (0.0 - 1.0) that have been attached to the specified items. {{Code|0}} is returned a value if no score was attached.
 
|Returns the score values (0.0 - 1.0) that have been attached to the specified items. {{Code|0}} is returned a value if no score was attached.
|-
+
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
Line 124: Line 158:
 
|}
 
|}
  
==ft:tokens==
+
==ft:tokenize==
{|
+
 
|-
+
{| width='100%'
| width='90' | '''Signatures'''
+
|- valign="top"
|{{Func|ft:tokens|$db as item()|element(value)*}}<br/>{{Func|ft:tokens|$db as item(), $prefix as xs:string|element(value)*}}
+
| width='120' | '''Signature'''
|-
+
|<pre>ft:tokenize(
 +
  $string  as xs:string?,
 +
  $options  as map(*)?    := map { }
 +
) as xs:string*</pre>
 +
|- valign="top"
 +
| '''Summary'''
 +
|Tokenizes the given {{Code|$string}}, using the current default full-text options or the {{Code|$options}} specified as second argument, and returns a sequence with the tokenized string. The following options are available:
 +
* {{Code|case}}: determines how character case is processed. Allowed values are {{Code|insensitive}}, {{Code|sensitive}}, {{Code|upper}} and {{Code|lower}}. By default, search is case insensitive.
 +
* {{Code|diacritics}}: determines how diacritical characters are processed. Allowed values are {{Code|insensitive}} and {{Code|sensitive}}. By default, search is diacritical insensitive.
 +
* {{Code|stemming}}: determines is tokens are stemmed. Allowed values are {{Code|true}} and {{Code|false}}. By default, stemming is turned off.
 +
* {{Code|language}}: determines the language. This option is relevant for stemming tokens. All language codes are supported. The default language is {{Code|en}}.
 +
The {{Code|$options}} argument can be used to control full-text processing.
 +
|- valign="top"
 +
| '''Examples'''
 +
|
 +
* <code>ft:tokenize("No Doubt")</code> returns the two strings {{Code|no}} and {{Code|doubt}}.
 +
* <code>ft:tokenize("École", map { 'diacritics': 'sensitive' })</code> returns the string {{Code|école}}.
 +
* <code>declare ft-option using stemming; ft:tokenize("GIFTS")</code> returns a single string {{Code|gift}}.
 +
|}
 +
 
 +
==ft:normalize==
 +
 
 +
{| width='100%'
 +
|- valign="top"
 +
| width='120' | '''Signature'''
 +
|<pre>ft:normalize(
 +
  $string  as xs:string?,
 +
  $options  as map(*)?    := map { }
 +
) as xs:string</pre>
 +
|- valign="top"
 +
| '''Summary'''
 +
|Normalizes the given {{Code|$string}}, using the current default full-text options or the {{Code|$options}} specified as second argument. The function accepts the same arguments as {{Function||ft:tokenize}}; special characters and separators will be preserved.
 +
|- valign="top"
 +
| '''Examples'''
 +
|
 +
* <code>ft:normalize("Häuser am Meer", map { 'case': 'sensitive' })</code> returns the string {{Code|Hauser am Meer}}.
 +
|}
 +
 
 +
==ft:thesaurus==
 +
 
 +
{| width='100%'
 +
|- valign="top"
 +
| width='120' | '''Signature'''
 +
|<pre>ft:thesaurus(
 +
  $node    as node(),
 +
  $term    as xs:string,
 +
  $options  as map(*)?    := map { }
 +
) as xs:string*</pre>
 +
|- valign="top"
 +
| '''Summary'''
 +
|Looks up a {{Code|$term}} in a [[Full-Text#Thesaurus|Thesaurus Structure]] supplied by {{Code|$node}}. The following {{Code|$options}} exist:
 +
* {{Code|relationship}}: determines the relationship between terms
 +
* {{Code|levels}}: determines the maximum number of levels to traverse
 +
|- valign="top"
 +
| '''Examples'''
 +
| Returns {{Code|happy}} and {{Code|lucky}}:
 +
<pre lang='xquery'>
 +
ft:thesaurus(
 +
  <thesaurus>
 +
    <entry>
 +
      <term>happy</term>
 +
      <synonym>
 +
        <term>lucky</term>
 +
        <relationship>RT</relationship>
 +
      </synonym>
 +
    </entry>
 +
  </thesaurus>,
 +
  'happy'
 +
)
 +
</pre>
 +
|}
 +
 
 +
=Highlighting Functions=
 +
 
 +
==ft:mark==
 +
 
 +
{| width='100%'
 +
|- valign="top"
 +
| width='120' | '''Signature'''
 +
|<pre>ft:mark(
 +
  $nodes  as node()*,
 +
  $name  as xs:string  := ()
 +
) as node()*</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Returns all full-text tokens stored in the index of the [[Database Module#Database Nodes|database node]] {{Code|$db}}, along with their numbers of occurrences.<br/>If {{Code|$prefix}} is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
+
|Puts a marker element around the resulting {{Code|$nodes}} of a full-text request.<br/>The default name of the marker element is {{Code|mark}}. An alternative name can be chosen via the optional {{Code|$name}} argument.<br/>Please note that:
|-
+
* The full-text expression that computes the token positions must be specified as argument of the <code>ft:mark()</code> function, as all position information is lost in subsequent processing steps. You may need to specify more than one full-text expression if you want to use the function in a FLWOR expression, as shown in Example 2.
| '''Errors'''
+
* The supplied node must be a [[Database Module#Database Node|Database Node]]. As shown in Example 3, {{Code|update}} or {{Code|transform}} can be utilized to convert a fragment to the required internal representation.
|{{Error|BXDB0004|Database Module#Errors}} the full-text index is not available.
+
|- valign="top"
 +
| '''Examples'''
 +
|'''Example 1''': The following query returns {{Code|&lt;XML&gt;&lt;mark&gt;hello&lt;/mark&gt; world&lt;/XML&gt;}}, if one text node of the database {{Code|DB}} has the value "hello world":
 +
<pre lang='xquery'>
 +
ft:mark(db:get('DB')//*[text() contains text 'hello'])
 +
</pre>
 +
'''Example 2''': The following expression loops through the first ten full-text results and marks the results in a second expression:
 +
<pre lang='xquery'>
 +
let $start := 1
 +
let $end  := 10
 +
let $term  := 'welcome'
 +
let $test  := function($node) { $node/text() contains text { $term } }
 +
for $ft in (db:get('DB')//*[$test(.)])[position() = $start to $end]
 +
return ft:mark($ft[$test(.)])
 +
</pre>
 +
'''Example 3''': The following expression returns <code>&lt;xml>hello &lt;b&gt;word&lt;/b&gt;&lt;/xml&gt;</code>:
 +
<pre lang='xquery'>
 +
copy $p := <xml>hello world</xml>
 +
modify ()
 +
return ft:mark($p[text() contains text 'word'], 'b')
 +
</pre>
 
|}
 
|}
  
==ft:tokenize==
+
==ft:extract==
{|
+
 
|-
+
{| width='100%'
| width='90' | '''Signatures'''
+
|- valign="top"
|{{Func|ft:tokenize|$input as xs:string|xs:string*}}
+
| width='120' | '''Signature'''
|-
+
|<pre>ft:extract(
 +
  $nodes  as node()*,
 +
  $name    as xs:string   := (),
 +
  $length  as xs:integer  := ()
 +
) as node()*</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Tokenizes the given {{Code|$input}} string, using the current default full-text options.
+
|Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting {{Code|$nodes}} of a full-text index request and chops irrelevant sections of the result.<br/>The default element name of the marker element is {{Code|mark}}. An alternative element name can be chosen via the optional {{Code|$name}} argument.<br/>The default length of the returned text is {{Code|150}} characters. An alternative length can be specified via the optional {{Code|$length}} argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.<br/>For more details on this function, please have a look at {{Function||ft:mark}}.
|-
+
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
* {{Code|ft:tokenize("No Doubt")}} returns the two strings {{Code|no}} and {{Code|doubt}}.
+
* The following query may return {{Code|&lt;XML&gt;...&lt;b&gt;hello&lt;/b&gt;...&lt;XML&gt;}} if a text node of the database {{Code|DB}} contains the string "hello world":
* {{Code|declare ft-option using stemming; ft:tokenize("GIFTS")}} returns a single string {{Code|gift}}.
+
<pre lang='xquery'>
 +
ft:extract(db:get('DB')//*[text() contains text 'hello'], 'b', 1)
 +
</pre>
 
|}
 
|}
  
Line 155: Line 299:
  
 
{| class="wikitable" width="100%"
 
{| class="wikitable" width="100%"
! width="5%"|Code
+
! width="110"|Code
! width="95%"|Description
+
|Description
|-
+
|- valign="top"
|{{Code|BXFT0001}}
+
|{{Code|options}}
 
|Both wildcards and fuzzy search have been specified as search options.
 
|Both wildcards and fuzzy search have been specified as search options.
 
|}
 
|}
Line 164: Line 308:
 
=Changelog=
 
=Changelog=
  
;Version 7.2
+
; Version 9.6
 +
* Added: {{Function||ft:thesaurus}}
 +
* Updated: {{Function||ft:search}}, {{Function||ft:contains}}: new {{Code|errors}} option.
 +
 
 +
; Version 9.1
 +
* Updated: {{Function||ft:tokenize}} and {{Function||ft:normalize}} can be called with empty sequence.
 +
 
 +
;Version 9.0
 +
* Updated: error codes updated; errors now use the module namespace
 +
 
 +
;Version 8.0
 +
* Added: {{Function||ft:contains}}, {{Function||ft:normalize}}
 +
* Updated: Options added to {{Function||ft:tokenize}}
  
* Updated: [[#ft:search|ft:search]] (second argument generalized, third parameter added)
+
;Version 7.8
 +
* Added: {{Function||ft:contains}}
 +
* Updated: Options added to {{Function||ft:search}}
  
;Version 7.1
+
;Version 7.7
 +
* Updated: the functions no longer accept [[Database Module#Database Nodes|Database Nodes]] as reference. Instead, the name of a database must now be specified.
  
* Added: [[#ft:tokens|ft:tokens]], [[#ft:tokenize|ft:tokenize]]
+
;Version 7.2
 +
* Updated: {{Function||ft:search}} (second argument generalized, third parameter added)
  
[[Category:XQuery]]
+
;Version 7.1
 +
* Added: {{Function||ft:tokens}}, {{Function||ft:tokenize}}

Latest revision as of 18:34, 1 December 2023

This XQuery Module extends the Full-Text features of BaseX: The index can be directly accessed, full-text results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the contains text expression, can be explicitly requested from items.

Conventions[edit]

All functions and errors in this module are assigned to the http://basex.org/modules/ft namespace, which is statically bound to the ft prefix.

Database Functions[edit]

ft:search[edit]

Signature
ft:search(
  $db       as xs:string,
  $terms    as item()*,
  $options  as map(*)?    := map { }
) as text()*
Summary Returns all text nodes from the full-text index of the database $db that contain the specified $terms.
The options used for tokenizing the input and building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.

The $options argument can be used to control full-text processing. The following options are supported (the introduction on Full-Text processing gives you equivalent expressions in the XQuery Full-Text notation):

  • mode: determine the mode how tokens are searched. Allowed values are any, any word, all, all words, and phrase. any is the default search mode.
  • wildcards: turn wildcard querying on or off. Allowed values are true and false. By default, wildcard querying is turned off.
  • fuzzy: turn fuzzy querying on or off. Allowed values are true and false. By default, fuzzy querying is turned off.
  • errors: control the maximum number of tolerated errors for fuzzy querying. By default, 0 is assigned (see Fuzzy Querying for more details).
  • ordered: indicate if all tokens must occur in the order in which they are specified. Allowed values are true and false. The default is false.
  • content: specify that the matched tokens need to occur at the beginning or end of a searched string, or need to cover the entire string. Allowed values are start, end, and entire. By default, the option is turned off.
  • scope: define the scope in which tokens must be located. The option has following sub options:
    • same: can be set to true or false. It specifies if tokens need to occur in the same or different units.
    • unit: can be sentence or paragraph. It specifies the unit for finding tokens.
  • window: set up a window in which all tokens must be located. By default, the option is turned off. It has following sub options:
    • size: specify the size of the window in terms of units.
    • unit: can be sentences, sentences or paragraphs. The default is words.
  • distance: specify the distance in which tokens must occur. By default, the option is turned off. It has following sub options:
    • min: specify the minimum distance in terms of units. The default is 0.
    • max: specify the maximum distance in terms of units. The default is .
    • unit: can be words, sentences or paragraphs. The default is words.
Errors db:get: The addressed database does not exist or could not be opened.
db:no-index: the index is not available.
options: the fuzzy and wildcard option cannot be both specified.
Examples
  • ft:search("DB", "QUERY"): Return all text nodes of the database DB that contain the term QUERY.
  • Return all text nodes of the database DB that contain the numbers 2010 and 2020:
    ft:search("DB", ("2010", "2020"), map { 'mode': 'all' })
  • Return text nodes that contain the terms A and B in a distance of at most 5 words:
ft:search("db", ("A", "B"), map {
  "mode": "all words",
  "distance": map {
    "max": "5",
    "unit": "words"
  }
})
  • Iterate over three databases and return all elements containing terms similar to Hello World in the text nodes:
let $terms := "Hello Worlds"
let $fuzzy := true()
for $db in 1 to 3
let $dbname := 'DB' || $db
return ft:search($dbname, $terms, map { 'fuzzy': $fuzzy })/..

ft:tokens[edit]

Signature
ft:tokens(
  $db      as xs:string,
  $prefix  as xs:string  := ()
) as element(value)*
Summary Returns all full-text tokens stored in the index of the database $db, along with their numbers of occurrences.
If $prefix is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
Errors db:get: The addressed database does not exist or could not be opened.
db:no-index: the full-text index is not available.
Examples Returns the number of occurrences for a single, specific index entry:
let $term := ft:tokenize($term)
return number(ft:tokens('db', $term)[. = $term]/@count)

General Functions[edit]

ft:contains[edit]

Signature
ft:contains(
  $input    as item()*,
  $terms    as item()*,
  $options  as map(*)?  := map { }
) as xs:boolean
Summary Checks if the specified $input items contain the specified $terms.
The function does the same as the Full-Text expression contains text, but options can be specified more dynamically. The $options are the same as for ft:search, and the following ones exist:
  • case: determines how character case is processed. Allowed values are insensitive, sensitive, upper and lower. By default, search is case-insensitive.
  • diacritics: determines how diacritical characters are processed. Allowed values are insensitive and sensitive. By default, search is diacritical insensitive.
  • stemming: determines is tokens are stemmed. Allowed values are true and false. By default, stemming is turned off.
  • language: determines the language. This option is relevant for stemming tokens. All language codes are supported. The default language is en.
Errors options: specified options are conflicting.
Examples
  • Checks if jack or john occurs in the input string John Doe:
ft:contains("John Doe", ("jack", "john"), map { "mode": "any" })
  • Calls the function with stemming turned on and off:
(true(), false()) ! ft:contains("Häuser", "Haus", map { 'stemming': ., 'language':'de' })

ft:count[edit]

Signature
ft:count(
  $nodes  as node()*
) as xs:integer
Summary Returns the number of occurrences of the search terms specified in a full-text expression.
Examples
  • ft:count(//*[text() contains text 'QUERY']) returns the xs:integer value 2 if a document contains two occurrences of the string "QUERY".

ft:score[edit]

Signature
ft:score(
  $item  as item()*
) as xs:double*
Summary Returns the score values (0.0 - 1.0) that have been attached to the specified items. 0 is returned a value if no score was attached.
Examples
  • ft:score('a' contains text 'a') returns the xs:double value 1.

ft:tokenize[edit]

Signature
ft:tokenize(
  $string   as xs:string?,
  $options  as map(*)?     := map { }
) as xs:string*
Summary Tokenizes the given $string, using the current default full-text options or the $options specified as second argument, and returns a sequence with the tokenized string. The following options are available:
  • case: determines how character case is processed. Allowed values are insensitive, sensitive, upper and lower. By default, search is case insensitive.
  • diacritics: determines how diacritical characters are processed. Allowed values are insensitive and sensitive. By default, search is diacritical insensitive.
  • stemming: determines is tokens are stemmed. Allowed values are true and false. By default, stemming is turned off.
  • language: determines the language. This option is relevant for stemming tokens. All language codes are supported. The default language is en.

The $options argument can be used to control full-text processing.

Examples
  • ft:tokenize("No Doubt") returns the two strings no and doubt.
  • ft:tokenize("École", map { 'diacritics': 'sensitive' }) returns the string école.
  • declare ft-option using stemming; ft:tokenize("GIFTS") returns a single string gift.

ft:normalize[edit]

Signature
ft:normalize(
  $string   as xs:string?,
  $options  as map(*)?     := map { }
) as xs:string
Summary Normalizes the given $string, using the current default full-text options or the $options specified as second argument. The function accepts the same arguments as ft:tokenize; special characters and separators will be preserved.
Examples
  • ft:normalize("Häuser am Meer", map { 'case': 'sensitive' }) returns the string Hauser am Meer.

ft:thesaurus[edit]

Signature
ft:thesaurus(
  $node     as node(),
  $term     as xs:string,
  $options  as map(*)?    := map { }
) as xs:string*
Summary Looks up a $term in a Thesaurus Structure supplied by $node. The following $options exist:
  • relationship: determines the relationship between terms
  • levels: determines the maximum number of levels to traverse
Examples Returns happy and lucky:
ft:thesaurus(
  <thesaurus>
    <entry>
      <term>happy</term>
      <synonym>
        <term>lucky</term>
        <relationship>RT</relationship>
      </synonym>
    </entry>
  </thesaurus>,
  'happy'
)

Highlighting Functions[edit]

ft:mark[edit]

Signature
ft:mark(
  $nodes  as node()*,
  $name   as xs:string  := ()
) as node()*
Summary Puts a marker element around the resulting $nodes of a full-text request.
The default name of the marker element is mark. An alternative name can be chosen via the optional $name argument.
Please note that:
  • The full-text expression that computes the token positions must be specified as argument of the ft:mark() function, as all position information is lost in subsequent processing steps. You may need to specify more than one full-text expression if you want to use the function in a FLWOR expression, as shown in Example 2.
  • The supplied node must be a Database Node. As shown in Example 3, update or transform can be utilized to convert a fragment to the required internal representation.
Examples Example 1: The following query returns <XML><mark>hello</mark> world</XML>, if one text node of the database DB has the value "hello world":
ft:mark(db:get('DB')//*[text() contains text 'hello'])

Example 2: The following expression loops through the first ten full-text results and marks the results in a second expression:

let $start := 1
let $end   := 10
let $term  := 'welcome'
let $test  := function($node) { $node/text() contains text { $term } }
for $ft in (db:get('DB')//*[$test(.)])[position() = $start to $end]
return ft:mark($ft[$test(.)])

Example 3: The following expression returns <xml>hello <b>word</b></xml>:

copy $p := <xml>hello world</xml>
modify ()
return ft:mark($p[text() contains text 'word'], 'b')

ft:extract[edit]

Signature
ft:extract(
  $nodes   as node()*,
  $name    as xs:string   := (),
  $length  as xs:integer  := ()
) as node()*
Summary Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting $nodes of a full-text index request and chops irrelevant sections of the result.
The default element name of the marker element is mark. An alternative element name can be chosen via the optional $name argument.
The default length of the returned text is 150 characters. An alternative length can be specified via the optional $length argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.
For more details on this function, please have a look at ft:mark.
Examples
  • The following query may return <XML>...<b>hello</b>...<XML> if a text node of the database DB contains the string "hello world":
ft:extract(db:get('DB')//*[text() contains text 'hello'], 'b', 1)

Errors[edit]

Code Description
options Both wildcards and fuzzy search have been specified as search options.

Changelog[edit]

Version 9.6
Version 9.1
Version 9.0
  • Updated: error codes updated; errors now use the module namespace
Version 8.0
Version 7.8
Version 7.7
  • Updated: the functions no longer accept Database Nodes as reference. Instead, the name of a database must now be specified.
Version 7.2
  • Updated: ft:search (second argument generalized, third parameter added)
Version 7.1