Difference between revisions of "Full-Text Module"

From BaseX Documentation
Jump to navigation Jump to search
m (Text replace - "{{Mono|" to "{{Code|")
Line 1: Line 1:
This [[Module Library|XQuery Module]] extends the [http://www.w3.org/TR/xpath-full-text-10 W3C Full Text Recommendation] with some useful functions: The index can be directly accessed, full-text results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the <code>contains text</code> expression, can be explicitly requested from items.
+
This [[Module Library|XQuery Module]] extends the [http://www.w3.org/TR/xpath-full-text-10 W3C Full Text Recommendation] with some useful functions: The index can be directly accessed, full-text results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the {{Code|contains text}} expression, can be explicitly requested from items.
  
 
=Conventions=
 
=Conventions=
  
All functions in this module are assigned to the <code>http://basex.org/modules/ft</code> namespace, which is statically bound to the <code>ft</code> prefix.<br/>
+
All functions in this module are assigned to the {{Code|http://basex.org/modules/ft}} namespace, which is statically bound to the {{Code|ft}} prefix.<br/>
All errors are assigned to the <code>http://basex.org/errors</code> namespace, which is statically bound to the <code>bxerr</code> prefix.
+
All errors are assigned to the {{Code|http://basex.org/errors}} namespace, which is statically bound to the {{Code|bxerr}} prefix.
  
 
=Functions=
 
=Functions=
Line 15: Line 15:
 
|-
 
|-
 
| width='90' | '''Signatures'''
 
| width='90' | '''Signatures'''
|<code><b>ft:search</b>($db as item(), $terms as item()*) as text()*</code><br/><code><b>ft:search</b>($db as item(), $terms as item()*, $options as item()) as text()*</code>
+
|{{Func|ft:search|$db as item(), $terms as item()*|text()*}}<br/>{{Func|ft:search|$db as item(), $terms as item()*, $options as item()|text()*}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Returns all text nodes from the full-text index of the [[Database Module#Database Nodes|database node]] <code>$db</code> that contain the specified {{Code|$terms}}.<br/>The options used for building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.
+
|Returns all text nodes from the full-text index of the [[Database Module#Database Nodes|database node]] {{Code|$db}} that contain the specified {{Code|$terms}}.<br/>The options used for building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.
 
The {{Code|$options}} argument can be used to overwrite the default full-text options. It can be specified as
 
The {{Code|$options}} argument can be used to overwrite the default full-text options. It can be specified as
* {{Code|element(options)}}: <code>&lt;options/&gt;</code> must be used as root element, and the parameters are specified as child nodes, with the element name representing the key and the text node representing the value:<br />
+
* {{Code|element(options)}}: {{Code|&lt;options/&gt;}} must be used as root element, and the parameters are specified as child nodes, with the element name representing the key and the text node representing the value:<br />
 
<pre class="brush:xml">
 
<pre class="brush:xml">
 
<options>
 
<options>
Line 27: Line 27:
 
</options>
 
</options>
 
</pre>
 
</pre>
* [[Map Module|map structure]]: all parameters can be directly represented as key/value pairs:<br /><code>map { "key" := "value", ... </code>}<br/>This variant is more compact, but please note that the W3C’s specification of maps in XQuery is still work in progress.
+
* [[Map Module|map structure]]: all parameters can be directly represented as key/value pairs:<br />{{Code|map { "key" := "value", ... }}}<br/>This variant is more compact, but please note that the W3C’s specification of maps in XQuery is still work in progress.
 
The following keys are supported:
 
The following keys are supported:
 
* {{Code|mode}}: determines the search mode (also called [http://www.w3.org/TR/xpath-full-text-10/#ftwords AnyAllOption]). Allowed values are {{Code|any}}, {{Code|any word}}, {{Code|all}}, {{Code|all words}}, and {{Code|phrase}}. {{Code|any}} is the default search mode.
 
* {{Code|mode}}: determines the search mode (also called [http://www.w3.org/TR/xpath-full-text-10/#ftwords AnyAllOption]). Allowed values are {{Code|any}}, {{Code|any word}}, {{Code|all}}, {{Code|all words}}, and {{Code|phrase}}. {{Code|any}} is the default search mode.
Line 34: Line 34:
 
|-
 
|-
 
| '''Errors'''
 
| '''Errors'''
|'''[[Database Module#Errors|BXDB0004]]''' is raised if the full-text index is not available.<br/>'''[[#Errors|BXFT0001]]''' is raised if both fuzzy and wildcard querying was selected.
+
|{{Error|BXDB0004|Database Module#Errors}} the full-text index is not available.<br/>{{Error|BXFT0001|#Errors}} both fuzzy and wildcard querying was selected.
 
|-
 
|-
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
* <code>ft:search("DB", "QUERY")</code> returns all text nodes of the database {{Code|DB}} that contain the term {{Code|QUERY}}.
+
* {{Code|ft:search("DB", "QUERY")}} returns all text nodes of the database {{Code|DB}} that contain the term {{Code|QUERY}}.
* <code>ft:search("DB", (2010,2011), map { 'mode':='all' })</code><br/>returns all text nodes of the database {{Code|DB}} that contain the numbers {{Code|2010}} and {{Code|20111}}.
+
* {{Code|ft:search("DB", (2010,2011), map { 'mode':='all' })}}<br/>returns all text nodes of the database {{Code|DB}} that contain the numbers {{Code|2010}} and {{Code|20111}}.
 
* The last example iterates over five databases and returns all elements containing terms similar to {{Code|Hello World}} in the text nodes:
 
* The last example iterates over five databases and returns all elements containing terms similar to {{Code|Hello World}} in the text nodes:
 
<pre class="brush:xquery">
 
<pre class="brush:xquery">
Line 58: Line 58:
 
|-
 
|-
 
| width='90' | '''Signatures'''
 
| width='90' | '''Signatures'''
|<code><b>ft:mark</b>($nodes as node()*) as node()*</code><br /><code><b>ft:mark</b>($nodes as node()*, $tag as xs:string) as node()*</code>
+
|{{Func|ft:mark|$nodes as node()*|node()*}}<br />{{Func|ft:mark|$nodes as node()*, $tag as xs:string|node()*}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Puts a marker element around the resulting <code>$nodes</code> of a full-text index request.<br />The default tag name of the marker element is <code>mark</code>. An alternative tag name can be chosen via the optional <code>$tag</code> argument.<br />Note that the XML node to be transformed must be an internal "database" node. The <code>transform</code> expression can be used to apply the method to a main-memory fragment (see example).
+
|Puts a marker element around the resulting {{Code|$nodes}} of a full-text index request.<br />The default tag name of the marker element is {{Code|mark}}. An alternative tag name can be chosen via the optional {{Code|$tag}} argument.<br />Note that the XML node to be transformed must be an internal "database" node. The {{Code|transform}} expression can be used to apply the method to a main-memory fragment (see example).
 
|-
 
|-
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
* The following query returns <code>&lt;XML&gt;&lt;mark&gt;hello&lt;/mark&gt; world&lt;/XML&gt;</code>, if one text node of the database <code>DB</code> has the value "hello world":
+
* The following query returns {{Code|&lt;XML&gt;&lt;mark&gt;hello&lt;/mark&gt; world&lt;/XML&gt;}}, if one text node of the database {{Code|DB}} has the value "hello world":
 
<pre class="brush:xquery">
 
<pre class="brush:xquery">
 
ft:mark(db:open('DB')//*[text() contains text 'hello'])
 
ft:mark(db:open('DB')//*[text() contains text 'hello'])
 
</pre>
 
</pre>
* The following expression returns <code>&lt;p&gt;&lt;b&gt;word&lt;/b&gt;&lt;/p&gt;</code>:
+
* The following expression returns {{Code|&lt;p&gt;&lt;b&gt;word&lt;/b&gt;&lt;/p&gt;}}:
 
<pre class="brush:xquery">
 
<pre class="brush:xquery">
 
copy $p := &lt;p&gt;word&lt;/p&gt;
 
copy $p := &lt;p&gt;word&lt;/p&gt;
Line 80: Line 80:
 
|-
 
|-
 
| width='90' | '''Signatures'''
 
| width='90' | '''Signatures'''
|<code><b>ft:extract</b>($nodes as node()*) as node()*</code><br /><code><b>ft:extract</b>($nodes as node()*, $tag as xs:string) as node()*</code><br /><code><b>ft:extract</b>($nodes as node()*, $tag as xs:string, $length as xs:integer) as node()*</code>
+
|{{Func|ft:extract|$nodes as node()*|node()*}}<br />{{Func|ft:extract|$nodes as node()*, $tag as xs:string|node()*}}<br />{{Func|ft:extract|$nodes as node()*, $tag as xs:string, $length as xs:integer|node()*}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting <code>$nodes</code> of a full-text index request and chops irrelevant sections of the result.<br />The default tag name of the marker element is <code>mark</code>. An alternative tag name can be chosen via the optional <code>$tag</code> argument.<br />The default length of the returned text is <code>150</code> characters. An alternative length can be specified via the optional <code>$length</code> argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.
+
|Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting {{Code|$nodes}} of a full-text index request and chops irrelevant sections of the result.<br />The default tag name of the marker element is {{Code|mark}}. An alternative tag name can be chosen via the optional {{Code|$tag}} argument.<br />The default length of the returned text is {{Code|150}} characters. An alternative length can be specified via the optional {{Code|$length}} argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.
 
|-
 
|-
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
* The following query may return <code>&lt;XML&gt;...&lt;b&gt;hello&lt;/b&gt;...&lt;XML&gt;</code> if a text node of the database <code>DB</code> contains the string "hello world":
+
* The following query may return {{Code|&lt;XML&gt;...&lt;b&gt;hello&lt;/b&gt;...&lt;XML&gt;}} if a text node of the database {{Code|DB}} contains the string "hello world":
 
<pre class="brush:xquery">
 
<pre class="brush:xquery">
 
ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1)
 
ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1)
Line 97: Line 97:
 
|-
 
|-
 
| width='90' | '''Signatures'''
 
| width='90' | '''Signatures'''
|<code><b>ft:count</b>($nodes as node()*) as xs:integer</code>
+
|{{Func|ft:count|$nodes as node()*|xs:integer}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
Line 104: Line 104:
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
* <code>ft:count(//*[text() contains text 'QUERY'])</code> returns the <code>xs:integer</code> value <code>2</code> if a document contains two occurrences of the string "QUERY".
+
* {{Code|ft:count(//*[text() contains text 'QUERY'])}} returns the {{Code|xs:integer}} value {{Code|2}} if a document contains two occurrences of the string "QUERY".
 
|}
 
|}
  
Line 111: Line 111:
 
|-
 
|-
 
| width='90' | '''Signatures'''
 
| width='90' | '''Signatures'''
|<code><b>ft:score</b>($item as item()*) as xs:double*</code>
+
|{{Func|ft:score|$item as item()*|xs:double*}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Returns the score values (0.0 - 1.0) that have been attached to the specified items. <code>0</code> is returned a value if no score was attached.
+
|Returns the score values (0.0 - 1.0) that have been attached to the specified items. {{Code|0}} is returned a value if no score was attached.
 
|-
 
|-
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
* <code>ft:score('a' contains text 'a')</code> returns the <code>xs:double</code> value <code>1</code>.
+
* {{Code|ft:score('a' contains text 'a')}} returns the {{Code|xs:double}} value {{Code|1}}.
 
|}
 
|}
  
Line 125: Line 125:
 
|-
 
|-
 
| width='90' | '''Signatures'''
 
| width='90' | '''Signatures'''
|{{Code|<b>ft:tokens</b>($db as item()) as element(value)*}}<br/>{{Code|<b>ft:tokens</b>($db as item(), $prefix as xs:string) as element(value)*}}
+
|{{Func|ft:tokens|$db as item()|element(value)*}}<br/>{{Func|ft:tokens|$db as item(), $prefix as xs:string|element(value)*}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Returns all full-text tokens stored in the index of the [[Database Module#Database Nodes|database node]] <code>$db</code>, along with their numbers of occurrences.<br/>If {{Code|$prefix}} is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
+
|Returns all full-text tokens stored in the index of the [[Database Module#Database Nodes|database node]] {{Code|$db}}, along with their numbers of occurrences.<br/>If {{Code|$prefix}} is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
 
|-
 
|-
 
| '''Errors'''
 
| '''Errors'''
|'''[[Database Module#Errors|BXDB0004]]''' is raised if the full-text index is not available.
+
|{{Error|BXDB0004|Database Module#Errors}} the full-text index is not available.
 
|}
 
|}
  
Line 138: Line 138:
 
|-
 
|-
 
| width='90' | '''Signatures'''
 
| width='90' | '''Signatures'''
|{{Code|<b>ft:tokenize</b>($input as xs:string) as xs:string*}}
+
|{{Func|ft:tokenize|$input as xs:string|xs:string*}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
Line 145: Line 145:
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
* <code>ft:tokenize("No Doubt")</code> returns the two strings {{Code|no}} and {{Code|doubt}}.
+
* {{Code|ft:tokenize("No Doubt")}} returns the two strings {{Code|no}} and {{Code|doubt}}.
* <code>declare ft-option using stemming; ft:tokenize("GIFTS")</code> returns a single string {{Code|gift}}.
+
* {{Code|declare ft-option using stemming; ft:tokenize("GIFTS")}} returns a single string {{Code|gift}}.
 
|}
 
|}
  
Line 155: Line 155:
 
! width="95%"|Description
 
! width="95%"|Description
 
|-
 
|-
|<code>BXFT0001</code>
+
|{{Code|BXFT0001}}
 
|Both wildcards and fuzzy search have been specified as search options.
 
|Both wildcards and fuzzy search have been specified as search options.
 
|}
 
|}

Revision as of 15:54, 26 May 2012

This XQuery Module extends the W3C Full Text Recommendation with some useful functions: The index can be directly accessed, full-text results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the contains text expression, can be explicitly requested from items.

Conventions

All functions in this module are assigned to the http://basex.org/modules/ft namespace, which is statically bound to the ft prefix.
All errors are assigned to the http://basex.org/errors namespace, which is statically bound to the bxerr prefix.

Functions

ft:search

Template:Mark second argument generalized, third parameter added.

Signatures ft:search($db as item(), $terms as item()*) as text()*
ft:search($db as item(), $terms as item()*, $options as item()) as text()*
Summary Returns all text nodes from the full-text index of the database node $db that contain the specified $terms.
The options used for building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.

The $options argument can be used to overwrite the default full-text options. It can be specified as

  • element(options): <options/> must be used as root element, and the parameters are specified as child nodes, with the element name representing the key and the text node representing the value:
<options>
  <key>value</key>
  ...
</options>
  • map structure: all parameters can be directly represented as key/value pairs:
    {{{1}}}}
    This variant is more compact, but please note that the W3C’s specification of maps in XQuery is still work in progress.

The following keys are supported:

  • mode: determines the search mode (also called AnyAllOption). Allowed values are any, any word, all, all words, and phrase. any is the default search mode.
  • fuzzy: turns fuzzy querying on or off. Allowed values are an empty string or true, or false. By default, fuzzy querying is turned off.
  • wildcards: turns wildcard querying on or off. Allowed values are an empty string or true, or false. By default, wildcard querying is turned off.
Errors BXDB0004: the full-text index is not available.
BXFT0001: both fuzzy and wildcard querying was selected.
Examples
  • ft:search("DB", "QUERY") returns all text nodes of the database DB that contain the term QUERY.
  • {{{1}}}
    returns all text nodes of the database DB that contain the numbers 2010 and 20111.
  • The last example iterates over five databases and returns all elements containing terms similar to Hello World in the text nodes:
let $terms := "Hello Worlds"
let $fuzzy := true()
let $options :=
  <options>
    <fuzzy>{ $fuzzy }</fuzzy>
  </options>
for $db in 1 to 3
let $dbname := 'DB' || $db
return ft:search($dbname, $terms, $options)/..

ft:mark

Signatures ft:mark($nodes as node()*) as node()*
ft:mark($nodes as node()*, $tag as xs:string) as node()*
Summary Puts a marker element around the resulting $nodes of a full-text index request.
The default tag name of the marker element is mark. An alternative tag name can be chosen via the optional $tag argument.
Note that the XML node to be transformed must be an internal "database" node. The transform expression can be used to apply the method to a main-memory fragment (see example).
Examples
  • The following query returns <XML><mark>hello</mark> world</XML>, if one text node of the database DB has the value "hello world":
ft:mark(db:open('DB')//*[text() contains text 'hello'])
  • The following expression returns <p><b>word</b></p>:
copy $p := <p>word</p>
modify ()
return ft:mark($p[text() contains text 'word'], 'b')

ft:extract

Signatures ft:extract($nodes as node()*) as node()*
ft:extract($nodes as node()*, $tag as xs:string) as node()*
ft:extract($nodes as node()*, $tag as xs:string, $length as xs:integer) as node()*
Summary Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting $nodes of a full-text index request and chops irrelevant sections of the result.
The default tag name of the marker element is mark. An alternative tag name can be chosen via the optional $tag argument.
The default length of the returned text is 150 characters. An alternative length can be specified via the optional $length argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.
Examples
  • The following query may return <XML>...<b>hello</b>...<XML> if a text node of the database DB contains the string "hello world":
ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1)

ft:count

Signatures ft:count($nodes as node()*) as xs:integer
Summary Returns the number of occurrences of the search terms specified in a full-text expression.
Examples
  • ft:count(//*[text() contains text 'QUERY']) returns the xs:integer value 2 if a document contains two occurrences of the string "QUERY".

ft:score

Signatures ft:score($item as item()*) as xs:double*
Summary Returns the score values (0.0 - 1.0) that have been attached to the specified items. 0 is returned a value if no score was attached.
Examples
  • ft:score('a' contains text 'a') returns the xs:double value 1.

ft:tokens

Signatures ft:tokens($db as item()) as element(value)*
ft:tokens($db as item(), $prefix as xs:string) as element(value)*
Summary Returns all full-text tokens stored in the index of the database node $db, along with their numbers of occurrences.
If $prefix is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
Errors BXDB0004: the full-text index is not available.

ft:tokenize

Signatures ft:tokenize($input as xs:string) as xs:string*
Summary Tokenizes the given $input string, using the current default full-text options.
Examples
  • ft:tokenize("No Doubt") returns the two strings no and doubt.
  • declare ft-option using stemming; ft:tokenize("GIFTS") returns a single string gift.

Errors

Code Description
BXFT0001 Both wildcards and fuzzy search have been specified as search options.

Changelog

Version 7.2

  • Updated: ft:search (second argument generalized, third parameter added)

Version 7.1