Difference between revisions of "Full-Text Module"

From BaseX Documentation
Jump to navigation Jump to search
Line 16: Line 16:
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Returns all text nodes from the full-text index of the database {{Code|$db}} that contain the specified {{Code|$terms}}.<br/>The options used for building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.
+
|Returns all text nodes from the full-text index of the database {{Code|$db}} that contain the specified {{Code|$terms}}.<br/>The options used for tokenizing the input and building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.
 
The {{Code|$options}} argument can be used to control full-text processing. Options can be either specified<br/>
 
The {{Code|$options}} argument can be used to control full-text processing. Options can be either specified<br/>
 
* as children of an {{Code|&lt;options/&gt;}} element, e.g.:
 
* as children of an {{Code|&lt;options/&gt;}} element, e.g.:
Line 29: Line 29:
 
{ "key1": "value1", ... }
 
{ "key1": "value1", ... }
 
</pre>
 
</pre>
The following options are supported (the introduction on [[Full-Text]] processing gives you some examples):
+
The following options are supported (the introduction on [[Full-Text]] processing gives you equivalent expressions in the XQuery Full-Text notation):
* {{Code|mode}}: determines the mode how query terms are searched. Allowed values are {{Code|any}}, {{Code|any word}}, {{Code|all}}, {{Code|all words}}, and {{Code|phrase}}. {{Code|any}} is the default search mode.
+
* {{Code|mode}}: determines the mode how tokens are searched. Allowed values are {{Code|any}}, {{Code|any word}}, {{Code|all}}, {{Code|all words}}, and {{Code|phrase}}. {{Code|any}} is the default search mode.
 
* {{Code|fuzzy}}: turns fuzzy querying on or off. Allowed values are an empty string or {{Code|true}}, or {{Code|false}}. By default, fuzzy querying is turned off.
 
* {{Code|fuzzy}}: turns fuzzy querying on or off. Allowed values are an empty string or {{Code|true}}, or {{Code|false}}. By default, fuzzy querying is turned off.
 
* {{Code|wildcards}}: turns wildcard querying on or off. Allowed values are an empty string or {{Code|true}}, or {{Code|false}}. By default, wildcard querying is turned off.
 
* {{Code|wildcards}}: turns wildcard querying on or off. Allowed values are an empty string or {{Code|true}}, or {{Code|false}}. By default, wildcard querying is turned off.
 
The following options have been added in {{Version|7.8}}:
 
The following options have been added in {{Version|7.8}}:
* {{Code|ordered}}: requires that all terms occur in the order in which they are specified. Allowed values are {{Code|true}} and {{Code|false}}. The default is {{Code|false}}.
+
* {{Code|ordered}}: requires that all tokens occur in the order in which they are specified. Allowed values are {{Code|true}} and {{Code|false}}. The default is {{Code|false}}.
* {{Code|content}}: specifies that the matched terms need to occur at the beginning or end of a searched string, or need to cover the entire string. Allowed values are {{Code|start}}, {{Code|end}}, and {{Code|entire}}. By default, this option is turned off.
+
* {{Code|content}}: specifies that the matched tokens need to occur at the beginning or end of a searched string, or need to cover the entire string. Allowed values are {{Code|start}}, {{Code|end}}, and {{Code|entire}}. By default, the option is turned off.
* {{Code|scope}}: defines the scope in which tokens need to occur. The option has the following sub options:
+
* {{Code|scope}}: defines the scope in which tokens must be located. The option has following sub options:
** {{Code|same}}: can be set to {{Code|true}} or {{Code|false}}; it specifies if tokens need to occur in the same or different units.
+
** {{Code|same}}: can be set to {{Code|true}} or {{Code|false}}. It specifies if tokens need to occur in the same or different units.
 
** {{Code|unit}}: can be {{Code|sentence}} or {{Code|paragraph}}. It specifies the unit for finding tokens.
 
** {{Code|unit}}: can be {{Code|sentence}} or {{Code|paragraph}}. It specifies the unit for finding tokens.
 +
* {{Code|window}}: sets up a window in which all tokens must be located. By default, the option is turned off. It has following sub options:
 +
** {{Code|size}}: specifies the size of the window in terms of ''units''.
 +
** {{Code|unit}}: can be {{Code|sentences}}, {{Code|sentences}}, or {{Code|paragraphs}}. The default is {{Code|words}}.
 +
* {{Code|window}}: specifies the distance in which tokens must occur. By default, the option is turned off. It has following sub options:
 +
** {{Code|min}}: specifies the minimum distance in terms of ''units''. The default is {{Code|0}}.
 +
** {{Code|max}}: specifies the maximum distance in terms of ''units''. The default is {{Code|∞}}.
 +
** {{Code|unit}}: can be {{Code|words}}, {{Code|sentences}}, or {{Code|paragraphs}}. The default is {{Code|words}}.
 
|-
 
|-
 
| '''Errors'''
 
| '''Errors'''

Revision as of 13:15, 25 October 2013

This XQuery Module extends the W3C Full Text Recommendation with some useful functions: The index can be directly accessed, full-text results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the contains text expression, can be explicitly requested from items.

Conventions

All functions in this module are assigned to the http://basex.org/modules/ft namespace, which is statically bound to the ft prefix.
All errors are assigned to the http://basex.org/errors namespace, which is statically bound to the bxerr prefix.

Functions

ft:search

Signatures ft:search($db as xs:string, $terms as item()*) as text()*
ft:search($db as xs:string, $terms as item()*, $options as item()) as text()*
Summary Returns all text nodes from the full-text index of the database $db that contain the specified $terms.
The options used for tokenizing the input and building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.

The $options argument can be used to control full-text processing. Options can be either specified

  • as children of an <options/> element, e.g.:
<options>
  <key1 value='value1'/>
  ...
</options>
  • as map, which contains all key/value pairs:
{ "key1": "value1", ... }

The following options are supported (the introduction on Full-Text processing gives you equivalent expressions in the XQuery Full-Text notation):

  • mode: determines the mode how tokens are searched. Allowed values are any, any word, all, all words, and phrase. any is the default search mode.
  • fuzzy: turns fuzzy querying on or off. Allowed values are an empty string or true, or false. By default, fuzzy querying is turned off.
  • wildcards: turns wildcard querying on or off. Allowed values are an empty string or true, or false. By default, wildcard querying is turned off.

The following options have been added in Version 7.8:

  • ordered: requires that all tokens occur in the order in which they are specified. Allowed values are true and false. The default is false.
  • content: specifies that the matched tokens need to occur at the beginning or end of a searched string, or need to cover the entire string. Allowed values are start, end, and entire. By default, the option is turned off.
  • scope: defines the scope in which tokens must be located. The option has following sub options:
    • same: can be set to true or false. It specifies if tokens need to occur in the same or different units.
    • unit: can be sentence or paragraph. It specifies the unit for finding tokens.
  • window: sets up a window in which all tokens must be located. By default, the option is turned off. It has following sub options:
    • size: specifies the size of the window in terms of units.
    • unit: can be sentences, sentences, or paragraphs. The default is words.
  • window: specifies the distance in which tokens must occur. By default, the option is turned off. It has following sub options:
    • min: specifies the minimum distance in terms of units. The default is 0.
    • max: specifies the maximum distance in terms of units. The default is .
    • unit: can be words, sentences, or paragraphs. The default is words.
Errors BXDB0002: The addressed database does not exist or could not be opened.
BXDB0004: the full-text index is not available.
BXFT0001: both fuzzy and wildcard querying was selected.
Examples
  • ft:search("DB", "QUERY") returns all text nodes of the database DB that contain the term QUERY.
  • ft:search("DB", ("2010","2011"), { 'mode': 'all' })
    returns all text nodes of the database DB that contain the numbers 2010 and 2011.
  • The last example iterates over five databases and returns all elements containing terms similar to Hello World in the text nodes:
let $terms := "Hello Worlds"
let $fuzzy := true()
let $options :=
  <options>
    <fuzzy>{ $fuzzy }</fuzzy>
  </options>
for $db in 1 to 3
let $dbname := 'DB' || $db
return ft:search($dbname, $terms, $options)/..

ft:mark

Signatures ft:mark($nodes as node()*) as node()*
ft:mark($nodes as node()*, $tag as xs:string) as node()*
Summary Puts a marker element around the resulting $nodes of a full-text index request.
The default tag name of the marker element is mark. An alternative tag name can be chosen via the optional $tag argument.
Please note that:
  • the XML node to be transformed must be an internal "database" node. The transform expression can be used to apply the method to a main-memory fragment, as shown in Example 2.
  • the full-text expression, which computes the token positions, must be specified within ft:mark() function, as all position information is lost in subsequent processing steps. You may need to specify more than one full-text expression if you want to use the function in a FLWOR expression, as shown in Example 3.
Examples Example 1: The following query returns <XML><mark>hello</mark> world</XML>, if one text node of the database DB has the value "hello world":
ft:mark(db:open('DB')//*[text() contains text 'hello'])

Example 2: The following expression returns <p><b>word</b></p>:

copy $p := <p>word</p>
modify ()
return ft:mark($p[text() contains text 'word'], 'b')

Example 3: The following expression loops through the first ten full-text results and marks the results in a second expression:

let $start := 1
let $end   := 10
let $term  := 'welcome'
for $ft in (db:open('DB')//*[text() contains text { $term }])[position() = $start to $end]
return element hit {
  ft:mark($ft[text() contains text { $term }])
}

ft:extract

Signatures ft:extract($nodes as node()*) as node()*
ft:extract($nodes as node()*, $tag as xs:string) as node()*
ft:extract($nodes as node()*, $tag as xs:string, $length as xs:integer) as node()*
Summary Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting $nodes of a full-text index request and chops irrelevant sections of the result.
The default tag name of the marker element is mark. An alternative tag name can be chosen via the optional $tag argument.
The default length of the returned text is 150 characters. An alternative length can be specified via the optional $length argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.
For more details on this function, please have a look at ft:mark.
Examples
  • The following query may return <XML>...<b>hello</b>...<XML> if a text node of the database DB contains the string "hello world":
ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1)

ft:count

Signatures ft:count($nodes as node()*) as xs:integer
Summary Returns the number of occurrences of the search terms specified in a full-text expression.
Examples
  • ft:count(//*[text() contains text 'QUERY']) returns the xs:integer value 2 if a document contains two occurrences of the string "QUERY".

ft:score

Signatures ft:score($item as item()*) as xs:double*
Summary Returns the score values (0.0 - 1.0) that have been attached to the specified items. 0 is returned a value if no score was attached.
Examples
  • ft:score('a' contains text 'a') returns the xs:double value 1.

ft:tokens

Signatures ft:tokens($db as xs:string) as element(value)*
ft:tokens($db as xs:string, $prefix as xs:string) as element(value)*
Summary Returns all full-text tokens stored in the index of the database $db, along with their numbers of occurrences.
If $prefix is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
Errors BXDB0002: The addressed database does not exist or could not be opened.
BXDB0004: the full-text index is not available.

ft:tokenize

Signatures ft:tokenize($input as xs:string) as xs:string*
Summary Tokenizes the given $input string, using the current default full-text options.
Examples
  • ft:tokenize("No Doubt") returns the two strings no and doubt.
  • declare ft-option using stemming; ft:tokenize("GIFTS") returns a single string gift.

Errors

Code Description
BXFT0001 Both wildcards and fuzzy search have been specified as search options.

Changelog

Version 7.7
  • Updated: the functions no longer accept Database Nodes as reference. Instead, the name of a database must now be specified.
Version 7.2
  • Updated: ft:search (second argument generalized, third parameter added)
Version 7.1