Difference between revisions of "Full-Text Module"
m (Text replace - "error codes updates" to "error codes updated") |
m (Text replacement - "[http://www.w3.org/TR/xpath" to "[https://www.w3.org/TR/xpath") |
||
(12 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | This [[Module Library|XQuery Module]] extends the [ | + | This [[Module Library|XQuery Module]] extends the [https://www.w3.org/TR/xpath-full-text-10 W3C Full Text Recommendation] with some useful functions: The index can be directly accessed, fulltext results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the {{Code|contains text}} expression, can be explicitly requested from items. |
=Conventions= | =Conventions= | ||
− | |||
− | |||
All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/ft</nowiki></code> namespace, which is statically bound to the {{Code|ft}} prefix.<br/> | All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/ft</nowiki></code> namespace, which is statically bound to the {{Code|ft}} prefix.<br/> | ||
Line 14: | Line 12: | ||
|- | |- | ||
| width='120' | '''Signatures''' | | width='120' | '''Signatures''' | ||
− | |{{Func|ft:search|$db as xs:string, $terms as item()*|text()*}}<br/>{{Func|ft:search|$db as xs:string, $terms as item()*, $options as map( | + | |{{Func|ft:search|$db as xs:string, $terms as item()*|text()*}}<br/>{{Func|ft:search|$db as xs:string, $terms as item()*, $options as map(*)?|text()*}} |
|- | |- | ||
| '''Summary''' | | '''Summary''' | ||
Line 43: | Line 41: | ||
* Return all text nodes of the database {{Code|DB}} that contain the numbers {{Code|2010}} and {{Code|2020}}:<br/><code>ft:search("DB", ("2010", "2020"), map { 'mode': 'all' })</code> | * Return all text nodes of the database {{Code|DB}} that contain the numbers {{Code|2010}} and {{Code|2020}}:<br/><code>ft:search("DB", ("2010", "2020"), map { 'mode': 'all' })</code> | ||
* Return text nodes that contain the terms {{Code|A}} and {{Code|B|}} in a distance of at most 5 words: | * Return text nodes that contain the terms {{Code|A}} and {{Code|B|}} in a distance of at most 5 words: | ||
− | < | + | <syntaxhighlight lang="xquery"> |
ft:search("db", ("A", "B"), map { | ft:search("db", ("A", "B"), map { | ||
"mode": "all words", | "mode": "all words", | ||
Line 51: | Line 49: | ||
} | } | ||
}) | }) | ||
− | </ | + | </syntaxhighlight> |
* Iterate over three databases and return all elements containing terms similar to {{Code|Hello World}} in the text nodes: | * Iterate over three databases and return all elements containing terms similar to {{Code|Hello World}} in the text nodes: | ||
− | < | + | <syntaxhighlight lang="xquery"> |
let $terms := "Hello Worlds" | let $terms := "Hello Worlds" | ||
let $fuzzy := true() | let $fuzzy := true() | ||
Line 59: | Line 57: | ||
let $dbname := 'DB' || $db | let $dbname := 'DB' || $db | ||
return ft:search($dbname, $terms, map { 'fuzzy': $fuzzy })/.. | return ft:search($dbname, $terms, map { 'fuzzy': $fuzzy })/.. | ||
− | </ | + | </syntaxhighlight> |
|} | |} | ||
Line 67: | Line 65: | ||
|- | |- | ||
| width='120' | '''Signatures''' | | width='120' | '''Signatures''' | ||
− | |{{Func|ft:contains|$input as item()*, $terms as item()*|xs:boolean}}<br/>{{Func|ft:contains|$input as item()*, $terms as item()*, $options as map( | + | |{{Func|ft:contains|$input as item()*, $terms as item()*|xs:boolean}}<br/>{{Func|ft:contains|$input as item()*, $terms as item()*, $options as map(*)?|xs:boolean}} |
|- | |- | ||
| '''Summary''' | | '''Summary''' | ||
Line 82: | Line 80: | ||
| | | | ||
* Checks if {{Code|jack}} or {{Code|john}} occurs in the input string {{Code|John Doe}}: | * Checks if {{Code|jack}} or {{Code|john}} occurs in the input string {{Code|John Doe}}: | ||
− | < | + | <syntaxhighlight lang="xquery"> |
ft:contains("John Doe", ("jack", "john"), map { "mode": "any" }) | ft:contains("John Doe", ("jack", "john"), map { "mode": "any" }) | ||
− | </ | + | </syntaxhighlight> |
* Calls the function with stemming turned on and off: | * Calls the function with stemming turned on and off: | ||
− | < | + | <syntaxhighlight lang="xquery"> |
(true(), false()) ! ft:contains("Häuser", "Haus", map { 'stemming': ., 'language':'de' }) | (true(), false()) ! ft:contains("Häuser", "Haus", map { 'stemming': ., 'language':'de' }) | ||
− | </ | + | </syntaxhighlight> |
|} | |} | ||
==ft:mark== | ==ft:mark== | ||
+ | |||
{| width='100%' | {| width='100%' | ||
|- | |- | ||
Line 98: | Line 97: | ||
|- | |- | ||
| '''Summary''' | | '''Summary''' | ||
− | |Puts a marker element around the resulting {{Code|$nodes}} of a full-text | + | |Puts a marker element around the resulting {{Code|$nodes}} of a full-text request.<br />The default name of the marker element is {{Code|mark}}. An alternative name can be chosen via the optional {{Code|$name}} argument.<br />Please note that: |
− | * | + | * The full-text expression that computes the token positions must be specified as argument of the <code>ft:mark()</code> function, as all position information is lost in subsequent processing steps. You may need to specify more than one full-text expression if you want to use the function in a FLWOR expression, as shown in Example 2. |
− | * | + | * The supplied node must be a [[Database Module#Database Node|Database Node]]. As shown in Example 3, {{Code|update}} or {{Code|transform}} can be utilized to convert a fragment to the required internal representation. |
|- | |- | ||
| '''Examples''' | | '''Examples''' | ||
|'''Example 1''': The following query returns {{Code|<XML><mark>hello</mark> world</XML>}}, if one text node of the database {{Code|DB}} has the value "hello world": | |'''Example 1''': The following query returns {{Code|<XML><mark>hello</mark> world</XML>}}, if one text node of the database {{Code|DB}} has the value "hello world": | ||
− | < | + | <syntaxhighlight lang="xquery"> |
ft:mark(db:open('DB')//*[text() contains text 'hello']) | ft:mark(db:open('DB')//*[text() contains text 'hello']) | ||
− | </ | + | </syntaxhighlight> |
'''Example 2''': The following expression loops through the first ten full-text results and marks the results in a second expression: | '''Example 2''': The following expression loops through the first ten full-text results and marks the results in a second expression: | ||
− | < | + | <syntaxhighlight lang="xquery"> |
let $start := 1 | let $start := 1 | ||
let $end := 10 | let $end := 10 | ||
Line 116: | Line 115: | ||
ft:mark($ft[text() contains text { $term }]) | ft:mark($ft[text() contains text { $term }]) | ||
} | } | ||
− | </ | + | </syntaxhighlight> |
− | '''Example 3''': The following expression returns | + | '''Example 3''': The following expression returns <code><xml>hello <b>word</b></xml></code>: |
− | < | + | <syntaxhighlight lang="xquery"> |
− | copy $p := | + | copy $p := <xml>hello world</xml> |
modify () | modify () | ||
− | return ft:mark($p[text() contains text 'word'], 'b')</ | + | return ft:mark($p[text() contains text 'word'], 'b') |
+ | </syntaxhighlight> | ||
|} | |} | ||
==ft:extract== | ==ft:extract== | ||
+ | |||
{| width='100%' | {| width='100%' | ||
|- | |- | ||
Line 136: | Line 137: | ||
| | | | ||
* The following query may return {{Code|<XML>...<b>hello</b>...<XML>}} if a text node of the database {{Code|DB}} contains the string "hello world": | * The following query may return {{Code|<XML>...<b>hello</b>...<XML>}} if a text node of the database {{Code|DB}} contains the string "hello world": | ||
− | < | + | <syntaxhighlight lang="xquery"> |
ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1) | ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1) | ||
− | </ | + | </syntaxhighlight> |
|} | |} | ||
Line 156: | Line 157: | ||
==ft:score== | ==ft:score== | ||
+ | |||
{| width='100%' | {| width='100%' | ||
|- | |- | ||
Line 183: | Line 185: | ||
| '''Examples''' | | '''Examples''' | ||
|Returns the number of occurrences for a single, specific index entry: | |Returns the number of occurrences for a single, specific index entry: | ||
− | < | + | <syntaxhighlight lang="xquery"> |
let $term := ft:tokenize($term) | let $term := ft:tokenize($term) | ||
return number(ft:tokens('db', $term)[. = $term]/@count) | return number(ft:tokens('db', $term)[. = $term]/@count) | ||
− | </ | + | </syntaxhighlight> |
|} | |} | ||
==ft:tokenize== | ==ft:tokenize== | ||
+ | |||
{| width='100%' | {| width='100%' | ||
|- | |- | ||
| width='120' | '''Signatures''' | | width='120' | '''Signatures''' | ||
− | |{{Func|ft:tokenize|$ | + | |{{Func|ft:tokenize|$string as xs:string?|xs:string*}}<br/>{{Func|ft:tokenize|$string as xs:string?, $options as map(*)?|xs:string*}} |
|- | |- | ||
| '''Summary''' | | '''Summary''' | ||
− | |Tokenizes the given {{Code|$ | + | |Tokenizes the given {{Code|$string}}, using the current default full-text options or the {{Code|$options}} specified as second argument, and returns a sequence with the tokenized string. The following options are available: |
* {{Code|case}}: determines how character case is processed. Allowed values are {{Code|insensitive}}, {{Code|sensitive}}, {{Code|upper}} and {{Code|lower}}. By default, search is case insensitive. | * {{Code|case}}: determines how character case is processed. Allowed values are {{Code|insensitive}}, {{Code|sensitive}}, {{Code|upper}} and {{Code|lower}}. By default, search is case insensitive. | ||
* {{Code|diacritics}}: determines how diacritical characters are processed. Allowed values are {{Code|insensitive}} and {{Code|sensitive}}. By default, search is diacritical insensitive. | * {{Code|diacritics}}: determines how diacritical characters are processed. Allowed values are {{Code|insensitive}} and {{Code|sensitive}}. By default, search is diacritical insensitive. | ||
Line 211: | Line 214: | ||
==ft:normalize== | ==ft:normalize== | ||
+ | |||
{| width='100%' | {| width='100%' | ||
|- | |- | ||
| width='120' | '''Signatures''' | | width='120' | '''Signatures''' | ||
− | |{{Func|ft:normalize|$ | + | |{{Func|ft:normalize|$string as xs:string?|xs:string}}<br/>{{Func|ft:normalize|$string as xs:string?, $options as map(*)?|xs:string}} |
|- | |- | ||
| '''Summary''' | | '''Summary''' | ||
− | |Normalizes the given {{Code|$ | + | |Normalizes the given {{Code|$string}}, using the current default full-text options or the {{Code|$options}} specified as second argument. The function expects the same arguments as [[#ft:tokenize|ft:tokenize]]. |
|- | |- | ||
| '''Examples''' | | '''Examples''' | ||
Line 225: | Line 229: | ||
=Errors= | =Errors= | ||
− | |||
− | |||
{| class="wikitable" width="100%" | {| class="wikitable" width="100%" | ||
Line 237: | Line 239: | ||
=Changelog= | =Changelog= | ||
+ | |||
+ | ; Version 9.1 | ||
+ | * Updated: [[#ft:tokenize|ft:tokenize]] and [[#ft:normalize|ft:normalize]] can be called with empty sequence. | ||
;Version 9.0 | ;Version 9.0 |
Revision as of 12:33, 2 July 2020
This XQuery Module extends the W3C Full Text Recommendation with some useful functions: The index can be directly accessed, fulltext results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the contains text
expression, can be explicitly requested from items.
Contents
Conventions
All functions and errors in this module are assigned to the http://basex.org/modules/ft
namespace, which is statically bound to the ft
prefix.
Functions
ft:search
Signatures | ft:search($db as xs:string, $terms as item()*) as text()* ft:search($db as xs:string, $terms as item()*, $options as map(*)?) as text()*
|
Summary | Returns all text nodes from the full-text index of the database $db that contain the specified $terms .The options used for tokenizing the input and building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well. The
|
Errors | db:open : The addressed database does not exist or could not be opened.db:no-index : the index is not available.options : the fuzzy and wildcard option cannot be both specified.
|
Examples |
<syntaxhighlight lang="xquery"> ft:search("db", ("A", "B"), map { "mode": "all words", "distance": map { "max": "5", "unit": "words" } }) </syntaxhighlight>
<syntaxhighlight lang="xquery"> let $terms := "Hello Worlds" let $fuzzy := true() for $db in 1 to 3 let $dbname := 'DB' || $db return ft:search($dbname, $terms, map { 'fuzzy': $fuzzy })/.. </syntaxhighlight> |
ft:contains
Signatures | ft:contains($input as item()*, $terms as item()*) as xs:boolean ft:contains($input as item()*, $terms as item()*, $options as map(*)?) as xs:boolean
|
Summary | Checks if the specified $input items contain the specified $terms .The function does the same as the Full-Text expression contains text , but options can be specified more dynamically. The $options are the same as for ft:search, and the following ones in addition:
|
Errors | options : specified options are conflicting.
|
Examples |
<syntaxhighlight lang="xquery"> ft:contains("John Doe", ("jack", "john"), map { "mode": "any" }) </syntaxhighlight>
<syntaxhighlight lang="xquery"> (true(), false()) ! ft:contains("Häuser", "Haus", map { 'stemming': ., 'language':'de' }) </syntaxhighlight> |
ft:mark
Signatures | ft:mark($nodes as node()*) as node()* ft:mark($nodes as node()*, $name as xs:string) as node()*
|
Summary | Puts a marker element around the resulting $nodes of a full-text request.The default name of the marker element is mark . An alternative name can be chosen via the optional $name argument.Please note that:
|
Examples | Example 1: The following query returns <XML><mark>hello</mark> world</XML> , if one text node of the database DB has the value "hello world":
<syntaxhighlight lang="xquery"> ft:mark(db:open('DB')//*[text() contains text 'hello']) </syntaxhighlight> Example 2: The following expression loops through the first ten full-text results and marks the results in a second expression: <syntaxhighlight lang="xquery"> let $start := 1 let $end := 10 let $term := 'welcome' for $ft in (db:open('DB')//*[text() contains text { $term }])[position() = $start to $end] return element hit { ft:mark($ft[text() contains text { $term }]) }
</syntaxhighlight>
Example 3: The following expression returns |
ft:extract
Signatures | ft:extract($nodes as node()*) as node()* ft:extract($nodes as node()*, $name as xs:string) as node()* ft:extract($nodes as node()*, $name as xs:string, $length as xs:integer) as node()*
|
Summary | Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting $nodes of a full-text index request and chops irrelevant sections of the result.The default element name of the marker element is mark . An alternative element name can be chosen via the optional $name argument.The default length of the returned text is 150 characters. An alternative length can be specified via the optional $length argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.For more details on this function, please have a look at ft:mark. |
Examples |
<syntaxhighlight lang="xquery"> ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1) </syntaxhighlight> |
ft:count
Signatures | ft:count($nodes as node()*) as xs:integer
|
Summary | Returns the number of occurrences of the search terms specified in a full-text expression. |
Examples |
|
ft:score
Signatures | ft:score($item as item()*) as xs:double*
|
Summary | Returns the score values (0.0 - 1.0) that have been attached to the specified items. 0 is returned a value if no score was attached.
|
Examples |
|
ft:tokens
Signatures | ft:tokens($db as xs:string) as element(value)* ft:tokens($db as xs:string, $prefix as xs:string) as element(value)*
|
Summary | Returns all full-text tokens stored in the index of the database $db , along with their numbers of occurrences.If $prefix is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
|
Errors | db:open : The addressed database does not exist or could not be opened.db:no-index : the full-text index is not available.
|
Examples | Returns the number of occurrences for a single, specific index entry:
<syntaxhighlight lang="xquery"> let $term := ft:tokenize($term) return number(ft:tokens('db', $term)[. = $term]/@count) </syntaxhighlight> |
ft:tokenize
Signatures | ft:tokenize($string as xs:string?) as xs:string* ft:tokenize($string as xs:string?, $options as map(*)?) as xs:string*
|
Summary | Tokenizes the given $string , using the current default full-text options or the $options specified as second argument, and returns a sequence with the tokenized string. The following options are available:
The |
Examples |
|
ft:normalize
Signatures | ft:normalize($string as xs:string?) as xs:string ft:normalize($string as xs:string?, $options as map(*)?) as xs:string
|
Summary | Normalizes the given $string , using the current default full-text options or the $options specified as second argument. The function expects the same arguments as ft:tokenize.
|
Examples |
|
Errors
Code | Description |
---|---|
options
|
Both wildcards and fuzzy search have been specified as search options. |
Changelog
- Version 9.1
- Updated: ft:tokenize and ft:normalize can be called with empty sequence.
- Version 9.0
- Updated: error codes updated; errors now use the module namespace
- Version 8.0
- Added: ft:contains, ft:normalize
- Updated: Options added to ft:tokenize
- Version 7.8
- Added: ft:contains
- Updated: Options added to ft:search
- Version 7.7
- Updated: the functions no longer accept Database Nodes as reference. Instead, the name of a database must now be specified.
- Version 7.2
- Updated: ft:search (second argument generalized, third parameter added)
- Version 7.1
- Added: ft:tokens, ft:tokenize