Changes

Jump to navigation Jump to search
4,642 bytes added ,  12:33, 2 July 2020
m
Text replacement - "[http://www.w3.org/TR/xpath" to "[https://www.w3.org/TR/xpath"
This [[Module Library|XQuery Module]] extends the [httphttps://www.w3.org/TR/xpath-full-text-10 W3C Full Text Recommendation] with some useful functions: The index can be directly accessed, full-text fulltext results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the <code>{{Code|contains text</code> }} expression, can be explicitly requested from items.  =Conventions= All functions and errors in this module are introduced with assigned to the <code>ft:</code> prefix, which is linked to the statically declared <codenowiki>http://basex.org/modules/ft</nowiki></code> namespace, which is statically bound to the {{Code|ft}} prefix.<br/>
=Functions=
==ft:search==
{{Mark|Updated with Version 7.2.2:}} second argument generalized, third parameter added. {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:search</b>(|$db as item()xs:string, $terms as item()*) as |text()*</code>}}<br/><code><b>{{Func|ft:search</b>(|$db as item()xs:string, $terms as item()*, $options as itemmap(*)) as ?|text()*</code>}}
|-
| valign='top' | '''Summary'''|Returns all text nodes from the full-text index of the database <code>[[Database Module#Database Argument{{Code|$db]]</code> }} that contain the specified {{MonoCode|$terms}}.<br/>The options used for tokenizing the input and building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.The {{MonoCode|$options}} argument can be used to overwrite the default control full-text optionsprocessing. It can be specified as* {{Mono|element(full-text-options)}}: <code>&lt;full-text-The following options/&gt;</code> must be used as root element, and the parameters are specified as child nodes, with supported (the element name representing the key and the text node representing the value:<br /><pre class="brush:xml"><full-text-options> <key>value</key> ...</full-text-options></pre>* introduction on [[Map Module|map structureFull-Text]]: all parameters can be directly represented as key/value pairs:<br /><code>map { "key" := "value", ... </code>}<br/>This variant is more compact, but please note that processing gives you equivalent expressions in the W3C’s specification of maps in XQuery is still work in progress.The following keys are supportedFull-Text notation):* {{MonoCode|mode}}: determines the search mode (also called [http://www.w3.org/TR/xpath-full-text-10/#ftwords AnyAllOption])how tokens are searched. Allowed values are {{MonoCode|any}}, {{MonoCode|any word}}, {{MonoCode|all}}, {{MonoCode|all words}}, and {{MonoCode|phrase}}. {{MonoCode|any}} is the default search mode.* {{MonoCode|fuzzy}}: turns fuzzy querying on or off. Allowed values are an empty string or {{MonoCode|true}}, or and {{MonoCode|false}}. By default, fuzzy querying is turned off.* {{MonoCode|wildcards}}: turns wildcard querying on or off. Allowed values are an empty {{Code|true}} and {{Code|false}}. By default, wildcard querying is turned off.* {{Code|ordered}}: requires that all tokens occur in the order in which they are specified. Allowed values are {{Code|true}} and {{Code|false}}. The default is {{Code|false}}.* {{Code|content}}: specifies that the matched tokens need to occur at the beginning or end of a searched string , or need to cover the entire string. Allowed values are {{Code|start}}, {{Code|end}}, and {{Code|entire}}. By default, the option is turned off.* {{MonoCode|scope}}: defines the scope in which tokens must be located. The option has following sub options:** {{Code|same}}: can be set to {{Code|true}}or {{Code|false}}. It specifies if tokens need to occur in the same or different units.** {{Code|unit}}: can be {{Code|sentence}} or {{Code|paragraph}}. It specifies the unit for finding tokens.* {{Code|window}}: sets up a window in which all tokens must be located. By default, the option is turned off. It has following sub options:** {{Code|size}}: specifies the size of the window in terms of ''units''.** {{Code|unit}}: can be {{Code|sentences}}, {{Code|sentences}} or {{MonoCode|falseparagraphs}}. The default is {{Code|words}}.* {{Code|distance}}: specifies the distance in which tokens must occur. By default, wildcard querying the option is turned off. It has following sub options:** {{Code|min}}: specifies the minimum distance in terms of ''units''. The default is {{Code|0}}.** {{Code|max}}: specifies the maximum distance in terms of ''units''. The default is {{Code|∞}}.** {{Code|unit}}: can be {{Code|words}}, {{Code|sentences}} or {{Code|paragraphs}}. The default is {{Code|words}}.
|-
| valign='top' | '''Errors'''|'''[[XQuery Errors{{Error|db:open|Database Module#BaseX Errors|BASX0001]]''' is raised if the full-text index is }} The addressed database does not available, exist or if the selected option is could not supported by the existing indexbe opened.<br/>'''[[XQuery Errors#BaseX Errors{{Error|BASX0002]]''' is raised if a referenced node is not stored in a database (i.e., references a maindb:no-memory XML fragment).<br/>'''[[XQuery Errorsindex|Database Module#BaseX Errors|BASX0021]]''' is raised if }} the specified full-text option index is unknownnot available.<br/>'''[[XQuery Errors{{Error|options|#BaseX Errors|BASX0022]]''' is raised if both }} the fuzzy and wildcard querying has been selectedoption cannot be both specified.
|-
| valign='top' | '''Examples'''
|
* <code>{{Code|ft:search("DB", "QUERY")</code> returns }}: Return all text nodes of the database {{MonoCode|DB}} that contain the term {{MonoCode|QUERY}}.* Return all text nodes of the database {{Code|DB}} that contain the numbers {{Code|2010}} and {{Code|2020}}:<br/><code>ft:search("DB", ("2010",2011"2020"), map { 'mode':='all' })</code><br/>returns all * Return text nodes of that contain the database terms {{MonoCode|DBA}} that contain the numbers and {{MonoCode|B|2010}} and in a distance of at most 5 words:<syntaxhighlight lang="xquery">ft:search("db", ("A", "B"), map { "mode": "all words", "distance": map {Mono|20111 "max": "5", "unit": "words" }}.)</syntaxhighlight>* The last example iterates Iterate over five three databases and returns return all elements containing terms similar to {{MonoCode|Hello World}} in the text nodes:<pre classsyntaxhighlight lang="brush:xquery">
let $terms := "Hello Worlds"
let $fuzzy := true()
let $options :=
<full-text-options>
<fuzzy>{ $fuzzy }</fuzzy>
</full-text-options>
for $db in 1 to 3
let $dbname := 'DB' || $db
return ft:search($dbname, $terms, map { 'fuzzy': $optionsfuzzy })/..</presyntaxhighlight>
|}
==ft:markcontains== {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:mark</b>(contains|$nodes input as nodeitem()*) , $terms as nodeitem()*</code>|xs:boolean}}<br /><code><b>{{Func|ft:mark</b>(contains|$nodes input as nodeitem()*, $tag terms as xs:stringitem() *, $options as nodemap(*)*</code>?|xs:boolean}}
|-
| valign='top' | '''Summary'''|Puts a marker element around Checks if the specified {{Code|$input}} items contain the resulting <code>specified {{Code|$nodesterms}}.<br/code> of a fullThe function does the same as the [[Full-Text]] expression {{Code|contains text index request}}, but options can be specified more dynamically.<br />The {{Code|$options}} are the same as for [[#ft:search|ft:search]], and the following ones in addition:* {{Code|case}}: determines how character case is processed. Allowed values are {{Code|insensitive}}, {{Code|sensitive}}, {{Code|upper}} and {{Code|lower}}. By default tag name of the marker element , search is case insensitive.* {{Code|diacritics}}: determines how diacritical characters are processed. Allowed values are {{Code|insensitive}} and {{Code|sensitive}}. By default, search is diacritical insensitive.* {{Code|stemming}}: determines is tokens are stemmed. Allowed values are {{Code|true}} and {{Code|false}}. By default, stemming is <code>mark</code>turned off. An alternative tag name can be chosen via * {{Code|language}}: determines the optional <code>$tag</code> argumentlanguage. This option is relevant for stemming tokens.<br />Note that the XML node to be transformed must be an internal "database" nodeAll language codes are supported. The <code>transform</code> expression can be used to apply the method to a main-memory fragment (see example)default language is {{Code|en}}.
|-
| valign='top' | '''Errors'''|'''[[XQuery Errors#BaseX Errors{{Error|options|BASX0002]]''' is raised if a referenced node is not stored in a database (i.e., references a main-memory XML fragment).<br />'''[[XQuery Errors#Functions Errors|FOCA0002]]''' is raised if <code>$name</code> is no valid QName}} specified options are conflicting.
|-
| valign='top' | '''Examples'''
|
* Checks if {{Code|jack}} or {{Code|john}} occurs in the input string {{Code|John Doe}}:<syntaxhighlight lang="xquery">ft:contains("John Doe", ("jack", "john"), map { "mode": "any" })</syntaxhighlight>* Calls the function with stemming turned on and off:<syntaxhighlight lang="xquery">(true(), false()) ! ft:contains("Häuser", "Haus", map { 'stemming': ., 'language':'de' })</syntaxhighlight>|} ==ft:mark== {| width='100%'|-| width='120' | '''Signatures'''|{{Func|ft:mark|$nodes as node()*|node()*}}<br />{{Func|ft:mark|$nodes as node()*, $name as xs:string|node()*}}|-| '''Summary'''|Puts a marker element around the resulting {{Code|$nodes}} of a full-text request.<br />The default name of the marker element is {{Code|mark}}. An alternative name can be chosen via the optional {{Code|$name}} argument.<br />Please note that:* The full-text expression that computes the token positions must be specified as argument of the <code>ft:mark()</code> function, as all position information is lost in subsequent processing steps. You may need to specify more than one full-text expression if you want to use the function in a FLWOR expression, as shown in Example 2.* The supplied node must be a [[Database Module#Database Node|Database Node]]. As shown in Example 3, {{Code|update}} or {{Code|transform}} can be utilized to convert a fragment to the required internal representation.|-| '''Examples'''|'''Example 1''': The following query returns <code>{{Code|&lt;XML&gt;&lt;mark&gt;hello&lt;/mark&gt; world&lt;/XML&gt;</code>}}, if one text node of the database <code>{{Code|DB</code> }} has the value "hello world":<pre classsyntaxhighlight lang="brush:xquery">
ft:mark(db:open('DB')//*[text() contains text 'hello'])
</presyntaxhighlight>'''Example 2''': The following expression loops through the first ten full-text results and marks the results in a second expression:<syntaxhighlight lang="xquery">let $start := 1let $end := 10let $term := 'welcome'for $ft in (db:open('DB')//* [text() contains text { $term }])[position() = $start to $end]return element hit { ft:mark($ft[text() contains text { $term }])}</syntaxhighlight>'''Example 3''': The following expression returns <code>&lt;p&gt;xml>hello &lt;b&gt;word&lt;/b&gt;&lt;/pxml&gt;</code>:<pre classsyntaxhighlight lang="brush:xquery">copy $p := &lt;p&gt;word&lt;<xml>hello world</p&gt;xml>
modify ()
return ft:mark($p[text() contains text 'word'], 'b')</presyntaxhighlight>
|}
==ft:extract==
 {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:extract</b>(|$nodes as node()*) as |node()*</code>}}<br /><code><b>{{Func|ft:extract</b>(|$nodes as node()*, $tag name as xs:string) as |node()*</code>}}<br /><code><b>{{Func|ft:extract</b>(|$nodes as node()*, $tag name as xs:string, $length as xs:integer) as |node()*</code>}}
|-
| valign='top' | '''Summary'''|Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting <code>{{Code|$nodes</code> }} of a full-text index request and chops irrelevant sections of the result.<br />The default tag element name of the marker element is <code>{{Code|mark</code>}}. An alternative tag element name can be chosen via the optional <code>{{Code|$tag</code> name}} argument.<br />The default length of the returned text is <code>{{Code|150</code> }} characters. An alternative length can be specified via the optional <code>{{Code|$length</code> }} argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.<br />For more details on this function, please have a look at [[#ft:mark|ft:mark]].
|-
| valign='top' | '''Errors'''|'''[[XQuery Errors#BaseX Errors|BASX0002]]''' is raised if a referenced node is not stored in a database (i.e., references a main-memory XML fragment).<br />'''[[XQuery Errors#Functions Errors|FOCA0002]]''' is raised if <code>$name</code> is no valid QName.|-| valign='top' | '''Examples'''
|
* The following query may return <code>{{Code|&lt;XML&gt;...&lt;b&gt;hello&lt;/b&gt;...&lt;XML&gt;</code> }} if a text node of the database <code>{{Code|DB</code> }} contains the string "hello world":<pre classsyntaxhighlight lang="brush:xquery">
ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1)
</presyntaxhighlight>
|}
==ft:count==
{|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:count</b>(|$nodes as node()*) as |xs:integer</code>}}
|-
| valign='top' | '''Summary'''
|Returns the number of occurrences of the search terms specified in a full-text expression.
|-
| valign='top' | '''Errors'''|'''[[XQuery Errors#BaseX Errors|BASX0002]]''' is raised if a referenced node is not stored in a database (i.e., references a main-memory XML fragment).|-| valign='top' | '''Examples'''
|
* <code>{{Code|ft:count(//*[text() contains text 'QUERY'])</code> }} returns the <code>{{Code|xs:integer</code> }} value <code>{{Code|2</code> }} if a document contains two occurrences of the string "QUERY".
|}
==ft:score==
 {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:score</b>(|$item as item()*) as |xs:double*</code>}}
|-
| valign='top' | '''Summary'''|Returns the score values (0.0 - 1.0) that have been attached to the specified items. <code>{{Code|0</code> }} is returned a value if no score was attached.
|-
| valign='top' | '''Examples'''
|
* <code>{{Code|ft:score('a' contains text 'a')</code> }} returns the <code>{{Code|xs:double</code> }} value <code>{{Code|1</code>}}.
|}
==ft:tokens==
{|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|{{MonoFunc|<b>ft:tokens</b>(|$db as item()) as xs:string|element(value)*}}<br/>{{MonoFunc|<b>ft:tokens</b>(|$db as item()xs:string, $prefix as xs:string) as |element(value)*}}
|-
| valign='top' | '''Summary'''|Returns all full-text tokens stored in the index of the database <code>[[Database Module#Database Argument{{Code|$db]]</code>}}, along with their numbers of occurrences. {{Mono|$db}} may either be an <code>xs:string</code>, denoting the database name, or a node stored in the database.<br/>If {{MonoCode|$prefix}} is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
|-
| valign='top' | '''Errors'''|'''[[XQuery {{Error|db:open|Database Module#Errors}} The addressed database does not exist or could not be opened.<br/>{{Error|db:no-index|Database Module#BaseX Errors|BASX0001]]''' is raised if }} the full-text index is not available.<br/>|-| '''[[XQuery Errors#BaseX Errors|BASX0002]]Examples''' is raised if {{Mono|Returns the number of occurrences for a single, specific index entry:<syntaxhighlight lang="xquery">let $term := ft:tokenize($term)return number(ft:tokens('db}} references a node that is not stored in a database (i.e.', references a main-memory XML fragment$term)[.= $term]/@count)<br/syntaxhighlight>'''[[XQuery Errors#BaseX Errors|BASX0003]]''' is raised if the addressed database cannot be opened.
|}
==ft:tokenize==
 {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|{{MonoFunc|<b>ft:tokenize|$string as xs:string?|xs:string*}}<br/b>({{Func|ft:tokenize|$input string as xs:string?, $options as map(*) as ?|xs:string*}}
|-
| valign='top' | '''Summary'''|Tokenizes the given {{MonoCode|$inputstring}} string, using the current default full-text optionsor the {{Code|$options}} specified as second argument, and returns a sequence with the tokenized string. The following options are available:* {{Code|case}}: determines how character case is processed. Allowed values are {{Code|insensitive}}, {{Code|sensitive}}, {{Code|upper}} and {{Code|lower}}. By default, search is case insensitive.* {{Code|diacritics}}: determines how diacritical characters are processed. Allowed values are {{Code|insensitive}} and {{Code|sensitive}}. By default, search is diacritical insensitive.* {{Code|stemming}}: determines is tokens are stemmed. Allowed values are {{Code|true}} and {{Code|false}}. By default, stemming is turned off.* {{Code|language}}: determines the language. This option is relevant for stemming tokens. All language codes are supported. The default language is {{Code|en}}.The {{Code|$options}} argument can be used to control full-text processing.
|-
| valign='top' | '''Examples'''
|
* <code>ft:tokenize("No Doubt")</code> returns the two strings {{MonoCode|no}} and {{MonoCode|doubt}}.* <code>ft:tokenize("École", map { 'diacritics': 'sensitive' })</code> returns the string {{Code|école}}.* <code>declare ft-option using stemming; ft:tokenize("GIFTS")</code> returns a single string {{MonoCode|gift}}.|} ==ft:normalize== {| width='100%'|-| width='120' | '''Signatures'''|{{Func|ft:normalize|$string as xs:string?|xs:string}}<br/>{{Func|ft:normalize|$string as xs:string?, $options as map(*)?|xs:string}}|-| '''Summary'''|Normalizes the given {{Code|$string}}, using the current default full-text options or the {{Code|$options}} specified as second argument. The function expects the same arguments as [[#ft:tokenize|ft:tokenize]].|-| '''Examples'''|* <code>ft:tokenize("Häuser am Meer", map { 'case': 'sensitive' })</code> returns the string {{Code|Hauser am Meer}}.|} =Errors= {| class="wikitable" width="100%"! width="110"|Code|Description|-|{{Code|options}}|Both wildcards and fuzzy search have been specified as search options.
|}
=Changelog=
===; Version 9.1* Updated: [[#ft:tokenize|ft:tokenize]] and [[#ft:normalize|ft:normalize]] can be called with empty sequence. ;Version 9.0 * Updated: error codes updated; errors now use the module namespace ;Version 8.0 * Added: [[#ft:contains|ft:contains]], [[#ft:normalize|ft:normalize]]* Updated: Options added to [[#ft:tokenize|ft:tokenize]] ;Version 7.8 * Added: [[#ft:contains|ft:contains]]* Updated: Options added to [[#ft:search|ft:search]] ;Version 7.7 * Updated: the functions no longer accept [[Database Module#Database Nodes|Database Nodes]] as reference. Instead, the name of a database must now be specified. ;Version 7.2===
* Updated: [[#ft:search|ft:search]] (second argument generalized, third parameter added)
===;Version 7.1===
* Added: [[#ft:tokens|ft:tokens]], [[#ft:tokenize|ft:tokenize]]
 
[[Category:XQuery]]
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu