Changes

Jump to navigation Jump to search
7,204 bytes added ,  12:33, 2 July 2020
m
Text replacement - "[http://www.w3.org/TR/xpath" to "[https://www.w3.org/TR/xpath"
This [[Module Library|XQuery Module]] extends the [httphttps://www.w3.org/TR/xpath-full-text-10 W3C Full Text Recommendation] with some useful functions: The index can be directly accessed, full-text fulltext results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the <code>{{Code|contains text</code> }} expression, can be explicitly requested from items.  =Conventions= All functions and errors in this module are introduced with assigned to the <code>ft:</code> prefix, which is linked to the statically declared <codenowiki>http://basex.org/modules/ft</nowiki></code> namespace, which is statically bound to the {{Code|ft}} prefix.<br/>
=Functions=
==ft:search==
 {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:search|$db as xs:string, $terms as item()*|text()*}}<br/b>({{Func|ft:search|$db as xs:string, $terms as item()*, $text options as xs:stringmap(*) as ?|text()*</code>}}
|-
| valign='top' | '''Summary'''|Returns all text nodes from the full-text index of the database [[Database Module#Database Argument{{Code|$db]] }} that contain the text specified as {{MonoCode|$terms}}.<br/>The options used for tokenizing the input and building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.The {{Code|$options}} argument can be used to control full-text processing. The following options are supported (the introduction on [[Full-Text]] processing gives you equivalent expressions in the XQuery Full-Text notation):* {{Code|mode}}: determines the mode how tokens are searched. Allowed values are {{Code|any}}, {{Code|any word}}, {{Code|all}}, {{Code|all words}}, and {{Code|phrase}}. {{Code|any}} is the default search mode.* {{Code|fuzzy}}: turns fuzzy querying on or off. Allowed values are {{Code|true}} and {{Code|false}}. By default, fuzzy querying is turned off.* {{Code|wildcards}}: turns wildcard querying on or off. Allowed values are {{Code|true}} and {{Code|false}}. By default, wildcard querying is turned off.* {{Code|ordered}}: requires that all tokens occur in the order in which they are specified. Allowed values are {{Code|true}} and {{Code|false}}. The default is {{Code|false}}.* {{Code|content}}: specifies that the matched tokens need to occur at the beginning or end of a searched string, or need to cover the entire string. Allowed values are {{Code|start}}, {{Code|end}}, and {{Code|entire}}. By default, the option is turned off.* {{Code|scope}}: defines the scope in which tokens must be located. The option has following sub options:** {{Code|same}}: can be set to {{Code|true}} or {{Code|false}}. It specifies if tokens need to occur in the same or different units.** {{Code|unit}}: can be {{Code|sentence}} or {{Code|paragraph}}. It specifies the unit for finding tokens.* {{Code|window}}: sets up a window in which all tokens must be located. By default, the option is turned off. It has following sub options:** {{Code|size}}: specifies the size of the window in terms of ''units''.** {{Code|unit}}: can be {{Code|sentences}}, {{Code|sentences}} or {{Code|paragraphs}}. The default is {{Code|words}}.* {{Code|distance}}: specifies the distance in which tokens must occur. By default, the option is turned off. It has following sub options:** {{Code|min}}: specifies the minimum distance in terms of ''units''. The default is {{Code|0}}.** {{Code|max}}: specifies the maximum distance in terms of ''units''. The default is {{Code|∞}}.** {{Code|unit}}: can be {{Code|words}}, {{Code|sentences}} or {{Code|paragraphs}}. The default is {{Code|words}}.
|-
| valign='top' | '''Errors'''|'''[[XQuery {{Error|db:open|Database Module#Errors}} The addressed database does not exist or could not be opened.<br/>{{Error|db:no-index|Database Module#BaseX Errors|BASX0001]]''' is raised if }} the full-text index is not available.<br/>'''[[XQuery Errors{{Error|options|#BaseX Errors|BASX0002]]''' is raised if a referenced node is not stored in a database (i.e., references a main-memory XML fragment)}} the fuzzy and wildcard option cannot be both specified.
|-
| valign='top' | '''Examples'''
|
* {{MonoCode|ft:search("DB", "QUERY")}} returns : Return all text nodes of the database {{MonoCode|DB}} that contain the string term {{MonoCode|QUERY}}.* Return all text nodes of the database {{Code|DB}} that contain the numbers {{Code|2010}} and {{Code|2020}}:<br/><code>ft:search("DB", ("2010", "2020"), map { 'mode': 'all' })</code>* Return text nodes that contain the terms {{Code|A}} and {{Code|B|}} in a distance of at most 5 words:<syntaxhighlight lang="xquery">ft:search("db", ("A", "B"), map { "mode": "all words", "distance": map { "max": "5", "unit": "words" }})</syntaxhighlight>* Iterate over three databases and return all elements containing terms similar to {{Code|Hello World}} in the text nodes:<syntaxhighlight lang="xquery">let $terms := "Hello Worlds"let $fuzzy := true()for $db in 1 to 3let $dbname := 'DB' || $dbreturn ft:search($dbname, $terms, map { 'fuzzy': $fuzzy })/..</syntaxhighlight>
|}
==ft:markcontains== {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:mark</b>(contains|$nodes input as nodeitem()*) , $terms as nodeitem()*</code>|xs:boolean}}<br /><code><b>{{Func|ft:mark</b>(contains|$nodes input as nodeitem()*, $tag terms as xs:stringitem() *, $options as nodemap(*)*</code>?|xs:boolean}}
|-
| valign='top' | '''Summary'''|Puts a marker element around Checks if the specified {{Code|$input}} items contain the resulting <code>specified {{Code|$nodesterms}}.<br/code> of a fullThe function does the same as the [[Full-Text]] expression {{Code|contains text index request}}, but options can be specified more dynamically.<br />The {{Code|$options}} are the same as for [[#ft:search|ft:search]], and the following ones in addition:* {{Code|case}}: determines how character case is processed. Allowed values are {{Code|insensitive}}, {{Code|sensitive}}, {{Code|upper}} and {{Code|lower}}. By default tag name of the marker element , search is case insensitive.* {{Code|diacritics}}: determines how diacritical characters are processed. Allowed values are {{Code|insensitive}} and {{Code|sensitive}}. By default, search is diacritical insensitive.* {{Code|stemming}}: determines is tokens are stemmed. Allowed values are {{Code|true}} and {{Code|false}}. By default, stemming is <code>mark</code>turned off. An alternative tag name can be chosen via * {{Code|language}}: determines the optional <code>$tag</code> argumentlanguage. This option is relevant for stemming tokens.<br />Note that the XML node to be transformed must be an internal "database" nodeAll language codes are supported. The <code>transform</code> expression can be used to apply the method to a main-memory fragment (see example)default language is {{Code|en}}.
|-
| valign='top' | '''Errors'''|'''[[XQuery Errors#BaseX Errors{{Error|options|BASX0002]]''' is raised if a referenced node is not stored in a database (i.e., references a main-memory XML fragment).<br />'''[[XQuery Errors#Functions Errors|FOCA0002]]''' is raised if <code>$name</code> is no valid QName}} specified options are conflicting.
|-
| valign='top' | '''Examples'''
|
* Checks if {{Code|jack}} or {{Code|john}} occurs in the input string {{Code|John Doe}}:<syntaxhighlight lang="xquery">ft:contains("John Doe", ("jack", "john"), map { "mode": "any" })</syntaxhighlight>* Calls the function with stemming turned on and off:<syntaxhighlight lang="xquery">(true(), false()) ! ft:contains("Häuser", "Haus", map { 'stemming': ., 'language':'de' })</syntaxhighlight>|} ==ft:mark== {| width='100%'|-| width='120' | '''Signatures'''|{{Func|ft:mark|$nodes as node()*|node()*}}<br />{{Func|ft:mark|$nodes as node()*, $name as xs:string|node()*}}|-| '''Summary'''|Puts a marker element around the resulting {{Code|$nodes}} of a full-text request.<br />The default name of the marker element is {{Code|mark}}. An alternative name can be chosen via the optional {{Code|$name}} argument.<br />Please note that:* The full-text expression that computes the token positions must be specified as argument of the <code>ft:mark()</code> function, as all position information is lost in subsequent processing steps. You may need to specify more than one full-text expression if you want to use the function in a FLWOR expression, as shown in Example 2.* The supplied node must be a [[Database Module#Database Node|Database Node]]. As shown in Example 3, {{Code|update}} or {{Code|transform}} can be utilized to convert a fragment to the required internal representation.|-| '''Examples'''|'''Example 1''': The following query returns <code>{{Code|&lt;XML&gt;&lt;mark&gt;hello&lt;/mark&gt; world&lt;/XML&gt;</code>}}, if one text node of the database <code>{{Code|DB</code> }} has the value "hello world":<pre classsyntaxhighlight lang="brush:xquery">
ft:mark(db:open('DB')//*[text() contains text 'hello'])
</presyntaxhighlight>'''Example 2''': The following expression loops through the first ten full-text results and marks the results in a second expression:<syntaxhighlight lang="xquery">let $start := 1let $end := 10let $term := 'welcome'for $ft in (db:open('DB')//* [text() contains text { $term }])[position() = $start to $end]return element hit { ft:mark($ft[text() contains text { $term }])}</syntaxhighlight>'''Example 3''': The following expression returns <code>&lt;p&gt;xml>hello &lt;b&gt;word&lt;/b&gt;&lt;/pxml&gt;</code>:<pre classsyntaxhighlight lang="brush:xquery">copy $p := &lt;p&gt;word&lt;<xml>hello world</p&gt;xml>
modify ()
return ft:mark($p[text() contains text 'word'], 'b')</presyntaxhighlight>
|}
==ft:extract==
 {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:extract</b>(|$nodes as node()*) as |node()*</code>}}<br /><code><b>{{Func|ft:extract</b>(|$nodes as node()*, $tag name as xs:string) as |node()*</code>}}<br /><code><b>{{Func|ft:extract</b>(|$nodes as node()*, $tag name as xs:string, $length as xs:integer) as |node()*</code>}}
|-
| valign='top' | '''Summary'''|Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting <code>{{Code|$nodes</code> }} of a full-text index request and chops irrelevant sections of the result.<br />The default tag element name of the marker element is <code>{{Code|mark</code>}}. An alternative tag element name can be chosen via the optional <code>{{Code|$tag</code> name}} argument.<br />The default length of the returned text is <code>{{Code|150</code> }} characters. An alternative length can be specified via the optional <code>{{Code|$length</code> }} argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.<br />For more details on this function, please have a look at [[#ft:mark|ft:mark]].
|-
| valign='top' | '''Errors'''|'''[[XQuery Errors#BaseX Errors|BASX0002]]''' is raised if a referenced node is not stored in a database (i.e., references a main-memory XML fragment).<br />'''[[XQuery Errors#Functions Errors|FOCA0002]]''' is raised if <code>$name</code> is no valid QName.|-| valign='top' | '''Examples'''
|
* The following query may return <code>{{Code|&lt;XML&gt;...&lt;b&gt;hello&lt;/b&gt;...&lt;XML&gt;</code> }} if a text node of the database <code>{{Code|DB</code> }} contains the string "hello world":<pre classsyntaxhighlight lang="brush:xquery">
ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1)
</presyntaxhighlight>
|}
==ft:count==
{|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:count</b>(|$nodes as node()*) as |xs:integer</code>}}
|-
| valign='top' | '''Summary'''
|Returns the number of occurrences of the search terms specified in a full-text expression.
|-
| valign='top' | '''Errors'''|'''[[XQuery Errors#BaseX Errors|BASX0002]]''' is raised if a referenced node is not stored in a database (i.e., references a main-memory XML fragment).|-| valign='top' | '''Examples'''
|
* <code>{{Code|ft:count(//*[text() contains text 'QUERY'])</code> }} returns the <code>{{Code|xs:integer</code> }} value <code>{{Code|2</code> }} if a document contains two occurrences of the string "QUERY".
|}
==ft:score==
 {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:score</b>(|$item as item()*) as |xs:double*</code>}}
|-
| valign='top' | '''Summary'''|Returns the score values (0.0 - 1.0) that have been attached to the specified items. <code>{{Code|0</code> }} is returned a value if no score was attached.
|-
| valign='top' | '''Examples'''
|
* <code>{{Code|ft:score('a' contains text 'a')</code> }} returns the <code>{{Code|xs:double</code> }} value <code>{{Code|1</code>}}.
|}
==ft:tokens==
{|width='100%'|-| width='120' | '''Signatures'''|{{Func|ft:tokens|$db as xs:string|element(value)*}}<br/>{{Func|ft:tokens|$db as xs:string, $prefix as xs:string|element(value)*}}
|-
| valign='top' width='90' | '''SignaturesSummary'''|Returns all full-text tokens stored in the index of the database {{MonoCode|<b>ft:tokens</b>($db as item()) as element(value)*}}, along with their numbers of occurrences.<br/>If {{MonoCode|<b>ft:tokens</b>($db as item(), $prefix as xs:string) as element(value)*}}is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
|-
| valign='top' | '''SummaryErrors'''|Returns all full-text tokens stored in the index of the database [[Database Module#Database Argument|$db]], along with their numbers of occurrences. {{MonoError|$db:open|Database Module#Errors}} may either be an <code>xs:string</code>, denoting the The addressed database name, does not exist or a node stored in the databasecould not be opened.<br/>If {{MonoError|db:no-index|$prefixDatabase Module#Errors}} is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the indexis not available.
|-
| valign='top' | '''ErrorsExamples'''|'''[[XQuery Errors#BaseX Errors|BASX0001]]''' is raised if Returns the full-text number of occurrences for a single, specific index is not available.entry:<br/syntaxhighlight lang="xquery">let $term := ft:tokenize($term)return number(ft:tokens('db'', $term)[[XQuery Errors#BaseX Errors|BASX0002. = $term]]''' is raised if {{Mono|$db}} references a node that is not stored in a database (i.e., references a main-memory XML fragment/@count).<br/syntaxhighlight>'''[[XQuery Errors#BaseX Errors|BASX0003]]''' is raised if the addressed database cannot be opened.
|}
==ft:tokenize==
 {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|{{MonoFunc|<b>ft:tokenize|$string as xs:string?|xs:string*}}<br/b>({{Func|ft:tokenize|$input string as xs:string?, $options as map(*) as ?|xs:string*}}
|-
| valign='top' | '''Summary'''|Tokenizes the given {{MonoCode|$inputstring}} string, using the current default full-text optionsor the {{Code|$options}} specified as second argument, and returns a sequence with the tokenized string. The following options are available:* {{Code|case}}: determines how character case is processed. Allowed values are {{Code|insensitive}}, {{Code|sensitive}}, {{Code|upper}} and {{Code|lower}}. By default, search is case insensitive.* {{Code|diacritics}}: determines how diacritical characters are processed. Allowed values are {{Code|insensitive}} and {{Code|sensitive}}. By default, search is diacritical insensitive.* {{Code|stemming}}: determines is tokens are stemmed. Allowed values are {{Code|true}} and {{Code|false}}. By default, stemming is turned off.* {{Code|language}}: determines the language. This option is relevant for stemming tokens. All language codes are supported. The default language is {{Code|en}}.The {{Code|$options}} argument can be used to control full-text processing.
|-
| valign='top' | '''Examples'''
|
* <code>ft:tokenize("No Doubt")</code> returns the two strings {{MonoCode|no}} and {{MonoCode|doubt}}.* <code>ft:tokenize("École", map { 'diacritics': 'sensitive' })</code> returns the string {{Code|école}}.* <code>declare ft-option using stemming; ft:tokenize("GIFTS")</code> returns a single string {{MonoCode|gift}}.|} ==ft:normalize== {| width='100%'|-| width='120' | '''Signatures'''|{{Func|ft:normalize|$string as xs:string?|xs:string}}<br/>{{Func|ft:normalize|$string as xs:string?, $options as map(*)?|xs:string}}|-| '''Summary'''|Normalizes the given {{Code|$string}}, using the current default full-text options or the {{Code|$options}} specified as second argument. The function expects the same arguments as [[#ft:tokenize|ft:tokenize]].|-| '''Examples'''|* <code>ft:tokenize("Häuser am Meer", map { 'case': 'sensitive' })</code> returns the string {{Code|Hauser am Meer}}.|} =Errors= {| class="wikitable" width="100%"! width="110"|Code|Description|-|{{Code|options}}|Both wildcards and fuzzy search have been specified as search options.
|}
=Changelog=
===; Version 9.1* Updated: [[#ft:tokenize|ft:tokenize]] and [[#ft:normalize|ft:normalize]] can be called with empty sequence. ;Version 9.0 * Updated: error codes updated; errors now use the module namespace ;Version 8.0 * Added: [[#ft:contains|ft:contains]], [[#ft:normalize|ft:normalize]]* Updated: Options added to [[#ft:tokenize|ft:tokenize]] ;Version 7.8 * Added: [[#ft:contains|ft:contains]]* Updated: Options added to [[#ft:search|ft:search]] ;Version 7.7 * Updated: the functions no longer accept [[Database Module#Database Nodes|Database Nodes]] as reference. Instead, the name of a database must now be specified. ;Version 7.2 * Updated: [[#ft:search|ft:search]] (second argument generalized, third parameter added) ;Version 7.1===
* Added: [[#ft:tokens|ft:tokens]], [[#ft:tokenize|ft:tokenize]]
 
[[Category:XQuery]]
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu