Changes

Jump to navigation Jump to search
9,728 bytes added ,  12:33, 2 July 2020
m
Text replacement - "[http://www.w3.org/TR/xpath" to "[https://www.w3.org/TR/xpath"
This module [[Module Library|XQuery Module]] extends the XQuery [https://www.w3.org/TR/xpath-full-text-10 W3C Full Text Recommendation ] with some useful functions: The index can be directly accessed, full-text fulltext results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the <code>{{Code|contains text</code> }} expression, can be explicitly requested from items.  =Conventions= All functions and errors in this module are introduced with assigned to the <code><nowiki>http://basex.org/modules/ft:</nowiki></code> namespace, which is statically bound to the {{Code|ft}} prefix.<br/> =Functions=
==ft:search==
 {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:search|$db as xs:string, $terms as item()*|text()*}}<br/b>({{Func|ft:search|$db as xs:string, $node terms as nodeitem()*, $text options as xs:stringmap(*) as ?|text()</code><br />*}}
|-
| valign='top' | '''Summary'''|Performs a Returns all text nodes from the full-text index of the database {{Code|$db}} that contain the specified {{Code|$terms}}.<br/>The options used for tokenizing the input and building the full-text will also be applied to the search terms. As an example, if the index requestterms have been stemmed, the search string will be stemmed as well.The {{Code|$options}} argument can be used to control full-text processing. The following options are supported (the introduction on [[Full-Text]] processing gives you equivalent expressions in the XQuery Full-Text notation):* {{Code|mode}}: determines the mode how tokens are searched. Allowed values are {{Code|any}}, {{Code|any word}}, {{Code|all}}, {{Code|all words}}, and {{Code|phrase}}. {{Code|any}} is the default search mode.* {{Code|fuzzy}}: turns fuzzy querying on or off. Allowed values are {{Code|true}} and {{Code|false}}. By default, fuzzy querying is turned off.* {{Code|wildcards}}: turns wildcard querying on or off. Allowed values are {{Code|true}} and {{Code|false}}. By default, wildcard querying is turned off.* {{Code|ordered}}: requires that all tokens occur in the order in which they are specified. Allowed values are {{Code|true}} and {{Code|false}}. The default is {{Code|false}}.* {{Code|content}}: specifies that the matched tokens need to occur at the beginning or end of a searched string, or need to cover the entire string. Allowed values are {{Code|start}}, {{Code|end}}, and {{Code|entire}}. By default, the option is turned off.* {{Code|scope}}: defines the scope in which tokens must be located. The option has following sub options:** {{Code|same}}: can be set to {{Code|true}} or {{Code|false}}. It specifies if tokens need to occur in the same or different units.** {{Code|unit}}: can be {{Code|sentence}} or {{Code|paragraph}}. It specifies the unit for finding tokens.* {{Code|window}}: sets up a window in which all tokens must be located. By default, the option is turned off. It has following sub options:** {{Code|size}}: specifies the size of the window in terms of ''units''.** {{Code|unit}}: can be {{Code|sentences}}, {{Code|sentences}} or {{Code|paragraphs}}. The default is {{Code|words}}.* {{Code|distance}}: specifies the distance in which tokens must occur. By default, the option is turned off. It has following sub options:** {{Code|min}}: specifies the minimum distance in terms of ''units''. The default is {{Code|0}}.** {{Code|max}}: specifies the maximum distance in terms of ''units''. The default is {{Code|∞}}.** {{Code|unit}}: can be {{Code|words}}, {{Code|sentences}} or {{Code|paragraphs}}. The default is {{Code|words}}.
|-
| valign='top' | '''RulesErrors'''|This function performs an explicit full-text index request on the specified {{Error|db:open|Database Module#Errors}} The addressed database node and returns all text nodes that contain the string <code>$textdoes not exist or could not be opened.<br/code>. The {{Error|db:no-index full-text options are used for searching, i.e., if |Database Module#Errors}} the index terms were stemmed, the search string will be stemmed as wellis not available.<br />{{Error|options|#Errors}} the fuzzy and wildcard option cannot be both specified.
|-
| valign='top' | '''Examples'''|The expression * {{Code|ft:search("DB", "QUERY")}}: Return all text nodes of the database {{Code|DB}} that contain the term {{Code|QUERY}}.* Return all text nodes of the database {{Code|DB}} that contain the numbers {{Code|2010}} and {{Code|2020}}:<br/><code>ft:search(."DB", ("2010", "QUERY2020"), map { 'mode': 'all' })</code> returns all * Return text nodes of the currently opened database that contain the string terms {{Code|A}} and {{Code|B|}} in a distance of at most 5 words:<syntaxhighlight lang="xquery">ft:search("db", ("A", "B"), map { "mode": "all words", "distance": map { "max": "5", "unit": "words" }})</syntaxhighlight>* Iterate over three databases and return all elements containing terms similar to {{Code|Hello World}} in the text nodes:<syntaxhighlight lang="xquery"QUERY>let $terms := "Hello Worlds"let $fuzzy := true()for $db in 1 to 3let $dbname := 'DB' || $dbreturn ft:search($dbname, $terms, map { 'fuzzy': $fuzzy })/..<br /syntaxhighlight>|} ==ft:contains== {| width='100%'
|-
| valignwidth='top120' | '''ErrorsSignatures'''|{{Func|ft:contains|$input as item()*, $terms as item()*|xs:boolean}}<br/>{{Func|ft:contains|$input as item()*, $terms as item()*, $options as map(*)?|xs:boolean}}|-| '''Summary'''|Checks if the specified {{Code|$input}} items contain the specified {{Code|$terms}}.<bbr/>The function does the same as the [BASX0002[Full-Text]] expression {{Code|contains text}}, but options can be specified more dynamically. The {{Code|$options}} are the same as for [[#ft:search|ft:search]], and the following ones in addition:* {{Code|case}}: determines how character case is processed. Allowed values are {{Code|insensitive}}, {{Code|sensitive}}, {{Code|upper}} and {{Code|lower}}. By default, search is case insensitive.* {{Code|diacritics}}: determines how diacritical characters are processed. Allowed values are {{Code|insensitive}} and {{Code|sensitive}}. By default, search is diacritical insensitive.* {{Code|stemming}}: determines is tokens are stemmed. Allowed values are {{Code|true}} and {{Code|false}}. By default, stemming is turned off.* {{Code|language}}: determines the language. This option is relevant for stemming tokens. All language codes are supported. The default language is {{Code|en}}.|-| '''Errors'''|{{Error|options|#Errors}} specified options are conflicting.|-| '''Examples'''|* Checks if {{Code|jack}} or {{Code|john}} occurs in the input string {{Code|John Doe}}:<syntaxhighlight lang="xquery">ft:contains("John Doe", ("jack", "john"), map { "mode": "any" })</bsyntaxhighlight> is raised if * Calls the context item does not represent a database nodefunction with stemming turned on and off:<syntaxhighlight lang="xquery">(true(), false()) ! ft:contains("Häuser", "Haus", map { 'stemming': ., 'language':'de' })<br /syntaxhighlight>
|}
==ft:mark==
 {|width='100%'|-| width='120' | '''Signatures'''|{{Func|ft:mark|$nodes as node()*|node()*}}<br />{{Func|ft:mark|$nodes as node()*, $name as xs:string|node()*}}|-| '''Summary'''|Puts a marker element around the resulting {{Code|$nodes}} of a full-text request.<br />The default name of the marker element is {{Code|mark}}. An alternative name can be chosen via the optional {{Code|$name}} argument.<br />Please note that:* The full-text expression that computes the token positions must be specified as argument of the <code>ft:mark()</code> function, as all position information is lost in subsequent processing steps. You may need to specify more than one full-text expression if you want to use the function in a FLWOR expression, as shown in Example 2.* The supplied node must be a [[Database Module#Database Node|Database Node]]. As shown in Example 3, {{Code|update}} or {{Code|transform}} can be utilized to convert a fragment to the required internal representation.
|-
| valign='top' width='90Examples''' | '''SignaturesExample 1''': The following query returns {{Code|&lt;XML&gt;&lt;mark&gt;hello&lt;/mark&gt; world&lt;/XML&gt;}}, if one text node of the database {{Code|DB}} has the value "hello world":|<code><bsyntaxhighlight lang="xquery">ft:mark(db:open('DB')//*[text() contains text 'hello'])</bsyntaxhighlight>'''Example 2''': The following expression loops through the first ten full-text results and marks the results in a second expression:<syntaxhighlight lang="xquery">let $start := 1let $end := 10let $term := 'welcome'for $ft in ($nodes as nodedb:open('DB')//*[text() contains text { $term }])[position() as node= $start to $end]return element hit { ft:mark($ft[text()*contains text { $term }])}</syntaxhighlight>'''Example 3''': The following expression returns <code>&lt;xml>hello &lt;b&gt;word&lt;/b&gt;&lt;/xml&gt;<br /code>:<codesyntaxhighlight lang="xquery">copy $p := <xml>hello world<b/xml>modify ()return ft:mark</b>($nodes as nodep[text()*contains text 'word'], $tag as xs:string'b') as node()*</code><br /syntaxhighlight>|} ==ft:extract== {| width='100%'
|-
| valignwidth='top120' | '''Signatures'''|{{Func|ft:extract|$nodes as node()*|node()*}}<br />{{Func|ft:extract|$nodes as node()*, $name as xs:string|node()*}}<br />{{Func|ft:extract|$nodes as node()*, $name as xs:string, $length as xs:integer|node()*}}|-| '''Summary'''|Marks Extracts and returns relevant parts of full-text results from . It puts a marker element around the resulting {{Code|$nodes}} of a full-text index requestand chops irrelevant sections of the result.<br />The default element name of the marker element is {{Code|mark}}. An alternative element name can be chosen via the optional {{Code|$name}} argument.<br />The default length of the returned text is {{Code|150}} characters. An alternative length can be specified via the optional {{Code|$length}} argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.<br />For more details on this function, please have a look at [[#ft:mark|ft:mark]].|-| '''Examples'''|* The following query may return {{Code|&lt;XML&gt;...&lt;b&gt;hello&lt;/b&gt;...&lt;XML&gt;}} if a text node of the database {{Code|DB}} contains the string "hello world":<syntaxhighlight lang="xquery">ft:extract(db:open('DB')//*[text() contains text 'hello'], 'b', 1)</syntaxhighlight>|} ==ft:count=={| width='100%'|-| width='120' | '''Signatures'''|{{Func|ft:count|$nodes as node()*|xs:integer}}|-| '''Summary'''|Returns the number of occurrences of the search terms specified in a full-text expression.|-| '''Examples'''|* {{Code|ft:count(//*[text() contains text 'QUERY'])}} returns the {{Code|xs:integer}} value {{Code|2}} if a document contains two occurrences of the string "QUERY".|} ==ft:score== {| width='100%'
|-
| valignwidth='top120' | '''RulesSignatures'''|This function puts a marker element around the resulting <code>$nodes</code> of a full-text index request.<br />The default tag name of the marker element is <code>mark</code>. An alternative tag name can be chosen via the optional <code>{{Func|ft:score|$tag</code> argument.<br />Note that the XML node to be transformed must be an internal "database" node. The <code>transform</code> expression can be used to apply the method to a main-memory fragment item as item(see example).<br />*|xs:double*}}
|-
| valign='top' | '''ExamplesSummary'''|The following query returns <code>&lt;XML&gt;&lt;mark&gt;hello&lt;/mark&gt; world&lt;/XML&gt;</code>, if one text node of Returns the database <code>DB</code> has score values (0.0 - 1.0) that have been attached to the specified items. {{Code|0}} is returned a value "hello world":<br /> <code>ft:mark(db:open('DB')//*[text() contains text 'hello'])</code><br />The following expression returns <code>&lt;p&gt;&lt;b&gt;word&lt;/b&gt;&lt;/p&gt;</code>:<br /> <code>copy $p := &lt;p&gt;word&lt;/p&gt;</code><br /> <code>modify ()</code><br /> <code>return ft:mark($p[text() contains text 'word'], 'b')</code><br />if no score was attached.
|-
| valign='top' | 'Examples'''Errors|* {{Code|ft:score('a'contains text 'a')}} returns the {{Code|xs:double}} value {{Code|<b>[BASX0002]</b> is raised if the context item does not represent a database node.<br /><b>[FOCA0002]</b> is raised if <code>$name</code> is no valid QName1}}.<br />
|}
==ft:extracttokens=={|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:extract</b>($nodes as node()*) as node()*</code><br /><code><b>ft:extract</b>(tokens|$nodes as node()*, $tag db as xs:string) as node|element(value)*</code>}}<br /><code><b>{{Func|ft:extract</b>($nodes as node()*, tokens|$tag db as xs:string, $length prefix as xs:integer) as nodestring|element(value)*</code><br />}}
|-
| valign='top' | '''Summary'''|Extracts relevant parts Returns all full-text tokens stored in the index of the database {{Code|$db}}, along with their numbers of occurrences.<br/>If {{Code|$prefix}} is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text resultsused for creating the index.
|-
| valign='top' | '''RulesErrors'''|This function extracts and returns relevant parts of full-text results. It puts a marker element around the resulting <code>$nodes</code> of a full-text index request and chops irrelevant sections of the result.<br />{{Error|db:open|Database Module#Errors}} The default tag name of the marker element is <code>mark</code>. An alternative tag name can addressed database does not exist or could not be chosen via the optional <code>$tag</code> argumentopened.<br />The default length of {{Error|db:no-index|Database Module#Errors}} the returned full-text index is <code>150</code> charactersnot available. An alternative length can be specified via the optional <code>$length</code> argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues.<br />
|-
| valign='top' | '''Examples'''|The following query may return <code>&lt;XML&gt;...&lt;b&gt;hello&lt;/b&gt;...&lt;XML&gt;</code> if Returns the number of occurrences for a text node of the database single, specific index entry:<code>DB</code> contains the string syntaxhighlight lang="hello worldxquery":<br /> <code>let $term := ft:extracttokenize($term)return number(dbft:opentokens('DBdb', $term)//*[text() contains text 'hello'. = $term], 'b', 1/@count)</code><br /syntaxhighlight>|} ==ft:tokenize== {| width='100%'
|-
| valignwidth='top120' | '''ErrorsSignatures'''|{{Func|ft:tokenize|$string as xs:string?|xs:string*}}<b>[BASX0002]<br/b> {{Func|ft:tokenize|$string as xs:string?, $options as map(*)?|xs:string*}}|-| '''Summary'''|Tokenizes the given {{Code|$string}}, using the current default full-text options or the {{Code|$options}} specified as second argument, and returns a sequence with the tokenized string. The following options are available:* {{Code|case}}: determines how character case is processed. Allowed values are {{Code|insensitive}}, {{Code|sensitive}}, {{Code|upper}} and {{Code|lower}}. By default, search is case insensitive.* {{Code|diacritics}}: determines how diacritical characters are processed. Allowed values are {{Code|insensitive}} and {{Code|sensitive}}. By default, search is diacritical insensitive.* {{Code|stemming}}: determines is tokens are stemmed. Allowed values are {{Code|true}} and {{Code|false}}. By default, stemming is raised if turned off.* {{Code|language}}: determines the context item does not represent a database nodelanguage. This option is relevant for stemming tokens. All language codes are supported. The default language is {{Code|en}}.The {{Code|$options}} argument can be used to control full-text processing.|-| '''Examples'''|* <code>ft:tokenize("No Doubt")<br /code>returns the two strings {{Code|no}} and {{Code|doubt}}.* <bcode>[FOCA0002]ft:tokenize("École", map { 'diacritics': 'sensitive' })</bcode> is raised if returns the string {{Code|école}}.* <code>$namedeclare ft-option using stemming; ft:tokenize("GIFTS")</code> is no valid QNamereturns a single string {{Code|gift}}.<br />
|}
==ft:scorenormalize== {|width='100%'
|-
| valign='top' width='90120' | '''Signatures'''|<code><b>{{Func|ft:scorenormalize|$string as xs:string?|xs:string}}<br/b>({{Func|ft:normalize|$string as xs:string?, $item options as itemmap()*) as ?|xs:double*</code><br />string}}
|-
| valign='top' | '''Summary'''|Returns Normalizes the score of itemsgiven {{Code|$string}}, using the current default full-text options or the {{Code|$options}} specified as second argument. The function expects the same arguments as [[#ft:tokenize|ft:tokenize]].
|-
| valign='top' | '''RulesExamples'''|This function returns the score values (0.0 - 1.0) that have been attached to the specified items. * <code>0ft:tokenize("Häuser am Meer", map { 'case': 'sensitive' })</code> is returned if no score was attachedreturns the string {{Code|Hauser am Meer}}.<br />|} =Errors= {| class="wikitable" width="100%"! width="110"|Code|Description
|-
| valign='top' {{Code| '''Examples'''options}}|The expression <code>ft:score('a' contains text 'a')</code> returns the <code>xs:double</code> value <code>1</code>Both wildcards and fuzzy search have been specified as search options.<br />
|}
=Changelog= ; Version 9.1* Updated: [[#ft:tokenize|ft:tokenize]] and [[#ft:normalize|ft:normalize]] can be called with empty sequence. ;Version 9.0 * Updated: error codes updated; errors now use the module namespace ;Version 8.0 * Added: [[#ft:contains|ft:contains]], [[#ft:normalize|ft:normalize]]* Updated: Options added to [[#ft:tokenize|ft:tokenize]] ;Version 7.8 * Added: [[#ft:contains|ft:contains]]* Updated: Options added to [[#ft:search|ft:search]] ;Version 7.7 * Updated: the functions no longer accept [[Database Module#Database Nodes|Database Nodes]] as reference. Instead, the name of a database must now be specified. ;Version 7.2 * Updated: [[#ft:search|ft:search]] (second argument generalized, third parameter added) ;Version 7.1 * Added: [[#ft:tokens|ft:tokens]], [[Category#ft:tokenize|ft:XQuerytokenize]]
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu