Main Page » XQuery » Functions » Full-Text Functions

Full-Text Functions

This module extends the Full-Text features of BaseX: The index can be directly accessed, full-text results can be marked with additional elements, or the relevant parts can be extracted. Moreover, the score value, which is generated by the contains text expression, can be explicitly requested from items.

Conventions

All functions and errors in this module are assigned to the http://basex.org/modules/ft namespace, which is statically bound to the ft prefix.

Database Functions

`ft:search`

Signature

ft:search(
  $db       as xs:string,
  $terms    as item()*,
  $options  as map(*)?    := {}
) as text()*

Summary

Returns all text nodes from the full-text index of the database $db that contain the specified $terms. The options used for tokenizing the input and building the full-text will also be applied to the search terms. As an example, if the index terms have been stemmed, the search string will be stemmed as well.

The $options argument can be used to control full-text processing. The following options are available (the introduction on Full-Text processing presents equivalent expressions in the XQuery Full-Text notation):

option	default	description
`mode`	`any`	Determine how tokens are searched. Allowed values are `any`, `any word`, `all`, `all words`, and `phrase`.
`wildcards`	`false()`	Turn wildcard querying on or off.
`fuzzy`	`false()`	Turn fuzzy querying on or off.
`errors`	`0`	Control the maximum number of tolerated errors for fuzzy querying (see Fuzzy Querying for more details).
`ordered`	`false()`	Indicate if all tokens must occur in the order in which they are specified.
`content`	`–`	Specify that the matched tokens need to occur at the beginning or end of a searched string, or need to cover the entire string. Allowed values are `start`, `end`, and `entire`.
`scope`	`–`	Define the scope in which tokens must be located. The following sub options are available: `same`: can be set to `true` or `false`. It specifies if tokens need to occur in the same or different units. `unit`: can be `sentence` or `paragraph`. It specifies the unit for finding tokens.
`window`	`–`	Set up a window in which all tokens must be located. The following sub options are available: `size`: specify the size of the window in terms of units. `unit`: can be `sentences`, `sentences` or `paragraphs`. The default is `words`.
`distance`	`–`	Specify the distance in which tokens must occur. The following sub options are available: `min`: specify the minimum distance in terms of units. The default is `0`. `max`: specify the maximum distance in terms of units. The default is `∞`. `unit`: can be `words`, `sentences` or `paragraphs`. The default is `words`.

Errors

options Both wildcards and fuzzy search have been specified as search options.

Examples

ft:search("DB", "QUERY")

Return all text nodes of the database DB that contain the term QUERY.

ft:search("DB", ("2010", "2020"), { 'mode': 'all' })

Return all text nodes of the database DB that contain the numbers 2010 and 2020.

ft:search("db", ("A", "B"), {
  "mode": "all words",
  "distance": { "max": "5", "unit": "words" }
})

Return text nodes that contain the terms A and B in a distance of at most 5 words.

let $terms := "Hello Worlds"
let $fuzzy := true()
for $db in 1 to 3
let $dbname := 'DB' || $db
return ft:search($dbname, $terms, { 'fuzzy': $fuzzy })/..

Iterate over three databases and return all elements containing terms similar to Hello World in the text nodes.

`ft:tokens`

Signature	ft:tokens( $db as xs:string, $prefix as xs:string := () ) as element(value)*
Summary	Returns all full-text tokens stored in the index of the database `$db`, along with their numbers of occurrences. If `$prefix` is specified, the returned nodes will be refined to the strings starting with that prefix. The prefix will be tokenized according to the full-text used for creating the index.
Examples	`let $term := ft:tokenize($term) return number(ft:tokens('db', $term)[. = $term]/@count)` Returns the number of occurrences for a single, specific index entry.

General Functions

`ft:contains`

Signature

ft:contains(
  $input    as item()*,
  $terms    as item()*,
  $options  as map(*)?  := {}
) as xs:boolean

Summary

Checks if the specified $input items contain the specified $terms. The function does the same as the Full-Text expression contains text, but options can be specified more dynamically. The $options are the same as for ft:search, plus the following ones:

option	default	description
`case`	`insensitive`	Determine how upper/lower case is processed. Allowed values are `insensitive`, `sensitive`, `upper` and `lower`.
`diacritics`	`insensitive`	Determine how diacritical characters are processed. Allowed values are `insensitive` and `sensitive`.
`stemming`	`false()`	Determine how tokens are stemmed.
`language`	`en`	Determine the input language. This option is relevant for stemming tokens. Arbitrary language codes are accepted.

Errors

options Both wildcards and fuzzy search have been specified as search options.

Examples

ft:contains("John Doe", ("jack", "john"), { "mode": "any" })

Checks if jack or john occurs in the input string John Doe.

for $s in (true(), false())
return ft:contains("Häuser", "Haus", { 'stemming': $s, 'language':'de' })

Calls the function with stemming turned on and off.

`ft:count`

Signature	ft:count( $nodes as node()* ) as xs:integer
Summary	Returns the number of occurrences of the search terms specified in a full-text expression.
Examples	`ft:count(//*[text() contains text 'QUERY'])` Returns the `xs:integer` value `2` if a document contains two occurrences of the string `QUERY`.

`ft:score`

Signature	ft:score( $item as item()* ) as xs:double*
Summary	Returns the score values (0.0 - 1.0) that have been attached to the specified items. `0` is returned a value if no score was attached.
Examples	`ft:score('a' contains text 'a')` Returns the `xs:double` value `1`.

`ft:tokenize`

Signature

ft:tokenize(
  $string   as xs:string?,
  $options  as map(*)?     := {}
) as xs:string*

Summary

Tokenizes the given $string, using the current default full-text options or the $options specified as second argument, and returns a sequence with the tokenized string. The following options are available:

option	default	description
`case`	`insensitive`	Determine how upper/lower case is processed. Allowed values are `insensitive`, `sensitive`, `upper` and `lower`.
`diacritics`	`insensitive`	Determine how diacritical characters are processed. Allowed values are `insensitive` and `sensitive`.
`stemming`	`false()`	Determine how tokens are stemmed.
`language`	`en`	Determine the input language. This option is relevant for stemming tokens. Arbitrary language codes are accepted.

Examples

ft:tokenize("No Doubt")

Returns the two strings no and doubt.

ft:tokenize("École", { 'diacritics': 'sensitive' })

Returns the string école.

declare ft-option using stemming; ft:tokenize("GIFTS")

Returns a single string gift.

`ft:normalize`

Signature	ft:normalize( $string as xs:string?, $options as map(*)? := {} ) as xs:string
Summary	Normalizes the given `$string`, using the current default full-text options or the `$options` specified as second argument. The function accepts the same arguments as `ft:tokenize`; special characters and separators will be preserved.
Examples	`ft:normalize("Häuser am Meer", { 'case': 'sensitive' })` Returns the string `Hauser am Meer`.

`ft:thesaurus`

Signature

ft:thesaurus(
  $node     as node(),
  $term     as xs:string,
  $options  as map(*)?    := {}
) as xs:string*

Summary

Looks up a $term in a Thesaurus Structure supplied by $node. The following $options are available:

option	default	description
`relationship`	`–`	The relationship between terms.
`levels`	`–`	The maximum number of levels to traverse.

Examples

ft:thesaurus(
  <thesaurus>
    <entry>
      <term>happy</term>
      <synonym>
        <term>lucky</term>
        <relationship>RT</relationship>
      </synonym>
    </entry>
  </thesaurus>,
  'happy'
)

Result: 'lucky', 'happy'

Highlighting Functions

`ft:mark`

Signature

Signature	ft:mark( $nodes as node(), $name as xs:string := () ) as node()
Summary	Puts a marker element around the resulting `$nodes` of a full-text request. The default name of the marker element is `mark`. An alternative name can be chosen via the optional `$name` argument. Please note that: The full-text expression that computes the token positions must be specified as argument of the `ft:mark()` function, as all position information is lost in subsequent processing steps. You may need to specify more than one full-text expression if you want to use the function in a FLWOR expression, as shown in Example 2. The supplied node must be a Database Node. As shown in Example 3, `update` or `transform` can be utilized to convert a fragment to the required internal representation.
Examples	`ft:mark(db:get('DB')//[text() contains text 'hello'])` Returns `<XML><mark>hello</mark> world</XML>`, if one text node of the database `DB` has the value "hello world". `let $start := 1 let $end := 10 let $term := 'welcome' let $test := fn($node) { $node/text() contains text { $term } } for $ft in (db:get('DB')//[$test(.)])[position() = $start to $end] return ft:mark($ft[$test(.)])` Iterates over the first ten full-text results and marks the results in a second expression. `copy $copy := <xml>hello world</xml> modify () return ft:mark($copy[text() contains text 'world'], 'b')` Result: `<xml>hello <b>word</b></xml>`

ft:mark(
  $nodes  as node()*,
  $name   as xs:string  := ()
) as node()*

Summary

Puts a marker element around the resulting $nodes of a full-text request. The default name of the marker element is mark. An alternative name can be chosen via the optional $name argument. Please note that:

The full-text expression that computes the token positions must be specified as argument of the ft:mark() function, as all position information is lost in subsequent processing steps. You may need to specify more than one full-text expression if you want to use the function in a FLWOR expression, as shown in Example 2.
The supplied node must be a Database Node. As shown in Example 3, update or transform can be utilized to convert a fragment to the required internal representation.

Examples

ft:mark(db:get('DB')//*[text() contains text 'hello'])

Returns <XML><mark>hello</mark> world</XML>, if one text node of the database DB has the value "hello world".

let $start := 1
let $end   := 10
let $term  := 'welcome'
let $test  := fn($node) { $node/text() contains text { $term } }
for $ft in (db:get('DB')//*[$test(.)])[position() = $start to $end]
return ft:mark($ft[$test(.)])

Iterates over the first ten full-text results and marks the results in a second expression.

copy $copy := <xml>hello world</xml>
modify ()
return ft:mark($copy[text() contains text 'world'], 'b')

Result: <xml>hello <b>word</b></xml>

`ft:extract`

Signature	ft:extract( $nodes as node(), $name as xs:string := (), $length as xs:integer := () ) as node()
Summary	Extracts and returns relevant parts of full-text results. It puts a marker element around the resulting `$nodes` of a full-text index request and chops irrelevant sections of the result. The default element name of the marker element is `mark`. An alternative element name can be chosen via the optional `$name` argument. The default length of the returned text is `150` characters. An alternative length can be specified via the optional `$length` argument. Note that the effective text length may differ from the specified text due to formatting and readibility issues. For more details on this function, please have a look at `ft:mark`.
Examples	`ft:extract(db:get('DB')//*[text() contains text 'hello'], 'b', 1)` Returns `<XML>...<b>hello</b>...<XML>` if a text node of the database `DB` contains the string `hello world`.