Main Page » XQuery » Functions » String Functions

String Functions

This module contains functions for string operations and computations.

Conventions

All functions and errors in this module and errors are assigned to the http://basex.org/modules/string namespace, which is statically bound to the string prefix.

Computations

`string:levenshtein`

Signature

string:levenshtein(
  $value1  as xs:string,
  $value2  as xs:string
) as xs:double

Summary

Computes the Damerau-Levenshtein Distance for $value1 and $value2 and returns a normalized double value (0 – 1). The returned value is computed as follows:

1 – distance / max(lengths of strings)
1 is returned if the strings are equal; 0 is returned if the strings are too different.

Errors

bounds The specified string exceeds the maximum supported length.

Examples

string:levenshtein("flower", "flowers")

Result: 0.8571428571428571e0

let $norm := ft:normalize(?, { 'stemming': true() })
return string:levenshtein($norm("HOUSES"), $norm("house"))

1 is returned after the input has been normalized (words are stemmed, converted to lower case, and diacritics are removed).

`string:jaro-winkler`

Signature

Signature	string:jaro-winkler( $value1 as xs:string, $value2 as xs:string ) as xs:double
Summary	Computes the Jaro-Winkler Distance for `$value1` and `$value2` and returns a double value (0 – 1). 1 is returned if the strings are equal; 0 is returned if the strings are too different.
Examples	`string:jaro-winkler("flower", "flowers")` Result: `0.98e0` `let $norm := ft:normalize(?, { 'stemming': true() }) return string:jaro-winkler($norm("HOUSES"), $norm("house"))` `1` is returned after the input has been normalized (words are stemmed, converted to lower case, and diacritics are removed).

string:jaro-winkler(
  $value1  as xs:string,
  $value2  as xs:string
) as xs:double

Summary Computes the Jaro-Winkler Distance for $value1 and $value2 and returns a double value (0 – 1). 1 is returned if the strings are equal; 0 is returned if the strings are too different.

Examples

string:jaro-winkler("flower", "flowers")

Result: 0.98e0

let $norm := ft:normalize(?, { 'stemming': true() })
return string:jaro-winkler($norm("HOUSES"), $norm("house"))

1 is returned after the input has been normalized (words are stemmed, converted to lower case, and diacritics are removed).

`string:token-sort-ratio`

Added: New function.

Signature

string:token-sort-ratio(
  $value1  as xs:string,
  $value2  as xs:string
) as xs:double

Summary Tokenizes $value1 and $value2 on whitespace, sorts the tokens, and returns the normalized string:levenshtein similarity (0 – 1) of the rejoined strings. As the tokens are sorted before the comparison, the measure is insensitive to their order. 1 is returned if the strings are equal after sorting; 0 is returned if they are too different.

Errors

bounds The specified string exceeds the maximum supported length.

Examples

string:token-sort-ratio("Ishida Yūtei", "Yūtei Ishida")

Result: 1e0. 1 is returned, as both strings consist of the same tokens in a different order.

`string:token-set-ratio`

Added: New function.

Signature

string:token-set-ratio(
  $value1  as xs:string,
  $value2  as xs:string
) as xs:double

Summary Tokenizes $value1 and $value2 on whitespace and compares them as sets: the shared tokens and the tokens unique to either string are reassembled, compared with the normalized string:levenshtein similarity, and the best of the resulting values (0 – 1) is returned. A high value is returned if the tokens of one string are a subset of the other; this makes the measure robust against extra or omitted words.

Errors

bounds The specified string exceeds the maximum supported length.

Examples

string:token-set-ratio("Fortuny y Marsal", "Fortuny Marsal")

Result: 1e0. 1 is returned, as the tokens of the second string are a subset of the first.

`string:ngram-similarity`

Added: New function.

Signature

string:ngram-similarity(
  $value1  as xs:string,
  $value2  as xs:string,
  $n       as xs:integer  := 2
) as xs:double

Summary Computes the similarity of $value1 and $value2 via the Sørensen-Dice coefficient on the sets of their character n-grams of length $n, and returns a double value (0 – 1). 1 is returned if the n-gram sets are equal; 0 is returned if they are disjoint. The measure is robust against transliteration and OCR drift, where string:levenshtein thresholds tend to be brittle.

Errors

ngram The specified n-gram length is not positive.

Examples

string:ngram-similarity("night", "nacht")

Result: 0.25e0

string:ngram-similarity("Massys", "Metsys")

Result: 0.4e0. The two name variants share 2 of 5 bigrams each.

`string:ngrams`

Added: New function.

Signature

string:ngrams(
  $value  as xs:string,
  $n      as xs:integer  := 2
) as xs:string*

Summary Returns the character n-grams of length $n of $value, in string order and including duplicates. A non-empty string that is shorter than $n yields a single n-gram with the whole string; an empty string yields no n-grams. Whitespace is treated like any other character, and the input is not normalized. The distinct n-grams are the building block of string:ngram-similarity, which equals the Sørensen-Dice coefficient over distinct-values(string:ngrams(...)) of both arguments.

Errors

ngram The specified n-gram length is not positive.

Examples

string:ngrams("flower")

Result: 'fl', 'lo', 'ow', 'we', 'er'

string:ngrams("flower", 3)

Result: 'flo', 'low', 'owe', 'wer'

`string:soundex`

Signature

Signature	string:soundex( $value as xs:string ) as xs:string
Summary	Computes the Soundex value for the specified string `$value`. The algorithm can be used to find and index English words with similar pronouncation.
Examples	`string:soundex("Michael")` Result: `'M240'` `string:soundex("OBrien") = string:soundex("O'Brien")` Result: `true()`

string:soundex(
  $value  as xs:string
) as xs:string

Summary Computes the Soundex value for the specified string $value. The algorithm can be used to find and index English words with similar pronouncation.

Examples

string:soundex("Michael")

Result: 'M240'

string:soundex("OBrien") = string:soundex("O'Brien")

Result: true()

`string:cologne-phonetic`

Signature

Signature	string:cologne-phonetic( $value as xs:string ) as xs:string
Summary	Computes the Kölner Phonetik value for the specified string `$value`. Similar to Soundex, the algorithm is used to find similarly pronounced words, but for the German language. As the first returned digit can be `0`, the result is returned as string.
Examples	`string:cologne-phonetic("Michael")` Result: `'645'` `every $s in ("Mayr", "Maier", "Meier") satisfies string:cologne-phonetic($s) = "67"` Result: `true()`

string:cologne-phonetic(
  $value  as xs:string
) as xs:string

Summary Computes the Kölner Phonetik value for the specified string $value. Similar to Soundex, the algorithm is used to find similarly pronounced words, but for the German language. As the first returned digit can be 0, the result is returned as string.

Examples

string:cologne-phonetic("Michael")

Result: '645'

every $s in ("Mayr", "Maier", "Meier")
satisfies string:cologne-phonetic($s) = "67"

Result: true()

Formatting

`string:format`

Signature

string:format(
  $pattern    as xs:string,
  $values...  as item()
) as xs:string

Summary Returns a formatted string. The remaining $values are incorported into the $pattern, according to Java’s printf syntax.

Errors

format The specified format is invalid.

Examples

string:format("%b", true())

Result: 'true'

string:format("%06d", 256)

Result: '000256'

string:format("%e", 1234.5678)

Result: '1.234568e+03'

Errors

Code	Description
`bounds`	The specified string exceeds the maximum supported length.
`format`	The specified format is invalid.
`ngram`	The specified n-gram length is not positive.

Changelog

Version 13.0

Added: string:token-sort-ratio for order-insensitive token similarity.
Added: string:token-set-ratio for set-based token similarity.
Added: string:ngram-similarity for character n-gram similarity.
Added: string:ngrams for extracting character n-grams.

Version 11.0

Added: string:jaro-winkler for computing the Jaro-Winkler Distance.
Removed: string:tab, string:nl and string:cr in favor of fn:char.

Version 10.0

Updated: Renamed from Strings Module to String Module. The namespace URI has been updated as well.
Updated: string:format, string:cr, string:nl and string:tab adopted from the obsolete Output Module.

Version 8.3

Added: New module added. Functions were adopted from the obsolete Utility and Output Modules.

⚡Generated with XQuery