Changes

Jump to navigation Jump to search
2,726 bytes removed ,  06:43, 29 November 2019
no edit summary
This article is part of the [[XQuery|XQuery Portal]].It summarizes the features of the [http://www.w3.org/TR/xpath-full-text -10/ W3C XQuery Full Text 1.0] Recommendation, and language-specific custom features of the implementation in BaseX.
Please read the separate [[Indexes#Full-text retrieval is an essential query feature for working with XML documents, and BaseX was the first query processor that fully supported the [http://www.w3.org/TR/xpathText Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text-10/ W3C XQuery Full Text 1.0] Recommendationrequests on large databases within milliseconds.
=Introduction=
The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification. This section gives you a quick insight into the most important features of the language.
This is a simple example for a basic full-text expression:
</pre>
Varius Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.
==Combining Results==
* {{Code|phrase}}: all strings need to be found as a single phrase
The keywords {{Code|ftand}}, {{Code|ftor}} and {{Code|ftnot}} can also be used to combine multiple query terms. The following query yields the same result as the last one does (but it takes [[#FTAnd|more memory]]):
<pre class="brush:xquery">
for $text in ("New York", "new conditions")
return $text contains text "New" not in "New York"
</pre>
 
Due to the complex data model of the XQuery Full Text spec, the usage of {{Code|ftand}} may lead to a high memory consumption. If you should encounter problems, simply use the {{Code|all}} keyword:
 
<pre class="brush:xquery">
doc('factbook')//country[descendant::religions contains text { 'Christian', 'Jewish'} all]/name
</pre>
<pre class="brush:xquery">
'Mary told me, “I will survive!” This is what Mary told me.' contains text { 'will', 'told' } all words same sentence
</pre>
Sentences are delimited by end of line markers ({{Code|.}}, {{Code|!}}, {{Code|?}}, etc.), and newline characters are treated as paragraph delimiters. By the way: in the In some examples above, the {{Code|wordwords}} unit has been was used, but {{Code|sentences}} and {{Code|paragraphs}} are would have been valid alternatives.
Last but not least, three specifiers exist to filter results depending on the position of a hit:
* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:
<pre class="brush:xquery">
"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive
</pre>
* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:
<pre class="brush:xquery">
"catch" contains text "catches" using stemming,"Haus" contains text "Häuser" using stemming using language 'de'
</pre>
* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful when if the size of a full-text index structure needs to be reducedtakes too much space (a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):
<pre class="brush:xquery">
"You and me" contains text "you or me" using stop words ("and", "or"),
=BaseX Features=
This page lists BaseX-specific full-text features and options.==Languages==
==Options==The chosen language determines how strings will be tokenized and stemmed. Either names (e.g. <code>English</code>, <code>German</code>) or codes (<code>en</code>, <code>de</code>) can be specified. A list of all language codes that are available on your system can be retrieved as follows:
The available full-text index can handle various combinations of the match options defined in the XQuery Full Text Recommendation<pre class="brush:xquery">declare namespace locale = "java:java. By default, most options are disabledutil. The GUI dialogs for creating new databases or displaying the database properties contain a tab for choosing between all available options. On the commandLocale";distinct-line, the <code>SET</code> command can be used to activate full-text indexing or creating a full-text index for existing databasesvalues(locale:getAvailableLocales() ! locale* <code>SET FTINDEX true; CREATE DB inputgetLanguage(.xml</code>))* <code>CREATE INDEX fulltext</codepre> The following indexing options are available:
* '''Language''': [[#Languages|see below]] for more details (By default, unless the languages codes <code>SET LANGUAGE ENja</code>).* '''Stemming''': tokens are stemmed with the Porter Stemmer before being indexed (, <code>SET STEMMING truear</code>).* '''Case Sensitive''': tokens are indexed in case-sensitive mode (, <code>SET CASESENS trueko</code>).* '''Diacritics''': diacritics are indexed as well (, <code>SET DIACRITICS trueth</code>).* '''Stopword List''': a stop word list can be defined to reduce the number of indexed tokens (, or <code>SET STOPWORDS [filename]zh</code>).are specified, a tokenizer for Western texts is used:
==Languages==* Whitespaces are interpreted as token delimiters.* Sentence delimiters are <code>.</code>, <code>!</code>, and <code>?</code>.* Paragraph delimiters are newlines (<code>&amp;#xa;</code>).
The chosen language determines how the input text will be tokenized and stemmed. The basic code base and <code>jar</code> JAR file of BaseX comes with built-in stemming support for English , German, Greek and GermanIndonesian. More Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:
* [http://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes the Snowball and Lucene stemmers and extends language support to for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Greek, Hindi, Hungarian, Indonesian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
* [http://en.sourceforge.jp/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.
The JAR files can also be found are included in the <code>zip</code> ZIP and <code>exe</code> distribution files EXE distributions of BaseX.
The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:
<pre class="brush:xquery">
"Indexing" contains text "index" using stemming,
"häuser" contains text "haus" using stemming using language "deGerman"
</pre>
</pre>
With {{Version|8.0}}, scores Scores will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates. The In the following queries will query, all yield the same resultreturned scores are equal:
<pre class="brush:xquery">
let $text := "A B C"
let score $s1 := $text[. contains text "A"][. ftand "B C"let score $s2 := $text contains text "A" ftand "B C"]let score $s2 s3 := $text contains text "A" and $text contains text "B C"let score $s3 s4 := $text contains text "A" or $text contains text "B C"let score $s5 := $text[. contains text "A"][. contains text "B C"]return ($s1, $s2, $s3, $s4, $s5)
</pre>
==Thesaurus==
BaseX supports full-text queries using thesauri, but it does not provide a default thesaurus. This is why queries such as:
<pre class="brush:xquery">
==Fuzzy Querying==
In addition to the official recommendation, BaseX supports a fuzzy queryingsearch feature.The XQFT grammar was enhanced by the FTMatchOption <code>using fuzzy </code> match option to allow for approximate searches results in full texts.By default, the standard [[indexes|full-text index]] already supports the efficientexecution of fuzzy searches.:
'''Document 'doc.xml'''':
</doc>
</pre>
'''Command:''' <code>CREATE DB doc.xml; CREATE INDEX fulltext</code>
'''Query:'''
</pre>
Fuzzy search is based on the Levenshtein distance. The maximum number of allowederrors is calculated by dividing the token length of a specified query term by 4,preserving a minimum of 1 errors. A static error distance can be set by adjustingthe <code>[[Options#LSERROR{{Option|LSERROR]]</code> property }} option (default: <code>SET LSERROR 0</code>).The query above yields two results as there is no error between the query term“house” and the text node “house”, and one error between“house” and “hous”. =Performance= ==Index Processing== BaseX offers different evaluation strategies for XQFT queries, the choice of whichdepends on the input data and the existence of a full text index. The query compiler triesto optimize and speed up queries by applying a full text index structure wheneverpossible and useful. Three evaluation strategies are available: the standard sequentialdatabase scan, a full-text index based evaluation and a hybrid one, combining both strategies (see [http://www.inf.uni-konstanz.de/gk/pubsys/publishedFiles/GrGaHo09.pdf XQuery Full Text implementation in BaseX]). Query optimization and selection of the most efficient evaluation strategy is donein a full-fledged automatic manner. The output of the query optimizer indicates whichevaluation plan is chosen for a specific query. It can be inspected by activating verbosequerying (Command: <code>SET VERBOSE ON</code>) or opening the Query Info in the GUI.The message <code>Applying full-text index</code> suggests that the full-text index is applied to speed up query evaluation.A second message <code>Removing path with no index results</code> indicates that the index does not yield any results for the specified term andis thus skipped. If index optimizations are missing, it sometimes helps to givethe compiler a second chance and try different rewritings of the same query. ==FTAnd==
The internal XQuery Full Text data model Fuzzy search is pretty complex and may consume more mainmemory as would initially guess. If you plan to combine search terms via {{Code|ftand}},we recommend you to resort to an alternative, memoryalso supported by the full-saving representation: <pre class="brush:xquery">(: representation via "ftand" :)"A B" contains text "A" ftand "B" ftor "C" ftor "D" (: memory saving representation :)"A B" contains text { "A", "B" } all ftor { "C", "D" } all</pre>index.
=Mixed Content=
Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [http://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].
To enable this kind of searches, it is recommendable to: * Turn off ''whitespace chopping'' must be turned off when importing XML documents . This can be done by setting the option <code>[[Options#CHOP{{Option|CHOP]]</code> }} to <code>OFF</code> (default: <code>SET CHOP ON</code>). In This can also be done in the GUI, you find this option in if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Chop Whitespaces'').* Turn off automatic indentation by assigning <code>indent=no</code> to the {{Option|SERIALIZER}} option.  A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.
Note that the node structure is completely ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the <code>ft:mark</code> and <code>ft:extract</code> functions (see [[Full-Text Module|Full-Text Functions]]) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:
<pre class="brush:xquery">
</pre>
Note that BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [http://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. This means that it is not possible If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow. Here is an example document: <pre class="brush:xml"><p>This text is provided for illustrative<note>Serving as an example or explanation.</note> purposes only.</p></pre> The ignore option would enable , you can build a second database from and exclude all information you do not want to search for . See the string “illustrative purposes”following example (visit [[XQuery Update]] to learn more about updates):
<pre class="brush:xquery">
let $docs := db:open('docs')return db:create( 'index-db', $docs update delete node ( .//footnote ), $docs/p[db:path(. contains text ), map { 'illustrative purposesftindex' without content note]: true() })
</pre>
 
For more examples, see [http://www.w3.org/TR/xpath-full-text-10-use-cases/#Ignore XQuery and XPath Full Text 1.0 Use Cases].
 
As BaseX does not support the ignore option, it raises error [[XQuery_Errors#Full-Text_Errors|FTST0007]] when it encounters <code>without content</code> in a full-text <code>contains</code> expression.
=Functions=
=Collations=
Another See [[XQuery feature related to natural language processing are '''3.1#Collations'''|XQuery 3.1]] for standard collation features.  By default, string comparisons in XQuery are based on the Unicode codepoint order. The default namespace URI {{Code|http://www.w3.org/2003/05/xpath-functions/collation/codepoint}} specifies this ordering. In BaseX, the following URI syntax is supported to specify collations:
<nowiki>http://basex.org/collation?lang=...;strength=...;decomposition=...</nowiki>
|-
| {{Code|strength}}
| Level of difference considered significant in comparisons. Four strengths are supported: {{Code|primary}}, {{Code|secondary}}, {{Code|tertiary}}, and {{Code|identical}}. For As an example, in German, :* "Ä" and "A" are considered primary differences, * "Ä" and "ä" are secondary differences, * "Ä" and "A[&amp;#x308;" (see http://www.fileformat.info/info/unicode/char/308/index.htm &amp;#x308;]" ) are tertiary differences, and * "A" and "A" are identical.
|-
| {{Code|decomposition}}
</nowiki></pre>
==Case-Insensitive Collation=Changelog=
{{Mark|Introduced with BaseX 8; Version 9.0}}2:
XQuery 3.1 provides another default collation, which allows for a case-insensitive comparison of ASCII characters (<code>A-Z</code> = <code>a-z</code>). This query returns <code>true</code>* Added<pre class="brush:xquery">declare default collation 'http://www.w3Arabic stemmer.org/2005/xpath-functions/collation/html-ascii-case-insensitive';'HTML' = 'html'</pre> =Changelog=
; Version 8.0:
* Added: [[#Case-Insensitive Collation|Case-Insensitive Collation]].
* Updated: [[#Scoring|Scores]] will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates.
* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.
* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.
 
[[Category:XQuery]]
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu