Changes

Full-Text (edit)

Revision as of 06:43, 29 November 2019

2,726 bytes removed , 06:43, 29 November 2019

no edit summary

This article is part of the [[XQuery|XQuery Portal]].It summarizes the features of the [http://www.w3.org/TR/xpath-full-text -10/ W3C XQuery Full Text 1.0] Recommendation, and ~~language-specific~~ custom features of the implementation in BaseX.

Please read the separate [[Indexes#Full-~~text retrieval is an essential query feature for working with XML documents, and BaseX was the first query processor that fully supported the [http://www.w3.org/TR/xpath~~Text Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text~~-10/ W3C XQuery Full Text 1.0] Recommendation~~requests on large databases within milliseconds.

=Introduction=

The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification. This section gives you a quick insight into the most important features of the language.

This is a simple example for a basic full-text expression:

</pre>

~~Varius~~ Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.

==Combining Results==

* {{Code|phrase}}: all strings need to be found as a single phrase

The keywords {{Code|ftand}}, {{Code|ftor}} and {{Code|ftnot}} can also be used to combine multiple query terms. The following query yields the same result as the last one does ~~(but it takes [[#FTAnd|more memory]])~~:

for $text in ("New York", "new conditions")

return $text contains text "New" not in "New York"

</pre>

Due to the complex data model of the XQuery Full Text spec, the usage of {{Code|ftand}} may lead to a high memory consumption. If you should encounter problems, simply use the {{Code|all}} keyword:

doc('factbook')//country[descendant::religions contains text { 'Christian', 'Jewish'} all]/name

</pre>

'Mary told me, “I will survive!” ~~This is what Mary told me~~.' contains text { 'will', 'told' } all words same sentence

</pre>

~~Sentences are delimited by end of line markers ({{Code|.}}, {{Code|!}}, {{Code|?}}, etc.), and newline characters are treated as paragraph delimiters.~~ By the way: ~~in the~~ In some examples above, the {{Code|~~word~~words}} unit ~~has been~~ was used, but {{Code|sentences}} and {{Code|paragraphs}} ~~are~~ would have been valid alternatives.

Last but not least, three specifiers exist to filter results depending on the position of a hit:

* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:

"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive

</pre>

* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:

"catch" contains text "catches" using stemming,~~"Haus" contains text "Häuser" using stemming using language 'de'~~

</pre>

* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful ~~when~~ if the ~~size of a~~ full-text index ~~structure needs to be reduced~~takes too much space (a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):

"You and me" contains text "you or me" using stop words ("and", "or"),

=BaseX Features=

~~This page lists BaseX-specific full-text features and options.~~==Languages==

~~==Options==~~The chosen language determines how strings will be tokenized and stemmed. Either names (e.g. <code>English</code>, <code>German</code>) or codes (<code>en</code>, <code>de</code>) can be specified. A list of all language codes that are available on your system can be retrieved as follows:

~~The available full-text index can handle various combinations of the match options defined in the XQuery Full Text Recommendation~~<pre class="brush:xquery">declare namespace locale = "java:java. ~~By default, most options are disabled~~util. ~~The GUI dialogs for creating new databases or displaying the database properties contain a tab for choosing between all available options. On the command~~Locale";distinct-~~line, the <code>SET</code> command can be used to activate full-text indexing or creating a full-text index for existing databases~~values(locale:getAvailableLocales() ! locale: * <code>SET FTINDEX true; CREATE DB inputgetLanguage(.~~xml</code>~~))* <code>CREATE INDEX fulltext</~~code~~pre> ~~The following indexing options are available:~~

* '''Language''': [[#Languages|see below]] for more details (By default, unless the languages codes <code>~~SET LANGUAGE EN~~ja</code>).* '''Stemming''': tokens are stemmed with the Porter Stemmer before being indexed (, <code>~~SET STEMMING true~~ar</code>).* '''Case Sensitive''': tokens are indexed in case-sensitive mode (, <code>~~SET CASESENS true~~ko</code>).* '''Diacritics''': diacritics are indexed as well (, <code>~~SET DIACRITICS true~~th</code>).* '''Stopword List''': a stop word list can be defined to reduce the number of indexed tokens (, or <code>~~SET STOPWORDS [filename]~~zh</code>).are specified, a tokenizer for Western texts is used:

~~==Languages==~~* Whitespaces are interpreted as token delimiters.* Sentence delimiters are <code>.</code>, <code>!</code>, and <code>?</code>.* Paragraph delimiters are newlines (<code>&#xa;</code>).

~~The chosen language determines how the input text will be tokenized and stemmed.~~ The basic ~~code base and <code>jar</code>~~ JAR file of BaseX comes with built-in stemming support for English , German, Greek and ~~German~~Indonesian. ~~More~~ Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:

* [http://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes the Snowball and Lucene stemmers ~~and extends language support to~~ for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French~~, Greek~~, Hindi, Hungarian~~, Indonesian~~, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

* [http://en.sourceforge.jp/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.

The JAR files ~~can also be found~~ are included in the ~~<code>zip</code>~~ ZIP and ~~<code>exe</code> distribution files~~ EXE distributions of BaseX.

The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:

"Indexing" contains text "index" using stemming,

"häuser" contains text "haus" using stemming using language "deGerman"

</pre>

~~With {{Version|8.0}}, scores~~ Scores will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates. ~~The~~ In the following ~~queries will~~ query, all ~~yield the same result~~returned scores are equal:

let $text := "A B C"

let score $s1 := $text[. contains text "A"~~][.~~ ftand "B C"let score $s2 := $text contains text "A" ftand "B C"]let score $s2 s3 := $text contains text "A" and $text contains text "B C"let score $s3 s4 := $text contains text "A" or $text contains text "B C"let score $s5 := $text[. contains text "A"][. contains text "B C"]return ($s1, $s2, $s3, $s4, $s5)

</pre>

==Thesaurus==

BaseX supports full-text queries using thesauri, but it does not provide a default thesaurus. This is why queries such as:

==Fuzzy Querying==

In addition to the official recommendation, BaseX supports a fuzzy ~~querying~~search feature.The XQFT grammar was enhanced by the ~~FTMatchOption~~ <code>~~using~~ fuzzy </code> match option to allow for approximate ~~searches~~ results in full texts.~~By default, the standard [[indexes|full-text index]] already supports the efficientexecution of fuzzy searches.~~:

'''Document 'doc.xml'''':

</doc>

</pre>

~~'''Command:''' <code>CREATE DB doc.xml; CREATE INDEX fulltext</code>~~

'''Query:'''

</pre>

Fuzzy search is based on the Levenshtein distance. The maximum number of allowederrors is calculated by dividing the token length of a specified query term by 4,preserving a minimum of 1 errors. A static error distance can be set by adjustingthe ~~<code>[[Options#LSERROR~~{{Option|LSERROR~~]]</code> property~~ }} option (default: <code>SET LSERROR 0</code>).The query above yields two results as there is no error between the query term“house” and the text node “house”, and one error between“house” and “hous”. ~~=Performance=~~ ~~==Index Processing==~~ ~~BaseX offers different evaluation strategies for XQFT queries, the choice of whichdepends on the input data and the existence of a full text index. The query compiler triesto optimize and speed up queries by applying a full text index structure wheneverpossible and useful. Three evaluation strategies are available: the standard sequential~~database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see [http://www.inf.uni-konstanz.de/gk/pubsys/publishedFiles/GrGaHo09.pdf XQuery Full Text implementation in BaseX]). ~~Query optimization and selection of the most efficient evaluation strategy is donein a full-fledged automatic manner. The output of the query optimizer indicates whichevaluation plan is chosen for a specific query. It can be inspected by activating verbosequerying (Command: <code>SET VERBOSE ON</code>) or opening the Query Info in the GUI.The message~~ ~~<code>Applying full-text index</code>~~ ~~suggests that the full-text index is applied to speed up query evaluation.A second message~~ ~~<code>Removing path with no index results</code>~~ ~~indicates that the index does not yield any results for the specified term andis thus skipped. If index optimizations are missing, it sometimes helps to givethe compiler a second chance and try different rewritings of the same query.~~ ~~==FTAnd==~~

~~The internal XQuery Full Text data model~~ Fuzzy search is ~~pretty complex and may consume more mainmemory as would initially guess. If you plan to combine search terms via {{Code|ftand}},we recommend you to resort to an alternative, memory~~also supported by the full-~~saving representation:~~ ~~<pre class="brush:xquery">(: representation via "ftand" :)"A B" contains~~ text ~~"A" ftand "B" ftor "C" ftor "D"~~ ~~(: memory saving representation :)"A B" contains text { "A", "B" } all ftor { "C", "D" } all</pre>~~index.

=Mixed Content=

Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [http://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].

To enable this kind of searches, it is recommendable to: * Turn off ''whitespace chopping'' ~~must be turned off~~ when importing XML documents . This can be done by setting ~~the option <code>[[Options#CHOP~~{{Option|CHOP~~]]</code>~~ }} to <code>OFF</code> ~~(default: <code>SET CHOP ON</code>)~~. In This can also be done in the GUI~~, you find this option in~~ if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Chop Whitespaces'').* Turn off automatic indentation by assigning <code>indent=no</code> to the {{Option|SERIALIZER}} option. A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.

Note that the node structure is ~~completely~~ ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the <code>ft:mark</code> and <code>ft:extract</code> functions (see [[Full-Text Module|Full-Text Functions]]) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:

</pre>

~~Note that~~ BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [http://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. ~~This means that it is not possible~~ If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow~~. Here is an example document:~~ ~~<pre class="brush:xml"><p>This text is provided for illustrative<note>Serving as an example or explanation.</note> purposes only.</p></pre>~~ ~~The ignore option would enable~~ , you can build a second database from and exclude all information you do not want to search for . See the ~~string “illustrative purposes”~~following example (visit [[XQuery Update]] to learn more about updates):

let $docs := db:open('docs')return db:create( 'index-db', $docs update delete node ( .//footnote ), $docs/p[db:path(. ~~contains text~~ ), map { '~~illustrative purposes~~ftindex' ~~without content note]~~: true() })

</pre>

~~For more examples, see [http://www.w3.org/TR/xpath-full-text-10-use-cases/#Ignore XQuery and XPath Full Text 1.0 Use Cases].~~

As BaseX does not support the ignore option, it raises error [[XQuery_Errors#Full-Text_Errors|FTST0007]] when it encounters <code>without content</code> in a full-text <code>contains</code> expression.

=Functions=

=Collations=

~~Another~~ See [[XQuery ~~feature related to natural language processing are '''~~3.1#Collations~~'''~~|XQuery 3.1]] for standard collation features. By default, string comparisons in XQuery are based on the Unicode codepoint order. The default namespace URI {{Code|http://www.w3.org/2003/05/xpath-functions/collation/codepoint}} specifies this ordering. In BaseX, the following URI syntax is supported to specify collations:

<nowiki>http://basex.org/collation?lang=...;strength=...;decomposition=...</nowiki>

|-

| {{Code|strength}}

| Level of difference considered significant in comparisons. Four strengths are supported: {{Code|primary}}, {{Code|secondary}}, {{Code|tertiary}}, and {{Code|identical}}. ~~For~~ As an example, in German, :* "Ä" and "A" are considered primary differences, * "Ä" and "ä" are secondary differences, * "Ä" and "A[&#x308;" (see http://www.fileformat.info/info/unicode/char/308/index.htm ~~&#x308;]"~~ ) are tertiary differences, and * "A" and "A" are identical.

|-

| {{Code|decomposition}}

</nowiki></pre>

=~~=Case-Insensitive Collation=~~Changelog=

~~{{Mark|Introduced with BaseX 8~~; Version 9.~~0}}~~2:

~~XQuery 3.1 provides another default collation, which allows for a case-insensitive comparison of ASCII characters (<code>A-Z</code> = <code>a-z</code>). This query returns <code>true</code>~~* Added: ~~<pre class="brush:xquery">declare default collation 'http://www.w3~~Arabic stemmer.~~org/2005/xpath-functions/collation/html-ascii-case-insensitive';'HTML' = 'html'</pre>~~ ~~=Changelog=~~

; Version 8.0:

* Added: [[#Case-Insensitive Collation|Case-Insensitive Collation]].

* Updated: [[#Scoring|Scores]] will be propagated by the {{Code|and}} and {{Code|or}} expressions and in predicates.

* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.

* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.

~~[[Category:XQuery]]~~

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Full-Text (edit)

Revision as of 06:43, 29 November 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools