Changes

Full-Text (edit)

Revision as of 06:43, 29 November 2019

245 bytes removed , 06:43, 29 November 2019

no edit summary

This article is part of the [[XQuery|XQuery Portal]].It summarizes the features of the [http://www.w3.org/TR/xpath-full-text -10/ W3C XQuery Full Text 1.0] Recommendation, and ~~language-specific~~ custom features of the implementation in BaseX.

Please read the separate [[Indexes#Full-~~text retrieval is an essential query feature for working with XML documents, and BaseX was the first query processor that fully supported the [http://www.w3.org/TR/xpath~~Text Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text~~-10/ W3C XQuery Full Text 1.0] Recommendation~~requests on large databases within milliseconds.

=Introduction=

The XQuery and XPath Full Text Recommendation (XQFT) is a feature-rich extension of the XQuery language. It can be used to both query XML documents and single strings for words and phrases. BaseX was the first query processor that supported all features of the specification. This section gives you a quick insight into the most important features of the language.

This is a simple example for a basic full-text expression:

</pre>

~~Varius~~ Various range modifiers are available: {{Code|exactly}}, {{Code|at least}}, {{Code|at most}}, and {{Code|from ... to ...}}.

==Combining Results==

'Mary told me, “I will survive!” ~~This is what Mary told me~~.' contains text { 'will', 'told' } all words same sentence

</pre>

~~Sentences are delimited by end of line markers ({{Code|.}}, {{Code|!}}, {{Code|?}}, etc.), and newline characters are treated as paragraph delimiters.~~ By the way: ~~in the~~ In some examples above, the {{Code|~~word~~words}} unit ~~has been~~ was used, but {{Code|sentences}} and {{Code|paragraphs}} ~~are~~ would have been valid alternatives.

Last but not least, three specifiers exist to filter results depending on the position of a hit:

* If {{Code|diacritics}} is insensitive, characters with and without diacritics (umlauts, characters with accents) are declared as identical. By default, the option is {{Code|insensitive}}; it can also be set to {{Code|sensitive}}:

"'Äpfel' will not be found..." contains text "Apfel" using diacritics sensitive

</pre>

* If {{Code|stemming}} is activated, words are shortened to a base form by a language-specific stemmer:

"catch" contains text "catches" using stemming,~~"Haus" contains text "Häuser" using stemming using language 'de'~~

</pre>

* With the {{Code|stop words}} option, a list of words can be defined that will be ignored when tokenizing a string. This is particularly helpful ~~when~~ if the full-text index takes too much space(a standard stopword list for English texts is provided in the directory {{Code|etc/stopwords.txt}} in the full distributions of BaseX, and available online at http://files.basex.org/etc/stopwords.txt):

"You and me" contains text "you or me" using stop words ("and", "or"),

=BaseX Features=

==~~Options~~Languages==

The ~~available [[Indexes#Full-Text Index|Full-Text Index]] can handle various combinations of the match options defined in the XQuery Full Text Recommendation~~chosen language determines how strings will be tokenized and stemmed. Either names (e.g. ~~By default~~<code>English</code>, ~~most options are disabled. The GUI dialogs for creating new databases~~ <code>German</code>) or ~~displaying the database properties contain a tab for choosing between all available options. On command-line~~codes (<code>en</code>, ~~the~~ <code>~~SET~~de</code> ~~command~~ ) can be specified. A list of all language codes that are available on your system can be ~~used to activate full-text indexing or creating a full-text index for existing databases~~retrieved as follows:

* <~~code~~pre class="brush:xquery">~~SET FTINDEX true~~declare namespace locale = "java:java.util.Locale"; ~~CREATE DB input~~distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.~~xml</code>~~))* <code>CREATE INDEX fulltext</~~code~~pre> ~~The following indexing options are available:~~

* '''Language''': [[#Languages|see below]] for more details (By default, unless the languages codes <code>~~SET LANGUAGE EN~~ja</code>).* '''Stemming''': tokens are stemmed with the Porter Stemmer before being indexed (, <code>~~SET STEMMING true~~ar</code>).* '''Case Sensitive''': tokens are indexed in case-sensitive mode (, <code>~~SET CASESENS true~~ko</code>).* '''Diacritics''': diacritics are indexed as well (, <code>~~SET DIACRITICS true~~th</code>).* '''Stopword List''': a stop word list can be defined to reduce the number of indexed tokens (, or <code>~~SET STOPWORDS [filename]~~zh</code>).are specified, a tokenizer for Western texts is used:

~~==Languages==~~* Whitespaces are interpreted as token delimiters.* Sentence delimiters are <code>.</code>, <code>!</code>, and <code>?</code>.* Paragraph delimiters are newlines (<code>&#xa;</code>).

~~The chosen language determines how the input text will be tokenized and stemmed.~~ The basic ~~code base and <code>jar</code>~~ JAR file of BaseX comes with built-in stemming support for English , German, Greek and ~~German~~Indonesian. ~~More~~ Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:

* [http://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes the Snowball and Lucene stemmers ~~and extends language support to~~ for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French~~, Greek~~, Hindi, Hungarian~~, Indonesian~~, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

* [http://en.sourceforge.jp/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.

The JAR files ~~can also be found~~ are included in the ~~<code>zip</code>~~ ZIP and ~~<code>exe</code> distribution files~~ EXE distributions of BaseX.

The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:

"Indexing" contains text "index" using stemming,

"häuser" contains text "haus" using stemming using language "deGerman"

</pre>

==Fuzzy Querying==

In addition to the official recommendation, BaseX supports a fuzzy search feature. The XQFT grammar was enhanced by the ~~FTMatchOption~~ <code>~~using~~ fuzzy </code> match option to allow for approximate results in full texts~~. Fuzzy search is also supported by the [[Indexes#Full-Text Index|Full-Text Index]].~~:

'''Document 'doc.xml'''':

</pre>

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the ~~<code>[[Options#LSERROR~~{{Option|LSERROR~~]]</code> property~~ }} option (default: <code>SET LSERROR 0</code>). The query above yields two results as there is no error between the query term “house” and the text node “house”, and one error between “house” and “hous”. Fuzzy search is also supported by the full-text index.

=Mixed Content=

Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [http://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].

To enable this kind of searches, it is recommendable to: * Turn off ''whitespace chopping'' ~~must be turned off~~ when importing XML documents . This can be done by setting ~~the option <code>[[Options#CHOP~~{{Option|CHOP~~]]</code>~~ }} to <code>OFF</code> ~~(default: <code>SET CHOP ON</code>)~~. In This can also be done in the GUI~~, you find this option in~~ if a new database is created (''Database'' → ''New…'' → ''Parsing'' → ''Chop Whitespaces'').* Turn off automatic indentation by assigning <code>indent=no</code> to the {{Option|SERIALIZER}} option. A query such as <code>//p[. contains text 'real text']</code> will then match the example paragraph above. However, the full-text index will '''not''' be used in this query, so it may take a long time. The full-text index would be used for the query <code>//p[text() contains text 'real text']</code>, but this query will not find the example paragraph, because the matching text is split over two text nodes.

Note that the node structure is ignored by the full-text tokenizer: The {{Code|contains text}} expression applies all full-text operations to the ''string value'' of its left operand. As a consequence, the <code>ft:mark</code> and <code>ft:extract</code> functions (see [[Full-Text Module|Full-Text Functions]]) will only yield useful results if they are applied to single text nodes, as the following example demonstrates:

|-

| {{Code|strength}}

| Level of difference considered significant in comparisons. Four strengths are supported: {{Code|primary}}, {{Code|secondary}}, {{Code|tertiary}}, and {{Code|identical}}. ~~For~~ As an example, in German, :* "Ä" and "A" are considered primary differences, * "Ä" and "ä" are secondary differences, * "Ä" and "A[&#x308;" (see http://www.fileformat.info/info/unicode/char/308/index.htm ~~&#x308;]"~~ ) are tertiary differences, and * "A" and "A" are identical.

|-

| {{Code|decomposition}}

=Changelog=

; Version 9.2:

* Added: Arabic stemmer.

; Version 8.0:

* Removed: Trie index, which was specialized on wildcard queries. The fuzzy index now supports both wildcard and fuzzy queries.

* Removed: TF/IDF scoring was discarded in favor of the internal scoring model.

~~[[Category:XQuery]]~~

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Full-Text (edit)

Revision as of 06:43, 29 November 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools