Changes

Full-Text (edit)

Revision as of 11:01, 15 September 2020

39 bytes removed , 11:01, 15 September 2020

no edit summary

This article is part of the [[XQuery|XQuery Portal]]. It summarizes the features of the [~~http~~https://www.w3.org/TR/xpath-full-text-10/ W3C XQuery Full Text ~~1.0~~] Recommendation, and custom features of the implementation in BaseX.

Please read the separate [[Indexes#Full-Text Index|Full-Text Index]] section in our documentation if you want to learn how to evaluate full-text requests on large databases within milliseconds.

The basic JAR file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. Some more languages are supported if the following libraries are found in the [[Startup#Distributions|classpath]]:

* [~~http~~https://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar] includes the Snowball and Lucene stemmers for the following languages: Arabic, Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

* [~~http~~https://enosdn.~~sourceforge.jp~~net/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.

The JAR files are included in the ZIP and EXE distributions of BaseX.

</syntaxhighlight>

The format of the thesaurus files must be the same as the format of the thesauri provided by the [~~http~~https://dev.w3.org/2007/xpath-full-text-10-test-suite XQuery and XPath Full Text 1.0 Test Suite]. It is an XML with structure defined by an [~~http~~https://dev.w3.org~~/cvsweb/~checkout~~~/2007/xpath-full-text-10-test-suite/TestSuiteStagingArea/TestSources/thesaurus.xsd~~?rev=1.3;content-type=application%2Fxml~~ XSD Schema].

==Fuzzy Querying==

=Mixed Content=

When working with so-called narrative XML documents, such as HTML, [~~http~~https://tei-c.org/ TEI], or [~~http~~https://docbook.org / DocBook] documents, you typically have ''mixed content'', i.e., elements containing a mix of text and markup, such as:

</syntaxhighlight>

Since the logical flow of the text is not interrupted by the child elements, you will typically want to search across elements, so that the above paragraph would match a search for “real text”. For more examples, see [~~http~~https://www.w3.org/TR/xpath-full-text-10-use-cases/#Across XQuery and XPath Full Text 1.0 Use Cases].

To enable this kind of searches, it is recommendable to:

</syntaxhighlight>

BaseX does '''not''' support the ''ignore option'' (<code>without content</code>) of the [~~http~~https://www.w3.org/TR/xpath-full-text-10/#ftignoreoption W3C XQuery Full Text 1.0] Recommendation. If you want to ignore descendant element content, such as footnotes or other material that does not belong to the same logical text flow, you can build a second database from and exclude all information you do not want to search for. See the following example (visit [[XQuery Update]] to learn more about updates):

|-

| {{Code|decomposition}}

| Defines how composed characters are handled. Three decompositions are supported: {{Code|none}}, {{Code|standard}}, and {{Code|full}}. More details are found in the [~~http~~https://docs.oracle.com/en/java/javase/711/docs/api/java.base/java/text/Collator.html JavaDoc] of the JDK.

|}

</nowiki></syntaxhighlight>

If the [http://site.icu-project.org/download ICU Library] is added to the classpath, the full [~~http~~https://www.w3.org/TR/xpath-functions-31/#uca-collations Unicode Collation Algorithm] features become available:

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Full-Text (edit)

Revision as of 11:01, 15 September 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools