Changes

Full-Text (edit)

Revision as of 22:23, 25 May 2016

702 bytes added , 22:23, 25 May 2016

==Languages==

The chosen language determines how ~~the input text~~ strings will be tokenized and stemmed. Either names (e.g. <code>English</code>, <code>German</code>) or codes (<code>en</code>, <code>de</code>) can be specified. A list of all language codes that are available on your system can be retrieved as follows: <pre class="brush:xquery">declare namespace locale = "java:java.util.Locale";distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))</pre> By default, unless the languages codes "ja", "ar", "ko", "th", or "zh" are specified, a tokenizer for Western texts will be used to tokenize texts: * Whitespaces are interpreted as token delimiters;* <code>.</code>, <code>!</code> and <code>?</code> are interpreted as sentence delimiters; and* newlines are interpreted as paragraph delimiters. The basic ~~code base and~~ <code>jar</code> file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. More languages are supported if the following libraries are found in the classpath:

* [http://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes Snowball and Lucene stemmers and extends language support to the following languages: Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

* [http://en.sourceforge.jp/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.

The JAR files ~~can also be found~~ are included in the <code>zip</code> and <code>exe</code> ~~distribution files~~ distributions of BaseX.

The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Full-Text (edit)

Revision as of 22:23, 25 May 2016

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools