Changes

Jump to navigation Jump to search
702 bytes added ,  22:23, 25 May 2016
==Languages==
The chosen language determines how the input text strings will be tokenized and stemmed. Either names (e.g. <code>English</code>, <code>German</code>) or codes (<code>en</code>, <code>de</code>) can be specified. A list of all language codes that are available on your system can be retrieved as follows: <pre class="brush:xquery">declare namespace locale = "java:java.util.Locale";distinct-values(locale:getAvailableLocales() ! locale:getLanguage(.))</pre> By default, unless the languages codes "ja", "ar", "ko", "th", or "zh" are specified, a tokenizer for Western texts will be used to tokenize texts: * Whitespaces are interpreted as token delimiters;* <code>.</code>, <code>!</code> and <code>?</code> are interpreted as sentence delimiters; and* newlines are interpreted as paragraph delimiters. The basic code base and <code>jar</code> file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. More languages are supported if the following libraries are found in the classpath:
* [http://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes Snowball and Lucene stemmers and extends language support to the following languages: Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
* [http://en.sourceforge.jp/projects/igo/releases/ igo-0.4.3.jar]: [[Full-Text: Japanese|An additional article]] explains how Igo can be integrated, and how Japanese texts are tokenized and stemmed.
The JAR files can also be found are included in the <code>zip</code> and <code>exe</code> distribution files distributions of BaseX.
The following two queries, which both return <code>true</code>, demonstrate that stemming depends on the selected language:
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu