Changes

Jump to navigation Jump to search
7 bytes removed ,  22:25, 25 May 2016
By default, unless the languages codes "ja", "ar", "ko", "th", or "zh" are specified, a tokenizer for Western texts will be used to tokenize texts:
* Whitespaces are interpreted as token delimiters;.* Sentence delimiters are <code>.</code>, <code>!</code> , and <code>?</code> are interpreted as sentence delimiters; and.* Paragraph delimiters are newlines are interpreted as paragraph delimiters(<code>&#xa;</code>).
The basic <code>jar</code> file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. More Some more languages are supported if the following libraries are found in the classpath:
* [http://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes Snowball and Lucene stemmers and extends language support to the following languages: Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu