Changes

Full-Text (edit)

Revision as of 22:25, 25 May 2016

7 bytes removed , 22:25, 25 May 2016

By default, unless the languages codes "ja", "ar", "ko", "th", or "zh" are specified, a tokenizer for Western texts will be used to tokenize texts:

* Whitespaces are interpreted as token delimiters;.* Sentence delimiters are <code>.</code>, <code>!</code> , and <code>?</code> ~~are interpreted as sentence delimiters; and~~.* Paragraph delimiters are newlines ~~are interpreted as paragraph delimiters~~(<code>
</code>).

The basic <code>jar</code> file of BaseX comes with built-in stemming support for English, German, Greek and Indonesian. ~~More~~ Some more languages are supported if the following libraries are found in the classpath:

* [http://files.basex.org/maven/org/apache/lucene-stemmers/3.4.0/lucene-stemmers-3.4.0.jar lucene-stemmers-3.4.0.jar]: includes Snowball and Lucene stemmers and extends language support to the following languages: Bulgarian, Catalan, Czech, Danish, Dutch, Finnish, French, Hindi, Hungarian, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Full-Text (edit)

Revision as of 22:25, 25 May 2016

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools