</pre>
By default, unless the languages codes <code>ja</code>, <code>ar</code>, <code>ko</code>, <code>th</code>, or <code>zh</code> are specified, a tokenizer for Western texts will be is used to tokenize texts:
* Whitespaces are interpreted as token delimiters.