Changes

Jump to navigation Jump to search
125 bytes added ,  17:30, 27 March 2015
no edit summary
This article is linked from the [[Full-Text]] page. It gives some insight into the implementation of the full-text features for Japanese text corpora. The Japanese version is [http://files.basex.org/etc/ja-ft.pdf also available as PDF].
Thank you to [http://blog.infinite.jp Toshio HIRAI] for integrating the lexer in BaseX!
==Introduction==
The lexical analysis of Japanese documents is performed by
[http://igo.sourceforge.jp/ Igo]. Igo is a ''morphological analyser'',
and some of the advantages and reasons for using Igo are:
* it is a popular project that is compatible with the results of a prominent morphological analyzer "MeCab" results* it can also be used to distribute use the dictionary project distributed by the Project MeCab
* the morphological analyzer is implemented in Java and is relatively fast
* NAIST Dictionary: http://files.basex.org/etc/naistdic.zip
==Lexical Analysis==
The example sentence "私は本を書きました。(I wrote a book.)"
morpheme are used in indexing and stemming.
==Parsing==
During indexing and parsing, the input strings are split into single ''tokens''.
* Auxiliary verb
Thus, in the example above, the "{{Code|"}}, "{{Code|"}}, "and {{Code|書き" }} will be passed to the indexer
for each token.
==Token Processing==
"Fullwidth" and "Halfwidth" (which is defined by
[http://www.w3.org/TR/xpath-full-text-10/#ftdiacriticsoption Diacritics] Option.
==Stemming==
Stemming in Japanese means to analyze the results of morphological analysis
</pre>
==Wildcards==
The Wildcard option in XQuery Full-Text is available for Japanese as well.
The following example is based on '芥川 龍之介(AKUTAGAWA, Ryunosuke)', a prominent Japanese writer,
the first name of whom is often spelled as "竜之介". The following two
queries both return <code>true</code>:
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu