Changes

Jump to navigation Jump to search
126 bytes added ,  17:30, 27 March 2015
no edit summary
This article is linked from the [[Full-Text]] page. It gives some insight into the implementation of the full-text features for Japanese text corpora. The Japanese version is [http://files.basex.org/etc/ja-ft.pdf also available as PDF].
Thank you to [http://blog.infinite.jp Toshio HIRAI] for integrating the lexer in BaseX!
==Introduction==
The lexical analysis of Japanese documents is performed by
[http://igo.sourceforge.jp/ Igo]. Igo is a ''morphological analyser'',
and some of the advantages and reasons for using Igo are:
* it is a popular project that is compatible with the results of a prominent morphological analyzer "MeCab" results* it can also be used to distribute use the dictionary project distributed by the Project MeCab
* the morphological analyzer is implemented in Java and is relatively fast
* NAIST Dictionary: http://files.basex.org/etc/naistdic.zip
==Lexical Analysis==
The example sentence "私は本を書きました。(I wrote a book.)"
morpheme are used in indexing and stemming.
==Parsing==
During indexing and parsing, the input strings are split into single ''tokens''.
* Auxiliary verb
Thus, in the example above, the "{{Code|"}}, "{{Code|"}}, "and {{Code|書き" }} will be passed to the indexer
for each token.
==Token Processing==
"Fullwidth" and "Halfwidth" (which is defined by
[http://www.w3.org/TR/xpath-full-text-10/#ftdiacriticsoption Diacritics] Option.
==Stemming==
Stemming in Japanese means to analyze the results of morphological analysis
Because the "auxiliary verb" is always excluded from the tokens, there is
no need to consider its use. Therefore, the same result (<code>true</code>)
is returned for the following two types of queries:
</pre>
==Wildcards==
The Wildcard option in XQuery Full-Text is available for Japanese as well.
The following example is based on '芥川 龍之介(AKUTAGAWA, Ryunosuke)', a prominent Japanese writer,
the first name of whom is often spelled as "竜之介". The following two
queries both return <code>true</code>:
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu