Changes

Full-Text: Japanese (edit)

Revision as of 17:30, 27 March 2015

609 bytes added , 17:30, 27 March 2015

no edit summary

This article is linked from the [[Full-Text]] page. It gives some insight into the implementation of the full-text features for Japanese text corpora. The Japanese version is [http://files.basex.org/etc/ja-ft.pdf also available as PDF].

Thank you to [http://blog.infinite.jp Toshio HIRAI] for integrating the lexer in BaseX!

=~~=Lexical Analysis=~~Introduction=

The lexical analysis of Japanese documents is performed by ~~''morphological analysis''.~~[http://igo.sourceforge.jp/ Igo ~~- A morphological analyzer~~] 0.~~4.3~~ Igo is ~~used as an~~a ''morphological analyser'',~~external library for morphological analysis. Some~~ and some of the advantages and reasonsfor using Igo are:* ~~it is~~ compatible with the results of a prominent morphological analyzer ~~that is compatible with~~ "MeCab" ~~results~~* it can ~~be used to distribute~~ use the dictionary ~~project~~ distributed by the Project MeCab

* the morphological analyzer is implemented in Java and is relatively fast

Japanese tokenization will be activated in BaseX if Igo is found in theclasspath. [http://en.sourceforge.jp/projects/igo/releases/ igo-0.4.3.jar]of Igo is currently included in all distributions of BaseX. In addition to the library, one of the following dictionary files must either be unzipped into the current directory, or into the <code>etc</code> sub-directory of the project’s [[Configuration#Home Directory|Home Directory]]:* IPA Dictionary: http://files.basex.org/etc/ipadic.zip* NAIST Dictionary: http://files.basex.org/etc/naistdic.zip =Lexical Analysis= The example sentence "私は本を書きました。(I wrote a book.)" is analyzed as follows.

<pre>私は本を書きました。

morpheme are used in indexing and stemming.

==Parsing==

During indexing and parsing, the input strings are split into single ''tokens''.

* Auxiliary verb

Thus, in the example above, ~~the "~~{{Code|私"}}, "{{Code|本"}}, "and {{Code|書き" }} will be passed to the indexer

for each token.

==Token Processing==

"Fullwidth" and "Halfwidth" (which is defined by

[http://www.w3.org/TR/xpath-full-text-10/#ftdiacriticsoption Diacritics] Option.

==Stemming==

Stemming in Japanese means to analyze the results of morphological analysis

Because the "auxiliary verb" is always excluded from the tokens, there is

no need to consider its use. Therefore, the same result (<code>true</code>)

is returned for the following two types of queries:

</pre>

==Wildcards==

The Wildcard option in XQuery Full-Text is available for Japanese as well.

The following example is based on '芥川龍之介(AKUTAGAWA, Ryunosuke)', a prominent Japanese writer,

the first name of whom is often spelled as "竜之介". The following two

queries both return <code>true</code>:

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Full-Text: Japanese (edit)

Revision as of 17:30, 27 March 2015

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools