Changes

5,065 bytes added , 15:26, 19 November 2011

Toshio HIRAI's article on Japanes Full Text support added

This article is linked from the [[Full-Text]] page. It gives some insight into the implementation of the full-text features for Japanese text corpora. The Japanese version is [http://files.basex.org/etc/ja-ft.pdf also available as PDF].

==Lexical Analysis==

The lexical analysis of Japanese documents is performed by ''morphological analysis''.
[http://igo.sourceforge.jp/ Igo - A morphological analyzer] 0.4.3 is used as an
external library for morphological analysis. Some of the advantages and reasons
for using Igo are:
* it is a prominent morphological analyzer that is compatible with "MeCab" results
* it can be used to distribute the dictionary project MeCab
* the morphological analyzer is implemented in Java and is relatively fast

The example sentence "私は本を書きました。(I wrote a book.)" is analyzed as follows.

<pre>私は本を書きました。
私名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
は助詞,係助詞,*,*,*,*,は,ハ,ワ
本名詞,一般,*,*,*,*,本,ホン,ホン
を助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
書き動詞,自立,*,*,五段・カ行イ音便,連用形,書く,カキ,カキ
まし助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ
た助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。記号,句点,*,*,*,*,。,。,。
</pre>

The element of the decomposed part is called "Surface",
the content analysis is called "Morpheme".
The Morpheme component is built as follows:

<pre>品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用形,活用型,原形,読み,発音
(POS, subtyping POS 1, subtyping POS 2, subtyping POS 3, inflections, use type, prototype, reading, pronunciation)
</pre>

Of these, the surface is used as a token. Also, The contents of analysis of a
morpheme are used in indexing and stemming.

==Parsing==

During indexing and parsing, the input strings are split into single ''tokens''.
In order to reduce the index size and speed up search, the following word classes
have been intentionally excluded:
* Mark
* Filler
* Postpositional particle
* Auxiliary verb

Thus, in the example above, the "私", "本", "書き" will be passed to the indexer
for each token.

==Token Processing==

"Fullwidth" and "Halfwidth" (which is defined by
[http://unicode.org/Public/UNIDATA/EastAsianWidth.txt East Asian Width Properties])
are not distinguished (this is the so-called ZENKAKU/HANKAKU problem).
For example, <code>ＸＭＬ</code> and <code>XML</code> will be treated
as the same word. If documents are ''hybrid'', i.e. written in multiple languages,
this is also helpful for some other options of the XQuery Full Text Specification,
such as the [http://www.w3.org/TR/xpath-full-text-10/#ftcaseoption Case] or the
[http://www.w3.org/TR/xpath-full-text-10/#ftdiacriticsoption Diacritics] Option.

==Stemming==

Stemming in Japanese means to analyze the results of morphological analysis
("verbs" and "adjectives") that are processed using the "prototype".

If the stemming option is enabled, for example, the two statements
"私は本を書いた (I wrote the book)" and "私は本を書く (I write the book)"
can be led back to the same prototype by analyzing their verb:

<pre>
書く動詞,自立,*,*,五段・カ行イ音便,基本形,[書く],カク,カク

書い動詞,自立,*,*,五段・カ行イ音便,連用タ接続,[書く],カイ,カイ
た助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
</pre>

Because the "auxiliary verb" is always excluded from the tokens, there is
no need to consider its use. Therefore, the same result (<code>true</code)
is returned for the following two types of queries:

<pre class="brush:xquery">
'私は本を書いた' contains text '書く' using stemming using language 'ja'
'私は本を書く' contains text '書いた' using stemming using language 'ja'
</pre>

==Wildcards==

The Wildcard option in XQuery Full-Text is available for Japanese as well.
The following example is based on '芥川龍之介', a prominent Japanese writer,
the first name of whom is often spelled as "竜之介". The following two
queries both return <code>true</code>:

<pre class="brush:xquery">
'芥川龍之介' contains text '.之介' using wildcards using language 'ja'
'芥川竜之介' contains text '.之介' using wildcards using language 'ja'
</pre>

However, there is a special case that requires attention. The following
query will yield <code>false</code>:

<pre class="brush:xquery">
'芥川龍之介' contains text '芥川.之介' using wildcards using language 'ja'
</pre>

This is because the next word boundary metacharacters
cannot be determined in the query. In this case, you may insert
an additional whitespaces as word boundary:

<pre class="brush:xquery">
'芥川龍之介' contains text '芥川　.之介' using wildcards using language 'ja'
</pre>

As an alternative, you may modify the query as follows:

<pre class="brush:xquery">
'芥川龍之介' contains text '芥川' ftand '.之介' using wildcards using language 'ja'
</pre>

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Full-Text: Japanese (edit)

Revision as of 15:26, 19 November 2011

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools