Difference between revisions of "Full-Text: Japanese"

From BaseX Documentation
Jump to navigation Jump to search
(Toshio HIRAI's article on Japanes Full Text support added)
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
This article is linked from the [[Full-Text]] page. It gives some insight into the implementation of the full-text features for Japanese text corpora. The Japanese version is [http://files.basex.org/etc/ja-ft.pdf also available as PDF].
+
This article is linked from the [[Full-Text]] page. It gives some insight into the implementation of the full-text features for Japanese text corpora. The Japanese version is [https://files.basex.org/etc/ja-ft.pdf also available as PDF].
  
==Lexical Analysis==
+
The lexer was contributed by Toshio HIRAI.
  
The lexical analysis of Japanese documents is performed by ''morphological analysis''.
+
=Introduction=
[http://igo.sourceforge.jp/ Igo - A morphological analyzer] 0.4.3 is used as an
 
external library for morphological analysis. Some of the advantages and reasons
 
for using Igo are:
 
* it is a prominent morphological analyzer that is compatible with "MeCab" results
 
* it can be used to distribute the dictionary project MeCab
 
* the morphological analyzer is implemented in Java and is relatively fast
 
  
The example sentence "私は本を書きました。(I wrote a book.)" is analyzed as follows.
+
The lexical analysis of Japanese documents is performed by [https://igo.osdn.jp/ Igo]. Igo is a ''morphological analyser'', and some of the advantages and reasons for using Igo are:
  
<pre>私は本を書きました。
+
* Compatible with the results of a prominent morphological analyzer "MeCab".
 +
* It can use the dictionary distributed by the Project MeCab.
 +
* The morphological analyzer is implemented in Java and is relatively fast.
 +
 
 +
Japanese tokenization will be activated in BaseX if Igo is found in the classpath. [https://osdn.net/projects/igo/releases/ igo-0.4.3.jar] of Igo is currently included in all distributions of BaseX.
 +
 
 +
In addition to the library, one of the following dictionary files must either be unzipped into the current directory, or into the <code>etc</code> sub-directory of the project’s [[Configuration#Home Directory|Home Directory]]:
 +
 
 +
* IPA Dictionary: https://files.basex.org/etc/ipadic.zip
 +
* NAIST Dictionary: https://files.basex.org/etc/naistdic.zip
 +
 
 +
=Lexical Analysis=
 +
 
 +
The example sentence "私は本を書きました。(I wrote a book.)"
 +
is analyzed as follows.
 +
 
 +
<syntaxhighlight>
 +
私は本を書きました。
 
私      名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
 
私      名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ
 
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
 
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
Line 22: Line 33:
 
た      助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
 
た      助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
 
。      記号,句点,*,*,*,*,。,。,。
 
。      記号,句点,*,*,*,*,。,。,。
</pre>
+
</syntaxhighlight>
  
 
The element of the decomposed part is called "Surface",
 
The element of the decomposed part is called "Surface",
Line 28: Line 39:
 
The Morpheme component is built as follows:
 
The Morpheme component is built as follows:
  
<pre>品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用形,活用型,原形,読み,発音
+
<syntaxhighlight>
 +
品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用形,活用型,原形,読み,発音
 
(POS, subtyping POS 1, subtyping POS 2, subtyping POS 3, inflections, use type, prototype, reading, pronunciation)
 
(POS, subtyping POS 1, subtyping POS 2, subtyping POS 3, inflections, use type, prototype, reading, pronunciation)
</pre>
+
</syntaxhighlight>
  
 
Of these, the surface is used as a token. Also, The contents of analysis of a  
 
Of these, the surface is used as a token. Also, The contents of analysis of a  
 
morpheme are used in indexing and stemming.  
 
morpheme are used in indexing and stemming.  
  
==Parsing==
+
=Parsing=
  
 
During indexing and parsing, the input strings are split into single ''tokens''.
 
During indexing and parsing, the input strings are split into single ''tokens''.
Line 45: Line 57:
 
* Auxiliary verb
 
* Auxiliary verb
  
Thus, in the example above, the "", "", "書き" will be passed to the indexer  
+
Thus, in the example above, {{Code|}}, {{Code|}}, and {{Code|書き}} will be passed to the indexer  
 
for each token.
 
for each token.
  
==Token Processing==
+
=Token Processing=
 +
 
 +
"Fullwidth" and "Halfwidth" (which is defined by [https://unicode.org/Public/UNIDATA/EastAsianWidth.txt East Asian Width Properties]) are not distinguished (this is the so-called ZENKAKU/HANKAKU problem).
  
"Fullwidth" and "Halfwidth" (which is defined by
+
For example, <code>XML</code> and <code>XML</code> will be treated as the same word. If documents are ''hybrid'', i.e. written in multiple languages​​, this is also helpful for some other options of the XQuery Full Text Specification, such as the [https://www.w3.org/TR/xpath-full-text-10/#ftcaseoption Case] or the [https://www.w3.org/TR/xpath-full-text-10/#ftdiacriticsoption Diacritics] option.
[http://unicode.org/Public/UNIDATA/EastAsianWidth.txt East Asian Width Properties])
 
are not distinguished (this is the so-called ZENKAKU/HANKAKU problem).
 
For example, <code>XML</code> and <code>XML</code> will be treated
 
as the same word. If documents are ''hybrid'', i.e. written in multiple languages​​,
 
this is also helpful for some other options of the XQuery Full Text Specification,
 
such as the [http://www.w3.org/TR/xpath-full-text-10/#ftcaseoption Case] or the
 
[http://www.w3.org/TR/xpath-full-text-10/#ftdiacriticsoption Diacritics] Option.
 
  
==Stemming==
+
=Stemming=
  
 
Stemming in Japanese means to analyze the results of morphological analysis
 
Stemming in Japanese means to analyze the results of morphological analysis
Line 68: Line 75:
 
can be led back to the same prototype by analyzing their verb:
 
can be led back to the same prototype by analyzing their verb:
  
<pre>
+
<syntaxhighlight>
 
書く    動詞,自立,*,*,五段・カ行イ音便,基本形,[書く],カク,カク
 
書く    動詞,自立,*,*,五段・カ行イ音便,基本形,[書く],カク,カク
  
 
書い    動詞,自立,*,*,五段・カ行イ音便,連用タ接続,[書く],カイ,カイ
 
書い    動詞,自立,*,*,五段・カ行イ音便,連用タ接続,[書く],カイ,カイ
 
た      助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
 
た      助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
</pre>
+
</syntaxhighlight>
  
 
Because the "auxiliary verb" is always excluded from the tokens, there is  
 
Because the "auxiliary verb" is always excluded from the tokens, there is  
no need to consider its use. Therefore, the same result (<code>true</code)
+
no need to consider its use. Therefore, the same result (<code>true</code>)
 
is returned for the following two types of queries:
 
is returned for the following two types of queries:
  
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
'私は本を書いた' contains text '書く' using stemming using language 'ja'
 
'私は本を書いた' contains text '書く' using stemming using language 'ja'
 
'私は本を書く' contains text '書いた' using stemming using language 'ja'
 
'私は本を書く' contains text '書いた' using stemming using language 'ja'
</pre>
+
</syntaxhighlight>
  
==Wildcards==
+
=Wildcards=
  
 
The Wildcard option in XQuery Full-Text is available for Japanese as well.
 
The Wildcard option in XQuery Full-Text is available for Japanese as well.
The following example is based on '芥川 龍之介', a prominent Japanese writer,
+
The following example is based on '芥川 龍之介(AKUTAGAWA, Ryunosuke)', a prominent Japanese writer,
 
the first name of whom is often spelled as "竜之介". The following two
 
the first name of whom is often spelled as "竜之介". The following two
 
queries both return <code>true</code>:
 
queries both return <code>true</code>:
  
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
'芥川龍之介' contains text '.之介' using wildcards using language 'ja'
 
'芥川龍之介' contains text '.之介' using wildcards using language 'ja'
 
'芥川竜之介' contains text '.之介' using wildcards using language 'ja'
 
'芥川竜之介' contains text '.之介' using wildcards using language 'ja'
</pre>
+
</syntaxhighlight>
  
 
However, there is a special case that requires attention. The following
 
However, there is a special case that requires attention. The following
 
query will yield <code>false</code>:
 
query will yield <code>false</code>:
  
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
'芥川龍之介' contains text '芥川.之介' using wildcards using language 'ja'
 
'芥川龍之介' contains text '芥川.之介' using wildcards using language 'ja'
</pre>
+
</syntaxhighlight>
  
 
This is because the next word boundary metacharacters
 
This is because the next word boundary metacharacters
Line 107: Line 114:
 
an additional whitespaces as word boundary:
 
an additional whitespaces as word boundary:
  
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
'芥川龍之介' contains text '芥川 .之介' using wildcards using language 'ja'
 
'芥川龍之介' contains text '芥川 .之介' using wildcards using language 'ja'
</pre>
+
</syntaxhighlight>
  
 
As an alternative, you may modify the query as follows:
 
As an alternative, you may modify the query as follows:
  
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
'芥川龍之介' contains text '芥川' ftand '.之介' using wildcards using language 'ja'
 
'芥川龍之介' contains text '芥川' ftand '.之介' using wildcards using language 'ja'
</pre>
+
</syntaxhighlight>

Revision as of 12:56, 2 July 2020

This article is linked from the Full-Text page. It gives some insight into the implementation of the full-text features for Japanese text corpora. The Japanese version is also available as PDF.

The lexer was contributed by Toshio HIRAI.

Introduction

The lexical analysis of Japanese documents is performed by Igo. Igo is a morphological analyser, and some of the advantages and reasons for using Igo are:

  • Compatible with the results of a prominent morphological analyzer "MeCab".
  • It can use the dictionary distributed by the Project MeCab.
  • The morphological analyzer is implemented in Java and is relatively fast.

Japanese tokenization will be activated in BaseX if Igo is found in the classpath. igo-0.4.3.jar of Igo is currently included in all distributions of BaseX.

In addition to the library, one of the following dictionary files must either be unzipped into the current directory, or into the etc sub-directory of the project’s Home Directory:

Lexical Analysis

The example sentence "私は本を書きました。(I wrote a book.)" is analyzed as follows.

<syntaxhighlight> 私は本を書きました。 私 名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ は 助詞,係助詞,*,*,*,*,は,ハ,ワ 本 名詞,一般,*,*,*,*,本,ホン,ホン を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ 書き 動詞,自立,*,*,五段・カ行イ音便,連用形,書く,カキ,カキ まし 助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ 。 記号,句点,*,*,*,*,。,。,。 </syntaxhighlight>

The element of the decomposed part is called "Surface", the content analysis is called "Morpheme". The Morpheme component is built as follows:

<syntaxhighlight> 品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用形,活用型,原形,読み,発音 (POS, subtyping POS 1, subtyping POS 2, subtyping POS 3, inflections, use type, prototype, reading, pronunciation) </syntaxhighlight>

Of these, the surface is used as a token. Also, The contents of analysis of a morpheme are used in indexing and stemming.

Parsing

During indexing and parsing, the input strings are split into single tokens. In order to reduce the index size and speed up search, the following word classes have been intentionally excluded:

  • Mark
  • Filler
  • Postpositional particle
  • Auxiliary verb

Thus, in the example above, , , and 書き will be passed to the indexer for each token.

Token Processing

"Fullwidth" and "Halfwidth" (which is defined by East Asian Width Properties) are not distinguished (this is the so-called ZENKAKU/HANKAKU problem).

For example, XML and XML will be treated as the same word. If documents are hybrid, i.e. written in multiple languages​​, this is also helpful for some other options of the XQuery Full Text Specification, such as the Case or the Diacritics option.

Stemming

Stemming in Japanese means to analyze the results of morphological analysis ("verbs" and "adjectives") that are processed using the "prototype".

If the stemming option is enabled, for example, the two statements "私は本を書いた (I wrote the book)" and "私は本を書く (I write the book)" can be led back to the same prototype by analyzing their verb:

<syntaxhighlight> 書く 動詞,自立,*,*,五段・カ行イ音便,基本形,[書く],カク,カク

書い 動詞,自立,*,*,五段・カ行イ音便,連用タ接続,[書く],カイ,カイ た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ </syntaxhighlight>

Because the "auxiliary verb" is always excluded from the tokens, there is no need to consider its use. Therefore, the same result (true) is returned for the following two types of queries:

<syntaxhighlight lang="xquery"> '私は本を書いた' contains text '書く' using stemming using language 'ja' '私は本を書く' contains text '書いた' using stemming using language 'ja' </syntaxhighlight>

Wildcards

The Wildcard option in XQuery Full-Text is available for Japanese as well. The following example is based on '芥川 龍之介(AKUTAGAWA, Ryunosuke)', a prominent Japanese writer, the first name of whom is often spelled as "竜之介". The following two queries both return true:

<syntaxhighlight lang="xquery"> '芥川龍之介' contains text '.之介' using wildcards using language 'ja' '芥川竜之介' contains text '.之介' using wildcards using language 'ja' </syntaxhighlight>

However, there is a special case that requires attention. The following query will yield false:

<syntaxhighlight lang="xquery"> '芥川龍之介' contains text '芥川.之介' using wildcards using language 'ja' </syntaxhighlight>

This is because the next word boundary metacharacters cannot be determined in the query. In this case, you may insert an additional whitespaces as word boundary:

<syntaxhighlight lang="xquery"> '芥川龍之介' contains text '芥川 .之介' using wildcards using language 'ja' </syntaxhighlight>

As an alternative, you may modify the query as follows:

<syntaxhighlight lang="xquery"> '芥川龍之介' contains text '芥川' ftand '.之介' using wildcards using language 'ja' </syntaxhighlight>