Changes

Jump to navigation Jump to search
38 bytes added ,  13:56, 2 July 2020
no edit summary
This article is linked from the [[Full-Text]] page. It gives some insight into the implementation of the full-text features for Japanese text corpora. The Japanese version is [httphttps://files.basex.org/etc/ja-ft.pdf also available as PDF].Thank you to [http://blog.infinite.jp The lexer was contributed by Toshio HIRAI] for integrating the lexer in BaseX!.
=Introduction=
The lexical analysis of Japanese documents is performed by [httphttps://igo.sourceforgeosdn.jp/ Igo]. Igo is a ''morphological analyser'',and some of the advantages and reasons for using Igo are:* compatible with the results of a prominent morphological analyzer "MeCab"* it can use the dictionary distributed by the Project MeCab* the morphological analyzer is implemented in Java and is relatively fast
* Compatible with the results of a prominent morphological analyzer "MeCab".* It can use the dictionary distributed by the Project MeCab.* The morphological analyzer is implemented in Java and is relatively fast. Japanese tokenization will be activated in BaseX if Igo is found in theclasspath. [httphttps://en.sourceforgeosdn.jpnet/projects/igo/releases/ igo-0.4.3.jar]of Igo is currently included in all distributions of BaseX.
In addition to the library, one of the following dictionary files must either be unzipped into the current directory, or into the <code>etc</code> sub-directory of the project’s [[Configuration#Home Directory|Home Directory]]:
 * IPA Dictionary: httphttps://files.basex.org/etc/ipadic.zip* NAIST Dictionary: httphttps://files.basex.org/etc/naistdic.zip
=Lexical Analysis=
=Token Processing=
"Fullwidth" and "Halfwidth" (which is defined by[httphttps://unicode.org/Public/UNIDATA/EastAsianWidth.txt East Asian Width Properties])are not distinguished (this is the so-called ZENKAKU/HANKAKU problem). For example, <code>XML</code> and <code>XML</code> will be treatedas the same word. If documents are ''hybrid'', i.e. written in multiple languages​​,this is also helpful for some other options of the XQuery Full Text Specification,such as the [httphttps://www.w3.org/TR/xpath-full-text-10/#ftcaseoption Case] or the[httphttps://www.w3.org/TR/xpath-full-text-10/#ftdiacriticsoption Diacritics] Optionoption.
=Stemming=
is returned for the following two types of queries:
<pre classsyntaxhighlight lang="brush:xquery">
'私は本を書いた' contains text '書く' using stemming using language 'ja'
'私は本を書く' contains text '書いた' using stemming using language 'ja'
</presyntaxhighlight>
=Wildcards=
queries both return <code>true</code>:
<pre classsyntaxhighlight lang="brush:xquery">
'芥川龍之介' contains text '.之介' using wildcards using language 'ja'
'芥川竜之介' contains text '.之介' using wildcards using language 'ja'
</presyntaxhighlight>
However, there is a special case that requires attention. The following
query will yield <code>false</code>:
<pre classsyntaxhighlight lang="brush:xquery">
'芥川龍之介' contains text '芥川.之介' using wildcards using language 'ja'
</presyntaxhighlight>
This is because the next word boundary metacharacters
an additional whitespaces as word boundary:
<pre classsyntaxhighlight lang="brush:xquery">
'芥川龍之介' contains text '芥川 .之介' using wildcards using language 'ja'
</presyntaxhighlight>
As an alternative, you may modify the query as follows:
<pre classsyntaxhighlight lang="brush:xquery">
'芥川龍之介' contains text '芥川' ftand '.之介' using wildcards using language 'ja'
</presyntaxhighlight>
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu