Difference between revisions of "Full-Text"

From BaseX Documentation
Jump to navigation Jump to search
m (wikified)
Line 1: Line 1:
 +
 
==Fuzzy Querying==  
 
==Fuzzy Querying==  
 
In addition to the W3C XQFT Recommendation, BaseX supports fuzzy querying.
 
In addition to the W3C XQFT Recommendation, BaseX supports fuzzy querying.
Line 5: Line 6:
 
By default, the standard full-text index already supports the efficient
 
By default, the standard full-text index already supports the efficient
 
execution of fuzzy searches.
 
execution of fuzzy searches.
<blockquote>
+
 
<b>Document 'doc.xml'</b>:
+
'''Document 'doc.xml'''':
 
<pre> &lt;doc&gt;
 
<pre> &lt;doc&gt;
 
   &lt;a&gt;foo bar&lt;/a&gt;
 
   &lt;a&gt;foo bar&lt;/a&gt;
Line 13: Line 14:
 
  &lt;/doc&gt;
 
  &lt;/doc&gt;
 
</pre>  
 
</pre>  
<p><b>Command:</b> <code>create db doc.xml; create index fullext</code>  
+
'''Command:''' <code>create db doc.xml; create index fullext</code>  
</p>
+
<p><b>Query:</b> <code>xquery //a[text() contains text 'foo' using fuzzy]</code>  
+
'''Query:''' <code>xquery //a[text() contains text 'foo' using fuzzy]</code>  
</p>
+
<p><b>Result:</b> <code>&lt;a&gt;foo bar&lt;/a&gt; &lt;a&gt;foa bar&lt;/a&gt;</code>  
+
'''Result:''' <code>&lt;a&gt;foo bar&lt;/a&gt; &lt;a&gt;foa bar&lt;/a&gt;</code>  
</p>
+
</blockquote>
+
 
   
 
   
 
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed
 
Fuzzy search is based on the Levenshtein distance. The maximum number of allowed
Line 27: Line 28:
 
The query above yields two results as there is no error between the query term
 
The query above yields two results as there is no error between the query term
 
&quot;foo&quot; and the text node &quot;foo bar&quot;, and one error between
 
&quot;foo&quot; and the text node &quot;foo bar&quot;, and one error between
&quot;foo&quot; and &quot;foa bar&quot;.<br/>
+
&quot;foo&quot; and &quot;foa bar&quot;.  
 
   
 
   
 
==Query Evaluation==  
 
==Query Evaluation==  
Line 35: Line 36:
 
possible and useful. Three evaluation strategies are available: the standard sequential
 
possible and useful. Three evaluation strategies are available: the standard sequential
 
database scan, a full-text index based evaluation and a hybrid one, combining both strategies
 
database scan, a full-text index based evaluation and a hybrid one, combining both strategies
(see our <a target='_top' href='publications'>publications</a> for details).<br/><br/>
+
(see our <a target='_top' href='publications'>publications</a> for details).  
 
Query optimization and selection of the most efficient evaluation strategy is done
 
Query optimization and selection of the most efficient evaluation strategy is done
 
in a full-fledged automatic manner. The output of the query optimizer indicates which
 
in a full-fledged automatic manner. The output of the query optimizer indicates which
Line 51: Line 52:
 
indicates that the index does not yield any results for the specified term and
 
indicates that the index does not yield any results for the specified term and
 
is thus skipped. If index optimizations are missing, it sometimes helps to give
 
is thus skipped. If index optimizations are missing, it sometimes helps to give
the compiler a second chance and try different rewritings of the same query.<br/>
+
the compiler a second chance and try different rewritings of the same query.  
 
   
 
   
 
==Indexes==  
 
==Indexes==  
<p>To support a wide variety of scenarios, the available full-text index can
+
To support a wide variety of scenarios, the available full-text index can
 
handle different combinations of the match options in XQuery Full Text.
 
handle different combinations of the match options in XQuery Full Text.
 
By default, most indexing options are disabled. The GUI dialog for creating new databases
 
By default, most indexing options are disabled. The GUI dialog for creating new databases
 
allows to choose the available options. On the command-line, the <code>SET</code>  
 
allows to choose the available options. On the command-line, the <code>SET</code>  
command has to be used before the full-text index is created, either by</p>
+
command has to be used before the full-text index is created, either by  
 
<code> create index fulltext </code> or
 
<code> create index fulltext </code> or
 
<code> set ftindex on; create db FILENAME.xml</code>.
 
<code> set ftindex on; create db FILENAME.xml</code>.
 
   
 
   
<p>The following options are available:</p>
+
The following options are available:  
<ul>
+
 
  <li>Support Wildcards: a trie-based index can be applied to support wildcard searches (<code>SET WILDCARDS ON</code>)</li>
+
* Support Wildcards: a trie-based index can be applied to support wildcard searches (<code>SET WILDCARDS ON</code>)  
  <li>Stemming: tokens are stemmed with the Porter Stemmer before being indexed (<code>SET STEMMING ON</code>)</li>
+
* Stemming: tokens are stemmed with the Porter Stemmer before being indexed (<code>SET STEMMING ON</code>)  
  <li>Case Sensitive: tokens are indexed in case-sensitive mode (<code>SET CASESEND ON</code>)</li>
+
* Case Sensitive: tokens are indexed in case-sensitive mode (<code>SET CASESEND ON</code>)  
  <li>Diacritics: diacritics are indexed as well (<code>SET DIACRITICS ON</code>)</li>
+
* Diacritics: diacritics are indexed as well (<code>SET DIACRITICS ON</code>)  
  <li>TF/IDF Scoring: TF/IDF-based scoring values are calculated and stored in the index (<code>SET SCORING 1/2</code>; details see below)</li>
+
* TF/IDF Scoring: TF/IDF-based scoring values are calculated and stored in the index (<code>SET SCORING 1/2</code>; details see below)  
        <li> Stopwords: a stop word list can be defined to reduce the number of indexed tokens (<code>SET STOPWORDS FILENAME</code>)</li>
+
* Stopwords: a stop word list can be defined to reduce the number of indexed tokens (<code>SET STOPWORDS FILENAME</code>)  
</ul>
 
 
   
 
   
<p><b>Caution:</b> The index will only be applied if the activated options
 
are also specified in the query:</p>
 
 
   
 
   
<blockquote>
+
'''Caution:''' The index will only be applied if the activated options
<p><b>Index Options:</b> Case Sensitive, Stemming ON</p>
+
are also specified in the query:
<p><b>Query 1 (wrong):</b>
+
 +
 +
'''Index Options:''' Case Sensitive, Stemming ON  
 +
'''Query 1 (wrong):'''
 
<pre>//*[text() contains text 'inform']
 
<pre>//*[text() contains text 'inform']
 
</pre>  
 
</pre>  
</p>
+
<p><b>Query 2 (correct):</b>
+
'''Query 2 (correct):'''
 
<pre>//*[text() contains text 'inform' using case sensitive using stemming]
 
<pre>//*[text() contains text 'inform' using case sensitive using stemming]
 
</pre>  
 
</pre>  
</p>
+
<p><b>Query 3 (correct):</b>
+
'''Query 3 (correct):'''
 
<pre>declare ft-option using case sensitive using stemming;
 
<pre>declare ft-option using case sensitive using stemming;
 
//*[text() contains text 'inform']
 
//*[text() contains text 'inform']
 
</pre>  
 
</pre>  
</p>
+
</blockquote>
+
 
   
 
   
 
==Scoring==  
 
==Scoring==  
<p>The XQuery Full Text Recommendation allows for the usage of scoring models
+
The XQuery Full Text Recommendation allows for the usage of scoring models
 
and values within queries, with scoring being completely implementation defined.
 
and values within queries, with scoring being completely implementation defined.
 
BaseX offers an efficient internal scoring model which can be easily extended to
 
BaseX offers an efficient internal scoring model which can be easily extended to
Line 99: Line 100:
 
values within the full-text index structure. Three scoring types are currently
 
values within the full-text index structure. Three scoring types are currently
 
available, which can be adjusted with the <code>SCORING</code> property
 
available, which can be adjusted with the <code>SCORING</code> property
(Default: <code>SET SCORING 0</code>):</p>
+
(Default: <code>SET SCORING 0</code>):  
<ul>
+
  <li><code>0:</code> A standard algorithm is applied, which
+
*<code>0:</code> A standard algorithm is applied, which considers the length of a term and its frequency in a single text node. This algorithm is always applied if no index exists, or if the index cannot be applied in a query.  
    considers the length of a term and its frequency in a single text node.
+
*<code>1:</code> Standard TF/IDF algorithm, which treats document nodes as document units  
    This algorithm is always applied if no index exists, or if the index cannot be
+
*<code>2:</code> Each text node is treated as a document unit in the TF/IDF algorithm. This variant is recommendable for large XML files which only contain one document node  
    applied in a query.</li>
+
 
  <li><code>1:</code> Standard TF/IDF algorithm, which treats document nodes
 
    as document units</li>
 
        <li><code>2:</code> Each text node is treated as a document unit in the
 
  TF/IDF algorithm. This variant is recommendable for large XML files which
 
  only contain one document node</li>
 
</ul>
 
 
[[Category:XQuery]]
 
[[Category:XQuery]]
[[Category:Wikify]]
 

Revision as of 10:51, 12 December 2010

Fuzzy Querying

In addition to the W3C XQFT Recommendation, BaseX supports fuzzy querying. The XQFT grammar was enhanced by the FTMatchOption using fuzzy to allow for approximate searches in full texts. By default, the standard full-text index already supports the efficient execution of fuzzy searches.

Document 'doc.xml':

 <doc>
   <a>foo bar</a>
   <a>foa bar</a>
   <a>faa bar</a>
 </doc>

Command: create db doc.xml; create index fullext

Query: xquery //a[text() contains text 'foo' using fuzzy]

Result: <a>foo bar</a> <a>foa bar</a>


Fuzzy search is based on the Levenshtein distance. The maximum number of allowed errors is calculated by dividing the token length of a specified query term by 4, preserving a minimum of 1 errors. A static error distance can be set by adjusting the LSERR property (default: SET LSERR 0). The query above yields two results as there is no error between the query term "foo" and the text node "foo bar", and one error between "foo" and "foa bar".

Query Evaluation

BaseX offers different evaluation strategies for XQFT queries, the choice of which depends on the input data and the existence of a full text index. The query compiler tries to optimize and speed up queries by applying a full text index structure whenever possible and useful. Three evaluation strategies are available: the standard sequential database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see our <a target='_top' href='publications'>publications</a> for details). Query optimization and selection of the most efficient evaluation strategy is done in a full-fledged automatic manner. The output of the query optimizer indicates which evaluation plan is chosen for a specific query. It can be inspected by activating verbose querying (Command: SET VERBOSE ON) or opening the Query Info in the GUI. The message

 Applying full-text index

suggests that the full-text index is applied to speed up query evaluation. A second message

 Removing path with no index results

indicates that the index does not yield any results for the specified term and is thus skipped. If index optimizations are missing, it sometimes helps to give the compiler a second chance and try different rewritings of the same query.

Indexes

To support a wide variety of scenarios, the available full-text index can handle different combinations of the match options in XQuery Full Text. By default, most indexing options are disabled. The GUI dialog for creating new databases allows to choose the available options. On the command-line, the SET command has to be used before the full-text index is created, either by create index fulltext or set ftindex on; create db FILENAME.xml.

The following options are available:

  • Support Wildcards: a trie-based index can be applied to support wildcard searches (SET WILDCARDS ON)
  • Stemming: tokens are stemmed with the Porter Stemmer before being indexed (SET STEMMING ON)
  • Case Sensitive: tokens are indexed in case-sensitive mode (SET CASESEND ON)
  • Diacritics: diacritics are indexed as well (SET DIACRITICS ON)
  • TF/IDF Scoring: TF/IDF-based scoring values are calculated and stored in the index (SET SCORING 1/2; details see below)
  • Stopwords: a stop word list can be defined to reduce the number of indexed tokens (SET STOPWORDS FILENAME)


Caution: The index will only be applied if the activated options are also specified in the query:


Index Options: Case Sensitive, Stemming ON Query 1 (wrong):

//*[text() contains text 'inform']

Query 2 (correct):

//*[text() contains text 'inform' using case sensitive using stemming]

Query 3 (correct):

declare ft-option using case sensitive using stemming;
//*[text() contains text 'inform']


Scoring

The XQuery Full Text Recommendation allows for the usage of scoring models and values within queries, with scoring being completely implementation defined. BaseX offers an efficient internal scoring model which can be easily extended to different application scenarios. Additionally, BaseX allows to store scoring values within the full-text index structure. Three scoring types are currently available, which can be adjusted with the SCORING property (Default: SET SCORING 0):

  • 0: A standard algorithm is applied, which considers the length of a term and its frequency in a single text node. This algorithm is always applied if no index exists, or if the index cannot be applied in a query.
  • 1: Standard TF/IDF algorithm, which treats document nodes as document units
  • 2: Each text node is treated as a document unit in the TF/IDF algorithm. This variant is recommendable for large XML files which only contain one document node