Changes

Full-Text (edit)

Revision as of 11:51, 12 December 2010

311 bytes removed , 11:51, 12 December 2010

m

wikified

==Fuzzy Querying==

In addition to the W3C XQFT Recommendation, BaseX supports fuzzy querying.

By default, the standard full-text index already supports the efficient

execution of fuzzy searches.

~~<blockquote>~~'''Document 'doc.xml'~~~~''':

</doc>

</pre>

~~~~'''Command:~~~~ ''' <code>create db doc.xml; create index fullext</code> ~~~~ ~~~~'''Query:~~~~ ''' <code>xquery //a[text() contains text 'foo' using fuzzy]</code> ~~~~ ~~~~'''Result:~~~~ ''' <code><a>foo bar</a> <a>foa bar</a></code> ~~~~ ~~</blockquote>~~

Fuzzy search is based on the Levenshtein distance. The maximum number of allowed

The query above yields two results as there is no error between the query term

"foo" and the text node "foo bar", and one error between

"foo" and "foa bar".~~ ~~

==Query Evaluation==

possible and useful. Three evaluation strategies are available: the standard sequential

database scan, a full-text index based evaluation and a hybrid one, combining both strategies

(see our <a target='_top' href='publications'>publications</a> for details).~~ ~~

Query optimization and selection of the most efficient evaluation strategy is done

in a full-fledged automatic manner. The output of the query optimizer indicates which

indicates that the index does not yield any results for the specified term and

is thus skipped. If index optimizations are missing, it sometimes helps to give

the compiler a second chance and try different rewritings of the same query.~~ ~~

==Indexes==

~~~~To support a wide variety of scenarios, the available full-text index can

handle different combinations of the match options in XQuery Full Text.

By default, most indexing options are disabled. The GUI dialog for creating new databases

allows to choose the available options. On the command-line, the <code>SET</code>

command has to be used before the full-text index is created, either by~~~~

<code> create index fulltext </code> or

<code> set ftindex on; create db FILENAME.xml</code>.

~~~~The following options are available:~~~~ ~~<ul>~~ ~~<li>~~* Support Wildcards: a trie-based index can be applied to support wildcard searches (<code>SET WILDCARDS ON</code>)~~</li>~~ ~~<li>~~* Stemming: tokens are stemmed with the Porter Stemmer before being indexed (<code>SET STEMMING ON</code>)~~</li>~~ ~~<li>~~* Case Sensitive: tokens are indexed in case-sensitive mode (<code>SET CASESEND ON</code>)~~</li>~~ ~~<li>~~* Diacritics: diacritics are indexed as well (<code>SET DIACRITICS ON</code>)~~</li>~~ ~~<li>~~* TF/IDF Scoring: TF/IDF-based scoring values are calculated and stored in the index (<code>SET SCORING 1/2</code>; details see below)~~</li>~~ ~~<li>~~ * Stopwords: a stop word list can be defined to reduce the number of indexed tokens (<code>SET STOPWORDS FILENAME</code>)~~</li>~~ ~~</ul>~~

~~Caution: The index will only be applied if the activated options~~

~~are also specified in the query:~~

~~<blockquote>~~ '''Caution:''' The index will only be applied if the activated optionsare also specified in the query: ~~~~'''Index Options:~~~~ ''' Case Sensitive, Stemming ON~~~~ ~~~~'''Query 1 (wrong):~~~~ '''

<pre>//*[text() contains text 'inform']

</pre>

~~~~ ~~~~'''Query 2 (correct):~~~~ '''

<pre>//*[text() contains text 'inform' using case sensitive using stemming]

</pre>

~~~~ ~~~~'''Query 3 (correct):~~~~ '''

<pre>declare ft-option using case sensitive using stemming;

//*[text() contains text 'inform']

</pre>

~~~~ ~~</blockquote>~~

==Scoring==

~~~~The XQuery Full Text Recommendation allows for the usage of scoring models

and values within queries, with scoring being completely implementation defined.

BaseX offers an efficient internal scoring model which can be easily extended to

values within the full-text index structure. Three scoring types are currently

available, which can be adjusted with the <code>SCORING</code> property

(Default: <code>SET SCORING 0</code>):~~~~ ~~<ul>~~ ~~<li>~~*<code>0:</code> A standard algorithm is applied, which considers the length of a term and its frequency in a single text node. This algorithm is always applied if no index exists, or if the index cannot be applied in a query.~~</li>~~ ~~<li>~~*<code>1:</code> Standard TF/IDF algorithm, which treats document nodes as document units~~</li>~~ ~~<li>~~*<code>2:</code> Each text node is treated as a document unit in the TF/IDF algorithm. This variant is recommendable for large XML files which only contain one document node~~</li>~~ ~~</ul>~~

[[Category:XQuery]]

~~[[Category:Wikify]]~~

Michael

administrator, Bureaucrats, editor, Interface administrators, reviewer, Administrators

401

edits

Changes

Full-Text (edit)

Revision as of 11:51, 12 December 2010

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools