Difference between revisions of "Indexes"

From BaseX Documentation
Jump to navigation Jump to search
Line 1: Line 1:
This article is part of the [[Advanced User's Guide]] and introduces the available index structures, which may speed up querying by orders of magnitudes.
+
This article is part of the [[Advanced User's Guide]] and introduces the available index structures, which are utilized by the query optimizer to rewrite expressions and speed up query evaluation.
  
==Index Structures==
+
The examples in this article are based on the [http://files.basex.org/xml/factbook.xml factbook.xml] document. To see how a query is rewritten, please turn on the [[GUI#Visualizations|Info View]] in the GUI or use the [[Startup_Options#BaseX_Standalone|-V flag]] on command line.
 
Currently, the following index structures exist in BaseX:
 
  
===Structural Indexes===
+
=Structural Indexes=
  
 
Structural indexes will always be present and cannot be dropped by the user:
 
Structural indexes will always be present and cannot be dropped by the user:
  
* '''Tag/Attribute Name Index'''
+
==Name Index==
: All element and attribute names are automatically indexed and enriched with statistical information.
 
  
* '''Path Summary'''
+
The name index contains all element and attribute names of a database, and the fixed-size index ids are stored in the main database table. If a database is updated, new names are automatically added. The index is further enriched with statistical information, which will get out-of-dated after new updates.
: Unique paths in a document or collection are referenced by the path index, which is applied e.g. to rewrite descendant to more specific child steps.
 
  
* '''Document Index'''
+
The name index is applied to pre-evaluate location steps that will never yield results:
: This index caches references to all document nodes in a database. It provides fast access to single documents in large database instances.
 
  
===Value Indexes===
+
<pre class="brush:xquery">
 +
(: will be rewritten to an empty sequence :)
 +
/non-existing-name
 +
</pre>
  
Value indexes can be dropped and created by the user:
+
==Path Index==
  
* '''Text Index'''
+
The path index (also called ''path summary'') stores all distinct paths of the documents in the database. It also contains additional statistical information. Currently, the index will get out-of-dated after new updates.
: This index speeds up equality tests and simple range queries on text nodes in XPath location steps with predicates.
 
  
* '''Attribute Index'''
+
The path index is applied to rewrite descendant steps to multiple child steps. Child steps can be evaluated faster, as less nodes have to be accessed:
: This index speeds up equality tests and simple range queries on attribute value in XPath location steps with predicates.
 
  
* '''[[Full-Text|Full-Text Index]]'''
+
<pre class="brush:xquery">
: This index speeds up queries using the {{Mono|contains text}} keyword. Internally, BaseX handles two different index structures: the default index sorts all keys alphabetically by their character length. It is particularly fast if fuzzy searches are performed. The second index is a compressed trie structure, which needs slightly more memory, but is specialized on wildcard searches.
+
doc('factbook.xml')//province,
 +
(: ...will be rewritten to... :)
 +
doc('factbook.xml')/mondial/country/province
 +
</pre>
  
With {{Mark|Version 7.1}}, a new database option was introduced to support [[Options#UPDINDEX|incremental indexing]] of texts and attributes.
+
The paths statistics are e.g. used to pre-evaluate the {{Mono|count()}} function:
 +
 
 +
<pre class="brush:xquery">
 +
(: will be rewritten and pre-evaluated by the path index :)
 +
count( doc('factbook')//country )
 +
</pre>
 +
 
 +
==Document Index==
 +
 
 +
The document index contains references to the {{Mono|pre}} values of all document nodes. It speeds up the access to specific documents in a database, and it will be automatically updated when updates are performed.
 +
 
 +
The following query will be sped up by the document index:
 +
 
 +
<pre class="brush:xquery">
 +
db:open('DatabaseWithLotsOfDocuments')
 +
</pre>
 +
 
 +
=Value Indexes=
 +
 
 +
Value indexes can be optionally created and dropped by the user. The text and attribute index will be created by default:
 +
 
 +
==Text Index==
 +
 
 +
This index speeds up string-based equality tests and (since {{Version|7.2.1}}) range queries on text nodes. The [[Options#UPDINDEX|UPDINDEX]] option can be activated to keep this index up-to-date.
 +
 
 +
The following queries will all be rewritten for index access:
 +
 
 +
<pre class="brush:xquery">
 +
//*[text() = 'Germany'],
 +
doc('factbook.xml')//name[. = 'Germany'],
 +
db:open('factbook')//country[.//city/name = 'Hanoi']/name
 +
</pre>
 +
 
 +
==Attribute Index==
 +
 
 +
Similar to the text index, this index speeds up string-based tests on attribute values. The [[Options#UPDINDEX|UPDINDEX]] option can be activated to keep this index up-to-date.
 +
 
 +
The following queries will all be rewritten for index access:
 +
 
 +
<pre class="brush:xquery">
 +
//country[@car_code = 'J'],
 +
//province[@* = 'Hokkaido']//name,
 +
//sea[@depth > '2100' and @depth < '4000']
 +
</pre>
 +
 
 +
==Full-Text Index==
 +
 
 +
The [[Full-Text]] index speeds up queries using the {{Mono|contains text}} expression. Internally, two index structures are provided: the default index sorts all keys alphabetically by their character length. It is particularly fast if fuzzy searches are performed. The second index is a compressed trie structure, which needs slightly more memory, but is specialized on wildcard searches. Both index structures will be merged in a future version of BaseX.
  
==Example Queries==
 
 
 
The following queries are examples for expressions that will be optimized for index access (provided that the relevant index exists in a particular database):
 
The following queries are examples for expressions that will be optimized for index access (provided that the relevant index exists in a particular database):
 
===Name/Path Index===
 
* {{Mono|//address}} is rewritten to {{Mono|addressbook/address}} if all {{Mono|address}} elements have an {{Mono|addressbook}} element as their only ancestor.
 
* {{Mono|/non-existing-name}} is rewritten to an empty sequence
 
  
===Text Index===
+
If the full-text index exists, the following queries will all be rewritten for index access:
* <code>//node()[text() = 'Usability']</code>
 
* <code>//div[p = 'Usability' or p = 'Testing']</code>
 
* <code>path/to/relevant[text() = 'Usability Testing']/and/so/on</code>
 
  
===Attribute Index===
+
<pre class="brush:xquery">  
* <code>//node()[@align = 'right']</code>
+
//country/name[text() contains text 'and'],
* <code>descendant::elem[@id = '1']</code>
+
//religions[. contains text { 'Catholic', 'Roman' }
* <code>range/query[@id &gt;= 1 and @id &lt;= 5]</code>
+
    using case insensitive distance at most 2 words]
+
</pre>  
===Full-Text Index===
 
* <code>//node[text() contains text 'Usability']</code>
 
* <code>//node[text() contains text 'Usebiliti' using fuzzy]</code>
 
* <code>//book[chapter contains text ('web' ftor 'WWW' using no stemming) ftand 'diversity' using stemming distance at most 5 words]</code>
 
  
 
[[Category:Internals]]
 
[[Category:Internals]]

Revision as of 15:10, 17 April 2012

This article is part of the Advanced User's Guide and introduces the available index structures, which are utilized by the query optimizer to rewrite expressions and speed up query evaluation.

The examples in this article are based on the factbook.xml document. To see how a query is rewritten, please turn on the Info View in the GUI or use the -V flag on command line.

Structural Indexes

Structural indexes will always be present and cannot be dropped by the user:

Name Index

The name index contains all element and attribute names of a database, and the fixed-size index ids are stored in the main database table. If a database is updated, new names are automatically added. The index is further enriched with statistical information, which will get out-of-dated after new updates.

The name index is applied to pre-evaluate location steps that will never yield results:

 
(: will be rewritten to an empty sequence :)
/non-existing-name

Path Index

The path index (also called path summary) stores all distinct paths of the documents in the database. It also contains additional statistical information. Currently, the index will get out-of-dated after new updates.

The path index is applied to rewrite descendant steps to multiple child steps. Child steps can be evaluated faster, as less nodes have to be accessed:

 
doc('factbook.xml')//province,
(: ...will be rewritten to... :)
doc('factbook.xml')/mondial/country/province

The paths statistics are e.g. used to pre-evaluate the count() function:

 
(: will be rewritten and pre-evaluated by the path index :)
count( doc('factbook')//country )

Document Index

The document index contains references to the pre values of all document nodes. It speeds up the access to specific documents in a database, and it will be automatically updated when updates are performed.

The following query will be sped up by the document index:

 
db:open('DatabaseWithLotsOfDocuments')

Value Indexes

Value indexes can be optionally created and dropped by the user. The text and attribute index will be created by default:

Text Index

This index speeds up string-based equality tests and (since Version 7.2.1) range queries on text nodes. The UPDINDEX option can be activated to keep this index up-to-date.

The following queries will all be rewritten for index access:

 
//*[text() = 'Germany'],
doc('factbook.xml')//name[. = 'Germany'],
db:open('factbook')//country[.//city/name = 'Hanoi']/name

Attribute Index

Similar to the text index, this index speeds up string-based tests on attribute values. The UPDINDEX option can be activated to keep this index up-to-date.

The following queries will all be rewritten for index access:

 
//country[@car_code = 'J'],
//province[@* = 'Hokkaido']//name,
//sea[@depth > '2100' and @depth < '4000']

Full-Text Index

The Full-Text index speeds up queries using the contains text expression. Internally, two index structures are provided: the default index sorts all keys alphabetically by their character length. It is particularly fast if fuzzy searches are performed. The second index is a compressed trie structure, which needs slightly more memory, but is specialized on wildcard searches. Both index structures will be merged in a future version of BaseX.

The following queries are examples for expressions that will be optimized for index access (provided that the relevant index exists in a particular database):

If the full-text index exists, the following queries will all be rewritten for index access:

 
//country/name[text() contains text 'and'],
//religions[. contains text { 'Catholic', 'Roman' }
    using case insensitive distance at most 2 words]