Difference between revisions of "Indexes"

From BaseX Documentation
Jump to navigation Jump to search
(Improve readability of Updates section)
Line 53: Line 53:
 
=Value Indexes=
 
=Value Indexes=
  
Value indexes can be optionally created and dropped by the user. The text and attribute index will be created by default.
+
Value indexes can be optionally created and dropped by the user. By default, the text and attribute index will be created, and it comprises all text and attribute nodes of a database.
  
 
==Text Index==
 
==Text Index==
Line 125: Line 125:
  
 
Matching text nodes can be directly requested from the index via the XQuery function [[Full-Text Module#ft:search|ft:search]]. The index contents can be accessed with [[Full-Text Module#ft:tokens|ft:tokens]].
 
Matching text nodes can be directly requested from the index via the XQuery function [[Full-Text Module#ft:search|ft:search]]. The index contents can be accessed with [[Full-Text Module#ft:tokens|ft:tokens]].
 +
 +
==Selective Indexing==
 +
 +
Since {{Version|8.3}}, value indexing can be restricted to specific elements and attributes. The nodes to be indexed can be restricted via the [[Options#TEXTINCLUDE|TEXTINCLUDE]], [[Options#ATTRINCLUDE|ATTRINCLUDE]] and [[Options#FTINCLUDE|FTINCLUDE]] options. The options take a list of name patterns, which are separated by commas. The following name patterns are supported:
 +
 +
* <code>*</code>: accept all names
 +
* <code>name</code>: accept all elements or attributes called <code>name</code> in the empty default namespace
 +
* <code>*:name</code>: accepts all elements or attributes called <code>name</code>; the namespace is ignored
 +
* <code>Q{uri}*</code>: accepts all elements or attributes in the <code>uri</code> namespace
 +
* <code>Q{uri}name</code>: accepts all elements or attributes called <code>name</code> in the <code>uri</code> namespace
 +
 +
The options can also be specified via the functions of the Database Module:
 +
 +
<pre class="brush:xquery">
 +
(: only index id and name attributes :)
 +
db:create('factbook', 'http://files.basex.org/xml/factbook.xml', '', map { 'attrinclude': 'id,name' })
 +
</pre>
  
 
==Index Construction==
 
==Index Construction==
Line 157: Line 174:
  
 
=Changelog=
 
=Changelog=
 +
 +
;Version 8.3
 +
 +
* Added: [[#Selective Indexing|Selective Indexing]]
  
 
;Version 8.0
 
;Version 8.0

Revision as of 15:02, 12 August 2015

This article is part of the Advanced User's Guide and introduces the available index structures, which are utilized by the query optimizer to rewrite expressions and speed up query evaluation.

Nearly all examples in this article are based on the factbook.xml document. To see how a query is rewritten, please turn on the Info View in the GUI or use the -V flag on the command line.

Structural Indexes

Structural indexes will always be present and cannot be dropped by the user:

Name Index

The name index contains all element and attribute names of a database, and the fixed-size index ids are stored in the main database table. If a database is updated, new names are automatically added. Furthermore, the index is enriched with statistical information, such as the distinct (categorical) or minimum and maximum values of its elements and attributes. The maximum number of categories to store per name can be changed via MAXCATS. The statistics are discarded after database updates and can be recreated with the OPTIMIZE command.

The name index is e.g. applied to pre-evaluate location steps that will never yield results:

 
(: will be rewritten to an empty sequence :)
/non-existing-name

The contents of the name indexes can be directly accessed with the XQuery functions index:element-names and index:attribute-names.

Path Index

The path index (also called path summary) stores all distinct paths of the documents in the database. It contains the same statistical information as the name index. The statistics are discarded after database updates and can be recreated with the OPTIMIZE command.

The path index is applied to rewrite descendant steps to multiple child steps. Child steps can be evaluated faster, as fewer nodes have to be accessed:

 
doc('factbook.xml')//province,
(: ...will be rewritten to... :)
doc('factbook.xml')/mondial/country/province

The paths statistics are e.g. used to pre-evaluate the count function:

 
(: will be rewritten and pre-evaluated by the path index :)
count( doc('factbook')//country )

The contents of the path index can be directly accessed with the XQuery function index:facets.

Resource Index

The resource index contains references to the pre values of all XML document nodes. It speeds up the access to specific documents in a database, and it will be automatically updated when updates are performed.

The following query will be sped up by the resource index:

 
db:open('DatabaseWithLotsOfDocuments')

Value Indexes

Value indexes can be optionally created and dropped by the user. By default, the text and attribute index will be created, and it comprises all text and attribute nodes of a database.

Text Index

Exact Queries

This index speeds up string-based equality tests on text nodes. The UPDINDEX option can be activated to keep this index up-to-date.

The following queries will all be rewritten for index access:

 
(: 1st example :)
//*[text() = 'Germany'],
(: 2nd example :)
doc('factbook.xml')//name[. = 'Germany'],
(: 3rd example :)
for $c in db:open('factbook')//country
where $c//city/name = 'Hanoi'
return $c/name

Matching text nodes can be directly requested from the index with the XQuery function db:text. The index contents can be accessed via index:texts.

Range Queries

The text index also supports range queries based on string comparisons:

 
(: 1st example :)
db:open('Library')//Medium[Year >= '2011' and Year <= '2016'],
(: 2nd example :)
let $min := '2014-04-16T00:00:00'
let $max := '2014-04-19T23:59:59' 
return db:open('news')//entry[date-time > $min and date-time < $max]

Text nodes can be directly accessed from the index via the XQuery function db:text-range.

Please note that the current index structures do not support queries for numbers and dates.

Attribute Index

Similar to the text index, this index speeds up string-based equality and range tests on attribute values. The UPDINDEX option can be activated to keep this index up-to-date.

The following queries will all be rewritten for index access:

 
(: 1st example :)
//country[@car_code = 'J'],
(: 2nd example :)
//province[@* = 'Hokkaido']//name,
(: 3rd example :)
//sea[@depth > '2100' and @depth < '4000']

Matching text nodes can be directly requested from the index with the XQuery functions db:attribute and db:attribute-range. The index contents can be accessed with index:attributes.

Full-Text Index

The Full-Text index speeds up queries using the contains text expression. Internally, two index structures are provided: the default index sorts all keys alphabetically by their character length. It is particularly fast if fuzzy searches are performed. The second index is a compressed trie structure, which needs slightly more memory, but is specialized on wildcard searches. Both index structures will be merged in a future version of BaseX.

If the full-text index exists, the following queries will all be rewritten for index access:

 
(: 1st example :)
//country/name[text() contains text 'and'],
(: 2nd example :)
//religions[. contains text { 'Catholic', 'Roman' }
    using case insensitive distance at most 2 words]

Matching text nodes can be directly requested from the index via the XQuery function ft:search. The index contents can be accessed with ft:tokens.

Selective Indexing

Since Version 8.3, value indexing can be restricted to specific elements and attributes. The nodes to be indexed can be restricted via the TEXTINCLUDE, ATTRINCLUDE and FTINCLUDE options. The options take a list of name patterns, which are separated by commas. The following name patterns are supported:

  • *: accept all names
  • name: accept all elements or attributes called name in the empty default namespace
  • *:name: accepts all elements or attributes called name; the namespace is ignored
  • Q{uri}*: accepts all elements or attributes in the uri namespace
  • Q{uri}name: accepts all elements or attributes called name in the uri namespace

The options can also be specified via the functions of the Database Module:

 
(: only index id and name attributes :)
db:create('factbook', 'http://files.basex.org/xml/factbook.xml', '', map { 'attrinclude': 'id,name' })

Index Construction

If main memory runs out while creating a value index, the currently generated index structures will be partially written to disk and eventually merged. If the used memory heuristics fails for some reason (i.e., because multiple index operations run at the same time), fixed index split sizes may be chosen via the INDEXSPLITSIZE and FTINDEXSPLITSIZE options.

If DEBUG is set to true, and if a new database is created from the command line, the number of index operations will be output to standard output; this might help you to choose a proper split size. The following example shows how the output can look for a document with 111 MB and 128 MB of available main memory:

> basex -d -c"set ftindex; create db 111mb 111mb.xml"
Creating Database...
.... 8132.44 ms (17824 KB)
Indexing Text...
.. 979920 operations, 2913.78 ms (44 MB)
Indexing Attribute Values...
.. 381870 operations, 630.61 ms (21257 KB)
Indexing Full-Text...
..|| 3 splits, 12089347 operations, 16420.47 ms (36 MB)

The info string 3 splits indicates that three partial full-text index structures were written to disk, and the string 12089347 operations tells that the index construction consisted of approximately 12 mio index operations. If we set FTINDEXSPLITSIZE to the fixed value 4000000 (12 mio divided by three), or a smaller value, we should be able to build the index and circumvent the memory heuristics.

Updates

By default, index structures are discarded after an update operation. As a result, queries will be executed more slowly. There are different alternatives to cope with this:

  • After the execution of one or more update operations, call the OPTIMIZE command or the db:optimize function to rebuild the index structures. Multiple update operations will be performed faster, as the database meta data is only updated and regenerated when requested by the database users.
  • Enable the UPDINDEX option before creating or optimizing the database. As a result, text and attribute index structures will be incrementally updated after each database update. Please note that incremental updates are not available for the full-text index and database statistics. This is also the reason why the internal up-to-date flag of a database (which can e.g. be displayed via INFO DB or db:info) will be set to false until the next optimize call is triggered.
  • Enable the AUTOOPTIMIZE option before creating or optimizing the database. As a result, all outdated index structures and statistics will be recreated after each database update. Enable this option only for small or medium-sized databases.

Changelog

Version 8.3
Version 8.0
  • Added: AUTOOPTIMIZE option
Version 7.2.1
  • Added: string-based range queries