Indexes

From BaseX Documentation

Jump to: navigation, search

This article is part of the XQuery Portal. It contains information on the available index structures.

The query compiler tries to optimize and speed up queries by applying the index whenever it is possible and seems promising. To see how a query is rewritten, and if an index is used, you can turn on the Info View in the GUI or use the -V flag on the command line:

Contents

[edit] Structural Indexes

Structural indexes are automatically created and cannot be dropped by the user:

[edit] Name Index

The name index contains references to the names of all elements and attributes in a database. It contains some basic statistical information, such as the number of occurrence of a name.

The name index is e.g. applied to discard location steps that will never yield results:

 
(: will be rewritten to an empty sequence :)
/non-existing-name

The contents of the name indexes can be directly accessed with the XQuery functions index:element-names and index:attribute-names.

If a database is updated, new names will be added incrementally, but the statistical information will get out-dated.

[edit] Path Index

The path index (which is also called path summary or data guide) stores all distinct paths of the documents in the database. It contains additional statistical information, such as the number of occurrence of a path, its distinct string values, and the minimum/maximum of numeric values. The maximum number of distinct values to store per name can be changed via MAXCATS.

Since Version 8.6, the distinct values are also stored for elements and attributes of numeric type.

Various queries will be evaluated much faster if an up-to-date path index is available (as can be observed when opening the Info View):

 
doc('factbook.xml')//province,
(: ...will be rewritten to... :)
doc('factbook.xml')/mondial/country/province
 
count(doc('factbook')//country)
 
distinct-values(db:open('factbook')//religions)

The contents of the path index can be directly accessed with the XQuery function index:facets.

If a database is updated, the statistics in the path index will be invalidated.

[edit] Document Index

The document index contains references to all document nodes in a database. Once documents with specific paths are requested, the index will be extended to also contain document paths.

The index generally speeds up access to single documents and database paths. It will always be kept up-to-date.

[edit] Value Indexes

Value indexes can be created and dropped by the user. Four types of values indexes are available: a text and attribute index, and an optional token and full-text index. By default, the text and attribute index will automatically be created.

In the GUI, index structures can be managed in the dialog windows for creating new databases or displaying the database properties. On command-line, the commands CREATE INDEX and DROP INDEX are used to create and drop index structures. With INFO INDEX, you get some insight into the contents of an index structure, and SET allows you to change the index defaults for new databases:

With XQuery, index structures can be created and dropped via db:optimize:

 
(: Optimize specified database, create full-text index for texts of the specified elements :)
db:optimize(
  'factbook',
  false(),
  map { 'ftindex': true(), 'ftinclude': 'p div' }
)

[edit] Text Index

[edit] Exact Queries

This index speeds up string-based equality tests on text nodes. The following queries will all be rewritten for index access:

 
(: 1st example :)
//*[text() = 'Germany'],
(: 2nd example :)
doc('factbook.xml')//name[. = 'Germany'],
(: 3rd example :)
for $c in db:open('factbook')//country
where $c//city/name = 'Hanoi'
return $c/name

Matching text nodes can be directly requested from the index with the XQuery function db:text. The index contents can be accessed via index:text.

The UPDINDEX option can be activated to keep this index up-to-date.

[edit] Range Queries

The text index also supports range queries based on string comparisons:

 
(: 1st example :)
db:open('Library')//Medium[Year >= '2011' and Year <= '2016'],
(: 2nd example :)
let $min := '2014-04-16T00:00:00'
let $max := '2014-04-19T23:59:59' 
return db:open('news')//entry[date-time > $min and date-time < $max]

Text nodes can directly be retrieved from the index via the XQuery function db:text-range.

Please note that the current index structures do not support queries for numbers and dates.

[edit] Attribute Index

Similar to the text index, this index speeds up string-based equality and range tests on attribute values. Additionally, the XQuery function fn:id takes advantage of the index whenever possible. The following queries will all be rewritten for index access:

 
(: 1st example :)
//country[@car_code = 'J'],
(: 2nd example :)
//province[@* = 'Hokkaido']//name,
(: 3rd example :)
//sea[@depth > '2100' and @depth < '4000']
(: 4th example :)
fn:id('f0_119', db:open('factbook'))

Attribute nodes can directly be retrieved from the index with the XQuery functions db:attribute and db:attribute-range. The index contents can be accessed with index:attributes.

The UPDINDEX option can be activated to keep this index up-to-date.

[edit] Token Index

In many XML dialects, such as HTML or DITA, multiple tokens are stored in attribute values. The token index can be created to speed up the retrieval of these tokens. The XQuery functions fn:contains-token, fn:tokenize and fn:idref are rewritten for index access whenever possible. If a token index exists, it will e.g. be utilized for the following queries:

 
(: 1st example :)
//div[contains-token(@class, 'row')],
(: 2nd example :)
//p[tokenize(@class) = 'row'],
(: 3rd example :)
doc('graph.xml')/idref('edge8')

Attributes with tokens can be directly retrieved from the index with the XQuery function db:token. The index contents can be accessed with index:tokens.

[edit] Full-Text Index

The Full-Text index contains the normalized tokens of text nodes of a document. It is utilized to speed up queries with the contains text expression, and it is capable of processing wildcard and fuzzy search operations. Three evaluation strategies are available: the standard sequential database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see XQuery Full Text implementation in BaseX).

If the full-text index exists, the following queries will all be rewritten for index access:

 
(: 1st example :)
//country[name/text() contains text 'and'],
(: 2nd example :)
//religions[.//text() contains text { 'Catholic', 'Roman' }
    using case insensitive distance at most 2 words]

The index provides support for the following full-text features (the values can be changed in the GUI or via the SET command):

The options that have been used for creating the full-text index will also be applied to the optimized full-text queries. However, the defaults can be overwritten if you supply options in your query. For example, if words were stemmed in the index, and if the query can be rewritten for index access, the query terms will be stemmed as well, unless stemming is not explicitly disabled. This is demonstrated in the following Command Script:

<commands>
  <!-- Create database with stemmed full-text index -->
  <set option='stemming'>true</set>
  <set option='ftindex'>true</set>
  <create-db name='test-db'> <text>house</text> </create-db>
  <!-- Index access: Query term will be stemmed -->
  <xquery> /text[. contains text { 'houses' }] </xquery>
  <!-- Disable stemming (query will not be evaluated by the index) -->
  <xquery> /text[. contains text { 'houses' } using no stemming] </xquery>
</commands>

Text nodes can be directly requested from the index via the XQuery function ft:search. The index contents can be accessed with ft:tokens.

[edit] Selective Indexing

Value indexing can be restricted to specific elements and attributes. The nodes to be indexed can be restricted via the TEXTINCLUDE, ATTRINCLUDE, TOKENINCLUDE and FTINCLUDE options. The options take a list of name patterns, which are separated by commas. The following name patterns are supported:

The options can either be specified via the SET command or via XQuery. With the following operations, an attribute index is created for all id and name attributes:

Commands
 
SET ATTRINCLUDE id,name
CREATE DB factbook http://files.basex.org/xml/factbook.xml'
# Restore default
SET ATTRINCLUDE
XQuery
db:create('factbook', 'http://files.basex.org/xml/factbook.xml', '',
  map { 'attrinclude': 'id,name' })

With CREATE INDEX and db:optimize, new selective indexing options will ba applied to an existing database.

[edit] Custom Index Structures

With XQuery, it is comparatively easy to create your own, custom index structures. The following query demonstrate how you can create a factbook-index database, which contains all texts of the original database in lower case:

let $db := 'factbook'

let $index := <index>{
  for $nodes in db:open($db)//text()
  group by $text := lower-case($nodes)
  return <text string='{ $text }'>{
    for $node in $nodes
    return <id>{ db:node-id($node ) }</id>
  }</text>
}</index>

return db:create($db || '-index', $index, $db || '-index.xml')

In the following query, a text string is searched, and the text nodes of the original database are retrieved:

let $db := 'factbook'
let $text := 'italian'
for $id in db:open($db || '-index')//*[@string = $text]/id
return db:open-id($db, $id)/..

With some extra effort, and if UPDINDEX is enabled for both your original and your index database (see below), your index database will support updates as well (try it, it’s fun!).

[edit] Performance

If main memory runs out while creating a value index, the current index structures will be partially written to disk and eventually merged. If the memory heuristics fail for some reason (i.e., because multiple index operations run at the same time, or because the applied JVM does not support explicit garbage collections), a fixed index split sizes may be chosen via the SPLITSIZE option.

If DEBUG is enabled, the command-line output might help you to find a good split size. The following example shows the output for creating a database for an XMark document with 1 GB, and with 128 MB assigned to the JVM:

> basex -d -c"SET FTINDEX ON; SET TOKENINDEX ON; CREATE DB xmark 1gb.xml"
Creating Database...
................................ 76559.99 ms (29001 KB)
Indexing Text...
....|...|...|.....|. 9.81 M operations, 18576.92 ms (13523 KB). Recommended SPLITSIZE: 20.
Indexing Attribute Values...
.........|....... 3.82 M operations, 7151.77 ms (6435 KB). Recommended SPLITSIZE: 20.
Indexing Tokens...
.......|..|.....|.. 3.82 M operations, 9636.73 ms (10809 KB). Recommended SPLITSIZE: 10.
Indexing Full-Text...
..|.|.|.|...|...|..|.|..| 116.33 M operations, 138740.94 ms (106 MB). Recommended SPLITSIZE: 12.

The output can be interpreted as follows:

[edit] Updates

Generally, update operations are very fast in BaseX. By default, the index structures will be invalidated by updates; as a result, queries that benefit from index structures may slow down after updates. There are different alternatives to cope with this:

[edit] Changelog

Version 8.4
Version 8.4
Version 8.3
Version 8.0
Version 7.2.1
Personal tools
Namespaces
Variants
Actions
Navigation
Print/export