Indexes
This article is part of the Advanced User's Guide and introduces the available index structures. The query compiler tries to optimize and speed up queries by applying the index whenever it is possible and seems promising.
Most examples in this article are based on the factbook.xml document. To see how a query is rewritten, and if an index is used, please turn on the Info View in the GUI or use the -V flag on the command line:
- A message like
Applying text index for "Japan"
indicates that the text index is applied to speed up the search of the shown string. The following message… Removing path with no index results
indicates that a string in a path expression will never yield results. Because of that, the path does not need to be evaluated at all.- If you cannot find any index optimization hints in the info output, it often helps if you rewrite and simplify your query.
Contents
Structural Indexes
Structural indexes are automatically created and cannot be dropped by the user:
Name Index
The name index contains all element and attribute names of a database, and the fixed-size index ids are stored in the main database table. If a database is updated, new names are automatically added. Furthermore, the index is enriched with statistical information, such as the distinct (categorical) or minimum and maximum values of its elements and attributes. The maximum number of categories to store per name can be changed via MAXCATS. The statistics are discarded after database updates and can be recreated with the OPTIMIZE command.
The name index is e.g. applied to pre-evaluate location steps that will never yield results:
(: will be rewritten to an empty sequence :) /non-existing-name
The contents of the name indexes can be directly accessed with the XQuery functions index:element-names and index:attribute-names.
Path Index
The path index (also called path summary) stores all distinct paths of the documents in the database. It contains the same statistical information as the name index. The statistics are discarded after database updates and can be recreated with the OPTIMIZE command.
The path index is applied to rewrite descendant steps to multiple child steps. Child steps can be evaluated faster, as fewer nodes have to be accessed:
doc('factbook.xml')//province, (: ...will be rewritten to... :) doc('factbook.xml')/mondial/country/province
The paths statistics are e.g. used to pre-evaluate the count
function:
(: will be rewritten and pre-evaluated by the path index :) count( doc('factbook')//country )
The contents of the path index can be directly accessed with the XQuery function index:facets.
Document Index
The document index contains references to all document nodes in a database. Once documents with specific paths are requested, the index will be extended to also contain document paths.
The index generally speeds up access to single documents and database paths. It will always be kept up-to-date.
Value Indexes
Value indexes can be created and dropped by the user. Four types of values indexes are available: a text and attribute index, and an optional token and full-text index. By default, the text and attribute index will automatically be created.
In the GUI, index structures can be managed in the dialog windows for creating new databases or displaying the database properties. On command-line, the commands CREATE INDEX
and DROP INDEX
are used to create and drop index structures. With INFO INDEX
, you get some insight into the contents of an index structure, and SET
allows you to change the index defaults for new databases:
OPEN factbook; CREATE INDEX fulltext
: Open database; create full-text indexOPEN factbook; INFO INDEX TOKEN
: Open database; show info on token indexSET ATTRINDEX true; SET ATTRINCLUDE id name; CREATE DB factbook.xml
: Enable attribute index; only index 'id' and 'name' attributes; create database
With XQuery, index structures can be created and dropped via db:optimize:
(: Optimize specified database, create full-text index for texts of the specified elements :) db:optimize( 'factbook', false(), map { 'ftindex': true(), 'ftinclude': 'p div' } )
Text Index
Exact Queries
This index speeds up string-based equality tests on text nodes. The UPDINDEX option can be activated to keep this index up-to-date.
The following queries will all be rewritten for index access:
(: 1st example :) //*[text() = 'Germany'], (: 2nd example :) doc('factbook.xml')//name[. = 'Germany'], (: 3rd example :) for $c in db:open('factbook')//country where $c//city/name = 'Hanoi' return $c/name
Matching text nodes can be directly requested from the index with the XQuery function db:text. The index contents can be accessed via index:texts.
Range Queries
The text index also supports range queries based on string comparisons:
(: 1st example :) db:open('Library')//Medium[Year >= '2011' and Year <= '2016'], (: 2nd example :) let $min := '2014-04-16T00:00:00' let $max := '2014-04-19T23:59:59' return db:open('news')//entry[date-time > $min and date-time < $max]
Text nodes can directly be retrieved from the index via the XQuery function db:text-range.
Please note that the current index structures do not support queries for numbers and dates.
Attribute Index
Similar to the text index, this index speeds up string-based equality and range tests on attribute values. The UPDINDEX option can be activated to keep this index up-to-date.
The following queries will all be rewritten for index access:
(: 1st example :) //country[@car_code = 'J'], (: 2nd example :) //province[@* = 'Hokkaido']//name, (: 3rd example :) //sea[@depth > '2100' and @depth < '4000']
Attribute nodes can directly be retrieved from the index with the XQuery functions db:attribute and db:attribute-range. The index contents can be accessed with index:attributes.
Token Index
In many XML dialects, such as HTML or DITA, multiple tokens are stored in attribute values. The token index can be used to access these entries.
Following queries such as the following ones will (soon) be rewritten for index access:
(: 1st example :) //div[contains-token(@class, 'row')], (: 2nd example :) //p[tokenize(@class) = 'row'], (: 3rd example :) doc('graph.xml')/idref('edge8')
Attributes with tokens can be directly retrieved from the index with the XQuery function db:token. The index contents can be accessed with index:tokens.
Full-Text Index
The Full-Text index contains the normalized tokens of text nodes of a document. It is utilized to speed up queries with the contains text
expression, and it is capable of processing wildcard and fuzzy search operations. Three evaluation strategies are available: the standard sequential database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see XQuery Full Text implementation in BaseX).
If the full-text index exists, the following queries will all be rewritten for index access:
(: 1st example :) //country[name/text() contains text 'and'], (: 2nd example :) //religions[.//text() contains text { 'Catholic', 'Roman' } using case insensitive distance at most 2 words]
The index provides support for the following full-text features (the values can be changed in the GUI or via the SET
command):
- Stemming: tokens are stemmed before being indexed (see
STEMMING
) - Case Sensitive: tokens are indexed in case-sensitive mode (see
CASESENS
) - Diacritics: diacritics are indexed as well (see
DIACRITICS
) - Stopword List: a stop word list can be defined to reduce the number of indexed tokens (see
STOPWORDS
) - Language: see Languages for more details (see
LANGUAGE
)
The options that have been used for creating the full-text index will also be applied to the optimized full-text queries. However, the defaults can be overwritten if you supply options in your query. For example, if words were stemmed in the index, and if the query can be rewritten for index access, the query terms will be stemmed as well, unless stemming is not explicitly disabled. This is demonstrated in the following Command Script:
<commands> <!-- Create database with stemmed full-text index --> <set option='stemming'>true</set> <set option='ftindex'>true</set> <create-db name='test-db'> <text>house</text> </create-db> <!-- Index access: Query term will be stemmed --> <xquery> /text[. contains text { 'houses' }] </xquery> <!-- Disable stemming (query will not be evaluated by the index) --> <xquery> /text[. contains text { 'houses' } using no stemming] </xquery> </commands>
Text nodes can be directly requested from the index via the XQuery function ft:search. The index contents can be accessed with ft:tokens.
Selective Indexing
Template:Mark TOKENINCLUDE
option added
Value indexing can be restricted to specific elements and attributes. The nodes to be indexed can be restricted via the TEXTINCLUDE, ATTRINCLUDE, TOKENINCLUDE and FTINCLUDE options. The options take a list of name patterns, which are separated by commas. The following name patterns are supported:
*
: all namesname
: elements or attributes calledname
, which are in the empty default namespace*:name
: elements or attributes calledname
, no matter which namespaceQ{uri}*
: all elements or attributes in theuri
namespaceQ{uri}name
: elements or attributes calledname
in theuri
namespace
The options can either be specified via the SET command or via XQuery. With the following operations, an attribute index is created for all id
and name
attributes:
- Commands
SET ATTRINCLUDE id,name CREATE DB factbook http://files.basex.org/xml/factbook.xml' # Restore default SET ATTRINCLUDE
- XQuery
db:create('factbook', 'http://files.basex.org/xml/factbook.xml', '', map { 'attrinclude': 'id,name' })
With OPTIMIZE ALL
and db:optimize, new selective indexing options can be assigned to an existing database.
Performance
If main memory runs out while creating a value index, the currently generated index structures will be partially written to disk and eventually merged. If the used memory heuristics fails for some reason (i.e., because multiple index operations run at the same time), fixed index split sizes may be chosen via the INDEXSPLITSIZE
and FTINDEXSPLITSIZE
options.
If debugging is enabled, and if a new database is created from the command line, the number of index operations will be output to standard output; this might help you to choose a proper split size. The following example shows how the output can look for a document with 111 MB and 128 MB of available main memory:
> basex -d -c"set ftindex; create db 111mb 111mb.xml" Creating Database... .... 8132.44 ms (17824 KB) Indexing Text... .. 979920 operations, 2913.78 ms (44 MB) Indexing Attribute Values... .. 381870 operations, 630.61 ms (21257 KB) Indexing Full-Text... ..|| 3 splits, 12089347 operations, 16420.47 ms (36 MB)
The info string 3 splits
indicates that three partial full-text index structures were written to disk, and the string 12089347 operations
tells that the index construction consisted of approximately 12 mio index operations. If we set FTINDEXSPLITSIZE
to the fixed value 4000000
(12 mio divided by three), or a smaller value, we should be able to build the index and circumvent the memory heuristics.
Updates
Updates in BaseX are generally very fast, because the index structures will be invalidated by updates. As a result, subsequent queries may be executed more slowly than before the update. There are different alternatives to cope with this:
- After the execution of one or more update operations, the OPTIMIZE command or the db:optimize function can be called to rebuild the index structures.
- The UPDINDEX option can be activated before creating or optimizing the database. As a result, the text and attribute index structures will be incrementally updated after each database update. Please note that incremental updates are not available for the token index, full-text index, and database statistics. This is also explains why the up-to-date flag, which is e.g. displayed via INFO DB or db:info, will be set to
false
until the database will be optimized again. - The AUTOOPTIMIZE option can be enabled before creating or optimizing the database. All outdated index structures and statistics will then be recreated after each database update. This option should only be used for small and medium-sized databases.
Changelog
- Version 8.4
- Added: Token Index
- Version 8.3
- Added: Selective Indexing
- Version 8.0
- Added: AUTOOPTIMIZE option
- Version 7.2.1
- Added: string-based range queries