Difference between revisions of "Indexes"

From BaseX Documentation
Jump to navigation Jump to search
m (Editorial-type tweaks)
 
(94 intermediate revisions by 5 users not shown)
Line 1: Line 1:
This article is part of the [[Advanced User's Guide]] and introduces the available index structures, which are utilized by the query optimizer to rewrite expressions and speed up query evaluation.
+
This article is part of the [[XQuery|XQuery Portal]]. It contains information on the available index structures.
  
Nearly all examples in this article are based on the [http://files.basex.org/xml/factbook.xml factbook.xml] document. To see how a query is rewritten, please turn on the [[GUI#Visualizations|Info View]] in the GUI or use the [[Command-Line Options#BaseX_Standalone|-V flag]] on the command line.
+
The query compiler tries to optimize and speed up queries by applying the index whenever it is possible and seems promising. To see how a query is rewritten, and if an index is used, you can turn on the [[GUI#Visualizations|Info View]] in the GUI or use the [[Command-Line Options#BaseX_Standalone|-V flag]] on the command line:
 +
 
 +
* A message like <code>apply text index for "Japan"</code> indicates that the text index is applied to speed up the search of the shown string. The following message…
 +
* <code>no index results</code> indicates that a string in a path expression will never yield results. Because of that, the path does not need to be evaluated at all.
 +
* If you cannot find any index optimization hints in the info output, it often helps if you rewrite and simplify your query.
 +
 
 +
Additional examples for index rewritings are presented in our article on [[XQuery Optimizations]].
  
 
=Structural Indexes=
 
=Structural Indexes=
  
Structural indexes will always be present and cannot be dropped by the user:
+
Structural indexes are automatically created and cannot be dropped by the user:
  
 
==Name Index==
 
==Name Index==
  
The name index contains all element and attribute names of a database, and the fixed-size index ids are stored in the main database table. If a database is updated, new names are automatically added. Furthermore, the index is enriched with statistical information, such as the distinct (categorical) or minimum and maximum values of its elements and attributes. The maximum number of categories to store per name can be changed via [[Options#MAXCATS|MAXCATS]]. The statistics are discarded after database updates and can be recreated with the [[Commands#OPTIMIZE|OPTIMIZE]] command.
+
The name index contains references to the names of all elements and attributes in a database. It contains some basic statistical information, such as the number of occurrence of a name.
  
The name index is e.g. applied to pre-evaluate location steps that will never yield results:
+
The name index is e.g. applied to discard location steps that will never yield results:
  
<pre class="brush:xquery">  
+
<syntaxhighlight lang="xquery">
 
(: will be rewritten to an empty sequence :)
 
(: will be rewritten to an empty sequence :)
 
/non-existing-name
 
/non-existing-name
</pre>
+
</syntaxhighlight>
  
 
The contents of the name indexes can be directly accessed with the XQuery functions [[Index Module#index:element-names|index:element-names]] and [[Index Module#index:attribute-names|index:attribute-names]].
 
The contents of the name indexes can be directly accessed with the XQuery functions [[Index Module#index:element-names|index:element-names]] and [[Index Module#index:attribute-names|index:attribute-names]].
 +
 +
If a database is updated, new names will be added incrementally, but the statistical information will get out-dated.
  
 
==Path Index==
 
==Path Index==
  
The path index (also called ''path summary'') stores all distinct paths of the documents in the database. It contains the same statistical information as the name index. The statistics are discarded after database updates and can be recreated with the [[Commands#OPTIMIZE|OPTIMIZE]] command.
+
The path index (which is also called ''path summary'' or ''data guide'') stores all distinct paths of the documents in the database. It contains additional statistical information, such as the number of occurrence of a path, its distinct string values, and the minimum/maximum of numeric values. The maximum number of distinct values to store per name can be changed via {{Option|MAXCATS}}. Distinct values are also stored for elements and attributes of numeric type.
 +
 
 +
Various queries will be evaluated much faster if an up-to-date path index is available (as can be observed when opening the [[GUI#Visualizations|Info View]]):
  
The path index is applied to rewrite descendant steps to multiple child steps. Child steps can be evaluated faster, as fewer nodes have to be accessed:
+
* Descendant steps will be rewritten to multiple child steps. Child steps are evaluated faster, as fewer nodes have to be traversed:
  
<pre class="brush:xquery">  
+
<syntaxhighlight lang="xquery">
 
doc('factbook.xml')//province,
 
doc('factbook.xml')//province,
 
(: ...will be rewritten to... :)
 
(: ...will be rewritten to... :)
 
doc('factbook.xml')/mondial/country/province
 
doc('factbook.xml')/mondial/country/province
</pre>
+
</syntaxhighlight>
 +
 
 +
* The {{Code|fn:count}} function will be pre-evaluated by looking up the number in the index:
 +
 
 +
<syntaxhighlight lang="xquery">
 +
count(doc('factbook')//country)
 +
</syntaxhighlight>
  
The paths statistics are e.g. used to pre-evaluate the {{Code|count}} function:
+
* The distinct values of elements or attributes can be looked up in the index as well:
  
<pre class="brush:xquery">  
+
<syntaxhighlight lang="xquery">
(: will be rewritten and pre-evaluated by the path index :)
+
distinct-values(db:open('factbook')//religions)
count( doc('factbook')//country )
+
</syntaxhighlight>
</pre>
 
  
 
The contents of the path index can be directly accessed with the XQuery function [[Index Module#index:facets|index:facets]].
 
The contents of the path index can be directly accessed with the XQuery function [[Index Module#index:facets|index:facets]].
  
==Resource Index==
+
If a database is updated, the statistics in the path index will be invalidated.
  
The resource index contains references to the {{Code|pre}} values of all XML document nodes. It speeds up the access to specific documents in a database, and it will be automatically updated when updates are performed.
+
==Document Index==
  
The following query will be sped up by the resource index:
+
The document index contains references to all document nodes in a database. Once documents with specific paths are requested, the index will be extended to also contain document paths.
  
<pre class="brush:xquery">
+
The index generally speeds up access to single documents and database paths. It will always be kept up-to-date.
db:open('DatabaseWithLotsOfDocuments')
 
</pre>
 
  
 
=Value Indexes=
 
=Value Indexes=
  
Value indexes can be optionally created and dropped by the user. The text and attribute index will be created by default.
+
Value indexes can be created and dropped by the user. Four types of values indexes are available: a text and attribute index, and an optional token and full-text index. By default, the text and attribute index will automatically be created.
 +
 
 +
In the GUI, index structures can be managed in the dialog windows for creating new databases or displaying the database properties. On command-line, the commands {{Command|CREATE INDEX}} and {{Command|DROP INDEX}} are used to create and drop index structures. With {{Command|INFO INDEX}}, you get some insight into the contents of an index structure, and {{Command|SET}} allows you to change the index defaults for new databases:
 +
 
 +
* <code>OPEN factbook; CREATE INDEX fulltext</code>: Open database; create full-text index
 +
* <code>OPEN factbook; INFO INDEX TOKEN</code>: Open database; show info on token index
 +
* <code>SET ATTRINDEX true; SET ATTRINCLUDE id name; CREATE DB factbook.xml</code>: Enable attribute index; only index 'id' and 'name' attributes; create database
 +
 
 +
With XQuery, index structures can be created and dropped via {{Function|Database|db:optimize}}:
 +
 
 +
<syntaxhighlight lang="xquery">
 +
(: Optimize specified database, create full-text index for texts of the specified elements :)
 +
db:optimize(
 +
  'factbook',
 +
  false(),
 +
  map { 'ftindex': true(), 'ftinclude': 'p div' }
 +
)
 +
</syntaxhighlight>
  
 
==Text Index==
 
==Text Index==
Line 59: Line 89:
 
===Exact Queries===
 
===Exact Queries===
  
This index speeds up string-based equality tests on text nodes. The [[Options#UPDINDEX|UPDINDEX]] option can be activated to keep this index up-to-date.
+
This index references text nodes of documents. It will be utilized to accelerate string comparisons in path expressions. The following queries will all be rewritten for index access:
  
The following queries will all be rewritten for index access:
+
<syntaxhighlight lang="xquery">
 
+
(: example 1 :)
<pre class="brush:xquery">  
 
(: 1st example :)
 
 
//*[text() = 'Germany'],
 
//*[text() = 'Germany'],
(: 2nd example :)
+
(: example 2 :)
 
doc('factbook.xml')//name[. = 'Germany'],
 
doc('factbook.xml')//name[. = 'Germany'],
(: 3rd example :)
+
(: example 3 :)
 
for $c in db:open('factbook')//country
 
for $c in db:open('factbook')//country
 
where $c//city/name = 'Hanoi'
 
where $c//city/name = 'Hanoi'
 
return $c/name
 
return $c/name
</pre>
+
</syntaxhighlight>
 +
 
 +
Before the actual index rewriting takes places, some preliminary optimizations are applied:
 +
* In example 2, the context item expression {{Code|.}} will be replaced with a {{Code|text()}} step.
 +
* In example 3, the {{Code|where}} clause will be rewritten to a predicate and attached to the first path expression.
 +
 
 +
The indexed text nodes can be accessed directly with the XQuery function {{Function|Database|db:text}}. The indexed string values can be looked up via {{Function|Index|index:text}}.
 +
 
 +
The {{Option|UPDINDEX}} option can be enabled to keep this index up-to-date:
  
Matching text nodes can be directly requested from the index with the XQuery function [[Database Module#db:text|db:text]]. The index contents can be accessed via [[Index Module#index:texts|index:texts]].
+
<syntaxhighlight lang="xquery">
 +
db:optimize(
 +
  'mydb',
 +
  true(),
 +
  map { 'updindex':true(), 'textindex': true(), 'textinclude':'id' }
 +
)
 +
</syntaxhighlight>
  
 
===Range Queries===
 
===Range Queries===
Line 80: Line 122:
 
The text index also supports range queries based on string comparisons:
 
The text index also supports range queries based on string comparisons:
  
<pre class="brush:xquery">  
+
<syntaxhighlight lang="xquery">
(: 1st example :)
+
(: example 1 :)
 
db:open('Library')//Medium[Year >= '2011' and Year <= '2016'],
 
db:open('Library')//Medium[Year >= '2011' and Year <= '2016'],
(: 2nd example :)
+
(: example 2 :)
 
let $min := '2014-04-16T00:00:00'
 
let $min := '2014-04-16T00:00:00'
 
let $max := '2014-04-19T23:59:59'  
 
let $max := '2014-04-19T23:59:59'  
 
return db:open('news')//entry[date-time > $min and date-time < $max]
 
return db:open('news')//entry[date-time > $min and date-time < $max]
</pre>
+
</syntaxhighlight>
  
Text nodes can be directly accessed from the index via the XQuery function [[Database Module#db:text-range|db:text-range]].
+
With {{Function|Database|db:text-range}}, you can access all text nodes whose values are between a minimum and maximum value.
  
Please note that the current index structures do not support queries for numbers and dates.
+
Please note that the index structures do not support queries for numbers and dates.
  
 
==Attribute Index==
 
==Attribute Index==
  
Similar to the text index, this index speeds up string-based equality and range tests on attribute values. The [[Options#UPDINDEX|UPDINDEX]] option can be activated to keep this index up-to-date.
+
Similar to the text index, this index speeds up string and range comparisons on attribute values. Additionally, the XQuery function {{Code|fn:id}} takes advantage of the index whenever possible. The following queries will all be rewritten for index access:
  
The following queries will all be rewritten for index access:
+
<syntaxhighlight lang="xquery">
 
 
<pre class="brush:xquery">  
 
 
(: 1st example :)
 
(: 1st example :)
 
//country[@car_code = 'J'],
 
//country[@car_code = 'J'],
Line 106: Line 146:
 
(: 3rd example :)
 
(: 3rd example :)
 
//sea[@depth > '2100' and @depth < '4000']
 
//sea[@depth > '2100' and @depth < '4000']
</pre>
+
(: 4th example :)
 +
fn:id('f0_119', db:open('factbook'))
 +
</syntaxhighlight>
 +
 
 +
''Attribute nodes'' (which you can use as starting points of navigation) can directly be retrieved from the index with the XQuery functions {{Function|Database|db:attribute}} and {{Function|Database|db:attribute-range}}. The index contents (''strings'') can be accessed with {{Function|Index|index:attributes}}.
 +
 
 +
The {{Option|UPDINDEX}} option can be activated to keep this index up-to-date.
 +
 
 +
==Token Index==
 +
 
 +
In many XML dialects, such as HTML or DITA, multiple tokens are stored in attribute values. The token index can be created to speed up the retrieval of these tokens. The XQuery functions {{Code|fn:contains-token}}, {{Code|fn:tokenize}} and {{Code|fn:idref}} are rewritten for index access whenever possible. If a token index exists, it will e.g. be utilized for the following queries:
 +
 
 +
<syntaxhighlight lang="xquery">
 +
(: 1st example :)
 +
//div[contains-token(@class, 'row')],
 +
(: 2nd example :)
 +
//p[tokenize(@class) = 'row'],
 +
(: 3rd example :)
 +
doc('graph.xml')/idref('edge8')
 +
</syntaxhighlight>
  
Matching text nodes can be directly requested from the index with the XQuery functions [[Database Module#db:attribute|db:attribute]] and [[Database Module#db:attribute-range|db:attribute-range]]. The index contents can be accessed with [[Index Module#index:attributes|index:attributes]].
+
''Attribute nodes'' with a matching value (containing at least one from a set of given tokens) can be directly retrieved from the index with the XQuery function {{Function|Database|db:token}}. The index contents (''token strings'') can be accessed with {{Function|Index|index:tokens}}.
  
 
==Full-Text Index==
 
==Full-Text Index==
  
The [[Full-Text]] index speeds up queries using the {{Code|contains text}} expression. Internally, two index structures are provided: the default index sorts all keys alphabetically by their character length. It is particularly fast if fuzzy searches are performed. The second index is a compressed trie structure, which needs slightly more memory, but is specialized on wildcard searches. Both index structures will be merged in a future version of BaseX.
+
The [[Full-Text]] index contains the normalized tokens of text nodes of a document. It is utilized to speed up queries with the {{Code|contains text}} expression, and it is capable of processing wildcard and fuzzy search operations. Three evaluation strategies are available: the standard sequential database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see [https://files.basex.org/publications/Gruen%20et%20al.%20%5B2009%5D,%20XQuery%20Full%20Text%20Implementation%20in%20BaseX.pdf XQuery Full Text implementation in BaseX]).
  
 
If the full-text index exists, the following queries will all be rewritten for index access:
 
If the full-text index exists, the following queries will all be rewritten for index access:
  
<pre class="brush:xquery">  
+
<syntaxhighlight lang="xquery">
 
(: 1st example :)
 
(: 1st example :)
//country/name[text() contains text 'and'],
+
//country[name/text() contains text 'and'],
 
(: 2nd example :)
 
(: 2nd example :)
//religions[. contains text { 'Catholic', 'Roman' }
+
//religions[.//text() contains text { 'Catholic', 'Roman' }
 
     using case insensitive distance at most 2 words]
 
     using case insensitive distance at most 2 words]
</pre>  
+
</syntaxhighlight>
 +
 
 +
The index provides support for the following full-text features (the values can be changed in the GUI or via the {{Command|SET}} command):
 +
 
 +
* '''Stemming''': tokens are stemmed before being indexed (option: {{Option|STEMMING}})
 +
* '''Case Sensitive''': tokens are indexed in case-sensitive mode (option: {{Option|CASESENS}})
 +
* '''Diacritics''': diacritics are indexed as well (option: {{Option|DIACRITICS}})
 +
* '''Stopword List''': a stop word list can be defined to reduce the number of indexed tokens (option: {{Option|STOPWORDS}})
 +
* '''Language''': see [[Full-Text#Languages|Languages]] for more details (option: {{Option|LANGUAGE}})
 +
 
 +
The options that have been used for creating the full-text index will also be applied to the optimized full-text queries. However, the defaults can be overwritten if you supply options in your query. For example, if words were stemmed in the index, and if the query can be rewritten for index access, the query terms will be stemmed as well, unless stemming is not explicitly disabled. This is demonstrated in the following [[Commands#Command_Scripts|Command Script]]:
 +
 
 +
<syntaxhighlight lang="xml">
 +
<commands>
 +
  <!-- Create database with stemmed full-text index -->
 +
  <set option='stemming'>true</set>
 +
  <set option='ftindex'>true</set>
 +
  <create-db name='test-db'> <text>house</text> </create-db>
 +
  <!-- Index access: Query term will be stemmed -->
 +
  <xquery> /text[. contains text { 'houses' }] </xquery>
 +
  <!-- Disable stemming (query will not be evaluated by the index) -->
 +
  <xquery> /text[. contains text { 'houses' } using no stemming] </xquery>
 +
</commands>
 +
</syntaxhighlight>
 +
 
 +
Text nodes can be directly requested from the index via the XQuery function {{Function|Full-Text|ft:search}}. The index contents can be accessed with {{Function|Full-Text|ft:tokens}}.
 +
 
 +
==Selective Indexing==
 +
 
 +
Value indexing can be restricted to specific elements and attributes. The nodes to be indexed can be restricted via the {{Option|TEXTINCLUDE}}, {{Option|ATTRINCLUDE}}, {{Option|TOKENINCLUDE}} and {{Option|FTINCLUDE}} options. The options take a list of name patterns, which are separated by commas. The following name patterns are supported:
 +
 
 +
* <code>*</code>: all names
 +
* <code>name</code>: elements or attributes called <code>name</code>, which are in the empty default namespace
 +
* <code>*:name</code>: elements or attributes called <code>name</code>, no matter which namespace
 +
* <code>Q{uri}*</code>: all elements or attributes in the <code>uri</code> namespace
 +
* <code>Q{uri}name</code>: elements or attributes called <code>name</code> in the <code>uri</code> namespace
 +
 
 +
The options can either be specified via the {{Command|SET}} command or via XQuery. With the following operations, an attribute index is created for all {{Code|id}} and {{Code|name}} attributes:
 +
 
 +
; Commands
 +
<syntaxhighlight lang="xml">
 +
SET ATTRINCLUDE id,name
 +
CREATE DB factbook http://files.basex.org/xml/factbook.xml'
 +
# Restore default
 +
SET ATTRINCLUDE
 +
</syntaxhighlight>
 +
 
 +
; XQuery
 +
<syntaxhighlight lang="xquery">
 +
db:create('factbook', 'http://files.basex.org/xml/factbook.xml', '',
 +
  map { 'attrinclude': 'id,name' })
 +
</syntaxhighlight>
 +
 
 +
With {{Command|CREATE INDEX}} and {{Function|Database|db:optimize}}, new selective indexing options will ba applied to an existing database.
 +
 
 +
==Enforce Rewritings==
 +
 
 +
In various cases, existing index structures will not be utilized by the query optimizer. This is usually the case if the name of the database is not a static string (e.g., because it is bound to a variable or passed on as argument of a function call). Furthermore, several candidates for index rewritings may exist, and the query optimizer may decide for a rewriting that turns out to be suboptimal.
 +
 
 +
With the {{Option|ENFORCEINDEX}} option, certain index rewritings can be enforced. While the option can be globally enabled, it is usually better to supply it as [[XQuery Extensions#Pragmas|Pragma]]. Two examples:
 +
 
 +
* In the query below, 10 databases will be addressed. If it is known in advance that these databases contain an up-to-date text index, the index rewriting can be enforced as follows:
 +
 
 +
<syntaxhighlight lang="xquery">
 +
(# db:enforceindex #) {
 +
  for $n in 1 to 10
 +
  let $db := 'persons' || $n
 +
  return db:open($db)//person[name/text() = 'John']
 +
}
 +
</syntaxhighlight>
 +
 
 +
* The following query contains two predicates that may both be rewritten for index access. If the automatically chosen rewriting is known not to be optimal, another index rewriting can enforced by surrounding the specific expression with the pragma:
 +
 
 +
<syntaxhighlight lang="xquery">
 +
db:open('factbook')//country
 +
  [(# db:enforceindex #) {
 +
  @population > '10000000' and
 +
  @population < '10999999'
 +
  }]
 +
  [religions/text() = 'Protestant']
 +
</syntaxhighlight>
 +
 
 +
The option can also be assigned to predicates with dynamic values. In the following example the comparison of the first comparison will be rewritten for index access. Without the pragma expression, the second comparison is preferred and chosen for the rewriting, because the statically known string allows for an exact cost estimation:
 +
 
 +
<syntaxhighlight lang="xquery">
 +
for $name in ('Germany', 'Italy')
 +
for $country in db:open('factbook')//country
 +
where (# db:enforceindex #) { $country/name = $name }
 +
where $country/religions/text() = 'Protestant'
 +
return $country
 +
</syntaxhighlight>
 +
 
 +
Please note that:
 +
 
 +
* The option should only be enabled if the addressed databases exist, have all required index structures and are up-to-date (otherwise, you will be given an error message).
 +
* If you address the full-text index, and if you use non-default indexing options, you will have to specify them in your query (via {{Code|using stemming}},  {{Code|using language 'de'}}, etc).
 +
* If you have more than one enforce pragma in a single path expression, only the first will be considered.
 +
* In general, there are always expressions that cannot be rewritten for index access. If you enforce rewritings, you will have no guarantee that an index will be used.
 +
 
 +
=Custom Index Structures=
 +
 
 +
With XQuery, it is comparatively easy to create your own, custom index structures. The following query demonstrate how you can create a {{Code|factbook-index}} database, which contains all texts of the original database in lower case:
 +
 
 +
<syntaxhighlight lang="xquery">
 +
let $db := 'factbook'
 +
 
 +
let $index := <index>{
 +
  for $nodes in db:open($db)//text()
 +
  group by $text := lower-case($nodes)
 +
  return <text string='{ $text }'>{
 +
    for $node in $nodes
 +
    return <id>{ db:node-id($node ) }</id>
 +
  }</text>
 +
}</index>
 +
 
 +
return db:create($db || '-index', $index, $db || '-index.xml')
 +
</syntaxhighlight>
  
Matching text nodes can be directly requested from the index via the XQuery function [[Full-Text Module#ft:search|ft:search]]. The index contents can be accessed with [[Full-Text Module#ft:tokens|ft:tokens]].
+
In the following query, a text string is searched, and the text nodes of the original database are retrieved:
  
==Index Construction==
+
<syntaxhighlight lang="xquery">
 +
let $db := 'factbook'
 +
let $text := 'italian'
 +
for $id in db:open($db || '-index')//*[@string = $text]/id
 +
return db:open-id($db, $id)/..
 +
</syntaxhighlight>
  
If main memory runs out while creating a value index, the currently generated index structures will be partially written to disk and eventually merged. If the used memory heuristics fails for some reason (i.e., because multiple index operations run at the same time), fixed index split sizes may be chosen via the [[Options#INDEXSPLITSIZE|INDEXSPLITSIZE]] and [[Options#FTINDEXSPLITSIZE|FTINDEXSPLITSIZE]] options.
+
With some extra effort, and if {{Option|UPDINDEX}} is enabled for both your original and your index database (see below), your index database will support updates as well (try it, it’s fun!).
  
If [[Options#DEBUG|DEBUG]] is set to true, and if a new database is created from the command line, the number of index operations will be output to standard output; this might help you to choose a proper split size. The following example shows how the output can look for a document with 111 MB and 128 MB of available main memory:
+
=Performance=
 +
 
 +
If main memory runs out while creating a value index, the current index structures will be partially written to disk and eventually merged. If the memory heuristics fail for some reason (i.e., because multiple index operations run at the same time, or because the applied JVM does not support explicit garbage collections), a fixed index split sizes may be chosen via the {{Option|SPLITSIZE}} option.
 +
 
 +
If {{Option|DEBUG}} is enabled, the command-line output might help you to find a good split size. The following example shows the output for creating a database for an XMark document with 1 GB, and with 128 MB assigned to the JVM:
  
 
<pre>
 
<pre>
> basex -d -c"set ftindex; create db 111mb 111mb.xml"
+
> basex -d -c"SET FTINDEX ON; SET TOKENINDEX ON; CREATE DB xmark 1gb.xml"
 
Creating Database...
 
Creating Database...
.... 8132.44 ms (17824 KB)
+
................................ 76559.99 ms (29001 KB)
 
Indexing Text...
 
Indexing Text...
.. 979920 operations, 2913.78 ms (44 MB)
+
....|...|...|.....|. 9.81 M operations, 18576.92 ms (13523 KB). Recommended SPLITSIZE: 20.
 
Indexing Attribute Values...
 
Indexing Attribute Values...
.. 381870 operations, 630.61 ms (21257 KB)
+
.........|....... 3.82 M operations, 7151.77 ms (6435 KB). Recommended SPLITSIZE: 20.
 +
Indexing Tokens...
 +
.......|..|.....|.. 3.82 M operations, 9636.73 ms (10809 KB). Recommended SPLITSIZE: 10.
 
Indexing Full-Text...
 
Indexing Full-Text...
..|| 3 splits, 12089347 operations, 16420.47 ms (36 MB)
+
..|.|.|.|...|...|..|.|..| 116.33 M operations, 138740.94 ms (106 MB). Recommended SPLITSIZE: 12.
 
</pre>
 
</pre>
  
The info string {{Code|3 splits}} indicates that three partial full-text index structures were written to disk, and the string {{Code|12089347 operations}} tells that the index construction consisted of approximately 12 mio index operations. If we set [[Options#FTINDEXSPLITSIZE|FTINDEXSPLITSIZE]] to the fixed value {{Code|4000000}} (12 mio divided by three), or a smaller value, we should be able to build the index and circumvent the memory heuristics.
+
The output can be interpreted as follows:
 +
 
 +
* The vertical bar <code>|</code> indicates that a partial index structure was written to disk.
 +
* The mean value of the recommendations can be assigned to the {{Option|SPLITSIZE}} option. Please note that the recommendation is only a vague proposal, so try different values if you get main-of-memory errors or indexing gets too slow. Greater values will require more main memory.
 +
* In the example, the full-text index was split 12 times. 116 million tokens were indexed, processing time was 2,5 minutes, and final main memory consumption (after writing the index to disk) was 76 MB. A good value for the split size option could be {{Code|15}}.
  
 
=Updates=
 
=Updates=
  
By default, index structures are discarded after an update operation.
+
Generally, update operations are very fast in BaseX. By default, the index structures will be invalidated by updates; as a result, queries that benefit from index structures may slow down after updates. There are different alternatives to cope with this:
As a result, queries will be executed more slowly. There are
 
different alternatives to cope with this:
 
  
After the execution of update operations, the [[Commands#OPTIMIZE|OPTIMIZE]]
+
* After the execution of one or more update operations, the {{Command|OPTIMIZE}} command or the {{Function|Database|db:optimize}} function can be called to rebuild the index structures.
command or the [[Database Module#db:optimize|db:optimize]] function can be
+
* The {{Option|UPDINDEX}} option can be activated before creating or optimizing the database. As a result, the text, attribute and token indexes will be incrementally updated after each database update. Please note that incremental updates are not available for the full-text index and database statistics. This is also explains why the UPTODATE flag, which is e.g. displayed via {{Command|INFO DB}} or {{Function|Database|db:info}}, will be set to {{Code|false}} until the database will be optimized again (various optimizations won’t be triggered. For example, count(//item) can be extremely fast if all meta data is up-to-date.
called to rebuild the index structures. This way, multiple update operations will be performed faster,
+
* The {{Option|AUTOOPTIMIZE}} option can be enabled before creating or optimizing the database. All outdated index structures and statistics will then be recreated after each database update. This option should only be done for small and medium-sized databases.
as the database meta data is only updated and regenerated
+
* Both options can be used side by side: {{Option|UPDINDEX}} will take care that the value index structures will be updated as part of the actual update operation. {{Option|AUTOOPTIMIZE}} will update the remaining data structures (full-text index, database statistics).
when requested by the database users.
 
  
With the [[Options#UPDINDEX|UPDINDEX]] option, text and attribute
+
=Changelog=
index structures will incrementally be updated. This option must be
 
turned on before the database is created or optimized.
 
Please note that incremental updates are not available for
 
the full-text index and database statistics. This is also the
 
reason why the internal up-to-date flag of a database (which can e.g.
 
be displayed via [[Commands#INFO DB|INFO DB]] or [[Database_Module#db:info|db:info]])
 
will be set to {{Code|false}} until the next optimize call is triggered.
 
  
If [[Options#AUTOOPTIMIZE|AUTOOPTIMIZE]] is activated, all outdated
+
;Version 9.1
index structures and statistics will be recreated after a database update.
 
This option should only be activated for medium-sized databases.
 
Similar to UPDINDEX, it must be turned on before the database is created
 
or optimized.
 
  
=Changelog=
+
* Updated: [[#Enforce Rewritings|Enforce Rewritings]], support for comparisons with dynamic values.
 +
 
 +
;Version 9.0
 +
 
 +
* Added: [[#Enforce Rewritings|Enforce Rewritings]]
 +
 
 +
;Version 8.4
 +
 
 +
* Updated: [[#Name Index|Name Index]], [[#Path Index|Path Index]]
 +
 
 +
;Version 8.4
 +
 
 +
* Added: [[#Token Index|Token Index]]
 +
 
 +
;Version 8.3
  
 +
* Added: [[#Selective Indexing|Selective Indexing]]
 
;Version 8.0
 
;Version 8.0
  
Line 182: Line 373:
  
 
* Added: string-based range queries
 
* Added: string-based range queries
 
[[Category:Internals]]
 

Latest revision as of 11:17, 15 February 2021

This article is part of the XQuery Portal. It contains information on the available index structures.

The query compiler tries to optimize and speed up queries by applying the index whenever it is possible and seems promising. To see how a query is rewritten, and if an index is used, you can turn on the Info View in the GUI or use the -V flag on the command line:

  • A message like apply text index for "Japan" indicates that the text index is applied to speed up the search of the shown string. The following message…
  • no index results indicates that a string in a path expression will never yield results. Because of that, the path does not need to be evaluated at all.
  • If you cannot find any index optimization hints in the info output, it often helps if you rewrite and simplify your query.

Additional examples for index rewritings are presented in our article on XQuery Optimizations.

Structural Indexes[edit]

Structural indexes are automatically created and cannot be dropped by the user:

Name Index[edit]

The name index contains references to the names of all elements and attributes in a database. It contains some basic statistical information, such as the number of occurrence of a name.

The name index is e.g. applied to discard location steps that will never yield results:

(: will be rewritten to an empty sequence :)
/non-existing-name

The contents of the name indexes can be directly accessed with the XQuery functions index:element-names and index:attribute-names.

If a database is updated, new names will be added incrementally, but the statistical information will get out-dated.

Path Index[edit]

The path index (which is also called path summary or data guide) stores all distinct paths of the documents in the database. It contains additional statistical information, such as the number of occurrence of a path, its distinct string values, and the minimum/maximum of numeric values. The maximum number of distinct values to store per name can be changed via MAXCATS. Distinct values are also stored for elements and attributes of numeric type.

Various queries will be evaluated much faster if an up-to-date path index is available (as can be observed when opening the Info View):

  • Descendant steps will be rewritten to multiple child steps. Child steps are evaluated faster, as fewer nodes have to be traversed:
doc('factbook.xml')//province,
(: ...will be rewritten to... :)
doc('factbook.xml')/mondial/country/province
  • The fn:count function will be pre-evaluated by looking up the number in the index:
count(doc('factbook')//country)
  • The distinct values of elements or attributes can be looked up in the index as well:
distinct-values(db:open('factbook')//religions)

The contents of the path index can be directly accessed with the XQuery function index:facets.

If a database is updated, the statistics in the path index will be invalidated.

Document Index[edit]

The document index contains references to all document nodes in a database. Once documents with specific paths are requested, the index will be extended to also contain document paths.

The index generally speeds up access to single documents and database paths. It will always be kept up-to-date.

Value Indexes[edit]

Value indexes can be created and dropped by the user. Four types of values indexes are available: a text and attribute index, and an optional token and full-text index. By default, the text and attribute index will automatically be created.

In the GUI, index structures can be managed in the dialog windows for creating new databases or displaying the database properties. On command-line, the commands CREATE INDEX and DROP INDEX are used to create and drop index structures. With INFO INDEX, you get some insight into the contents of an index structure, and SET allows you to change the index defaults for new databases:

  • OPEN factbook; CREATE INDEX fulltext: Open database; create full-text index
  • OPEN factbook; INFO INDEX TOKEN: Open database; show info on token index
  • SET ATTRINDEX true; SET ATTRINCLUDE id name; CREATE DB factbook.xml: Enable attribute index; only index 'id' and 'name' attributes; create database

With XQuery, index structures can be created and dropped via db:optimize:

(: Optimize specified database, create full-text index for texts of the specified elements :)
db:optimize(
  'factbook',
  false(),
  map { 'ftindex': true(), 'ftinclude': 'p div' }
)

Text Index[edit]

Exact Queries[edit]

This index references text nodes of documents. It will be utilized to accelerate string comparisons in path expressions. The following queries will all be rewritten for index access:

(: example 1 :)
//*[text() = 'Germany'],
(: example 2 :)
doc('factbook.xml')//name[. = 'Germany'],
(: example 3 :)
for $c in db:open('factbook')//country
where $c//city/name = 'Hanoi'
return $c/name

Before the actual index rewriting takes places, some preliminary optimizations are applied:

  • In example 2, the context item expression . will be replaced with a text() step.
  • In example 3, the where clause will be rewritten to a predicate and attached to the first path expression.

The indexed text nodes can be accessed directly with the XQuery function db:text. The indexed string values can be looked up via index:text.

The UPDINDEX option can be enabled to keep this index up-to-date:

db:optimize(
  'mydb',
  true(),
  map { 'updindex':true(), 'textindex': true(), 'textinclude':'id' }
)

Range Queries[edit]

The text index also supports range queries based on string comparisons:

(: example 1 :)
db:open('Library')//Medium[Year >= '2011' and Year <= '2016'],
(: example 2 :)
let $min := '2014-04-16T00:00:00'
let $max := '2014-04-19T23:59:59' 
return db:open('news')//entry[date-time > $min and date-time < $max]

With db:text-range, you can access all text nodes whose values are between a minimum and maximum value.

Please note that the index structures do not support queries for numbers and dates.

Attribute Index[edit]

Similar to the text index, this index speeds up string and range comparisons on attribute values. Additionally, the XQuery function fn:id takes advantage of the index whenever possible. The following queries will all be rewritten for index access:

(: 1st example :)
//country[@car_code = 'J'],
(: 2nd example :)
//province[@* = 'Hokkaido']//name,
(: 3rd example :)
//sea[@depth > '2100' and @depth < '4000']
(: 4th example :)
fn:id('f0_119', db:open('factbook'))

Attribute nodes (which you can use as starting points of navigation) can directly be retrieved from the index with the XQuery functions db:attribute and db:attribute-range. The index contents (strings) can be accessed with index:attributes.

The UPDINDEX option can be activated to keep this index up-to-date.

Token Index[edit]

In many XML dialects, such as HTML or DITA, multiple tokens are stored in attribute values. The token index can be created to speed up the retrieval of these tokens. The XQuery functions fn:contains-token, fn:tokenize and fn:idref are rewritten for index access whenever possible. If a token index exists, it will e.g. be utilized for the following queries:

(: 1st example :)
//div[contains-token(@class, 'row')],
(: 2nd example :)
//p[tokenize(@class) = 'row'],
(: 3rd example :)
doc('graph.xml')/idref('edge8')

Attribute nodes with a matching value (containing at least one from a set of given tokens) can be directly retrieved from the index with the XQuery function db:token. The index contents (token strings) can be accessed with index:tokens.

Full-Text Index[edit]

The Full-Text index contains the normalized tokens of text nodes of a document. It is utilized to speed up queries with the contains text expression, and it is capable of processing wildcard and fuzzy search operations. Three evaluation strategies are available: the standard sequential database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see XQuery Full Text implementation in BaseX).

If the full-text index exists, the following queries will all be rewritten for index access:

(: 1st example :)
//country[name/text() contains text 'and'],
(: 2nd example :)
//religions[.//text() contains text { 'Catholic', 'Roman' }
    using case insensitive distance at most 2 words]

The index provides support for the following full-text features (the values can be changed in the GUI or via the SET command):

  • Stemming: tokens are stemmed before being indexed (option: STEMMING)
  • Case Sensitive: tokens are indexed in case-sensitive mode (option: CASESENS)
  • Diacritics: diacritics are indexed as well (option: DIACRITICS)
  • Stopword List: a stop word list can be defined to reduce the number of indexed tokens (option: STOPWORDS)
  • Language: see Languages for more details (option: LANGUAGE)

The options that have been used for creating the full-text index will also be applied to the optimized full-text queries. However, the defaults can be overwritten if you supply options in your query. For example, if words were stemmed in the index, and if the query can be rewritten for index access, the query terms will be stemmed as well, unless stemming is not explicitly disabled. This is demonstrated in the following Command Script:

<commands>
  <!-- Create database with stemmed full-text index -->
  <set option='stemming'>true</set>
  <set option='ftindex'>true</set>
  <create-db name='test-db'> <text>house</text> </create-db>
  <!-- Index access: Query term will be stemmed -->
  <xquery> /text[. contains text { 'houses' }] </xquery>
  <!-- Disable stemming (query will not be evaluated by the index) -->
  <xquery> /text[. contains text { 'houses' } using no stemming] </xquery>
</commands>

Text nodes can be directly requested from the index via the XQuery function ft:search. The index contents can be accessed with ft:tokens.

Selective Indexing[edit]

Value indexing can be restricted to specific elements and attributes. The nodes to be indexed can be restricted via the TEXTINCLUDE, ATTRINCLUDE, TOKENINCLUDE and FTINCLUDE options. The options take a list of name patterns, which are separated by commas. The following name patterns are supported:

  • *: all names
  • name: elements or attributes called name, which are in the empty default namespace
  • *:name: elements or attributes called name, no matter which namespace
  • Q{uri}*: all elements or attributes in the uri namespace
  • Q{uri}name: elements or attributes called name in the uri namespace

The options can either be specified via the SET command or via XQuery. With the following operations, an attribute index is created for all id and name attributes:

Commands
SET ATTRINCLUDE id,name
CREATE DB factbook http://files.basex.org/xml/factbook.xml'
# Restore default
SET ATTRINCLUDE
XQuery
db:create('factbook', 'http://files.basex.org/xml/factbook.xml', '',
  map { 'attrinclude': 'id,name' })

With CREATE INDEX and db:optimize, new selective indexing options will ba applied to an existing database.

Enforce Rewritings[edit]

In various cases, existing index structures will not be utilized by the query optimizer. This is usually the case if the name of the database is not a static string (e.g., because it is bound to a variable or passed on as argument of a function call). Furthermore, several candidates for index rewritings may exist, and the query optimizer may decide for a rewriting that turns out to be suboptimal.

With the ENFORCEINDEX option, certain index rewritings can be enforced. While the option can be globally enabled, it is usually better to supply it as Pragma. Two examples:

  • In the query below, 10 databases will be addressed. If it is known in advance that these databases contain an up-to-date text index, the index rewriting can be enforced as follows:
(# db:enforceindex #) {
  for $n in 1 to 10
  let $db := 'persons' || $n
  return db:open($db)//person[name/text() = 'John']
}
  • The following query contains two predicates that may both be rewritten for index access. If the automatically chosen rewriting is known not to be optimal, another index rewriting can enforced by surrounding the specific expression with the pragma:
db:open('factbook')//country
  [(# db:enforceindex #) {
   @population > '10000000' and
   @population < '10999999'
  }]
  [religions/text() = 'Protestant']

The option can also be assigned to predicates with dynamic values. In the following example the comparison of the first comparison will be rewritten for index access. Without the pragma expression, the second comparison is preferred and chosen for the rewriting, because the statically known string allows for an exact cost estimation:

for $name in ('Germany', 'Italy')
for $country in db:open('factbook')//country
where (# db:enforceindex #) { $country/name = $name }
where $country/religions/text() = 'Protestant'
return $country

Please note that:

  • The option should only be enabled if the addressed databases exist, have all required index structures and are up-to-date (otherwise, you will be given an error message).
  • If you address the full-text index, and if you use non-default indexing options, you will have to specify them in your query (via using stemming, using language 'de', etc).
  • If you have more than one enforce pragma in a single path expression, only the first will be considered.
  • In general, there are always expressions that cannot be rewritten for index access. If you enforce rewritings, you will have no guarantee that an index will be used.

Custom Index Structures[edit]

With XQuery, it is comparatively easy to create your own, custom index structures. The following query demonstrate how you can create a factbook-index database, which contains all texts of the original database in lower case:

let $db := 'factbook'

let $index := <index>{
  for $nodes in db:open($db)//text()
  group by $text := lower-case($nodes)
  return <text string='{ $text }'>{
    for $node in $nodes
    return <id>{ db:node-id($node ) }</id>
  }</text>
}</index>

return db:create($db || '-index', $index, $db || '-index.xml')

In the following query, a text string is searched, and the text nodes of the original database are retrieved:

let $db := 'factbook'
let $text := 'italian'
for $id in db:open($db || '-index')//*[@string = $text]/id
return db:open-id($db, $id)/..

With some extra effort, and if UPDINDEX is enabled for both your original and your index database (see below), your index database will support updates as well (try it, it’s fun!).

Performance[edit]

If main memory runs out while creating a value index, the current index structures will be partially written to disk and eventually merged. If the memory heuristics fail for some reason (i.e., because multiple index operations run at the same time, or because the applied JVM does not support explicit garbage collections), a fixed index split sizes may be chosen via the SPLITSIZE option.

If DEBUG is enabled, the command-line output might help you to find a good split size. The following example shows the output for creating a database for an XMark document with 1 GB, and with 128 MB assigned to the JVM:

> basex -d -c"SET FTINDEX ON; SET TOKENINDEX ON; CREATE DB xmark 1gb.xml"
Creating Database...
................................ 76559.99 ms (29001 KB)
Indexing Text...
....|...|...|.....|. 9.81 M operations, 18576.92 ms (13523 KB). Recommended SPLITSIZE: 20.
Indexing Attribute Values...
.........|....... 3.82 M operations, 7151.77 ms (6435 KB). Recommended SPLITSIZE: 20.
Indexing Tokens...
.......|..|.....|.. 3.82 M operations, 9636.73 ms (10809 KB). Recommended SPLITSIZE: 10.
Indexing Full-Text...
..|.|.|.|...|...|..|.|..| 116.33 M operations, 138740.94 ms (106 MB). Recommended SPLITSIZE: 12.

The output can be interpreted as follows:

  • The vertical bar | indicates that a partial index structure was written to disk.
  • The mean value of the recommendations can be assigned to the SPLITSIZE option. Please note that the recommendation is only a vague proposal, so try different values if you get main-of-memory errors or indexing gets too slow. Greater values will require more main memory.
  • In the example, the full-text index was split 12 times. 116 million tokens were indexed, processing time was 2,5 minutes, and final main memory consumption (after writing the index to disk) was 76 MB. A good value for the split size option could be 15.

Updates[edit]

Generally, update operations are very fast in BaseX. By default, the index structures will be invalidated by updates; as a result, queries that benefit from index structures may slow down after updates. There are different alternatives to cope with this:

  • After the execution of one or more update operations, the OPTIMIZE command or the db:optimize function can be called to rebuild the index structures.
  • The UPDINDEX option can be activated before creating or optimizing the database. As a result, the text, attribute and token indexes will be incrementally updated after each database update. Please note that incremental updates are not available for the full-text index and database statistics. This is also explains why the UPTODATE flag, which is e.g. displayed via INFO DB or db:info, will be set to false until the database will be optimized again (various optimizations won’t be triggered. For example, count(//item) can be extremely fast if all meta data is up-to-date.
  • The AUTOOPTIMIZE option can be enabled before creating or optimizing the database. All outdated index structures and statistics will then be recreated after each database update. This option should only be done for small and medium-sized databases.
  • Both options can be used side by side: UPDINDEX will take care that the value index structures will be updated as part of the actual update operation. AUTOOPTIMIZE will update the remaining data structures (full-text index, database statistics).

Changelog[edit]

Version 9.1
Version 9.0
Version 8.4
Version 8.4
Version 8.3
Version 8.0
  • Added: AUTOOPTIMIZE option
Version 7.2.1
  • Added: string-based range queries