Changes

Jump to navigation Jump to search
2,606 bytes added ,  15:56, 14 February 2017
This article is part of the [[Advanced User's GuideXQuery|XQuery Portal]] and introduces . It contains information on the available index structures. The query compiler tries to optimize and speed up queries by applying the index whenever it is possible and seems promising.
Most examples in this article are based on The query compiler tries to optimize and speed up queries by applying the [http://files.basex.org/xml/factbook.xml factbook.xml] documentindex whenever it is possible and seems promising. To see how a query is rewritten, and if an index is used, please you can turn on the [[GUI#Visualizations|Info View]] in the GUI or use the [[Command-Line Options#BaseX_Standalone|-V flag]] on the command line:
* A message like <code>Applying text index for "Japan"</code> indicates that the text index is applied to speed up the search of the shown string. The following message…
==Name Index==
The name index contains all element and attribute references to the names of a database, all elements and the fixed-size index ids are stored attributes in the main database table. If a database is updated, new names are automatically added. Furthermore, the index is enriched with It contains some basic statistical information, such as the distinct (categorical) or minimum and maximum values number of its elements and attributes. The maximum number occurrence of categories to store per a name can be changed via [[Options#MAXCATS|MAXCATS]]. The statistics are discarded after database updates and can be recreated with the [[Commands#OPTIMIZE|OPTIMIZE]] command.
The name index is e.g. applied to pre-evaluate discard location steps that will never yield results:
<pre class="brush:xquery">
The contents of the name indexes can be directly accessed with the XQuery functions [[Index Module#index:element-names|index:element-names]] and [[Index Module#index:attribute-names|index:attribute-names]].
 
If a database is updated, new names will be added incrementally, but the statistical information will get out-dated.
==Path Index==
The path index (which is also called ''path summary'' or ''data guide'') stores all distinct paths of the documents in the database. It contains the same additional statistical information , such as the number of occurrence of a path, its distinct string values, and the minimum/maximum of numeric values. The maximum number of distinct values to store per name indexcan be changed via {{Option|MAXCATS}}. Since {{Version|8. The statistics 6}}, the distinct values are discarded after database updates also stored for elements and attributes of numeric type. Various queries will be evaluated much faster if an up-to-date path index is available (as can be recreated with observed when opening the [[CommandsGUI#OPTIMIZEVisualizations|OPTIMIZEInfo View]] command.):
The path index is applied to rewrite descendant * Descendant steps will be rewritten to multiple child steps. Child steps can be are evaluated faster, as fewer nodes have to be accessedtraversed:
<pre class="brush:xquery">
</pre>
* The paths statistics are e.g. used to pre-evaluate the {{Code|fn:count}} functionwill be pre-evaluated by looking up the number in the index:
<pre class="brush:xquery">
count(doc(: will 'factbook')//country)</pre> * The distinct values of elements or attributes can be rewritten and pre-evaluated by looked up in the path index as well: <pre class="brush:)xquery"> countdistinct-values( docdb:open('factbook')//country religions)
</pre>
The contents of the path index can be directly accessed with the XQuery function [[Index Module#index:facets|index:facets]].
 
If a database is updated, the statistics in the path index will be invalidated.
==Document Index==
Value indexes can be created and dropped by the user. Four types of values indexes are available: a text and attribute index, and an optional token and full-text index. By default, the text and attribute index will automatically be created.
In the GUI, index structures can be managed in the dialog windows for creating new databases or displaying the database properties. On command-line, the commands <code>[[Commands#CREATE INDEX{{Command|CREATE INDEX]]</code> }} and <code>[[Commands#DROP INDEX{{Command|DROP INDEX]]</code> }} are used to create and drop index structures. With <code>[[Commands#INFO INDEX{{Command|INFO INDEX]]</code>}}, you get some insight into the contents of an index structure, and <code>[[Commands#SET{{Command|SET]]</code> }} allows you to change the index defaults for new databases:
* <code>OPEN factbook; CREATE INDEX fulltext</code>: Open database; create full-text index
* <code>OPEN factbook; INFO INDEX TOKEN</code>: Open database; show info on token index
* <code>SET ATTRINDEX true; SET ATTRINCLUDE id name; CREATE DB factbook.xml</code>: Enable attribute index; only index 'id' and 'name' attributes; create database
* <code>OPEN factbook; INFO INDEX TOKEN</code>: Open database; show info on token index
* <code>OPEN factbook; SET FTINDEX true; OPTIMIZE</code>: Open database; enable full-text indexing; optimize database
With XQuery, index structures can be created and dropped via [[{{Function|Database Module#db:optimize|db:optimize]]}}:
<pre class="brush:xquery">
===Exact Queries===
This index speeds up string-based equality tests on text nodes. The [[Options#UPDINDEX|UPDINDEX]] option can be activated to keep this index up-to-date. The following queries will all be rewritten for index access:
<pre class="brush:xquery">
</pre>
Matching text nodes can be directly requested from the index with the XQuery function [[{{Function|Database Module#db:text|db:text]]}}. The index contents can be accessed via [[{{Function|Index Module#|index:textstext}}. The {{Option|UPDINDEX}} option can be activated to keep this indexup-to-date, for example:texts]]. <pre class="brush:xquery">db:optimize( 'mydb', true(), map { 'updindex':true(), 'textindex': true(), 'textinclude':'id' })</pre>
===Range Queries===
</pre>
Text nodes can directly be retrieved from the index via the XQuery function [[{{Function|Database Module#db:text-range|db:text-range]]}}.
Please note that the current index structures do not support queries for numbers and dates.
==Attribute Index==
Similar to the text index, this index speeds up string-based equality and range tests on attribute values. The [[Options#UPDINDEXAdditionally, the XQuery function {{Code|UPDINDEX]] option can be activated to keep this fn:id}} takes advantage of the index up-to-datewhenever possible. The following queries will all be rewritten for index access:
<pre class="brush:xquery">
(: 3rd example :)
//sea[@depth > '2100' and @depth < '4000']
(: 4th example :)
fn:id('f0_119', db:open('factbook'))
</pre>
Attribute nodes can directly be retrieved from the index with the XQuery functions [[{{Function|Database Module#db:attribute|db:attribute]] }} and [[{{Function|Database Module#db:attribute-range|db:attribute-range]]}}. The index contents can be accessed with [[{{Function|Index Module#|index:attributes}}. The {{Option|UPDINDEX}} option can be activated to keep this index:attributes]]up-to-date.
==Token Index==
{{Mark|Introduced with Version 8.4:}} In many XML dialects, such as HTML or DITA, multiple tokens are stored in attribute values. The token index can be used created to access speed up the retrieval of these entriestokensFollowing queries such as the following ones will (soon) be The XQuery functions {{Code|fn:contains-token}}, {{Code|fn:tokenize}} and {{Code|fn:idref}} are rewritten for index accesswhenever possible. If a token index exists, it will e.g. be utilized for the following queries:
<pre class="brush:xquery">
</pre>
Attributes with tokens can be directly retrieved from the index with the XQuery function [[{{Function|Database Module#db:token|db:token]]}}. The index contents can be accessed with [[{{Function|Index Module#index:tokens|index:tokens]]}}.
==Full-Text Index==
using case insensitive distance at most 2 words]
</pre>
 
The index provides support for the following full-text features (the values can be changed in the GUI or via the {{Command|SET}} command):
 
* '''Stemming''': tokens are stemmed before being indexed (option: {{Option|STEMMING}})
* '''Case Sensitive''': tokens are indexed in case-sensitive mode (option: {{Option|CASESENS}})
* '''Diacritics''': diacritics are indexed as well (option: {{Option|DIACRITICS}})
* '''Stopword List''': a stop word list can be defined to reduce the number of indexed tokens (option: {{Option|STOPWORDS}})
* '''Language''': see [[Full-Text#Languages|Languages]] for more details (option: {{Option|LANGUAGE}})
The options that have been used for creating the full-text index will also be applied to the optimized full-text queries. However, the defaults can be overwritten if you supply options in your query. For example, if words were stemmed in the index, and if the query can be rewritten for index access, the query terms will be stemmed as well, unless stemming is not explicitly disabled. This is demonstrated in the following [[Commands#Command_Scripts|Command Script]]:
</pre>
Text nodes can be directly requested from the index via the XQuery function [[{{Function|Full-Text Module#ft:search|ft:search]]}}. The index contents can be accessed with [[{{Function|Full-Text Module#ft:tokens|ft:tokens]]}}.
==Selective Indexing==
{{Mark|Updated with Version 8.4:}} {{Code|TOKENINCLUDE}} option added Value indexing can be restricted to specific elements and attributes. The nodes to be indexed can be restricted via the [[Options#TEXTINCLUDE{{Option|TEXTINCLUDE]]}}, [[Options#ATTRINCLUDE{{Option|ATTRINCLUDE]]}}, [[Options#TOKENINCLUDE{{Option|TOKENINCLUDE]] }} and [[Options#FTINCLUDE{{Option|FTINCLUDE]] }} options. The options take a list of name patterns, which are separated by commas. The following name patterns are supported:
* <code>*</code>: all names
* <code>Q{uri}name</code>: elements or attributes called <code>name</code> in the <code>uri</code> namespace
The options can either be specified via the [[Commands#SET{{Command|SET]] }} command or via XQuery. With the following operations, an attribute index is created for all {{Code|id}} and {{Code|name}} attributes:
; Commands
</pre>
With [[Commands#Optimize{{Command|OPTIMIZE ALL]] CREATE INDEX}} and [[{{Function|Database Module#db:optimize|db:optimize]]}}, new selective indexing options can be assigned will ba applied to an existing database. =Custom Index Structures= With XQuery, it is comparatively easy to create your own, custom index structures. The following query demonstrate how you can create a {{Code|factbook-index}} database, which contains all texts of the original database in lower case: <pre class="brush:xquery">let $db := 'factbook' let $index := <index>{ for $nodes in db:open($db)//text() group by $text := lower-case($nodes) return <text string='{ $text }'>{ for $node in $nodes return <id>{ db:node-id($node ) }</id> }</text>}</index> return db:create($db || '-index', $index, $db || '-index.xml')</pre> In the following query, a text string is searched, and the text nodes of the original database are retrieved: <pre class="brush:xquery">let $db := 'factbook'let $text := 'italian'for $id in db:open($db || '-index')//*[@string = $text]/idreturn db:open-id($db, $id)/..</pre> With some extra effort, and if {{Option|UPDINDEX}} is enabled for both your original and your index database (see below), your index database will support updates as well (try it, it’s fun!).
==Index Construction=Performance=
If main memory runs out while creating a value index, the currently generated current index structures will be partially written to disk and eventually merged. If the used memory heuristics fails fail for some reason (i.e., because multiple index operations run at the same time, or because the applied JVM does not support explicit garbage collections), a fixed index split sizes may be chosen via the [[Options#INDEXSPLITSIZE|INDEXSPLITSIZE]] and [[Options#FTINDEXSPLITSIZE{{Option|FTINDEXSPLITSIZE]] optionsSPLITSIZE}} option.
If [[Options#DEBUG{{Option|DEBUG]] }} is set to trueenabled, and if a new database is created from the command -line, the number of index operations will be output to standard output; this might help you to choose find a proper good split size. The following example shows how the output can look for creating a database for an XMark document with 111 MB 1 GB, and with 128 MB of available main memoryassigned to the JVM:
<pre>
> basex -d -c"set ftindexSET FTINDEX ON; create db 111mb 111mbSET TOKENINDEX ON; CREATE DB xmark 1gb.xml"
Creating Database...
.... 8132.44 ........................... 76559.99 ms (17824 29001 KB)
Indexing Text...
.. 979920 ..|...|...|.....|. 9.81 M operations, 291318576.78 92 ms (44 MB13523 KB). Recommended SPLITSIZE: 20.
Indexing Attribute Values...
.. 381870 .......|....... 3.82 M operations, 7151.77 ms (6435 KB). Recommended SPLITSIZE: 20.Indexing Tokens..........|..|.....|.. 3.82 M operations, 6309636.61 73 ms (21257 10809 KB). Recommended SPLITSIZE: 10.
Indexing Full-Text...
..|.|.| 3 splits, 12089347 .|...|...|..|.|..| 116.33 M operations, 16420138740.47 94 ms (36 106 MB). Recommended SPLITSIZE: 12.
</pre>
The info string {{Codeoutput can be interpreted as follows: * The vertical bar <code>|3 splits}} </code> indicates that three a partial full-text index structures were structure was written to disk, and .* The mean value of the recommendations can be assigned to the string {{CodeOption|12089347 operationsSPLITSIZE}} tells option. Please note that the recommendation is only a vague proposal, so try different values if you get main-of-memory errors or indexing gets too slow. Greater values will require more main memory.* In the example, the full-text index construction consisted of approximately was split 12 mio times. 116 million tokens were indexed, processing time was 2,5 minutes, and final main memory consumption (after writing the index operationsto disk) was 76 MB. If we set [[Options#FTINDEXSPLITSIZE|FTINDEXSPLITSIZE]] to A good value for the fixed value split size option could be {{Code|400000015}} (12 mio divided by three), or a smaller value, we should be able to build the index and circumvent the memory heuristics.
=Updates=
Updates in BaseX Generally, update operations are very fastin BaseX. By default, because the index structures will be invalidated by updates. As ; as a result, subsequent queries that benefit from index structures may be executed more slowly than before the updateslow down after updates. There are different alternatives to cope with this:
* After the execution of one or more update operations, the [[Commands#OPTIMIZE{{Command|OPTIMIZE]] }} command or the [[{{Function|Database Module#db:optimize|db:optimize]] }} function can be called to rebuild the index structures.* The [[Options#UPDINDEX{{Option|UPDINDEX]] }} option can be activated before creating or optimizing the database. As a result, the text , attribute and attribute index structures token indexes will be incrementally updated after each database update. Please note that incremental updates are not available for the token index, full-text index, and database statistics. This is also explains why the up-to-date UPTODATE flag, which is e.g. displayed via [[Commands#INFO DB{{Command|INFO DB]] }} or [[Database_Module#db:info{{Function|Database|db:info]]}}, will be set to {{Code|false}} until the database will be optimized again(various optimizations won’t be triggered. For example, count(//item) can be extremely fast if all meta data is up-to-date.* The [[Options#AUTOOPTIMIZE{{Option|AUTOOPTIMIZE]] }} option can be enabled before creating or optimizing the database. All outdated index structures and statistics will then be recreated after each database update. This option should only be used done for small and medium-sized databases.* Both options can be used side by side: {{Option|UPDINDEX}} will take care that the value index structures will be updated as part of the actual update operation. {{Option|AUTOOPTIMIZE}} will update the remaining data structures (full-text index, database statistics).
=Changelog=
 
;Version 8.4
 
* Updated: [[#Name Index|Name Index]], [[#Path Index|Path Index]]
;Version 8.4
* Added: string-based range queries
 
[[Category:Internals]]
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu