Changes

Indexes (edit)

Revision as of 12:43, 2 July 2020

383 bytes added , 12:43, 2 July 2020

→‎Full-Text Index

The name index is e.g. applied to discard location steps that will never yield results:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: will be rewritten to an empty sequence :)

/non-existing-name

</~~pre~~syntaxhighlight>

The contents of the name indexes can be directly accessed with the XQuery functions [[Index Module#index:element-names|index:element-names]] and [[Index Module#index:attribute-names|index:attribute-names]].

* Descendant steps will be rewritten to multiple child steps. Child steps are evaluated faster, as fewer nodes have to be traversed:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

doc('factbook.xml')//province,

(: ...will be rewritten to... :)

doc('factbook.xml')/mondial/country/province

</~~pre~~syntaxhighlight>

* The {{Code|fn:count}} function will be pre-evaluated by looking up the number in the index:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

count(doc('factbook')//country)

</~~pre~~syntaxhighlight>

* The distinct values of elements or attributes can be looked up in the index as well:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

distinct-values(db:open('factbook')//religions)

</~~pre~~syntaxhighlight>

The contents of the path index can be directly accessed with the XQuery function [[Index Module#index:facets|index:facets]].

With XQuery, index structures can be created and dropped via {{Function|Database|db:optimize}}:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: Optimize specified database, create full-text index for texts of the specified elements :)

db:optimize(

map { 'ftindex': true(), 'ftinclude': 'p div' }

)

</~~pre~~syntaxhighlight>

==Text Index==

This index references text nodes of documents. It will be utilized to accelerate string comparisons in path expressions. The following queries will all be rewritten for index access:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: example 1 :)

//*[text() = 'Germany'],

where $c//city/name = 'Hanoi'

return $c/name

</~~pre~~syntaxhighlight>

Before the actual index rewriting takes places, some preliminary optimizations are applied:

The {{Option|UPDINDEX}} option can be enabled to keep this index up-to-date:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

db:optimize(

'mydb',

map { 'updindex':true(), 'textindex': true(), 'textinclude':'id' }

)

</~~pre~~syntaxhighlight>

===Range Queries===

The text index also supports range queries based on string comparisons:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: example 1 :)

db:open('Library')//Medium[Year >= '2011' and Year <= '2016'],

let $max := '2014-04-19T23:59:59'

return db:open('news')//entry[date-time > $min and date-time < $max]

</~~pre~~syntaxhighlight>

With {{Function|Database|db:text-range}}, you can access all text nodes whose values are between a minimum and maximum value.

==Attribute Index==

Similar to the text index, this index speeds up string~~-based equality~~ and range ~~tests~~ comparisons on attribute values. Additionally, the XQuery function {{Code|fn:id}} takes advantage of the index whenever possible. The following queries will all be rewritten for index access:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: 1st example :)

//country[@car_code = 'J'],

(: 4th example :)

fn:id('f0_119', db:open('factbook'))

</~~pre~~syntaxhighlight>

''Attribute nodes'' (which you can use as starting points of navigation) can directly be retrieved from the index with the XQuery functions {{Function|Database|db:attribute}} and {{Function|Database|db:attribute-range}}. The index contents (''strings'') can be accessed with {{Function|Index|index:attributes}}.

In many XML dialects, such as HTML or DITA, multiple tokens are stored in attribute values. The token index can be created to speed up the retrieval of these tokens. The XQuery functions {{Code|fn:contains-token}}, {{Code|fn:tokenize}} and {{Code|fn:idref}} are rewritten for index access whenever possible. If a token index exists, it will e.g. be utilized for the following queries:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: 1st example :)

//div[contains-token(@class, 'row')],

(: 3rd example :)

doc('graph.xml')/idref('edge8')

</~~pre~~syntaxhighlight>

''Attribute nodes'' with a matching value (containing at least one from a set of given tokens) can be directly retrieved from the index with the XQuery function {{Function|Database|db:token}}. The index contents (''token strings'') can be accessed with {{Function|Index|index:tokens}}.

==Full-Text Index==

The [[Full-Text]] index contains the normalized tokens of text nodes of a document. It is utilized to speed up queries with the {{Code|contains text}} expression, and it is capable of processing wildcard and fuzzy search operations. Three evaluation strategies are available: the standard sequential database scan, a full-text index based evaluation and a hybrid one, combining both strategies (see [~~http~~https://~~www~~files.~~inf~~basex.~~uni-konstanz.de~~org/gkpublications/~~pubsys/publishedFiles/GrGaHo09~~Gruen%20et%20al.%20%5B2009%5D,%20XQuery%20Full%20Text%20Implementation%20in%20BaseX.pdf XQuery Full Text implementation in BaseX]).

If the full-text index exists, the following queries will all be rewritten for index access:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(: 1st example :)

//country[name/text() contains text 'and'],

//religions[.//text() contains text { 'Catholic', 'Roman' }

using case insensitive distance at most 2 words]

</~~pre~~syntaxhighlight>

The index provides support for the following full-text features (the values can be changed in the GUI or via the {{Command|SET}} command):

The options that have been used for creating the full-text index will also be applied to the optimized full-text queries. However, the defaults can be overwritten if you supply options in your query. For example, if words were stemmed in the index, and if the query can be rewritten for index access, the query terms will be stemmed as well, unless stemming is not explicitly disabled. This is demonstrated in the following [[Commands#Command_Scripts|Command Script]]:

<~~pre class~~syntaxhighlight lang="~~brush:~~xml">

<xquery> /text[. contains text { 'houses' } using no stemming] </xquery>

</commands>

</~~pre~~syntaxhighlight>

Text nodes can be directly requested from the index via the XQuery function {{Function|Full-Text|ft:search}}. The index contents can be accessed with {{Function|Full-Text|ft:tokens}}.

; Commands

<~~pre class~~syntaxhighlight lang="~~brush:~~xml">

SET ATTRINCLUDE id,name

CREATE DB factbook http://files.basex.org/xml/factbook.xml'

# Restore default

SET ATTRINCLUDE

</~~pre~~syntaxhighlight>

; XQuery

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

db:create('factbook', 'http://files.basex.org/xml/factbook.xml', '',

map { 'attrinclude': 'id,name' })

</~~pre~~syntaxhighlight>

With {{Command|CREATE INDEX}} and {{Function|Database|db:optimize}}, new selective indexing options will ba applied to an existing database.

* In the query below, 10 databases will be addressed. If it is known in advance that these databases contain an up-to-date text index, the index rewriting can be enforced as follows:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

(# db:enforceindex #) {

for $n in 1 to 10

return db:open($db)//person[name/text() = 'John']

}

</~~pre~~syntaxhighlight>

* The following query contains two predicates that may both be rewritten for index access. If the automatically chosen rewriting is known not to be optimal, another index rewriting can enforced by surrounding the specific expression with the pragma:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

db:open('factbook')//country

[(# db:enforceindex #) {

}]

[religions/text() = 'Protestant']

</~~pre~~syntaxhighlight>

The option can also be assigned to predicates with dynamic values. In the following example the comparison of the first comparison will be rewritten for index access. Without the pragma expression, the second comparison is preferred and chosen for the rewriting, because the statically known string allows for an exact cost estimation:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

for $name in ('Germany', 'Italy')

for $country in db:open('factbook')//country

where $country/religions/text() = 'Protestant'

return $country

</~~pre~~syntaxhighlight>

Please note that:

With XQuery, it is comparatively easy to create your own, custom index structures. The following query demonstrate how you can create a {{Code|factbook-index}} database, which contains all texts of the original database in lower case:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

let $db := 'factbook'

return db:create($db || '-index', $index, $db || '-index.xml')

</~~pre~~syntaxhighlight>

In the following query, a text string is searched, and the text nodes of the original database are retrieved:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

let $db := 'factbook'

let $text := 'italian'

for $id in db:open($db || '-index')//*[@string = $text]/id

return db:open-id($db, $id)/..

</~~pre~~syntaxhighlight>

With some extra effort, and if {{Option|UPDINDEX}} is enabled for both your original and your index database (see below), your index database will support updates as well (try it, it’s fun!).

If {{Option|DEBUG}} is enabled, the command-line output might help you to find a good split size. The following example shows the output for creating a database for an XMark document with 1 GB, and with 128 MB assigned to the JVM:

<~~pre~~syntaxhighlight>

> basex -d -c"SET FTINDEX ON; SET TOKENINDEX ON; CREATE DB xmark 1gb.xml"

Creating Database...

Indexing Full-Text...

..|.|.|.|...|...|..|.|..| 116.33 M operations, 138740.94 ms (106 MB). Recommended SPLITSIZE: 12.

</~~pre~~syntaxhighlight>

The output can be interpreted as follows:

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Indexes (edit)

Revision as of 12:43, 2 July 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools