Changes

Jump to navigation Jump to search

XQuery Optimizations (edit)

Revision as of 18:52, 18 November 2020

8,929 bytes added , 18:52, 18 November 2020

→‎FLWOR Rewritings

This article is part of the [[XQuery|XQuery Portal]]. Optimizations are presented that speed up the execution time and reduce memory consumption. The ~~article~~ text will be ~~incrementally enhanced~~ regularly extended with ~~more~~ further examples.

=Introduction=

# At parse time, the query string – an XQuery main module – is transformed to a tree representation, the ''abstract syntax tree'' (AST).

# At compile time, the syntax tree is decorated with additional information (type information, expression properties); expressions are relocated, simplified, or pre-evaluated:## Logical optimizations are ''context-independent''. They can be applied no matter which data will be processed later on.## Physical optimizations rely on context information, such as database statistics or available indexes.

# At evaluation time, the resulting expression tree is processed.

# The results are returned to the user. Some expression (such as simple loops) can be evaluated in iterative manner, whereas others (such as sort operations) need to be fully evaluated before the first result is available.

If you run a query on [[Command-Line_Options#Standalone|command-line]], you can use {{Code|-V}} to output detailed query information. In the [[GUI]], you can enable the Info View panel.

=~~Compile-Time~~ Logical Optimizations=

==Pre-Evaluation==

return 2 * 3

(: ~~will be~~ rewritten to :)

for $i in 1 to 10

return 6

return 'Results: ' || count($nodes)

(: ~~will be~~ rewritten to :)

let $nodes := //nodes

where true()

return 'Results: ' || count($nodes)

(: ~~will be~~ rewritten to :)

let $nodes := //nodes

return 'Results: ' || count($nodes)

(: ~~will be~~ rewritten to :)

'Results: ' || count(//nodes)

</syntaxhighlight>

return local:inc($n)

(: ~~will be~~ rewritten to :)

for $n in 1 to 5

return (

)

(: ~~will be~~ rewritten to :)

for $n in 1 to 5

return $n + 1

Subsequent rewritings might result in query plans that differ a lot from the original query. As this might complicate debugging, you can disable function inling during development by setting {{Option|INLINELIMIT}} to {{Code|0}}.

==~~Static Typing~~Paths== Due to the compact syntax of XPath, it can make a big difference if a slash is added or omitted in a path expression. A classical example is the double slash {{Code|//}}, which is a shortcut for {{Code|descendant-or-node()/}}. If the query is evaluated without optimizations, all nodes of a document are gathered, and for each of them, the next step is evaluated. This leads to a potentially huge number of duplicate node tree traversals, most of which are redundant, as all duplicate nodes will be removed at the end anyway. In most cases, paths with a double slash can be rewritten to descendant steps… <syntaxhighlight lang="xquery">(: equivalent queries, with identical syntax trees :)doc('addressbook.xml')//city,doc('addressbook.xml')/descendant-or-node()/child::city (: rewritten to :)doc('addressbook.xml')/descendant::city</syntaxhighlight> …unless the last step does not contain a positional predicate: <syntaxhighlight lang="xquery">doc('addressbook.xml')//city[1]</syntaxhighlight> As the positional test refers to the city child step, a rewritten query would yield different steps. Paths may contain predicates that will be evaluated again by a later axis step. Such predicates are either shifted down or discarded: <syntaxhighlight lang="xquery">(: equivalent query :)a[b]/b[c/d]/c (: rewritten to :)a/b/c[d]</syntaxhighlight>

~~If the type~~ Names of ~~a value is known at compile time, type checks~~ nodes can be ~~removed~~specified via name tests or predicates. ~~In the example below~~If names are e.g. supplied via external variables, the ~~static information that {{Code|$i}} will always reference items of type {{Code|xs:integer}}~~ predicates can often be ~~utilized to simplify the expression~~dissolved:

<syntaxhighlight lang="xquery">

~~for~~ declare variable $~~i in 1 to 5return typeswitch($i)~~ ~~case xs~~name external :~~numeric return~~ = '~~number~~city'; ~~default return~~ db:open('~~string~~addressbook')/descendant::*[name() = $name]

(: ~~will be~~ rewritten to :)~~for $i in 1 to 5return~~ db:open('~~number~~addressbook')/descendant::city

</syntaxhighlight>

* {{Code|where}} clauses are rewritten to predicates.

* {{Code|if}} expressions in the return clause are rewritten to {{Code|where}} clauses.

~~Since {{Version|9.4}}, the~~ * The last {{Code|for}} clause is merged into the {{Code|return}} clause and rewritten to a [[XQuery_3.0|Simple_Map_Operator|simple map]] expression.

Various of these rewriting are demonstrated in the following example:

return $c

(: the where expression is rewritten to a predicate :)

let $b := 2

for $a in 1 to 5[. > 3]

return $a + 2

(: ~~for clause is~~ the remaining clauses are merged ~~into return clause~~ and rewritten to a simple map :)

(1 to 5)[. > 3] ! (. + 2)

</syntaxhighlight>

==Static Typing==

If the type of a value is known at compile time, type checks can be removed. In the example below, the static information that {{Code|$i}} will always reference items of type {{Code|xs:integer}} can be utilized to simplify the expression:

<syntaxhighlight lang="xquery">

for $i in 1 to 5

return typeswitch($i)

case xs:numeric return 'number'

default return 'string'

(: rewritten to :)

for $i in 1 to 5

return 'number'

</syntaxhighlight>

==Pure Logic==

If expressions can often be simplified:

<syntaxhighlight lang="xquery">

for $a in ('a', '')

return $a[boolean(if(.) then true() else false())]

(: rewritten to :)

for $a in ('a', '')

return $a[boolean(.)]

(: rewritten to :)

for $a in ('a', '')

return $a[.]

(: rewritten to :)

('a', '')[.]

</syntaxhighlight>

Boolean algebra (and set theory) comes with a set of laws that can all be applied to XQuery expressions.

{| class="wikitable"

|- valign="top"

| Expression

| Rewritten expression

| Rule

|- valign="top"

| <code>$a + 0</code>, <code>$a * 1</code>

| <code>$a</code>

| Identity

|- valign="top"

| <code>$a * 0</code>

| <code>0</code>

| Annihilator

|- valign="top"

| <code>$a and $a</code>

| <code>$a</code>

| Idempotence

|- valign="top"

| <code>$a and ($a or $b)</code>

| <code>$a</code>

| Absorption

|- valign="top"

| <code>($a and $b) or ($a and $c)</code>

| <code>$a and ($b or $c)</code>

| Distributivity

|- valign="top"

| <code>$a or not($a)</code>

| <code>true()</code>

| Tertium non datur

|- valign="top"

| <code>not($a) and not($b)</code>

| <code>not($a or $b)</code>

| De Morgan

|}

It is not sufficient to apply the rules to arbitrary input. Examples:

* If the operands are no boolean values, a conversion is enforced: <code>$string and $string</code> is rewritten to <code>boolean($string)</code>.

* <code>xs:double('NaN') * 0</code> yields <code>NaN</code> instead of <code>0</code>

* <code>true#0 and true#0</code> must raise an error; it cannot be simplified to <code>true#0</code>

=Physical Optimizations=

Some physical optimizations are also presented in the article on [[Indexes|index structures]].

==Database Statistics==

In each database, metadata is stored that can be utilized by the query optimizer to speed up or even skip query evaluation:

;Count element nodes

The number of elements that are found for a specific path need not be evaluated sequentially. Instead, the count can directly be retrieved from the database statistics:

<syntaxhighlight lang="xquery">

count(/mondial/country)

(: rewritten to :)

231

</syntaxhighlight>

;Return distinct values

The distinct values for specific names and paths can also be fetched from the database metadata, provided that the number does not exceed the maximum number of distinct values (see {{Option|MAXCATS}} for more information):

<syntaxhighlight lang="xquery">

distinct-values(//religions)

(: rewritten to :)

('Muslim', 'Roman Catholic', 'Albanian Orthodox', ...)

</syntaxhighlight>

==Index Rewritings==

A major feature of BaseX is the ability to rewrite all kinds of query patterns for index access.

The following queries are all equivalent. They will be rewritten to exactly the same query that will eventually access the text index of a <code>factbook.xml</code> database instance (the file included in our full distributions):

<syntaxhighlight lang="xquery">

declare context item := db:open('factbook');

declare variable $DB := 'factbook';

//name[. = 'Shenzhen'],

//name[data() = 'Shenzhen'],

//name[./text() = 'Shenzhen'],

//name[text()[. = 'Shenzhen']],

//name[string() = 'Shenzhen'],

//name[string() = 'Shen' || 'zhen'],

//name[./data(text()/string()) = 'Shenzhen'],

//name[text() ! data() ! string() = 'Shenzhen'],

.//name[. = 'Shenzhen'],

//*[local-name() = 'name'][data() = 'Shenzhen'],

db:open('factbook')//name[. = 'Shenzhen'],

db:open($DB)//name[. = 'Shenzhen'],

for $name in //name[text() = 'Shenzhen']

return $name,

for $name in //name

return $name[text() = 'Shenzhen'],

for $name in //name

return if($name/text() = 'Shenzhen') then $name else (),

for $name in //name

where $name/text() = 'Shenzhen'

return $name,

for $name in //name

where $name/text()[. = 'Shenzhen']

return $name,

for $node in //*

where data($node) = 'Shenzhen'

where name($node) = 'name'

return $node,

(: rewritten to :)

db:text('factbook', 'Shenzhen')/parent::name

</syntaxhighlight>

Multiple element names and query strings can be supplied in a path:

<syntaxhighlight lang="xquery">

//*[(ethnicgroups, religions)/text() = ('Jewish', 'Muslim')]

(: rewritten to :)

db:text('factbook', ('Jewish', 'Muslim'))/(parent::*:ethnicgroups | parent::*:religions)/parent::*

</syntaxhighlight>

If multiple candidates for index access are found, the database statistics (if available) are consulted to choose the cheapest candidate:

<syntaxhighlight lang="xquery">

/mondial/country

[religions = 'Muslim'] (: yields 77 results :)

[ethnicgroups = 'Greeks'] (: yields 2 results :)

(: rewritten to :)

db:text('factbook', 'Greeks')/parent::ethnicgroups/parent::country[religions = 'Muslim']

</syntaxhighlight>

If index access is possible within more complex FLWOR expressions, only the paths will be rewritten:

<syntaxhighlight lang="xquery">

for $country in //country

where $country/ethnicgroups = 'German'

order by $country/name[1]

return element { replace($country/@name, ' ', '') } {},

(: rewritten to :)

for $country in db:text('factbook', 'German')/parent::ethnicgroups/parent::country

order by $country/name[1]

return element { replace($country/@name, ' ', '') } {}

</syntaxhighlight>

The [https://projects.cwi.nl/xmark/ XMark XML Benchmark] comes with sample auction data and a bunch of queries, some of which are suitable for index rewritings:

;XMark Query 1

<syntaxhighlight lang="xquery">

let $auction := doc('xmark')

return for $b in $auction/site/people/person[@id = 'person0']

return $b/name/text()

(: rewritten to :)

db:attribute('xmark', 'person0')/self::attribute(id)/parent::person/name/text()

</syntaxhighlight>

;XMark Query 8

<syntaxhighlight lang="xquery">

let $auction := doc('xmark')

return

for $p in $auction/site/people/person

let $a :=

for $t in $auction/site/closed_auctions/closed_auction

where $t/buyer/@person = $p/@id

return $t

return <item person='{ $p/name/text() }'>{ count($a) }</item>,

(: rewritten to :)

db:open('xmark')/site/people/person !

<item person='{ name/text() }'>{ count(

db:attribute('xmark', @id)/self::attribute(person)/parent::buyer/parent::closed_auction

)

}</item>

</syntaxhighlight>

If the accessed database is not known at compile time, or if you want to give a predicate preference to another one, you can [[Indexes#Enforce Rewritings|enforce index rewritings]].

=Evaluation-Time Optimizations=

==Comparisons==

In many cases, the amount of data to be processed is only known after the query has been compiled. Moreover, the data that is looped through expressions may change. In those cases, the best optimizations needs to be chosen at runtime.

If sequences of items are compared against each other, a dynamic hash index will be generated, and the total number of comparisons can be significantly reduced. In the following example, <code>count($input1) * count($input2)</code> comparisons would need to be made without the intermediate index structure:

<syntaxhighlight lang="xquery">

let $input1 := file:read-text-lines('huge1.txt')

let $input2 := file:read-text-lines('huge2.txt')

return $input1[not(. = $input2)]

</syntaxhighlight>

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Retrieved from "http://docs.basex.org/wiki/Special:MobileDiff/15202...15323"