Difference between revisions of "XQuery Optimizations"

From BaseX Documentation
Jump to navigation Jump to search
m (Text replacement - "<syntaxhighlight lang="xquery">" to "<pre lang='xquery'>")
 
(48 intermediate revisions by the same user not shown)
Line 1: Line 1:
This article is part of the [[XQuery|XQuery Portal]]. Optimizations are presented that speed up the execution time and reduce memory consumption.
+
This article is part of the [[XQuery|XQuery Portal]]. It presents some of the optimizations that speed up the execution and reduce memory consumption of queries.
 
 
The article will be further extended with each version of BaseX.
 
  
 
=Introduction=
 
=Introduction=
  
An XQuery expression is evaluated in multiple steps:
+
Query execution encompasses multiple steps:
  
# At parse time, the query string – an XQuery main module – is transformed to a tree representation, the ''abstract syntax tree'' (AST).
+
# '''Parsing''': The query input string is transformed to executable code. The result is a tree representation, called the ''abstract syntax tree'' (AST).
# At compile time, the syntax tree is decorated with additional information (type information, expression properties); expressions are relocated, simplified, or pre-evaluated.
+
# '''Compilation''': The syntax tree is decorated with additional information (type information, expression properties). Expressions (nodes) in the tree are relocated, simplified, or pre-evaluated. Logical optimizations are performed that do not rely on external information.
# At evaluation time, the resulting expression tree is processed.
+
# '''Optimization''': The dynamic context is incorporated: Referenced databases are opened and analyzed; queries are rewritten to use available indexes; accumulative and statistical operations (counts, summations, min/max, distinct values) are pre-evaluated; XPath expressions are simplified, based on the existence of steps.
# The results are returned to the user. Some expression (such as simple loops) can be evaluated in iterative manner, whereas others (such as sort operations) need to be fully evaluated before the first result is available.
+
# '''Evaluation''': The resulting code is executed.
 +
# '''Printing''': The query result is serialized and presented in a format that is either human-readable, or can be further processed by an API.
  
Each of the steps allows for numerous optimizations, some of which are described in this article.
+
Some rewritings are described in this article.
  
 
If you run a query on [[Command-Line_Options#Standalone|command-line]], you can use {{Code|-V}} to output detailed query information. In the [[GUI]], you can enable the Info View panel.
 
If you run a query on [[Command-Line_Options#Standalone|command-line]], you can use {{Code|-V}} to output detailed query information. In the [[GUI]], you can enable the Info View panel.
  
=Compile-Time Optimizations=
+
=Compilation=
  
 
==Pre-Evaluation==
 
==Pre-Evaluation==
Line 22: Line 21:
 
Parts of the query that are static and would be executed multiple times can already be evaluated at compile time:
 
Parts of the query that are static and would be executed multiple times can already be evaluated at compile time:
  
<syntaxhighlight lang="xquery">
+
<pre lang='xquery'>
 
for $i in 1 to 10
 
for $i in 1 to 10
 
return 2 * 3
 
return 2 * 3
Line 29: Line 28:
 
for $i in 1 to 10
 
for $i in 1 to 10
 
return 6
 
return 6
</syntaxhighlight>
+
</pre>
  
 
==Variable Inlining==
 
==Variable Inlining==
Line 35: Line 34:
 
The value of a variable can be ''inlined'': The variables references are replaced by the expression that is bound to the variable. The resulting expression can often be simplified, and further optimizations can be triggered:
 
The value of a variable can be ''inlined'': The variables references are replaced by the expression that is bound to the variable. The resulting expression can often be simplified, and further optimizations can be triggered:
  
<syntaxhighlight lang="xquery">
+
<pre lang='xquery'>
 
declare variable $INFO := true();
 
declare variable $INFO := true();
  
Line 53: Line 52:
 
(: rewritten to :)
 
(: rewritten to :)
 
'Results: ' || count(//nodes)
 
'Results: ' || count(//nodes)
</syntaxhighlight>
+
</pre>
  
 
As the example shows, variable declarations might be located in the query prolog and in FLWOR expressions. They may also occur (and be inlined) in {{Code|try}}/{{Code|catch}}, {{Code|switch}} or {{Code|typeswitch}} expressions.
 
As the example shows, variable declarations might be located in the query prolog and in FLWOR expressions. They may also occur (and be inlined) in {{Code|try}}/{{Code|catch}}, {{Code|switch}} or {{Code|typeswitch}} expressions.
Line 61: Line 60:
 
Functions can be inlined as well. The parameters are rewitten to {{Code|let}} clauses and the function is body is bound to the {{Code|return}} clause.
 
Functions can be inlined as well. The parameters are rewitten to {{Code|let}} clauses and the function is body is bound to the {{Code|return}} clause.
  
<syntaxhighlight lang="xquery">
+
<pre lang='xquery'>
 
declare function local:inc($i) { $i + 1 };
 
declare function local:inc($i) { $i + 1 };
 
for $n in 1 to 5
 
for $n in 1 to 5
Line 76: Line 75:
 
for $n in 1 to 5
 
for $n in 1 to 5
 
return $n + 1
 
return $n + 1
</syntaxhighlight>
+
</pre>
  
 
Subsequent rewritings might result in query plans that differ a lot from the original query. As this might complicate debugging, you can disable function inling during development by setting {{Option|INLINELIMIT}} to {{Code|0}}.
 
Subsequent rewritings might result in query plans that differ a lot from the original query. As this might complicate debugging, you can disable function inling during development by setting {{Option|INLINELIMIT}} to {{Code|0}}.
  
==Static Typing==
+
==Loop Unrolling==
 +
 
 +
Loops with few iterations are ''unrolled'' by the XQuery compiler to enable further optimizations:
 +
 
 +
<pre lang='xquery'>
 +
(1 to 2) ! (. * 2)
 +
 
 +
(: rewritten to :)
 +
1 ! (. * 2), 2 ! (. * 2)
 +
 
 +
(: further rewritten to :)
 +
1 * 2, 2 * 2
 +
 
 +
(: further rewritten to :)
 +
2, 4
 +
</pre>
 +
 
 +
Folds are unrolled, too:
 +
 
 +
<pre lang='xquery'>
 +
let $f := function($a, $b) { $a * $b }
 +
return fold-left(2 to 5, 1, $f)
 +
 
 +
(: rewritten to :)
 +
let $f := function($a, $b) { $a * $b }
 +
return $f($f($f($f(1, 2), 3), 4), 5)
 +
</pre>
 +
 
 +
The standard unroll limit is <code>5</code>. It can be adjusted with the {{Option|UNROLLLIMIT}} option, e.g. via a pragma:
 +
 
 +
<pre lang='xquery'>
 +
(# db:unrolllimit 10 #) {
 +
  for $i in 1 to 10
 +
  return db:get('db' || $i)//*[text() = 'abc']
 +
}
 +
 
 +
(: rewritten to :)
 +
db:get('db1')//*[text() = 'abc'],
 +
db:get('db2')//*[text() = 'abc'],
 +
...
 +
db:get('db10')//*[text() = 'abc'],
 +
</pre>
 +
 
 +
The last example indicates that index rewritings might be triggered by unrolling loops with paths on database nodes.
 +
 
 +
The following expressions can be unrolled:
 +
 
 +
* Simple map expressions
 +
* Simple FLWOR expressions
 +
* Filter expressions
 +
* [[Higher-Order Functions#fn:fold-left|fn:fold-left]], [[Higher-Order Functions#fn:fold-right|fn:fold-right]], {{Function|Higher-Order Functions|fn:fold-left1}}
 +
 
 +
Care should be taken if a higher value is selected, as memory consumption and compile time will increase.
 +
 
 +
==Paths==
 +
 
 +
Due to the compact syntax of XPath, it can make a big difference if a slash is added or omitted in a path expression. A classical example is the double slash {{Code|//}}, which is a shortcut for {{Code|descendant-or-node()/}}. If the query is evaluated without optimizations, all nodes of a document are gathered, and for each of them, the next step is evaluated. This leads to a potentially huge number of duplicate node tree traversals, most of which are redundant, as all duplicate nodes will be removed at the end anyway.
 +
 
 +
In most cases, paths with a double slash can be rewritten to descendant steps…
 +
 
 +
<pre lang='xquery'>
 +
(: equivalent queries, with identical syntax trees :)
 +
doc('addressbook.xml')//city,
 +
doc('addressbook.xml')/descendant-or-self::node()/child::city
 +
 
 +
(: rewritten to :)
 +
doc('addressbook.xml')/descendant::city
 +
</pre>
 +
 
 +
…unless the last step does not contain a positional predicate:
 +
 
 +
<pre lang='xquery'>
 +
doc('addressbook.xml')//city[1]
 +
</pre>
 +
 
 +
As the positional test refers to the city child step, a rewritten query would yield different steps.
 +
 
 +
Paths may contain predicates that will be evaluated again by a later axis step. Such predicates are either shifted down or discarded:
 +
 
 +
<pre lang='xquery'>
 +
(: equivalent query :)
 +
a[b]/b[c/d]/c
 +
 
 +
(: rewritten to :)
 +
a/b/c[d]
 +
</pre>
  
If the type of a value is known at compile time, type checks can be removed. In the example below, the static information that {{Code|$i}} will always reference items of type {{Code|xs:integer}} can be utilized to simplify the expression:
+
Names of nodes can be specified via name tests or predicates. If names are e.g. supplied via external variables, the predicates can often be dissolved:
  
<syntaxhighlight lang="xquery">
+
<pre lang='xquery'>
for $i in 1 to 5
+
declare variable $name external := 'city';
return typeswitch($i)
+
db:get('addressbook')/descendant::*[name() = $name]
  case xs:numeric return 'number'
 
  default return 'string'
 
  
 
(: rewritten to :)
 
(: rewritten to :)
for $i in 1 to 5
+
db:get('addressbook')/descendant::city
return 'number'
+
</pre>
</syntaxhighlight>
 
  
 
==FLWOR Rewritings==
 
==FLWOR Rewritings==
Line 106: Line 187:
 
* {{Code|where}} clauses are rewritten to predicates.
 
* {{Code|where}} clauses are rewritten to predicates.
 
* {{Code|if}} expressions in the return clause are rewritten to {{Code|where}} clauses.
 
* {{Code|if}} expressions in the return clause are rewritten to {{Code|where}} clauses.
 
+
* The last {{Code|for}} clause is merged into the {{Code|return}} clause and rewritten to a [[XQuery_3.0#Simple_Map_Operator|simple map]] expression.
Since {{Version|9.4}}, the last {{Code|for}} clause is merged into the {{Code|return}} clause and rewritten to a [[XQuery_3.0|Simple_Map_Operator|simple map]] expression.
 
  
 
Various of these rewriting are demonstrated in the following example:
 
Various of these rewriting are demonstrated in the following example:
  
<syntaxhighlight lang="xquery">
+
<pre lang='xquery'>
for $a in 1 to 5
+
for $a in 1 to 10
 
for $b in 2
 
for $b in 2
 
where $a > 3
 
where $a > 3
Line 119: Line 199:
  
 
(: for is rewritten to let :)
 
(: for is rewritten to let :)
for $a in 1 to 5
+
for $a in 1 to 10
 
let $b := 2
 
let $b := 2
 
where $a > 3
 
where $a > 3
Line 127: Line 207:
 
(: let is lifted up :)
 
(: let is lifted up :)
 
let $b := 2
 
let $b := 2
for $a in 1 to 5
+
for $a in 1 to 10
 
where $a > 3
 
where $a > 3
 
let $c := $a + $b
 
let $c := $a + $b
Line 134: Line 214:
 
(: the where expression is rewritten to a predicate :)
 
(: the where expression is rewritten to a predicate :)
 
let $b := 2
 
let $b := 2
for $a in 1 to 5[. > 3]
+
for $a in (1 to 10)[. > 3]
 
let $c := $a + $b
 
let $c := $a + $b
 
return $c
 
return $c
  
 
(: $b is inlined :)
 
(: $b is inlined :)
for $a in 1 to 5[. > 3]
+
for $a in (1 to 10)[. > 3]
 
let $c := $a + 2
 
let $c := $a + 2
 
return $c
 
return $c
  
 
(: $c is inlined :)
 
(: $c is inlined :)
for $a in 1 to 5[. > 3]
+
for $a in (1 to 10)[. > 3]
 
return $a + 2
 
return $a + 2
  
 
(: the remaining clauses are merged and rewritten to a simple map :)
 
(: the remaining clauses are merged and rewritten to a simple map :)
(1 to 5)[. > 3] ! (. + 2)
+
(1 to 10)[. > 3] ! (. + 2)
</syntaxhighlight>
+
</pre>
 +
 
 +
==Static Typing==
 +
 
 +
If the type of a value is known at compile time, type checks can be removed. In the example below, the static information that {{Code|$i}} will always reference items of type {{Code|xs:integer}} can be utilized to simplify the expression:
 +
 
 +
<pre lang='xquery'>
 +
for $i in 1 to 5
 +
return typeswitch($i)
 +
  case xs:numeric return 'number'
 +
  default return 'string'
 +
 
 +
(: rewritten to :)
 +
for $i in 1 to 5
 +
return 'number'
 +
</pre>
 +
 
 +
==Pure Logic==
 +
 
 +
If expressions can often be simplified:
 +
 
 +
<pre lang='xquery'>
 +
for $a in ('a', '')
 +
return $a[boolean(if(.) then true() else false())]
 +
 
 +
(: rewritten to :)
 +
for $a in ('a', '')
 +
return $a[boolean(.)]
 +
 
 +
(: rewritten to :)
 +
for $a in ('a', '')
 +
return $a[.]
  
==Logic==
+
(: rewritten to :)
 +
('a', '')[.]
 +
</pre>
  
 
Boolean algebra (and set theory) comes with a set of laws that can all be applied to XQuery expressions.
 
Boolean algebra (and set theory) comes with a set of laws that can all be applied to XQuery expressions.
Line 181: Line 294:
 
| Distributivity
 
| Distributivity
 
|- valign="top"
 
|- valign="top"
| <code>$a and not($a)</code>
+
| <code>$a or not($a)</code>
 
| <code>true()</code>
 
| <code>true()</code>
 
| Tertium non datur
 
| Tertium non datur
Line 196: Line 309:
 
* <code>true#0 and true#0</code> must raise an error; it cannot be simplified to <code>true#0</code>
 
* <code>true#0 and true#0</code> must raise an error; it cannot be simplified to <code>true#0</code>
  
==Paths==
+
=Optimization=
 +
 
 +
Some physical optimizations are also presented in the article on [[Indexes|index structures]].
  
Due to the compact syntax of XPath, it can make a big difference if a slash is added or omitted in a path expression. A classical example is the double slash {{Code|//}}, which is a shortcut for {{Code|descendant-or-node()/}}. If the query is evaluated without optimizations, all nodes of a document are gathered, and for each of them, the next step is evaluated. This leads to a potentially huge number of duplicate node tree traversals, most of which are redundant, as all duplicate nodes will be removed at the end anyway.
+
==Database Statistics==
  
In most cases, paths with a double slash can be rewritten to descendant steps…
+
In each database, metadata is stored that can be utilized by the query optimizer to speed up or even skip query evaluation:
  
<syntaxhighlight lang="xquery">
+
;Count element nodes
(: equivalent queries, with identical syntax trees :)
 
doc('addressbook.xml')//city,
 
doc('addressbook.xml')/descendant-or-node()/child::city
 
  
(: rewritten to :)
+
The number of elements that are found for a specific path need not be evaluated sequentially. Instead, the count can directly be retrieved from the database statistics:
doc('addressbook.xml')/descendant::city
 
</syntaxhighlight>
 
  
…unless the last step does not contain a positional predicate:
+
<pre lang='xquery'>
 +
count(/mondial/country)
  
<syntaxhighlight lang="xquery">
+
(: rewritten to :)
doc('addressbook.xml')//city[1]
+
231
</syntaxhighlight>
+
</pre>
  
As the positional test refers to the city child step, a rewritten query would yield different steps.
+
;Return distinct values
  
Paths may contain predicates that will be evaluated again by a later axis step. Such predicates are either shifted down or discarded:
+
The distinct values for specific names and paths can also be fetched from the database metadata, provided that the number does not exceed the maximum number of distinct values (see {{Option|MAXCATS}} for more information):
  
<syntaxhighlight lang="xquery">
+
<pre lang='xquery'>
(: equivalent query :)
+
distinct-values(//religions)
a[b]/b[c/d]/c
 
  
 
(: rewritten to :)
 
(: rewritten to :)
a/b/c[d]
+
('Muslim', 'Roman Catholic', 'Albanian Orthodox', ...)
</syntaxhighlight>
+
</pre>
 
 
Names of nodes can be specified via name tests or predicates. If names are e.g. supplied via external variables, the predicates can often be dissolved:
 
 
 
<syntaxhighlight lang="xquery">
 
declare variable $name external := 'city';
 
db:open('addressbook')//*[name() = $name]
 
  
(: rewritten to :)
+
==Index Rewritings==
db:open('addressbook')//city
 
</syntaxhighlight>
 
  
=Index Rewritings=
+
A major feature of BaseX is the ability to rewrite all kinds of query patterns for [[Indexes#Value Indexes|index access]].
  
A major feature of BaseX is the ability to rewrite all kinds of query patterns for [[Indexes|database index access]]. The following queries will all be rewritten to exactly the same query that will access an index:
+
The following queries are all equivalent. They will be rewritten to exactly the same query that will eventually access the text index of a <code>factbook.xml</code> database instance (the file included in our full distributions):
  
<syntaxhighlight lang="xquery">
+
<pre lang='xquery'>
declare context item := db:open('factbook');
+
declare context item := db:get('factbook');
 +
declare variable $DB := 'factbook';
  
 
//name[. = 'Shenzhen'],
 
//name[. = 'Shenzhen'],
//name[text() = 'Shenzhen'],
 
 
//name[data() = 'Shenzhen'],
 
//name[data() = 'Shenzhen'],
 +
//name[./text() = 'Shenzhen'],
 +
//name[text()[. = 'Shenzhen']],
 
//name[string() = 'Shenzhen'],
 
//name[string() = 'Shenzhen'],
 
//name[string() = 'Shen' || 'zhen'],
 
//name[string() = 'Shen' || 'zhen'],
//name[text() ! data() ! string() = 'Iceland'],
+
//name[./data(text()/string()) = 'Shenzhen'],
 +
//name[text() ! data() ! string() = 'Shenzhen'],
 +
 
 +
//name[. eq 'Shenzhen'],
 +
//name[not(. ne 'Shenzhen')],
 +
//name[not(. != 'Shenzhen')],
 +
.//name[. = 'Shenzhen'],
 +
//*[local-name() = 'name'][data() = 'Shenzhen'],
  
db:open('factbook')//name[. = 'Shenzhen'],
+
db:get('factbook')//name[. = 'Shenzhen'],
 +
db:get($DB)//name[. = 'Shenzhen'],
  
 
for $name in //name[text() = 'Shenzhen']
 
for $name in //name[text() = 'Shenzhen']
Line 279: Line 391:
 
(: rewritten to :)
 
(: rewritten to :)
 
db:text('factbook', 'Shenzhen')/parent::name
 
db:text('factbook', 'Shenzhen')/parent::name
</syntaxhighlight>
+
</pre>
 +
 
 +
Multiple element names and query strings can be supplied in a path:
 +
 
 +
<pre lang='xquery'>
 +
//*[(ethnicgroups, religions)/text() = ('Jewish', 'Muslim')]
 +
 
 +
(: rewritten to :)
 +
db:text('factbook', ('Jewish', 'Muslim'))/(parent::*:ethnicgroups | parent::*:religions)/parent::*
 +
</pre>
 +
 
 +
If multiple candidates for index access are found, the database statistics (if available) are consulted to choose the cheapest candidate:
 +
 
 +
<pre lang='xquery'>
 +
/mondial/country
 +
  [religions    = 'Muslim']  (: yields 77 results :)
 +
  [ethnicgroups = 'Greeks']  (: yields 2 results :)
 +
 
 +
(: rewritten to :)
 +
db:text('factbook', 'Greeks')/parent::ethnicgroups/parent::country[religions = 'Muslim']
 +
</pre>
 +
 
 +
If index access is possible within more complex FLWOR expressions, only the paths will be rewritten:
  
=Evaluation-Time Optimizations=
+
<pre lang='xquery'>
 +
for $country in //country
 +
where $country/ethnicgroups = 'German'
 +
order by $country/name[1]
 +
return element { replace($country/@name, ' ', '') } {},
 +
 
 +
(: rewritten to :)
 +
for $country in db:text('factbook', 'German')/parent::ethnicgroups/parent::country
 +
order by $country/name[1]
 +
return element { replace($country/@name, ' ', '') } {}
 +
</pre>
 +
 
 +
The [https://projects.cwi.nl/xmark/ XMark XML Benchmark] comes with sample auction data and a bunch of queries, some of which are suitable for index rewritings:
 +
 
 +
;XMark Query 1
 +
 
 +
<pre lang='xquery'>
 +
let $auction := doc('xmark')
 +
return for $b in $auction/site/people/person[@id = 'person0']
 +
return $b/name/text()
 +
 
 +
(: rewritten to :)
 +
db:attribute('xmark', 'person0')/self::attribute(id)/parent::person/name/text()
 +
</pre>
 +
 
 +
;XMark Query 8
 +
 
 +
<pre lang='xquery'>
 +
let $auction := doc('xmark')
 +
return
 +
  for $p in $auction/site/people/person
 +
  let $a :=
 +
    for $t in $auction/site/closed_auctions/closed_auction
 +
    where $t/buyer/@person = $p/@id
 +
    return $t
 +
  return <item person="{ $p/name/text() }">{ count($a) }</item>,
 +
 
 +
(: rewritten to :)
 +
db:get('xmark')/site/people/person !
 +
  <item person='{ name/text() }'>{ count(
 +
    db:attribute('xmark', @id)/self::attribute(person)/parent::buyer/parent::closed_auction
 +
  )
 +
}</item>
 +
</pre>
 +
 
 +
If the accessed database is not known at compile time, or if you want to give a predicate preference to another one, you can [[Indexes#Enforce Rewritings|enforce index rewritings]].
 +
 
 +
=Evaluation=
  
 
==Comparisons==
 
==Comparisons==
  
In many cases, the amount of data to be processed is only known after the query has been compiled. Moreover, the data that is looped through expressions may change.
+
In many cases, the amount of data to be processed is only known after the query has been compiled. Moreover, the data that is looped through expressions may change. In those cases, the best optimizations needs to be chosen at runtime.
  
 
If sequences of items are compared against each other, a dynamic hash index will be generated, and the total number of comparisons can be significantly reduced. In the following example, <code>count($input1) * count($input2)</code> comparisons would need to be made without the intermediate index structure:
 
If sequences of items are compared against each other, a dynamic hash index will be generated, and the total number of comparisons can be significantly reduced. In the following example, <code>count($input1) * count($input2)</code> comparisons would need to be made without the intermediate index structure:
  
<syntaxhighlight lang="xquery">
+
<pre lang='xquery'>
 
let $input1 := file:read-text-lines('huge1.txt')
 
let $input1 := file:read-text-lines('huge1.txt')
 
let $input2 := file:read-text-lines('huge2.txt')
 
let $input2 := file:read-text-lines('huge2.txt')
 
return $input1[not(. = $input2)]
 
return $input1[not(. = $input2)]
</syntaxhighlight>
+
</pre>
  
==Pre-Evaluation==
+
=Changelog=
  
=Changelog=
+
;Version 9.6
 +
* Added: {{Option|UNROLLLIMIT}}
  
 
Introduced with Version 9.4.
 
Introduced with Version 9.4.

Latest revision as of 18:33, 1 December 2023

This article is part of the XQuery Portal. It presents some of the optimizations that speed up the execution and reduce memory consumption of queries.

Introduction[edit]

Query execution encompasses multiple steps:

  1. Parsing: The query input string is transformed to executable code. The result is a tree representation, called the abstract syntax tree (AST).
  2. Compilation: The syntax tree is decorated with additional information (type information, expression properties). Expressions (nodes) in the tree are relocated, simplified, or pre-evaluated. Logical optimizations are performed that do not rely on external information.
  3. Optimization: The dynamic context is incorporated: Referenced databases are opened and analyzed; queries are rewritten to use available indexes; accumulative and statistical operations (counts, summations, min/max, distinct values) are pre-evaluated; XPath expressions are simplified, based on the existence of steps.
  4. Evaluation: The resulting code is executed.
  5. Printing: The query result is serialized and presented in a format that is either human-readable, or can be further processed by an API.

Some rewritings are described in this article.

If you run a query on command-line, you can use -V to output detailed query information. In the GUI, you can enable the Info View panel.

Compilation[edit]

Pre-Evaluation[edit]

Parts of the query that are static and would be executed multiple times can already be evaluated at compile time:

for $i in 1 to 10
return 2 * 3

(: rewritten to :)
for $i in 1 to 10
return 6

Variable Inlining[edit]

The value of a variable can be inlined: The variables references are replaced by the expression that is bound to the variable. The resulting expression can often be simplified, and further optimizations can be triggered:

declare variable $INFO := true();

let $nodes := //nodes
where $INFO
return 'Results: ' || count($nodes)

(: rewritten to :)
let $nodes := //nodes
where true()
return 'Results: ' || count($nodes)

(: rewritten to :)
let $nodes := //nodes
return 'Results: ' || count($nodes)

(: rewritten to :)
'Results: ' || count(//nodes)

As the example shows, variable declarations might be located in the query prolog and in FLWOR expressions. They may also occur (and be inlined) in try/catch, switch or typeswitch expressions.

Function Inlining[edit]

Functions can be inlined as well. The parameters are rewitten to let clauses and the function is body is bound to the return clause.

declare function local:inc($i) { $i + 1 };
for $n in 1 to 5
return local:inc($n)

(: rewritten to :)
for $n in 1 to 5
return (
  let $_ := $n
  return $_ + 1
)

(: rewritten to :)
for $n in 1 to 5
return $n + 1

Subsequent rewritings might result in query plans that differ a lot from the original query. As this might complicate debugging, you can disable function inling during development by setting INLINELIMIT to 0.

Loop Unrolling[edit]

Loops with few iterations are unrolled by the XQuery compiler to enable further optimizations:

(1 to 2) ! (. * 2)

(: rewritten to :)
1 ! (. * 2), 2 ! (. * 2)

(: further rewritten to :)
1 * 2, 2 * 2

(: further rewritten to :)
2, 4

Folds are unrolled, too:

let $f := function($a, $b) { $a * $b }
return fold-left(2 to 5, 1, $f)

(: rewritten to :)
let $f := function($a, $b) { $a * $b }
return $f($f($f($f(1, 2), 3), 4), 5)

The standard unroll limit is 5. It can be adjusted with the UNROLLLIMIT option, e.g. via a pragma:

(# db:unrolllimit 10 #) {
  for $i in 1 to 10
  return db:get('db' || $i)//*[text() = 'abc']
}

(: rewritten to :)
db:get('db1')//*[text() = 'abc'],
db:get('db2')//*[text() = 'abc'],
...
db:get('db10')//*[text() = 'abc'],

The last example indicates that index rewritings might be triggered by unrolling loops with paths on database nodes.

The following expressions can be unrolled:

Care should be taken if a higher value is selected, as memory consumption and compile time will increase.

Paths[edit]

Due to the compact syntax of XPath, it can make a big difference if a slash is added or omitted in a path expression. A classical example is the double slash //, which is a shortcut for descendant-or-node()/. If the query is evaluated without optimizations, all nodes of a document are gathered, and for each of them, the next step is evaluated. This leads to a potentially huge number of duplicate node tree traversals, most of which are redundant, as all duplicate nodes will be removed at the end anyway.

In most cases, paths with a double slash can be rewritten to descendant steps…

(: equivalent queries, with identical syntax trees :)
doc('addressbook.xml')//city,
doc('addressbook.xml')/descendant-or-self::node()/child::city

(: rewritten to :)
doc('addressbook.xml')/descendant::city

…unless the last step does not contain a positional predicate:

doc('addressbook.xml')//city[1]

As the positional test refers to the city child step, a rewritten query would yield different steps.

Paths may contain predicates that will be evaluated again by a later axis step. Such predicates are either shifted down or discarded:

(: equivalent query :)
a[b]/b[c/d]/c

(: rewritten to :)
a/b/c[d]

Names of nodes can be specified via name tests or predicates. If names are e.g. supplied via external variables, the predicates can often be dissolved:

declare variable $name external := 'city';
db:get('addressbook')/descendant::*[name() = $name]

(: rewritten to :)
db:get('addressbook')/descendant::city

FLWOR Rewritings[edit]

FLWOR expressions are central to XQuery and the most complex constructs the language offers. Numerous optimizations have been realized to improve the execution time:

  • Nested FLWOR expressions are flattened.
  • for clauses with single items are rewritten to let clauses.
  • let clauses that are iterated multiple times are lifted up.
  • Expressions of let clauses are inlined.
  • Unused variables are removed.
  • where clauses are rewritten to predicates.
  • if expressions in the return clause are rewritten to where clauses.
  • The last for clause is merged into the return clause and rewritten to a simple map expression.

Various of these rewriting are demonstrated in the following example:

for $a in 1 to 10
for $b in 2
where $a > 3
let $c := $a + $b
return $c

(: for is rewritten to let :)
for $a in 1 to 10
let $b := 2
where $a > 3
let $c := $a + $b
return $c

(: let is lifted up :)
let $b := 2
for $a in 1 to 10
where $a > 3
let $c := $a + $b
return $c

(: the where expression is rewritten to a predicate :)
let $b := 2
for $a in (1 to 10)[. > 3]
let $c := $a + $b
return $c

(: $b is inlined :)
for $a in (1 to 10)[. > 3]
let $c := $a + 2
return $c

(: $c is inlined :)
for $a in (1 to 10)[. > 3]
return $a + 2

(: the remaining clauses are merged and rewritten to a simple map :)
(1 to 10)[. > 3] ! (. + 2)

Static Typing[edit]

If the type of a value is known at compile time, type checks can be removed. In the example below, the static information that $i will always reference items of type xs:integer can be utilized to simplify the expression:

for $i in 1 to 5
return typeswitch($i)
  case xs:numeric return 'number'
  default return 'string'

(: rewritten to :)
for $i in 1 to 5
return 'number'

Pure Logic[edit]

If expressions can often be simplified:

for $a in ('a', '')
return $a[boolean(if(.) then true() else false())]

(: rewritten to :)
for $a in ('a', '')
return $a[boolean(.)]

(: rewritten to :)
for $a in ('a', '')
return $a[.]

(: rewritten to :)
('a', '')[.]

Boolean algebra (and set theory) comes with a set of laws that can all be applied to XQuery expressions.

Expression Rewritten expression Rule
$a + 0, $a * 1 $a Identity
$a * 0 0 Annihilator
$a and $a $a Idempotence
$a and ($a or $b) $a Absorption
($a and $b) or ($a and $c) $a and ($b or $c) Distributivity
$a or not($a) true() Tertium non datur
not($a) and not($b) not($a or $b) De Morgan

It is not sufficient to apply the rules to arbitrary input. Examples:

  • If the operands are no boolean values, a conversion is enforced: $string and $string is rewritten to boolean($string).
  • xs:double('NaN') * 0 yields NaN instead of 0
  • true#0 and true#0 must raise an error; it cannot be simplified to true#0

Optimization[edit]

Some physical optimizations are also presented in the article on index structures.

Database Statistics[edit]

In each database, metadata is stored that can be utilized by the query optimizer to speed up or even skip query evaluation:

Count element nodes

The number of elements that are found for a specific path need not be evaluated sequentially. Instead, the count can directly be retrieved from the database statistics:

count(/mondial/country)

(: rewritten to :)
231
Return distinct values

The distinct values for specific names and paths can also be fetched from the database metadata, provided that the number does not exceed the maximum number of distinct values (see MAXCATS for more information):

distinct-values(//religions)

(: rewritten to :)
('Muslim', 'Roman Catholic', 'Albanian Orthodox', ...)

Index Rewritings[edit]

A major feature of BaseX is the ability to rewrite all kinds of query patterns for index access.

The following queries are all equivalent. They will be rewritten to exactly the same query that will eventually access the text index of a factbook.xml database instance (the file included in our full distributions):

declare context item := db:get('factbook');
declare variable $DB := 'factbook';

//name[. = 'Shenzhen'],
//name[data() = 'Shenzhen'],
//name[./text() = 'Shenzhen'],
//name[text()[. = 'Shenzhen']],
//name[string() = 'Shenzhen'],
//name[string() = 'Shen' || 'zhen'],
//name[./data(text()/string()) = 'Shenzhen'],
//name[text() ! data() ! string() = 'Shenzhen'],

//name[. eq 'Shenzhen'],
//name[not(. ne 'Shenzhen')],
//name[not(. != 'Shenzhen')],
.//name[. = 'Shenzhen'],
//*[local-name() = 'name'][data() = 'Shenzhen'],

db:get('factbook')//name[. = 'Shenzhen'],
db:get($DB)//name[. = 'Shenzhen'],

for $name in //name[text() = 'Shenzhen']
return $name,

for $name in //name
return $name[text() = 'Shenzhen'],

for $name in //name
return if($name/text() = 'Shenzhen') then $name else (),

for $name in //name
where $name/text() = 'Shenzhen'
return $name,

for $name in //name
where $name/text()[. = 'Shenzhen']
return $name,

for $node in //*
where data($node) = 'Shenzhen'
where name($node) = 'name'
return $node,

(: rewritten to :)
db:text('factbook', 'Shenzhen')/parent::name

Multiple element names and query strings can be supplied in a path:

//*[(ethnicgroups, religions)/text() = ('Jewish', 'Muslim')]

(: rewritten to :)
db:text('factbook', ('Jewish', 'Muslim'))/(parent::*:ethnicgroups | parent::*:religions)/parent::*

If multiple candidates for index access are found, the database statistics (if available) are consulted to choose the cheapest candidate:

/mondial/country
  [religions    = 'Muslim']  (: yields 77 results :)
  [ethnicgroups = 'Greeks']  (: yields 2 results :) 

(: rewritten to :)
db:text('factbook', 'Greeks')/parent::ethnicgroups/parent::country[religions = 'Muslim']

If index access is possible within more complex FLWOR expressions, only the paths will be rewritten:

for $country in //country
where $country/ethnicgroups = 'German'
order by $country/name[1]
return element { replace($country/@name, ' ', '') } {},

(: rewritten to :)
for $country in db:text('factbook', 'German')/parent::ethnicgroups/parent::country
order by $country/name[1]
return element { replace($country/@name, ' ', '') } {}

The XMark XML Benchmark comes with sample auction data and a bunch of queries, some of which are suitable for index rewritings:

XMark Query 1
let $auction := doc('xmark')
return for $b in $auction/site/people/person[@id = 'person0']
return $b/name/text()

(: rewritten to :)
db:attribute('xmark', 'person0')/self::attribute(id)/parent::person/name/text()
XMark Query 8
let $auction := doc('xmark')
return
  for $p in $auction/site/people/person
  let $a :=
    for $t in $auction/site/closed_auctions/closed_auction
    where $t/buyer/@person = $p/@id
    return $t
  return <item person="{ $p/name/text() }">{ count($a) }</item>,

(: rewritten to :)
db:get('xmark')/site/people/person !
  <item person='{ name/text() }'>{ count(
    db:attribute('xmark', @id)/self::attribute(person)/parent::buyer/parent::closed_auction
  )
}</item>

If the accessed database is not known at compile time, or if you want to give a predicate preference to another one, you can enforce index rewritings.

Evaluation[edit]

Comparisons[edit]

In many cases, the amount of data to be processed is only known after the query has been compiled. Moreover, the data that is looped through expressions may change. In those cases, the best optimizations needs to be chosen at runtime.

If sequences of items are compared against each other, a dynamic hash index will be generated, and the total number of comparisons can be significantly reduced. In the following example, count($input1) * count($input2) comparisons would need to be made without the intermediate index structure:

let $input1 := file:read-text-lines('huge1.txt')
let $input2 := file:read-text-lines('huge2.txt')
return $input1[not(. = $input2)]

Changelog[edit]

Version 9.6

Introduced with Version 9.4.