Difference between revisions of "Fetch Module"

From BaseX Documentation
Jump to navigation Jump to search
m (Text replacement - "<syntaxhighlight lang="xquery">" to "<pre lang='xquery'>")
Tags: Mobile web edit Mobile edit
 
(27 intermediate revisions by the same user not shown)
Line 1: Line 1:
This [[Module Library|XQuery Module]] provides simple functions to fetch the content of resources identified by URIs. Resources can be stored locally or remotely and e.g. use the {{Code|file://}} or {{Code|http://}} scheme. If more control over HTTP requests is required, the [[HTTP Module]] can be used. With the [[HTML Module]], retrieved HTML documents can be converted to XML.
+
This [[Module Library|XQuery Module]] provides simple functions to fetch the content of resources identified by URIs. Resources can be stored locally or remotely and e.g. use the {{Code|file://}} or {{Code|http://}} scheme. If more control over HTTP requests is required, the [[HTTP Client Module]] can be used. With the [[HTML Module]], retrieved HTML documents can be converted to XML.
  
 
=Conventions=
 
=Conventions=
 
{{Mark|Updated with Version 9.0}}:
 
  
 
All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/fetch</nowiki></code> namespace, which is statically bound to the {{Code|fetch}} prefix.<br/>
 
All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/fetch</nowiki></code> namespace, which is statically bound to the {{Code|fetch}} prefix.<br/>
Line 14: Line 12:
  
 
{| width='100%'
 
{| width='100%'
|-
+
|- valign="top"
| width='120' | '''Signatures'''
+
| width='120' | '''Signature'''
|{{Func|fetch:binary|$uri as xs:string|xs:base64Binary}}<br/>
+
|<pre>fetch:binary(
|-
+
  $href  as xs:string
 +
) as xs:base64Binary</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Fetches the resource referred to by the given URI and returns it as [[Streaming Module|streamable]] {{Code|xs:base64Binary}}.
+
|Fetches the resource referred to by the given {{Code|href}} string and returns it as [[Lazy Module|lazy]] {{Code|xs:base64Binary}} item.
|-
+
|- valign="top"
 
| '''Errors'''
 
| '''Errors'''
|{{Error|open|XQuery Errors#Functions Errors}} the URI could not be resolved, or the resource could not be retrieved.
+
|{{Error|open|#Errors}} the URI could not be resolved, or the resource could not be retrieved.
|-
+
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
 
* <code><nowiki>fetch:binary("http://images.trulia.com/blogimg/c/5/f/4/679932_1298401950553_o.jpg")</nowiki></code> returns the addressed image.
 
* <code><nowiki>fetch:binary("http://images.trulia.com/blogimg/c/5/f/4/679932_1298401950553_o.jpg")</nowiki></code> returns the addressed image.
* <code><nowiki>stream:materialize(fetch:binary("http://en.wikipedia.org"))</nowiki></code> returns a materialized representation of the streamable result.
+
* <code><nowiki>lazy:cache(fetch:binary("http://en.wikipedia.org"))</nowiki></code> enforces the fetch operation (otherwise, it will be delayed until requested first).
 
|}
 
|}
  
Line 33: Line 33:
  
 
{| width='100%'
 
{| width='100%'
|-
+
|- valign="top"
| width='120' | '''Signatures'''
+
| width='120' | '''Signature'''
|{{Func|fetch:text|$uri as xs:string|xs:string}}<br/>{{Func|fetch:text|$uri as xs:string, $encoding as xs:string|xs:string}}<br/>{{Func|fetch:text|$uri as xs:string, $encoding as xs:string, $fallback as xs:boolean|xs:string}}<br/>
+
|<pre>fetch:text(
|-
+
  $href      as xs:string,
 +
  $encoding as xs:string   := (),
 +
  $fallback as xs:boolean?  := false()
 +
) as xs:string</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Fetches the resource referred to by the given {{Code|$uri}} and returns it as [[Streaming Module|streamable]] {{Code|xs:string}}:
+
|Fetches the resource referred to by the given {{Code|href}} string and returns it as [[Lazy Module|lazy]] {{Code|xs:string}} item:
 
* The UTF-8 default encoding can be overwritten with the optional {{Code|$encoding}} argument.
 
* The UTF-8 default encoding can be overwritten with the optional {{Code|$encoding}} argument.
 
* By default, invalid characters will be rejected. If {{Code|$fallback}} is set to true, these characters will be replaced with the Unicode replacement character <code>FFFD</code> (&#xFFFD;).
 
* By default, invalid characters will be rejected. If {{Code|$fallback}} is set to true, these characters will be replaced with the Unicode replacement character <code>FFFD</code> (&#xFFFD;).
|-
+
|- valign="top"
 
| '''Errors'''
 
| '''Errors'''
|{{Error|open|XQuery Errors#Functions Errors}} the URI could not be resolved, or the resource could not be retrieved.<br/>{{Error|encoding|XQuery Errors#Functions Errors}} the specified encoding is not supported, or unknown.
+
|{{Error|open|#Errors}} the URI could not be resolved, or the resource could not be retrieved.<br/>{{Error|encoding|#Errors}} the specified encoding is not supported, or unknown.
|-
+
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
 
* <code><nowiki>fetch:text("http://en.wikipedia.org")</nowiki></code> returns a string representation of the English Wikipedia main HTML page.
 
* <code><nowiki>fetch:text("http://en.wikipedia.org")</nowiki></code> returns a string representation of the English Wikipedia main HTML page.
 
* <code><nowiki>fetch:text("http://www.bbc.com","US-ASCII",true())</nowiki></code> returns the BBC homepage in US-ASCII with all non-US-ASCII characters replaced with &#xFFFD;.
 
* <code><nowiki>fetch:text("http://www.bbc.com","US-ASCII",true())</nowiki></code> returns the BBC homepage in US-ASCII with all non-US-ASCII characters replaced with &#xFFFD;.
* <code><nowiki>stream:materialize(fetch:text("http://en.wikipedia.org"))</nowiki></code> returns a materialized representation of the streamable result.
+
* <code><nowiki>lazy:cache(fetch:text("http://en.wikipedia.org"))</nowiki></code> enforces the fetch operation (otherwise, it will be delayed until requested first).
 
|}
 
|}
  
==fetch:xml==
+
==fetch:doc==
  
 
{| width='100%'
 
{| width='100%'
|-
+
|- valign="top"
| width='120' | '''Signatures'''
+
| width='120' | '''Signature'''
|{{Func|fetch:xml|$uri as xs:string|document-node()}}<br/>{{Func|fetch:xml|$uri as xs:string, $options as map(*)|document-node()}}
+
|<pre>fetch:doc(
|-
+
  $href    as xs:string,
 +
  $options as map(*)?    := map { }
 +
) as document-node()</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Fetches the resource referred to by the given {{Code|$uri}} and returns it as XML document node.<br/>In contrast to <code>fn:doc</code>, each function call returns a different document node. As a consequence, document instances created by this function will not be kept in memory until the end of query evaluation.<br/>The {{Code|$options}} argument can be used to change the parsing behavior. Allowed options are all [[Options#Parsing|parsing]] and [[Options#XML Parsing|XML parsing]] options in lower case.
+
|Fetches the resource referred to by the given {{Code|href}} string and returns it as a document node.<br/>The {{Code|$options}} argument can be used to change the parsing behavior. Allowed options are all [[Options#Parsing|parsing]] and [[Options#XML Parsing|XML parsing]] options in lower case.<br/>The function differs from {{Code|fn:doc}} in various aspects:
|-
+
* It is ''nondeterministic'', i.e., a new document node will be created by each call of this function.
 +
* A document created by this function will be garbage-collected as soon as it is not referenced anymore.
 +
* URIs will not be resolved against existing databases. As a result, it will not trigger any locks (see [[Transaction Management#Limitations|limitations of database locking]] for more details).
 +
|- valign="top"
 
| '''Errors'''
 
| '''Errors'''
|{{Error|open|XQuery Errors#Functions Errors}} the URI could not be resolved, or the resource could not be retrieved.
+
|{{Error|open|#Errors}} the URI could not be resolved, or the resource could not be retrieved.
|-
+
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
* Retrieve an XML representation of the English Wikipedia main HTML page, chop all whitespace nodes:
+
* Retrieve an XML representation of the English Wikipedia main HTML page with whitespace stripped:
<pre class="brush:xquery">
+
<pre lang='xquery'>
fetch:xml("http://en.wikipedia.org", map { 'chop': true() })
+
fetch:doc("http://en.wikipedia.org", map { 'stripws': true() })
 
</pre>
 
</pre>
* Return a document located in the current base directory:
+
* Return a web page as XML, preserve namespaces:
<pre class="brush:xquery">
+
<pre lang='xquery'>
fetch:xml(file:base-dir() || "example.xml")
+
fetch:doc(
 +
  'http://basex.org/',
 +
  map {
 +
    'parser': 'html',
 +
    'htmlparser': map { 'nons': false() }
 +
  }
 +
)
 
</pre>
 
</pre>
 
|}
 
|}
  
==fetch:xml-binary==
+
==fetch:binary-doc==
 
 
{{Mark|Introduced with Version 9.0:}}
 
  
 
{| width='100%'
 
{| width='100%'
|-
+
|- valign="top"
| width='120' | '''Signatures'''
+
| width='120' | '''Signature'''
|{{Func|fetch:xml-binary|$data as xs:base64Binary|document-node()}}<br/>{{Func|fetch:xml-binary|$data as xs:base64Binary, $options as map(*)|document-node()}}
+
|<pre>fetch:binary-doc(
|-
+
  $input    as xs:anyAtomicType,
 +
  $options as map(*)?          := map { }
 +
) as document-node()</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Parses binary {{Code|$data}} and returns it as XML document node.<br/>In contrast to fn:parse-xml, which expects an XQuery string, the input of this function can be arbitrarily encoded. The encoding will be derived from the XML declaration or (in case of UTF16 or UTF32) from the first bytes of the input.<br/>The {{Code|$options}} argument can be used to change the parsing behavior. Allowed options are all [[Options#Parsing|parsing]] and [[Options#XML Parsing|XML parsing]] options in lower case.
+
|Converts the specified {{Code|$input}} ({{Code|xs:base64Binary}}, {{Code|xs:hexBinary}}) to XML and returns it as a document node.<br/>In contrast to {{Code|fn:parse-xml}}, which expects a string, the input can be arbitrarily encoded. The encoding will be derived from the XML declaration or (in case of UTF-16 or UTF-32) from the first bytes of the input.<br/>The {{Code|$options}} argument can be used to change the parsing behavior. Allowed options are all [[Options#Parsing|parsing]] and [[Options#XML Parsing|XML parsing]] options in lower case.
|-
+
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
 
* Retrieves file input as binary data and parses it as XML:
 
* Retrieves file input as binary data and parses it as XML:
<pre class="brush:xquery">
+
<pre lang='xquery'>
fetch:xml-binary(file:read-binary('doc.xml'))
+
fetch:binary-doc(file:read-binary('doc.xml'))
 
</pre>
 
</pre>
 
* Encodes a string as CP1252 and parses it as XML. The input and the string {{Code|touché}} will be correctly decoded because of the XML declaration:
 
* Encodes a string as CP1252 and parses it as XML. The input and the string {{Code|touché}} will be correctly decoded because of the XML declaration:
<pre class="brush:xquery">
+
<pre lang='xquery'>
fetch:xml-binary(convert:string-to-base64(
+
fetch:binary-doc(convert:string-to-base64(
 
   "<?xml version='1.0' encoding='CP1252'?><xml>touché</xml>",
 
   "<?xml version='1.0' encoding='CP1252'?><xml>touché</xml>",
 
   "CP1252"
 
   "CP1252"
 
))
 
))
 
</pre>
 
</pre>
* Encodes a string as UTF16 and parses it as XML. The document will be correctly decoded, as the first bytes of the data indicate that the input must be UTF16:
+
* Encodes a string as UTF-16 and parses it as XML. The document will be correctly decoded, as the first bytes of the data indicate that the input must be UTF-16:
<pre class="brush:xquery">
+
<pre lang='xquery'>
fetch:xml-binary(convert:string-to-base64("<xml/>", "UTF16"))
+
fetch:binary-doc(convert:string-to-base64("<xml/>", "UTF16"))
 
</pre>
 
</pre>
 +
|- valign="top"
 +
| '''Errors'''
 +
|{{Error|open|#Errors}} the input could not be parsed.
 
|}
 
|}
  
Line 111: Line 131:
  
 
{| width='100%'
 
{| width='100%'
|-
+
|- valign="top"
| width='120' | '''Signatures'''
+
| width='120' | '''Signature'''
|{{Func|fetch:content-type|$uri as xs:string|xs:string}}<br/>
+
|<pre>fetch:content-type(
|-
+
  $href  as xs:string
 +
) as xs:string</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Returns the content-type (also called mime-type) of the resource specified by {{Code|$uri}}:
+
|Returns the content-type (also called mime-type) of the resource specified by {{Code|href}} string:
 
* If a remote resource is addressed, the request header will be evaluated.
 
* If a remote resource is addressed, the request header will be evaluated.
 
* If the addressed resource is locally stored, the content-type will be guessed based on the file extension.
 
* If the addressed resource is locally stored, the content-type will be guessed based on the file extension.
|-
+
|- valign="top"
 
| '''Errors'''
 
| '''Errors'''
|{{Error|open|XQuery Errors#Functions Errors}} the URI could not be resolved, or the resource could not be retrieved.
+
|{{Error|open|#Errors}} the URI could not be resolved, or the resource could not be retrieved.
|-
+
|- valign="top"
 
| '''Examples'''
 
| '''Examples'''
 
|
 
|
Line 129: Line 151:
  
 
=Errors=
 
=Errors=
 
{{Mark|Updated with Version 9.0}}:
 
  
 
{| class="wikitable" width="100%"
 
{| class="wikitable" width="100%"
 
! width="110"|Code
 
! width="110"|Code
 
|Description
 
|Description
|-
+
|- valign="top"
 
|{{Code|encoding}}
 
|{{Code|encoding}}
 
|The specified encoding is not supported, or unknown.
 
|The specified encoding is not supported, or unknown.
|-
+
|- valign="top"
 
|{{Code|open}}
 
|{{Code|open}}
 
|The URI could not be resolved, or the resource could not be retrieved.
 
|The URI could not be resolved, or the resource could not be retrieved.
Line 144: Line 164:
  
 
=Changelog=
 
=Changelog=
 +
 +
;Version 10.0
 +
* Updated: {{Function||fetch:doc}} renamed (before: {{Code|fetch:xml}}).
 +
* Updated: {{Function||fetch:binary-doc}} renamed (before: {{Code|fetch:xml-binary}}).
  
 
;Version 9.0
 
;Version 9.0
 
+
* Added: {{Code|fetch:xml-binary}}
* Added: [[#fetch:xml-binary|fetch:xml-binary]]
+
* Updated: error codes updated; errors now use the module namespace
* Updated: error codes updates; errors now use the module namespace
 
  
 
;Version 8.5
 
;Version 8.5
 
+
* Updated: {{Function||fetch:text}}: <code>$fallback</code> argument added.
* Updated: [[#fetch:text|fetch:text]]: <code>$fallback</code> argument added.
 
  
 
;Version 8.0
 
;Version 8.0
 
+
* Added: {{Code|fetch:xml}}
* Added: [[#fetch:xml|fetch:xml]]
 
  
 
The module was introduced with Version 7.6.
 
The module was introduced with Version 7.6.

Latest revision as of 17:34, 1 December 2023

This XQuery Module provides simple functions to fetch the content of resources identified by URIs. Resources can be stored locally or remotely and e.g. use the file:// or http:// scheme. If more control over HTTP requests is required, the HTTP Client Module can be used. With the HTML Module, retrieved HTML documents can be converted to XML.

Conventions[edit]

All functions and errors in this module are assigned to the http://basex.org/modules/fetch namespace, which is statically bound to the fetch prefix.

URI arguments can point be URLs or point to local files. Relative file paths will be resolved against the current working directory (for more details, have a look at the File Module).

Functions[edit]

fetch:binary[edit]

Signature
fetch:binary(
  $href  as xs:string
) as xs:base64Binary
Summary Fetches the resource referred to by the given href string and returns it as lazy xs:base64Binary item.
Errors open: the URI could not be resolved, or the resource could not be retrieved.
Examples
  • fetch:binary("http://images.trulia.com/blogimg/c/5/f/4/679932_1298401950553_o.jpg") returns the addressed image.
  • lazy:cache(fetch:binary("http://en.wikipedia.org")) enforces the fetch operation (otherwise, it will be delayed until requested first).

fetch:text[edit]

Signature
fetch:text(
  $href      as xs:string,
  $encoding  as xs:string    := (),
  $fallback  as xs:boolean?  := false()
) as xs:string
Summary Fetches the resource referred to by the given href string and returns it as lazy xs:string item:
  • The UTF-8 default encoding can be overwritten with the optional $encoding argument.
  • By default, invalid characters will be rejected. If $fallback is set to true, these characters will be replaced with the Unicode replacement character FFFD (�).
Errors open: the URI could not be resolved, or the resource could not be retrieved.
encoding: the specified encoding is not supported, or unknown.
Examples
  • fetch:text("http://en.wikipedia.org") returns a string representation of the English Wikipedia main HTML page.
  • fetch:text("http://www.bbc.com","US-ASCII",true()) returns the BBC homepage in US-ASCII with all non-US-ASCII characters replaced with �.
  • lazy:cache(fetch:text("http://en.wikipedia.org")) enforces the fetch operation (otherwise, it will be delayed until requested first).

fetch:doc[edit]

Signature
fetch:doc(
  $href     as xs:string,
  $options  as map(*)?    := map { }
) as document-node()
Summary Fetches the resource referred to by the given href string and returns it as a document node.
The $options argument can be used to change the parsing behavior. Allowed options are all parsing and XML parsing options in lower case.
The function differs from fn:doc in various aspects:
  • It is nondeterministic, i.e., a new document node will be created by each call of this function.
  • A document created by this function will be garbage-collected as soon as it is not referenced anymore.
  • URIs will not be resolved against existing databases. As a result, it will not trigger any locks (see limitations of database locking for more details).
Errors open: the URI could not be resolved, or the resource could not be retrieved.
Examples
  • Retrieve an XML representation of the English Wikipedia main HTML page with whitespace stripped:
fetch:doc("http://en.wikipedia.org", map { 'stripws': true() })
  • Return a web page as XML, preserve namespaces:
fetch:doc(
  'http://basex.org/',
  map {
    'parser': 'html',
    'htmlparser': map { 'nons': false() }
  }
)

fetch:binary-doc[edit]

Signature
fetch:binary-doc(
  $input    as xs:anyAtomicType,
  $options  as map(*)?           := map { }
) as document-node()
Summary Converts the specified $input (xs:base64Binary, xs:hexBinary) to XML and returns it as a document node.
In contrast to fn:parse-xml, which expects a string, the input can be arbitrarily encoded. The encoding will be derived from the XML declaration or (in case of UTF-16 or UTF-32) from the first bytes of the input.
The $options argument can be used to change the parsing behavior. Allowed options are all parsing and XML parsing options in lower case.
Examples
  • Retrieves file input as binary data and parses it as XML:
fetch:binary-doc(file:read-binary('doc.xml'))
  • Encodes a string as CP1252 and parses it as XML. The input and the string touché will be correctly decoded because of the XML declaration:
fetch:binary-doc(convert:string-to-base64(
  "<?xml version='1.0' encoding='CP1252'?><xml>touché</xml>",
  "CP1252"
))
  • Encodes a string as UTF-16 and parses it as XML. The document will be correctly decoded, as the first bytes of the data indicate that the input must be UTF-16:
fetch:binary-doc(convert:string-to-base64("<xml/>", "UTF16"))
Errors open: the input could not be parsed.

fetch:content-type[edit]

Signature
fetch:content-type(
  $href  as xs:string
) as xs:string
Summary Returns the content-type (also called mime-type) of the resource specified by href string:
  • If a remote resource is addressed, the request header will be evaluated.
  • If the addressed resource is locally stored, the content-type will be guessed based on the file extension.
Errors open: the URI could not be resolved, or the resource could not be retrieved.
Examples
  • fetch:content-type("http://docs.basex.org/skins/vector/images/wiki.png") returns image/png.

Errors[edit]

Code Description
encoding The specified encoding is not supported, or unknown.
open The URI could not be resolved, or the resource could not be retrieved.

Changelog[edit]

Version 10.0
Version 9.0
  • Added: fetch:xml-binary
  • Updated: error codes updated; errors now use the module namespace
Version 8.5
Version 8.0
  • Added: fetch:xml

The module was introduced with Version 7.6.