Difference between revisions of "HTML Module"

From BaseX Documentation
Jump to navigation Jump to search
m (Text replace - "error codes updates" to "error codes updated")
(7 intermediate revisions by 2 users not shown)
Line 2: Line 2:
  
 
=Conventions=
 
=Conventions=
 
{{Mark|Updated with Version 9.0}}:
 
  
 
All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/html</nowiki></code> namespace, which is statically bound to the {{Code|html}} prefix.<br/>
 
All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/html</nowiki></code> namespace, which is statically bound to the {{Code|html}} prefix.<br/>
Line 21: Line 19:
  
 
==html:parse==
 
==html:parse==
 +
 
{| width='100%'
 
{| width='100%'
 
|-
 
|-
 
| width='120' | '''Signatures'''
 
| width='120' | '''Signatures'''
|{{Func|html:parse|$input as xs:anyAtomicType|document-node()}}<br />{{Func|html:parse|$input as xs:anyAtomicType, $options as map(xs:string, item())|document-node()}}<br />
+
|{{Func|html:parse|$input as xs:anyAtomicType|document-node()}}<br />{{Func|html:parse|$input as xs:anyAtomicType, $options as map(*)?|document-node()}}<br />
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Converts the HTML document specified by {{Code|$input}} to XML, and returns a document node:<br/>
+
|Converts the HTML document specified by {{Code|$input}} to XML and returns a document node:<br/>
 
* The input may either be a string or a binary item (xs:hexBinary, xs:base64Binary).
 
* The input may either be a string or a binary item (xs:hexBinary, xs:base64Binary).
 
* If the input is passed on in its binary representation, the HTML parser will try to automatically choose the correct encoding.
 
* If the input is passed on in its binary representation, the HTML parser will try to automatically choose the correct encoding.
  
 
The {{Code|$options}} argument can be used to set [[Parsers#Options|TagSoup Options]].
 
The {{Code|$options}} argument can be used to set [[Parsers#Options|TagSoup Options]].
 +
|-
 +
| '''Errors'''
 +
|{{Error|parse|#Errors}} the input cannot be converted to XML.
 +
|}
 +
 +
==html:doc==
 +
 +
{| width='100%'
 +
|-
 +
| width='120' | '''Signatures'''
 +
|{{Func|html:doc|$uri as xs:string?|document-node()?}}<br />{{Func|html:doc|$uri as xs:string?, $options as map(*)?|document-node()?}}<br />
 +
|-
 +
| '''Summary'''
 +
|Fetches the HTML document referred to by the given {{Code|$uri}}, converts it to XML and returns a document node. The {{Code|$options}} argument can be used to set [[Parsers#Options|TagSoup Options]].
 
|-
 
|-
 
| '''Errors'''
 
| '''Errors'''
Line 44: Line 57:
  
 
;Query:
 
;Query:
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
html:parse("<html>")
 
html:parse("<html>")
</pre>
+
</syntaxhighlight>
  
 
;Result:
 
;Result:
<pre class="brush:xml">
+
<syntaxhighlight lang="xml">
 
<html xmlns="http://www.w3.org/1999/xhtml"/>
 
<html xmlns="http://www.w3.org/1999/xhtml"/>
</pre>
+
</syntaxhighlight>
  
 
===Specifying Options===
 
===Specifying Options===
Line 58: Line 71:
  
 
;Query:
 
;Query:
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
html:parse("<a href='ok.html'/>", map { 'nons': false() })
 
html:parse("<a href='ok.html'/>", map { 'nons': false() })
</pre>
+
</syntaxhighlight>
  
 
;Result:
 
;Result:
<pre class="brush:xml">
+
<syntaxhighlight lang="xml">
 
<html xmlns="http://www.w3.org/1999/xhtml">
 
<html xmlns="http://www.w3.org/1999/xhtml">
 
   <body>
 
   <body>
Line 69: Line 82:
 
   </body>
 
   </body>
 
</html>
 
</html>
</pre>
+
</syntaxhighlight>
  
 
===Parsing Binary Input===
 
===Parsing Binary Input===
Line 77: Line 90:
  
 
;Query:
 
;Query:
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
html:parse(fetch:binary("http://en.wikipedia.org"))
+
html:parse(fetch:binary("https://en.wikipedia.org"))
</pre>
+
</syntaxhighlight>
  
 
;Result:
 
;Result:
<pre class="brush:xml">
+
<syntaxhighlight lang="xml">
 
<html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en">
 
<html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en">
 
   <head>
 
   <head>
Line 88: Line 101:
 
     <meta charset="UTF-8"/>
 
     <meta charset="UTF-8"/>
 
     ...
 
     ...
</pre>
+
</syntaxhighlight>
  
 
=Errors=
 
=Errors=
 
{{Mark|Updated with Version 9.0}}:
 
  
 
{| class="wikitable" width="100%"
 
{| class="wikitable" width="100%"
Line 103: Line 114:
  
 
=Changelog=
 
=Changelog=
 +
 +
;Version 9.4
 +
 +
* Added: [[#html:doc|html:doc]]
  
 
;Version 9.0
 
;Version 9.0

Revision as of 17:50, 18 November 2020

This XQuery Module provides functions for converting HTML to XML. Conversion will only take place if TagSoup is included in the classpath (see HTML Parsing for more details).

Conventions

All functions and errors in this module are assigned to the http://basex.org/modules/html namespace, which is statically bound to the html prefix.

Functions

html:parser

Signatures html:parser() as xs:string
Summary Returns the name of the applied HTML parser (currently: TagSoup). If an empty string is returned, TagSoup was not found in the classpath, and the input will be treated as well-formed XML.

html:parse

Signatures html:parse($input as xs:anyAtomicType) as document-node()
html:parse($input as xs:anyAtomicType, $options as map(*)?) as document-node()
Summary Converts the HTML document specified by $input to XML and returns a document node:
  • The input may either be a string or a binary item (xs:hexBinary, xs:base64Binary).
  • If the input is passed on in its binary representation, the HTML parser will try to automatically choose the correct encoding.

The $options argument can be used to set TagSoup Options.

Errors parse: the input cannot be converted to XML.

html:doc

Signatures html:doc($uri as xs:string?) as document-node()?
html:doc($uri as xs:string?, $options as map(*)?) as document-node()?
Summary Fetches the HTML document referred to by the given $uri, converts it to XML and returns a document node. The $options argument can be used to set TagSoup Options.
Errors parse: the input cannot be converted to XML.

Examples

Basic Example

The following query converts the specified string to an XML document node.

Query

<syntaxhighlight lang="xquery"> html:parse("<html>") </syntaxhighlight>

Result

<syntaxhighlight lang="xml"> <html xmlns="http://www.w3.org/1999/xhtml"/> </syntaxhighlight>

Specifying Options

The next query creates an XML document with namespaces:

Query

<syntaxhighlight lang="xquery"> html:parse("<a href='ok.html'/>", map { 'nons': false() }) </syntaxhighlight>

Result

<syntaxhighlight lang="xml"> <html xmlns="http://www.w3.org/1999/xhtml">

 <body>
   <a shape="rect" href="ok.html"/>
 </body>

</html> </syntaxhighlight>

Parsing Binary Input

If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:

Query

<syntaxhighlight lang="xquery"> html:parse(fetch:binary("https://en.wikipedia.org")) </syntaxhighlight>

Result

<syntaxhighlight lang="xml"> <html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en">

 <head>
   <title>Wikipedia, the free encyclopedia</title>
   <meta charset="UTF-8"/>
   ...

</syntaxhighlight>

Errors

Code Description
parse The input cannot be converted to XML.

Changelog

Version 9.4
Version 9.0
  • Updated: error codes updated; errors now use the module namespace

The module was introduced with Version 7.6.