Difference between revisions of "HTML Module"

From BaseX Documentation
Jump to navigation Jump to search
m (Made link to Wikipedia HTTPS for binary example - as HTTP returns nothing)
(32 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This [[Module Library|XQuery Module]] provides functions for converting HTML to XML. The input will only be converted if [http://home.ccil.org/~cowan/XML/tagsoup/ TagSoup] is included in the classpath (see [[Parsers#HTML Parser|HTML Parsing]] for more details).
+
This [[Module Library|XQuery Module]] provides functions for converting HTML to XML. Conversion will only take place if TagSoup is included in the classpath (see [[Parsers#HTML Parser|HTML Parsing]] for more details).
  
 
=Conventions=
 
=Conventions=
  
All functions in this module are assigned to the {{Code|http://basex.org/modules/html}} namespace, which is statically bound to the {{Code|html}} prefix.<br/>
+
All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/html</nowiki></code> namespace, which is statically bound to the {{Code|html}} prefix.<br/>
All errors are assigned to the {{Code|http://basex.org/errors}} namespace, which is statically bound to the {{Code|bxerr}} prefix.
 
  
 
=Functions=
 
=Functions=
Line 12: Line 11:
 
{| width='100%'
 
{| width='100%'
 
|-
 
|-
| width='90' | '''Signatures'''
+
| width='120' | '''Signatures'''
 
|{{Code|'''html:parser'''() as xs:string}}<br />
 
|{{Code|'''html:parser'''() as xs:string}}<br />
 
|-
 
|-
Line 22: Line 21:
 
{| width='100%'
 
{| width='100%'
 
|-
 
|-
| width='90' | '''Signatures'''
+
| width='120' | '''Signatures'''
|{{Func|html:parse|$input as xs:anyAtomicType|document-node()}}<br />{{Func|html:parse|$input as xs:anyAtomicType, $options as item()|document-node()}}<br />
+
|{{Func|html:parse|$input as xs:anyAtomicType|document-node()}}<br />{{Func|html:parse|$input as xs:anyAtomicType, $options as map(*)?|document-node()}}<br />
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Converts the HTML document specified by {{Code|$input}} to XML, and returns a document node. The input may either be a string or a binary item (xs:hexBinary, xs:base64Binary). If the input is passed on in its binary representation, the HTML parser will try to automatically choose the correct encoding.<br/>The {{Code|$options}} argument can be used to set [[Parsers#TagSoup Options|TagSoup options]]. It can be specified<br />
+
|Converts the HTML document specified by {{Code|$input}} to XML, and returns a document node:<br/>
* as children of an {{Code|<html:options/>}} element; e.g.:
+
* The input may either be a string or a binary item (xs:hexBinary, xs:base64Binary).
 +
* If the input is passed on in its binary representation, the HTML parser will try to automatically choose the correct encoding.
 +
 
 +
The {{Code|$options}} argument can be used to set [[Parsers#Options|TagSoup Options]].
 +
|-
 +
| '''Errors'''
 +
|{{Error|parse|#Errors}} the input cannot be converted to XML.
 +
|}
 +
 
 +
=Examples=
 +
 
 +
===Basic Example===
 +
 
 +
The following query converts the specified string to an XML document node.
 +
 
 +
;Query:
 +
<pre class="brush:xquery">
 +
html:parse("<html>")
 +
</pre>
 +
 
 +
;Result:
 
<pre class="brush:xml">
 
<pre class="brush:xml">
<html:options>
+
<html xmlns="http://www.w3.org/1999/xhtml"/>
  <html:key1 value='value1'/>
 
  ...
 
</html:options>
 
 
</pre>
 
</pre>
* as map, which contains all key/value pairs:
+
 
<pre class="brush:xml">
+
===Specifying Options===
map { "key1" := "value1", ... }
+
 
 +
The next query creates an XML document with namespaces:
 +
 
 +
;Query:
 +
<pre class="brush:xquery">
 +
html:parse("<a href='ok.html'/>", map { 'nons': false() })
 
</pre>
 
</pre>
|-
+
 
| '''Errors'''
+
;Result:
|{{Error|BXHL0001|#Errors}} the input cannot be converted to XML.
 
|-
 
| '''Examples'''
 
|
 
* {{Code|html:parse("<html></html>")}} returns {{Code|<html/>}}
 
* <code><nowiki>html:parse("<a href='ok.html'/>", map { 'nons' := true() })</nowiki></code> creates an XML document without namespaces. It returns:
 
 
<pre class="brush:xml">
 
<pre class="brush:xml">
<html>
+
<html xmlns="http://www.w3.org/1999/xhtml">
 
   <body>
 
   <body>
 
     <a shape="rect" href="ok.html"/>
 
     <a shape="rect" href="ok.html"/>
Line 53: Line 68:
 
</html>
 
</html>
 
</pre>
 
</pre>
* <code><nowiki>html:parse(fetch:content-binary("http://en.wikipedia.org"))</nowiki></code> returns an XML representation of the English Wikipedia main page. The input is passed on its binary representation such that the HTML parser can automatically detect the correct encoding.
+
 
|}
+
===Parsing Binary Input===
 +
 
 +
If the input encoding is unknown, the data to be processed can be passed on in its binary representation.
 +
The HTML parser will automatically try to detect the correct encoding:
 +
 
 +
;Query:
 +
<pre class="brush:xquery">
 +
html:parse(fetch:binary("https://en.wikipedia.org"))
 +
</pre>
 +
 
 +
;Result:
 +
<pre class="brush:xml">
 +
<html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en">
 +
  <head>
 +
    <title>Wikipedia, the free encyclopedia</title>
 +
    <meta charset="UTF-8"/>
 +
    ...
 +
</pre>
  
 
=Errors=
 
=Errors=
  
{| width='100%' class="wikitable" width="100%"
+
{| class="wikitable" width="100%"
! width="5%"|Code
+
! width="110"|Code
! width="95%"|Description
+
|Description
 
|-
 
|-
|{{Code|BXHL0001}}
+
|{{Code|parse}}
 
|The input cannot be converted to XML.
 
|The input cannot be converted to XML.
 
|}
 
|}
Line 68: Line 100:
 
=Changelog=
 
=Changelog=
  
The module was introduced with Version 7.5.1.
+
;Version 9.0
 +
 
 +
* Updated: error codes updated; errors now use the module namespace
  
[[Category:XQuery]]
+
The module was introduced with Version 7.6.

Revision as of 12:31, 28 June 2019

This XQuery Module provides functions for converting HTML to XML. Conversion will only take place if TagSoup is included in the classpath (see HTML Parsing for more details).

Conventions

All functions and errors in this module are assigned to the http://basex.org/modules/html namespace, which is statically bound to the html prefix.

Functions

html:parser

Signatures html:parser() as xs:string
Summary Returns the name of the applied HTML parser (currently: TagSoup). If an empty string is returned, TagSoup was not found in the classpath, and the input will be treated as well-formed XML.

html:parse

Signatures html:parse($input as xs:anyAtomicType) as document-node()
html:parse($input as xs:anyAtomicType, $options as map(*)?) as document-node()
Summary Converts the HTML document specified by $input to XML, and returns a document node:
  • The input may either be a string or a binary item (xs:hexBinary, xs:base64Binary).
  • If the input is passed on in its binary representation, the HTML parser will try to automatically choose the correct encoding.

The $options argument can be used to set TagSoup Options.

Errors parse: the input cannot be converted to XML.

Examples

Basic Example

The following query converts the specified string to an XML document node.

Query
html:parse("<html>")
Result
<html xmlns="http://www.w3.org/1999/xhtml"/>

Specifying Options

The next query creates an XML document with namespaces:

Query
html:parse("<a href='ok.html'/>", map { 'nons': false() })
Result
<html xmlns="http://www.w3.org/1999/xhtml">
  <body>
    <a shape="rect" href="ok.html"/>
  </body>
</html>

Parsing Binary Input

If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:

Query
html:parse(fetch:binary("https://en.wikipedia.org"))
Result
<html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en">
  <head>
    <title>Wikipedia, the free encyclopedia</title>
    <meta charset="UTF-8"/>
    ...

Errors

Code Description
parse The input cannot be converted to XML.

Changelog

Version 9.0
  • Updated: error codes updated; errors now use the module namespace

The module was introduced with Version 7.6.