HTML Module

From BaseX Documentation
Revision as of 19:36, 31 March 2014 by James Ball (talk | contribs) (Corrected fetch:content-binary to fetch:binary in section Parsing Binary Input)
Jump to navigation Jump to search

This XQuery Module provides functions for converting HTML to XML. Conversion will only take place if TagSoup is included in the classpath (see HTML Parsing for more details).


All functions in this module are assigned to the namespace, which is statically bound to the html prefix.
All errors are assigned to the namespace, which is statically bound to the bxerr prefix.



Signatures html:parser() as xs:string
Summary Returns the name of the applied HTML parser (currently: TagSoup). If an empty string is returned, TagSoup was not found in the classpath, and the input will be treated as well-formed XML.


Signatures html:parse($input as xs:anyAtomicType) as document-node()
html:parse($input as xs:anyAtomicType, $options as item()) as document-node()
Summary Converts the HTML document specified by $input to XML, and returns a document node:
  • The input may either be a string or a binary item (xs:hexBinary, xs:base64Binary).
  • If the input is passed on in its binary representation, the HTML parser will try to automatically choose the correct encoding.

The $options argument can be used to set TagSoup Options, which can be specified…

  • as children of an <html:options/> element; e.g.:
  <html:key1 value='value1'/>
  • as map, which contains all key/value pairs:
map { "key1" := "value1", ... }
Errors BXHL0001: the input cannot be converted to XML.


Basic Example

The following query converts the specified string to an XML document node.

<html xmlns=""/>

Specifying Options

The next query creates an XML document without namespaces:

html:parse("<a href='ok.html'/>", map { 'nons' := true() })
    <a shape="rect" href="ok.html"/>

Parsing Binary Input

If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:

<html xmlns="" class="client-nojs" dir="ltr" lang="en">
    <title>Wikipedia, the free encyclopedia</title>
    <meta charset="UTF-8"/>


Code Description
BXHL0001 The input cannot be converted to XML.


The module was introduced with Version 7.6.