Difference between revisions of "HTML Module"

From BaseX Documentation
Jump to navigation Jump to search
m (Corrected fetch:content-binary to fetch:binary in section Parsing Binary Input)
m (Text replacement - "syntaxhighlight" to "pre")
 
(35 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This [[Module Library|XQuery Module]] provides functions for converting HTML to XML. Conversion will only take place if [http://home.ccil.org/~cowan/XML/tagsoup/ TagSoup] is included in the classpath (see [[Parsers#HTML Parser|HTML Parsing]] for more details).
+
This [[Module Library|XQuery Module]] provides functions for converting HTML to XML. Conversion will only take place if TagSoup is included in the classpath (see [[Parsers#HTML Parser|HTML Parsing]] for more details).
  
 
=Conventions=
 
=Conventions=
  
All functions in this module are assigned to the {{Code|http://basex.org/modules/html}} namespace, which is statically bound to the {{Code|html}} prefix.<br/>
+
All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/html</nowiki></code> namespace, which is statically bound to the {{Code|html}} prefix.<br/>
All errors are assigned to the {{Code|http://basex.org/errors}} namespace, which is statically bound to the {{Code|bxerr}} prefix.
 
  
 
=Functions=
 
=Functions=
  
==html:parser==
+
==html:doc==
  
 
{| width='100%'
 
{| width='100%'
|-
+
|- valign="top"
| width='120' | '''Signatures'''
+
| width='120' | '''Signature'''
|{{Code|'''html:parser'''() as xs:string}}<br />
+
|<pre>html:doc(
|-
+
  $href    as xs:string?,
 +
  $options  as map(*)?    := map { }
 +
) as document-node()?</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Returns the name of the applied HTML parser (currently: {{Code|TagSoup}}). If an ''empty string'' is returned, TagSoup was not found in the classpath, and the input will be treated as well-formed XML.<br />
+
|Fetches the HTML document referred to by the given {{Code|$href}}, converts it to XML and returns a document node. The {{Code|$options}} argument can be used to set [[Parsers#Options|TagSoup Options]].
 +
|- valign="top"
 +
| '''Errors'''
 +
|{{Error|parse|#Errors}} the input cannot be converted to XML.
 
|}
 
|}
  
 
==html:parse==
 
==html:parse==
 +
 
{| width='100%'
 
{| width='100%'
|-
+
|- valign="top"
| width='120' | '''Signatures'''
+
| width='120' | '''Signature'''
|{{Func|html:parse|$input as xs:anyAtomicType|document-node()}}<br />{{Func|html:parse|$input as xs:anyAtomicType, $options as item()|document-node()}}<br />
+
|<pre>html:parse(
|-
+
  $value    as xs:anyAtomicType,
 +
  $options as map(*)?          := map { }
 +
) as document-node()</pre>
 +
|- valign="top"
 
| '''Summary'''
 
| '''Summary'''
|Converts the HTML document specified by {{Code|$input}} to XML, and returns a document node:<br/>
+
|Converts the HTML document specified by {{Code|$value}} to XML and returns a document node:<br/>
* The input may either be a string or a binary item (xs:hexBinary, xs:base64Binary).
+
* The input may be of type {{Code|xs:string}}, {{Code|xs:base64Binary}}, or {{Code|xs:hexBinary}}.
* If the input is passed on in its binary representation, the HTML parser will try to automatically choose the correct encoding.
+
* If the input is passed on in its binary representation, and if no encoding option is supplied, the HTML parser will try to choose the correct encoding automatically.
  
The {{Code|$options}} argument can be used to set [[Parsers#TagSoup Options|TagSoup Options]], which can be specified…<br />
+
The {{Code|$options}} argument can be used to set [[Parsers#Options|TagSoup Options]].
* as children of an {{Code|<html:options/>}} element; e.g.:
+
|- valign="top"
<pre class="brush:xml">
 
<html:options>
 
  <html:key1 value='value1'/>
 
  ...
 
</html:options>
 
</pre>
 
* as map, which contains all key/value pairs:
 
<pre class="brush:xml">
 
map { "key1" := "value1", ... }
 
</pre>
 
|-
 
 
| '''Errors'''
 
| '''Errors'''
|{{Error|BXHL0001|#Errors}} the input cannot be converted to XML.
+
|{{Error|parse|#Errors}} the input cannot be converted to XML.
 +
|}
 +
 
 +
==html:parser==
 +
 
 +
{| width='100%'
 +
|- valign="top"
 +
| width='120' | '''Signature'''
 +
|{{Code|'''html:parser'''() as xs:string}}<br/>
 +
|- valign="top"
 +
| '''Summary'''
 +
|Returns the name of the applied HTML parser (currently: {{Code|TagSoup}}). If an ''empty string'' is returned, TagSoup was not found in the classpath, and the input will be treated as well-formed XML.<br/>
 
|}
 
|}
  
Line 54: Line 63:
  
 
;Query:
 
;Query:
<pre class="brush:xquery">
+
<pre lang='xquery'>
 
html:parse("<html>")
 
html:parse("<html>")
 
</pre>
 
</pre>
  
 
;Result:
 
;Result:
<pre class="brush:xml">
+
<pre lang="xml">
 
<html xmlns="http://www.w3.org/1999/xhtml"/>
 
<html xmlns="http://www.w3.org/1999/xhtml"/>
 
</pre>
 
</pre>
Line 65: Line 74:
 
===Specifying Options===
 
===Specifying Options===
  
The next query creates an XML document without namespaces:
+
The next query creates an XML document with namespaces:
  
 
;Query:
 
;Query:
<pre class="brush:xquery">
+
<pre lang='xquery'>
html:parse("<a href='ok.html'/>", map { 'nons' := true() })
+
html:parse("<a href='ok.html'/>", map { 'nons': false() })
 
</pre>
 
</pre>
  
 
;Result:
 
;Result:
<pre class="brush:xml">
+
<pre lang="xml">
<html>
+
<html xmlns="http://www.w3.org/1999/xhtml">
 
   <body>
 
   <body>
 
     <a shape="rect" href="ok.html"/>
 
     <a shape="rect" href="ok.html"/>
Line 87: Line 96:
  
 
;Query:
 
;Query:
<pre class="brush:xquery">
+
<pre lang='xquery'>
html:parse(fetch:binary("http://en.wikipedia.org"))
+
html:parse(fetch:binary("https://en.wikipedia.org"))
 
</pre>
 
</pre>
  
 
;Result:
 
;Result:
<pre class="brush:xml">
+
<pre lang="xml">
 
<html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en">
 
<html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en">
 
   <head>
 
   <head>
Line 105: Line 114:
 
! width="110"|Code
 
! width="110"|Code
 
|Description
 
|Description
|-
+
|- valign="top"
|{{Code|BXHL0001}}
+
|{{Code|parse}}
 
|The input cannot be converted to XML.
 
|The input cannot be converted to XML.
 
|}
 
|}
  
 
=Changelog=
 
=Changelog=
 +
 +
;Version 9.4
 +
 +
* Added: {{Function||html:doc}}
 +
 +
;Version 9.0
 +
 +
* Updated: error codes updated; errors now use the module namespace
  
 
The module was introduced with Version 7.6.
 
The module was introduced with Version 7.6.
 
[[Category:XQuery]]
 

Latest revision as of 18:39, 1 December 2023

This XQuery Module provides functions for converting HTML to XML. Conversion will only take place if TagSoup is included in the classpath (see HTML Parsing for more details).

Conventions[edit]

All functions and errors in this module are assigned to the http://basex.org/modules/html namespace, which is statically bound to the html prefix.

Functions[edit]

html:doc[edit]

Signature
html:doc(
  $href     as xs:string?,
  $options  as map(*)?     := map { }
) as document-node()?
Summary Fetches the HTML document referred to by the given $href, converts it to XML and returns a document node. The $options argument can be used to set TagSoup Options.
Errors parse: the input cannot be converted to XML.

html:parse[edit]

Signature
html:parse(
  $value    as xs:anyAtomicType,
  $options  as map(*)?           := map { }
) as document-node()
Summary Converts the HTML document specified by $value to XML and returns a document node:
  • The input may be of type xs:string, xs:base64Binary, or xs:hexBinary.
  • If the input is passed on in its binary representation, and if no encoding option is supplied, the HTML parser will try to choose the correct encoding automatically.

The $options argument can be used to set TagSoup Options.

Errors parse: the input cannot be converted to XML.

html:parser[edit]

Signature html:parser() as xs:string
Summary Returns the name of the applied HTML parser (currently: TagSoup). If an empty string is returned, TagSoup was not found in the classpath, and the input will be treated as well-formed XML.

Examples[edit]

Basic Example[edit]

The following query converts the specified string to an XML document node.

Query
html:parse("<html>")
Result
<html xmlns="http://www.w3.org/1999/xhtml"/>

Specifying Options[edit]

The next query creates an XML document with namespaces:

Query
html:parse("<a href='ok.html'/>", map { 'nons': false() })
Result
<html xmlns="http://www.w3.org/1999/xhtml">
  <body>
    <a shape="rect" href="ok.html"/>
  </body>
</html>

Parsing Binary Input[edit]

If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:

Query
html:parse(fetch:binary("https://en.wikipedia.org"))
Result
<html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en">
  <head>
    <title>Wikipedia, the free encyclopedia</title>
    <meta charset="UTF-8"/>
    ...

Errors[edit]

Code Description
parse The input cannot be converted to XML.

Changelog[edit]

Version 9.4
Version 9.0
  • Updated: error codes updated; errors now use the module namespace

The module was introduced with Version 7.6.