Difference between revisions of "HTML Module"
m (Text replace - "assigned to the {{Code|http://basex.org/errors}} namespace" to "assigned to the <code><nowiki>http://basex.org/errors</nowiki></code> namespace") |
|||
(17 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | This [[Module Library|XQuery Module]] provides functions for converting HTML to XML. Conversion will only take place if | + | This [[Module Library|XQuery Module]] provides functions for converting HTML to XML. Conversion will only take place if TagSoup is included in the classpath (see [[Parsers#HTML Parser|HTML Parsing]] for more details). |
=Conventions= | =Conventions= | ||
− | All functions in this module | + | All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/html</nowiki></code> namespace, which is statically bound to the {{Code|html}} prefix.<br/> |
− | |||
=Functions= | =Functions= | ||
Line 20: | Line 19: | ||
==html:parse== | ==html:parse== | ||
+ | |||
{| width='100%' | {| width='100%' | ||
|- | |- | ||
| width='120' | '''Signatures''' | | width='120' | '''Signatures''' | ||
− | |{{Func|html:parse|$input as xs:anyAtomicType|document-node()}}<br />{{Func|html:parse|$input as xs:anyAtomicType, $options as | + | |{{Func|html:parse|$input as xs:anyAtomicType|document-node()}}<br />{{Func|html:parse|$input as xs:anyAtomicType, $options as map(*)?|document-node()}}<br /> |
|- | |- | ||
| '''Summary''' | | '''Summary''' | ||
− | |Converts the HTML document specified by {{Code|$input}} to XML | + | |Converts the HTML document specified by {{Code|$input}} to XML and returns a document node:<br/> |
* The input may either be a string or a binary item (xs:hexBinary, xs:base64Binary). | * The input may either be a string or a binary item (xs:hexBinary, xs:base64Binary). | ||
* If the input is passed on in its binary representation, the HTML parser will try to automatically choose the correct encoding. | * If the input is passed on in its binary representation, the HTML parser will try to automatically choose the correct encoding. | ||
− | The {{Code|$options}} argument can be used to set [[Parsers# | + | The {{Code|$options}} argument can be used to set [[Parsers#Options|TagSoup Options]]. |
− | + | |- | |
− | + | | '''Errors''' | |
− | + | |{{Error|parse|#Errors}} the input cannot be converted to XML. | |
− | + | |} | |
− | + | ||
− | </html:options | + | ==html:doc== |
− | </ | + | |
− | + | {| width='100%' | |
− | + | |- | |
− | + | | width='120' | '''Signatures''' | |
− | + | |{{Func|html:doc|$uri as xs:string?|document-node()?}}<br />{{Func|html:doc|$uri as xs:string?, $options as map(*)?|document-node()?}}<br /> | |
+ | |- | ||
+ | | '''Summary''' | ||
+ | |Fetches the HTML document referred to by the given {{Code|$uri}}, converts it to XML and returns a document node. The {{Code|$options}} argument can be used to set [[Parsers#Options|TagSoup Options]]. | ||
|- | |- | ||
| '''Errors''' | | '''Errors''' | ||
− | |{{Error| | + | |{{Error|parse|#Errors}} the input cannot be converted to XML. |
|} | |} | ||
Line 54: | Line 57: | ||
;Query: | ;Query: | ||
− | < | + | <syntaxhighlight lang="xquery"> |
html:parse("<html>") | html:parse("<html>") | ||
− | </ | + | </syntaxhighlight> |
;Result: | ;Result: | ||
− | < | + | <syntaxhighlight lang="xml"> |
<html xmlns="http://www.w3.org/1999/xhtml"/> | <html xmlns="http://www.w3.org/1999/xhtml"/> | ||
− | </ | + | </syntaxhighlight> |
===Specifying Options=== | ===Specifying Options=== | ||
− | The next query creates an XML document | + | The next query creates an XML document with namespaces: |
;Query: | ;Query: | ||
− | < | + | <syntaxhighlight lang="xquery"> |
− | html:parse("<a href='ok.html'/>", map { 'nons': | + | html:parse("<a href='ok.html'/>", map { 'nons': false() }) |
− | </ | + | </syntaxhighlight> |
;Result: | ;Result: | ||
− | < | + | <syntaxhighlight lang="xml"> |
− | <html> | + | <html xmlns="http://www.w3.org/1999/xhtml"> |
<body> | <body> | ||
<a shape="rect" href="ok.html"/> | <a shape="rect" href="ok.html"/> | ||
</body> | </body> | ||
</html> | </html> | ||
− | </ | + | </syntaxhighlight> |
===Parsing Binary Input=== | ===Parsing Binary Input=== | ||
Line 87: | Line 90: | ||
;Query: | ;Query: | ||
− | < | + | <syntaxhighlight lang="xquery"> |
− | html:parse(fetch:binary(" | + | html:parse(fetch:binary("https://en.wikipedia.org")) |
− | </ | + | </syntaxhighlight> |
;Result: | ;Result: | ||
− | < | + | <syntaxhighlight lang="xml"> |
<html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en"> | <html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en"> | ||
<head> | <head> | ||
Line 98: | Line 101: | ||
<meta charset="UTF-8"/> | <meta charset="UTF-8"/> | ||
... | ... | ||
− | </ | + | </syntaxhighlight> |
=Errors= | =Errors= | ||
Line 106: | Line 109: | ||
|Description | |Description | ||
|- | |- | ||
− | |{{Code| | + | |{{Code|parse}} |
|The input cannot be converted to XML. | |The input cannot be converted to XML. | ||
|} | |} | ||
=Changelog= | =Changelog= | ||
+ | |||
+ | ;Version 9.4 | ||
+ | |||
+ | * Added: [[#html:doc|html:doc]] | ||
+ | |||
+ | ;Version 9.0 | ||
+ | |||
+ | * Updated: error codes updated; errors now use the module namespace | ||
The module was introduced with Version 7.6. | The module was introduced with Version 7.6. | ||
− | |||
− |
Revision as of 18:50, 18 November 2020
This XQuery Module provides functions for converting HTML to XML. Conversion will only take place if TagSoup is included in the classpath (see HTML Parsing for more details).
Contents
Conventions
All functions and errors in this module are assigned to the http://basex.org/modules/html
namespace, which is statically bound to the html
prefix.
Functions
html:parser
Signatures | html:parser() as xs:string |
Summary | Returns the name of the applied HTML parser (currently: TagSoup ). If an empty string is returned, TagSoup was not found in the classpath, and the input will be treated as well-formed XML. |
html:parse
Signatures | html:parse($input as xs:anyAtomicType) as document-node() html:parse($input as xs:anyAtomicType, $options as map(*)?) as document-node() |
Summary | Converts the HTML document specified by $input to XML and returns a document node:
The |
Errors | parse : the input cannot be converted to XML.
|
html:doc
Signatures | html:doc($uri as xs:string?) as document-node()? html:doc($uri as xs:string?, $options as map(*)?) as document-node()? |
Summary | Fetches the HTML document referred to by the given $uri , converts it to XML and returns a document node. The $options argument can be used to set TagSoup Options.
|
Errors | parse : the input cannot be converted to XML.
|
Examples
Basic Example
The following query converts the specified string to an XML document node.
- Query
<syntaxhighlight lang="xquery"> html:parse("<html>") </syntaxhighlight>
- Result
<syntaxhighlight lang="xml"> <html xmlns="http://www.w3.org/1999/xhtml"/> </syntaxhighlight>
Specifying Options
The next query creates an XML document with namespaces:
- Query
<syntaxhighlight lang="xquery"> html:parse("<a href='ok.html'/>", map { 'nons': false() }) </syntaxhighlight>
- Result
<syntaxhighlight lang="xml"> <html xmlns="http://www.w3.org/1999/xhtml">
<body> <a shape="rect" href="ok.html"/> </body>
</html> </syntaxhighlight>
Parsing Binary Input
If the input encoding is unknown, the data to be processed can be passed on in its binary representation. The HTML parser will automatically try to detect the correct encoding:
- Query
<syntaxhighlight lang="xquery"> html:parse(fetch:binary("https://en.wikipedia.org")) </syntaxhighlight>
- Result
<syntaxhighlight lang="xml"> <html xmlns="http://www.w3.org/1999/xhtml" class="client-nojs" dir="ltr" lang="en">
<head> <title>Wikipedia, the free encyclopedia</title> <meta charset="UTF-8"/> ...
</syntaxhighlight>
Errors
Code | Description |
---|---|
parse
|
The input cannot be converted to XML. |
Changelog
- Version 9.4
- Added: html:doc
- Version 9.0
- Updated: error codes updated; errors now use the module namespace
The module was introduced with Version 7.6.