Changes

Parsers (edit)

Revision as of 13:34, 2 July 2020

70 bytes removed , 13:34, 2 July 2020

no edit summary

=XML Parsers=

BaseX provides two ~~parsers to import~~ alternatives for parsing XML ~~data~~:

* By default, ~~the~~ Java’s [https://docs.oracle.com/en/java/javase/11/docs/api/java.xml/javax/xml/parsers/SAXParser.html SAXParser] is used to parse XML documents.* The internal, built-in XML parser ~~is used, which~~ is more fault-tolerant than Java’s XML parser. It supports standard HTML entities out-of-the-box, and it is faster than the default parser, in ~~most cases~~particular if small documents are to be parsed. ~~In turn~~However, it the internal parser does not support ~~all oddities specified by DTDs,~~ the full range of DTD features and cannot resolve [[Catalog_Resolver|catalogs]].* Java’s [http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/SAXParser.html SAXParser] can also be selected for parsing XML documents. This parser is stricter than the built-in parser, but it refuses to process some large documents.

==GUI==

To turn the internal XML parser and DTD parsing on/off, modify the <code>INTPARSE</code> and <code>DTD</code> options:

SET ~~[[Options#INTPARSE~~{{Option|INTPARSE]] }} true SET ~~[[Options#DTD~~{{Option|DTD]] }} true

==XQuery==

The [[Database Module#db:add|db:add]] and [[Database Module#db:replace|db:replace]] functions can also be used to add new XML documents to the database. The following example query uses the internal XML parser and adds all files to the database <code>DB</code> that are found in the directory <code>2Bimported</code>:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

for $file in file:list("2Bimported")

return db:add('DB', $file, '', map { 'intparse': true() })

</~~pre~~syntaxhighlight>

=HTML Parser=

~~With~~ If [http://~~home~~vrici.~~ccil~~lojban.org/~cowan/XML/tagsoup/ TagSoup] is found in the [[Startup#Distributions|classpath]], HTML can be imported in BaseX without any problems. TagSoup ensures that only well-formed HTML arrives at the XML parser (correct opening and closing tags, etc.). ~~Hence, if~~ If TagSoup is not available on a system, ~~there~~ the default XML parser will be ~~a lot of cases where importing HTML fails~~used. (Only) if the input is well-formed XML, ~~no matter whether you use~~ the ~~GUI or the standalone mode~~import will succeed.

==Installation==

====Maven====

An easy way to add TagSoup to your ~~own~~ project is to follow this steps:

1. ~~visit~~ Visit [~~http~~https://mvnrepository.com/artifact/org.ccil.cowan.tagsoup/tagsoup/ MVN TagSoup Repository]

2. ~~click~~ Click on the version you want

3. On the first tab, you can see ~~on the first tab called Maven a~~ an XML snippet like this :

<~~pre class~~syntaxhighlight lang="~~brush:~~xml">

<groupId>org.ccil.cowan.tagsoup</groupId>

</dependency>

</~~pre~~syntaxhighlight>

4. copy that in your own maven ~~project's~~ project’s <code>pom.xml ~~under~~ </code> file into the <code><dependencies> ~~tag~~</code> element.

5. ~~don't~~ don’t forget to run <code>mvn jetty:run</code> again

====Debian====

apt-get install libtagsoup-java

==~~TagSoup~~ Options==

TagSoup offers a variety of options to customize the HTML conversion. For the complete list

please visit the [http://~~home~~vrici.~~ccil~~lojban.org/~cowan/XML/tagsoup/#program TagSoup] website. BaseX supports

most of these options with a few exceptions:

* '''encoding''': BaseX tries to guess the input encoding , but this can be overwritten by ~~the user if necessary~~this option.

* '''files''': not supported as input documents are piped directly to the XML parser.

* '''method''': set to 'xml' as default. If this is set to 'html' ending tags may be missing for instance.

Turn on the HTML Parser before parsing documents, and set a file filter:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} html SET ~~[[Options#HTMLPARSER~~{{Option|HTMLPARSER]] }} method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *.html

===XQuery===

as function arguments:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

fetch:xml("index.html", map {

'parser': 'html',

'htmlparser': map { 'html': false(), 'nodefaults': true() }

})

</~~pre~~syntaxhighlight>

=JSON Parser=

Turn on the JSON Parser before parsing documents, and set some optional, parser-specific options and a file filter:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} json SET ~~[[Options#JSONPARSER~~{{Option|JSONPARSER]] }} encoding=utf-8, jsonml=true SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *.json

==XQuery==

Turn on the CSV Parser before parsing documents, and set some optional, parser-specific options and a file filter. Unicode code points can be specified as separators; {{Code|32}} is the code point for spaces:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} csv SET ~~[[Options#CSVPARSER~~{{Option|CSVPARSER]] }} encoding=utf-8, lines=true, header=false, separator=space SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *.csv

==XQuery==

the first lines as column headers:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

for $file in file:list("2Bimported", false(), "*.csv")

return db:add("DB", $file, "", map {

'csvparser': map { 'header': true() }

})

</~~pre~~syntaxhighlight>

=Text Parser=

Turn on the CSV Parser before parsing documents and set some optional, parser-specific options and a file filter:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} text SET ~~[[Options#TEXTPARSER~~{{Option|TEXTPARSER]] }} lines=yes SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *

==XQuery==

Similar to the other formats, the text parser can also be specified via XQuery:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">

for $file in file:list("2Bimported", true(), "*.txt")

return db:add("DB", $file, "", map { 'parser': 'text' })

</~~pre~~syntaxhighlight>

=Changelog=

* Updated: Enhanced support for TagSoup options

~~[[Category: Beginner]]~~

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Parsers (edit)

Revision as of 13:34, 2 July 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools