Changes

Parsers (edit)

Revision as of 13:34, 2 July 2020

337 bytes removed , 13:34, 2 July 2020

no edit summary

This article is part of the [[Getting Started]] Section.

It presents the available parsers that can be used to ~~importing~~ import various data sources ~~into~~ in BaseX databases.

Please visit the [[Serialization]] article if you want to know how to export data.

~~The names of the parser options have been updated with {{Version|7.8}}.~~

==XML Parsers==

BaseX provides two ~~parsers to import~~ alternatives for parsing XML ~~data~~:

* By default, ~~the~~ Java’s [https://docs.oracle.com/en/java/javase/11/docs/api/java.xml/javax/xml/parsers/SAXParser.html SAXParser] is used to parse XML documents.* The internal, built-in XML parser ~~is used, which~~ is more fault-tolerant than Java’s XML parser. It supports standard HTML entities out-of-the-box, and it is faster than the default parser, in ~~most cases~~particular if small documents are to be parsed. ~~In turn~~However, it the internal parser does not support ~~all oddities specified by DTDs,~~ the full range of DTD features and cannot resolve [[Catalog_Resolver|catalogs]].* Java’s [http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/SAXParser.html SAXParser] can also be selected for parsing XML documents. This parser is stricter than the built-in parser, but it refuses to process some large documents.

===GUI===

Go to Menu ''Database'' → ''New'', then choose the ''Parsing'' tab and (de)activate ''Use internal XML parser''. The parsing of DTDs can be turned on/off by selecting the checkbox below.

===Command Line===

To turn the internal XML parser and DTD parsing on/off, ~~modfify~~ modify the <code>INTPARSE</code> and <code>DTD</code> options:

SET ~~[[Options#INTPARSE~~{{Option|INTPARSE]] }} true SET ~~[[Options#DTD~~{{Option|DTD]] }} true

===XQuery===

The [[Database Module#db:add|db:add()]] or and [[Database Module#db:replace|db:replace()]] functions can also be used to add new XML documents to the database. The following example query uses the internal XML parser and adds all files to the database <code>DB</code> that are found in the directory <code>2Bimported</code>:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">~~declare option db:intparse "yes";~~

for $file in file:list("2Bimported")

return db:add('DB', $file, '', map { 'intparse': true() })</~~pre~~syntaxhighlight>

==HTML Parser==

~~With~~ If [http://~~home~~vrici.~~ccil~~lojban.org/~cowan/XML/tagsoup/ TagSoup] is found in the [[Startup#Distributions|classpath]], HTML can be imported in BaseX without any problems. TagSoup ensures that only well-formed HTML arrives at the XML parser (correct opening and closing tags, etc.)~~. Hence, if TagSoup is not available on a system, there will be a lot of cases where importing HTML fails, no matter whether you use the GUI or the standalone mode~~.

~~===Installation===~~If TagSoup is not available on a system, the default XML parser will be used. (Only) if the input is well-formed XML, the import will succeed.

==Installation== ====Downloads=====

TagSoup is already included in the full BaseX distributions ({{Code|BaseX.zip}}, {{Code|BaseX.exe}}, etc.). It can also be manually downloaded and embedded on the appropriate platforms.

=====Maven=====

An easy way to add TagSoup to your ~~own~~ project is to follow this steps:

1. ~~visit~~ Visit [~~http~~https://mvnrepository.com/artifact/org.ccil.cowan.tagsoup/tagsoup/ MVN TagSoup Repository]

2. ~~click~~ Click on the version you want

3. On the first tab, you can see ~~on the first tab called Maven a~~ an XML snippet like this :

<~~pre class~~syntaxhighlight lang="~~brush:~~xml">

<groupId>org.ccil.cowan.tagsoup</groupId>

</dependency>

</~~pre~~syntaxhighlight>

4. copy that in your own maven ~~project's~~ project’s <code>pom.xml ~~under~~ </code> file into the <code><dependencies> ~~tag~~</code> element.

5. ~~don't~~ don’t forget to run <code>mvn jetty:run</code> again

=====Debian=====

With Debian, TagSoup will be automatically detected and included after it has been installed via:

apt-get install libtagsoup-java

==~~=TagSoup~~ Options===

TagSoup offers a variety of options to customize the HTML conversion. For the complete list

please visit the [http://~~home~~vrici.~~ccil~~lojban.org/~cowan/XML/tagsoup/#program TagSoup] website. BaseX supports

most of these options with a few exceptions:

* '''encoding''': BaseX tries to guess the input encoding , but this can be overwritten by ~~the user if necessary~~this option.

* '''files''': not supported as input documents are piped directly to the XML parser.

* '''method''': set to 'xml' as default. If this is set to 'html' ending tags may be missing for instance.

Turn on the HTML Parser before parsing documents, and set a file filter:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} html SET ~~[[Options#HTMLPARSER~~{{Option|HTMLPARSER]] }} method=xml,nons=true,~~ncdata~~nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *.html

===XQuery===

Documents can also be converted by specifying the parser and additional options

~~in the query prolog~~as function arguments:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">~~declare option db~~fetch:~~parser~~ xml("index.html";, map { 'parser': 'html',~~declare option db~~ 'htmlparser':~~htmlopt "~~map { 'html=': false";(), 'nodefaults': true() }~~doc("index.html"~~})</~~pre~~syntaxhighlight>

==JSON Parser==

BaseX can also import JSON documents. The resulting format is described in the documentation for the XQuery [[JSON Module]]:

===GUI===

Go to Menu ''Database'' → ''New'' and select "JSON" in the input format combo box.

* '''JsonML''': Activate this option if the incoming file is a JsonML file.

===Command Line===

Turn on the JSON Parser before parsing documents, and set some optional, parser-specific options and a file filter:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} json SET ~~[[Options#JSONPARSER~~{{Option|JSONPARSER]] }} encoding=utf-8, jsonml=true SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *.json

===XQuery===

The [[JSON Module]] provides functions for converting JSON objects to XML documents.

==CSV Parser==

BaseX can be used to import CSV documents. Different alternatives how to proceed are shown in the following:

===GUI===

Go to Menu ''Database'' → ''New'' and select "CSV" in the input format combo box.

* '''Separator''': Choose the column separator of the CSV file. Possible: <code>comma</code>, <code>semicolon</code>, <code>tab</code> or <code>space</code> or an arbitrary character.

* '''Header''': Activate this option if the incoming CSV files have a header line.

* {{Mark|Removed in Version 7.7.2}}: '''XML format''': Choose the XML format. Possible: <code>verbose</code>, <code>simple</code>.

===Command Line===

Turn on the CSV Parser before parsing documents, and set some optional, parser-specific options and a file filter. Unicode code points can be specified as separators; {{Code|32}} is the code point for spaces:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} csv SET ~~[[Options#CSVPARSER~~{{Option|CSVPARSER]] }} encoding=utf-8, lines=true, header=false, separator=space SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *.csv

===XQuery===

~~Since {{Version|7.7.2}}, the~~ The [[CSV Module]] provides a function for converting ~~HTML~~ CSV to XML documents.

Documents can also be converted by specifying the parser in ~~the query prolog~~an XQuery function.

The following example query adds all CSV files that are located in the

directory {{Code|2Bimported}} to the database {{Code|DB}} and interprets

the first lines as column headers:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">~~declare option db:parser "csv";declare option db:csvparser "header=yes";~~

for $file in file:list("2Bimported", false(), "*.csv")

return db:add('"DB'", $file, "", map { 'parser': 'csv', 'csvparser': map { 'header': true() }})</~~pre~~syntaxhighlight>

==Text Parser==

Plain text can be imported as well:

===GUI===

Go to Menu ''Database'' → ''New'' and select "TEXT" in the input format combobox.

* '''Lines''': Activate this option to create a <code><line>...</line></code> element for each line of the input text file.

===Command Line===

Turn on the CSV Parser before parsing documents and set some optional, parser-specific options and a file filter:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} text SET ~~[[Options#TEXTPARSER~~{{Option|TEXTPARSER]] }} lines=yes SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *

===XQuery===

Similar to the other formats , the text parser can also be specified ~~in the prolog of an~~via XQuery ~~expression~~:

<~~pre class~~syntaxhighlight lang="~~brush:~~xquery">~~declare option db:parser "text";~~

for $file in file:list("2Bimported", true(), "*.txt")

return db:add('"DB'", $file, "", map { 'parser': 'text' })</~~pre~~syntaxhighlight>

=Changelog=

;Version 7.7.2

* Removed: CSV option "format".

;Version 7.3

* Updated: the CSV {{Code|SEPARATOR }} option ~~of CSV parser~~ may now ~~contain~~ be assigned arbitrary ~~code points.~~single characters

;Version 7.2

* Updated: Enhanced support for TagSoup options. ~~[[Category: Beginner]]~~

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Parsers (edit)

Revision as of 13:34, 2 July 2020

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools