Changes

Parsers (edit)

Revision as of 16:56, 10 April 2019

480 bytes removed , 16:56, 10 April 2019

no edit summary

This article is part of the [[Getting Started]] Section.

It presents ~~different~~ the available parsers ~~for importing~~ that can be used to import various data~~source into~~ sources in BaseX databases. ~~For export see~~ Please visit the [[Serialization]]article if you want to know how to export data.

==XML Parsers==

BaseX provides two ~~parsers to import~~ alternatives for parsing XML ~~data~~:

* By default, ~~the~~ Java’s [http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/SAXParser.html SAXParser] is used to parse XML documents.* The internal, built-in XML parser ~~is used, which~~ is more fault-tolerant than Java’s XML parser. It supports standard HTML entities out-of-the-box, and it is faster than the default parser, in ~~most cases~~particular if small documents are to be parsed. ~~In turn~~However, it the internal parser does not support ~~all oddities specified by DTDs,~~ the full range of DTD features and cannot resolve [[Catalog_Resolver|catalogs]].* Java’s [http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/SAXParser.html SAXParser] can also be selected for parsing XML documents. This parser is stricter than the built-in parser, but it refuses to process some large documents.

===GUI===

Go to Menu ''Database'' → ''New'', then choose the ''Parsing'' tab and (de)activate ''Use internal XML parser''. The parsing of DTDs can be turned on/off by selecting the checkbox below.

===Command Line===

To turn the internal XML parser and DTD parsing on/off, ~~modfify~~ modify the <code>INTPARSE</code> and <code>DTD</code> options:

SET ~~[[Options#INTPARSE~~{{Option|INTPARSE]] }} true SET ~~[[Options#DTD~~{{Option|DTD]] }} true

===XQuery===

The [[Database Module#db:add|db:add()]] or and [[Database Module#db:replace|db:replace()]]~~function~~ functions can also be used ~~as well~~ to add new XML documents to the database.The following example query uses the internal XML parser and adds all filesto the database <code>DB</code> that are found in the directory<code>2Bimported</code>:

~~declare option db:intparse "yes";~~

for $file in file:list("2Bimported")

return db:add('DB', $file, '', map { 'intparse': true() })

</pre>

==HTML Parser==

~~With~~ If [http://~~home~~vrici.~~ccil~~lojban.org/~cowan/XML/tagsoup/ TagSoup] is found in the [[Startup#Distributions|classpath]], HTML can be imported in BaseX without any problems. TagSoup ensures that only well-formed HTML arrives at the XML parser (correct opening and closing tags, etc.)~~. Hence, if TagSoup is not available on a system, there will be a lot of cases where importing HTML fails, no matter whether you use the GUI or the standalone mode~~.

~~===Installation===~~If TagSoup is not available on a system, the default XML parser will be used. (Only) if the input is well-formed XML, the import will succeed.

==Installation== ====Downloads=====

TagSoup is already included in the full BaseX distributions ({{Code|BaseX.zip}}, {{Code|BaseX.exe}}, etc.). It can also be manually downloaded and embedded on the appropriate platforms.

=====Maven=====

An easy way to add TagSoup to your ~~own~~ project is to follow this steps:

1. visit [http://mvnrepository.com/artifact/org.ccil.cowan.tagsoup/tagsoup/ MVN TagSoup Repository]

2. click on the version you want

3. ~~you can see~~ on the first tab ~~called Maven a~~ , you can see an XML snippet like this :

<pre class="brush:xml"><dependency> <groupId>org.ccil.cowan.tagsoup</groupId> <artifactId>tagsoup</artifactId> <version>1.2.1</version> </dependency></pre>

4. copy that in your own maven ~~project's~~ project’s <code>pom.xml ~~under~~ </code> file into the <code><dependencies> ~~tag~~</code> element.

5. ~~don't~~ don’t forget to run <code>mvn jetty:run</code> again

=====Debian=====

With Debian, TagSoup will be automatically detected and included after it has been installed via:

apt-get install libtagsoup-java

==~~=TagSoup~~ Options===

TagSoup offers a variety of options to customize the HTML conversion. For the complete list

please visit the [http://~~home~~vrici.~~ccil~~lojban.org/~cowan/XML/tagsoup/#program TagSoup] website. BaseX supports

most of these options with a few exceptions:

* '''encoding''': BaseX tries to guess the input encoding , but this can be overwritten by ~~the user if necessary~~this option.

* '''files''': not supported as input documents are piped directly to the XML parser.

* '''method''': set to 'xml' as default. If this is set to 'html' ending tags may be missing for instance.

Turn on the HTML Parser before parsing documents, and set a file filter:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} html SET ~~[[Options#HTMLOPT~~{{Option|~~HTMLOPT]]~~ HTMLPARSER}} method=xml,nons=true,~~ncdata~~nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *.html

===XQuery===

The [[HTML Module]] provides a function for converting HTML to XML documents.

~~HTML files~~ Documents can also be converted by specifying the ~~HTML~~ parser ~~in the query prolog~~and additional optionsas function arguments:

~~declare option db~~fetch:~~parser~~ xml("index.html";, map { 'parser': 'html',~~declare option db~~ 'htmlparser':~~htmlopt "~~map { 'html=': false";(), 'nodefaults': true() }~~doc("index.html"~~})

</pre>

==JSON Parser==

BaseX can also import JSON documents. The resulting format is described in the documentation for the XQuery [[JSON Module]]:

===GUI===

Go to Menu ''Database'' → ''New'' and select "JSON" in the input format combo box.

* '''JsonML''': Activate this option if the incoming file is a JsonML file.

===Command Line===

Turn on the JSON Parser before parsing documents, and set some optional, parser-specific options and a file filter:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} json SET ~~[[Options#PARSEROPT~~{{Option|~~PARSEROPT]]~~ JSONPARSER}} encoding=utf-8, jsonml=true SET {{Option|CREATEFILTER}} *.json ==XQuery== The [[~~Options#CREATEFILTER|CREATEFILTER~~JSON Module]] *provides functions for converting JSON objects to XML documents.~~json~~

==CSV Parser==

BaseX can be used to import CSV documents. Different alternatives how to proceed are shown in the following:

===GUI===

Go to Menu ''Database'' → ''New'' and select "CSV" in the input format combo box.

* '''Separator''': Choose the column separator of the CSV file. Possible: <code>comma</code>, <code>semicolon</code>, <code>tab</code> or <code>space</code> or an arbitrary character.

* '''Header''': Activate this option if the incoming CSV files have a header line.

* {{Mark|Removed in Version 7.7.2}}: '''XML format''': Choose the XML format. Possible: <code>verbose</code>, <code>simple</code>.

===Command Line===

Turn on the CSV Parser before parsing documents, and set some optional, parser-specific options and a file filter. Unicode code points can be specified as separators; {{Code|32}} is the code point for spaces:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} csv SET ~~[[Options#PARSEROPT~~{{Option|~~PARSEROPT]]~~ CSVPARSER}} encoding=utf-8, lines=true, header=false, separator=space SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *.csv

===XQuery=== The [[CSV Module]] provides a function for converting CSV to XML documents.

~~The CSV parser~~ Documents can also be ~~specified~~ converted by specifying the parser in ~~the prolog of~~ an XQuery ~~expression.The [[Database Module#db:add|db:add()]] or [[Database Module#db:replace|db:replace()]]~~function ~~can be used to add the specified source files into the database~~.The following example query adds all CSV files ~~to the database <code>DB</code>~~that are ~~found~~ located in the directory ~~<code>~~{{Code|2Bimported~~</code>,~~ }} to the database {{Code|DB}} and interprets thefirst lines as column headers:

~~declare option db:parser "csv";~~

~~declare option db:parseropt "header=yes";~~

for $file in file:list("2Bimported", false(), "*.csv")

return db:add('"DB'", $file, "", map { 'parser': 'csv', 'csvparser': map { 'header': true() }})

</pre>

==Text Parser==

Plain text can be imported as well:

===GUI===

Go to Menu ''Database'' → ''New'' and select "TEXT" in the input format combobox.

* '''Lines''': Activate this option to create a <code><line>...</line></code> element for each line of the input text file.

===Command Line===

Turn on the CSV Parser before parsing documents and set some optional, parser-specific options and a file filter:

SET ~~[[Options#PARSER~~{{Option|PARSER]] }} text SET ~~[[Options#PARSEROPT~~{{Option|~~PARSEROPT]]~~ TEXTPARSER}} lines=yes SET ~~[[Options#CREATEFILTER~~{{Option|CREATEFILTER]] }} *

===XQuery===

~~Again~~Similar to the other formats, the text parser can also be specified ~~in the prolog of an~~ via XQuery ~~expression, andthe [[Database Module#db:add|db:add()]] or [[Database Module#db:replace|db:replace()]]function can be used to add the specified source files into the database.The following example query adds all text files to the database <code>DB</code>that are found in the directory <code>2Bimported</code> and its sub-directories~~:

~~declare option db:parser "text";~~

for $file in file:list("2Bimported", true(), "*.txt")

return db:add('"DB'", $file, "", map { 'parser': 'text' })

</pre>

=Changelog=

;Version 7.8

* Updated: parser options

;Version 7.7.2

* Removed: CSV option "format".

;Version 7.3

* Updated: the CSV {{Code|SEPARATOR }} option ~~of CSV parser~~ may now ~~contain~~ be assigned arbitrary ~~code points.~~single characters

;Version 7.2

* Updated: Enhanced support for TagSoup options. ~~[[Category: Beginner]]~~

CG

Bureaucrats, editor, reviewer, Administrators

13,550

edits

Changes

Parsers (edit)

Revision as of 16:56, 10 April 2019

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools