Difference between revisions of "CSV Module"

From BaseX Documentation
Jump to navigation Jump to search
(32 intermediate revisions by 3 users not shown)
Line 3: Line 3:
 
=Conventions=
 
=Conventions=
  
All functions in this module are assigned to the {{Code|http://basex.org/modules/csv}} namespace, which is statically bound to the {{Code|csv}} prefix.<br/>
+
All functions in this module are assigned to the <code><nowiki>http://basex.org/modules/csv</nowiki></code> namespace, which is statically bound to the {{Code|csv}} prefix.<br/>
All errors are assigned to the {{Code|http://basex.org/errors}} namespace, which is statically bound to the {{Code|bxerr}} prefix.
+
All errors are assigned to the <code><nowiki>http://basex.org/errors</nowiki></code> namespace, which is statically bound to the {{Code|bxerr}} prefix.
  
 
==Conversion==
 
==Conversion==
Line 16: Line 16:
 
* Fields are represented via {{Code|<entry>}} elements. The value of a field is represented as text node.
 
* Fields are represented via {{Code|<entry>}} elements. The value of a field is represented as text node.
 
* If the {{Code|header}} option is set to {{Code|true}}, the first text line is parsed as table header, and the {{Code|<entry>}} elements are replaced with the field names:
 
* If the {{Code|header}} option is set to {{Code|true}}, the first text line is parsed as table header, and the {{Code|<entry>}} elements are replaced with the field names:
** Empty names are represented by a single underscore ({{Code|_}}), and characters that are not valid in element names are replaced with underscores.
+
** Empty names are represented by a single underscore ({{Code|_}}), and characters that are not valid in element names are replaced with underscores or (when invalid as first character of an element name) prefixed with an underscore.
 
** If the {{Code|lax}} option is set to {{Code|false}}, invalid characters will be rewritten to an underscore and the character’s four-digit Unicode, and underscores will be represented as two underscores ({{Code|__}}). The resulting element names may be less readable, but can always be converted back to the original field names.
 
** If the {{Code|lax}} option is set to {{Code|false}}, invalid characters will be rewritten to an underscore and the character’s four-digit Unicode, and underscores will be represented as two underscores ({{Code|__}}). The resulting element names may be less readable, but can always be converted back to the original field names.
 
* If {{Code|format}} is set to {{Code|attributes}}, field names will be stored in name attributes.
 
* If {{Code|format}} is set to {{Code|attributes}}, field names will be stored in name attributes.
 
In the Database Creation dialog of the GUI, when the CSV parser is selected, the ''Parsing'' tab demonstrates the conversion of CSV to XML and the effects of the single conversion options.
 
  
 
===Map===
 
===Map===
Line 29: Line 27:
 
* By default, all entries of a records are represented in a sequence.
 
* By default, all entries of a records are represented in a sequence.
 
* If the {{Code|header}} option is set to {{Code|true}}, a map is created, which contains all field names and its values.
 
* If the {{Code|header}} option is set to {{Code|true}}, a map is created, which contains all field names and its values.
 +
 +
'''A little advice''': in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the ''Parsing'' tab, you can see the effects of some of the conversion options.
  
 
==Options==
 
==Options==
  
The following options are available:
+
{{Mark|Updated with Version 8.6}}: improved Excel compatibility
 +
 
 +
In the following table, all available options are listed. The Excel column indicates what are the preferred options for data that is to be imported, or has been exported from Excel.
  
 
{| class="wikitable sortable" width="100%"
 
{| class="wikitable sortable" width="100%"
Line 40: Line 42:
 
! Allowed
 
! Allowed
 
! Default
 
! Default
 +
! Excel
 
|- valign="top"
 
|- valign="top"
 
| {{Code|separator}}
 
| {{Code|separator}}
| Defines the character which separates the entries of a record in a single line.
+
| Defines the character which separates the values of a single record.
 
| {{Code|comma}}, {{Code|semicolon}}, {{Code|colon}}, {{Code|tab}}, {{Code|space}} or a ''single character''
 
| {{Code|comma}}, {{Code|semicolon}}, {{Code|colon}}, {{Code|tab}}, {{Code|space}} or a ''single character''
 
| {{Code|comma}}
 
| {{Code|comma}}
 +
| {{Code|semicolon}}
 
|- valign="top"
 
|- valign="top"
 
| {{Code|header}}
 
| {{Code|header}}
Line 50: Line 54:
 
| {{Code|yes}}, {{Code|no}}
 
| {{Code|yes}}, {{Code|no}}
 
| {{Code|no}}
 
| {{Code|no}}
 +
|
 
|- valign="top"
 
|- valign="top"
 
| {{Code|format}}
 
| {{Code|format}}
Line 58: Line 63:
 
| {{Code|direct}}, {{Code|attributes}}, {{Code|map}}
 
| {{Code|direct}}, {{Code|attributes}}, {{Code|map}}
 
| {{Code|direct}}
 
| {{Code|direct}}
 +
|
 
|- valign="top"
 
|- valign="top"
 
| {{Code|lax}}
 
| {{Code|lax}}
Line 63: Line 69:
 
| {{Code|yes}}, {{Code|no}}
 
| {{Code|yes}}, {{Code|no}}
 
| {{Code|yes}}
 
| {{Code|yes}}
 +
| {{Code|no}}
 +
|- valign="top"
 +
| {{Code|quotes}}
 +
| Specifies how quotes are parsed:
 +
* Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change.
 +
* Serialization: If the option is enabled, the value will be wrapped with quotes. A quote character in the value will be encoded according to the rules of the {{Code|backslashes}} option.
 +
| {{Code|yes}}, {{Code|no}}
 +
| {{Code|yes}}
 +
| {{Code|yes}}
 +
|- valign="top"
 +
| {{Code|backslashes}}
 +
| Specifies how quotes and other characters are escaped:
 +
* Parsing: If the option is enabled, {{Code|\r}}, {{Code|n}} and {{Code|\t}} will be replaced with the corresponding control characters. All other escaped characters will be adopted as literals (e.g.: {{Code|\"}} → {{Code|"}}). If the option is disabled, two consecutive quotes will be replaced with a single quote (unless {{Code|quotes}} is enabled and the quote is the first or last character of a value).
 +
* Serialization: If the option is enabled, {{Code|\r}}, {{Code|n}}, {{Code|\t}}, {{Code|"}} and the separator character will be encoded with a backslash. If the option is disabled, quotes will be duplicated.
 +
| {{Code|yes}}, {{Code|no}}
 +
| {{Code|no}}
 +
| {{Code|no}}
 
|}
 
|}
 
The CSV function signatures provide an {{Code|$options}} argument. Options can either be specified
 
* as children of an {{Code|<csv:options/>}} element; e.g.:
 
<pre class="brush:xml">
 
<csv:options>
 
  <csv:separator value=';'/>
 
  ...
 
</csv:options>
 
</pre>
 
* or as map, which contains all key/value pairs:
 
<pre class="brush:xquery">
 
{ 'separator': ';', ... }
 
</pre>
 
  
 
=Functions=
 
=Functions=
  
 
==csv:parse==
 
==csv:parse==
 
{{Version|7.8}}: [[#csv:parse|csv:parse]] now returns a document node instead of an element, or an XQuery map if {{Code|format}} is set to {{Code|map}}. New options have been added.
 
  
 
{| width='100%'
 
{| width='100%'
 
|-
 
|-
 
| width='120' | '''Signatures'''
 
| width='120' | '''Signatures'''
|{{Func|csv:parse|$input as xs:string|document-node(element(csv))}}<br/>{{Func|csv:parse|$input as xs:string, $options as item()|item()}}
+
|{{Func|csv:parse|$input as xs:string|document-node(element(csv))}}<br/>{{Func|csv:parse|$input as xs:string, $options as map(xs:string, item())|item()}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
Line 97: Line 105:
  
 
==csv:serialize==
 
==csv:serialize==
 +
 
{| width='100%'
 
{| width='100%'
 
|-
 
|-
 
| width='120' | '''Signatures'''
 
| width='120' | '''Signatures'''
|{{Func|csv:serialize|$input as node(), $options as item()|xs:string}}
+
|{{Func|csv:serialize|$input as node()|xs:string}}<br/>{{Func|csv:serialize|$input as node(), $options as map(xs:string, item())|xs:string}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Serializes the node specified by {{Code|$input}} as CSV data, and returns the result as {{Code|xs:string}}.Items can also be serialized as JSON if the [[Serialization|Serialization Parameter]] {{Code|method}} is set to {{Code|csv}}.<br/>The {{Code|$options}} argument can be used to control the way the input is serialized.
+
|Serializes the node specified by {{Code|$input}} as CSV data, and returns the result as {{Code|xs:string}}. Items can also be serialized as JSON if the [[Serialization|Serialization Parameter]] {{Code|method}} is set to {{Code|csv}}.<br/>The {{Code|$options}} argument can be used to control the way the input is serialized.
 
|-
 
|-
 
| '''Errors'''
 
| '''Errors'''
Line 122: Line 131:
 
<pre class="brush:xquery">
 
<pre class="brush:xquery">
 
let $text := file:read-text('addressbook.csv')
 
let $text := file:read-text('addressbook.csv')
return csv:parse($text, { 'header': true() })
+
return csv:parse($text, map { 'header': true() })
 
</pre>
 
</pre>
  
Line 142: Line 151:
 
'''Query:'''
 
'''Query:'''
 
<pre class="brush:xquery">
 
<pre class="brush:xquery">
let $text := file:read-text('some-data.csv')
+
let $options := map { 'lax': false() }
let $options := { 'lax': 'no' }
+
let $input := file:read-text('some-data.csv')
let $xml := csv:parse($text, $options)
+
let $output := $input => csv:parse($options) => csv:serialize($options)
let $csv := csv:serialize($xml, $options)
+
return $input eq $output
return $text eq $csv
 
 
</pre>
 
</pre>
  
Line 154: Line 162:
 
<pre class="brush:xquery">
 
<pre class="brush:xquery">
 
let $text := "Name;City" || out:nl() || "John;Newton" || out:nl() || "Jack;Oldtown"
 
let $text := "Name;City" || out:nl() || "John;Newton" || out:nl() || "Jack;Oldtown"
let $options :=
+
let $options := map {
   <csv:options>
+
   'separator': ';',
    <csv:separator value=';'/>
+
  'format'  : 'map',
    <csv:format value='map'/>
+
  'header'  : true()
    <csv:header value='yes'/>
+
}
   </csv:options>
+
return csv:parse($text, $options)
let $map := csv:parse($text, $options)
 
return map:serialize($map)
 
 
</pre>
 
</pre>
  
 
'''Result:'''
 
'''Result:'''
<pre class="brush:xml">
+
<pre class="brush:xquery">
{
+
map {
   1: {
+
   1: map {
 
     "City": "Newton",
 
     "City": "Newton",
 
     "Name": "John"
 
     "Name": "John"
 
   },
 
   },
   2: {
+
   2: map {
 
     "City": "Oldtown",
 
     "City": "Oldtown",
 
     "Name": "Jack"
 
     "Name": "Jack"
Line 192: Line 198:
  
 
=Changelog=
 
=Changelog=
 +
 +
;Version 8.6
 +
 +
* Updated: [[#Options|Options]]: improved Excel compatibility
 +
 +
;Version 8.0
 +
 +
* Added: {{Code|backslashes}} option
  
 
;Version 7.8
 
;Version 7.8
Line 199: Line 213:
  
 
The module was introduced with Version 7.7.2.
 
The module was introduced with Version 7.7.2.
 
[[Category:XQuery]]
 

Revision as of 15:18, 11 February 2017

This XQuery Module contains a single function to parse CSV input. CSV (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel.

Conventions

All functions in this module are assigned to the http://basex.org/modules/csv namespace, which is statically bound to the csv prefix.
All errors are assigned to the http://basex.org/errors namespace, which is statically bound to the bxerr prefix.

Conversion

XML: Direct, Attributes

CSV is converted to XML as follows:

  • The resulting XML document has a <csv> root element.
  • Rows are represented via <record> elements.
  • Fields are represented via <entry> elements. The value of a field is represented as text node.
  • If the header option is set to true, the first text line is parsed as table header, and the <entry> elements are replaced with the field names:
    • Empty names are represented by a single underscore (_), and characters that are not valid in element names are replaced with underscores or (when invalid as first character of an element name) prefixed with an underscore.
    • If the lax option is set to false, invalid characters will be rewritten to an underscore and the character’s four-digit Unicode, and underscores will be represented as two underscores (__). The resulting element names may be less readable, but can always be converted back to the original field names.
  • If format is set to attributes, field names will be stored in name attributes.

Map

If format is set to map, the CSV data will be converted to an XQuery map:

  • All records are enumerated with positive integers.
  • By default, all entries of a records are represented in a sequence.
  • If the header option is set to true, a map is created, which contains all field names and its values.

A little advice: in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the Parsing tab, you can see the effects of some of the conversion options.

Options

Template:Mark: improved Excel compatibility

In the following table, all available options are listed. The Excel column indicates what are the preferred options for data that is to be imported, or has been exported from Excel.

Option Description Allowed Default Excel
separator Defines the character which separates the values of a single record. comma, semicolon, colon, tab, space or a single character comma semicolon
header Indicates if the first line of the parsed or serialized CSV data is a table header. yes, no no
format Specifies the format of the XML data. The format is only relevant if the header option is activated:
  • With direct conversion, field names are represented as element names
  • With attributes conversion, field names are stored in name attributes
  • With map conversion, the input is converted to an XQuery map
direct, attributes, map direct
lax Specifies if a lax approach is used to convert QNames to JSON names. yes, no yes no
quotes Specifies how quotes are parsed:
  • Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change.
  • Serialization: If the option is enabled, the value will be wrapped with quotes. A quote character in the value will be encoded according to the rules of the backslashes option.
yes, no yes yes
backslashes Specifies how quotes and other characters are escaped:
  • Parsing: If the option is enabled, \r, n and \t will be replaced with the corresponding control characters. All other escaped characters will be adopted as literals (e.g.: \""). If the option is disabled, two consecutive quotes will be replaced with a single quote (unless quotes is enabled and the quote is the first or last character of a value).
  • Serialization: If the option is enabled, \r, n, \t, " and the separator character will be encoded with a backslash. If the option is disabled, quotes will be duplicated.
yes, no no no

Functions

csv:parse

Signatures csv:parse($input as xs:string) as document-node(element(csv))
csv:parse($input as xs:string, $options as map(xs:string, item())) as item()
Summary Converts the CSV data specified by $input to an XML document or a map. The $options argument can be used to control the way the input is converted.
Errors BXCS0001: the input cannot be parsed.

csv:serialize

Signatures csv:serialize($input as node()) as xs:string
csv:serialize($input as node(), $options as map(xs:string, item())) as xs:string
Summary Serializes the node specified by $input as CSV data, and returns the result as xs:string. Items can also be serialized as JSON if the Serialization Parameter method is set to csv.
The $options argument can be used to control the way the input is serialized.
Errors BXCS0002: the input cannot be serialized.

Examples

Example 1: Converts CSV data to XML, interpreting the first row as table header:

Input addressbook.csv:

Name,First Name,Address,City
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing

Query:

let $text := file:read-text('addressbook.csv')
return csv:parse($text, map { 'header': true() })

Result:

<csv>
  <record>
    <Name>Huber</Name>
    <First_Name>Sepp</First_Name>
    <Address>Hauptstraße 13</Address>
    <City>93547 Hintertupfing</City>
  </record>
</csv>

Example 2: Converts some CSV data to XML and back, and checks if the input and output are equal. The expected result is true:

Query:

let $options := map { 'lax': false() }
let $input := file:read-text('some-data.csv')
let $output := $input => csv:parse($options) => csv:serialize($options)
return $input eq $output

Example 3: Converts CSV data to an XQuery map item and serializes its contents:

Query:

let $text := "Name;City" || out:nl() || "John;Newton" || out:nl() || "Jack;Oldtown"
let $options := map {
  'separator': ';',
  'format'   : 'map',
  'header'   : true()
}
return csv:parse($text, $options)

Result:

map {
  1: map {
    "City": "Newton",
    "Name": "John"
  },
  2: map {
    "City": "Oldtown",
    "Name": "Jack"
  }
}

Errors

Code Description
BXCS0001 The input cannot be parsed.
BXCS0002 The node cannot be serialized.

Changelog

Version 8.6
  • Updated: Options: improved Excel compatibility
Version 8.0
  • Added: backslashes option
Version 7.8
  • Updated: csv:parse now returns a document node instead of an element, or an XQuery map if format is set to map.
  • Added: format and lax options

The module was introduced with Version 7.7.2.