Difference between revisions of "CSV Module"

From BaseX Documentation
Jump to navigation Jump to search
 
(33 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This [[Module Library|XQuery Module]] contains a single function to parse CSV input. [http://en.wikipedia.org/wiki/Comma-separated_values CSV] (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel.
+
This [[Module Library|XQuery Module]] contains a single function to parse CSV input. [https://en.wikipedia.org/wiki/Comma-separated_values CSV] (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel.
  
 
=Conventions=
 
=Conventions=
  
All functions in this module are assigned to the <code><nowiki>http://basex.org/modules/csv</nowiki></code> namespace, which is statically bound to the {{Code|csv}} prefix.<br/>
+
All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/csv</nowiki></code> namespace, which is statically bound to the {{Code|csv}} prefix.<br/>
All errors are assigned to the <code><nowiki>http://basex.org/errors</nowiki></code> namespace, which is statically bound to the {{Code|bxerr}} prefix.
 
  
 
==Conversion==
 
==Conversion==
Line 10: Line 9:
 
===XML: Direct, Attributes===
 
===XML: Direct, Attributes===
  
CSV is converted to XML as follows:
+
If the {{Code|direct}} or {{Code|attributes}} format is chosen, a CSV string is converted to XML:
  
* The resulting XML document has a {{Code|<csv>}} root element.
+
* The resulting XML document has a {{Code|csv}} root element.
* Rows are represented via {{Code|<record>}} elements.
+
* Rows are represented via {{Code|record}} elements.
* Fields are represented via {{Code|<entry>}} elements. The value of a field is represented as text node.
+
* Fields are represented via {{Code|entry}} elements. The value of a field is represented as text node.
* If the {{Code|header}} option is set to {{Code|true}}, the first text line is parsed as table header, and the {{Code|<entry>}} elements are replaced with the field names:
+
* If the {{Code|header}} option is set to {{Code|true}}, the first text line is parsed as table header:
** Empty names are represented by a single underscore ({{Code|_}}), and characters that are not valid in element names are replaced with underscores or (when invalid as first character of an element name) prefixed with an underscore.
+
** If {{Code|format}} is set to {{Code|direct}}, the field names are encoded, as described in the [[Conversion Module#Keys|Conversion Module]], and used as element names.
** If the {{Code|lax}} option is set to {{Code|false}}, invalid characters will be rewritten to an underscore and the character’s four-digit Unicode, and underscores will be represented as two underscores ({{Code|__}}). The resulting element names may be less readable, but can always be converted back to the original field names.
+
** Otherwise, if {{Code|format}} is {{Code|attributes}}, the field names will be stored in name attributes.
* If {{Code|format}} is set to {{Code|attributes}}, field names will be stored in name attributes.
 
  
 
'''A little advice''': in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the ''Parsing'' tab, you can see the effects of some of the conversion options.
 
'''A little advice''': in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the ''Parsing'' tab, you can see the effects of some of the conversion options.
Line 24: Line 22:
 
===XQuery===
 
===XQuery===
  
{{Mark|Introduced with Version 9.0}}:
+
With the {{Code|xquery}} format, CSV records are converted to a sequence of arrays:
 
 
CSV records can be converted to a plain sequence of arrays:
 
  
 
* The resulting value will be a map with a {{Code|records}} and an optional {{Code|names}} key.
 
* The resulting value will be a map with a {{Code|records}} and an optional {{Code|names}} key.
Line 32: Line 28:
 
* The column names will be available if {{Code|header}} option is set to {{Code|true}}.
 
* The column names will be available if {{Code|header}} option is set to {{Code|true}}.
  
Rows in the resulting records sequence can be accessed with positional predicates, and columns can be accessed via the
+
The CSV map can e.g. be accessed as follows:
  
* <code>$csv?records[2]</code> returns all entries of the second record (row)
+
* <code>$csv?records[5]</code> returns all entries of the 5th record (row)
* <code>$csv?records(2)</code> returns all entries of the second field (column)
+
* <code>$csv?records(2)</code> returns all entries of the 2nd field (column)
 +
* <code>$csv?names?*</code> returns the names of all fields (if available)
 +
* Return enumerated strings for all records:
 +
<syntaxhighlight lang="xquery">
 +
for $record at $pos in $csv?records
 +
return $pos || ". " || string-join($record?*, ', ')
 +
</syntaxhighlight>
  
The resulting representation consumes less memory than XML-based formats, and values can be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing. Before {{Version|9.0}}, a {{Code|map}} format was available, which was now replaced with the more flexible and light-weight {{Code|xquery}} format.
+
The resulting representation consumes less memory than XML-based formats, and values can be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing.
  
 
==Options==
 
==Options==
Line 68: Line 70:
 
* With {{Code|attributes}} conversion, field names are stored in {{Code|name}} attributes
 
* With {{Code|attributes}} conversion, field names are stored in {{Code|name}} attributes
 
* With {{Code|xquery}} conversion, the input is converted to an XQuery map
 
* With {{Code|xquery}} conversion, the input is converted to an XQuery map
| {{Code|direct}}, {{Code|attributes}}, {{Code|map}}
+
| {{Code|direct}}, {{Code|attributes}}, {{Code|xquery}}
 
| {{Code|direct}}
 
| {{Code|direct}}
 
|
 
|
Line 81: Line 83:
 
| Specifies how quotes are parsed:
 
| Specifies how quotes are parsed:
 
* Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change.
 
* Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change.
* Serialization: If the option is enabled, the value will be wrapped with quotes. A quote character in the value will be encoded according to the rules of the {{Code|backslashes}} option.
+
* Serialization: If the option is enabled, the value will be wrapped with quotes if it contains characters that might be treated as control characters. A quote character in the value will be encoded according to the rules of the {{Code|backslashes}} option.
 
| {{Code|yes}}, {{Code|no}}
 
| {{Code|yes}}, {{Code|no}}
 
| {{Code|yes}}
 
| {{Code|yes}}
Line 102: Line 104:
 
|-
 
|-
 
| width='120' | '''Signatures'''
 
| width='120' | '''Signatures'''
|{{Func|csv:parse|$input as xs:string|document-node(element(csv))}}<br/>{{Func|csv:parse|$input as xs:string, $options as map(xs:string, item())|item()}}
+
|{{Func|csv:parse|$string as xs:string?|item()?}}<br/>{{Func|csv:parse|$string as xs:string?, $options as map(*)?|item()?}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Converts the CSV data specified by {{Code|$input}} to an XML document or a map. The {{Code|$options}} argument can be used to control the way the input is converted.
+
|Converts the CSV {{Code|$string}} to an XQuery value. The {{Code|$options}} argument can be used to control the way the input is converted.
 
|-
 
|-
 
| '''Errors'''
 
| '''Errors'''
|{{Error|BXCS0001|#Errors}} the input cannot be parsed.
+
|{{Error|parse|#Errors}} the specified input cannot be parsed as CSV document.
 +
|}
 +
 
 +
==csv:doc==
 +
 
 +
{| width='100%'
 +
|-
 +
| width='120' | '''Signatures'''
 +
|{{Func|csv:doc|$uri as xs:string?|item()?}}<br />{{Func|csv:doc|$uri as xs:string?, $options as map(*)?|item()?}}<br />
 +
|-
 +
| '''Summary'''
 +
|Fetches the CSV document referred to by the given {{Code|$uri}} and converts it to an XQuery value. The {{Code|$options}} argument can be used to control the way the input is converted.
 +
|-
 +
| '''Errors'''
 +
|{{Error|parse|#Errors}} the specified input cannot be parsed as CSV document.<br/>{{Error|options|#Errors}} the specified options are conflicting.
 
|}
 
|}
  
Line 116: Line 132:
 
|-
 
|-
 
| width='120' | '''Signatures'''
 
| width='120' | '''Signatures'''
|{{Func|csv:serialize|$input as node()|xs:string}}<br/>{{Func|csv:serialize|$input as node(), $options as map(xs:string, item())|xs:string}}
+
|{{Func|csv:serialize|$input as item()?|xs:string}}<br/>{{Func|csv:serialize|$input as item()?, $options as map(*)?|xs:string}}
 
|-
 
|-
 
| '''Summary'''
 
| '''Summary'''
|Serializes the node specified by {{Code|$input}} as CSV data, and returns the result as {{Code|xs:string}}. Items can also be serialized as JSON if the [[Serialization|Serialization Parameter]] {{Code|method}} is set to {{Code|csv}}.<br/>The {{Code|$options}} argument can be used to control the way the input is serialized.
+
|Serializes the specified {{Code|$input}} as CSV, using the specified {{Code|$options}}, and returns the result as string.
 +
Values can also be serialized as CSV with the standard [[Serialization]] feature of XQuery:
 +
* The parameter {{Code|method}} needs to be set to {{Code|csv}}, and
 +
* the options presented in this article need to be assigned to the {{Code|csv}} parameter.
 
|-
 
|-
 
| '''Errors'''
 
| '''Errors'''
|{{Error|BXCS0002|#Errors}} the input cannot be serialized.
+
|{{Error|serialize|#Errors}} the input cannot be serialized.
 
|}
 
|}
  
Line 130: Line 149:
  
 
'''Input''' {{Code|addressbook.csv}}:
 
'''Input''' {{Code|addressbook.csv}}:
<pre class="brush:xml">
+
<syntaxhighlight lang="xml">
 
Name,First Name,Address,City
 
Name,First Name,Address,City
 
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing
 
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing
</pre>
+
</syntaxhighlight>
  
 
'''Query:'''
 
'''Query:'''
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
let $text := file:read-text('addressbook.csv')
 
let $text := file:read-text('addressbook.csv')
 
return csv:parse($text, map { 'header': true() })
 
return csv:parse($text, map { 'header': true() })
</pre>
+
</syntaxhighlight>
  
 
'''Result:'''
 
'''Result:'''
<pre class="brush:xml">
+
<syntaxhighlight lang="xml">
 
<csv>
 
<csv>
 
   <record>
 
   <record>
Line 151: Line 170:
 
   </record>
 
   </record>
 
</csv>
 
</csv>
</pre>
+
</syntaxhighlight>
 
'''
 
'''
  
Line 157: Line 176:
  
 
'''Query:'''
 
'''Query:'''
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
let $options := map { 'lax': false() }
 
let $options := map { 'lax': false() }
 
let $input := file:read-text('some-data.csv')
 
let $input := file:read-text('some-data.csv')
 
let $output := $input => csv:parse($options) => csv:serialize($options)
 
let $output := $input => csv:parse($options) => csv:serialize($options)
 
return $input eq $output
 
return $input eq $output
</pre>
+
</syntaxhighlight>
  
 
'''Example 3:''' Converts CSV data to XQuery and returns distinct column values:
 
'''Example 3:''' Converts CSV data to XQuery and returns distinct column values:
  
 
'''Query:'''
 
'''Query:'''
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
let $text := ``[Name,City
 
let $text := ``[Name,City
 
Jack,Chicago
 
Jack,Chicago
Line 174: Line 193:
 
]``
 
]``
 
let $options := map { 'format': 'xquery', 'header': true() }
 
let $options := map { 'format': 'xquery', 'header': true() }
let $data := csv:parse($text, $options)
+
let $csv := csv:parse($text, $options)
 
return (
 
return (
 
   'Distinct values:',
 
   'Distinct values:',
   let $records := $data?records
+
   let $records := $csv('records')
   for $name at $pos in $data?names?*
+
   for $name at $pos in $csv('names')?*
 
   let $values := $records?($pos)
 
   let $values := $records?($pos)
 
   return (
 
   return (
Line 184: Line 203:
 
   )
 
   )
 
)
 
)
</pre>
+
</syntaxhighlight>
  
 
'''Result:'''
 
'''Result:'''
<pre class="brush:xquery">
+
<syntaxhighlight lang="xquery">
 
Distinct values:
 
Distinct values:
 
* Name: Jack, John
 
* Name: Jack, John
 
* City: Chicago, Washington, New York
 
* City: Chicago, Washington, New York
</pre>
+
</syntaxhighlight>
  
 
=Errors=
 
=Errors=
Line 199: Line 218:
 
|Description
 
|Description
 
|-
 
|-
|{{Code|BXCS0001}}
+
|{{Code|parse}}
 
| The input cannot be parsed.
 
| The input cannot be parsed.
 
|-
 
|-
|{{Code|BXCS0002}}
+
|{{Code|serialize}}
 
| The node cannot be serialized.
 
| The node cannot be serialized.
 
|}
 
|}
  
 
=Changelog=
 
=Changelog=
 +
 +
;Version 9.4
 +
* Added: [[#csv:doc|csv:doc]]
 +
 +
; Version 9.1
 +
* Updated: [[#csv:parse|csv:parse]] can be called with empty sequence.
  
 
;Version 9.0
 
;Version 9.0
Line 212: Line 237:
 
* Added: {{Code|xquery}} option
 
* Added: {{Code|xquery}} option
 
* Removed: {{Code|map}} option
 
* Removed: {{Code|map}} option
 +
* Updated: error codes updated; errors now use the module namespace
  
 
;Version 8.6
 
;Version 8.6

Latest revision as of 18:51, 18 November 2020

This XQuery Module contains a single function to parse CSV input. CSV (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel.

Conventions[edit]

All functions and errors in this module are assigned to the http://basex.org/modules/csv namespace, which is statically bound to the csv prefix.

Conversion[edit]

XML: Direct, Attributes[edit]

If the direct or attributes format is chosen, a CSV string is converted to XML:

  • The resulting XML document has a csv root element.
  • Rows are represented via record elements.
  • Fields are represented via entry elements. The value of a field is represented as text node.
  • If the header option is set to true, the first text line is parsed as table header:
    • If format is set to direct, the field names are encoded, as described in the Conversion Module, and used as element names.
    • Otherwise, if format is attributes, the field names will be stored in name attributes.

A little advice: in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the Parsing tab, you can see the effects of some of the conversion options.

XQuery[edit]

With the xquery format, CSV records are converted to a sequence of arrays:

  • The resulting value will be a map with a records and an optional names key.
  • Records are organized as a sequence of arrays. A single array contains the entries of a single record.
  • The column names will be available if header option is set to true.

The CSV map can e.g. be accessed as follows:

  • $csv?records[5] returns all entries of the 5th record (row)
  • $csv?records(2) returns all entries of the 2nd field (column)
  • $csv?names?* returns the names of all fields (if available)
  • Return enumerated strings for all records:
for $record at $pos in $csv?records
return $pos || ". " || string-join($record?*, ', ')

The resulting representation consumes less memory than XML-based formats, and values can be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing.

Options[edit]

In the following table, all available options are listed. The Excel column indicates what are the preferred options for data that is to be imported, or has been exported from Excel.

Option Description Allowed Default Excel
separator Defines the character which separates the values of a single record. comma, semicolon, colon, tab, space or a single character comma semicolon
header Indicates if the first line of the parsed or serialized CSV data is a table header. yes, no no
format Specifies the format of the XML data:
  • With direct conversion, field names are represented as element names
  • With attributes conversion, field names are stored in name attributes
  • With xquery conversion, the input is converted to an XQuery map
direct, attributes, xquery direct
lax Specifies if a lax approach is used to convert QNames to JSON names. yes, no yes no
quotes Specifies how quotes are parsed:
  • Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change.
  • Serialization: If the option is enabled, the value will be wrapped with quotes if it contains characters that might be treated as control characters. A quote character in the value will be encoded according to the rules of the backslashes option.
yes, no yes yes
backslashes Specifies how quotes and other characters are escaped:
  • Parsing: If the option is enabled, \r, n and \t will be replaced with the corresponding control characters. All other escaped characters will be adopted as literals (e.g.: \""). If the option is disabled, two consecutive quotes will be replaced with a single quote (unless quotes is enabled and the quote is the first or last character of a value).
  • Serialization: If the option is enabled, \r, n, \t, " and the separator character will be encoded with a backslash. If the option is disabled, quotes will be duplicated.
yes, no no no

Functions[edit]

csv:parse[edit]

Signatures csv:parse($string as xs:string?) as item()?
csv:parse($string as xs:string?, $options as map(*)?) as item()?
Summary Converts the CSV $string to an XQuery value. The $options argument can be used to control the way the input is converted.
Errors parse: the specified input cannot be parsed as CSV document.

csv:doc[edit]

Signatures csv:doc($uri as xs:string?) as item()?
csv:doc($uri as xs:string?, $options as map(*)?) as item()?
Summary Fetches the CSV document referred to by the given $uri and converts it to an XQuery value. The $options argument can be used to control the way the input is converted.
Errors parse: the specified input cannot be parsed as CSV document.
options: the specified options are conflicting.

csv:serialize[edit]

Signatures csv:serialize($input as item()?) as xs:string
csv:serialize($input as item()?, $options as map(*)?) as xs:string
Summary Serializes the specified $input as CSV, using the specified $options, and returns the result as string.

Values can also be serialized as CSV with the standard Serialization feature of XQuery:

  • The parameter method needs to be set to csv, and
  • the options presented in this article need to be assigned to the csv parameter.
Errors serialize: the input cannot be serialized.

Examples[edit]

Example 1: Converts CSV data to XML, interpreting the first row as table header:

Input addressbook.csv:

Name,First Name,Address,City
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing

Query:

let $text := file:read-text('addressbook.csv')
return csv:parse($text, map { 'header': true() })

Result:

<csv>
  <record>
    <Name>Huber</Name>
    <First_Name>Sepp</First_Name>
    <Address>Hauptstraße 13</Address>
    <City>93547 Hintertupfing</City>
  </record>
</csv>

Example 2: Converts some CSV data to XML and back, and checks if the input and output are equal. The expected result is true:

Query:

let $options := map { 'lax': false() }
let $input := file:read-text('some-data.csv')
let $output := $input => csv:parse($options) => csv:serialize($options)
return $input eq $output

Example 3: Converts CSV data to XQuery and returns distinct column values:

Query:

let $text := ``[Name,City
Jack,Chicago
Jack,Washington
John,New York
]``
let $options := map { 'format': 'xquery', 'header': true() }
let $csv := csv:parse($text, $options)
return (
  'Distinct values:',
  let $records := $csv('records')
  for $name at $pos in $csv('names')?*
  let $values := $records?($pos)
  return (
    '* ' || $name || ': ' || string-join(distinct-values($values), ', ')
  )
)

Result:

Distinct values:
* Name: Jack, John
* City: Chicago, Washington, New York

Errors[edit]

Code Description
parse The input cannot be parsed.
serialize The node cannot be serialized.

Changelog[edit]

Version 9.4
Version 9.1
  • Updated: csv:parse can be called with empty sequence.
Version 9.0
  • Added: xquery option
  • Removed: map option
  • Updated: error codes updated; errors now use the module namespace
Version 8.6
  • Updated: Options: improved Excel compatibility
Version 8.0
  • Added: backslashes option
Version 7.8
  • Updated: csv:parse now returns a document node instead of an element, or an XQuery map if format is set to map.
  • Added: format and lax options

The module was introduced with Version 7.7.2.