Latest revision as of 18:39, 1 December 2023

This XQuery Module contains a single function to parse CSV input. CSV (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel.

Conventions[edit]

All functions and errors in this module are assigned to the http://basex.org/modules/csv namespace, which is statically bound to the csv prefix.

Conversion Formats[edit]

XML: Direct, Attributes[edit]

If the direct or attributes format is chosen, a CSV string is converted to XML:

The resulting XML document has a csv root element.
Rows are represented via record elements.
Fields are represented via entry elements. The value of a field is represented as text node.
If the header option is set to true, the first text line is parsed as table header:
- If format is set to direct, the field names are encoded, as described in the Conversion Module, and used as element names.
- Otherwise, if format is attributes, the field names will be stored in name attributes.

A little advice: in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the Parsing tab, you can see the effects of some of the conversion options.

XQuery[edit]

With the xquery format, CSV records are converted to a sequence of arrays:

The resulting value will be a map with a records entry and an optional names entry.
Records are organized as a sequence of arrays. An array contains the entries of a single record.
The names entry contains an array with the column names. It is generated if the header option is set to true.

The CSV map can e.g. be accessed as follows:

$csv?records[5] returns all entries of the 5th record (row)
$csv?records?(2) returns all entries of the 2nd field (column)
$csv?names?* returns the names of all fields (if available)
Return enumerated strings for all records:

for $record at $pos in $csv?records
return $pos || ". " || string-join($record?*, ', ')

The resulting representation consumes less memory than XML-based formats, and values can be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing.

Options[edit]

In the following table, all available options are listed. The Excel column lists recommended options for data that is processed with Excel or Open/Libre Office.

Option	Description	Allowed	Default	Direction	Excel
`separator`	Defines the character which separates the values of a single record.	`comma`, `semicolon`, `colon`, `tab`, `space` or a single character	`comma`	parse, serialize	`semicolon` or `comma`, depending on the region
`header`	Indicates if the first line of the parsed or serialized CSV data is a table header.	`yes`, `no`	`no`	parse, serialize
`format`	Specifies the format for converting CSV data (see above).	`direct`, `attributes`, `xquery`	`direct`	parse, serialize
`lax`	Specifies if lax conversion rules are used to convert QNames to JSON names.	`yes`, `no`	`yes`	parse, serialize	`no`
`quotes`	Specifies how quotes are parsed: Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change. Serialization: If the option is enabled, the value will be wrapped with quotes if it contains characters that might be treated as control characters. A quote character in the value will be encoded according to the rules of the `backslashes` option.	`yes`, `no`	`yes`	parse, serialize	`yes`
`backslashes`	Specifies how quotes and other characters are escaped: Parsing: If the option is enabled, `\r`, `n` and `\t` will be replaced with the corresponding control characters. All other escaped characters will be adopted as literals (e.g.: `\"` → `"`). If the option is disabled, two consecutive quotes will be replaced with a single quote (unless `quotes` is enabled and the quote is the first or last character of a value). Serialization: If the option is enabled, `\r`, `n`, `\t`, `"` and the separator character will be encoded with a backslash. If the option is disabled, quotes will be duplicated.	`yes`, `no`	`no`	parse, serialize	`no`
`skip-empty`	Version 11: Indicates if empty fields are included in the result. Only gets effective for the formats `direct` or `attribute`, and if the `header` option is enabled. Please note that if this option is used and the data is serialized again, the resulting CSV header may be incomplete.	`yes`, `no`	`no`	parse	`no`
`allow`	In Excel, a value will be evaluated if it starts with the character `-`, `+`, `=`, `@`, `\t` or `\r`. A regular expression can be specified to reject data that will be handled differently than expected by an application, or that may be malicious (see https://owasp.org/www-community/attacks/CSV_Injection for more details).	string		serialize	`[^-+=\t\r].\|[-+]\d([,.]\d+)?`

Functions[edit]

csv:doc[edit]

Signature	csv:doc( $href as xs:string?, $options as map(*)? := map { } ) as item()?
Summary	Fetches the CSV document referred to by the given `$href` and converts it to an XQuery value. The `$options` argument can be used to control the way the input is converted.
Errors	`parse`: the specified input cannot be parsed as CSV document. `options`: the specified options are conflicting.

csv:parse[edit]

Signature	csv:parse( $value as xs:string?, $options as map(*)? := map { } ) as item()?
Summary	Converts the CSV `$value` to an XQuery value. The `$options` argument can be used to control the way the input is converted.
Errors	`parse`: the specified input cannot be parsed as CSV document.

csv:serialize[edit]

Signature

csv:serialize(
  $input    as item()?,
  $options  as map(*)?  := map { }
) as xs:string

Summary

Serializes the specified $input as CSV, using the specified $options, and returns the result as string.

Values can also be serialized as CSV with the standard Serialization feature of XQuery:

The parameter method needs to be set to csv, and
the options presented in this article need to be assigned to the csv parameter.

Errors serialize: the input cannot be serialized.

Examples[edit]

Example 1: Converts CSV data to XML, interpreting the first row as table header:

Input addressbook.csv:

Name,First Name,Address,City
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing

Query:

let $text := file:read-text('addressbook.csv')
return csv:parse($text, map { 'header': true() })

Result:

<csv>
  <record>
    <Name>Huber</Name>
    <First_Name>Sepp</First_Name>
    <Address>Hauptstraße 13</Address>
    <City>93547 Hintertupfing</City>
  </record>
</csv>

Example 2: Converts some CSV data to XML and back, and checks if the input and output are equal. The expected result is true:

Query:

let $options := map { 'lax': false() }
let $input := file:read-text('some-data.csv')
let $output := $input => csv:parse($options) => csv:serialize($options)
return $input eq $output

Example 3: Converts CSV data to XQuery and returns distinct column values:

Query:

let $text := ``[Name,City
Jack,Chicago
Jack,Washington
John,New York
]``
let $options := map { 'format': 'xquery', 'header': true() }
let $csv := csv:parse($text, $options)
return (
  'Distinct values:',
  let $records := $csv('records')
  for $name at $pos in $csv('names')?*
  let $values := $records?($pos)
  return (
    '* ' || $name || ': ' || string-join(distinct-values($values), ', ')
  )
)

Result:

Distinct values:
* Name: Jack, John
* City: Chicago, Washington, New York

Errors[edit]

Code	Description
`parse`	The input cannot be parsed.
`serialize`	The node cannot be serialized.

Changelog[edit]

Version 11

Added: Options: skip-empty option.

Version 9.7

Added: Options: allow option.

Version 9.4

Added: csv:doc

Version 9.1

Updated: csv:parse can be called with empty sequence.

Version 9.0

Added: xquery option
Removed: map option
Updated: error codes updated; errors now use the module namespace

Version 8.6

Updated: Options: improved Excel compatibility

Version 8.0

Added: backslashes option

Version 7.8

Updated: csv:parse now returns a document node instead of an element, or an XQuery map if format is set to map.
Added: format and lax options

The module was introduced with Version 7.7.2.

Difference between revisions of "CSV Module"

Latest revision as of 18:39, 1 December 2023

Contents

Conventions[edit]

Conversion Formats[edit]

XML: Direct, Attributes[edit]

XQuery[edit]

Options[edit]

Functions[edit]

csv:doc[edit]

csv:parse[edit]

csv:serialize[edit]

Examples[edit]

Errors[edit]

Changelog[edit]

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
-This [[Module Library|XQuery Module]] contains a single function to parse CSV input. [http://en.wikipedia.org/wiki/Comma-separated_values CSV] (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel.
+This [[Module Library|XQuery Module]] contains a single function to parse CSV input. [https://en.wikipedia.org/wiki/Comma-separated_values CSV] (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel.
 =Conventions=
-All functions in this module are assigned to the {{Code|http://basex.org/modules/csv}} namespace, which is statically bound to the {{Code|csv}} prefix.<br/>
+All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/csv</nowiki></code> namespace, which is statically bound to the {{Code|csv}} prefix.<br/>
-All errors are assigned to the {{Code|http://basex.org/errors}} namespace, which is statically bound to the {{Code|bxerr}} prefix.
-=Rules=
+==Conversion Formats==
-CSV is converted to XML as follows:
+===XML: Direct, Attributes===
-# The resulting XML document has a {{Code|<csv/>}} root elements.
+If the {{Code|direct}} or {{Code|attributes}} format is chosen, a CSV string is converted to XML:
-# Rows are represented via {{Code|<record/>}} elements.
-# Fields are represented via {{Code|<entry/>}} elements. The value of a field is represented as text node.
-# If the {{Code|header}} option is set to {{Code|true}}, the first text line is parsed as table header, and the {{Code|entry}} elements are replaced with the field names:
-## Empty names are represented by a single underscore ({{Code|_}}), and characters that are not valid in element names are replaced with underscores.
-## If the {{Code|lax}} option is set to {{Code|false}}, invalid characters will be rewritten to an underscore and the character’s four-digit Unicode, and underscores will be represented as two underscores ({{Code|__}}). The resulting element names may be less readable, but can always be converted back to the original field names.
-# If {{Code|format}} is set to {{Code|attributes}}, field names will be stored in name attributes.
-In the Database Creation dialog of the GUI, when the CSV parser is selected, the ''Parsing'' tab demonstrates the conversion of CSV to XML and the effects of the single conversion options.
+* The resulting XML document has a {{Code|csv}} root element.
+* Rows are represented via {{Code|record}} elements.
+* Fields are represented via {{Code|entry}} elements. The value of a field is represented as text node.
+* If the {{Code|header}} option is set to {{Code|true}}, the first text line is parsed as table header:
+** If {{Code|format}} is set to {{Code|direct}}, the field names are encoded, as described in the [[Conversion Module#Keys|Conversion Module]], and used as element names.
+** Otherwise, if {{Code|format}} is {{Code|attributes}}, the field names will be stored in name attributes.
+'''A little advice''': in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the ''Parsing'' tab, you can see the effects of some of the conversion options.
+===XQuery===
+With the {{Code|xquery}} format, CSV records are converted to a sequence of arrays:
+* The resulting value will be a map with a {{Code|records}} entry and an optional {{Code|names}} entry.
+* Records are organized as a sequence of arrays. An array contains the entries of a single record.
+* The {{Code|names}} entry contains an array with the column names. It is generated if the {{Code|header}} option is set to {{Code|true}}.
+The CSV map can e.g. be accessed as follows:
+* <code>$csv?records[5]</code> returns all entries of the 5th record (row)
+* <code>$csv?records?(2)</code> returns all entries of the 2nd field (column)
+* <code>$csv?names?*</code> returns the names of all fields (if available)
+* Return enumerated strings for all records:
+<pre lang='xquery'>
+for $record at $pos in $csv?records
+return $pos || ". " || string-join($record?*, ', ')
+</pre>
+The resulting representation consumes less memory than XML-based formats, and values can be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing.
 ==Options==
-The following options are available:
+In the following table, all available options are listed. The Excel column lists recommended options for data that is processed with Excel or Open/Libre Office.
 {| class="wikitable sortable" width="100%"
 |- valign="top"
-! width="140" | Option
+! Option
-! width="50%" | Description
+! Description
 ! Allowed
 ! Default
+! Direction
+! Excel
 |- valign="top"
 | {{Code|separator}}
-| Defines the character which separates the entries of a record in a single line.
+| Defines the character which separates the values of a single record.
 | {{Code|comma}}, {{Code|semicolon}}, {{Code|colon}}, {{Code|tab}}, {{Code|space}} or a ''single character''
 | {{Code|comma}}
+| ''parse'', ''serialize''
+| {{Code|semicolon}} or {{Code|comma}}, depending on the region
 |- valign="top"
 | {{Code|header}}
@@ Line 40: / Line 65: @@
 | {{Code|yes}}, {{Code|no}}
 | {{Code|no}}
+| ''parse'', ''serialize''
+|
 |- valign="top"
 | {{Code|format}}
-| Specifies the format of the XML data. The format is only relevant if the {{Code|header}} option is activated:<br/>
+| Specifies the format for converting CSV data ([[#Conversion Formats|see above]]).
-* With {{Code|direct}} conversion, field names are represented as element names
+| {{Code|direct}}, {{Code|attributes}}, {{Code|xquery}}
-* With {{Code|attributes}} conversion, field names are stored in {{Code|name}} attributes
-| {{Code|direct}}, {{Code|attributes}}
 | {{Code|direct}}
+| ''parse'', ''serialize''
+|
 |- valign="top"
 | {{Code|lax}}
-| Specifies if a lax approach is used to convert QNames to JSON names.
+| Specifies if [[Conversion Module#Keys|lax conversion rules]] are used to convert QNames to JSON names.
+| {{Code|yes}}, {{Code|no}}
+| {{Code|yes}}
+| ''parse'', ''serialize''
+| {{Code|no}}
+|- valign="top"
+| {{Code|quotes}}
+| Specifies how quotes are parsed:
+* Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change.
+* Serialization: If the option is enabled, the value will be wrapped with quotes if it contains characters that might be treated as control characters. A quote character in the value will be encoded according to the rules of the {{Code|backslashes}} option.
 | {{Code|yes}}, {{Code|no}}
 | {{Code|yes}}
+| ''parse'', ''serialize''
+| {{Code|yes}}
+|- valign="top"
+| {{Code|backslashes}}
+| Specifies how quotes and other characters are escaped:
+* Parsing: If the option is enabled, {{Code|\r}}, {{Code|n}} and {{Code|\t}} will be replaced with the corresponding control characters. All other escaped characters will be adopted as literals (e.g.: {{Code|\"}} → {{Code|"}}). If the option is disabled, two consecutive quotes will be replaced with a single quote (unless {{Code|quotes}} is enabled and the quote is the first or last character of a value).
+* Serialization: If the option is enabled, {{Code|\r}}, {{Code|n}}, {{Code|\t}}, {{Code|"}} and the separator character will be encoded with a backslash. If the option is disabled, quotes will be duplicated.
+| {{Code|yes}}, {{Code|no}}
+| {{Code|no}}
+| ''parse'', ''serialize''
+| {{Code|no}}
+|- valign="top"
+| {{Code|skip-empty}}
+| {{Announce|Version 11:}} Indicates if empty fields are included in the result. Only gets effective for the formats {{Code|direct}} or {{Code|attribute}}, and if the {{Code|header}} option is enabled. Please note that if this option is used and the data is serialized again, the resulting CSV header may be incomplete.
+| {{Code|yes}}, {{Code|no}}
+| {{Code|no}}
+| ''parse''
+| {{Code|no}}
+|- valign="top"
+| {{Code|allow}}
+| In Excel, a value will be evaluated if it starts with the character {{Code|-}}, {{Code|+}}, <code>=</code>, {{Code|@}}, {{Code|\t}} or {{Code|\r}}. A regular expression can be specified to reject data that will be handled differently than expected by an application, or that may be malicious (see https://owasp.org/www-community/attacks/CSV_Injection for more details).
+| ''string''
+|
+| ''serialize''
+| <code><nowiki>[^-+=\t\r].*|[-+]\d*([,.]\d+)?</nowiki></code>
 |}
-Options can either be specified<br />
+=Functions=
-* as children of an {{Code|<csv:options/>}} element; e.g.:
-<pre class="brush:xml">
+==csv:doc==
-<csv:options>
-  <csv:separator value=';'/>
-  ...
-</csv:options>
-</pre>
-* or as map, which contains all key/value pairs:
-<pre class="brush:xquery">
-{ 'separator': ';', ... }
-</pre>
-=Functions=
+{| width='100%'
+|- valign="top"
+| width='120' | '''Signature'''
+|<pre>csv:doc(
+  $href     as xs:string?,
+  $options  as map(*)?     := map { }
+) as item()?</pre>
+|- valign="top"
+| '''Summary'''
+|Fetches the CSV document referred to by the given {{Code|$href}} and converts it to an XQuery value. The {{Code|$options}} argument can be used to control the way the input is converted.
+|- valign="top"
+| '''Errors'''
+|{{Error|parse|#Errors}} the specified input cannot be parsed as CSV document.<br/>{{Error|options|#Errors}} the specified options are conflicting.
+|}
 ==csv:parse==
-{{Version|7.8}}: the return type has been changed from {{Code|element(<csv>)}} to {{Code|document-node(element(<csv>))}}, and the {{Code|format}} and {{Code|lax}} options have been added.
 {| width='100%'
-|-
+|- valign="top"
-| width='120' | '''Signatures'''
+| width='120' | '''Signature'''
-|{{Func|csv:parse|$input as xs:string|document-node(element(csv))}}<br/>{{Func|csv:parse|$input as xs:string, $options as item()|document-node(element(csv))}}
+|<pre>csv:parse(
-|-
+  $value    as xs:string?,
+  $options  as map(*)?     := map { }
+) as item()?</pre>
+|- valign="top"
 | '''Summary'''
-|Converts the CSV data specified by {{Code|$input}} to XML, and returns the result as {{Code|<csv/>}} value. The {{Code|$options}} argument can be used to control the way the input is converted.
+|Converts the CSV {{Code|$value}} to an XQuery value. The {{Code|$options}} argument can be used to control the way the input is converted.
-|-
+|- valign="top"
 | '''Errors'''
-|{{Error|BXCS0001|#Errors}} the input cannot be parsed.
+|{{Error|parse|#Errors}} the specified input cannot be parsed as CSV document.
 |}
 ==csv:serialize==
 {| width='100%'
-|-
+|- valign="top"
-| width='120' | '''Signatures'''
+| width='120' | '''Signature'''
-|{{Func|csv:serialize|$input as node(), $options as item()|xs:string}}
+|<pre>csv:serialize(
-|-
+  $input    as item()?,
+  $options  as map(*)?  := map { }
+) as xs:string</pre>
+|- valign="top"
 | '''Summary'''
-|Serializes the node specified by {{Code|$input}} as CSV data, and returns the result as {{Code|xs:string}}.<br />XML documents can also be serialized as CSV if the [[Serialization|Serialization Option]] {{Code|method}} is set to {{Code|csv}}.<br/>With the {{Code|$options}} argument, the way the node is serialized can be controlled.
+|Serializes the specified {{Code|$input}} as CSV, using the specified {{Code|$options}}, and returns the result as string.
-|-
+Values can also be serialized as CSV with the standard [[Serialization]] feature of XQuery:
+* The parameter {{Code|method}} needs to be set to {{Code|csv}}, and
+* the options presented in this article need to be assigned to the {{Code|csv}} parameter.
+|- valign="top"
 | '''Errors'''
-|{{Error|BXCS0002|#Errors}} the input cannot be serialized.
+|{{Error|serialize|#Errors}} the input cannot be serialized.
 |}
@@ Line 103: / Line 176: @@
 '''Input''' {{Code|addressbook.csv}}:
-<pre class="brush:xml">
+<pre lang="xml">
 Name,First Name,Address,City
 Huber,Sepp,Hauptstraße 13,93547 Hintertupfing
@@ Line 109: / Line 182: @@
 '''Query:'''
-<pre class="brush:xquery">
+<pre lang='xquery'>
 let $text := file:read-text('addressbook.csv')
-return csv:parse($text, { 'header': 'true' })
+return csv:parse($text, map { 'header': true() })
 </pre>
 '''Result:'''
-<pre class="brush:xml">
+<pre lang="xml">
 <csv>
    <record>
@@ Line 130: / Line 203: @@
 '''Query:'''
-<pre class="brush:xquery">
+<pre lang='xquery'>
-let $text := file:read-text('some-data.csv')
+let $options := map { 'lax': false() }
-let $options := { 'lax': 'no' }
+let $input := file:read-text('some-data.csv')
-let $xml := csv:parse($text, $options)
+let $output := $input => csv:parse($options) => csv:serialize($options)
-let $csv := csv:serialize($xml, $options)
+return $input eq $output
-return $text eq $csv
+</pre>
+'''Example 3:''' Converts CSV data to XQuery and returns distinct column values:
+'''Query:'''
+<pre lang='xquery'>
+let $text := ``[Name,City
+Jack,Chicago
+Jack,Washington
+John,New York
+]``
+let $options := map { 'format': 'xquery', 'header': true() }
+let $csv := csv:parse($text, $options)
+return (
+  'Distinct values:',
+  let $records := $csv('records')
+  for $name at $pos in $csv('names')?*
+  let $values := $records?($pos)
+  return (
+    '* ' || $name || ': ' || string-join(distinct-values($values), ', ')
+  )
+)
+</pre>
+'''Result:'''
+<pre lang='xquery'>
+Distinct values:
+* Name: Jack, John
+* City: Chicago, Washington, New York
 </pre>
@@ Line 143: / Line 244: @@
 ! width="110"|Code
 |Description
-|-
+|- valign="top"
-|{{Code|BXCS0001}}
+|{{Code|parse}}
 | The input cannot be parsed.
-|-
+|- valign="top"
-|{{Code|BXCS0002}}
+|{{Code|serialize}}
 | The node cannot be serialized.
 |}
 =Changelog=
+;Version 11
+* Added: [[#Options|Options]]: <code>skip-empty</code> option.
+;Version 9.7
+* Added: [[#Options|Options]]: <code>allow</code> option.
+;Version 9.4
+* Added: {{Function||csv:doc}}
+; Version 9.1
+* Updated: {{Function||csv:parse}} can be called with empty sequence.
+;Version 9.0
+* Added: {{Code|xquery}} option
+* Removed: {{Code|map}} option
+* Updated: error codes updated; errors now use the module namespace
+;Version 8.6
+* Updated: [[#Options|Options]]: improved Excel compatibility
+;Version 8.0
+* Added: {{Code|backslashes}} option
 ;Version 7.8
-* Updated: return type of [[#csv:parse|csv:parse]] changed from {{Code|element(<csv>)}} to {{Code|document-node(element(<csv>))}}
+* Updated: {{Function||csv:parse}} now returns a document node instead of an element, or an XQuery map if {{Code|format}} is set to {{Code|map}}.
 * Added: {{Code|format}} and {{Code|lax}} options
 The module was introduced with Version 7.7.2.
-[[Category:XQuery]]