Main Page » XQuery » Functions » CSV Functions

CSV Functions

This module contains a single function to parse CSV input. CSV (comma-separated values) is a popular format for tabular data, exported e.g. from Excel.

Conventions

All functions and errors in this module are assigned to the http://basex.org/modules/csv namespace, which is statically bound to the csv prefix.

Conversion Formats

XML: Direct, Attributes, W3-XML

If the direct or attributes format is chosen, a CSV string is converted to XML:

The resulting XML document has a csv root element.
Rows are represented via record elements.
Fields are represented via entry elements. The value of a field is represented as text node.
If the header option is set to true, the first text line is parsed as table header:
- If format is set to direct, the field names are encoded, as described in the Conversion Functions, and used as element names.
- Otherwise, if format is attributes, the field names will be stored in name attributes.

A little advice: in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the Parsing tab, you can see the effects of some of the conversion options.

If the w3-xml format is chosen, a CSV string is converted to XML according to the rules of function fn:csv-to-xml. More details can be found there. This also applies to the interpretation of the options.

Map: W3

Updated: Revised (old name: xquery): Renamed, more options, header represented as sequence.

With the w3 format, CSV records are converted to a sequence of arrays:

The resulting value will be a map with a rows entry and an optional columns entry.
Rows are organized as a sequence of arrays. An array contains the entries of a single record.
The columns entry contains a sequence with the column names. It is generated if the header option is set to true.

The CSV map can e.g. be accessed as follows:

$csv?rows[5] returns all entries of the 5th record (row)
$csv?rows?2 returns all entries of the 2nd field (column)
$csv?columns returns the names of all fields (if available)
Return enumerated strings for all rows:

for $row at $pos in $csv?rows
return $pos || ". " || string-join($row?*, ', ')

The resulting representation consumes less memory than XML-based formats, and values can be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing.

With the w3 format, CSV data is converted to a map, containing a sequence of arrays, according to the rules of function fn:parse-csv. More details can be found there. This also applies to the interpretation of the options.

Arrays: W3-Arrays

With the w3-arrays format, CSV rows are converted to a sequence of arrays, according to the rules of function fn:csv-to-arrays. More details can be found there. This also applies to the interpretation of the options.

Options

In the following table, all available options are listed. The Excel column lists recommended options for data that is processed with Excel or Open/Libre Office.

Option	Description	Allowed	Default	Direction	Excel
`field-delimiter`	Defines the character which separates the values of a single record. Can also still be used with its old name `separator`.	`comma`, `semicolon`, `colon`, `tab`, `space` or a single character	`comma`	parse, serialize	`semicolon` or `comma`, depending on the region
`row-delimiter`	Defines the character which separates fields.	a single character	newline (` `)	parse, serialize	newline (` `)
`quote-character`	Defines the character which is used to quote field values.	a single character	`"`	parse, serialize	`"`
`header`	Indicates if the first line of the parsed or serialized CSV data is a table header, or provides explicit column names.	`yes`, `no` or a sequence of strings	`no`	parse, serialize
`format`	Specifies the format for converting CSV data (see above).	`direct`, `attributes`, `w3`, `w3-arrays`, `w3-xml`	`direct`	parse, serialize
`lax`	Specifies if lax conversion rules are used to convert QNames to JSON names.	`yes`, `no`	`yes`	parse, serialize	`no`
`quotes`	Specifies how quotes are parsed: Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change. Serialization: If the option is enabled, the value will be wrapped with quotes if it contains characters that might be treated as control characters. A quote character in the value will be encoded according to the rules of the `backslashes` option.	`yes`, `no`	`yes`	parse, serialize	`yes`
`strict-quoting`	If enabled, quotes (unless doubled, to represent a quote inside a quoted field) are allowed only at the very beginning and end of a field. Then, not even whitespace is permitted between a quote and a field or row delimiter.	`yes`, `no`	`no`	parse	`no`
`backslashes`	Specifies how quotes and other characters are escaped: Parsing: If the option is enabled, `\r`, `n` and `\t` will be replaced with the corresponding control characters. All other escaped characters will be adopted as literals (e.g.: `\"` → `"`). If the option is disabled, two consecutive quotes will be replaced with a single quote (unless `quotes` is enabled and the quote is the first or last character of a value). Serialization: If the option is enabled, `\r`, `n`, `\t`, `"` and the separator character will be encoded with a backslash. If the option is disabled, quotes will be duplicated.	`yes`, `no`	`no`	parse, serialize	`no`
`skip-empty`	Save memory by skipping entries in the result that would contain an empty string. Only gets effective for the formats `direct` or `attribute`, and if the `header` option is enabled. Note that skipping entries may cause columns to be lost in the serialised CSV output.	`yes`, `no`	`no`	parse	`no`
`trim-whitespace`	Determines whether leading and trailing whitespace is removed from the content of unquoted fields.	`yes`, `no`	`no`	parse	`no`
`trim-rows`	Determines whether all rows should be adjusted to contain the same number of fields. This option is ignored if `select-columns` is specified. If enabled, the number of fields in the first row (whether this be a header or a data row) determines the number of fields in every subsequent row; to achieve this, excess fields are removed, or additional zero-length fields are added.	`yes`, `no`	`no`	parse	`no`
`select-columns`	A sequence of integers indicating which columns to include and in which order. If this option is absent or empty, all columns are returned in their original order. An integer in the sequence is treated as the 1-based index of the column to include. Any other columns are dropped. If a particular row includes no field at the specified index, an empty field is included at the relevant position in the result. If an integer appears more than once then the result will include duplicated columns. During serialization, the integers indicate the column numbers of the target columns, where empty columns are inserted as necessary.	a sequence of positive integers	`()`	parse, serialize	`()`

Functions

`csv:doc`

Signature

csv:doc(
  $source   as xs:string?,
  $options  as map(*)?     := {}
) as item()?

Summary Fetches the CSV document referred to by the given $source and converts it to an XDM value. The $options argument can be used to control the way the input is converted.

Errors

`options`
`parse`	The input cannot be parsed.

`csv:parse`

Signature

csv:parse(
  $value    as xs:string?,
  $options  as map(*)?     := {}
) as item()?

Summary Converts the CSV $value to an XDM value. The $options argument can be used to control the way the input is converted.

Errors

parse The input cannot be parsed.

`csv:serialize`

Signature

csv:serialize(
  $input    as item()?,
  $options  as map(*)?  := {}
) as xs:string

Summary

Serializes the specified $input as CSV, using the specified $options, and returns the result as string. Values can also be serialized as CSV with the standard Serialization feature of XQuery:

The parameter method needs to be set to csv, and
the options presented in this article need to be assigned to the csv parameter.

Errors

serialize The node cannot be serialized.

Examples

Example 1: Converts CSV data to XML, interpreting the first row as table header:

Input addressbook.csv

Name,First Name,Address,City
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing

Query

let $text := file:read-text('addressbook.csv')
return csv:parse($text, { 'header': true() })

Result

<csv>
  <record>
    <Name>Huber</Name>
    <First_Name>Sepp</First_Name>
    <Address>Hauptstraße 13</Address>
    <City>93547 Hintertupfing</City>
  </record>
</csv>

Example 2: Convert some CSV data to XML and back, and check if the input and output are equal. The expected result is true:

Query

let $options := { 'lax': false() }
let $input := file:read-text('some-data.csv')
let $output := $input => csv:parse($options) => csv:serialize($options)
return $input eq $output

Example 3: Convert CSV data to XQuery and returns distinct column values:

Query

let $text := ``[Name,City
Jack,Chicago
Jack,Washington
John,New York
]``
let $options := { 'format': 'w3', 'header': true() }
let $csv := csv:parse($text, $options)
return (
  'Distinct values:',
  let $rows := $csv?rows
  for $column at $pos in $csv?columns
  let $values := $rows?$pos
  return (
    '* ' || $column || ': ' || string-join(distinct-values($values), ', ')
  )
)

Result

Distinct values:
* Name: Jack, John
* City: Chicago, Washington, New York

Errors

Code	Description
`parse`	The input cannot be parsed.
`serialize`	The node cannot be serialized.

Changelog

Version 12.0

Added: Options: row-delimiter, quote-character, trim-whitespace, trim-rows, strict-quoting, select-columns options.
Updated: W3 Conversion Format: Revised (old name: xquery): Renamed, more options, header represented as sequence.
Updated: Options: separator renamed to field-delimiter.

Version 11.0

Added: Options: skip-empty option.
Removed: Options: allow option.

Version 9.7

Added: Options: allow option.

Version 9.4

Added: csv:doc

Version 9.1

Updated: csv:parse can be called with empty sequence.

Version 9.0

Added: xquery option
Updated: error codes updated; errors now use the module namespace
Removed: map option

Version 8.6

Updated: Options: improved Excel compatibility

Version 8.0

Added: backslashes option

Version 7.8

Added: format and lax options
Updated: csv:parse now returns a document node instead of an element, or an XQuery map if format is set to map.

Version 7.7.2

Added: New module added.

⚡Generated with XQuery