Difference between revisions of "CSV Module"
(→Rules) |
m (Text replacement - "syntaxhighlight" to "pre") Tags: Mobile web edit Mobile edit |
||
(126 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
− | This [[Module Library|XQuery Module]] contains a single function to parse CSV input. [ | + | This [[Module Library|XQuery Module]] contains a single function to parse CSV input. [https://en.wikipedia.org/wiki/Comma-separated_values CSV] (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel. |
=Conventions= | =Conventions= | ||
− | All functions in this module are assigned to the | + | All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/csv</nowiki></code> namespace, which is statically bound to the {{Code|csv}} prefix.<br/> |
− | |||
− | = | + | ==Conversion Formats== |
− | + | ===XML: Direct, Attributes=== | |
− | + | If the {{Code|direct}} or {{Code|attributes}} format is chosen, a CSV string is converted to XML: | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | * The resulting XML document has a {{Code|csv}} root element. | |
+ | * Rows are represented via {{Code|record}} elements. | ||
+ | * Fields are represented via {{Code|entry}} elements. The value of a field is represented as text node. | ||
+ | * If the {{Code|header}} option is set to {{Code|true}}, the first text line is parsed as table header: | ||
+ | ** If {{Code|format}} is set to {{Code|direct}}, the field names are encoded, as described in the [[Conversion Module#Keys|Conversion Module]], and used as element names. | ||
+ | ** Otherwise, if {{Code|format}} is {{Code|attributes}}, the field names will be stored in name attributes. | ||
+ | |||
+ | '''A little advice''': in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the ''Parsing'' tab, you can see the effects of some of the conversion options. | ||
+ | |||
+ | ===XQuery=== | ||
+ | |||
+ | With the {{Code|xquery}} format, CSV records are converted to a sequence of arrays: | ||
+ | |||
+ | * The resulting value will be a map with a {{Code|records}} entry and an optional {{Code|names}} entry. | ||
+ | * Records are organized as a sequence of arrays. An array contains the entries of a single record. | ||
+ | * The {{Code|names}} entry contains an array with the column names. It is generated if the {{Code|header}} option is set to {{Code|true}}. | ||
+ | |||
+ | The CSV map can e.g. be accessed as follows: | ||
+ | |||
+ | * <code>$csv?records[5]</code> returns all entries of the 5th record (row) | ||
+ | * <code>$csv?records?(2)</code> returns all entries of the 2nd field (column) | ||
+ | * <code>$csv?names?*</code> returns the names of all fields (if available) | ||
+ | * Return enumerated strings for all records: | ||
+ | <pre lang='xquery'> | ||
+ | for $record at $pos in $csv?records | ||
+ | return $pos || ". " || string-join($record?*, ', ') | ||
+ | </pre> | ||
+ | |||
+ | The resulting representation consumes less memory than XML-based formats, and values can be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing. | ||
==Options== | ==Options== | ||
− | + | In the following table, all available options are listed. The Excel column lists recommended options for data that is processed with Excel or Open/Libre Office. | |
{| class="wikitable sortable" width="100%" | {| class="wikitable sortable" width="100%" | ||
|- valign="top" | |- valign="top" | ||
− | ! | + | ! Option |
− | ! | + | ! Description |
! Allowed | ! Allowed | ||
! Default | ! Default | ||
+ | ! Direction | ||
+ | ! Excel | ||
|- valign="top" | |- valign="top" | ||
| {{Code|separator}} | | {{Code|separator}} | ||
− | | Defines the character which separates the | + | | Defines the character which separates the values of a single record. |
| {{Code|comma}}, {{Code|semicolon}}, {{Code|colon}}, {{Code|tab}}, {{Code|space}} or a ''single character'' | | {{Code|comma}}, {{Code|semicolon}}, {{Code|colon}}, {{Code|tab}}, {{Code|space}} or a ''single character'' | ||
| {{Code|comma}} | | {{Code|comma}} | ||
+ | | ''parse'', ''serialize'' | ||
+ | | {{Code|semicolon}} or {{Code|comma}}, depending on the region | ||
|- valign="top" | |- valign="top" | ||
| {{Code|header}} | | {{Code|header}} | ||
Line 40: | Line 65: | ||
| {{Code|yes}}, {{Code|no}} | | {{Code|yes}}, {{Code|no}} | ||
| {{Code|no}} | | {{Code|no}} | ||
+ | | ''parse'', ''serialize'' | ||
+ | | | ||
|- valign="top" | |- valign="top" | ||
| {{Code|format}} | | {{Code|format}} | ||
− | | Specifies the format | + | | Specifies the format for converting CSV data ([[#Conversion Formats|see above]]). |
− | + | | {{Code|direct}}, {{Code|attributes}}, {{Code|xquery}} | |
− | |||
− | |||
| {{Code|direct}} | | {{Code|direct}} | ||
+ | | ''parse'', ''serialize'' | ||
+ | | | ||
|- valign="top" | |- valign="top" | ||
| {{Code|lax}} | | {{Code|lax}} | ||
− | | Specifies if | + | | Specifies if [[Conversion Module#Keys|lax conversion rules]] are used to convert QNames to JSON names. |
+ | | {{Code|yes}}, {{Code|no}} | ||
+ | | {{Code|yes}} | ||
+ | | ''parse'', ''serialize'' | ||
+ | | {{Code|no}} | ||
+ | |- valign="top" | ||
+ | | {{Code|quotes}} | ||
+ | | Specifies how quotes are parsed: | ||
+ | * Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change. | ||
+ | * Serialization: If the option is enabled, the value will be wrapped with quotes if it contains characters that might be treated as control characters. A quote character in the value will be encoded according to the rules of the {{Code|backslashes}} option. | ||
| {{Code|yes}}, {{Code|no}} | | {{Code|yes}}, {{Code|no}} | ||
| {{Code|yes}} | | {{Code|yes}} | ||
+ | | ''parse'', ''serialize'' | ||
+ | | {{Code|yes}} | ||
+ | |- valign="top" | ||
+ | | {{Code|backslashes}} | ||
+ | | Specifies how quotes and other characters are escaped: | ||
+ | * Parsing: If the option is enabled, {{Code|\r}}, {{Code|n}} and {{Code|\t}} will be replaced with the corresponding control characters. All other escaped characters will be adopted as literals (e.g.: {{Code|\"}} → {{Code|"}}). If the option is disabled, two consecutive quotes will be replaced with a single quote (unless {{Code|quotes}} is enabled and the quote is the first or last character of a value). | ||
+ | * Serialization: If the option is enabled, {{Code|\r}}, {{Code|n}}, {{Code|\t}}, {{Code|"}} and the separator character will be encoded with a backslash. If the option is disabled, quotes will be duplicated. | ||
+ | | {{Code|yes}}, {{Code|no}} | ||
+ | | {{Code|no}} | ||
+ | | ''parse'', ''serialize'' | ||
+ | | {{Code|no}} | ||
+ | |- valign="top" | ||
+ | | {{Code|skip-empty}} | ||
+ | | {{Announce|Version 11:}} Indicates if empty fields are included in the result. Only gets effective for the formats {{Code|direct}} or {{Code|attribute}}, and if the {{Code|header}} option is enabled. Please note that if this option is used and the data is serialized again, the resulting CSV header may be incomplete. | ||
+ | | {{Code|yes}}, {{Code|no}} | ||
+ | | {{Code|no}} | ||
+ | | ''parse'' | ||
+ | | {{Code|no}} | ||
+ | |- valign="top" | ||
+ | | {{Code|allow}} | ||
+ | | In Excel, a value will be evaluated if it starts with the character {{Code|-}}, {{Code|+}}, <code>=</code>, {{Code|@}}, {{Code|\t}} or {{Code|\r}}. A regular expression can be specified to reject data that will be handled differently than expected by an application, or that may be malicious (see https://owasp.org/www-community/attacks/CSV_Injection for more details). | ||
+ | | ''string'' | ||
+ | | | ||
+ | | ''serialize'' | ||
+ | | <code><nowiki>[^-+=\t\r].*|[-+]\d*([,.]\d+)?</nowiki></code> | ||
|} | |} | ||
− | + | =Functions= | |
− | + | ||
− | + | ==csv:doc== | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | = | + | {| width='100%' |
+ | |- valign="top" | ||
+ | | width='120' | '''Signature''' | ||
+ | |<pre>csv:doc( | ||
+ | $href as xs:string?, | ||
+ | $options as map(*)? := map { } | ||
+ | ) as item()?</pre> | ||
+ | |- valign="top" | ||
+ | | '''Summary''' | ||
+ | |Fetches the CSV document referred to by the given {{Code|$href}} and converts it to an XQuery value. The {{Code|$options}} argument can be used to control the way the input is converted. | ||
+ | |- valign="top" | ||
+ | | '''Errors''' | ||
+ | |{{Error|parse|#Errors}} the specified input cannot be parsed as CSV document.<br/>{{Error|options|#Errors}} the specified options are conflicting. | ||
+ | |} | ||
==csv:parse== | ==csv:parse== | ||
− | |||
− | |||
{| width='100%' | {| width='100%' | ||
− | |- | + | |- valign="top" |
− | | width='120' | ''' | + | | width='120' | '''Signature''' |
− | | | + | |<pre>csv:parse( |
− | |- | + | $value as xs:string?, |
+ | $options as map(*)? := map { } | ||
+ | ) as item()?</pre> | ||
+ | |- valign="top" | ||
| '''Summary''' | | '''Summary''' | ||
− | |Converts the CSV | + | |Converts the CSV {{Code|$value}} to an XQuery value. The {{Code|$options}} argument can be used to control the way the input is converted. |
− | |- | + | |- valign="top" |
| '''Errors''' | | '''Errors''' | ||
− | |{{Error| | + | |{{Error|parse|#Errors}} the specified input cannot be parsed as CSV document. |
|} | |} | ||
==csv:serialize== | ==csv:serialize== | ||
+ | |||
{| width='100%' | {| width='100%' | ||
− | |- | + | |- valign="top" |
− | | width='120' | ''' | + | | width='120' | '''Signature''' |
− | | | + | |<pre>csv:serialize( |
− | |- | + | $input as item()?, |
+ | $options as map(*)? := map { } | ||
+ | ) as xs:string</pre> | ||
+ | |- valign="top" | ||
| '''Summary''' | | '''Summary''' | ||
− | |Serializes the | + | |Serializes the specified {{Code|$input}} as CSV, using the specified {{Code|$options}}, and returns the result as string. |
− | |- | + | Values can also be serialized as CSV with the standard [[Serialization]] feature of XQuery: |
+ | * The parameter {{Code|method}} needs to be set to {{Code|csv}}, and | ||
+ | * the options presented in this article need to be assigned to the {{Code|csv}} parameter. | ||
+ | |- valign="top" | ||
| '''Errors''' | | '''Errors''' | ||
− | |{{Error| | + | |{{Error|serialize|#Errors}} the input cannot be serialized. |
|} | |} | ||
Line 103: | Line 176: | ||
'''Input''' {{Code|addressbook.csv}}: | '''Input''' {{Code|addressbook.csv}}: | ||
− | <pre | + | <pre lang="xml"> |
Name,First Name,Address,City | Name,First Name,Address,City | ||
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing | Huber,Sepp,Hauptstraße 13,93547 Hintertupfing | ||
Line 109: | Line 182: | ||
'''Query:''' | '''Query:''' | ||
− | <pre | + | <pre lang='xquery'> |
let $text := file:read-text('addressbook.csv') | let $text := file:read-text('addressbook.csv') | ||
− | return csv:parse($text, { 'header': | + | return csv:parse($text, map { 'header': true() }) |
</pre> | </pre> | ||
'''Result:''' | '''Result:''' | ||
− | <pre | + | <pre lang="xml"> |
<csv> | <csv> | ||
<record> | <record> | ||
Line 130: | Line 203: | ||
'''Query:''' | '''Query:''' | ||
− | <pre | + | <pre lang='xquery'> |
− | let $ | + | let $options := map { 'lax': false() } |
− | let $options := { ' | + | let $input := file:read-text('some-data.csv') |
− | let $ | + | let $output := $input => csv:parse($options) => csv:serialize($options) |
− | let $ | + | return $input eq $output |
− | return $ | + | </pre> |
+ | |||
+ | '''Example 3:''' Converts CSV data to XQuery and returns distinct column values: | ||
+ | |||
+ | '''Query:''' | ||
+ | <pre lang='xquery'> | ||
+ | let $text := ``[Name,City | ||
+ | Jack,Chicago | ||
+ | Jack,Washington | ||
+ | John,New York | ||
+ | ]`` | ||
+ | let $options := map { 'format': 'xquery', 'header': true() } | ||
+ | let $csv := csv:parse($text, $options) | ||
+ | return ( | ||
+ | 'Distinct values:', | ||
+ | let $records := $csv('records') | ||
+ | for $name at $pos in $csv('names')?* | ||
+ | let $values := $records?($pos) | ||
+ | return ( | ||
+ | '* ' || $name || ': ' || string-join(distinct-values($values), ', ') | ||
+ | ) | ||
+ | ) | ||
+ | </pre> | ||
+ | |||
+ | '''Result:''' | ||
+ | <pre lang='xquery'> | ||
+ | Distinct values: | ||
+ | * Name: Jack, John | ||
+ | * City: Chicago, Washington, New York | ||
</pre> | </pre> | ||
Line 143: | Line 244: | ||
! width="110"|Code | ! width="110"|Code | ||
|Description | |Description | ||
− | |- | + | |- valign="top" |
− | |{{Code| | + | |{{Code|parse}} |
| The input cannot be parsed. | | The input cannot be parsed. | ||
− | |- | + | |- valign="top" |
− | |{{Code| | + | |{{Code|serialize}} |
| The node cannot be serialized. | | The node cannot be serialized. | ||
|} | |} | ||
=Changelog= | =Changelog= | ||
+ | |||
+ | ;Version 11 | ||
+ | * Added: [[#Options|Options]]: <code>skip-empty</code> option. | ||
+ | |||
+ | ;Version 9.7 | ||
+ | * Added: [[#Options|Options]]: <code>allow</code> option. | ||
+ | |||
+ | ;Version 9.4 | ||
+ | * Added: {{Function||csv:doc}} | ||
+ | |||
+ | ; Version 9.1 | ||
+ | * Updated: {{Function||csv:parse}} can be called with empty sequence. | ||
+ | |||
+ | ;Version 9.0 | ||
+ | |||
+ | * Added: {{Code|xquery}} option | ||
+ | * Removed: {{Code|map}} option | ||
+ | * Updated: error codes updated; errors now use the module namespace | ||
+ | |||
+ | ;Version 8.6 | ||
+ | |||
+ | * Updated: [[#Options|Options]]: improved Excel compatibility | ||
+ | |||
+ | ;Version 8.0 | ||
+ | |||
+ | * Added: {{Code|backslashes}} option | ||
;Version 7.8 | ;Version 7.8 | ||
− | * Updated: | + | * Updated: {{Function||csv:parse}} now returns a document node instead of an element, or an XQuery map if {{Code|format}} is set to {{Code|map}}. |
* Added: {{Code|format}} and {{Code|lax}} options | * Added: {{Code|format}} and {{Code|lax}} options | ||
The module was introduced with Version 7.7.2. | The module was introduced with Version 7.7.2. | ||
− | |||
− |
Latest revision as of 18:39, 1 December 2023
This XQuery Module contains a single function to parse CSV input. CSV (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel.
Contents
Conventions[edit]
All functions and errors in this module are assigned to the http://basex.org/modules/csv
namespace, which is statically bound to the csv
prefix.
Conversion Formats[edit]
XML: Direct, Attributes[edit]
If the direct
or attributes
format is chosen, a CSV string is converted to XML:
- The resulting XML document has a
csv
root element. - Rows are represented via
record
elements. - Fields are represented via
entry
elements. The value of a field is represented as text node. - If the
header
option is set totrue
, the first text line is parsed as table header:- If
format
is set todirect
, the field names are encoded, as described in the Conversion Module, and used as element names. - Otherwise, if
format
isattributes
, the field names will be stored in name attributes.
- If
A little advice: in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the Parsing tab, you can see the effects of some of the conversion options.
XQuery[edit]
With the xquery
format, CSV records are converted to a sequence of arrays:
- The resulting value will be a map with a
records
entry and an optionalnames
entry. - Records are organized as a sequence of arrays. An array contains the entries of a single record.
- The
names
entry contains an array with the column names. It is generated if theheader
option is set totrue
.
The CSV map can e.g. be accessed as follows:
$csv?records[5]
returns all entries of the 5th record (row)$csv?records?(2)
returns all entries of the 2nd field (column)$csv?names?*
returns the names of all fields (if available)- Return enumerated strings for all records:
for $record at $pos in $csv?records
return $pos || ". " || string-join($record?*, ', ')
The resulting representation consumes less memory than XML-based formats, and values can be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing.
Options[edit]
In the following table, all available options are listed. The Excel column lists recommended options for data that is processed with Excel or Open/Libre Office.
Option | Description | Allowed | Default | Direction | Excel |
---|---|---|---|---|---|
separator
|
Defines the character which separates the values of a single record. | comma , semicolon , colon , tab , space or a single character
|
comma
|
parse, serialize | semicolon or comma , depending on the region
|
header
|
Indicates if the first line of the parsed or serialized CSV data is a table header. | yes , no
|
no
|
parse, serialize | |
format
|
Specifies the format for converting CSV data (see above). | direct , attributes , xquery
|
direct
|
parse, serialize | |
lax
|
Specifies if lax conversion rules are used to convert QNames to JSON names. | yes , no
|
yes
|
parse, serialize | no
|
quotes
|
Specifies how quotes are parsed:
|
yes , no
|
yes
|
parse, serialize | yes
|
backslashes
|
Specifies how quotes and other characters are escaped:
|
yes , no
|
no
|
parse, serialize | no
|
skip-empty
|
Version 11: Indicates if empty fields are included in the result. Only gets effective for the formats direct or attribute , and if the header option is enabled. Please note that if this option is used and the data is serialized again, the resulting CSV header may be incomplete.
|
yes , no
|
no
|
parse | no
|
allow
|
In Excel, a value will be evaluated if it starts with the character - , + , = , @ , \t or \r . A regular expression can be specified to reject data that will be handled differently than expected by an application, or that may be malicious (see https://owasp.org/www-community/attacks/CSV_Injection for more details).
|
string | serialize | [^-+=\t\r].*|[-+]\d*([,.]\d+)?
|
Functions[edit]
csv:doc[edit]
Signature | csv:doc( $href as xs:string?, $options as map(*)? := map { } ) as item()? |
Summary | Fetches the CSV document referred to by the given $href and converts it to an XQuery value. The $options argument can be used to control the way the input is converted.
|
Errors | parse : the specified input cannot be parsed as CSV document.options : the specified options are conflicting.
|
csv:parse[edit]
Signature | csv:parse( $value as xs:string?, $options as map(*)? := map { } ) as item()? |
Summary | Converts the CSV $value to an XQuery value. The $options argument can be used to control the way the input is converted.
|
Errors | parse : the specified input cannot be parsed as CSV document.
|
csv:serialize[edit]
Signature | csv:serialize( $input as item()?, $options as map(*)? := map { } ) as xs:string |
Summary | Serializes the specified $input as CSV, using the specified $options , and returns the result as string.
Values can also be serialized as CSV with the standard Serialization feature of XQuery:
|
Errors | serialize : the input cannot be serialized.
|
Examples[edit]
Example 1: Converts CSV data to XML, interpreting the first row as table header:
Input addressbook.csv
:
Name,First Name,Address,City
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing
Query:
let $text := file:read-text('addressbook.csv')
return csv:parse($text, map { 'header': true() })
Result:
<csv>
<record>
<Name>Huber</Name>
<First_Name>Sepp</First_Name>
<Address>Hauptstraße 13</Address>
<City>93547 Hintertupfing</City>
</record>
</csv>
Example 2: Converts some CSV data to XML and back, and checks if the input and output are equal. The expected result is true
:
Query:
let $options := map { 'lax': false() }
let $input := file:read-text('some-data.csv')
let $output := $input => csv:parse($options) => csv:serialize($options)
return $input eq $output
Example 3: Converts CSV data to XQuery and returns distinct column values:
Query:
let $text := ``[Name,City
Jack,Chicago
Jack,Washington
John,New York
]``
let $options := map { 'format': 'xquery', 'header': true() }
let $csv := csv:parse($text, $options)
return (
'Distinct values:',
let $records := $csv('records')
for $name at $pos in $csv('names')?*
let $values := $records?($pos)
return (
'* ' || $name || ': ' || string-join(distinct-values($values), ', ')
)
)
Result:
Distinct values:
* Name: Jack, John
* City: Chicago, Washington, New York
Errors[edit]
Code | Description |
---|---|
parse
|
The input cannot be parsed. |
serialize
|
The node cannot be serialized. |
Changelog[edit]
- Version 11
- Added: Options:
skip-empty
option.
- Version 9.7
- Added: Options:
allow
option.
- Version 9.4
- Added:
csv:doc
- Version 9.1
- Updated:
csv:parse
can be called with empty sequence.
- Version 9.0
- Added:
xquery
option - Removed:
map
option - Updated: error codes updated; errors now use the module namespace
- Version 8.6
- Updated: Options: improved Excel compatibility
- Version 8.0
- Added:
backslashes
option
- Version 7.8
- Updated:
csv:parse
now returns a document node instead of an element, or an XQuery map ifformat
is set tomap
. - Added:
format
andlax
options
The module was introduced with Version 7.7.2.