Difference between revisions of "CSV Module"
(42 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | This [[Module Library|XQuery Module]] contains a single function to parse CSV input. [ | + | This [[Module Library|XQuery Module]] contains a single function to parse CSV input. [https://en.wikipedia.org/wiki/Comma-separated_values CSV] (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel. |
=Conventions= | =Conventions= | ||
− | All functions in this module are assigned to the <code><nowiki>http://basex.org/modules/csv</nowiki></code> namespace, which is statically bound to the {{Code|csv}} prefix.<br/> | + | All functions and errors in this module are assigned to the <code><nowiki>http://basex.org/modules/csv</nowiki></code> namespace, which is statically bound to the {{Code|csv}} prefix.<br/> |
− | |||
==Conversion== | ==Conversion== | ||
Line 10: | Line 9: | ||
===XML: Direct, Attributes=== | ===XML: Direct, Attributes=== | ||
− | CSV is converted to XML | + | If the {{Code|direct}} or {{Code|attributes}} format is chosen, a CSV string is converted to XML: |
− | * The resulting XML document has a {{Code| | + | * The resulting XML document has a {{Code|csv}} root element. |
− | * Rows are represented via {{Code| | + | * Rows are represented via {{Code|record}} elements. |
− | * Fields are represented via {{Code| | + | * Fields are represented via {{Code|entry}} elements. The value of a field is represented as text node. |
− | * If the {{Code|header}} option is set to {{Code|true}}, the first text line is parsed as table header | + | * If the {{Code|header}} option is set to {{Code|true}}, the first text line is parsed as table header: |
− | + | ** If {{Code|format}} is set to {{Code|direct}}, the field names are encoded, as described in the [[Conversion Module#Keys|Conversion Module]], and used as element names. | |
− | ** If | + | ** Otherwise, if {{Code|format}} is {{Code|attributes}}, the field names will be stored in name attributes. |
− | * | ||
− | + | '''A little advice''': in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the ''Parsing'' tab, you can see the effects of some of the conversion options. | |
− | + | ===XQuery=== | |
− | + | With the {{Code|xquery}} format, CSV records are converted to a sequence of arrays: | |
− | |||
− | |||
− | + | * The resulting value will be a map with a {{Code|records}} and an optional {{Code|names}} key. | |
+ | * Records are organized as a sequence of arrays. A single array contains the entries of a single record. | ||
+ | * The column names will be available if {{Code|header}} option is set to {{Code|true}}. | ||
+ | |||
+ | The CSV map can e.g. be accessed as follows: | ||
+ | |||
+ | * <code>$csv?records[5]</code> returns all entries of the 5th record (row) | ||
+ | * <code>$csv?records(2)</code> returns all entries of the 2nd field (column) | ||
+ | * <code>$csv?names?*</code> returns the names of all fields (if available) | ||
+ | * Return enumerated strings for all records: | ||
+ | <syntaxhighlight lang="xquery"> | ||
+ | for $record at $pos in $csv?records | ||
+ | return $pos || ". " || string-join($record?*, ', ') | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | The resulting representation consumes less memory than XML-based formats, and values can be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing. | ||
==Options== | ==Options== | ||
− | |||
− | |||
In the following table, all available options are listed. The Excel column indicates what are the preferred options for data that is to be imported, or has been exported from Excel. | In the following table, all available options are listed. The Excel column indicates what are the preferred options for data that is to be imported, or has been exported from Excel. | ||
Line 45: | Line 54: | ||
|- valign="top" | |- valign="top" | ||
| {{Code|separator}} | | {{Code|separator}} | ||
− | | Defines the character which separates the | + | | Defines the character which separates the values of a single record. |
| {{Code|comma}}, {{Code|semicolon}}, {{Code|colon}}, {{Code|tab}}, {{Code|space}} or a ''single character'' | | {{Code|comma}}, {{Code|semicolon}}, {{Code|colon}}, {{Code|tab}}, {{Code|space}} or a ''single character'' | ||
| {{Code|comma}} | | {{Code|comma}} | ||
Line 57: | Line 66: | ||
|- valign="top" | |- valign="top" | ||
| {{Code|format}} | | {{Code|format}} | ||
− | | Specifies the format of the XML data | + | | Specifies the format of the XML data:<br/> |
* With {{Code|direct}} conversion, field names are represented as element names | * With {{Code|direct}} conversion, field names are represented as element names | ||
* With {{Code|attributes}} conversion, field names are stored in {{Code|name}} attributes | * With {{Code|attributes}} conversion, field names are stored in {{Code|name}} attributes | ||
− | * With {{Code| | + | * With {{Code|xquery}} conversion, the input is converted to an XQuery map |
− | | {{Code|direct}}, {{Code|attributes}}, {{Code| | + | | {{Code|direct}}, {{Code|attributes}}, {{Code|xquery}} |
| {{Code|direct}} | | {{Code|direct}} | ||
| | | | ||
Line 74: | Line 83: | ||
| Specifies how quotes are parsed: | | Specifies how quotes are parsed: | ||
* Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change. | * Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change. | ||
− | * Serialization: If the option is enabled, the value will be wrapped with quotes. A quote character in the value will be encoded according to the rules of the {{Code|backslashes}} option. | + | * Serialization: If the option is enabled, the value will be wrapped with quotes if it contains characters that might be treated as control characters. A quote character in the value will be encoded according to the rules of the {{Code|backslashes}} option. |
| {{Code|yes}}, {{Code|no}} | | {{Code|yes}}, {{Code|no}} | ||
| {{Code|yes}} | | {{Code|yes}} | ||
Line 95: | Line 104: | ||
|- | |- | ||
| width='120' | '''Signatures''' | | width='120' | '''Signatures''' | ||
− | |{{Func|csv:parse|$ | + | |{{Func|csv:parse|$string as xs:string?|item()?}}<br/>{{Func|csv:parse|$string as xs:string?, $options as map(*)?|item()?}} |
+ | |- | ||
+ | | '''Summary''' | ||
+ | |Converts the CSV {{Code|$string}} to an XQuery value. The {{Code|$options}} argument can be used to control the way the input is converted. | ||
+ | |- | ||
+ | | '''Errors''' | ||
+ | |{{Error|parse|#Errors}} the specified input cannot be parsed as CSV document. | ||
+ | |} | ||
+ | |||
+ | ==csv:doc== | ||
+ | |||
+ | {| width='100%' | ||
+ | |- | ||
+ | | width='120' | '''Signatures''' | ||
+ | |{{Func|csv:doc|$uri as xs:string?|item()?}}<br />{{Func|csv:doc|$uri as xs:string?, $options as map(*)?|item()?}}<br /> | ||
|- | |- | ||
| '''Summary''' | | '''Summary''' | ||
− | | | + | |Fetches the CSV document referred to by the given {{Code|$uri}} and converts it to an XQuery value. The {{Code|$options}} argument can be used to control the way the input is converted. |
|- | |- | ||
| '''Errors''' | | '''Errors''' | ||
− | |{{Error| | + | |{{Error|parse|#Errors}} the specified input cannot be parsed as CSV document.<br/>{{Error|options|#Errors}} the specified options are conflicting. |
|} | |} | ||
Line 109: | Line 132: | ||
|- | |- | ||
| width='120' | '''Signatures''' | | width='120' | '''Signatures''' | ||
− | |{{Func|csv:serialize|$input as | + | |{{Func|csv:serialize|$input as item()?|xs:string}}<br/>{{Func|csv:serialize|$input as item()?, $options as map(*)?|xs:string}} |
|- | |- | ||
| '''Summary''' | | '''Summary''' | ||
− | |Serializes the | + | |Serializes the specified {{Code|$input}} as CSV, using the specified {{Code|$options}}, and returns the result as string. |
+ | Values can also be serialized as CSV with the standard [[Serialization]] feature of XQuery: | ||
+ | * The parameter {{Code|method}} needs to be set to {{Code|csv}}, and | ||
+ | * the options presented in this article need to be assigned to the {{Code|csv}} parameter. | ||
|- | |- | ||
| '''Errors''' | | '''Errors''' | ||
− | |{{Error| | + | |{{Error|serialize|#Errors}} the input cannot be serialized. |
|} | |} | ||
Line 123: | Line 149: | ||
'''Input''' {{Code|addressbook.csv}}: | '''Input''' {{Code|addressbook.csv}}: | ||
− | < | + | <syntaxhighlight lang="xml"> |
Name,First Name,Address,City | Name,First Name,Address,City | ||
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing | Huber,Sepp,Hauptstraße 13,93547 Hintertupfing | ||
− | </ | + | </syntaxhighlight> |
'''Query:''' | '''Query:''' | ||
− | < | + | <syntaxhighlight lang="xquery"> |
let $text := file:read-text('addressbook.csv') | let $text := file:read-text('addressbook.csv') | ||
return csv:parse($text, map { 'header': true() }) | return csv:parse($text, map { 'header': true() }) | ||
− | </ | + | </syntaxhighlight> |
'''Result:''' | '''Result:''' | ||
− | < | + | <syntaxhighlight lang="xml"> |
<csv> | <csv> | ||
<record> | <record> | ||
Line 144: | Line 170: | ||
</record> | </record> | ||
</csv> | </csv> | ||
− | </ | + | </syntaxhighlight> |
''' | ''' | ||
Line 150: | Line 176: | ||
'''Query:''' | '''Query:''' | ||
− | < | + | <syntaxhighlight lang="xquery"> |
− | |||
let $options := map { 'lax': false() } | let $options := map { 'lax': false() } | ||
− | let $ | + | let $input := file:read-text('some-data.csv') |
− | let $csv := csv:serialize( | + | let $output := $input => csv:parse($options) => csv:serialize($options) |
− | return $ | + | return $input eq $output |
− | </ | + | </syntaxhighlight> |
− | '''Example 3:''' Converts CSV data to | + | '''Example 3:''' Converts CSV data to XQuery and returns distinct column values: |
'''Query:''' | '''Query:''' | ||
− | < | + | <syntaxhighlight lang="xquery"> |
− | let $text := | + | let $text := ``[Name,City |
− | let $options := map { | + | Jack,Chicago |
− | + | Jack,Washington | |
− | ' | + | John,New York |
− | ' | + | ]`` |
− | + | let $options := map { 'format': 'xquery', 'header': true() } | |
− | + | let $csv := csv:parse($text, $options) | |
− | </ | + | return ( |
+ | 'Distinct values:', | ||
+ | let $records := $csv('records') | ||
+ | for $name at $pos in $csv('names')?* | ||
+ | let $values := $records?($pos) | ||
+ | return ( | ||
+ | '* ' || $name || ': ' || string-join(distinct-values($values), ', ') | ||
+ | ) | ||
+ | ) | ||
+ | </syntaxhighlight> | ||
'''Result:''' | '''Result:''' | ||
− | < | + | <syntaxhighlight lang="xquery"> |
− | + | Distinct values: | |
− | + | * Name: Jack, John | |
− | + | * City: Chicago, Washington, New York | |
− | + | </syntaxhighlight> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | </ | ||
=Errors= | =Errors= | ||
Line 191: | Line 218: | ||
|Description | |Description | ||
|- | |- | ||
− | |{{Code| | + | |{{Code|parse}} |
| The input cannot be parsed. | | The input cannot be parsed. | ||
|- | |- | ||
− | |{{Code| | + | |{{Code|serialize}} |
| The node cannot be serialized. | | The node cannot be serialized. | ||
|} | |} | ||
=Changelog= | =Changelog= | ||
+ | |||
+ | ;Version 9.4 | ||
+ | * Added: [[#csv:doc|csv:doc]] | ||
+ | |||
+ | ; Version 9.1 | ||
+ | * Updated: [[#csv:parse|csv:parse]] can be called with empty sequence. | ||
+ | |||
+ | ;Version 9.0 | ||
+ | |||
+ | * Added: {{Code|xquery}} option | ||
+ | * Removed: {{Code|map}} option | ||
+ | * Updated: error codes updated; errors now use the module namespace | ||
;Version 8.6 | ;Version 8.6 |
Revision as of 18:51, 18 November 2020
This XQuery Module contains a single function to parse CSV input. CSV (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel.
Contents
Conventions
All functions and errors in this module are assigned to the http://basex.org/modules/csv
namespace, which is statically bound to the csv
prefix.
Conversion
XML: Direct, Attributes
If the direct
or attributes
format is chosen, a CSV string is converted to XML:
- The resulting XML document has a
csv
root element. - Rows are represented via
record
elements. - Fields are represented via
entry
elements. The value of a field is represented as text node. - If the
header
option is set totrue
, the first text line is parsed as table header:- If
format
is set todirect
, the field names are encoded, as described in the Conversion Module, and used as element names. - Otherwise, if
format
isattributes
, the field names will be stored in name attributes.
- If
A little advice: in the Database Creation dialog of the GUI, if you select CSV Parsing and switch to the Parsing tab, you can see the effects of some of the conversion options.
XQuery
With the xquery
format, CSV records are converted to a sequence of arrays:
- The resulting value will be a map with a
records
and an optionalnames
key. - Records are organized as a sequence of arrays. A single array contains the entries of a single record.
- The column names will be available if
header
option is set totrue
.
The CSV map can e.g. be accessed as follows:
$csv?records[5]
returns all entries of the 5th record (row)$csv?records(2)
returns all entries of the 2nd field (column)$csv?names?*
returns the names of all fields (if available)- Return enumerated strings for all records:
<syntaxhighlight lang="xquery"> for $record at $pos in $csv?records return $pos || ". " || string-join($record?*, ', ') </syntaxhighlight>
The resulting representation consumes less memory than XML-based formats, and values can be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing.
Options
In the following table, all available options are listed. The Excel column indicates what are the preferred options for data that is to be imported, or has been exported from Excel.
Option | Description | Allowed | Default | Excel |
---|---|---|---|---|
separator
|
Defines the character which separates the values of a single record. | comma , semicolon , colon , tab , space or a single character
|
comma
|
semicolon
|
header
|
Indicates if the first line of the parsed or serialized CSV data is a table header. | yes , no
|
no
|
|
format
|
Specifies the format of the XML data:
|
direct , attributes , xquery
|
direct
|
|
lax
|
Specifies if a lax approach is used to convert QNames to JSON names. | yes , no
|
yes
|
no
|
quotes
|
Specifies how quotes are parsed:
|
yes , no
|
yes
|
yes
|
backslashes
|
Specifies how quotes and other characters are escaped:
|
yes , no
|
no
|
no
|
Functions
csv:parse
Signatures | csv:parse($string as xs:string?) as item()? csv:parse($string as xs:string?, $options as map(*)?) as item()?
|
Summary | Converts the CSV $string to an XQuery value. The $options argument can be used to control the way the input is converted.
|
Errors | parse : the specified input cannot be parsed as CSV document.
|
csv:doc
Signatures | csv:doc($uri as xs:string?) as item()? csv:doc($uri as xs:string?, $options as map(*)?) as item()? |
Summary | Fetches the CSV document referred to by the given $uri and converts it to an XQuery value. The $options argument can be used to control the way the input is converted.
|
Errors | parse : the specified input cannot be parsed as CSV document.options : the specified options are conflicting.
|
csv:serialize
Signatures | csv:serialize($input as item()?) as xs:string csv:serialize($input as item()?, $options as map(*)?) as xs:string
|
Summary | Serializes the specified $input as CSV, using the specified $options , and returns the result as string.
Values can also be serialized as CSV with the standard Serialization feature of XQuery:
|
Errors | serialize : the input cannot be serialized.
|
Examples
Example 1: Converts CSV data to XML, interpreting the first row as table header:
Input addressbook.csv
:
<syntaxhighlight lang="xml">
Name,First Name,Address,City
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing
</syntaxhighlight>
Query: <syntaxhighlight lang="xquery"> let $text := file:read-text('addressbook.csv') return csv:parse($text, map { 'header': true() }) </syntaxhighlight>
Result: <syntaxhighlight lang="xml"> <csv>
<record> <Name>Huber</Name> <First_Name>Sepp</First_Name> <Address>Hauptstraße 13</Address> <City>93547 Hintertupfing</City> </record>
</csv> </syntaxhighlight>
Example 2: Converts some CSV data to XML and back, and checks if the input and output are equal. The expected result is true
:
Query: <syntaxhighlight lang="xquery"> let $options := map { 'lax': false() } let $input := file:read-text('some-data.csv') let $output := $input => csv:parse($options) => csv:serialize($options) return $input eq $output </syntaxhighlight>
Example 3: Converts CSV data to XQuery and returns distinct column values:
Query: <syntaxhighlight lang="xquery"> let $text := ``[Name,City Jack,Chicago Jack,Washington John,New York ]`` let $options := map { 'format': 'xquery', 'header': true() } let $csv := csv:parse($text, $options) return (
'Distinct values:', let $records := $csv('records') for $name at $pos in $csv('names')?* let $values := $records?($pos) return ( '* ' || $name || ': ' || string-join(distinct-values($values), ', ') )
) </syntaxhighlight>
Result: <syntaxhighlight lang="xquery"> Distinct values:
- Name: Jack, John
- City: Chicago, Washington, New York
</syntaxhighlight>
Errors
Code | Description |
---|---|
parse
|
The input cannot be parsed. |
serialize
|
The node cannot be serialized. |
Changelog
- Version 9.4
- Added: csv:doc
- Version 9.1
- Updated: csv:parse can be called with empty sequence.
- Version 9.0
- Added:
xquery
option - Removed:
map
option - Updated: error codes updated; errors now use the module namespace
- Version 8.6
- Updated: Options: improved Excel compatibility
- Version 8.0
- Added:
backslashes
option
- Version 7.8
- Updated: csv:parse now returns a document node instead of an element, or an XQuery map if
format
is set tomap
. - Added:
format
andlax
options
The module was introduced with Version 7.7.2.