Changes

Jump to navigation Jump to search
3,308 bytes added ,  14:17, 20 July 2022
no edit summary
This [[Module Library|XQuery Module]] contains a single function to parse CSV input. [httphttps://en.wikipedia.org/wiki/Comma-separated_values CSV] (comma-separated values) is a popular representation for tabular data, exported e. g. from Excel.
=Conventions=
All functions and errors in this module are assigned to the {{Code|<code><nowiki>http://basex.org/modules/csv}} </nowiki></code> namespace, which is statically bound to the {{Code|csv}} prefix.<br/>All errors are assigned to the {{Code|http://basex.org/errors}} namespace, which is statically bound to the {{Code|bxerr}} prefix.
==Conversion==
===XML: Direct, Attributes===
If the {{Code|direct}} or {{Code|attributes}} format is chosen, a CSV string is converted to XML as follows:
* The resulting XML document has a {{Code|<csv>}} root element.* Rows are represented via {{Code|<record>}} elements.* Fields are represented via {{Code|<entry>}} elements. The value of a field is represented as text node.* If the {{Code|header}} option is set to {{Code|true}}, the first text line is parsed as table header, and the {{Code|<entry>}} elements are replaced with the field names:** Empty names are represented by a single underscore ({{Code|_}}), and characters that are not valid in element names are replaced with underscores.** If the {{Code|laxformat}} option is set to {{Code|falsedirect}}, invalid characters will be rewritten to an underscore and the character’s four-digit Unicodefield names are encoded, as described in the [[Conversion Module#Keys|Conversion Module]], and underscores will be represented used as two underscores ({{Code|__}}). The resulting element names may be less readable, but can always be converted back to the original field names.* If * Otherwise, if {{Code|format}} is set to {{Code|attributes}}, the field names will be stored in name attributes.
In '''A little advice''': in the Database Creation dialog of the GUI, when the if you select CSV parser is selected, Parsing and switch to the ''Parsing'' tab demonstrates , you can see the conversion effects of CSV to XML and the effects some of the single conversion options.
===MapXQuery===
If With the {{Code|format}} is set to {{Code|mapxquery}}format, the CSV data will be records are converted to an XQuery mapa sequence of arrays:
* All The resulting value will be a map with a {{Code|records are enumerated with positive integers}} and an optional {{Code|names}} key.* By default, all Records are organized as a sequence of arrays. A single array contains the entries of a records are represented in a sequencesingle record.* If the The column names will be available if {{Code|header}} option is set to {{Code|true}}, a . The CSV map is created, which contains can e.g. be accessed as follows: * <code>$csv?records[5]</code> returns all entries of the 5th record (row)* <code>$csv?records?(2)</code> returns all entries of the 2nd field (column)* <code>$csv?names?*</code> returns the names of all fields (if available)* Return enumerated strings for all records:<syntaxhighlight lang="xquery">for $record at $pos in $csv?recordsreturn $pos || ". " || string-join($record?*, ', ')</syntaxhighlight> The resulting representation consumes less memory than XML-based formats, and its valuescan be directly accessed without conversion. Thus, it is recommendable for very large inputs and for efficient ad-hoc processing.
==Options==
The In the following table, all available options are available:listed. The Excel column lists recommended options for data that is processed with Excel or Open/Libre Office.
{| class="wikitable sortable" width="100%"
|- valign="top"
! width="140" | Option! width="50%" | Description
! Allowed
! Default
! Direction
! Excel
|- valign="top"
| {{Code|separator}}
| Defines the character which separates the entries values of a single record in a single line.
| {{Code|comma}}, {{Code|semicolon}}, {{Code|colon}}, {{Code|tab}}, {{Code|space}} or a ''single character''
| {{Code|comma}}
| ''parse'', ''serialize''
| {{Code|semicolon}} or {{Code|comma}}, depending on the region
|- valign="top"
| {{Code|header}}
| {{Code|yes}}, {{Code|no}}
| {{Code|no}}
| ''parse'', ''serialize''
|
|- valign="top"
| {{Code|format}}
| Specifies the format of the XML data. The format is only relevant if the {{Code|header}} option is activated:<br/>
* With {{Code|direct}} conversion, field names are represented as element names
* With {{Code|attributes}} conversion, field names are stored in {{Code|name}} attributes
* With {{Code|mapxquery}} conversion, the input is converted to an XQuery map| {{Code|direct}}, {{Code|attributes}}, {{Code|mapxquery}}
| {{Code|direct}}
| ''parse'', ''serialize''
|
|- valign="top"
| {{Code|lax}}
| {{Code|yes}}, {{Code|no}}
| {{Code|yes}}
| ''parse'', ''serialize''
| {{Code|no}}
|- valign="top"
| {{Code|quotes}}
| Specifies how quotes are parsed:
* Parsing: If the option is enabled, quotes at the start and end of a value will be treated as control characters. Separators and newlines within the quotes will be adopted without change.
* Serialization: If the option is enabled, the value will be wrapped with quotes if it contains characters that might be treated as control characters. A quote character in the value will be encoded according to the rules of the {{Code|backslashes}} option.
| {{Code|yes}}, {{Code|no}}
| {{Code|yes}}
| ''parse'', ''serialize''
| {{Code|yes}}
|- valign="top"
| {{Code|backslashes}}
| Specifies how quotes and other characters are escaped:
* Parsing: If the option is enabled, {{Code|\r}}, {{Code|n}} and {{Code|\t}} will be replaced with the corresponding control characters. All other escaped characters will be adopted as literals (e.g.: {{Code|\"}} → {{Code|"}}). If the option is disabled, two consecutive quotes will be replaced with a single quote (unless {{Code|quotes}} is enabled and the quote is the first or last character of a value).
* Serialization: If the option is enabled, {{Code|\r}}, {{Code|n}}, {{Code|\t}}, {{Code|"}} and the separator character will be encoded with a backslash. If the option is disabled, quotes will be duplicated.
| {{Code|yes}}, {{Code|no}}
| {{Code|no}}
| ''parse'', ''serialize''
| {{Code|no}}
|- valign="top"
| {{Code|allow}}
| In Excel, a value will be evaluated if it starts with the character {{Code|-}}, {{Code|+}}, <code>=</code>, {{Code|@}}, {{Code|\t}} or {{Code|\r}}. A regular expression can be specified to reject data that will be handled differently than expected by an application, or that may be malicious (see https://owasp.org/www-community/attacks/CSV_Injection for more details).
| ''string''
|
| ''serialize''
| <code><nowiki>[^-+=\t\r].*|[-+]\d*([,.]\d+)?</nowiki></code>
|}
The CSV function signatures provide an {{Code|$options}} argument. Options can either be specified=Functions=* as children of an {{Code|<csv:options/>}} element; e.g.:<pre class="brush:xml"><csv:options> <=csv:separator valuedoc=';'/> ...</csv:options></pre>* or as map, which contains all key/value pairs:<pre class="brush:xquery">{ 'separator': ';', ... }</pre>
{| width=Functions'100%'|- valign="top"| width='120' | '''Signatures'''|{{Func|csv:doc|$uri as xs:string?|item()?}}<br />{{Func|csv:doc|$uri as xs:string?, $options as map(*)?|item()?}}<br />|- valign="top"| '''Summary'''|Fetches the CSV document referred to by the given {{Code|$uri}} and converts it to an XQuery value. The {{Code|$options}} argument can be used to control the way the input is converted.|- valign="top"| '''Errors'''|{{Error|parse|#Errors}} the specified input cannot be parsed as CSV document.<br/>{{Error|options|#Errors}} the specified options are conflicting.|}
==csv:parse==
 
{{Version|7.8}}: [[#csv:parse|csv:parse]] now returns a document node instead of an element, or an XQuery map if {{Code|format}} is set to {{Code|map}}. New options have been added.
{| width='100%'
|-valign="top"
| width='120' | '''Signatures'''
|{{Func|csv:parse|$input string as xs:string?|document-node(elementitem(csv))?}}<br/>{{Func|csv:parse|$input string as xs:string?, $options as itemmap(*)?|item()?}}|-valign="top"
| '''Summary'''
|Converts the CSV data specified by {{Code|$inputstring}} to an XML document or a mapXQuery value. The {{Code|$options}} argument can be used to control the way the input is converted.|-valign="top"
| '''Errors'''
|{{Error|BXCS0001parse|#Errors}} the specified input cannot be parsedas CSV document.
|}
==csv:serialize==
 
{| width='100%'
|-valign="top"
| width='120' | '''Signatures'''
|{{Func|csv:serialize|$input as nodeitem()?|xs:string}}<br/>{{Func|csv:serialize|$input as item()?, $options as itemmap(*)?|xs:string}}|-valign="top"
| '''Summary'''
|Serializes the node specified by {{Code|$input}} as CSV data, and returns using the result as specified {{Code|xs:string$options}}, and returns the result as string.Items Values can also be serialized as JSON if CSV with the standard [[Serialization|Serialization Parameter]] feature of XQuery:* The parameter {{Code|method}} is needs to be set to {{Code|csv}}.<br/>The , and* the options presented in this article need to be assigned to the {{Code|$optionscsv}} argument can be used to control the way the input is serializedparameter.|-valign="top"
| '''Errors'''
|{{Error|BXCS0002serialize|#Errors}} the input cannot be serialized.
|}
'''Input''' {{Code|addressbook.csv}}:
<pre classsyntaxhighlight lang="brush:xml">
Name,First Name,Address,City
Huber,Sepp,Hauptstraße 13,93547 Hintertupfing
</presyntaxhighlight>
'''Query:'''
<pre classsyntaxhighlight lang="brush:xquery">
let $text := file:read-text('addressbook.csv')
return csv:parse($text, map { 'header': true() })</presyntaxhighlight>
'''Result:'''
<pre classsyntaxhighlight lang="brush:xml">
<csv>
<record>
</record>
</csv>
</presyntaxhighlight>
'''
'''Query:'''
<pre classsyntaxhighlight lang="brush:xquery">let $text options := map { 'lax': false() }let $input := file:read-text('some-data.csv')let $options output := { 'lax': 'no' }let $xml :input = > csv:parse($text, $options)let $csv := > csv:serialize($xml, $options)return $text input eq $csvoutput</presyntaxhighlight>
'''Example 3:''' Converts CSV data to an XQuery map item and serializes its contentsreturns distinct column values:
'''Query:'''
<pre classsyntaxhighlight lang="brush:xquery">let $text := "``[Name;,City" || outJack,ChicagoJack,WashingtonJohn,New York]``let $options := map { 'format':nl() || "John;Newton" || out'xquery', 'header':nltrue() || "Jack;Oldtown"}let $options csv := <csv:parse($text, $options>)return ( <csv 'Distinct values:separator value=';'/>, <csv let $records :format value=$csv('maprecords'/>) < for $name at $pos in $csv:header value=('yesnames'/>)?* </csv:options>let $map values := csv:parse$records?($text, $optionspos) return map( '* ' || $name || ':serialize' || string-join(distinct-values($mapvalues), ', ') ))</presyntaxhighlight>
'''Result:'''
<pre classsyntaxhighlight lang="brush:xmlxquery">{ 1Distinct values: { "City"* Name: "Newton"Jack, "Name": "John" }, 2: { "* City": "Oldtown"Chicago, Washington, "Name": "Jack" }}New York</presyntaxhighlight>
=Errors=
! width="110"|Code
|Description
|-valign="top"|{{Code|BXCS0001parse}}
| The input cannot be parsed.
|-valign="top"|{{Code|BXCS0002serialize}}
| The node cannot be serialized.
|}
=Changelog=
 
;Version 9.7
* Added: [[#Options|Options]]: <code>allow</code> option.
 
;Version 9.4
* Added: {{Function||csv:doc}}
 
; Version 9.1
* Updated: {{Function||csv:parse}} can be called with empty sequence.
 
;Version 9.0
 
* Added: {{Code|xquery}} option
* Removed: {{Code|map}} option
* Updated: error codes updated; errors now use the module namespace
 
;Version 8.6
 
* Updated: [[#Options|Options]]: improved Excel compatibility
 
;Version 8.0
 
* Added: {{Code|backslashes}} option
;Version 7.8
* Updated: [[#csv:parse{{Function||csv:parse]] }} now returns a document node instead of an element, or an XQuery map if {{Code|format}} is set to {{Code|map}}.
* Added: {{Code|format}} and {{Code|lax}} options
The module was introduced with Version 7.7.2.
 
[[Category:XQuery]]
Bureaucrats, editor, reviewer, Administrators
13,550

edits

Navigation menu