Whole CSV binary documents can be decoded with decode/1,2.
decode/1 assumes default RFC4180-style
options, that is:
- Fields are separated by commas.
- Fields are optionally enclosed in double quotes.
- Double quotes in enclosed fields are quoted by another double quote.
decode/2 allows using custom options:
#{separator => Separator, % any byte except $\r or $\n (defaul $,)
enclosure => Enclosure, % 'undefined' or any byte except $\r or $\n (default $")
quote => Quote} % 'undefined', 'enclosure', or any byte except $\r or $\n (defaults 'enclosure')Restrictions for option combinations:
- If
Enclosureisundefined(ie, no enclosing),Quotemust be eitherenclosureorundefined. - If
Enclosureis notundefined,Quotemust also not beundefined. - If
Enclosureis notundefined, it must not be the same asSeparator.
Lines are separated by \r, \n or \r\n. Empty lines are ignored by the decoder.
The result of decoding is a list of CSV lines, which are lists of CSV fields, which are in turn binaries representing the field values on the respective line.
Assume the following CSV data:
a,b,c
"d,d","e""e","f
f"
In an Erlang binary, this will look like:
1> CsvBinary = <<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>.
<<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>Decoded with decode/1, this will become:
2> hnc_csv:decode(CsvBinary).
[[<<"a">>,<<"b">>,<<"c">>],
[<<"d,d">>,<<"e\"e">>,<<"f\r\nf">>]]hnc_csv provides the functions decode_fold/3,4, decode_filter/2,3,
decode_map/2,3, decode_filtermap/2,3 and decode_foreach/2,3 which
allow decoding and processing decoded lines in one operation, much
like the lists functions foldl/3, filter/2, map/2, filtermap/2
and foreach/2.
In fact, decode/1,2 is implemented via decode_fold/3,4.
The decode family of functions accepts both a raw binary as well as a
Provider that delivers chunks of raw binary. When given a raw binary,
it is converted into a binary provider for further processing.
A provider is a 0-arity function which, when called, returns either a
tuple where the first element is a chunk of binary data and the second
is a new provider function for the next chunk of data, or the atom
end_of_data to indicate that the provider has delivered all data.
Providers can be implemented stateless of stateful, usually depending on the characteristics of the underlying data source.
A stateless provider does not change and is not susceptible to external changes to the state of the underlying data source.
A stateful provider on the other hand may change or be susceptible to changes to the state of the underlying data source or both. It is recommended to not (re-)use stateful providers or their underlying data source before, while or after being used in decoding functions, except for any necessary setup before or cleanup after being used.
hnc_csv comes with two convenience functions, get_binary_provider/1,2
(stateless) and get_file_provider/1,2 (stateful) which return providers for
binaries or files, respectively.
The following is an implementation of a (stateless) custom provider which delivers data taken from a given list of binaries:
-module(example_provider).
-export([get_list_provider/1]).
get_list_provider(L) ->
fun() -> list_provider(L) end.
list_provider([]) ->
end_of_data;
list_provider([Bin|More]) when is_binary(Bin) ->
{Bin, fun() -> list_provider(More) end}.get_list_provider/1creates the initial provider, which is a call tolist_provider/1wrapped in a 0-arity function.list_provider/1is the actual implementation of the provider, which returns eitherend_of_datawhen the list given as argument is exhausted, or otherwise a tuple with the head element of the list as first and a call to itself with the tail of the list wrapped in a 0-arity function as second element.
This provider can then be used as follows, for example to count the lines and fields in the CSV data which the provider delivers:
1> Provider = example_provider:get_list_provider([<<"a,b">>, <<",c\r">>,
<<"\nd,">>, <<"e,f">>,
<<"\r\n">>]).
#Fun<example_provider.0.64990923>
2> hnc_csv:decode_fold(Provider,
fun(Line, {LCnt, FCnt}) -> {LCnt+1, FCnt+length(Line)} end,
{0, 0}).
{2,6}For more complex scenarios than what the built-in functions provide
for, the functions decode_init/0,1,2, decode_next_line/1 and
decode_flush/1 can be used together to decode and process CSV
documents incrementally.
decode_init/0,1,2creates a decoder state to be used in the other functions listed above.decode_next_line/1decodes and returns the next line, together with an updated state. If the data in the provider backing the state is exhausted, the atomend_of_datais returned instead of a line.decode_flush/1returns all as by then unread lines in the given state.
In fact, decode_fold/4 is implemented using those functions.
CSV documents can be encoded with encode/1,2.
encode/1 assumes default RFC4180-style
options, that is:
- Fields are separated by commas
- Fields are optionally enclosed in double quotes
- Double quotes in enclosed fields are quoted by another double quote
- Lines are separated by
\r\n
encode/2 allows using custom options:
#{separator => Separator, % any byte except $\r and $\n (default $,)
enclosure => Enclosure, % 'undefined' or any byte except $\r or $\n (default $")
quote => Quote, % 'undefined', 'enclosure', or any byte except $\r or $\n (default 'enclosure')
enclose => Enclose, % 'optional' (default), 'never' or 'always'
end_of_line => EndOfLine} % `<<"\r\n">> (default), <<"\n">> or <<"\r">>Restrictions for option combinations:
- If
Encloseisnever(ie, no enclosing),Enclosuremust beundefinedandQuotemust beundefinedorenclosure. - If
Encloseisoptionaloralways,EnclosureandQuotemust not beundefined. - If
Enclosureis notundefined, it must not be the same asSeparator.
The input for encoding is a list of CSV lines, which are in turn lists of CSV fields, which are in turn binaries representing the field values.
The result is a CSV binary document consisting of the given CSV lines, in turn consisting of the given CSV fields of a line.
Assume the following CSV structure:
1> Csv = [[<<"a">>,<<"b">>,<<"c">>],
[<<"d,d">>,<<"e\"e">>,<<"f\r\nf">>]].Encoded with encode/1, this will become:
2> hnc_csv:encode(Csv).
<<"a,b,c\r\n"
"\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>- Maria Scott (Maria-12648430)
- Jan Uhlig (juhlig)