CSV is a popular format for storing tabular data used in many disciplines. Metadata concerning the contents of the file is often included in the header, but it rarely follows a format that is machine readable - sometimes is not even human readable! In some cases, such information is provided in a separate file, which is not ideal as it is easy for data and metadata to get separated.
CSVY is a small Python package to handle CSV files in which the metadata in the header is formatted in YAML. It supports reading/writing tabular data contained in numpy arrays, pandas DataFrames, polars DataFrames, and nested lists, as well as metadata using a standard python dictionary. Ultimately, it aims to incorporate information about the CSV dialect used and a Table Schema specifying the contents of each column to aid the reading and interpretation of the data.
'pycsvy' is available in PyPI and conda-forge therefore its installation is as easy as:
pip install pycsvyor
conda install --channel=conda-forge pycsvyIn order to support reading into numpy arrays, pandas DataFrames or polars DataFrames, you will
need to install those packages, too. This can be support by specifying extras, ie:
pip install pycsvy[pandas, polars]In the simplest case, to save some data contained in data and some metadata contained
in a metadata dictionary into a CSVY file important_data.csv (the extension is not
relevant), just do the following:
import csvy
csvy.write("important_data.csv", data, metadata)The resulting file will have the YAML-formatted header in between --- markers with,
optionally, a comment character starting each header line. It could look something like
the following:
---
name: my-dataset
title: Example file of csvy
description: Show a csvy sample file.
encoding: utf-8
schema:
fields:
- name: Date
type: object
- name: WTI
type: number
---
Date,WTI
1986-01-02,25.56
1986-01-03,26.00
1986-01-06,26.53
1986-01-07,25.85
1986-01-08,25.87
For reading the information back:
import csvy
# To read into a numpy array
data, metadata = csvy.read_to_array("important_data.csv")
# To read into a pandas DataFrame
data, metadata = csvy.read_to_dataframe("important_data.csv")
# To read into a polars LazyFrame
data, metadata = csvy.read_to_polars("important_data.csv")
# To read into a polars DataFrame
data, metadata = csvy.read_to_polars("important_data.csv", eager=True)The appropriate writer/reader will be selected based on the type of data:
- numpy array:
np.savetxtandnp.loadtxt - pandas DataFrame:
pd.DataFrame.to_csvandpd.read_csv - polars DataFrame/LazyFrame:
pl.DataFrame.write_csvandpl.scan_csv - nested lists:'
csv.writerandcsv.reader
Options can be passed to the tabular data writer/reader by setting the csv_options
dictionary. Likewise you can set the yaml_options dictionary with whatever options you
want to pass to yaml.safe_load and yaml.safe_dump functions, reading/writing the
YAML-formatted header, respectively.
You can also instruct a writer to use line buffering, instead of the usual chunk buffering.
Finally, you can control the character(s) used to indicate comments by setting the
comment keyword when writing a file. By default, there is no character ("").
During reading, the comment character is found automatically.
Note that, by default, these reader functions will assume UTF-8 encoding. You can choose a
different character encoding by setting the encoding keyword argument to any of these
reader or writer functions. For example, on Windows, Windows-1252 encoding is often used,
which can be specified via encoding='cp1252'.
Thanks goes to these wonderful people (emoji key):
Diego Alonso Álvarez 🚇 🤔 🚧 |
Alex Dewar 🤔 |
Adrian D'Alessandro 🐛 💻 📖 |
James Paul Turner 🚇 💻 |
Dan Cummins 🚇 💻 |
mikeheyns 🚇 |
Nana Adjei Manu 🚇 |
Harsh 💻 |
This project follows the all-contributors specification. Contributions of any kind welcome!