You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10-9Lines changed: 10 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,17 +8,17 @@
8
8
*TODO: the above badges that indicate python version and package version will only work if your package is on PyPI.
9
9
If you don't plan to publish to PyPI, you can remove them.*
10
10
11
-
csvplus provides a set of convenient enhancements on top of the Python `pandas` package for reading, cleaning, and summarizing data. Working with CSV files in pandas often requires manually specifying encodings or column types, which can be tedious. During analysis, it is common to encounter inconsistent values, such as "Google" vs. "Google, Inc.", that should be treated as the same entity. During data exploration, pandas offers descriptive statistics, but it does not automatically generate visual summaries.
11
+
csvplus provides a set of convenient enhancements on top of the Python `pandas` package for reading, comparing, cleaning, and summarizing data. Reading CSV files with pandas does not always use the data type of the least memory. Sometimes, it is helpful to tell the differences between two version of a CSV file. Within a CSV file, the file data values can be inconsistent, such as "Google" vs. "Google, Inc.", and they should be treated as the same entity. Also, it is helpful not only to have descriptive statistics, but also the number of missing values.
12
12
13
13
This package aims to address these pain points with these functions:
14
14
|||
15
15
|--------|--------|
16
-
|read_csv_auto|Automatically detecting the correct encoding when reading CSV files|
17
-
|detect_column_types|Inferring appropriate data types for each column|
18
-
|consolidate_data_variation|Consolidating variations of the same data value|
19
-
|summary_report|Producing graphical summary reports of the dataset|
16
+
|load_optimized_csv|Loads a CSV file and automatically downcasts data types to minimize memory footprint.|
17
+
|data_version_diff|Compare two versions of a pandas DataFrame and summarize their differences.|
18
+
|resolve_string_value|Consolidating spelling variations of the same data value in a column.|
19
+
|summary_report|Produce a list of descriptive statistics of the data and information about missing values.|
20
20
21
-
Our package fits into the Python preprocessing framework. Currently, the [`pandas`](https://pandas.pydata.org/) package provides basic functionality to read CSV and detect NA values, and the [`pyjanitor`](https://pyjanitor-devs.github.io/pyjanitor/) package sanitizes the column names, convert column dtype to categorical, and add onto dealing with missing data functionalities. Our package builds on top of those two with auto-detection capacities that further simplify data import, cleaning, and exploration, so that the data scientist can focus on modeling.
21
+
Our package fits into the Python preprocessing framework. Currently, the [`pandas`](https://pandas.pydata.org/) package provides basic functionality to read CSV and produce summary statistics, and the [`pyjanitor`](https://pyjanitor-devs.github.io/pyjanitor/) package provides functions for sanitizing the column names and converting column dtype. Our package can be used along with these two with more auto-detection and summarization functionalities that further increase the efficiency of data preprocessing and data exploration workflows.
22
22
23
23
## Contributors
24
24
- Alan Liu
@@ -34,13 +34,14 @@ You can install this package into your preferred Python environment using pip:
34
34
$ pip install csvplus
35
35
```
36
36
37
-
TODO: Add a brief example of how to use the package to this section
0 commit comments