Skip to content

Commit cdfa618

Browse files
authored
Address inconsistencies in function names in README.md and in data-correction.py (#21)
1 parent 347f22b commit cdfa618

2 files changed

Lines changed: 11 additions & 10 deletions

File tree

README.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,17 @@
88
*TODO: the above badges that indicate python version and package version will only work if your package is on PyPI.
99
If you don't plan to publish to PyPI, you can remove them.*
1010

11-
csvplus provides a set of convenient enhancements on top of the Python `pandas` package for reading, cleaning, and summarizing data. Working with CSV files in pandas often requires manually specifying encodings or column types, which can be tedious. During analysis, it is common to encounter inconsistent values, such as "Google" vs. "Google, Inc.", that should be treated as the same entity. During data exploration, pandas offers descriptive statistics, but it does not automatically generate visual summaries.
11+
csvplus provides a set of convenient enhancements on top of the Python `pandas` package for reading, comparing, cleaning, and summarizing data. Reading CSV files with pandas does not always use the data type of the least memory. Sometimes, it is helpful to tell the differences between two version of a CSV file. Within a CSV file, the file data values can be inconsistent, such as "Google" vs. "Google, Inc.", and they should be treated as the same entity. Also, it is helpful not only to have descriptive statistics, but also the number of missing values.
1212

1313
This package aims to address these pain points with these functions:
1414
| | |
1515
|--------|--------|
16-
|read_csv_auto|Automatically detecting the correct encoding when reading CSV files|
17-
|detect_column_types|Inferring appropriate data types for each column|
18-
|consolidate_data_variation|Consolidating variations of the same data value|
19-
|summary_report|Producing graphical summary reports of the dataset|
16+
|load_optimized_csv|Loads a CSV file and automatically downcasts data types to minimize memory footprint.|
17+
|data_version_diff|Compare two versions of a pandas DataFrame and summarize their differences.|
18+
|resolve_string_value|Consolidating spelling variations of the same data value in a column.|
19+
|summary_report|Produce a list of descriptive statistics of the data and information about missing values.|
2020

21-
Our package fits into the Python preprocessing framework. Currently, the [`pandas`](https://pandas.pydata.org/) package provides basic functionality to read CSV and detect NA values, and the [`pyjanitor`](https://pyjanitor-devs.github.io/pyjanitor/) package sanitizes the column names, convert column dtype to categorical, and add onto dealing with missing data functionalities. Our package builds on top of those two with auto-detection capacities that further simplify data import, cleaning, and exploration, so that the data scientist can focus on modeling.
21+
Our package fits into the Python preprocessing framework. Currently, the [`pandas`](https://pandas.pydata.org/) package provides basic functionality to read CSV and produce summary statistics, and the [`pyjanitor`](https://pyjanitor-devs.github.io/pyjanitor/) package provides functions for sanitizing the column names and converting column dtype. Our package can be used along with these two with more auto-detection and summarization functionalities that further increase the efficiency of data preprocessing and data exploration workflows.
2222

2323
## Contributors
2424
- Alan Liu
@@ -34,13 +34,14 @@ You can install this package into your preferred Python environment using pip:
3434
$ pip install csvplus
3535
```
3636

37-
TODO: Add a brief example of how to use the package to this section
38-
3937
To use csvplus in your code:
4038

4139
```python
4240
>>> import csvplus
43-
>>> csvplus.hello_world()
41+
>>> csvplus.load_optimized_csv.load_optimized_csv("large_dataset.csv")
42+
>>> csvplus.data_version_diff.data_version_diff(df_v1, df_v2)
43+
>>> csvplus.data-correction.resolve_string_value(data, "company_name", ["Google", "Microsoft"], 80)
44+
>>> csvplus.generate-report.summary_report(df)
4445
```
4546

4647
## Copyright

src/csvplus/data-correction.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ def resolve_string_value(df, column_name, resolved_names, threshold):
3030
... "company_name": ["Google", "Google Inc.", "Gogle", "Microsoftt", "Micro-soft"],
3131
... "location": ["Mt. view", "Mt. view", "Mt. view", "Redmond", , "Redmond"]
3232
... })
33-
>>> consolidate_data(data, "company_name", ["Google", "Microsoft"], 80)
33+
>>> resolve_string_value(data, "company_name", ["Google", "Microsoft"], 80)
3434
>>> data
3535
company_name location
3636
1 Google Mt. view

0 commit comments

Comments
 (0)