Skip to content

Commit b6ed4a8

Browse files
Merge pull request #38 from VladimirShitov/docs
Add better readme
2 parents 14926d1 + 6a855cf commit b6ed4a8

9 files changed

Lines changed: 83 additions & 82 deletions

README.md

Lines changed: 78 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -13,102 +13,120 @@
1313
[![License](https://img.shields.io/github/license/VladimirShitov/nafig)](https://github.com/VladimirShitov/nafig/blob/master/LICENSE)
1414
![Coverage Report](assets/images/coverage.svg)
1515

16-
Package for creating figures with NA data distribution
16+
Do you want to visualize missing values in your data? There are plenty amazing methods (check [missingno](https://github.com/ResidentMario/missingno) for example) but they all look bulky when your data has too many columns. `nafig` will help you to build a perfect NA figure!
1717

1818
</div>
1919

20-
## Very first steps
20+
# Installation
2121

22-
### Initialize your code
22+
```bash
23+
$ pip install -U nafig
24+
```
2325

24-
1. Initialize `git` inside your repo:
26+
or install with `Poetry`
2527

2628
```bash
27-
cd nafig && git init
29+
$ poetry add nafig
2830
```
2931

30-
2. If you don't have `Poetry` installed run:
32+
# Usage
3133

32-
```bash
33-
make poetry-download
34+
Here are some examples of the usage both for simulated and real world data. Check [this notebook](example.ipynb) to play with code yourself!
35+
36+
First, let's import the core function and other useful things:
37+
38+
```python
39+
>>> from nafig.plots import na_text_barplot # The core function
40+
>>> from nafig.utils import create_example_data # To simulate data
41+
>>> import pandas as pd # To works with tables
3442
```
3543

36-
3. Initialize poetry and install `pre-commit` hooks:
44+
```python
45+
>>> df, feature_types = create_example_data()
46+
```
3747

38-
```bash
39-
make install
40-
make pre-commit-install
48+
`df` is just a pandas dataframe with missing values. `feature_types` is an array, containing data type description for each column. This is just an example, so labels don't correspond to actual data types.
49+
50+
```python
51+
>>> feature_types[:10]
52+
array(['Categorical', 'Categorical', 'Binary', 'Continuous', 'Continuous',
53+
'Continuous', 'Binary', 'Continuous', 'Continuous', 'Binary'],
54+
dtype='<U11')
4155
```
4256

43-
4. Run the codestyle:
57+
This toy dataframe contains 300 columns. Visualization of missing data with heatmap would unfortunately be too bulky. How to explore missing data distribution in this dataset? Try NA text barplot!
4458

45-
```bash
46-
make codestyle
59+
```python
60+
>>> na_text_barplot(df, hue=feature_types, line_height=1.5)
4761
```
4862

49-
5. Upload initial code to GitHub:
63+
![1_simulated_data.png](images/1_simulated_data.png)
5064

51-
```bash
52-
git add .
53-
git commit -m ":tada: Initial commit"
54-
git branch -M main
55-
git remote add origin https://github.com/VladimirShitov/nafig.git
56-
git push -u origin main
65+
Columns of the dataset are binned by percentage of the missing data in them. Colouring by feature types helps to understand, which types of data are missing. On Y-axis you can see the number of features in each group.
66+
67+
You can vary the number of bins using num_bins parameter:
68+
69+
```python
70+
>>> na_text_barplot(df, hue=feature_types, line_height=1.5, num_bins=20)
5771
```
5872

59-
### Set up bots
73+
![2_20_bins.png](images/2_20_bins.png)
6074

61-
- Set up [Dependabot](https://docs.github.com/en/github/administering-a-repository/enabling-and-disabling-version-updates#enabling-github-dependabot-version-updates) to ensure you have the latest dependencies.
62-
- Set up [Stale bot](https://github.com/apps/stale) for automatic issue closing.
75+
```python
76+
>>> na_text_barplot(df, hue=feature_types, line_height=2, num_bins=2, fig_width=8, font_size=3)
77+
```
6378

64-
### Poetry
79+
![3_2_bins.png](images/3_2_bins.png)
6580

66-
Want to know more about Poetry? Check [its documentation](https://python-poetry.org/docs/).
81+
Now let's see some real data examples!
6782

68-
<details>
69-
<summary>Details about Poetry</summary>
70-
<p>
83+
## House prices missing data visualization
7184

72-
Poetry's [commands](https://python-poetry.org/docs/cli/#commands) are very intuitive and easy to learn, like:
85+
Data source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv
7386

74-
- `poetry add numpy@latest`
75-
- `poetry run pytest`
76-
- `poetry publish --build`
87+
```python
88+
>>> DATA_PATH = "data/house-prices/train.csv"
89+
>>> house_prices_df = pd.read_csv(DATA_PATH, index_col=0)
90+
```
7791

78-
etc
79-
</p>
80-
</details>
92+
This is a reasonably good data with most of the values present. But thanks to this plot, we can see, which features are the bad guys!
93+
94+
```python
95+
>>> na_text_barplot(house_prices_df, fig_width=17, num_bins=20, line_height=1.5)
96+
```
8197

82-
### Building and releasing your package
98+
![4_house_prices_data.png](images/4_house_prices_data.png)
8399

84-
Building a new version of the application contains steps:
100+
Note that if you don't pass the `hue` parameter, features will be colored by the data type of the column. If you don't want to colorize features at all, set `hue` to `False`.
101+
102+
By setting `remove_empty_bins` to `True`, you can remove the empty bins. It will require a reader to pay more attention to the X-axis but will save you some space.
103+
104+
```python
105+
>>> na_text_barplot(house_prices_df, fig_width=10, num_bins=20,
106+
line_height=1.5, remove_empty_bins=True)
107+
```
85108

86-
- Bump the version of your package `poetry version <version>`. You can pass the new version explicitly, or a rule such as `major`, `minor`, or `patch`. For more details, refer to the [Semantic Versions](https://semver.org/) standard.
87-
- Make a commit to `GitHub`.
88-
- Create a `GitHub release`.
89-
- And... publish 🙂 `poetry publish --build`
109+
![5_house_prices_no_bins.png](images/5_house_prices_no_bins.png)
90110

91-
## 🎯 What's next
111+
## Seatle AirBnB dataset missing values vizualization
92112

93-
Well, that's up to you 💪🏻. I can only recommend the packages and articles that helped me.
113+
Data source: https://www.kaggle.com/datasets/airbnb/seattle
114+
115+
```python
116+
>>> airbnb_df = pd.read_csv("data/airbnb/listings.csv")
117+
```
118+
119+
This dataset has a bit more missing data. On the plot we can see that all integer features are almost complete, and some `object` and floating number columns contain missing values
120+
121+
```python
122+
>>> na_text_barplot(airbnb_df, fig_width=18, line_height=1.8, font_size=9, remove_empty_bins=True)
123+
```
94124

95-
- [`Typer`](https://github.com/tiangolo/typer) is great for creating CLI applications.
96-
- [`Rich`](https://github.com/willmcgugan/rich) makes it easy to add beautiful formatting in the terminal.
97-
- [`Pydantic`](https://github.com/samuelcolvin/pydantic/) – data validation and settings management using Python type hinting.
98-
- [`Loguru`](https://github.com/Delgan/loguru) makes logging (stupidly) simple.
99-
- [`tqdm`](https://github.com/tqdm/tqdm) – fast, extensible progress bar for Python and CLI.
100-
- [`IceCream`](https://github.com/gruns/icecream) is a little library for sweet and creamy debugging.
101-
- [`orjson`](https://github.com/ijl/orjson) – ultra fast JSON parsing library.
102-
- [`Returns`](https://github.com/dry-python/returns) makes you function's output meaningful, typed, and safe!
103-
- [`Hydra`](https://github.com/facebookresearch/hydra) is a framework for elegantly configuring complex applications.
104-
- [`FastAPI`](https://github.com/tiangolo/fastapi) is a type-driven asynchronous web framework.
125+
![6_airbnb_data.png](images/6_airbnb_data.png)
105126

106-
Articles:
127+
Feel free to explore other parameters! There are more to help you create a perfect missing values visualization
107128

108-
- [Open Source Guides](https://opensource.guide/).
109-
- [A handy guide to financial support for open source](https://github.com/nayafia/lemonade-stand)
110-
- [GitHub Actions Documentation](https://help.github.com/en/actions).
111-
- Maybe you would like to add [gitmoji](https://gitmoji.carloscuesta.me/) to commit names. This is really funny. 😄
129+
# Developers section
112130

113131
## 🚀 Features
114132

@@ -131,25 +149,6 @@ Articles:
131149
- Always up-to-date dependencies with [`@dependabot`](https://dependabot.com/). You will only [enable it](https://docs.github.com/en/github/administering-a-repository/enabling-and-disabling-version-updates#enabling-github-dependabot-version-updates).
132150
- Automatic drafts of new releases with [`Release Drafter`](https://github.com/marketplace/actions/release-drafter). You may see the list of labels in [`release-drafter.yml`](https://github.com/VladimirShitov/nafig/blob/master/.github/release-drafter.yml). Works perfectly with [Semantic Versions](https://semver.org/) specification.
133151

134-
### Open source community features
135-
136-
- Ready-to-use [Pull Requests templates](https://github.com/VladimirShitov/nafig/blob/master/.github/PULL_REQUEST_TEMPLATE.md) and several [Issue templates](https://github.com/VladimirShitov/nafig/tree/master/.github/ISSUE_TEMPLATE).
137-
- Files such as: `LICENSE`, `CONTRIBUTING.md`, `CODE_OF_CONDUCT.md`, and `SECURITY.md` are generated automatically.
138-
- [`Stale bot`](https://github.com/apps/stale) that closes abandoned issues after a period of inactivity. (You will only [need to setup free plan](https://github.com/marketplace/stale)). Configuration is [here](https://github.com/VladimirShitov/nafig/blob/master/.github/.stale.yml).
139-
- [Semantic Versions](https://semver.org/) specification with [`Release Drafter`](https://github.com/marketplace/actions/release-drafter).
140-
141-
## Installation
142-
143-
```bash
144-
pip install -U nafig
145-
```
146-
147-
or install with `Poetry`
148-
149-
```bash
150-
poetry add nafig
151-
```
152-
153152

154153

155154
### Makefile usage

ehrapy_paper_plot.ipynb

Lines changed: 4 additions & 2 deletions
Large diffs are not rendered by default.

images/1_simulated_data.png

151 KB
Loading

images/2_20_bins.png

127 KB
Loading

images/3_2_bins.png

53.1 KB
Loading

images/4_house_prices_data.png

83.2 KB
Loading

images/5_house_prices_no_bins.png

68.3 KB
Loading

images/6_airbnb_data.png

150 KB
Loading

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ build-backend = "poetry.core.masonry.api"
55

66
[tool.poetry]
77
name = "nafig"
8-
version = "0.2.0"
8+
version = "1.0.0"
99
description = "Package for creating figures with NA data distribution"
1010
readme = "README.md"
1111
authors = ["vladimirshitov98 <vladimirshitov98@gmail.com>"]

0 commit comments

Comments
 (0)