|
| 1 | +REFERENCE (initial code): https://github.com/sdv-dev/CTGAN |
| 2 | + |
| 3 | +<p align="left"> |
| 4 | +<img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt=“sdv-dev” /> |
| 5 | +<i>An open source project from Data to AI Lab at MIT.</i> |
| 6 | +</p> |
| 7 | + |
| 8 | +[](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha) |
| 9 | +[](https://pypi.python.org/pypi/ctgan) |
| 10 | +[](https://travis-ci.org/sdv-dev/CTGAN) |
| 11 | +[](https://pepy.tech/project/ctgan) |
| 12 | +[](https://codecov.io/gh/sdv-dev/CTGAN) |
| 13 | + |
| 14 | +# CTGAN |
| 15 | + |
| 16 | +Implementation of our NeurIPS paper [Modeling Tabular data using Conditional GAN](https://arxiv.org/abs/1907.00503). |
| 17 | + |
| 18 | +CTGAN is a GAN-based data synthesizer that can generate synthetic tabular data with high fidelity. |
| 19 | + |
| 20 | +* License: [MIT](https://github.com/sdv-dev/CTGAN/blob/master/LICENSE) |
| 21 | +* Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha) |
| 22 | +* Documentation: https://sdv-dev.github.io/CTGAN |
| 23 | +* Homepage: https://github.com/sdv-dev/CTGAN |
| 24 | + |
| 25 | +## Overview |
| 26 | + |
| 27 | +Based on previous work ([TGAN](https://github.com/sdv-dev/TGAN)) on synthetic data generation, |
| 28 | +we develop a new model called CTGAN. Several major differences make CTGAN outperform TGAN. |
| 29 | + |
| 30 | +- **Preprocessing**: CTGAN uses more sophisticated Variational Gaussian Mixture Model to detect |
| 31 | + modes of continuous columns. |
| 32 | +- **Network structure**: TGAN uses LSTM to generate synthetic data column by column. CTGAN uses |
| 33 | + Fully-connected networks which is more efficient. |
| 34 | +- **Features to prevent mode collapse**: We design a conditional generator and resample the |
| 35 | + training data to prevent model collapse on discrete columns. We use WGANGP and PacGAN to |
| 36 | + stabilize the training of GAN. |
| 37 | + |
| 38 | + |
| 39 | +# Install |
| 40 | + |
| 41 | +## Requirements |
| 42 | + |
| 43 | +**CTGAN** has been developed and tested on [Python 3.5, 3.6 and 3.7](https://www.python.org/downloads/) |
| 44 | + |
| 45 | +## Install from PyPI |
| 46 | + |
| 47 | +The recommended way to installing **CTGAN** is using [pip](https://pip.pypa.io/en/stable/): |
| 48 | + |
| 49 | +```bash |
| 50 | +pip install ctgan |
| 51 | +``` |
| 52 | + |
| 53 | +This will pull and install the latest stable release from [PyPI](https://pypi.org/). |
| 54 | + |
| 55 | +If you want to install from source or contribute to the project please read the |
| 56 | +[Contributing Guide](https://sdv-dev.github.io/CTGAN/contributing.html#get-started). |
| 57 | + |
| 58 | +# Data Format |
| 59 | + |
| 60 | +**CTGAN** expects the input data to be a table given as either a `numpy.ndarray` or a |
| 61 | +`pandas.DataFrame` object with two types of columns: |
| 62 | + |
| 63 | +* **Continuous Columns**: Columns that contain numerical values and which can take any value. |
| 64 | +* **Discrete columns**: Columns that only contain a finite number of possible values, wether |
| 65 | +these are string values or not. |
| 66 | + |
| 67 | +This is an example of a table with 4 columns: |
| 68 | + |
| 69 | +* A continuous column with float values |
| 70 | +* A continuous column with integer values |
| 71 | +* A discrete column with string values |
| 72 | +* A discrete column with integer values |
| 73 | + |
| 74 | +| | A | B | C | D | |
| 75 | +|---|------|-----|-----|---| |
| 76 | +| 0 | 0.1 | 100 | 'a' | 1 | |
| 77 | +| 1 | -1.3 | 28 | 'b' | 2 | |
| 78 | +| 2 | 0.3 | 14 | 'a' | 2 | |
| 79 | +| 3 | 1.4 | 87 | 'a' | 3 | |
| 80 | +| 4 | -0.1 | 69 | 'b' | 2 | |
| 81 | + |
| 82 | + |
| 83 | +**NOTE**: CTGAN does not distinguish between float and integer columns, which means that it will |
| 84 | +sample float values in all cases. If integer values are required, the outputted float values |
| 85 | +must be rounded to integers in a later step, outside of CTGAN. |
| 86 | + |
| 87 | +# Python Quickstart |
| 88 | + |
| 89 | +In this short tutorial we will guide you through a series of steps that will help you |
| 90 | +getting started with **CTGAN**. |
| 91 | + |
| 92 | +## 1. Model the data |
| 93 | + |
| 94 | +### Step 1: Prepare your data |
| 95 | + |
| 96 | +Before being able to use CTGAN you will need to prepare your data as specified above. |
| 97 | + |
| 98 | +For this example, we will be loading some data using the `ctgan.load_demo` function. |
| 99 | + |
| 100 | +```python |
| 101 | +from ctgan import load_demo |
| 102 | + |
| 103 | +data = load_demo() |
| 104 | +``` |
| 105 | + |
| 106 | +This will download a copy of the [Adult Census Dataset](https://archive.ics.uci.edu/ml/datasets/adult) as a dataframe: |
| 107 | + |
| 108 | +| age | workclass | fnlwgt | ... | hours-per-week | native-country | income | |
| 109 | +|-------|------------------|----------|-----|------------------|------------------|----------| |
| 110 | +| 39 | State-gov | 77516 | ... | 40 | United-States | <=50K | |
| 111 | +| 50 | Self-emp-not-inc | 83311 | ... | 13 | United-States | <=50K | |
| 112 | +| 38 | Private | 215646 | ... | 40 | United-States | <=50K | |
| 113 | +| 53 | Private | 234721 | ... | 40 | United-States | <=50K | |
| 114 | +| 28 | Private | 338409 | ... | 40 | Cuba | <=50K | |
| 115 | +| ... | ... | ... | ... | ... | ... | ... | |
| 116 | + |
| 117 | + |
| 118 | +Aside from the table itself, you will need to create a list with the names of the discrete |
| 119 | +variables. |
| 120 | + |
| 121 | +For this example: |
| 122 | + |
| 123 | +```python |
| 124 | +discrete_columns = [ |
| 125 | + 'workclass', |
| 126 | + 'education', |
| 127 | + 'marital-status', |
| 128 | + 'occupation', |
| 129 | + 'relationship', |
| 130 | + 'race', |
| 131 | + 'sex', |
| 132 | + 'native-country', |
| 133 | + 'income' |
| 134 | +] |
| 135 | +``` |
| 136 | + |
| 137 | +### Step 2: Fit CTGAN to your data |
| 138 | + |
| 139 | +Once you have the data ready, you need to import and create an instance of the `CTGANSynthesizer` |
| 140 | +class and fit it passing your data and the list of discrete columns. |
| 141 | + |
| 142 | +```python |
| 143 | +from ctgan import CTGANSynthesizer |
| 144 | + |
| 145 | +ctgan = CTGANSynthesizer() |
| 146 | +ctgan.fit(data, discrete_columns) |
| 147 | +``` |
| 148 | + |
| 149 | +This process is likely to take a long time to run. |
| 150 | +If you want to make the process shorter, or longer, you can control the number of training epochs |
| 151 | +that the model will be performing by adding it to the `fit` call: |
| 152 | + |
| 153 | +```python |
| 154 | +ctgan.fit(data, discrete_columns, epochs=5) |
| 155 | +``` |
| 156 | + |
| 157 | +## 2. Generate synthetic data |
| 158 | + |
| 159 | +Once the process has finished, all you need to do is call the `sample` method of your |
| 160 | +`CTGANSynthesizer` instance indicating the number of rows that you want to generate. |
| 161 | + |
| 162 | +```python |
| 163 | +samples = ctgan.sample(1000) |
| 164 | +``` |
| 165 | + |
| 166 | +The output will be a table with the exact same format as the input and filled with the synthetic |
| 167 | +data generated by the model. |
| 168 | + |
| 169 | +| age | workclass | fnlwgt | ... | hours-per-week | native-country | income | |
| 170 | +|---------|--------------|-----------|-----|------------------|------------------|----------| |
| 171 | +| 26.3191 | Private | 124079 | ... | 40.1557 | United-States | <=50K | |
| 172 | +| 39.8558 | Private | 133996 | ... | 40.2507 | United-States | <=50K | |
| 173 | +| 38.2477 | Self-emp-inc | 135955 | ... | 40.1124 | Ecuador | <=50K | |
| 174 | +| 29.6468 | Private | 3331.86 | ... | 27.012 | United-States | <=50K | |
| 175 | +| 20.9853 | Private | 120637 | ... | 40.0238 | United-States | <=50K | |
| 176 | +| ... | ... | ... | ... | ... | ... | ... | |
| 177 | + |
| 178 | + |
| 179 | +# Join our community |
| 180 | + |
| 181 | +1. If you would like to try more dataset examples, please have a look at the [examples folder]( |
| 182 | +https://github.com/sdv-dev/CTGAN/tree/master/examples) of the repository. Please contact us |
| 183 | +if you have a usage example that you would want to share with the community. |
| 184 | +2. If you want to contribute to the project code, please head to the [Contributing Guide]( |
| 185 | +https://sdv-dev.github.io/CTGAN/contributing.html#get-started) for more details about how to do it. |
| 186 | +3. If you have any doubts, feature requests or detect an error, please [open an issue on github]( |
| 187 | +https://github.com/sdv-dev/CTGAN/issues) |
| 188 | +4. Also do not forget to check the [project documentation site](https://sdv-dev.github.io/CTGAN/)! |
| 189 | + |
| 190 | + |
| 191 | +# Citing TGAN |
| 192 | + |
| 193 | +If you use CTGAN, please cite the following work: |
| 194 | + |
| 195 | +- *Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni.* **Modeling Tabular data using Conditional GAN**. NeurIPS, 2019. |
| 196 | + |
| 197 | +```LaTeX |
| 198 | +@inproceedings{xu2019modeling, |
| 199 | + title={Modeling Tabular data using Conditional GAN}, |
| 200 | + author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan}, |
| 201 | + booktitle={Advances in Neural Information Processing Systems}, |
| 202 | + year={2019} |
| 203 | +} |
| 204 | +``` |
| 205 | + |
| 206 | +# Related Projects |
| 207 | +Please note that these libraries are external contributions and are not maintained nor supervised by |
| 208 | +the MIT DAI-Lab team. |
| 209 | + |
| 210 | +## R interface for CTGAN |
| 211 | + |
| 212 | +A wrapper around **CTGAN** has been implemented by Kevin Kuo @kevinykuo, bringing the functionalities |
| 213 | +of **CTGAN** to **R** users. |
| 214 | + |
| 215 | +More details can be found in the corresponding repository: https://github.com/kasaai/ctgan |
| 216 | + |
| 217 | +## CTGAN Server CLI |
| 218 | + |
| 219 | +A package to easily deploy **CTGAN** onto a remote server. This package is developed by Timothy Pillow @oregonpillow. |
| 220 | + |
| 221 | +More details can be found in the corresponding repository: https://github.com/oregonpillow/ctgan-server-cli |
0 commit comments