updated pip_desc.md

Diyago · Diyago · commit 049d962e8be5 · 2023-09-30T22:29:52.000+03:00
diff --git a/pip_desc.md b/pip_desc.md
@@ -1,26 +1,20 @@
 [![CodeFactor](https://www.codefactor.io/repository/github/diyago/gan-for-tabular-data/badge)](https://www.codefactor.io/repository/github/diyago/gan-for-tabular-data)
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
-# GANs for tabular  data
-<img src="https://raw.githubusercontent.com/Diyago/GAN-for-tabular-data/e5a4d437655261755de962b9779c73203611d921/images/logo%20tabular%20gan.svg" height="15%" width="15%">
+[![Downloads](https://pepy.tech/badge/tabgan)](https://pepy.tech/project/tabgan)
 
-We well know GANs for success in the realistic image generation. However, they can be applied in tabular data generation. We will review and examine some recent papers about tabular GANs in action.
+# GANs and Diffusions for tabular  data
+
+<img src="./images/tabular_gan.png" height="15%" width="15%">
+Generative Adversarial Networks (GANs) are well-known for their success in realistic image generation. However, they can also be applied to generate tabular data. Here will give opportunity to try some of them.
 
-* Github project: ["GAN-for-tabular-data"](https://github.com/Diyago/GAN-for-tabular-data)
 * Arxiv article: ["Tabular GANs for uneven distribution"](https://arxiv.org/abs/2010.00638)
 * Medium post: [GANs for tabular data](https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342)
 
-### Library goal
-
-Let say we have **T_train** and **T_test** (train and test set respectively).
-We need to train the model on **T_train** and make predictions on **T_test**.
-However, we will increase the train by generating new data by GAN,
-somehow similar to **T_test**, without using ground truth labels.
-
-### How to use library
+## How to use library
 
 * Installation: `pip install tabgan`
-* To generate new data to train by sampling and then filtering by adversarial
-  training call `GANGenerator().generate_data_pipe`:
+* To generate new data to train by sampling and then filtering by adversarial training
+  call `GANGenerator().generate_data_pipe`:
 
 ``` python
 from tabgan.sampler import OriginalGenerator, GANGenerator
@@ -34,10 +28,11 @@ test = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD
 
 # generate data
 new_train1, new_target1 = OriginalGenerator().generate_data_pipe(train, target, test, )
-new_train1, new_target1 = GANGenerator().generate_data_pipe(train, target, test, )
+new_train2, new_target2 = GANGenerator().generate_data_pipe(train, target, test, )
+new_train3, new_target3 = ForestDiffusionGenerator().generate_data_pipe(train, target, test, )
 
 # example with all params defined
-new_train3, new_target3 = GANGenerator(gen_x_times=1.1, cat_cols=None,
+new_train4, new_target4 = GANGenerator(gen_x_times=1.1, cat_cols=None,
            bot_filter_quantile=0.001, top_filter_quantile=0.999, is_post_process=True,
            adversarial_model_params={
                "metrics": "AUC", "max_depth": 2, "max_bin": 100, 
@@ -47,21 +42,23 @@ new_train3, new_target3 = GANGenerator(gen_x_times=1.1, cat_cols=None,
                                           test, deep_copy=True, only_adversarial=False, use_adversarial=True)
 ```
 
-Both samplers `OriginalGenerator` and `GANGenerator` have same input parameters:
+All samplers `OriginalGenerator`, `ForestDiffusionGenerator` and `GANGenerator` have same input parameters.
+
+1. **GANGenerator** based on **CTGAN**
+2. **ForestDiffusionGenerator** based on **Forest Diffusion**
 
 * **gen_x_times**: float = 1.1 - how much data to generate, output might be less because of postprocessing and
-adversarial filtering
+  adversarial filtering
 * **cat_cols**: list = None - categorical columns
 * **bot_filter_quantile**: float = 0.001 - bottom quantile for postprocess filtering
 * **top_filter_quantile**: float = 0.999 - top quantile for postprocess filtering
-* **is_post_process**: bool = True - perform or not postfiltering, if false bot_filter_quantile
- and top_filter_quantile ignored
+* **is_post_process**: bool = True - perform or not post-filtering, if false bot_filter_quantile and top_filter_quantile
+  ignored
 * **adversarial_model_params**: dict params for adversarial filtering model, default values for binary task
-* **pregeneration_frac**: float = 2 - for generataion step gen_x_times * pregeneration_frac amount of data
-will be generated. However, in postprocessing (1 + gen_x_times) % of original data will be returned
+* **pregeneration_frac**: float = 2 - for generataion step gen_x_times * pregeneration_frac amount of data will
+  generated. However in postprocessing (1 + gen_x_times) % of original data will be returned
 * **gen_params**: dict params for GAN training
 
-
 For `generate_data_pipe` methods params:
 
 * **train_df**: pd.DataFrame Train dataframe which has separate target
@@ -72,8 +69,7 @@ For `generate_data_pipe` methods params:
 * **use_adversarial**: bool = True - perform or not adversarial filtering
 * **only_generated_data**: bool = False  - After generation get only newly generated, without 
   concating input train dataframe.  
-
-* **@return**: -> Tuple[pd.DataFrame, pd.DataFrame] -  Newly generated train dataframe and test data
+* **@return**: -> Tuple[pd.DataFrame, pd.DataFrame] - Newly generated train dataframe and test data
 
 Thus, you may use this library to improve your dataset quality:
 
@@ -83,20 +79,19 @@ def fit_predict(clf, X_train, y_train, X_test, y_test):
     return sklearn.metrics.roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
 
 
-if __name__ == "__main__":
-    dataset = sklearn.datasets.load_breast_cancer()
-    clf = sklearn.ensemble.RandomForestClassifier(n_estimators=25, max_depth=6)
-    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
-        pd.DataFrame(dataset.data), pd.DataFrame(dataset.target, columns=["target"]), test_size=0.33, random_state=42)
-    print("initial metric", fit_predict(clf, X_train, y_train, X_test, y_test))
 
-    new_train1, new_target1 = OriginalGenerator().generate_data_pipe(X_train, y_train, X_test, )
-    print("OriginalGenerator metric", fit_predict(clf, new_train1, new_target1, X_test, y_test))
+dataset = sklearn.datasets.load_breast_cancer()
+clf = sklearn.ensemble.RandomForestClassifier(n_estimators=25, max_depth=6)
+X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
+    pd.DataFrame(dataset.data), pd.DataFrame(dataset.target, columns=["target"]), test_size=0.33, random_state=42)
+print("initial metric", fit_predict(clf, X_train, y_train, X_test, y_test))
 
-    new_train1, new_target1 = GANGenerator().generate_data_pipe(X_train, y_train, X_test, )
-    print("GANGenerator metric", fit_predict(clf, new_train1, new_target1, X_test, y_test))
-```
+new_train1, new_target1 = OriginalGenerator().generate_data_pipe(X_train, y_train, X_test, )
+print("OriginalGenerator metric", fit_predict(clf, new_train1, new_target1, X_test, y_test))
 
+new_train1, new_target1 = GANGenerator().generate_data_pipe(X_train, y_train, X_test, )
+print("GANGenerator metric", fit_predict(clf, new_train1, new_target1, X_test, y_test))
+```
 ## Timeseries GAN generation TimeGAN
 
 You can easily adjust code to generate multidimensional timeseries data.
@@ -134,37 +129,60 @@ new_train = collect_dates(new_train)
 **Running experiment**
 
 To run experiment follow these steps:
+
 1. Clone the repository. All required dataset are stored in `./Research/data` folder
 2. Install requirements `pip install -r requirements.txt`
-4. Run all experiments  `python ./Research/run_experiment.py`. Run all experiments  `python run_experiment.py`. You may add more datasets, adjust validation type and categorical encoders.
-5. Observe metrics across all experiment in console or
-   in `./Research/results/fit_predict_scores.txt`
-
+4. Run all experiments  `python ./Research/run_experiment.py`. Run all experiments  `python run_experiment.py`. You may
+   add more datasets, adjust validation type and categorical encoders.
+5. Observe metrics across all experiment in console or in `./Research/results/fit_predict_scores.txt`
 
 
-## Acknowledgments
+**Experiment design**
 
-The author would like to thank Open Data Science community [7] for many
-valuable discussions and educational help in the growing field of machine and
-deep learning. Also, special big thanks to Sber [8] for allowing solving
-such tasks and providing computational resources.
+![Experiment design and workflow](./images/workflow.png?raw=true)
 
-## References
+**Picture 1.1** Experiment design and workflow
 
-[1] Jonathan Hui. GAN — What is Generative Adversarial Networks GAN? (2018), medium article
+## Results
+To determine the best sampling strategy, ROC AUC scores of each dataset were scaled (min-max scale) and then averaged
+among the dataset.
 
-[2]Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Generative Adversarial Networks (2014). arXiv:1406.2661
+**Table 1.2** Different sampling results across the dataset, higher is better (100% - maximum per dataset ROC AUC)
 
-[3] Lei Xu LIDS, Kalyan Veeramachaneni. Synthesizing Tabular Data using Generative Adversarial Networks (2018). arXiv:1811.11264v1 [cs.LG]
+| dataset_name  |   None |   gan |   sample_original |
+|:-----------------------|-------------------:|------------------:|------------------------------:|
+| credit                 |           0.997 |          **0.998** |                      0.997 |
+| employee               |           **0.986** |          0.966 |                      0.972 |
+| mortgages              |           0.984 |          0.964 |                      **0.988** |
+| poverty_A              |           0.937 |          **0.950** |                      0.933 |
+| taxi                   |           0.966 |          0.938 |                      **0.987** |
+| adult                  |           0.995 |          0.967 |                      **0.998** |
 
-[4] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular Data using Conditional GAN (2019). arXiv:1907.00503v2 [cs.LG]
+## Acknowledgments
 
-[5] Denis Vorotyntsev. Benchmarking Categorical Encoders (2019). Medium post
+The author would like to thank Open Data Science community [7] for many valuable discussions and educational help in the
+growing field of machine and deep learning.
+
+## Citation
+
+If you use **GAN-for-tabular-data** in a scientific publication, we would appreciate references to the following BibTex entry:
+arxiv publication:
+```bibtex
+@misc{ashrapov2020tabular,
+      title={Tabular GANs for uneven distribution}, 
+      author={Insaf Ashrapov},
+      year={2020},
+      eprint={2010.00638},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
+```
 
-[6] Insaf Ashrapov. GAN-for-tabular-data (2020). Github repository.
+## References
 
-[7] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila. Analyzing and Improving the Image Quality of StyleGAN (2019) arXiv:1912.04958v2 [cs.CV]
+[1] Lei Xu LIDS, Kalyan Veeramachaneni. Synthesizing Tabular Data using Generative Adversarial Networks (2018). arXiv:
+1811.11264v1 [cs.LG]
 
-[8]  ODS.ai: Open data science (2020), https://ods.ai/
+[2] Alexia Jolicoeur-Martineau and Kilian Fatras and Tal Kachman. Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees ((2023) https://github.com/SamsungSAILMontreal/ForestDiffusion [cs.LG]
 
-[9]  Sber (2020), https://www.sberbank.ru/
+[3] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, (2019)