Diyago
diff --git a/‎Research/.deepsource.toml‎
Lines changed: 8 additions & 0 deletions b/‎Research/.deepsource.toml‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎Research/ctgan/README.MD‎
Lines changed: 221 additions & 0 deletions b/‎Research/ctgan/README.MD‎
Lines changed: 221 additions & 0 deletions
diff --git a/‎Research/ctgan/__init__.py‎
Lines changed: 15 additions & 0 deletions b/‎Research/ctgan/__init__.py‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎Research/ctgan/__main__.py‎
Lines changed: 46 additions & 0 deletions b/‎Research/ctgan/__main__.py‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎Research/ctgan/conditional.py‎
Lines changed: 98 additions & 0 deletions b/‎Research/ctgan/conditional.py‎
Lines changed: 98 additions & 0 deletions
@@ -0,0 +1,8 @@
+version = 1
+
+[[analyzers]]
+name = "python"
+enabled = true
+
+  [analyzers.meta]
+  runtime_version = "3.x.x"
@@ -0,0 +1,221 @@
+REFERENCE (initial code): https://github.com/sdv-dev/CTGAN
+
+<p align="left">
+<img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt=“sdv-dev” />
+<i>An open source project from Data to AI Lab at MIT.</i>
+</p>
+
+[![Development Status](https://img.shields.io/badge/Development%20Status-2%20--%20Pre--Alpha-yellow)](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
+[![PyPI Shield](https://img.shields.io/pypi/v/ctgan.svg)](https://pypi.python.org/pypi/ctgan)
+[![Travis CI Shield](https://travis-ci.org/sdv-dev/CTGAN.svg?branch=master)](https://travis-ci.org/sdv-dev/CTGAN)
+[![Downloads](https://pepy.tech/badge/ctgan)](https://pepy.tech/project/ctgan)
+[![Coverage Status](https://codecov.io/gh/sdv-dev/CTGAN/branch/master/graph/badge.svg)](https://codecov.io/gh/sdv-dev/CTGAN)
+
+# CTGAN
+
+Implementation of our NeurIPS paper [Modeling Tabular data using Conditional GAN](https://arxiv.org/abs/1907.00503).
+
+CTGAN is a GAN-based data synthesizer that can generate synthetic tabular data with high fidelity.
+
+* License: [MIT](https://github.com/sdv-dev/CTGAN/blob/master/LICENSE)
+* Development Status: [Pre-Alpha](https://pypi.org/search/?c=Development+Status+%3A%3A+2+-+Pre-Alpha)
+* Documentation: https://sdv-dev.github.io/CTGAN
+* Homepage: https://github.com/sdv-dev/CTGAN
+
+## Overview
+
+Based on previous work ([TGAN](https://github.com/sdv-dev/TGAN)) on synthetic data generation,
+we develop a new model called CTGAN. Several major differences make CTGAN outperform TGAN.
+
+- **Preprocessing**: CTGAN uses more sophisticated Variational Gaussian Mixture Model to detect
+  modes of continuous columns.
+- **Network structure**: TGAN uses LSTM to generate synthetic data column by column. CTGAN uses
+  Fully-connected networks which is more efficient.
+- **Features to prevent mode collapse**: We design a conditional generator and resample the
+  training data to prevent model collapse on discrete columns. We use WGANGP and PacGAN to
+  stabilize the training of GAN.
+
+
+# Install
+
+## Requirements
+
+**CTGAN** has been developed and tested on [Python 3.5, 3.6 and 3.7](https://www.python.org/downloads/)
+
+## Install from PyPI
+
+The recommended way to installing **CTGAN** is using [pip](https://pip.pypa.io/en/stable/):
+
+```bash
+pip install ctgan
+```
+
+This will pull and install the latest stable release from [PyPI](https://pypi.org/).
+
+If you want to install from source or contribute to the project please read the
+[Contributing Guide](https://sdv-dev.github.io/CTGAN/contributing.html#get-started).
+
+# Data Format
+
+**CTGAN** expects the input data to be a table given as either a `numpy.ndarray` or a
+`pandas.DataFrame` object with two types of columns:
+
+* **Continuous Columns**: Columns that contain numerical values and which can take any value.
+* **Discrete columns**: Columns that only contain a finite number of possible values, wether
+these are string values or not.
+
+This is an example of a table with 4 columns:
+
+* A continuous column with float values
+* A continuous column with integer values
+* A discrete column with string values
+* A discrete column with integer values
+
+|   | A    | B   | C   | D |
+|---|------|-----|-----|---|
+| 0 | 0.1  | 100 | 'a' | 1 |
+| 1 | -1.3 | 28  | 'b' | 2 |
+| 2 | 0.3  | 14  | 'a' | 2 |
+| 3 | 1.4  | 87  | 'a' | 3 |
+| 4 | -0.1 | 69  | 'b' | 2 |
+
+
+**NOTE**: CTGAN does not distinguish between float and integer columns, which means that it will
+sample float values in all cases. If integer values are required, the outputted float values
+must be rounded to integers in a later step, outside of CTGAN.
+
+# Python Quickstart
+
+In this short tutorial we will guide you through a series of steps that will help you
+getting started with **CTGAN**.
+
+## 1. Model the data
+
+### Step 1: Prepare your data
+
+Before being able to use CTGAN you will need to prepare your data as specified above.
+
+For this example, we will be loading some data using the `ctgan.load_demo` function.
+
+```python
+from ctgan import load_demo
+
+data = load_demo()
+```
+
+This will download a copy of the [Adult Census Dataset](https://archive.ics.uci.edu/ml/datasets/adult) as a dataframe:
+
+|   age | workclass        |   fnlwgt | ... |   hours-per-week | native-country   | income   |
+|-------|------------------|----------|-----|------------------|------------------|----------|
+|    39 | State-gov        |    77516 | ... |               40 | United-States    | <=50K    |
+|    50 | Self-emp-not-inc |    83311 | ... |               13 | United-States    | <=50K    |
+|    38 | Private          |   215646 | ... |               40 | United-States    | <=50K    |
+|    53 | Private          |   234721 | ... |               40 | United-States    | <=50K    |
+|    28 | Private          |   338409 | ... |               40 | Cuba             | <=50K    |
+|   ... | ...              |      ... | ... |              ... | ...              | ...      |
+
+
+Aside from the table itself, you will need to create a list with the names of the discrete
+variables.
+
+For this example:
+
+```python
+discrete_columns = [
+    'workclass',
+    'education',
+    'marital-status',
+    'occupation',
+    'relationship',
+    'race',
+    'sex',
+    'native-country',
+    'income'
+]
+```
+
+### Step 2: Fit CTGAN to your data
+
+Once you have the data ready, you need to import and create an instance of the `CTGANSynthesizer`
+class and fit it passing your data and the list of discrete columns.
+
+```python
+from ctgan import CTGANSynthesizer
+
+ctgan = CTGANSynthesizer()
+ctgan.fit(data, discrete_columns)
+```
+
+This process is likely to take a long time to run.
+If you want to make the process shorter, or longer, you can control the number of training epochs
+that the model will be performing by adding it to the `fit` call:
+
+```python
+ctgan.fit(data, discrete_columns, epochs=5)
+```
+
+## 2. Generate synthetic data
+
+Once the process has finished, all you need to do is call the `sample` method of your
+`CTGANSynthesizer` instance indicating the number of rows that you want to generate.
+
+```python
+samples = ctgan.sample(1000)
+```
+
+The output will be a table with the exact same format as the input and filled with the synthetic
+data generated by the model.
+
+|     age | workclass    |    fnlwgt | ... |   hours-per-week | native-country   | income   |
+|---------|--------------|-----------|-----|------------------|------------------|----------|
+| 26.3191 | Private      | 124079    | ... |          40.1557 | United-States    | <=50K    |
+| 39.8558 | Private      | 133996    | ... |          40.2507 | United-States    | <=50K    |
+| 38.2477 | Self-emp-inc | 135955    | ... |          40.1124 | Ecuador          | <=50K    |
+| 29.6468 | Private      |   3331.86 | ... |          27.012  | United-States    | <=50K    |
+| 20.9853 | Private      | 120637    | ... |          40.0238 | United-States    | <=50K    |
+|     ... | ...          |       ... | ... |              ... | ...              | ...      |
+
+
+# Join our community
+
+1. If you would like to try more dataset examples, please have a look at the [examples folder](
+https://github.com/sdv-dev/CTGAN/tree/master/examples) of the repository. Please contact us
+if you have a usage example that you would want to share with the community.
+2. If you want to contribute to the project code, please head to the [Contributing Guide](
+https://sdv-dev.github.io/CTGAN/contributing.html#get-started) for more details about how to do it.
+3. If you have any doubts, feature requests or detect an error, please [open an issue on github](
+https://github.com/sdv-dev/CTGAN/issues)
+4. Also do not forget to check the [project documentation site](https://sdv-dev.github.io/CTGAN/)!
+
+
+# Citing TGAN
+
+If you use CTGAN, please cite the following work:
+
+- *Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni.* **Modeling Tabular data using Conditional GAN**. NeurIPS, 2019.
+
+```LaTeX
+@inproceedings{xu2019modeling,
+  title={Modeling Tabular data using Conditional GAN},
+  author={Xu, Lei and Skoularidou, Maria and Cuesta-Infante, Alfredo and Veeramachaneni, Kalyan},
+  booktitle={Advances in Neural Information Processing Systems},
+  year={2019}
+}
+```
+
+# Related Projects
+Please note that these libraries are external contributions and are not maintained nor supervised by
+the MIT DAI-Lab team.
+
+## R interface for CTGAN
+
+A wrapper around **CTGAN** has been implemented by Kevin Kuo @kevinykuo, bringing the functionalities
+of **CTGAN** to **R** users.
+
+More details can be found in the corresponding repository: https://github.com/kasaai/ctgan
+
+## CTGAN Server CLI
+
+A package to easily deploy **CTGAN** onto a remote server. This package is developed by Timothy Pillow @oregonpillow.
+
+More details can be found in the corresponding repository: https://github.com/oregonpillow/ctgan-server-cli
@@ -0,0 +1,15 @@
+# -*- coding: utf-8 -*-
+
+"""Top-level package for ctgan."""
+
+__author__ = 'MIT Data To AI Lab'
+__email__ = '[email protected]'
+__version__ = '0.2.1'
+
+from ctgan.demo import load_demo
+from ctgan.synthesizer import CTGANSynthesizer
+
+__all__ = (
+    'CTGANSynthesizer',
+    'load_demo'
+)
@@ -0,0 +1,46 @@
+import argparse
+
+from ctgan.data import read_csv, read_tsv, write_tsv
+from ctgan.synthesizer import CTGANSynthesizer
+
+
+def _parse_args():
+    parser = argparse.ArgumentParser(description='CTGAN Command Line Interface')
+    parser.add_argument('-e', '--epochs', default=300, type=int,
+                        help='Number of training epochs')
+    parser.add_argument('-t', '--tsv', action='store_true',
+                        help='Load data in TSV format instead of CSV')
+    parser.add_argument('--no-header', dest='header', action='store_false',
+                        help='The CSV file has no header. Discrete columns will be indices.')
+
+    parser.add_argument('-m', '--metadata', help='Path to the metadata')
+    parser.add_argument('-d', '--discrete',
+                        help='Comma separated list of discrete columns, no whitespaces')
+
+    parser.add_argument('-n', '--num-samples', type=int,
+                        help='Number of rows to sample. Defaults to the training data size')
+
+    parser.add_argument('data', help='Path to training data')
+    parser.add_argument('output', help='Path of the output file')
+
+    return parser.parse_args()
+
+
+def main():
+    args = _parse_args()
+
+    if args.tsv:
+        data, discrete_columns = read_tsv(args.data, args.metadata)
+    else:
+        data, discrete_columns = read_csv(args.data, args.metadata, args.header, args.discrete)
+
+    model = CTGANSynthesizer()
+    model.fit(data, discrete_columns, args.epochs)
+
+    num_samples = args.num_samples or len(data)
+    sampled = model.sample(num_samples)
+
+    if args.tsv:
+        write_tsv(sampled, args.metadata, args.output)
+    else:
+        sampled.to_csv(args.output, index=False)
@@ -0,0 +1,98 @@
+import numpy as np
+
+
+class ConditionalGenerator(object):
+    def __init__(self, data, output_info, log_frequency):
+        self.model = []
+
+        start = 0
+        skip = False
+        max_interval = 0
+        counter = 0
+        for item in output_info:
+            if item[1] == 'tanh':
+                start += item[0]
+                skip = True
+                continue
+
+            elif item[1] == 'softmax':
+                if skip:
+                    skip = False
+                    start += item[0]
+                    continue
+
+                end = start + item[0]
+                max_interval = max(max_interval, end - start)
+                counter += 1
+                self.model.append(np.argmax(data[:, start:end], axis=-1))
+                start = end
+
+            else:
+                assert 0
+
+        assert start == data.shape[1]
+
+        self.interval = []
+        self.n_col = 0
+        self.n_opt = 0
+        skip = False
+        start = 0
+        self.p = np.zeros((counter, max_interval))
+        for item in output_info:
+            if item[1] == 'tanh':
+                skip = True
+                start += item[0]
+                continue
+            elif item[1] == 'softmax':
+                if skip:
+                    start += item[0]
+                    skip = False
+                    continue
+                end = start + item[0]
+                tmp = np.sum(data[:, start:end], axis=0)
+                if log_frequency:
+                    tmp = np.log(tmp + 1)
+                tmp = tmp / np.sum(tmp)
+                self.p[self.n_col, :item[0]] = tmp
+                self.interval.append((self.n_opt, item[0]))
+                self.n_opt += item[0]
+                self.n_col += 1
+                start = end
+            else:
+                assert 0
+
+        self.interval = np.asarray(self.interval)
+
+    def random_choice_prob_index(self, idx):
+        a = self.p[idx]
+        r = np.expand_dims(np.random.rand(a.shape[0]), axis=1)
+        return (a.cumsum(axis=1) > r).argmax(axis=1)
+
+    def sample(self, batch):
+        if self.n_col == 0:
+            return None
+
+        batch = batch
+        idx = np.random.choice(np.arange(self.n_col), batch)
+
+        vec1 = np.zeros((batch, self.n_opt), dtype='float32')
+        mask1 = np.zeros((batch, self.n_col), dtype='float32')
+        mask1[np.arange(batch), idx] = 1
+        opt1prime = self.random_choice_prob_index(idx)
+        opt1 = self.interval[idx, 0] + opt1prime
+        vec1[np.arange(batch), opt1] = 1
+
+        return vec1, mask1, idx, opt1prime
+
+    def sample_zero(self, batch):
+        if self.n_col == 0:
+            return None
+
+        vec = np.zeros((batch, self.n_opt), dtype='float32')
+        idx = np.random.choice(np.arange(self.n_col), batch)
+        for i in range(batch):
+            col = idx[i]
+            pick = int(np.random.choice(self.model[col]))
+            vec[i, pick + self.interval[col, 0]] = 1
+
+        return vec