Using DoWhy for categorical network causal effect estimation

**Question**
I want to use DoWhy to learn parameter for a predefined network structure B -> A, A -> T, B -> T given a simulated dataset. 
After this network is learned with parameter. Then I want to do manipulation of variable A (do operation, hard intervention) and see how the posterior of T change. 

I found that this is very straightforward in pgmpy and SMILE, but I am not sure how to do it in DoWhy. 


My sample code is as follows:

```
# %%
import pandas as pd
import numpy as np
import os
import scipy
import math
from functools import partial
import dowhy.gcm as gcm
import networkx as nx


# %%
output_folder = "./simulated_data"
os.makedirs(output_folder, exist_ok=True)

simulated_dataset_path = "./simulated_data/simulated_discrete_dataset.csv"

# %%
def simulate_bayesian_network():
    # Set random seed for reproducibility
    np.random.seed(1)

    N = 50000

    # Step 1: B ~ Bernoulli(0.5)
    B = np.random.binomial(1, 0.5, size=(N,))

    # Step 2: A depends on B
    A = np.zeros(N, dtype=int)
    A[B == 1] = np.random.binomial(1, 0.9, size=(np.sum(B == 1),))
    A[B == 0] = np.random.binomial(1, 0.1, size=(np.sum(B == 0),))

    # Step 3: T depends on A and B
    T = np.zeros(N, dtype=int)
    T[(A == 1) & (B == 1)] = np.random.binomial(1, 0.9, size=np.sum((A == 1) & (B == 1)))
    T[(A == 1) & (B == 0)] = np.random.binomial(1, 0.1, size=np.sum((A == 1) & (B == 0)))
    T[(A == 0) & (B == 1)] = np.random.binomial(1, 0.1, size=np.sum((A == 0) & (B == 1)))
    T[(A == 0) & (B == 0)] = np.random.binomial(1, 0.1, size=np.sum((A == 0) & (B == 0)))
    
    data = {}
    data['A'] = A
    data['B'] = B
    data['T'] = T

    return pd.DataFrame(data)

# Simulate and preview the dataset
data = simulate_bayesian_network()
print(data.head())

# %%
# gcm stands for "Graphical Causal Models"

causal_model = gcm.StructuralCausalModel(nx.DiGraph([('B', 'A'), ('B', 'T'), ('A', 'T')]))


# %%

auto_assignment_summary = gcm.auto.assign_causal_mechanisms(causal_model, data)


# %%
print(auto_assignment_summary)

# %% [markdown]
# ## Comments
# 
# Auto detected causal mechanism is not right. 
# 
# This is the code for all possible causal mechanism: 
# https://github.com/py-why/dowhy/blob/main/dowhy/gcm/causal_mechanisms.py
# 
# 

# %%
# causal_model.set_causal_mechanism('B', gcm.EmpiricalDistribution())
# causal_model.set_causal_mechanism('A', gcm.classification.LogisticRegressionClassifier())
# causal_model.set_causal_mechanism('T', gcm.classification.LogisticRegressionClassifier())

# %%

gcm.fit(causal_model, data)

# %%
gcm.average_causal_effect(causal_model,
                         'T',
                         interventions_alternative={'A': lambda x: 1},
                         interventions_reference={'A': lambda x: 0},
                         num_samples_to_draw=100000)

# %%
def print_out_posterior(df, variable_n):
    posterior_counts = df[variable_n].value_counts(normalize=True)
    print(posterior_counts)
    

# %%
samples = gcm.interventional_samples(causal_model,
                                     {'A': lambda x: 1},
                                     num_samples_to_draw=1000)
samples.head()

print_out_posterior(samples, "T")

# %%
mean_T = samples['T'].mean()
print(f"The mean of column 'T' is: {mean_T}")

# %%
samples_2 = gcm.interventional_samples(causal_model,
                                     {'A': lambda x: 0},
                                     num_samples_to_draw=1000)
samples_2.head()

# %%
mean_T_2 = samples_2['T'].mean()
print(f"The mean of column 'T' is: {mean_T_2}")

# %%
print_out_posterior(samples_2, "T")

```


In pgmpy and SMILE, there is very explicit way to defined a categorical network, pass network structure to it. Then the packages will learn the conditional probability table from the data for the network.

(1)
However, in DoWhy I found that `auto_assignment_summary = gcm.auto.assign_causal_mechanisms(causal_model, data)` is not work as expected and also I cannot find a good way using `causal_model.set_causal_mechanism` to define a categorical network.

In the auto_assignment_summary, Node A is assigned "Discrete AdditiveNoiseModel using LinearRegression", Node T is assigned "Discrete AdditiveNoiseModel using Pipeline". 

Could you please help me to explicitly define a categorical network?

I want explore DoWhy because my real world project contains (1) categorical variable depend on continuous variable,  (2) continuous variable depend on categorical variable, (3) continuous variable depend on continuous variable. (4) categorical variable depend on categorical variable.  DoWhy looks like can deal with all of these cases (other package has limitations) and can do hard intervention of a variable and see the effect estimation of the target variable. 


(2) 
Is it OK to use `gcm.interventional_samples` to do hard intervention and sample data, then estimate posterior probability of target variable `T`?  In my sample code, am I doing in the right way?
In current setting, I found that 

```
T 
0 0.482 
1 0.479 
2 0.021 
-1 0.018
```

It is surprising that T contains value 2 and -1. But in my input data, it should only have two value 0 and 1, which is a categorical variable. 
Why it behaves in this way and what I can do to make it right?

**Expected behavior**
The posterior of T should only contains value 0 and 1.

**Version information:**
 - DoWhy version [0.13]



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using DoWhy for categorical network causal effect estimation #1348

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using DoWhy for categorical network causal effect estimation #1348

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions