Skip to content

Using DoWhy for categorical network causal effect estimation #1348

@NianzuMa

Description

@NianzuMa

Question
I want to use DoWhy to learn parameter for a predefined network structure B -> A, A -> T, B -> T given a simulated dataset.
After this network is learned with parameter. Then I want to do manipulation of variable A (do operation, hard intervention) and see how the posterior of T change.

I found that this is very straightforward in pgmpy and SMILE, but I am not sure how to do it in DoWhy.

My sample code is as follows:

# %%
import pandas as pd
import numpy as np
import os
import scipy
import math
from functools import partial
import dowhy.gcm as gcm
import networkx as nx


# %%
output_folder = "./simulated_data"
os.makedirs(output_folder, exist_ok=True)

simulated_dataset_path = "./simulated_data/simulated_discrete_dataset.csv"

# %%
def simulate_bayesian_network():
    # Set random seed for reproducibility
    np.random.seed(1)

    N = 50000

    # Step 1: B ~ Bernoulli(0.5)
    B = np.random.binomial(1, 0.5, size=(N,))

    # Step 2: A depends on B
    A = np.zeros(N, dtype=int)
    A[B == 1] = np.random.binomial(1, 0.9, size=(np.sum(B == 1),))
    A[B == 0] = np.random.binomial(1, 0.1, size=(np.sum(B == 0),))

    # Step 3: T depends on A and B
    T = np.zeros(N, dtype=int)
    T[(A == 1) & (B == 1)] = np.random.binomial(1, 0.9, size=np.sum((A == 1) & (B == 1)))
    T[(A == 1) & (B == 0)] = np.random.binomial(1, 0.1, size=np.sum((A == 1) & (B == 0)))
    T[(A == 0) & (B == 1)] = np.random.binomial(1, 0.1, size=np.sum((A == 0) & (B == 1)))
    T[(A == 0) & (B == 0)] = np.random.binomial(1, 0.1, size=np.sum((A == 0) & (B == 0)))
    
    data = {}
    data['A'] = A
    data['B'] = B
    data['T'] = T

    return pd.DataFrame(data)

# Simulate and preview the dataset
data = simulate_bayesian_network()
print(data.head())

# %%
# gcm stands for "Graphical Causal Models"

causal_model = gcm.StructuralCausalModel(nx.DiGraph([('B', 'A'), ('B', 'T'), ('A', 'T')]))


# %%

auto_assignment_summary = gcm.auto.assign_causal_mechanisms(causal_model, data)


# %%
print(auto_assignment_summary)

# %% [markdown]
# ## Comments
# 
# Auto detected causal mechanism is not right. 
# 
# This is the code for all possible causal mechanism: 
# https://github.com/py-why/dowhy/blob/main/dowhy/gcm/causal_mechanisms.py
# 
# 

# %%
# causal_model.set_causal_mechanism('B', gcm.EmpiricalDistribution())
# causal_model.set_causal_mechanism('A', gcm.classification.LogisticRegressionClassifier())
# causal_model.set_causal_mechanism('T', gcm.classification.LogisticRegressionClassifier())

# %%

gcm.fit(causal_model, data)

# %%
gcm.average_causal_effect(causal_model,
                         'T',
                         interventions_alternative={'A': lambda x: 1},
                         interventions_reference={'A': lambda x: 0},
                         num_samples_to_draw=100000)

# %%
def print_out_posterior(df, variable_n):
    posterior_counts = df[variable_n].value_counts(normalize=True)
    print(posterior_counts)
    

# %%
samples = gcm.interventional_samples(causal_model,
                                     {'A': lambda x: 1},
                                     num_samples_to_draw=1000)
samples.head()

print_out_posterior(samples, "T")

# %%
mean_T = samples['T'].mean()
print(f"The mean of column 'T' is: {mean_T}")

# %%
samples_2 = gcm.interventional_samples(causal_model,
                                     {'A': lambda x: 0},
                                     num_samples_to_draw=1000)
samples_2.head()

# %%
mean_T_2 = samples_2['T'].mean()
print(f"The mean of column 'T' is: {mean_T_2}")

# %%
print_out_posterior(samples_2, "T")

In pgmpy and SMILE, there is very explicit way to defined a categorical network, pass network structure to it. Then the packages will learn the conditional probability table from the data for the network.

(1)
However, in DoWhy I found that auto_assignment_summary = gcm.auto.assign_causal_mechanisms(causal_model, data) is not work as expected and also I cannot find a good way using causal_model.set_causal_mechanism to define a categorical network.

In the auto_assignment_summary, Node A is assigned "Discrete AdditiveNoiseModel using LinearRegression", Node T is assigned "Discrete AdditiveNoiseModel using Pipeline".

Could you please help me to explicitly define a categorical network?

I want explore DoWhy because my real world project contains (1) categorical variable depend on continuous variable, (2) continuous variable depend on categorical variable, (3) continuous variable depend on continuous variable. (4) categorical variable depend on categorical variable. DoWhy looks like can deal with all of these cases (other package has limitations) and can do hard intervention of a variable and see the effect estimation of the target variable.

(2)
Is it OK to use gcm.interventional_samples to do hard intervention and sample data, then estimate posterior probability of target variable T? In my sample code, am I doing in the right way?
In current setting, I found that

T 
0 0.482 
1 0.479 
2 0.021 
-1 0.018

It is surprising that T contains value 2 and -1. But in my input data, it should only have two value 0 and 1, which is a categorical variable.
Why it behaves in this way and what I can do to make it right?

Expected behavior
The posterior of T should only contains value 0 and 1.

Version information:

  • DoWhy version [0.13]

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requestedstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions