-
Notifications
You must be signed in to change notification settings - Fork 992
Description
Question
I want to use DoWhy to learn parameter for a predefined network structure B -> A, A -> T, B -> T given a simulated dataset.
After this network is learned with parameter. Then I want to do manipulation of variable A (do operation, hard intervention) and see how the posterior of T change.
I found that this is very straightforward in pgmpy and SMILE, but I am not sure how to do it in DoWhy.
My sample code is as follows:
# %%
import pandas as pd
import numpy as np
import os
import scipy
import math
from functools import partial
import dowhy.gcm as gcm
import networkx as nx
# %%
output_folder = "./simulated_data"
os.makedirs(output_folder, exist_ok=True)
simulated_dataset_path = "./simulated_data/simulated_discrete_dataset.csv"
# %%
def simulate_bayesian_network():
# Set random seed for reproducibility
np.random.seed(1)
N = 50000
# Step 1: B ~ Bernoulli(0.5)
B = np.random.binomial(1, 0.5, size=(N,))
# Step 2: A depends on B
A = np.zeros(N, dtype=int)
A[B == 1] = np.random.binomial(1, 0.9, size=(np.sum(B == 1),))
A[B == 0] = np.random.binomial(1, 0.1, size=(np.sum(B == 0),))
# Step 3: T depends on A and B
T = np.zeros(N, dtype=int)
T[(A == 1) & (B == 1)] = np.random.binomial(1, 0.9, size=np.sum((A == 1) & (B == 1)))
T[(A == 1) & (B == 0)] = np.random.binomial(1, 0.1, size=np.sum((A == 1) & (B == 0)))
T[(A == 0) & (B == 1)] = np.random.binomial(1, 0.1, size=np.sum((A == 0) & (B == 1)))
T[(A == 0) & (B == 0)] = np.random.binomial(1, 0.1, size=np.sum((A == 0) & (B == 0)))
data = {}
data['A'] = A
data['B'] = B
data['T'] = T
return pd.DataFrame(data)
# Simulate and preview the dataset
data = simulate_bayesian_network()
print(data.head())
# %%
# gcm stands for "Graphical Causal Models"
causal_model = gcm.StructuralCausalModel(nx.DiGraph([('B', 'A'), ('B', 'T'), ('A', 'T')]))
# %%
auto_assignment_summary = gcm.auto.assign_causal_mechanisms(causal_model, data)
# %%
print(auto_assignment_summary)
# %% [markdown]
# ## Comments
#
# Auto detected causal mechanism is not right.
#
# This is the code for all possible causal mechanism:
# https://github.com/py-why/dowhy/blob/main/dowhy/gcm/causal_mechanisms.py
#
#
# %%
# causal_model.set_causal_mechanism('B', gcm.EmpiricalDistribution())
# causal_model.set_causal_mechanism('A', gcm.classification.LogisticRegressionClassifier())
# causal_model.set_causal_mechanism('T', gcm.classification.LogisticRegressionClassifier())
# %%
gcm.fit(causal_model, data)
# %%
gcm.average_causal_effect(causal_model,
'T',
interventions_alternative={'A': lambda x: 1},
interventions_reference={'A': lambda x: 0},
num_samples_to_draw=100000)
# %%
def print_out_posterior(df, variable_n):
posterior_counts = df[variable_n].value_counts(normalize=True)
print(posterior_counts)
# %%
samples = gcm.interventional_samples(causal_model,
{'A': lambda x: 1},
num_samples_to_draw=1000)
samples.head()
print_out_posterior(samples, "T")
# %%
mean_T = samples['T'].mean()
print(f"The mean of column 'T' is: {mean_T}")
# %%
samples_2 = gcm.interventional_samples(causal_model,
{'A': lambda x: 0},
num_samples_to_draw=1000)
samples_2.head()
# %%
mean_T_2 = samples_2['T'].mean()
print(f"The mean of column 'T' is: {mean_T_2}")
# %%
print_out_posterior(samples_2, "T")
In pgmpy and SMILE, there is very explicit way to defined a categorical network, pass network structure to it. Then the packages will learn the conditional probability table from the data for the network.
(1)
However, in DoWhy I found that auto_assignment_summary = gcm.auto.assign_causal_mechanisms(causal_model, data)
is not work as expected and also I cannot find a good way using causal_model.set_causal_mechanism
to define a categorical network.
In the auto_assignment_summary, Node A is assigned "Discrete AdditiveNoiseModel using LinearRegression", Node T is assigned "Discrete AdditiveNoiseModel using Pipeline".
Could you please help me to explicitly define a categorical network?
I want explore DoWhy because my real world project contains (1) categorical variable depend on continuous variable, (2) continuous variable depend on categorical variable, (3) continuous variable depend on continuous variable. (4) categorical variable depend on categorical variable. DoWhy looks like can deal with all of these cases (other package has limitations) and can do hard intervention of a variable and see the effect estimation of the target variable.
(2)
Is it OK to use gcm.interventional_samples
to do hard intervention and sample data, then estimate posterior probability of target variable T
? In my sample code, am I doing in the right way?
In current setting, I found that
T
0 0.482
1 0.479
2 0.021
-1 0.018
It is surprising that T contains value 2 and -1. But in my input data, it should only have two value 0 and 1, which is a categorical variable.
Why it behaves in this way and what I can do to make it right?
Expected behavior
The posterior of T should only contains value 0 and 1.
Version information:
- DoWhy version [0.13]