Simulation Abnormality #250

danielagandrade · 2025-11-11T23:18:05Z

danielagandrade
Nov 11, 2025

Hi CAFE5 developers, I am seeing some interesting results with my simulations and would like some input.

I have tested four different nested models using the base model. I modeled with a poisson distribution and included the error model for all of my runs. Here are the -lnL for the empirical data:

Global lambda model (GL): 96839.4
Two lambda model (2L): 93942.016575889
Three lambda model (3L): 93887.766913779
Four lambda model (4L): 93326.065646918

To select which model was best, I compared the GL to the 2L model, the 2L to the 3L model, and the 3L to the 4L model.

The following was my general procedure:

Simulate 1000 datasets using the root distribution from my data under the simpler one of the models (Important note: I was not able to make this run including the base error model , the error said "Trying to simulate leaf family size that was not included in error model
Fit both models to each one of the simulated datasets.
Calculate likelihood of ratios for every simulation and plot a distribution. Then analyze my empirical likelihood of ratios and compare it to the distribution. I used an alpha cutoff of 0.05.

I have attached the plots of the three comparisons, with the empirical LR plotted on them. I have out-ruled the global lambda model and the four lambda model because the plots for those comparisons are clear and straightforward. However, I am seeing some interesting results on the comparison of the two lambda model to the three lambda model and I would like your input.

My empirical LR is 108.4993. I have run both models multiple times with the empirical data and see convergence, with the -lnL indicating consistently that the 3L model is better (which is to be expected due to the extra parameter). Nonetheless, almost all of the LR values that come from the simulated data are negative, indicating that the 3L model has a worst fit. Almost all of the -lnL of the 3L model are larger than those of the 2L model.

Because the empirical LR is a positive value, when I compare it to the distribution of mostly negative numbers and the p value cutoff, it appears that the 3L model is the better choice. The p value of the empirical data is 0.001, calculated as follows:
```
p_value_C2 <- mean(LR_2L_vs_3L$Likelihood_Ratio >= observed_LR_2L_vs_3L)
```

However, I would like your input because this decision does not sit well with me since in almost all of the simulations the 3L model performed worse. I find this to be confusing since I would expect that increasing parameters would almost certainly always lead to a better fit, but this is not what I am seeing. Additionally the distribution of LR test values is skewed to the left. Based on the simulated data, I am inclined to choose the 2 lambda model. Nonetheless, I would like to see what your thoughts are and see if you have had similar behavior in the past when comparing models.

I have uploaded all of my distribution plots, as well as the 2L and 3L trees. Below is my code for the 2L vs 3L procedure in case it is helpful.

3_separate_lambdas.txt
2_separate_lambdas.txt
Distribution_Plot_2Lvs3L.pdf
Distribution_Plot_3Lvs4L.pdf
Distribution_Plot_GLvs2L.pdf

Thank you for developing and for all of your help when questions and issues arise, they are always very appreciated.

Daniela

CODE:


# Parameters
TREE=CAFE_No_ALAM_No_Cassiopea_ultrametric_tree.txt
ROOTDIST=Full_2L_root_distribution.txt
LAMBDAS_2L=0.08757878523108,1.8741653165921
ERRORMODEL=Base_error_model.txt
TWO_LAMBDA_FILE=2_separate_lambdas.txt   
THREE_LAMBDA_FILE=3_separate_lambdas.txt
N=1000                                 # Number of simulations
OUTDIR=./Bootstrap

mkdir -p $OUTDIR/Simulations $OUTDIR/Fits

# Simulation index from SLURM array
i=${SLURM_ARRAY_TASK_ID}
SIM_BASENAME=sim_${i}
SIM_DIR=$OUTDIR/Simulations/${SIM_BASENAME}
FIT_DIR=$OUTDIR/Fits/${SIM_BASENAME}

mkdir -p $SIM_DIR $FIT_DIR

echo "=== Simulation $i / 1000 ==="

# 1. Simulate dataset
cafe5 -s -f $ROOTDIST -l $LAMBDAS_2L -t $TREE -o $SIM_DIR > $SIM_DIR/sim.log 2>&1

SIM_FILE="$SIM_DIR/simulation.txt"

if [[ ! -f "$SIM_FILE" ]]; then
  echo "Simulation $i failed (no output file found)." >> $OUTDIR/failures.log
  exit 1
fi

# 2. Fit 2L model
cafe5 -i $SIM_FILE -t $TREE -p -e$ERRORMODEL -y $TWO_LAMBDA_FILE -o $FIT_DIR/2L_fit > $FIT_DIR/2L_fit.log 2>&1

# 3. Fit 3L model
cafe5 -i $SIM_FILE -t $TREE -p -e$ERRORMODEL -y $THREE_LAMBDA_FILE -o $FIT_DIR/3L_fit > $FIT_DIR/3L_fit.log 2>&1


 #Likelihood of Ratios


BASE_DIR="/home/dxg2725/Projects/EEID_Shallow/Susceptibility/CAFE5/All_Orthogroup_Models/Simulations/2L_vs_3L/Bootstrap/Fits"
OUTFILE="$BASE_DIR/LR_summary.tsv"

echo -e "Simulation_Number\tNeg_lnL_2L\tNeg_lnL_3L\tLikelihood_Ratio" > "$OUTFILE"

for sim_dir in "$BASE_DIR"/sim_*; do
    
    sim_num=$(basename "$sim_dir" | sed 's/sim_//')

    L2_FILE="$sim_dir/2L_fit/Base_results.txt"
    L3_FILE="$sim_dir/3L_fit/Base_results.txt"

    # Extract -lnL values
    NEGLL_2LAMBDA=$(grep "Model Base Final Likelihood (-lnL):" "$L2_FILE" | awk '{print $NF}')
    NEGLL_3LAMBDA=$(grep "Model Base Final Likelihood (-lnL):" "$L3_FILE" | awk '{print $NF}')


    # Compute LR = 2 * (lnL_simple - lnL_complex)
    LR=$(echo "2*($NEGLL_2LAMBDA - $NEGLL_3LAMBDA)" | bc -l)

    echo -e "${sim_num}\t${NEGLL_2LAMBDA}\t${NEGLL_3LAMBDA}\t${LR}" >> "$OUTFILE"
done

hahnlab-user · 2025-11-18T14:22:04Z

hahnlab-user
Nov 18, 2025
Maintainer

Hi Daniela,

Is it possible that you're just missing a minus sign in front of your LR calculation, or likewise are not using -lnL inside the parentheses? I think this would make all of your distribution plots make a lot more sense.

Otherwise, I don't have any good idea what's happening.

Matt

2 replies

danielagandrade Nov 18, 2025
Author

Hi Matt,

Unfortunately that is not the case. This is the formula I am using:

LR=2×(score of simpler λ model – score of more complex λ model)]

So if my -lnL for the empirical 2 λ model is 93,942.01658 and the lnL for the empirical 3 λ model is 93887.76691, I am doing:

2(93,942.01658 - 93,887.76691) = 108.49934

The problem with my 2λ vs 3 λ LR distribution of simulated datasets (under 2L) is that most -lnL values for the 2 λ model are smaller than the values for the 3λ model, so most LR are negative.

You mentioned that all of the distribution plots would make more sense that way. I was under the impression that the plots for the Gλ vs 2λ model and the 3λ vs 4λ model looked fine. Is there something catching your attention that I might not be noticing?

Once again, thank you for your insight!

hahnlab-user Nov 19, 2025
Maintainer

Hi Daniela,

Just to make absolutely sure: in your calculation from the empirical case you are using [-lnL] inside the parentheses, but in your example formula (and in your code, if I read correctly) you are using [lnL]. In the latter case, you need the minus sign in front of the right-hand side of the equation.

And to my eye all of your distributions are skewed towards negative numbers--the 2 vs 3 plots looks very similar (and perhaps even the best-behaved, if the results are mirrored).

Matt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simulation Abnormality #250

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Simulation Abnormality #250

Uh oh!

danielagandrade Nov 11, 2025

Replies: 1 comment · 2 replies

Uh oh!

hahnlab-user Nov 18, 2025 Maintainer

Uh oh!

danielagandrade Nov 18, 2025 Author

Uh oh!

hahnlab-user Nov 19, 2025 Maintainer

danielagandrade
Nov 11, 2025

Replies: 1 comment 2 replies

hahnlab-user
Nov 18, 2025
Maintainer

danielagandrade Nov 18, 2025
Author

hahnlab-user Nov 19, 2025
Maintainer