Top end model flattening #3306

Friedemannn · 2026-05-13T16:11:24Z

Friedemannn
May 13, 2026

Hi all,

after I've implemented cross validation (see #3298) as a new visualisation option for my models I discovered that the models do not describe the underlying/simulated objective values at the top end.
The predictions do not pass some threshold.
Its most visible in this 6D run: 6D_crossval_vis.pdf
But it also can be seen in 2D runs (subsets of the 6 Dimensions, same bounds): 2D_crossval_vis1.pdf, 2D_crossval_vis2.pdf, 2D_crossval_vis3.pdf

My code looks like this:

def bo_loop():

    # Numerical-stability knobs (all optional in parameters.json)
    # - min_noise: variance floor of the Gaussian likelihood in *scaled* objective space
    # - cholesky_jitter: added jitter used by gpytorch for (near-)singular covariances
    min_noise = float(loop_dict.get("min_noise", 1e-6))
    cholesky_jitter = float(loop_dict.get("cholesky_jitter", 1e-6))


    # Initialize tensors for BO
    x_trainT = torch.from_numpy(x_train).reshape(-1, num_leaf_keys(bounds_dict))
    objectiveT = torch.from_numpy(np.abs(y_train)).reshape(-1, 1)
    bounds3DT = torch.from_numpy(
        np.array([get_value_by_path(bounds_dict, path) for path in get_leaf_paths(bounds_dict)])
    ).T

    x_trainT = normalize(x_trainT, bounds3DT)

    # Scale objective so that standardization is numerically stable
    y_scaled = objectiveT * y_scale
    objective_list = y_train

    for i in range(num_loops):
        likelihood = gpytorch.likelihoods.GaussianLikelihood(
            noise_constraint=gpytorch.constraints.GreaterThan(min_noise)
        )
        model = SingleTaskGP(
            x_trainT,
            y_scaled,
            likelihood=likelihood,
            outcome_transform=Standardize(1),
        )
        mll = ExactMarginalLogLikelihood(model.likelihood, model)
        with gpytorch.settings.cholesky_jitter(cholesky_jitter):
            fit_gpytorch_mll(mll)

        # Optimize acquisition
        acq_func = qLogExpectedImprovement(model=model, best_f = torch.max(y_scaled),sampler=SobolQMCNormalSampler(sample_shape=torch.Size([1024])))
        with gpytorch.settings.cholesky_jitter(cholesky_jitter):
            candidates, _ = optimize_acqf(
                acq_function=acq_func,
                bounds=torch.tensor([[0.0] * num_leaf_keys(bounds_dict), [1.0] * num_leaf_keys(bounds_dict)]),
                q=batch_size,
                num_restarts=num_restarts,
                raw_samples=raw_samples,  # used for initialization heuristic
            )

        # Append new candidate(s)
        x_trainT = torch.cat([x_trainT, candidates])
        new_x_train = unnormalize(candidates, bounds3DT).detach()

        # Write new x values into parameter.json to call with fbpic
        ...
        # Run simulation with new x value(s)
       ...
        # Wait for all simulations to complete
       ...

        # Load sim results and calculate charge in energy bounds

        #latest_dirs = latest_scan_entry(run, limit=batch_size)
        for cand_dir in cand_dir_paths:
            #calculate objective
            with np.errstate(divide='ignore', invalid='ignore'):
                objective = np.sqrt(np.abs(charge)) / (mad / median_energy)
            if not np.isfinite(objective):
                print(f"Warning: Non-finite objective detected (charge={charge}, mad={mad}, median_energy={median_energy}). Setting objective to 0.0.")
                objective = 0.0
            objective_list = np.append(objective_list, objective)
            objectiveT = torch.cat([objectiveT, torch.tensor([[objective]])])
            y_scaled = torch.cat([y_scaled, torch.tensor([[objective * y_scale]])])
            
    # Save BoTorch model
    torch.save(model.state_dict(), os.path.join(run_dir, "model.pt"))

    return run_dir

Does anyone know what the problem might be?
Could the minimum noise I'm introducing be a problem? (I introduced that to solve the problem i had in #3133)

Thanks!

hvarfner · 2026-05-14T19:05:57Z

hvarfner
May 14, 2026
Maintainer

Hi @Friedemannn ,

The setup looks similar to the one where I asked you about the bounds issue - I see that I missed replying to the second half of that discussion.

Please use the input transform class input_transform=Normalize(bounds=your_bounds, dim=num_dims) along with the outcome transform. Normalization/Standardization is the most common issue I see, and using that correctly eliminates that source of error. In your previous problem, the bounds were not [0, 1], and I really recommend providing the actual bounds everywhere and let the model deal with it through normalize. Manual solutions to this problem tend do be substantially more error-prone.

Some other things:

Do you need with gpytorch.settings.cholesky_jitter(cholesky_jitter): for your script to run? The "numerical stability knobs" are not sumething that is typically needed, and seeing them makes me a bit nervous.
The ... parts are very interesting here, please include them

Getting the actual data that you are loading would help a lot here! The minimum noise could be the problem, but I'm pretty confident that the issue it the bounds and input normalization of the 6D-objective specifically.

3 replies

Friedemannn May 17, 2026
Author

Hi @hvarfner,

thank you for the quick answer!
I might not completely understand what you mean, but I'm pretty sure that I'm normalizing correctly and providing the actual bounds everywhere. With:

from botorch.utils.transforms import normalize, unnormalize
x_trainT = normalize(x_trainT, bounds3DT)

bounds3DT is named bad, its not 3D it has however many dimensions the bounds_dict has.

I'll try without the jitter and see if it still works.
I wanted to keep it simple, but here is the full function I can also upload the full script if needed:

def bo_loop(
    loop_dict: Dict[str, Any],
    bot_dict: Dict[str, Any],
    bounds_dict: Dict[str, Any],
    template_dict: Dict[str, Any],
    x_train: np.ndarray,
    y_train: np.ndarray,
    lim_energy_up: float,
    lim_energy_low: float,
    above_Elim: np.ndarray,
    below_Elim: np.ndarray,
    charge_list: np.ndarray,
    median_energy_list: np.ndarray,
    mad_list: np.ndarray,
    h_n_sym_flag: bool = False,
) -> str:
    """Run the Bayesian Optimization loop for LaPA_BO parameters.

    Returns the absolute path to the created scan directory.
    """
    y_scale = loop_dict["y_scale_factor"]
    num_loops = loop_dict["num_loops"]
    ucb_beta = loop_dict["UCB_beta"]

    # Numerical-stability knobs (all optional in parameters.json)
    # - min_noise: variance floor of the Gaussian likelihood in *scaled* objective space
    # - cholesky_jitter: added jitter used by gpytorch for (near-)singular covariances
    min_noise = float(loop_dict.get("min_noise", 1e-6))
    cholesky_jitter = float(loop_dict.get("cholesky_jitter", 1e-6))

    batch_size = bot_dict["BATCH_SIZE"]
    num_restarts = bot_dict["NUM_RESTARTS"]
    raw_samples = bot_dict["RAW_SAMPLES"]

    run = "run_LaPA_BO" + datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S") + uuid4().hex[:6]
    run_dir = os.path.join(SCANS_DIR, run)
    ensure_dir(run_dir)
    logs_dir = os.path.join(run_dir, "logs")
    ensure_dir(logs_dir)

    gpu_ids = detect_gpu_ids()
    if batch_size == "auto":
        batch_size = len(gpu_ids)
    else:
        print(
            f"Warning: BATCH_SIZE is not set to auto. This is currently not supported. "
            "Resetting BATCH_SIZE to the number of available GPUs."
        )
        batch_size = len(gpu_ids)

    # Initialize tensors for BO
    x_trainT = torch.from_numpy(x_train).reshape(-1, num_leaf_keys(bounds_dict))
    objectiveT = torch.from_numpy(np.abs(y_train)).reshape(-1, 1)
    bounds3DT = torch.from_numpy(
        np.array([get_value_by_path(bounds_dict, path) for path in get_leaf_paths(bounds_dict)])
    ).T

    x_trainT = normalize(x_trainT, bounds3DT)

    # Scale objective so that standardization is numerically stable
    y_scaled = objectiveT * y_scale
    objective_list = y_train

    for i in range(num_loops):
        t_model1 = time.time()
        likelihood = gpytorch.likelihoods.GaussianLikelihood(
            noise_constraint=gpytorch.constraints.GreaterThan(min_noise)
        )
        model = SingleTaskGP(
            x_trainT,
            y_scaled,
            likelihood=likelihood,
            outcome_transform=Standardize(1),
        )
        mll = ExactMarginalLogLikelihood(model.likelihood, model)
        with gpytorch.settings.cholesky_jitter(cholesky_jitter):
            fit_gpytorch_mll(mll)

        # Optimize acquisition
        acq_func = qLogExpectedImprovement(model=model, best_f = torch.max(y_scaled),sampler=SobolQMCNormalSampler(sample_shape=torch.Size([1024]))) 
        with gpytorch.settings.cholesky_jitter(cholesky_jitter):
            candidates, _ = optimize_acqf(
                acq_function=acq_func,
                bounds=torch.tensor([[0.0] * num_leaf_keys(bounds_dict), [1.0] * num_leaf_keys(bounds_dict)]),
                q=batch_size,
                num_restarts=num_restarts,
                raw_samples=raw_samples,  # used for initialization heuristic
            )
        t_model2 = time.time()
        print(f"Model fitting and acquisition optimization took {t_model2 - t_model1:.2f} seconds.")

        # Append new candidate(s)
        x_trainT = torch.cat([x_trainT, candidates])
        new_x_train = unnormalize(candidates, bounds3DT).detach()
        t_fbpic1 = time.time()
        # Write new x values into parameter.json to call with fbpic
        candidate_paths: List[str] = []
        cand_dir_paths: List[str] = []
        for cand_id, candidate in enumerate(new_x_train):
            candidate_dict = deepcopy(template_dict)
            for path, x_val in zip(get_leaf_paths(bounds_dict), candidate):
                candidate_dict = set_value_by_path(candidate_dict, path, float(x_val))
            if h_n_sym_flag:
                candidate_dict["other_params"]["species"]["Hydrogen"]["profile_params"]["curve_down"] = candidate_dict["other_params"]["species"]["Hydrogen"]["profile_params"]["ramp_up"]
                candidate_dict["other_params"]["species"]["Nitrogen"]["profile_params"]["curve_down"] = candidate_dict["other_params"]["species"]["Nitrogen"]["profile_params"]["ramp_up"]
            candidate_dir = os.path.join(run_dir, f"iter{i}_cand{cand_id}")
            ensure_dir(candidate_dir)
            cand_dir_paths.append(candidate_dir)
            candidate_dict["Path"] = candidate_dir
            candidate_dict["Name"] = run
            iter_param_path = os.path.join(candidate_dir, f"iter_params{i}_cand{cand_id}.json")
            candidate_paths.append(iter_param_path)
            with open(iter_param_path, "w") as json_file:
                json.dump(candidate_dict, json_file)
            
        print("----------------------")

        # Run simulation with new x value(s)
        processes: List[tuple[subprocess.Popen, io.TextIOBase]] = []
        for cand_idx,(gpu_id, cand_path) in enumerate(zip(gpu_ids, candidate_paths)):
            env = {**os.environ, "CUDA_VISIBLE_DEVICES": str(gpu_id)}
            log_path = os.path.join(logs_dir, f"log_iter{i}_cand{cand_idx}.txt")
            with open(log_path, "w") as log_file:
                proc = subprocess.Popen(
                    [
                        sys.executable,
                        "-m",
                        "fbpic_wrapper.SimForeman",
                        "--json=load",
                        f"--File={cand_path}",
                        "--Debug",
                        "--Summarize",
                    ],
                    stdout=log_file,
                    stderr=subprocess.STDOUT,
                    env=env,
                )
            processes.append((proc, log_file))

        # Wait for all simulations to complete
        for proc, log_file in processes:
            proc.wait()
            log_file.close()
        t_fbpic2 = time.time()
        print(f"FBPIC simulations took {t_fbpic2 - t_fbpic1:.2f} seconds.")

        # Load sim results and calculate charge in energy bounds
        for cand_dir in cand_dir_paths:
            dir_entries = [os.path.join(cand_dir, d) for d in os.listdir(cand_dir)]
            only_dirs = [entry for entry in dir_entries if os.path.isdir(entry)]
            only_dirs.remove(os.path.join(cand_dir, run)) #remove the checkpoints folder from list
            sim_dir = only_dirs[0]
            ts_2d = LpaDiagnostics(os.path.join(sim_dir, "diags/hdf5/"))

            #calculate objective
            charge =ts_2d.get_charge(iteration=ts_2d.iterations[-2], species="Electrons")
            charge_list = np.append(charge_list, charge)
            median_energy, mad = ts_2d.get_energy_spread(iteration=ts_2d.iterations[-2], species="Electrons", center="median", width="mad", property="energy")
            median_energy_list = np.append(median_energy_list, median_energy)
            mad_list = np.append(mad_list, mad)
            with np.errstate(divide='ignore', invalid='ignore'):
                objective = np.sqrt(np.abs(charge)) / (mad / median_energy)
            if not np.isfinite(objective):
                print(f"Warning: Non-finite objective detected (charge={charge}, mad={mad}, median_energy={median_energy}). Setting objective to 0.0.")
                objective = 0.0
            objective_list = np.append(objective_list, objective)
            objectiveT = torch.cat([objectiveT, torch.tensor([[objective]])])
            y_scaled = torch.cat([y_scaled, torch.tensor([[objective * y_scale]])])
            if median_energy > lim_energy_up:
                above_Elim = np.append(above_Elim, 1)
            else:
                above_Elim = np.append(above_Elim, 0)
            if median_energy < lim_energy_low:
                below_Elim = np.append(below_Elim, 1)
            else:
                below_Elim = np.append(below_Elim, 0)
            print(f"Debug: objective value is {objective}")
            
    # Save inputs and outputs
    np.savez(
        os.path.join(run_dir, "results.npz"),
        x_train=unnormalize(x_trainT, bounds3DT).numpy(),
        objective_list=objective_list,
        lim_energy_up=lim_energy_up,
        lim_energy_low=lim_energy_low,
        above_Elim=above_Elim,
        below_Elim=below_Elim,
        charge_list=charge_list,
        median_energy_list=median_energy_list,
        mad_list=mad_list,
    )

    # Save BoTorch model
    torch.save(model.state_dict(), os.path.join(run_dir, "model.pt"))

    return run_dir

Dicts used:

    "bo_loop": {
        "num_loops": 200,
        "y_scale_factor": 1e8,
    },
    
    "bo_torch": {
        "BATCH_SIZE": 1,
        "NUM_RESTARTS": 10,
        "RAW_SAMPLES": 1024,
        "MC_SAMPLES": 128
    }

bounds dict:

{
    "laser_params": {
        "zf": [0.0005, 0.0020]
    },
    "other_params": {
        "species": {
            "Hydrogen": {
                "profile_params": {
                    "plateau": [0.0001, 0.001],
                    "curve_down": [0.001, 0.006]
                },
                "density": [1e23,1e24]
            },
            "Nitrogen": {
                "profile_params": {
                    "start": [0.00005, 0.0018]
                },
                "density":[1e22,10e22]
            }
        }
    }
}

Data:

I currently dont have access to an example results.npz I'll upload smth in the next few days.

Friedemannn May 19, 2026
Author

Hi @hvarfner

here is the results.npz file of the 6D run with a txt ending: results.txt

I also created a .csv including x_trainT, x_trainT (but normalized using x_trainT = normalize(x_trainT, bounds3DT)), y_train (corresponding objective value from simulation), y_scaled: debug_flat_model_run_LaPA_BO2026-04-15_17-59-0116c4f4.csv (also as an excel: debug_flat_model_run_LaPA_BO2026-04-15_17-59-0116c4f4.xlsx)

The first 12 rows are initial data points to start the bo_loop with, the following rows are chosen by the bo_loop.

As far as i can see everything seems to be in order?

I'm currently also running a loop without cholesky_jitter

Friedemannn May 22, 2026
Author

@hvarfner small cholesky_jitter update. I did another loop without cholesky_jitter and it seems to work fine without it. I'll remove it for future loops.

cholesky_comparison.zip

hvarfner · 2026-05-22T16:54:14Z

hvarfner
May 22, 2026
Maintainer

@Friedemannn Glad to hear! Anything else that's not workin, or can we close out the discussion?

1 reply

Friedemannn May 22, 2026
Author

@hvarfner yes the main issue of the trained model having a flat top is still there.
I should've phrased my latest answer a bit more precisely:
I did another loop without cholesky jitter and my problems from #3133 did not reappear. However, the crossval plot having a flat top is still the case. The plots I attached are from 2D runs with around 180 samples, while the flat top is harder to see than in the 6D version with >800 samples (from my original post), it's still there.

(I did not repeat the 6D run without cholesky_jitter, bc that one takes a lot of time)

Besides the cholesky_jitter you mentioned that I'm not normalizing correctly. With the complete code and the data I uploaded, is that still the case?

Do you have any other ideas about what could cause this flat top behavior of the model?

Friedemannn · 2026-06-01T13:06:53Z

Friedemannn
Jun 1, 2026
Author

Does anyone have an idea where this behavior comes from?

The deviation from the ideal fit in the 6D version is probably because of to few data points for 6d to work well.

But I have a hard time even guessing where this flat top comes from.

0 replies

hvarfner · 2026-06-01T19:53:51Z

hvarfner
Jun 1, 2026
Maintainer

@Friedemannn Sorry, was off for a few days. You mean the predictions for the very largest values, right? These would be considered the "worst", right?

A tip: Color the points by the distance they have to either the boundary or to any other point in the search space. For example, points that are close to other points (e.g. by min distance and by kernel distance - two different plots) can be colored blue, and those with a large min distance would be colored red. I'm sure you would see a lot of red for those flat top points, meaning that they would be far from other points and thus difficult to predict.

Just a hunch, but I think it's worth exploring.

1 reply

Friedemannn Jun 1, 2026
Author

@hvarfner In my setup the objective value is maximized. So the higher ones (where the flat top is) in the crossval plots are the best/ the ones im looking for.

That sounds like an interesting plot I'll try that. Thank you!
Basically you're trying to say, that these flat tops might also be a symptom of too few data points for the model to accurately predict more difficult points?

hvarfner · 2026-06-02T13:11:46Z

hvarfner
Jun 2, 2026
Maintainer

@Friedemannn okay, I see. I thought it was the opposite, which makes me less optimistic. However:

(...) these flat tops might also be a symptom of too few data points for the model to accurately predict more difficult points?

I would add too few data points around those specific data points, yes. However, that is pretty unlikely if they are good - I thought they were bad. Now, I actually think they will be very close to one another, but that your objective varies very quickly around these specific data points, making them hard to predict. Simply put, I would guess there's some level of heterogeneity.

0 replies

Top end model flattening #3306

Uh oh!

Friedemannn May 13, 2026

Replies: 5 comments · 5 replies

Uh oh!

hvarfner May 14, 2026 Maintainer

Uh oh!

Friedemannn May 17, 2026 Author

Data:

Uh oh!

Friedemannn May 19, 2026 Author

Uh oh!

Uh oh!

Friedemannn May 22, 2026 Author

Uh oh!

hvarfner May 22, 2026 Maintainer

Uh oh!

Uh oh!

Friedemannn May 22, 2026 Author

Uh oh!

Friedemannn Jun 1, 2026 Author

Uh oh!

hvarfner Jun 1, 2026 Maintainer

Uh oh!

Friedemannn Jun 1, 2026 Author

Uh oh!

Uh oh!

hvarfner Jun 2, 2026 Maintainer

Friedemannn
May 13, 2026

Replies: 5 comments 5 replies

hvarfner
May 14, 2026
Maintainer

Friedemannn May 17, 2026
Author

Friedemannn May 19, 2026
Author

Friedemannn May 22, 2026
Author

hvarfner
May 22, 2026
Maintainer

Friedemannn May 22, 2026
Author

Friedemannn
Jun 1, 2026
Author

hvarfner
Jun 1, 2026
Maintainer

Friedemannn Jun 1, 2026
Author

hvarfner
Jun 2, 2026
Maintainer