- Python 3.8.17
- NvFlare 2.4.0
Decentralized ComBat is a privacy‑preserving tool that harmonizes neuroimaging data stored at multiple labs without ever copying raw files to a central server. Each site runs the ComBat math locally, shares only encrypted summary statistics with a lightweight aggregator, and then adjusts its data using the combined grand mean and variance. The result is a dataset that is statistically “site‑neutral,” giving analyses the same power and consistency as traditional, centralized ComBat while sidestepping legal, storage, and security hurdles. Tested on traumatic‑brain‑injury studies and large‑scale simulations, the method matches centralized results, scales cleanly to many sites, and lets researchers blend public and private datasets that previously could not be combined. In short, Decentralized ComBat makes multi‑center neuroimaging studies easier, safer, and more statistically robust.
Below are the key steps in the algorithm:
In our decentralized environment, we have two types of nodes: The first type is the aggregator node, also known as the remote node, which does not hold any data and acts as a storage of intermediate results and performs simple operations such as aggregation. The second node type is the local/regional node where datasets are located.
- Each participating site runs COINSTAC’s decentralized regression to obtain initial β‑coefficients.
- Using those coefficients, the site computes its local mean and local variance.
- These summary statistics—never raw data—are securely sent to the remote aggregator node.
- The aggregator combines all incoming summaries to derive the grand mean and grand variance across sites.
- It broadcasts those global values back to every local node.
- Each node uses the grand statistics to standardize its own dataset.
- It then estimates site‑specific effects via parametric empirical Bayes and adjusts its data accordingly.
- The result: harmonized, site‑neutral data that remain in place and ready for pooled analysis.
The computation requires two csv files as input:
- Covariates File (
CatCovariates.csv) - Dependent Variables File (
Data.csv)
Both files must follow a consistent format, though the specific covariates and dependents may vary from study to study. The computation expects these files to match the covariate and dependent variable names specified in the parameters.json file.
The key covariate_file in the parameters.json should match the file_name in local site.
Example: test_data/site1/CatCovariate.csv
- Format: CSV (Comma-Separated Values)
- Headers: The file must include a header row where each column name corresponds to a covariate specified in the
parameters.json. - Rows: Each row represents a subject, where each column contains the value for a specific covariate.
- Variable Names: The names of the covariates in the header must match the entries in the
covariates_typessection of theparameters.json.
<Covariate_1>,<Covariate_2>,...,<Covariate_N>
<value_1>,<value_2>,...,<value_N>
<value_1>,<value_2>,...,<value_N>
...
The key data_file in the parameters.json should match the file_name in local site.
- Format: CSV (Comma-Separated Values)
- Headers: The file must include a header row where each column name corresponds to a ROI in brain region.
- Rows: Each row represents the same subject as in the
covariates.csv, with values for the dependent variables.
<Dependent_1>,<Dependent_2>,...,<Dependent_N>
<value_1>,<value_2>,...,<value_N>
<value_1>,<value_2>,...,<value_N>
...
- The data provided by each site follows the specified format (standardized covariate and dependent variable headers).
- The computation is run in a federated environment, and each site contributes valid data.
This file is loaded by combat_controller.py on the remote node, which then passes it to the edge nodes(executors) in the computation as FLContext Object.
Example: test_data/server/parameters.json
| Key | Type | Required | Description | Example |
|---|---|---|---|---|
covariate_file |
string |
✅ | Covariate file name inside edge node data directory | "CatCovariate.csv" |
data_file |
string |
✅ | Dependent file name inside edge node data directory | "Data.csv" |
combat_algo |
string |
✅ | Which type of algorithm to implement during computation | combatDC or combatMegaDC |
covariates_types |
object |
✅ | Datatypes of each column values in covariates file | 3 |
covariates_types.['key_name'] |
string |
✅ | primitive datatype names supported in Python 3.8 |
int, float, string or bool |
Note: In the dependent file, each cell value is assumed to be either empty or of type
float.
The computation creates three categories of log files. First is site logs, which are under test_output/{site_name}/{site_name}.log. Second is remote logs, which are under test_output/remote/remote.log which are basically controller logs. Finally, the aggregator log file is stored in the same location as the remote logs and is specific to the aggregator computation.
Set the environment variable LOG_LEVEL with supported values as info, debug, error or warning, in dockerRun.sh
Pass the environment variable LOG_LEVEL to the application with supported values as info, debug, error or warning in docker run command.
Once the computation is completed, it generates the harmonized, site‑dependent CSV files in the test_output/{site_name} directory.
- Clone the repository
- Build the Docker image with the command below:
docker build . -t nvflare-dccombat -f Dockerfile-dev
- The above command generates a Docker image with tag
nvflare-dccombat. - Start the docker container with
./dockerRun.shcommand. Provide necessary execute permission for the above file. - The above will open a shell inside the container. Run the following command to run the computation:
nvflare simulator -c site1,site2 ./app/
- Make changes as needed and repeat step 5 to test them.