Skip to content

LilBabines/Phenofish

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal Tabular Imputation

Training framework for deep learning models that leverage image data to improve tabular data imputation via cross-attention.


Installation

python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

Data Preparation

The framework expects a single CSV file containing all tabular features, image references, and split assignments. Columns must follow a strict ordering convention.

CSV Structure

Columns must appear in this exact order:

[categorical features] | [continuous features] | image_name | [taxonomy] | split

1. Categorical Features (optional)

Place all categorical columns first. Each value must be integer-encoded (label-mapped). Missing values are supported and represented as empty cells.

⚠️ You must declare the number of classes for each categorical column — in the same order they appear in the CSV — via the field_lengths_tabular parameter in your experiment config:

data:
  field_lengths_tabular: [5, 12, 3]  # one entry per categorical column

2. Continuous Features

Place all continuous columns after the categorical ones. Missing values are supported (empty cells).

Recommended preprocessing:

  • Z-score normalization — zero mean, unit variance (required for stable training)
  • Log₁₀ scaling — recommended for heavily skewed or scale-varying features (e.g. body length, weight). Apply before z-scoring and store both the raw and log-transformed columns if needed.

3. image_name

A column named image_name containing the filename of the associated image (e.g. Salmo_trutta.png). The directory containing all images is set separately in the config:

data:
  dataset:
    dir_image_path: datasets/phenofish/images/

For now datasets classes expect images of size (256x256) Becareful, for training speed up, datatset __init__ fonction cache images into RAM memory.

This column is only required when using the image or multimodal modalities.

4. Taxonomy Columns (optional)

To include taxonomic classification in the loss, put with_taxonomy parameter to True and add the following three columns (in this order), each integer-encoded:

Column Description
order_idx Taxonomic order index
family_idx Taxonomic family index
genus_idx Taxonomic genus index

Declare the total number of classes for each level in your experiment config:

data:
  taxonomy:
    n_order: 37
    n_family: 213
    n_genus: 1899
    train_genus: true # or false

5. split

The last column must be split, with values train or val. Rows with an empty split value are ignored during training.


Example CSV

The example below has no categorical features, 30 z-scored continuous traits, one log₁₀-transformed body length column, image filenames, taxonomy indices, and split assignment:

diet_troph,...,body_elongation,image_name,order_idx,family_idx,genus_idx,split
,-0.3762,...,-0.3675,Petroleuciscus_borysthenicus.png,10,69,1343,train
,...,1.3433,Rineloricaria_heteroptera.png,33,123,1605,val

In this example field_lengths_tabular: [] since there are no categorical columns.


Config Reference

Set the following fields in your dataset config (or experiment override) to match your CSV:

data:
  dataset:
    data_frame_path: datasets/phenofish/train_val.csv
    dir_image_path:  datasets/phenofish/images/
    n_features: 30          # number of continuous + categorical features
    batch_size: 124
    num_workers: 8
    multi_img_per_line: False 
    with_taxonomy: True

  field_lengths_tabular: []   # number of classes per categorical column, in

  taxonomy:
     order
    n_order:  37
    n_family: 213
    n_genus:  1899
    train_genus: True

Usage

Training is handled by train.py using Hydra for configuration. You must always specify a modality and optionally an experiment override.

Modalities

Modality Description
image Train image encoder only
tabular Train tabular encoder only
multimodal Train full cross-attention model (image + tabular)

Commands

Train tabular encoder:

python train.py data=phenofish modality=tabular

Train image encoder:

python train.py data=phenofish modality=image

Train multimodal model:

python train.py data=phenofish modality=multimodal

With a specific experiment config:

python train.py data=phenofish modality=multimodal experiment=phenofish

Override specific parameters at runtime:

python train.py data=phenofish modality=multimodal experiment=phenofish model.multimodal.lr=1e-4 model.multimodal.freeze_encoders=True

Preprocessing

Configuration

Configs are organized under conf/ and follow Hydra's config group structure:

conf/
├── main_config.yaml
├── modality/
│   ├── image.yaml
│   ├── tabular.yaml
│   └── multimodal.yaml
├── model/
│   ├── image.yaml
│   ├── tabular.yaml
│   └── multimodal.yaml
├── trainer.yaml
├── config.yaml
└── experiment/
    └── phenofish.yaml

Both data and modality are mandatory — Hydra will raise an error if either is not specified.

Experiment overrides

Experiment configs (under conf/experiment/) are the recommended way to customize a run without modifying the base configs. They use # @package _global_ and can override any parameter from any config group:

# conf/experiment/phenofish.yaml
# @package _global_

data : 
  dataset:
    data_frame_path: datasets/phenofish/dataset_multimodal_v1/train_val_dataset_30_traits_zscore_preprocess.csv
    dir_image_path: datasets/phenofish/dataset_multimodal_v1/preprocess_fishmorph_1_2
    cpm_aug : 1
    n_features: 30
    batch_size: 124
    num_workers: 8
    #multi_img_per_line: False
  
  field_lengths_tabular: []
    
  # data specific
  taxonomy:
    
    n_order: 37
    n_family: 213
    n_genus: 1899
    train_genus: true
model:
  tabular:
    ckpt_path: #runs/phenofish/tabular/best_try_1/checkpoints/best-epoch=971.ckpt
  image : 
    ckpt_path: #runs/phenofish/image/best_try_1/checkpoints/best-epoch=00.ckpt
  multimodal: 
    ckpt_path: runs/phenofish/multimodal/now_way_from_pt_best_try_1_reprod_lambda/checkpoints/last-v1.ckpt
    hprams:
      lr: 0.0001
      weight_decay: 1.0e-05
      lambda_classif: 1
      lambda_recon: 3.0
      lambda_clip: 0.5
      lambda_encoders:  1
      freeze_encoders: False

This keeps base configs untouched and makes each experiment fully reproducible and self-contained.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages