Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Binary file not shown.
Binary file added working/.DS_Store
Binary file not shown.
Binary file added working/process/.DS_Store
Binary file not shown.
Binary file not shown.
106 changes: 106 additions & 0 deletions working/process/subcaption_and_summary_generation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
## vLLM Inference Pipeline for Open-PMC-18M Subcaption, Image-context Summary generation, and Modality Labeling

This repo contains three vLLM inference stages, each launched via a Slurm bash script:

* **Stage 1 (Subcaption extraction, VLM):** `Qwen2.5-VL-32B-Instruct` generates a *verbatim* subfigure caption from a full figure caption + subfigure image.
* **Stage 2 (Context summary, LLM):** `Qwen2.5-14B-Instruct` generates a focused summary of the context passage relevant to the subcaption.
* **Stage 2 (Modality Labeling, VLM):** `Qwen2.5-VL-32B-Instruct` generates L2 labels, then L1 and L0 labels are inferred from a predefined set based on the generated L2 label.

### Environment / Versions

This pipeline was run with:

* `vllm==0.8.2`
* `xformers==0.0.29.post2`
* `torch==2.6.0`

### Inputs

All scripts read and **overwrite** the same CSV or Jsonl (checkpointing is done by writing back to `--data_path`).

**Required columns**

* Subcaption stage (`generate_subcaption_vllm.py`):

* `subfig_path` (path to subfigure image)
* `caption` (full compound figure caption)
* Output column: `sub_caption`
* Summary stage (`generate_summary_vllm.py`):

* `caption` (full compound figure caption)
* `sub_caption` (subcaption for each subfigure)
* `image_context` (image context related to subfigure)
* Output column: `summary`
* Modality Labeling stage (`generate_modality_labels_vllm.py`):

* `subfig_path` (path to subfigure image)
* Output column: `L0_label`, `L1_label`, and `L2_label`

All stages support **resume** behavior: they skip rows where the output column is already filled (non-empty).

---

## How to Run (Slurm)

### 1) Subcaption generation (Qwen2.5-VL-32B-Instruct)

Edit the Slurm script to point to:

* your python file path
* your CSV path (`--data_path`)
* your model weights path (`--model_dir`)
* any desired batch/tp settings

Then submit:

```bash
sbatch run_vllm_subcaption_inference.sh
```

Slurm script reference:

**What it does:** launches `generate_subcaption_vllm.py` with vLLM tensor parallelism and writes `sub_caption` back into the CSV.

---

### 2) Summary generation (Qwen2.5-14B-Instruct)

After Stage 1 finishes (CSV now has `sub_caption`), edit and submit:

```bash
sbatch run_vllm_summary_inference.sh
```

Slurm script reference:

**What it does:** runs `generate_summary_vllm.py` and writes `summary` back into the same CSV.

---

### 3) Modality Label generation (Qwen2.5-VL-32B-Instruct)

Edit the Slurm script to point to:

* your python file path
* your CSV path (`--data_path`)
* your model weights path (`--model_dir`)
* any desired batch/tp settings

Then submit:

```bash
sbatch run_vllm_modality_inference.sh
```

Slurm script reference:

**What it does:** runs `generate_modality_labels_vllm.py` and writes `L0`, `L1`, and `L2` labels back into the same jsonl file.

---

## Notes

* **Paths:** All Slurm scripts include placeholder paths like `/path/to/...` — replace them before submitting.
* **GPU selection:** All scripts set `CUDA_VISIBLE_DEVICES=0,1` and use `--tp_size 2` to shard across 2 GPUs.
* **Checkpointing:** All scripts allow periodic checkpointing.
* **Outputs formatting:** subcaptions are extracted from `<caption>...</caption>`, and summaries from `<summary>...</summary>` (regex-based extraction).
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/bin/bash
#SBATCH --job-name=pmc-subcaption-qwen32b
#SBATCH --partition=a100
#SBATCH --qos=scavenger
#SBATCH --time=1-00:00:00
#SBATCH --nodes=1
#SBATCH --gpus-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --mem=59G
#SBATCH --output=qwen32b-subcap.%j.out

# Activate your environment

echo "Script Run Start!"
nvidia-smi

#module load cuda-12.4
module load gcc-12.3.0
gcc --version

source ~/envs/exp/bin/activate # Adjust this path to your virtual environment

echo "Module Loaded and Environment Activated!"


CUDA_VISIBLE_DEVICES=0,1 \
python /path/to/generate_modality_labels_vllm.py \
--data_path /path/to/data \
--model_dir /path/to/Qwen2.5-VL-32B-Instruct \
--batch_size 512 \
--max_new_tokens 128 \
--tp_size 2 \
--gpu_mem_util 0.90 \
--dtype bfloat16
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/bin/bash
#SBATCH --job-name=pmc-subcaption-qwen32b
#SBATCH --partition=a100
#SBATCH --qos=scavenger
#SBATCH --time=1-00:00:00
#SBATCH --nodes=1
#SBATCH --gpus-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --mem=59G
#SBATCH --output=qwen32b-subcap.%j.out

# Activate your environment

echo "Script Run Start!"
nvidia-smi

#module load cuda-12.4
module load gcc-12.3.0
gcc --version

source ~/envs/exp/bin/activate # Adjust this path to your virtual environment

echo "Module Loaded and Environment Activated!"

# Specify which GPUs to use
CUDA_VISIBLE_DEVICES=0,1 \
python /path/to/generate_subcaption_vllm.py \
--data_path /path/to/data.csv \
--model_dir /path/to/qwen2.5_vl_32B_model_weights_directory \
--batch_size 32 \
--max_new_tokens 1024 \
--tp_size 2 \
--gpu_mem_util 0.90 \
--dtype bfloat16
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/bin/bash
#SBATCH --job-name=summary-pmc
#SBATCH --partition=a40
#SBATCH --qos=scavenger
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --gpus-per-node=2
#SBATCH --cpus-per-task=4
#SBATCH --mem=43G
#SBATCH --output=qwen14b-summary.%j.out

echo "Script Run Start!"
nvidia-smi

#module load cuda-12.4
module load gcc-12.3.0
gcc --version

source ~/envs/exp2/bin/activate # Adjust this path to your virtual environment

echo "Module Loaded and Environment Activated!"

# Specify which GPUs to use
CUDA_VISIBLE_DEVICES=0,1 \
python /path/to/generate_summary_vllm.py \
--data_path /path/to/data.csv \
--model_dir /path/to/qwen2.5_14b_instruct_model_weights \
--batch_size 1024 \
--max_new_tokens 256 \
--tp_size 2 \
--gpu_mem_util 0.90 \
--dtype bfloat16

Loading
Loading