LongCLIP Checkpoint Conversion Guide

このドキュメントでは、オリジナルの LongCLIP checkpoint (.pt 形式) から HuggingFace Transformers 形式への変換プロセスを説明します。

概要

LongCLIP のオリジナル実装は独自の checkpoint 形式を使用していますが、これを HuggingFace Transformers と互換性のある形式に変換することで、AutoModel.from_pretrained()などの標準 API で利用できるようになります。

オリジナルの Checkpoint 構造

State Dict Keys

オリジナルの LongCLIP checkpoint (longclip-B.pt, longclip-L.pt) には以下のキーが含まれています：

Text Encoder

# Token embeddings
"token_embedding.weight"                              # [vocab_size, text_hidden_size]

# Dual positional embeddings (LongCLIP特有)
"positional_embedding"                                 # [248, text_hidden_size] - Base embeddings
"positional_embedding_res"                             # [248, text_hidden_size] - Residual embeddings

# Transformer layers (N layers)
"transformer.resblocks.{i}.ln_1.weight"               # Layer norm 1
"transformer.resblocks.{i}.ln_1.bias"
"transformer.resblocks.{i}.attn.in_proj_weight"       # Combined Q,K,V projection
"transformer.resblocks.{i}.attn.in_proj_bias"
"transformer.resblocks.{i}.attn.out_proj.weight"      # Output projection
"transformer.resblocks.{i}.attn.out_proj.bias"
"transformer.resblocks.{i}.ln_2.weight"               # Layer norm 2
"transformer.resblocks.{i}.ln_2.bias"
"transformer.resblocks.{i}.mlp.c_fc.weight"           # MLP first layer
"transformer.resblocks.{i}.mlp.c_fc.bias"
"transformer.resblocks.{i}.mlp.c_proj.weight"         # MLP second layer
"transformer.resblocks.{i}.mlp.c_proj.bias"

# Final layer norm
"ln_final.weight"
"ln_final.bias"

# Text projection
"text_projection"                                      # [text_hidden_size, projection_dim]

Vision Encoder

# Patch embeddings
"visual.conv1.weight"                                  # [vision_hidden_size, 3, patch_size, patch_size]

# Position embeddings
"visual.class_embedding"                               # [vision_hidden_size]
"visual.positional_embedding"                          # [num_patches+1, vision_hidden_size]

# Pre-layer norm
"visual.ln_pre.weight"
"visual.ln_pre.bias"

# Transformer layers (N layers)
"visual.transformer.resblocks.{i}.ln_1.weight"
"visual.transformer.resblocks.{i}.ln_1.bias"
"visual.transformer.resblocks.{i}.attn.in_proj_weight"
"visual.transformer.resblocks.{i}.attn.in_proj_bias"
"visual.transformer.resblocks.{i}.attn.out_proj.weight"
"visual.transformer.resblocks.{i}.attn.out_proj.bias"
"visual.transformer.resblocks.{i}.ln_2.weight"
"visual.transformer.resblocks.{i}.ln_2.bias"
"visual.transformer.resblocks.{i}.mlp.c_fc.weight"
"visual.transformer.resblocks.{i}.mlp.c_fc.bias"
"visual.transformer.resblocks.{i}.mlp.c_proj.weight"
"visual.transformer.resblocks.{i}.mlp.c_proj.bias"

# Post-layer norm
"visual.ln_post.weight"
"visual.ln_post.bias"

# Visual projection
"visual.proj"                                          # [vision_hidden_size, projection_dim]

Shared

# Logit scale (temperature parameter)
"logit_scale"                                          # scalar

HuggingFace Transformers 形式の構造

Model Architecture

LongCLIPModel
├── text_model (LongCLIPTextModel)
│   └── text_model (LongCLIPTextTransformer)
│       ├── embeddings (LongCLIPTextEmbeddings)
│       │   ├── token_embedding
│       │   ├── position_embedding           # Base positional embeddings
│       │   └── position_embedding_res       # Residual positional embeddings
│       ├── encoder
│       │   └── layers[i]
│       │       ├── layer_norm1
│       │       ├── self_attn
│       │       │   ├── q_proj
│       │       │   ├── k_proj
│       │       │   ├── v_proj
│       │       │   └── out_proj
│       │       ├── layer_norm2
│       │       └── mlp
│       │           ├── fc1
│       │           └── fc2
│       └── final_layer_norm
├── text_projection
│
├── vision_model (LongCLIPVisionModel)
│   └── vision_model (CLIPVisionTransformer)
│       ├── embeddings
│       │   ├── patch_embedding
│       │   ├── class_embedding
│       │   └── position_embedding
│       ├── pre_layrnorm
│       ├── encoder
│       │   └── layers[i]
│       │       ├── layer_norm1
│       │       ├── self_attn (Q,K,V,out_proj)
│       │       ├── layer_norm2
│       │       └── mlp (fc1, fc2)
│       └── post_layernorm
├── visual_projection
│
└── logit_scale

キーマッピング

Text Encoder

オリジナル	HuggingFace	備考
`token_embedding.weight`	`text_model.text_model.embeddings.token_embedding.weight`	そのままコピー
`positional_embedding`	`text_model.text_model.embeddings.position_embedding.weight`	Base embeddings
`positional_embedding_res`	`text_model.text_model.embeddings.position_embedding_res`	Residual embeddings (Parameter)
`transformer.resblocks.{i}.ln_1.*`	`text_model.text_model.encoder.layers[i].layer_norm1.*`	Layer norm 1
`transformer.resblocks.{i}.attn.in_proj_*`	`text_model.text_model.encoder.layers[i].self_attn.{q,k,v}_proj.*`	3 つに分割が必要
`transformer.resblocks.{i}.attn.out_proj.*`	`text_model.text_model.encoder.layers[i].self_attn.out_proj.*`	Output projection
`transformer.resblocks.{i}.ln_2.*`	`text_model.text_model.encoder.layers[i].layer_norm2.*`	Layer norm 2
`transformer.resblocks.{i}.mlp.c_fc.*`	`text_model.text_model.encoder.layers[i].mlp.fc1.*`	MLP first layer
`transformer.resblocks.{i}.mlp.c_proj.*`	`text_model.text_model.encoder.layers[i].mlp.fc2.*`	MLP second layer
`ln_final.*`	`text_model.text_model.final_layer_norm.*`	Final layer norm
`text_projection`	`text_projection.weight`	転置が必要 (.T)

Vision Encoder

オリジナル	HuggingFace	備考
`visual.conv1.weight`	`vision_model.vision_model.embeddings.patch_embedding.weight`	Patch embedding
`visual.class_embedding`	`vision_model.vision_model.embeddings.class_embedding`	CLS token
`visual.positional_embedding`	`vision_model.vision_model.embeddings.position_embedding.weight`	Position embeddings
`visual.ln_pre.*`	`vision_model.vision_model.pre_layrnorm.*`	Pre-layer norm
`visual.transformer.resblocks.{i}.ln_1.*`	`vision_model.vision_model.encoder.layers[i].layer_norm1.*`	Layer norm 1
`visual.transformer.resblocks.{i}.attn.in_proj_*`	`vision_model.vision_model.encoder.layers[i].self_attn.{q,k,v}_proj.*`	3 つに分割が必要
`visual.transformer.resblocks.{i}.attn.out_proj.*`	`vision_model.vision_model.encoder.layers[i].self_attn.out_proj.*`	Output projection
`visual.transformer.resblocks.{i}.ln_2.*`	`vision_model.vision_model.encoder.layers[i].layer_norm2.*`	Layer norm 2
`visual.transformer.resblocks.{i}.mlp.c_fc.*`	`vision_model.vision_model.encoder.layers[i].mlp.fc1.*`	MLP first layer
`visual.transformer.resblocks.{i}.mlp.c_proj.*`	`vision_model.vision_model.encoder.layers[i].mlp.fc2.*`	MLP second layer
`visual.ln_post.*`	`vision_model.vision_model.post_layernorm.*`	Post-layer norm
`visual.proj`	`visual_projection.weight`	転置が必要 (.T)

Shared

オリジナル	HuggingFace	備考
`logit_scale`	`logit_scale`	Temperature parameter

重要な変換処理

1. Dual Positional Embeddings (LongCLIP 特有)

LongCLIP の最も重要な特徴は、dual positional embeddings 機構です：

# オリジナルcheckpointから
positional_embedding = state_dict["positional_embedding"]      # [248, hidden_size]
positional_embedding_res = state_dict["positional_embedding_res"]  # [248, hidden_size]

# HFモデルへ
hf_model.text_model.text_model.embeddings.position_embedding.weight.data = positional_embedding
hf_model.text_model.text_model.embeddings.position_embedding_res.data = positional_embedding_res

注意事項:

position_embeddingは通常のnn.Embedding
position_embedding_resはnn.Parameter（直接テンソル）
mask1/mask2 を使用して適用される（実装参照）

2. Attention Layer: in_proj_weight の分割

オリジナルは Q, K, V を 1 つのテンソルにまとめています：

# オリジナル: [3*hidden_size, hidden_size]
in_proj_weight = state_dict[f"{prefix}.attn.in_proj_weight"]
in_proj_bias = state_dict[f"{prefix}.attn.in_proj_bias"]

# 3つに分割
q_proj, k_proj, v_proj = in_proj_weight.chunk(3, dim=0)
q_proj_bias, k_proj_bias, v_proj_bias = in_proj_bias.chunk(3, dim=0)

# HFモデルへコピー
hf_attn.q_proj.weight.data = q_proj
hf_attn.q_proj.bias.data = q_proj_bias
hf_attn.k_proj.weight.data = k_proj
hf_attn.k_proj.bias.data = k_proj_bias
hf_attn.v_proj.weight.data = v_proj
hf_attn.v_proj.bias.data = v_proj_bias

3. Projection Matrices の転置

Text projection と Visual projection は転置が必要です：

# オリジナル: [hidden_size, projection_dim]
text_projection = state_dict["text_projection"]
visual_projection = state_dict["visual.proj"]

# HFモデル: [projection_dim, hidden_size] (Linear層のweight)
hf_model.text_projection.weight.data = text_projection.T.contiguous()
hf_model.visual_projection.weight.data = visual_projection.T.contiguous()

理由: オリジナルは features @ projection を使用するが、HF のnn.Linearは x @ weight.T を計算するため。

4. Layer Norm の扱い

Layer norm は weight と bias をそのままコピー：

def copy_linear(hf_linear, pt_weight, pt_bias):
    hf_linear.weight.data = pt_weight
    hf_linear.bias.data = pt_bias

アーキテクチャの自動検出

変換スクリプトは checkpoint からモデル構成を自動的に検出します：

Text Encoder

# Hidden size
text_hidden_size = state_dict["ln_final.weight"].shape[0]  # 512 (B) or 768 (L)

# Number of layers (一意のレイヤーインデックスをカウント)
text_layers = set()
for key in state_dict.keys():
    if key.startswith("transformer.resblocks."):
        layer_num = int(key.split(".")[2])
        text_layers.add(layer_num)
text_num_layers = len(text_layers)  # 12

# Attention heads
text_num_heads = text_hidden_size // 64  # head_dim=64 固定

# Intermediate size (MLP)
text_intermediate_size = state_dict["transformer.resblocks.0.mlp.c_fc.weight"].shape[0]

# Vocab size
vocab_size = state_dict["token_embedding.weight"].shape[0]  # 49408

# Max positions
max_position_embeddings = state_dict["positional_embedding"].shape[0]  # 248

Vision Encoder

# Hidden size
vision_hidden_size = state_dict["visual.ln_post.weight"].shape[0]  # 768 (B) or 1024 (L)

# Number of layers
vision_layers = set()
for key in state_dict.keys():
    if key.startswith("visual.transformer.resblocks."):
        layer_num = int(key.split(".")[3])
        vision_layers.add(layer_num)
vision_num_layers = len(vision_layers)  # 12 (B) or 24 (L)

# Patch size
vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]  # 16 (B) or 14 (L)

# Attention heads
vision_num_heads = vision_hidden_size // 64

# Intermediate size
vision_intermediate_size = state_dict["visual.transformer.resblocks.0.mlp.c_fc.weight"].shape[0]

Projection Dimension

projection_dim = state_dict["text_projection"].shape[0]  # 512 (B) or 768 (L)

モデルサイズ別の構成

LongCLIP-B (Base)

Component	Value
Text hidden size	512
Text layers	12
Text heads	8
Text intermediate	2048
Vision hidden size	768
Vision layers	12
Vision heads	12
Vision patch size	16
Vision intermediate	3072
Projection dim	512
Max positions	248
Vocab size	49408

LongCLIP-L (Large)

Component	Value
Text hidden size	768
Text layers	12
Text heads	12
Text intermediate	3072
Vision hidden size	1024
Vision layers	24
Vision heads	16
Vision patch size	14
Vision intermediate	4096
Projection dim	768
Max positions	248
Vocab size	49408

変換手順

使用方法

# LongCLIP-B を変換
python scripts/convert_longclip_to_hf.py \
    --checkpoint_path checkpoints/longclip-B.pt \
    --output_path ./longclip-base-hf

# LongCLIP-L を変換
python scripts/convert_longclip_to_hf.py \
    --checkpoint_path checkpoints/longclip-L.pt \
    --output_path ./longclip-large-hf

変換プロセス

Checkpoint 読み込み

state_dict = torch.load(checkpoint_path, map_location="cpu")

構成の自動検出

config = determine_config_from_checkpoint(state_dict)

HF モデルの作成
```
hf_model = LongCLIPModel(config)
```
Weight のコピー
- Text model weights (dual positional embeddings 含む)
- Vision model weights
- Logit scale
検証
- 248 トークンのテスト入力で forward pass
- NaN/Inf チェック
- Output shape チェック

保存

hf_model.save_pretrained(output_path)
config.save_pretrained(output_path)

出力ファイル

変換後のディレクトリには以下のファイルが含まれます：

longclip-base-hf/
├── config.json           # モデル構成
└── model.safetensors     # モデルweight (SafeTensors形式)

または

├── config.json
└── pytorch_model.bin     # モデルweight (PyTorch形式)

検証

変換の正確性を確認するため、integration tests を実行：

# Baseline fixtures作成（オリジナル実装）
uv run pytest tests/test_baseline.py -v

# Integration tests（変換後のモデルと比較）
uv run pytest tests/test_integration.py -v

許容誤差

rtol: 1e-2 (1% relative tolerance)
atol: 1e-2 (0.01 absolute tolerance)

Baseline が float16 で保存されているため、float32 との比較では若干大きめの許容誤差が必要。

検証項目

✅ Text features match (single and batch)
✅ Image features match (single and batch)
✅ Similarity scores match
✅ 248 token context works correctly
✅ No NaN or Inf in outputs
✅ Logits symmetry (logits_per_image == logits_per_text.T)

トラブルシューティング

よくあるエラー

1. KeyError: Layer not found

問題: レイヤー数のカウントが間違っている

原因: 総キー数をカウントしている（各レイヤーに複数のキーがある）

解決: 一意のレイヤーインデックスをカウント

# ❌ 間違い
num_layers = len([k for k in state_dict if "resblocks" in k]) // 4

# ✅ 正しい
layers = set()
for k in state_dict.keys():
    if k.startswith("transformer.resblocks."):
        layer_num = int(k.split(".")[2])
        layers.add(layer_num)
num_layers = len(layers)

2. Shape mismatch in projection

問題: RuntimeError: shape mismatch for text_projection

原因: 転置を忘れている

解決:

# ❌ 間違い
hf_model.text_projection.weight.data = state_dict["text_projection"]

# ✅ 正しい
hf_model.text_projection.weight.data = state_dict["text_projection"].T.contiguous()

3. Integration test failures

問題: テストの許容誤差内に収まらない

原因: Baseline が float16 で保存されている

解決: 適切な許容誤差を設定

# float16 baseline用
RTOL = 1e-2  # 1%
ATOL = 1e-2  # 0.01

まとめ

LongCLIP の checkpoint 変換の主なポイント：

Dual positional embeddings: 最も重要な特徴、正確にコピーが必要
Attention in_proj の分割: Q, K, V を 3 つに分割
Projection の転置: text_projection と visual_projection
アーキテクチャ自動検出: レイヤー数の正確なカウント
検証: Integration tests で数値的な互換性を確認

変換スクリプト: scripts/convert_longclip_to_hf.py

FilesExpand file tree

CONVERSION.md

Latest commit

History

CONVERSION.md

File metadata and controls

LongCLIP Checkpoint Conversion Guide

概要

オリジナルの Checkpoint 構造

State Dict Keys

Text Encoder

Vision Encoder

Shared

HuggingFace Transformers 形式の構造

Model Architecture

キーマッピング

Text Encoder

Vision Encoder

Shared

重要な変換処理

1. Dual Positional Embeddings (LongCLIP 特有)

2. Attention Layer: in_proj_weight の分割

3. Projection Matrices の転置

4. Layer Norm の扱い

アーキテクチャの自動検出

Text Encoder

Vision Encoder

Projection Dimension

モデルサイズ別の構成

LongCLIP-B (Base)

LongCLIP-L (Large)

変換手順

使用方法

変換プロセス

出力ファイル

検証

許容誤差

検証項目

トラブルシューティング

よくあるエラー

1. KeyError: Layer not found

2. Shape mismatch in projection

3. Integration test failures

まとめ