Skip to content

Conversation

@nyo16
Copy link

@nyo16 nyo16 commented Jan 8, 2026

Summary

Add FP8 (8-bit floating point) dtype support for E4M3FN and E5M2 formats. This enables reading and writing FP8 quantized model
weights from HuggingFace models like Qwen3-FP8.

Changes

  • Add FP8 dtype constants (F8_E4M3FN, F8_E5M2) and type mapping
  • Handle 3-tuple FP8 types {:f, 8, :e4m3fn} and {:f, 8, :e5m2} in:
    • dtype_from_string/1 - Parse "F8_E4M3" from safetensor headers
    • tensor_byte_size/1 - Calculate byte size for FP8 tensors
    • tensor_to_iodata/1 - Serialize FP8 tensors
    • build_tensor/2 - Deserialize FP8 tensors
  • Support reading FP8 model files (e.g., Qwen/Qwen3-0.6B-FP8)

Test plan

  • Unit tests for FP8 type encoding/decoding
  • Integration test reading real FP8 model files
  • Verified with Qwen3-0.6B-FP8 model inference

Notes

This is the first PR in a series to enable native FP8 model inference:

  1. safetensors (this PR) - FP8 file I/O
  2. nx/exla - FP8 type system support
  3. bumblebee - FP8 model loading and inference

nyo16 added 5 commits January 6, 2026 00:13
Adds support for F8_E4M3 and F8_E5M2 dtypes in SafeTensors format,
enabling loading of fp8-quantized models from HuggingFace.

Changes:
- Add {:f, 8, :e4m3fn} → "F8_E4M3" mapping
- Add {:f, 8, :e5m2} → "F8_E5M2" mapping
- Add {:f, 8} → "F8_E5M2" for backward compatibility
- Update dtype_to_type reverse mappings for fp8 formats

Enables loading models like Qwen3-4B-Instruct-2507-FP8 which uses
F8_E4M3 format for weights with fine-grained quantization.
- Test write/read for E4M3FN and E5M2 tensors
- Test type preservation in round-trip
- Test lazy loading with fp8 types
- Test byte size calculation
- Test dtype strings in SafeTensors header
- Add NX_PATH environment variable support for local development
@josevalim
Copy link
Contributor

Please remove the convo.txt :)

My suggestion is to break this in two. The first one is to add FP8 support, which means E5M2. No need for additional tuples and steps.

then a separate PR adds handling for unknown types. For now, the user should pass a separate function that receives the type and the value and builds the tensors

which types QWEN uses?

@nyo16
Copy link
Author

nyo16 commented Jan 8, 2026

Qwen3 is using: F8_E4M3

@nyo16
Copy link
Author

nyo16 commented Jan 8, 2026

ok I will work to break this down, for bumblebee and Nx i will open the PRs as draft for open more the discussion.

@josevalim
Copy link
Contributor

josevalim commented Jan 8, 2026

@nyo16 to make everyone on the same page: elixir-nx/nx#1657 (comment)

I think this PR will be straight-forward once we add e4m3fn to Nx, no need for custom functions :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants