Skip to content

Model trained on FP8 and NVFP4 but still saved in bf16 format #2058

@ayakakirima

Description

@ayakakirima

Describe the bug

I tried finetune Qwen3-1.7b with FP8 and NVFP4 config, but checkpoint after convert to hf still in bf16 format.

Steps/Code to reproduce bug

# convert to mcore format
python /root/megatron-test/Megatron-Bridge/examples/conversion/convert_checkpoints.py import \
                --hf-model Qwen/Qwen3-1.7B \
                --megatron-path /root/megatron-test/test-mcore
                
# training script, reused qwen3-next config at examples/recipes/qwen3_next/finetune_qwen3_next_80b_a3b.py, 
# I only replaced with qwen3 1.7 config
torchrun --nproc_per_node=1 /root/megatron-test/Megatron-Bridge/examples/recipes/finetune_qwen3_4b.py \
                --hf-path Qwen/Qwen3-1.7B \
               --pretrained-checkpoint /root/megatron-test/test-mcore \
                --data-path /root/megatron-test/data \
                --config-file /root/megatron-test/qwen3_next_4b_pretrain_override.yaml
-> checkpoint ~3.4GB (no optim)
/root/megatron-test/test-mcore-ckpt
├── iter_0000002
│   ├── __0_0.distcp
│   ├── __0_1.distcp
│   ├── common.pt
│   ├── metadata.json
│   ├── modelopt_run_config.yaml
│   ├── run_config.yaml
│   ├── tokenizer
│   │   ├── added_tokens.json
│   │   ├── chat_template.jinja
│   │   ├── merges.txt
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer_config.json
│   │   └── vocab.json
│   └── train_state.pt
├── latest_checkpointed_iteration.txt
├── latest_train_state.pt
└── latest_wandb_artifact_path.txt

# convert script
python /root/megatron-test/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
               --hf-model Qwen/Qwen3-1.7B \
               --megatron-path /root/megatron-test/test-mcore-ckpt \
               --hf-path /root/megatron-test/test-mcore2hf
-> hf weight in bf16, no quantization config in config.json, no FP8 weight keys in the safetensors file (~3.4GB). 
It's same issue with nvfp4 config.

Final training config (FP8):

INFO:megatron.core.timers:(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (2133.77, 2133.77)
    train/valid/test-data-iterators-setup ..........: (4752.22, 4752.22)
------- Task Configuration -------
_target_: megatron.bridge.training.config.ConfigContainer
checkpoint:
  _target_: megatron.bridge.training.config.CheckpointConfig
  async_save: false
  ckpt_assume_constant_structure: false
  ckpt_convert_format: null
  ckpt_convert_save: null
  ckpt_format: torch_dist
  ckpt_step: null
  dist_ckpt_optim_fully_reshardable: true
  dist_ckpt_strictness: log_all
  distrib_optim_fully_reshardable_mem_efficient: false
  exit_on_missing_checkpoint: false
  finetune: false
  fully_parallel_load: false
  fully_parallel_save: true
  load: /root/megatron-test/nemo_experiments/default/checkpoints
  load_main_params_from_ckpt: false
  load_optim: true
  load_rng: true
  most_recent_k: -1
  non_persistent_ckpt_type: global
  non_persistent_global_ckpt_dir: null
  non_persistent_local_ckpt_algo: fully_parallel
  non_persistent_local_ckpt_dir: null
  non_persistent_save_interval: null
  pretrained_checkpoint: /root/megatron-test/test-mcore
  replication: false
  replication_factor: 2
  replication_jump: null
  save: /root/megatron-test/test-mcore-ckpt
  save_interval: 2
  save_optim: false
  save_rng: true
  save_tokenizer_assets: true
  strict_fsdp_dtensor_load: false
  use_checkpoint_args: false
  use_persistent_ckpt_worker: true
comm_overlap: null
dataset:
  _target_: megatron.bridge.training.config.FinetuningDatasetConfig
  data_sharding: true
  dataloader_type: batch
  dataset_kwargs:
    answer_only_loss: false
    chat: true
    pad_to_max_length: true
    tool_schemas:
    - function:
        description: Execute Python code in a stateful Jupyter notebook environment.
          Use this tool when you need to perform calculations, verify results, or
          solve complex mathematical problems. Python will respond with the output
          of the execution or time out after 80.0 seconds.
        name: python
        parameters:
          properties:
            code:
              description: The Python code to execute
              type: string
          required:
          - code
          type: object
      type: function
    use_hf_tokenizer_chat_template: true
  dataset_root: /root/megatron-test/data
  do_test: true
  do_validation: true
  max_train_samples: null
  memmap_workers: 12
  num_workers: 12
  packed_sequence_specs:
    _target_: megatron.bridge.data.datasets.packed_sequence.PackedSequenceSpecs
    packed_metadata_path: null
    packed_sequence_size: 40960
    packed_train_data_path: null
    packed_val_data_path: null
    pad_cu_seqlens: false
    pad_seq_to_mult: 1
    tokenizer_model_name: hihi
  persistent_workers: false
  pin_memory: true
  seed: 5678
  seq_length: 40960
  trust_remote_code: null
ddp:
  _target_: megatron.core.distributed.distributed_data_parallel_config.DistributedDataParallelConfig
  align_param_gather: false
  average_in_collective: false
  bucket_size: 40000000
  check_for_large_grads: false
  check_for_nan_in_grad: true
  data_parallel_sharding_strategy: no_shard
  delay_wgrad_compute: false
  disable_symmetric_registration: false
  fp8_param_gather: true
  fsdp_double_buffer: false
  fsdp_manual_registration: false
  grad_reduce_in_fp32: false
  gradient_reduce_div_fusion: true
  keep_fp8_transpose_cache: false
  nccl_ub: false
  num_distributed_optimizer_instances: 1
  outer_dp_sharding_strategy: no_shard
  overlap_grad_reduce: true
  overlap_param_gather: false
  pad_buckets_for_high_nccl_busbw: false
  preserve_fp32_weights: true
  reduce_scatter_with_fp32_accumulation: false
  reuse_grad_buf_for_mxfp8_param_ag: false
  suggested_communication_unit_size: null
  use_custom_fsdp: false
  use_distributed_optimizer: true
  use_megatron_fsdp: false
dist:
  _target_: megatron.bridge.training.config.DistributedInitConfig
  align_grad_reduce: true
  disable_jit_fuser: false
  distributed_backend: nccl
  distributed_timeout_minutes: 10
  distributed_timeout_seconds_after_init: null
  enable_megatron_core_experimental: true
  external_gpu_device_mapping: false
  high_priority_stream_groups: null
  lazy_init: false
  local_rank: 0
  nccl_communicator_config_path: null
  sharp_enabled_group: null
  use_gloo_process_groups: true
  use_megatron_fsdp: false
  use_sharp: false
  use_torch_fsdp2: false
  use_tp_pp_dp_mapping: false
ft: null
inprocess_restart: null
logger:
  _target_: megatron.bridge.training.config.LoggerConfig
  filter_warnings: true
  log_energy: false
  log_interval: 1
  log_l2_norm_grad_to_tensorboard: false
  log_loss_scale_to_tensorboard: true
  log_memory_to_tensorboard: true
  log_params_norm: false
  log_progress: false
  log_runtime_to_tensorboard: false
  log_throughput: true
  log_throughput_to_tensorboard: true
  log_timers_to_tensorboard: true
  log_validation_ppl_to_tensorboard: true
  log_world_size_to_tensorboard: false
  logging_level: 20
  memory_keys: null
  modules_to_filter: null
  runtime_time_unit: hours
  save_config_filepath: null
  set_level_for_all_loggers: false
  tensorboard_dir: /root/megatron-test/nemo_experiments/default/tb_logs
  tensorboard_log_interval: 1
  tensorboard_queue_size: 1000
  throughput_window_size: 100
  timing_log_level: 0
  timing_log_option: minmax
  wandb_entity: null
  wandb_exp_name: qwen3-1.7b-1xH100
  wandb_project: megatron-bridge
  wandb_save_dir: /root/megatron-test/wandb
mixed_precision:
  _target_: megatron.bridge.training.mixed_precision.MixedPrecisionConfig
  autocast_dtype: null
  autocast_enabled: false
  bf16: true
  first_last_layers_bf16: true
  fp16: false
  fp32: false
  fp4: null
  fp4_recipe: nvfp4
  fp8: hybrid
  fp8_amax_compute_algo: most_recent
  fp8_amax_history_len: 1
  fp8_dot_product_attention: false
  fp8_margin: 0
  fp8_multi_head_attention: false
  fp8_param: true
  fp8_param_gather: true
  fp8_recipe: blockwise
  fp8_wgrad: true
  grad_reduce_in_fp32: false
  hysteresis: 2
  initial_loss_scale: 4294967296
  loss_scale: null
  loss_scale_window: 1000
  min_loss_scale: 1.0
  num_layers_at_end_in_bf16: 1
  num_layers_at_start_in_bf16: 1
  params_dtype:
    _call_: false
    _target_: torch.bfloat16
  pipeline_dtype:
    _call_: false
    _target_: torch.bfloat16
  reuse_grad_buf_for_mxfp8_param_ag: false
model:
  _target_: megatron.bridge.models.qwen.qwen_provider.Qwen3ModelProvider
  account_for_embedding_in_pipeline_split: false
  account_for_loss_in_pipeline_split: false
  activation_func:
    _call_: false
    _target_: torch.nn.functional.silu
  activation_func_clamp_value: null
  activation_func_fp8_input_store: false
  add_bias_linear: false
  add_qkv_bias: false
  apply_query_key_layer_scaling: false
  apply_residual_connection_post_layernorm: false
  apply_rope_fusion: false
  async_tensor_model_parallel_allreduce: false
  attention_backend:
    _args_:
    - 5
    _call_: true
    _target_: megatron.core.transformer.enums.AttnBackend
  attention_dropout: 0.0
  attention_output_gate: false
  attention_softmax_in_fp32: false
  autocast_dtype:
    _call_: false
    _target_: torch.bfloat16
  barrier_with_L1_time: true
  batch_invariant_mode: false
  batch_p2p_comm: true
  batch_p2p_sync: true
  bf16: true
  bias_activation_fusion: false
  bias_dropout_fusion: false
  calculate_per_token_loss: true
  clone_scatter_output_in_embedding: true
  config_logger_dir: ''
  context_parallel_size: 1
  cp_comm_type: null
  cpu_offloading: false
  cpu_offloading_activations: false
  cpu_offloading_double_buffering: false
  cpu_offloading_num_layers: 0
  cpu_offloading_weights: false
  cross_entropy_fusion_impl: te
  cross_entropy_loss_fusion: true
  cuda_graph_impl: none
  cuda_graph_retain_backward_graph: false
  cuda_graph_scope: []
  cuda_graph_use_single_mempool: false
  cuda_graph_warmup_steps: 3
  deallocate_pipeline_outputs: true
  defer_embedding_wgrad_compute: false
  delay_wgrad_compute: false
  deterministic_mode: false
  disable_bf16_reduced_precision_matmul: false
  disable_parameter_transpose_cache: false
  distribute_saved_activations: null
  dsa_indexer_head_dim: null
  dsa_indexer_loss_coeff: null
  dsa_indexer_n_heads: null
  dsa_indexer_topk: null
  dsa_indexer_use_sparse_loss: false
  embedding_init_method:
    _args_: []
    _partial_: true
    _target_: torch.nn.init.normal_
    mean: 0.0
    std: 0.02
  embedding_init_method_std: 0.02
  enable_autocast: false
  enable_cuda_graph: false
  ep_overlap_early_attn_memory_release: false
  experimental_attention_variant: null
  expert_model_parallel_size: 1
  expert_tensor_parallel_size: 1
  external_cuda_graph: false
  ffn_hidden_size: 6144
  finalize_model_grads_func:
    _args_: []
    _partial_: true
    _target_: megatron.core.distributed.finalize_model_grads.finalize_model_grads
    pg_collection:
      _call_: true
      _target_: megatron.core.process_groups_config.ProcessGroupCollection
  fine_grained_activation_offloading: false
  first_last_layers_bf16: true
  flash_decode: false
  fp16: false
  fp16_lm_cross_entropy: false
  fp32_residual_connection: false
  fp4: null
  fp4_param: false
  fp4_quantizer_factory: null
  fp4_recipe: nvfp4
  fp8: hybrid
  fp8_amax_compute_algo: most_recent
  fp8_amax_history_len: 1
  fp8_dot_product_attention: false
  fp8_interval: 1
  fp8_margin: 0
  fp8_multi_head_attention: false
  fp8_param: true
  fp8_quantizer_factory: null
  fp8_recipe: blockwise
  fp8_wgrad: true
  fused_single_qkv_rope: false
  gated_linear_unit: true
  generation_config:
    _call_: true
    _target_: transformers.generation.configuration_utils.GenerationConfig.from_dict
    config_dict:
      _from_model_config: false
      assistant_confidence_threshold: 0.4
      assistant_early_exit: null
      assistant_lookbehind: 10
      bad_words_ids: null
      begin_suppress_tokens: null
      bos_token_id: 151643
      cache_config: null
      cache_implementation: null
      constraints: null
      decoder_start_token_id: null
      disable_compile: false
      diversity_penalty: 0.0
      do_sample: true
      dola_layers: null
      early_stopping: false
      encoder_no_repeat_ngram_size: 0
      encoder_repetition_penalty: 1.0
      eos_token_id:
      - 151645
      - 151643
      epsilon_cutoff: 0.0
      eta_cutoff: 0.0
      exponential_decay_length_penalty: null
      force_words_ids: null
      forced_bos_token_id: null
      forced_eos_token_id: null
      guidance_scale: null
      is_assistant: false
      length_penalty: 1.0
      low_memory: null
      max_length: 20
      max_matching_ngram_size: null
      max_new_tokens: null
      max_time: null
      min_length: 0
      min_new_tokens: null
      min_p: null
      no_repeat_ngram_size: 0
      num_assistant_tokens: 20
      num_assistant_tokens_schedule: constant
      num_beam_groups: 1
      num_beams: 1
      num_return_sequences: 1
      output_attentions: false
      output_hidden_states: false
      output_logits: null
      output_scores: false
      pad_token_id: 151643
      penalty_alpha: null
      prefill_chunk_size: null
      prompt_lookup_num_tokens: null
      remove_invalid_values: false
      renormalize_logits: false
      repetition_penalty: 1.0
      return_dict_in_generate: false
      return_legacy_cache: null
      sequence_bias: null
      stop_strings: null
      suppress_tokens: null
      target_lookbehind: 10
      temperature: 0.6
      token_healing: false
      top_k: 20
      top_p: 0.95
      transformers_version: 4.57.6
      trust_remote_code: false
      typical_p: 1.0
      use_cache: true
      watermarking_config: null
  glu_linear_offset: 0.0
  grad_scale_func:
    _call_: false
    _target_: megatron.core.optimizer.optimizer.MegatronOptimizer.scale_loss
  grad_sync_func:
    _call_: false
    _target_: megatron.core.distributed.distributed_data_parallel.DistributedDataParallel.start_grad_sync
  gradient_accumulation_fusion: true
  hetereogenous_dist_checkpoint: false
  heterogeneous_block_specs: false
  hf_model_id: Qwen/Qwen3-1.7B
  hidden_dropout: 0.0
  hidden_size: 2048
  hierarchical_context_parallel_sizes: null
  hybrid_context_parallel: false
  inference_fuse_tp_communication: false
  inference_rng_tracker: false
  inference_sampling_seed: 42
  init_method:
    _args_: []
    _partial_: true
    _target_: torch.nn.init.normal_
    mean: 0.0
    std: 0.02
  init_method_std: 0.02
  init_model_with_meta_device: false
  is_hybrid_model: false
  kitchen_attention_backend: sdpa
  kv_channels: 128
  layernorm_epsilon: 1.0e-06
  layernorm_zero_centered_gamma: false
  linear_attention_freq: null
  linear_conv_kernel_dim: null
  linear_key_head_dim: null
  linear_num_key_heads: null
  linear_num_value_heads: null
  linear_value_head_dim: null
  log_max_attention_logit: false
  make_vocab_size_divisible_by: 128
  mamba_head_dim: 64
  mamba_num_groups: 8
  mamba_num_heads: null
  mamba_state_dim: 128
  masked_softmax_fusion: true
  max_position_embeddings: 40960
  max_seqlen_per_dp_cp_rank: null
  memory_efficient_layer_norm: false
  microbatch_group_size_per_vp_stage: 1
  min_offloaded_tensor_size: 1048576
  mlp_chunks_for_prefill: 1
  moe_apply_probs_on_input: false
  moe_aux_loss_coeff: 0.0
  moe_deepep_num_sms: 20
  moe_enable_deepep: false
  moe_expert_capacity_factor: null
  moe_extended_tp: false
  moe_ffn_hidden_size: null
  moe_flex_dispatcher_backend: deepep
  moe_grouped_gemm: false
  moe_hybridep_num_sms: 16
  moe_input_jitter_eps: null
  moe_latent_size: null
  moe_layer_freq: 1
  moe_layer_recompute: false
  moe_pad_expert_input_to_capacity: false
  moe_per_layer_logging: false
  moe_permute_fusion: false
  moe_router_bias_update_rate: 0.001
  moe_router_dtype: null
  moe_router_enable_expert_bias: false
  moe_router_force_load_balancing: false
  moe_router_fusion: false
  moe_router_group_topk: null
  moe_router_load_balancing_type: aux_loss
  moe_router_num_groups: null
  moe_router_padding_for_fp8: false
  moe_router_padding_for_quantization: false
  moe_router_pre_softmax: false
  moe_router_score_function: softmax
  moe_router_topk: 2
  moe_router_topk_limited_devices: null
  moe_router_topk_scaling_factor: null
  moe_shared_expert_gate: false
  moe_shared_expert_intermediate_size: null
  moe_shared_expert_overlap: false
  moe_token_dispatcher_type: allgather
  moe_token_drop_policy: probs
  moe_token_dropping: false
  moe_use_legacy_grouped_gemm: false
  moe_z_loss_coeff: null
  mrope_section: null
  mtp_enabled: false
  mtp_loss_scaling_factor: null
  mtp_num_layers: null
  mtp_standalone: false
  multi_latent_attention: false
  no_rope_freq: null
  no_sync_func:
    _call_: false
    _target_: megatron.core.distributed.distributed_data_parallel.DistributedDataParallel.no_sync
  normalization: RMSNorm
  num_attention_heads: 16
  num_layers: 28
  num_layers_at_end_in_bf16: 1
  num_layers_at_start_in_bf16: 1
  num_layers_in_first_pipeline_stage: null
  num_layers_in_last_pipeline_stage: null
  num_microbatches_with_partial_activation_checkpoints: null
  num_moe_experts: null
  num_query_groups: 8
  offload_modules: null
  output_layer_init_method:
    _args_: []
    _partial_: true
    _target_: torch.nn.init.normal_
    mean: 0.0
    std: 0.002672612419124244
  overlap_moe_expert_parallel_comm: false
  overlap_p2p_comm: false
  overlap_p2p_comm_warmup_flush: false
  parallel_output: true
  param_sync_func: null
  params_dtype:
    _call_: false
    _target_: torch.bfloat16
  perform_initialization: true
  persist_layer_norm: false
  pipeline_dtype:
    _call_: false
    _target_: torch.bfloat16
  pipeline_model_parallel_comm_backend: null
  pipeline_model_parallel_layout: null
  pipeline_model_parallel_size: 1
  position_embedding_type: rope
  qk_clip: false
  qk_clip_alpha: 0.5
  qk_clip_threshold: 100
  qk_l2_norm: false
  qk_layernorm: true
  quant_recipe: null
  recompute_granularity: full
  recompute_method: uniform
  recompute_modules:
  - core_attn
  recompute_num_layers: 1
  restore_modelopt_state: false
  rotary_base: 1000000
  rotary_interleaved: false
  rotary_percent: 1.0
  scatter_embedding_sequence_parallel: true
  seq_len_interpolation_factor: null
  seq_length: 40960
  sequence_parallel: false
  share_embeddings_and_output_weights: true
  should_pad_vocab: false
  softmax_scale: null
  softmax_type: vanilla
  symmetric_ar_type: null
  tensor_model_parallel_size: 1
  test_mode: false
  timers:
    _call_: true
    _target_: megatron.core.timers.Timers
  tp_comm_atomic_ag: false
  tp_comm_atomic_rs: false
  tp_comm_bootstrap_backend: nccl
  tp_comm_bulk_dgrad: true
  tp_comm_bulk_wgrad: true
  tp_comm_overlap: false
  tp_comm_overlap_ag: true
  tp_comm_overlap_cfg: null
  tp_comm_overlap_disable_fc1: false
  tp_comm_overlap_disable_qkv: false
  tp_comm_overlap_rs: true
  tp_comm_overlap_rs_dgrad: false
  tp_comm_split_ag: true
  tp_comm_split_rs: true
  tp_only_amax_red: false
  transformer_impl: transformer_engine
  transformer_layer_spec:
    _call_: false
    _target_: megatron.bridge.models.gpt_provider.default_layer_spec
  use_arbitrary_attention_mask: null
  use_cpu_initialization: false
  use_fused_weighted_squared_relu: false
  use_inference_optimized_layers: false
  use_kitchen: false
  use_kitchen_attention: false
  use_mamba_mem_eff_path: true
  use_ring_exchange_p2p: false
  use_te_activation_func: false
  use_te_rng_tracker: false
  use_transformer_engine_full_layer_spec: false
  use_transformer_engine_op_fuser: false
  variable_seq_lengths: false
  virtual_pipeline_model_parallel_size: null
  vocab_size: 151936
  wgrad_deferral_limit: 0
  window_attn_skip_freq: null
  window_size: null
nvrx_straggler: null
optimizer:
  _target_: megatron.bridge.training.config.OptimizerConfig
  adam_beta1: 0.9
  adam_beta2: 0.98
  adam_eps: 1.0e-05
  apply_wd_to_qk_layernorm: false
  barrier_with_L1_time: false
  bf16: true
  clip_grad: 1.0
  config_logger_dir: ''
  decoupled_lr: null
  decoupled_min_lr: null
  decoupled_weight_decay: true
  exp_avg_dtype:
    _call_: false
    _target_: torch.bfloat16
  exp_avg_sq_dtype:
    _call_: false
    _target_: torch.bfloat16
  fp16: false
  fp8_recipe: blockwise
  hysteresis: 2
  initial_loss_scale: 4294967296
  log_num_zeros_in_grad: false
  loss_scale: null
  loss_scale_window: 1000
  lr: 1.0e-05
  main_grads_dtype:
    _call_: false
    _target_: torch.float32
  main_params_dtype:
    _call_: false
    _target_: torch.float32
  min_loss_scale: 1.0
  min_lr: 5.0e-06
  muon_extra_scale_factor: 1.0
  muon_fp32_matmul_prec: medium
  muon_momentum: 0.95
  muon_num_ns_steps: 5
  muon_scale_mode: spectral
  muon_split_qkv: true
  muon_tp_mode: blockwise
  muon_use_nesterov: false
  optimizer: adam
  optimizer_cpu_offload: false
  optimizer_offload_fraction: 0.4
  overlap_cpu_optimizer_d2h_h2d: true
  overlap_param_gather: false
  overlap_param_gather_with_optimizer_step: false
  params_dtype:
    _call_: false
    _target_: torch.bfloat16
  pin_cpu_grads: true
  pin_cpu_params: true
  reuse_grad_buf_for_mxfp8_param_ag: false
  sgd_momentum: 0.9
  store_param_remainders: true
  timers:
    _call_: true
    _target_: megatron.core.timers.Timers
  use_distributed_optimizer: true
  use_precision_aware_optimizer: true
  use_torch_optimizer_for_cpu_offload: false
  weight_decay: 0.1
optimizer_config_override_provider:
  _target_: megatron.bridge.training.config.OptimizerConfigOverrideProvider
peft: null
profiling:
  _target_: megatron.bridge.training.config.ProfilingConfig
  memory_snapshot_path: snapshot.pickle
  nvtx_ranges: false
  profile_ranks:
  - 0
  profile_step_end: 12
  profile_step_start: 10
  record_memory_history: false
  record_shapes: false
  use_nsys_profiler: false
  use_pytorch_profiler: false
rerun_state_machine:
  _target_: megatron.bridge.training.config.RerunStateMachineConfig
  check_for_nan_in_loss: true
  check_for_spiky_loss: false
  error_injection_rate: 0
  error_injection_type: transient_error
  rerun_mode: disabled
rng:
  _target_: megatron.bridge.training.config.RNGConfig
  data_parallel_random_init: false
  inference_rng_tracker: false
  seed: 42
  te_rng_tracker: false
scheduler:
  _target_: megatron.bridge.training.config.SchedulerConfig
  end_weight_decay: 0.033
  lr_decay_iters: 1000
  lr_decay_samples: null
  lr_decay_steps: 6000
  lr_decay_style: cosine
  lr_warmup_fraction: null
  lr_warmup_init: 0.0
  lr_warmup_iters: 20
  lr_warmup_samples: 0
  lr_warmup_steps: 120
  lr_wsd_decay_iters: null
  lr_wsd_decay_samples: null
  lr_wsd_decay_style: exponential
  no_weight_decay_cond_type: null
  override_opt_param_scheduler: true
  start_weight_decay: 0.033
  use_checkpoint_opt_param_scheduler: false
  wd_incr_steps: 6000
  weight_decay_incr_style: constant
  wsd_decay_steps: null
straggler: null
tensor_inspect: null
tokenizer:
  _target_: megatron.bridge.training.tokenizers.config.TokenizerConfig
  chat_template: null
  hf_tokenizer_kwargs: {}
  image_tag_type: null
  legacy_tokenizer: false
  merge_file: null
  metadata_path: null
  sp_tokenizer_kwargs: {}
  special_tokens: null
  tiktoken_num_special_tokens: 1000
  tiktoken_pattern: null
  tiktoken_special_tokens: null
  tokenizer_model: Qwen/Qwen3-1.7B
  tokenizer_prompt_format: null
  tokenizer_type: HuggingFaceTokenizer
  vocab_extra_ids: 0
  vocab_file: null
  vocab_size: null
train:
  _target_: megatron.bridge.training.config.TrainingConfig
  check_weight_hash_across_dp_replicas_interval: null
  decrease_batch_size_if_needed: false
  empty_unused_memory_level: 0
  eval_interval: 100
  eval_iters: 15
  exit_duration_in_mins: null
  exit_interval: null
  exit_signal:
    _args_:
    - 15
    _call_: true
    _target_: signal.Signals
  exit_signal_handler: false
  exit_signal_handler_for_dataloader: false
  global_batch_size: 6
  iterations_to_skip: []
  manual_gc: false
  manual_gc_eval: true
  manual_gc_interval: 0
  micro_batch_size: 1
  rampup_batch_size: null
  skip_train: false
  train_iters: 1000
  train_samples: null
  train_sync_interval: null

----------------------------------

Expected behavior

Weight save and convert success in FP8 or NVFP4 format with smaller weight size.

Additional context

Env: newest Megatron-Bridge and Megatron-LM commit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions