Skip to content

LoRA adapters are not properly syncing when using LocalBackend - stale adapters are being used in rollouts #661

@gmanlan

Description

@gmanlan

When using LocalBackend (colocate mode), I noticed that the LoRA adapter was not being sync properly. If I killed the run after several steps and resumed from a checkpoint, the reward would climb up abruptly as ART was forced to reload the last checkpoint as the first/base model. Upon inspection, it seems that the adapter is correctly making its way to vLLM, but the OpenAI serving layer is not "seeing" the new adapter. As a result, rollouts (which are obtaining the current model using model.get_inference_name()) are just seeing/using the initial/base adapter (step @ 0) and so the whole training process falls apart (silently) due to the stale inference.

This seems to be a regression that may have affected multiple versions, because I do remember this working properly in last year's versions.

While I don't have the means to properly submit a PR at the moment, I wanted to share one possible solution here, in case it helps maintainers:

The problem arises from unsloth/service.py, where at the end of _train_shared() it adds the new adapter to vLLM, but it forgets to also register it within _openai_serving_models. A quick fix would be to add the following snippet right after the llm.add_lora() code block:

lora_request = LoRARequest(
	lora_name=f"{self.model_name}@{new_step}",
	lora_int_id=self._next_lora_id(),
	lora_path=checkpoint_dir,
)
added = await llm.add_lora(lora_request)
if not added:
	raise RuntimeError(f"Failed to add LoRA adapter for step {new_step} at {checkpoint_dir}")

# -- Patch here:
import art.vllm.server as _vllm_server_mod
serving_models = _vllm_server_mod._openai_serving_models
if serving_models is not None:
	serving_models.lora_requests[lora_name] = lora_request
	logger.info(
		"Registered '%s' in OpenAI serving models registry", lora_name
	)
else:
	logger.warning(
		"_openai_serving_models is None — LoRA loaded into vLLM "
		"but NOT registered in the OpenAI serving layer. Inference requests "
		"may still use the previous adapter."
	)
# --

self._latest_step = new_step

Notes:

  • Affected versions: I believe multiple versions are affected, but I personally discovered this with version 0.5.16.
  • I acknowledge that this may not be the best place to fix it, but I've verified that ART works correctly after patching this.
  • I don't know if train_sft is sensitive to this, but if so, it could be patched in exactly the same way I believe.

@bradhilton this is what we briefly discussed on Discord.

I hope it helps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions