Directional Steering

Directional steering is a runtime activation edit for DS4. A steering file is a flat f32 matrix with one normalized 4096-wide direction per layer. During inference, ds4 can apply the edit after attention outputs, FFN outputs, or both:

y = y - scale * direction[layer] * dot(direction[layer], y)

Positive scale removes the represented direction. Negative scale amplifies it. With no steering file or zero scales, ds4 follows the normal inference path.

Runtime Options

--dir-steering-file FILE   load a 43 x 4096 f32 direction file
--dir-steering-ffn F       apply steering after FFN outputs; default is 1 when a file is provided
--dir-steering-attn F      apply steering after attention outputs; default is 0

The FFN output is usually the best first target because it is late enough in each layer to represent behavior, style, and topic signals. Attention steering is available for experiments, but it can be more fragile.

Verbosity Example

The bundled example builds a style direction from 100 paired prompts. Each pair asks for the same information in two ways:

examples/succinct.txt: terse target prompts.
examples/verbose.txt: detailed contrast prompts.

Because the extracted direction is succinct - verbose, negative FFN scales make answers shorter, while positive FFN scales tend to make answers longer and more explanatory.

Build the vector:

python3 dir-steering/tools/build_direction.py \
  --ds4 ./ds4 \
  --model ds4flash.gguf \
  --good-file dir-steering/examples/succinct.txt \
  --bad-file dir-steering/examples/verbose.txt \
  --out dir-steering/out/verbosity.json \
  --component ffn_out \
  --ctx 512

This writes:

dir-steering/out/verbosity.json
dir-steering/out/verbosity.f32

Try a terse run:

./ds4 -m ds4flash.gguf --nothink --temp 0 -n 160 \
  --dir-steering-file dir-steering/out/verbosity.f32 \
  --dir-steering-ffn -1 \
  -p "Explain why databases use indexes."

Try a verbose run:

./ds4 -m ds4flash.gguf --nothink --temp 0 -n 220 \
  --dir-steering-file dir-steering/out/verbosity.f32 \
  --dir-steering-ffn 2 \
  -p "Explain why databases use indexes."

The same vector can be used in either direction. The sign is the important part:

negative scale amplifies the succinct target direction;
positive scale suppresses that direction and usually gives the model more room to elaborate.

Evaluating Scales

Use the sweep helper to test several strengths on a fixed prompt set:

python3 dir-steering/tools/run_sweep.py \
  --ds4 ./ds4 \
  --model ds4flash.gguf \
  --direction dir-steering/out/verbosity.f32 \
  --prompts dir-steering/examples/eval_prompts.txt \
  --scales "-1,-0.5,0,0.5,1,2" \
  --tokens 180 \
  --nothink

Start with FFN scales between -1 and 2. If the model becomes repetitive, ignores the prompt, or starts losing factual content, the scale is too strong. For this example, -1 is a good first terse setting and 2 is a good first verbose setting. Strong negative scales such as -2 or -3 can over-amplify the terse direction and collapse into repetition on some prompts.

Observed Effect

With the 100-pair vector built from the commands above, local greedy checks showed the expected behavior:

Prompt: Explain why databases use indexes.
--dir-steering-ffn -1: 67 words, one compact paragraph.
--dir-steering-ffn 0: 136 words, structured explanation.
--dir-steering-ffn 1: 140 words, structured explanation with more detail.

On a prompt that the unsteered model already answered briefly, positive steering made the expansion more visible:

Prompt: What does DNS do?
--dir-steering-ffn 0: 44 words.
--dir-steering-ffn 2: 171 words, with sections and step-by-step detail.

Building Other Directions

The extractor compares two prompt sets:

good-file: target prompts for the direction you want to represent.
bad-file: contrast prompts that should be separated from the target.

It captures DS4 activations from the same local GPU graph used for inference, averages target minus contrast, normalizes one vector per layer, and writes both metadata JSON and the runtime .f32 file.

Concept removal:

Put concept-heavy prompts in good-file.
Put neutral prompts in bad-file.
Run with a positive FFN scale.

Concept amplification:

Put desired concept prompts in good-file.
Put neutral prompts in bad-file.
Run with a negative FFN scale.

Style control:

Put prompts for the target style in good-file.
Put contrasting style prompts in bad-file.
Use negative scale to amplify the target style, positive scale to reduce it.

The method is not a fine-tune. It is a low-rank runtime edit, so it works best for coarse behavior, topic, or style directions that are consistently present in the activation captures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Directional Steering

Runtime Options

Verbosity Example

Evaluating Scales

Observed Effect

Building Other Directions

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Directional Steering

Runtime Options

Verbosity Example

Evaluating Scales

Observed Effect

Building Other Directions