Skip to content

CBify Supervised to Contextual Bandit

JohnLangford edited this page Feb 2, 2026 · 1 revision

cbify is a reduction that converts supervised learning examples (multiclass, cost-sensitive, or regression) into contextual bandit problems. This lets you apply CB exploration algorithms to existing labeled datasets, which is useful for evaluating exploration strategies, simulating online decision-making from offline data, or warm-starting a CB policy.

How it works

Given a supervised example with a known label:

  1. The CB exploration policy proposes a probability distribution over actions
  2. An action is sampled from this distribution
  3. A cost is computed by comparing the sampled action to the true label
  4. The CB learner updates using this (action, cost, probability) tuple

This simulates the partial-feedback setting of contextual bandits: the learner only observes the cost for the action it chose, not for all actions.

Multiclass mode (default)

The basic mode converts multiclass examples into CB problems. The argument to --cbify specifies the number of actions K.

Suppose train.dat contains multiclass examples:

1 | feature_a feature_b
3 | feature_c
2 | feature_a feature_c

Then:

vw --cbify 3 --epsilon 0.05 -d train.dat

This converts each example into a K-armed bandit problem. If the sampled action matches the true label, cost is 0; otherwise cost is 1. The --loss0 and --loss1 options control these values (defaults 0.0 and 1.0).

All of the standard CB exploration strategies are available:

vw --cbify 10 --epsilon 0.1 -d train.dat       # epsilon-greedy (default)
vw --cbify 10 --first 5 -d train.dat            # explore-first
vw --cbify 10 --bag 7 -d train.dat              # bagging
vw --cbify 10 --cover 3 -d train.dat            # online cover

ADF mode

By adding --cb_explore_adf, cbify uses the action-dependent features framework. This is more flexible and supports additional exploration algorithms like RegCB and SquareCB:

vw --cbify 10 --cb_explore_adf --cb_type mtr --regcb --mellowness 0.01 -d train.dat
vw --cbify 10 --cb_explore_adf --cb_type mtr --squarecb --gamma_scale 500 -d train.dat

Cost-sensitive mode (--cbify_cs)

When examples have per-action costs rather than a single correct label, use --cbify_cs:

vw --cbify 3 --cbify_cs --epsilon 0.05 -d cs_data.dat

The input uses VW's cost-sensitive format:

1:0 2:1 3:1 | feature_a
1:1 2:0 3:0.5 | feature_b

Costs are interpolated between --loss0 and --loss1 based on the per-class cost values.

Cost-sensitive LDF mode (--cbify_ldf)

For cost-sensitive examples with label-dependent features (multiline format), use --cbify_ldf:

vw --cbify_ldf --cb_type mtr --squarecb --gamma_scale 500 -d cs_ldf_data.dat

The input uses VW's csoaa_ldf multiline format with a shared line and one line per action.

Regression mode (--cbify_reg)

Converts regression examples into CB problems. Requires --min_value and --max_value to define the continuous range:

vw --cbify 8 --cbify_reg --min_value 0 --max_value 100 --loss_option 1 -d regression.dat

The continuous range is discretized into K bins (the --cbify argument). Three loss functions are available:

--loss_option Loss function Formula
0 (default) Squared (predicted - actual)^2 / range^2
1 Absolute `
2 Zero-one 0 if `

The zero-one loss threshold is controlled by --loss_01_ratio (default 0.1).

Continuous actions with CATS

For truly continuous action spaces (instead of discretization), combine with --cats:

vw --cbify 4 --cbify_reg --min_value 185 --max_value 23959 --bandwidth 3000 -d regression.dat --loss_option 1

Discrete CB mode

Use --cb_discrete to discretize the continuous space and route through cb_explore:

vw --cbify 2048 --cbify_reg --cb_discrete --min_value 185 --max_value 23959 -d regression.dat --loss_option 1

Options reference

Option Description
--cbify <K> Convert to CB with K actions
--cbify_cs Accept cost-sensitive input instead of multiclass
--cbify_reg Accept regression input
--cbify_ldf Accept cost-sensitive LDF (multiline) input
--loss0 <v> Cost for correct prediction (default 0.0)
--loss1 <v> Cost for incorrect prediction (default 1.0)
--flip_loss_sign Use reward (negate costs) instead of loss
--min_value <v> Minimum value for regression mode
--max_value <v> Maximum value for regression mode
--loss_option <n> Regression loss: 0=squared, 1=absolute, 2=zero-one
--loss_report <n> 0=normalized loss, 1=denormalized
--loss_01_ratio <v> Threshold ratio for zero-one loss (default 0.1)
--cb_discrete Discretize continuous space for regression mode

These combine with the standard CB exploration options (--epsilon, --first, --bag, --cover, --cb_explore_adf, --cb_type, --regcb, --squarecb, etc.) documented on the Contextual Bandit algorithms page.

Clone this wiki locally