CBify Supervised to Contextual Bandit

cbify is a reduction that converts supervised learning examples (multiclass, cost-sensitive, or regression) into contextual bandit problems. This lets you apply CB exploration algorithms to existing labeled datasets, which is useful for evaluating exploration strategies, simulating online decision-making from offline data, or warm-starting a CB policy.

How it works

Given a supervised example with a known label:

The CB exploration policy proposes a probability distribution over actions
An action is sampled from this distribution
A cost is computed by comparing the sampled action to the true label
The CB learner updates using this (action, cost, probability) tuple

This simulates the partial-feedback setting of contextual bandits: the learner only observes the cost for the action it chose, not for all actions.

Multiclass mode (default)

The basic mode converts multiclass examples into CB problems. The argument to --cbify specifies the number of actions K.

Suppose train.dat contains multiclass examples:

1 | feature_a feature_b
3 | feature_c
2 | feature_a feature_c

Then:

vw --cbify 3 --epsilon 0.05 -d train.dat

This converts each example into a K-armed bandit problem. If the sampled action matches the true label, cost is 0; otherwise cost is 1. The --loss0 and --loss1 options control these values (defaults 0.0 and 1.0).

All of the standard CB exploration strategies are available:

vw --cbify 10 --epsilon 0.1 -d train.dat       # epsilon-greedy (default)
vw --cbify 10 --first 5 -d train.dat            # explore-first
vw --cbify 10 --bag 7 -d train.dat              # bagging
vw --cbify 10 --cover 3 -d train.dat            # online cover

ADF mode

By adding --cb_explore_adf, cbify uses the action-dependent features framework. This is more flexible and supports additional exploration algorithms like RegCB and SquareCB:

vw --cbify 10 --cb_explore_adf --cb_type mtr --regcb --mellowness 0.01 -d train.dat
vw --cbify 10 --cb_explore_adf --cb_type mtr --squarecb --gamma_scale 500 -d train.dat

Cost-sensitive mode (`--cbify_cs`)

When examples have per-action costs rather than a single correct label, use --cbify_cs:

vw --cbify 3 --cbify_cs --epsilon 0.05 -d cs_data.dat

The input uses VW's cost-sensitive format:

1:0 2:1 3:1 | feature_a
1:1 2:0 3:0.5 | feature_b

Costs are interpolated between --loss0 and --loss1 based on the per-class cost values.

Cost-sensitive LDF mode (`--cbify_ldf`)

For cost-sensitive examples with label-dependent features (multiline format), use --cbify_ldf:

vw --cbify_ldf --cb_type mtr --squarecb --gamma_scale 500 -d cs_ldf_data.dat

The input uses VW's csoaa_ldf multiline format with a shared line and one line per action.

Regression mode (`--cbify_reg`)

Converts regression examples into CB problems. Requires --min_value and --max_value to define the continuous range:

vw --cbify 8 --cbify_reg --min_value 0 --max_value 100 --loss_option 1 -d regression.dat

The continuous range is discretized into K bins (the --cbify argument). Three loss functions are available:

`--loss_option`	Loss function	Formula
0 (default)	Squared	`(predicted - actual)^2 / range^2`
1	Absolute	`
2	Zero-one	0 if `

The zero-one loss threshold is controlled by --loss_01_ratio (default 0.1).

Continuous actions with CATS

For truly continuous action spaces (instead of discretization), combine with --cats:

vw --cbify 4 --cbify_reg --min_value 185 --max_value 23959 --bandwidth 3000 -d regression.dat --loss_option 1

Discrete CB mode

Use --cb_discrete to discretize the continuous space and route through cb_explore:

vw --cbify 2048 --cbify_reg --cb_discrete --min_value 185 --max_value 23959 -d regression.dat --loss_option 1

Options reference

Option	Description
`--cbify <K>`	Convert to CB with K actions
`--cbify_cs`	Accept cost-sensitive input instead of multiclass
`--cbify_reg`	Accept regression input
`--cbify_ldf`	Accept cost-sensitive LDF (multiline) input
`--loss0 <v>`	Cost for correct prediction (default 0.0)
`--loss1 <v>`	Cost for incorrect prediction (default 1.0)
`--flip_loss_sign`	Use reward (negate costs) instead of loss
`--min_value <v>`	Minimum value for regression mode
`--max_value <v>`	Maximum value for regression mode
`--loss_option <n>`	Regression loss: 0=squared, 1=absolute, 2=zero-one
`--loss_report <n>`	0=normalized loss, 1=denormalized
`--loss_01_ratio <v>`	Threshold ratio for zero-one loss (default 0.1)
`--cb_discrete`	Discretize continuous space for regression mode

These combine with the standard CB exploration options (--epsilon, --first, --bag, --cover, --cb_explore_adf, --cb_type, --regcb, --squarecb, etc.) documented on the Contextual Bandit algorithms page.

CBify Supervised to Contextual Bandit

How it works

Multiclass mode (default)

ADF mode

Cost-sensitive mode (--cbify_cs)

Cost-sensitive LDF mode (--cbify_ldf)

Regression mode (--cbify_reg)

Continuous actions with CATS

Discrete CB mode

Options reference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cost-sensitive mode (`--cbify_cs`)

Cost-sensitive LDF mode (`--cbify_ldf`)

Regression mode (`--cbify_reg`)