Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations

Replacing few-shot examples with automatically extracted task knowledge for scalable, interpretable B2B conversation classification.

Overview

In B2B sales, automatically classifying conversation segments at scale, across diverse and evolving intents, requires handling scarce labeled data, keeping annotation overhead low, and avoiding per-intent fine-tuning. Few-shot In-Context Learning (ICL) is theoretically suited to these constraints, but in practice concatenating multiple long conversation snippets causes severe performance degradation well before hitting context-window limits.

This work proposes a novel extraction framework that shifts from example concatenation to a different representational form entirely. A single offline step uses an LLM to transform a small set of labeled examples into natural-language classification criteria or task descriptions. At inference time, those compact artifacts replace the examples, producing transparent, human-editable decision logic with no lossy compression.

Key Contributions

Performance: Up to 7% improvement in macro-averaged AUC over direct few-shot ICL.
Efficiency: 99% reduction in token usage, enabling classification at scale at a fraction of the cost.
Interpretability: Human-readable, editable criteria/descriptions enable human-in-the-loop refinement.
Cross-model transfer: Large models can distill task knowledge for deployment on smaller ones.
Dataset: The Call Playbook Dataset — 5 binary classification tasks from real, anonymized enterprise sales conversations.

Methods at a Glance

All methods share the same classifier prompt structure; they differ in what knowledge is passed to it.

Examples is the standard few-shot baseline: N sampled training examples are included directly in the classification prompt. Each B2B conversation snippet can be hundreds of words long, so token cost grows quickly with N, and the model must infer classification rules implicitly from the examples.

Criteria-Ex replaces the examples with explicit classification criteria extracted from them. Given the sampled examples and the task objective, an LLM generates two lists: positive criteria (patterns indicating the concept is present) and negative criteria (patterns indicating its absence). This reduces token usage substantially, makes the classification logic transparent, and improves generalization by surfacing patterns rather than relying on specific instances.

Description-Ex follows the same two-step structure but extracts a free-text task description instead of criteria. The description captures the essence of the classification task and clarifies concept boundaries, providing a cohesive narrative that can better capture complex, context-dependent relationships that rigid criteria may miss.

Criteria-De and Description-Cr are iterative variants. Criteria-De derives criteria from a Description-Ex output; Description-Cr derives a description from a Criteria-Ex output. These variants explore whether knowledge representations can be progressively refined and are particularly suited to human-in-the-loop workflows, where users can inspect and edit an intermediate artifact before the final classification step.

Method	Knowledge extraction	Classifies with	Token cost
Examples	None	Raw examples	High
Criteria-Ex	Derives positive/negative criteria from examples	Criteria	Low
Description-Ex	Derives a free-text task description from examples	Description	Low
Criteria-De	Description-Ex output → criteria	Criteria	Low
Description-Cr	Criteria-Ex output → description	Description	Low

The Call Playbook Dataset

Access the Dataset

Request Dataset Access →

The Call Playbook Dataset contains annotated excerpts from real enterprise sales conversations across 5 binary classification tasks, each capturing a critical signal in the B2B sales process:

Task	Positive class definition	Example positive
Business Goals	Desired outcomes or strategic objectives articulated by the prospect	"We need to cut churn by 20% before Q3."
Decision Criteria	Specific attributes, features, or evaluation metrics used by the prospect to assess potential solutions	"Security compliance is non-negotiable for us."
Decision Makers	Individuals or roles identified as having authority or influence over the purchasing decision	"This ultimately goes to our VP of Finance."
Decision Making Process	The series of steps or procedures described by the prospect for arriving at a final decision	"We run a 3-week POC with two vendors in parallel."
Pain Points	Inefficiencies, obstacles, or needs expressed by the prospect that they aim to address through a potential solution	"Our current tool breaks on calls longer than an hour."

Setup

After requesting access, place the downloaded data under call_playbook_dataset/ in the project root:

call_playbook_dataset/
├── business_goals/
│   ├── train.csv
│   └── test.csv
├── decision_criteria/
│   ├── train.csv
│   └── test.csv
├── decision_makers/
│   ├── train.csv
│   └── test.csv
├── decision_making_process/
│   ├── train.csv
│   └── test.csv
└── pain_points/
    ├── train.csv
    └── test.csv

Pass the task subfolder as --data_dir (e.g., --data_dir ./call_playbook_dataset/business_goals).

Format

Each task contains train.csv / test.csv with the following structure:

id,text,label
0,"[PROSPECT_A] We need to reduce churn before Q3. [SELLER_A] Absolutely, let's talk about how.",1
1,"[SELLER_A] Great, I'll send over the proposal tonight. [PROSPECT_A] Sounds good.",0

Conversations are represented as speaker-tagged utterance sequences, preserving turn structure while enabling flexible windowing.

Speaker Tags

Speaker roles are annotated inline within each snippet:

Tag	Role
`[PROSPECT_A]`, `[PROSPECT_B]`	Prospect-side speakers
`[SELLER_A]`, `[SELLER_B]`	Seller-side speakers
`[SPEAKER_A]`, ...	Generic speaker (role unknown)

Dataset Statistics

Task	Train Samples	Train Calls	Test Samples	Test Calls	Avg Words
Business Goals	200	25	200	25	~284
Decision Criteria	200	25	200	25	~276
Decision Makers	200	25	200	25	~267
Decision Making Process	200	20	200	21	~267
Pain Points	200	25	200	25	~286

Privacy & Anonymization

All data has been rigorously anonymized before release. Named entities were identified and systematically replaced with fictional alternatives that preserve conversational realism:

Entity type	Replacements	Sample substitutions
Organizations	120+	"Quantum Solutions", "Zenith Innovations", "Nebula Technologies"
Persons	130+	"Alex", "Jordan", "Casey", "Taylor"
Products	80+	"CodeCraft", "QuantaQuery", "NebulaNet"
Locations	180+	"Varthevia", "Brindmere", "Corswick"
URLs / IDs / Phones / Emails	20+	anonymized placeholder formats

The full replacement table is in replacements.json.

Getting Started

Installation

Requires Python >= 3.12.

git clone https://github.com/gong-io/call-playbook
cd call-playbook
pip install uv
uv venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
uv pip install -r requirements.txt

Configure credentials

Create a .env file in the project root:

# Azure OpenAI
AZURE_OPENAI_API_KEY=your_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
OPENAI_API_VERSION=your_api_version
AZURE_OPENAI_MODEL_VERSION=your_model_version

# AWS Bedrock
AWS_PROFILE=your_profile
AWS_REGION=us-east-1
AWS_BEDROCK_ENDPOINT=https://your-bedrock-endpoint.amazonaws.com

Run an experiment

python run_classification.py \
  --model_id gpt-4o \
  --model_source azure \
  --objective "the prospect discusses their business goals in the context of the purchasing process." \
  --data_dir ./call_playbook_dataset/business_goals \
  --output_dir ./results/business_goals \
  --num_few_shot_examples 0 10 25 50 75 100 \
  --num_experiments 5 \
  --env_file .env

Or pass everything via a config file:

python run_classification.py --config config.json --env_file .env

Full config.json reference

{
    "model_id": "gpt-4o",
    "model_source": "azure",
    "objective": "the prospect discusses their business goals in the context of the purchasing process",
    "data_dir": "./call_playbook_dataset/business_goals",
    "num_few_shot_examples": [0, 10, 25, 50, 75, 100],
    "num_experiments": 5,
    "batch_size": 10,
    "output_dir": "./results/business_goals",
    "use_wandb": false,
    "override": false,
    "mix_examples": false,
    "sampling_method": "label_distribution",
    "label_column": "label",
    "text_column": "text",
    "example_column": "text",
    "label_map": {"0": "Negative", "1": "Positive"},
    "wandb_entity": "your-entity",
    "wandb_project": "b2b-classification-research",
    "wandb_run": "experiment-business_goals-gpt4o"
}

CLI reference

Argument	Default	Description
`--model_id`	`gpt-4o`	Model identifier
`--model_source`	`azure`	`azure` or `bedrock`
`--objective`	—	One-sentence description of the classification task
`--data_dir`	—	Path to directory with `train.csv` / `test.csv`
`--num_few_shot_examples`	`0 10 25 50 75 100`	Space-separated list of K values to sweep
`--num_experiments`	`5`	Repetitions per configuration (for ± std)
`--batch_size`	`10`	Number of examples processed per LLM call
`--sampling_method`	`label_distribution`	`random` or `label_distribution`
`--mix_examples`	off	Mix examples across classes instead of grouping by label
`--label_column`	`label`	Name of the label column in the dataset
`--text_column`	`text`	Name of the text column in the dataset
`--example_column`	`text`	Name of the column used as few-shot examples
`--label_map`	`{"0":"Negative","1":"Positive"}`	JSON string mapping integer labels to display names
`--output_dir`	—	Where to write results
`--use_wandb`	off	Enable Weights & Biases logging
`--override`	off	Re-run even if results already exist
`--wandb_entity`	—	W&B entity (username or team)
`--wandb_project`	—	W&B project name
`--wandb_run`	—	W&B run name
`--config`	—	Path to JSON config file (overrides all flags)
`--env_file`	`.env`	Path to `.env` file with API credentials

Outputs

Results are written to output_dir/<model_id>/:

results/
└── gpt-4o/
    ├── Examples_10_iter_1.csv                                        # sampled few-shot sets
    ├── Criteria-Ex_10_iter_1.json                                    # generated criteria
    ├── Description-Ex_10_iter_1.json                                 # generated descriptions
    ├── Examples_num_samples_10_iter_1_classification_report.json     # per-iteration classification report
    ├── Examples_10_average_metrics.json                              # mean across iterations
    ├── Examples_10_std_metrics.json                                  # std across iterations
    ├── macro_avg_f1-score.png                                        # performance curves (raster)
    ├── weighted_avg_precision.pdf                                    # performance curves (vector)
    └── experiment.log                                                # run log with timings and metadata

Performance plots show precision / recall / F1 vs. number of few-shot examples with error bars across iterations, for every method and every metric category (per-class, macro, micro, weighted).

Experiment Tracking

The framework supports Weights & Biases for full experiment tracking. Enable with --use_wandb. Logged artifacts include:

Per-iteration classification reports for all 5 methods
Mean / std summary tables
Few-shot example sets
Generated criteria and description strings
All performance plots

Project Structure

.
├── run_classification.py         # Entry point: argument parsing + experiment loop
├── src/
│   ├── classifiers.py            # Prompt-based binary classifier
│   ├── criteria_creator.py       # Criteria generation (from examples or descriptions)
│   ├── description_creator.py    # Description generation (from examples or criteria)
│   ├── dataset_loader.py         # Dataset loading and preprocessing
│   ├── experiment.py             # Experiment runner and logger setup
│   ├── metrics.py                # Classification metrics and aggregation
│   ├── models.py                 # LLM initialization (Azure OpenAI, AWS Bedrock)
│   ├── utils.py                  # Sampling, I/O, and artifact helpers
│   └── visualization.py          # Performance curve plots
├── figures/
│   ├── main_figure.png           # Method overview figure
│   └── call_playbook_avatar.png  # Dataset example snippets
├── call_playbook_dataset/        # Downloaded dataset (not included in repo)
│   ├── business_goals/
│   ├── decision_criteria/
│   ├── decision_makers/
│   ├── decision_making_process/
│   └── pain_points/
├── config.json                   # Example configuration
├── replacements.json             # Anonymization entity mappings
├── requirements.txt
├── LICENSE
└── .gitignore

Citation

If this work or dataset is useful to your research, please cite:

@inproceedings{rotman-etal-2026-distilling,
    title = "Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World {B2B} Conversations",
    author = "Rotman, Guy  and
      Kopilov, Adi  and
      Berger Zalmanson, Danit  and
      Allouche, Omri",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2026",
    month = jul,
    year = "2026",
    address = "San Diego, California, USA",
    publisher = "Association for Computational Linguistics",
}

Request Dataset Access · MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations

Table of Contents

Overview

Key Contributions

Methods at a Glance

The Call Playbook Dataset

Access the Dataset

Setup

Format

Speaker Tags

Dataset Statistics

Privacy & Anonymization

Getting Started

Installation

Configure credentials

Run an experiment

CLI reference

Outputs

Experiment Tracking

Project Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
figures		figures
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
replacements.json		replacements.json
requirements.txt		requirements.txt
run_classification.py		run_classification.py

Folders and files

Latest commit

History

Repository files navigation

Distilling Examples into Task Instructions: Enhanced In-Context Learning for Real-World B2B Conversations

Table of Contents

Overview

Key Contributions

Methods at a Glance

The Call Playbook Dataset

Access the Dataset

Setup

Format

Speaker Tags

Dataset Statistics

Privacy & Anonymization

Getting Started

Installation

Configure credentials

Run an experiment

CLI reference

Outputs

Experiment Tracking

Project Structure

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages