Runtime errors, OOM issues, and configuration problems for NeMo Safe Synthesizer. Sections are organized by pipeline phase. For output quality and evaluation metrics, see Synthetic Data Quality. For environment variables, model caching, offline setup, NIM endpoint configuration, and NER parallelism, see Environment Variables.
| Symptom | Likely Cause | Fix |
|---|---|---|
| "kernels package not installed" | No network for Kernels Hub | Set training.attn_implementation: sdpa |
ConnectionError during startup |
No internet / model not cached | Pre-cache models |
| OOM in training | VRAM exhausted | Reduce batch size, quantize |
| OOM in generation | VRAM exhausted | Verify training cleanup |
| OOM in evaluation | Large dataset + PCA | Reduce columns or disable eval |
| "Cannot use unsloth without GPU" | No CUDA device | Switch to HuggingFace backend |
| "max_sequences_per_example must be 1" | Incompatible DP config | Configuration Reference -- Differential Privacy |
| "Unsloth not compatible with DP" | Mutual exclusion | Configuration Reference -- Differential Privacy |
| "Unable to automatically determine a noise multiplier" | Epsilon too low | Increase epsilon or add records |
| "no valid records" in generation | Underfitting / schema mismatch | See GenerationError |
| "exceeds context length" | Records too long | Reduce record size |
| "fraction of invalid records" | Generation quality too low | Lower threshold or retrain |
| Metrics show UNAVAILABLE | Too few records / columns | Ensure >= 200 records |
| Low SQS scores | Underfit or too few records | Review distributions |
| PII uses default entities | Classifier failed | Set entities explicitly |
| "timestamp_column has missing values" | Dirty time series data | Clean NaN/nulls from timestamp column |
| "groups must have same start" | Inconsistent groups | Align group start timestamps |
GPU memory, context length, and backend issues during fine-tuning.
Training OOM errors appear during the "Training" phase with HuggingFace Trainer
stack traces. If you see torch.cuda.OutOfMemoryError:
- Enable 4-bit quantization -- the single largest memory saver. Set
training.quantize_model: trueandtraining.quantization_bits: 4. QLoRA stores the frozen base model in 4-bit NF4 while training LoRA adapters in full precision, cutting model weight memory by ~4x. Quantization reduces precision in the frozen weights; in practice QLoRA typically produces results close to full-precision LoRA, but verify with your evaluation report - Reduce the context window -- see
Context Length and Record Fitting for
how to lower
training.rope_scaling_factor, truncate records, or simplify grouped examples. Longer sequences require more activation memory even with gradient checkpointing enabled - Verify
training.batch_sizeis1(the default). The effective batch size isbatch_size * gradient_accumulation_steps(default 1 x 8 = 8). Peak memory is set by the forward/backward pass on one micro-batch --gradient_accumulation_stepscontrols how many micro-batches accumulate before each optimizer step but does not affect peak memory - Lower
training.max_vram_fraction(default0.8) to leave headroom for other GPU consumers on the same device
GPU memory during LoRA SFT breaks down into three components:
- Base model weights (dominant) -- ~14 GiB for a 7B model in fp16, ~3.5 GiB in 4-bit. Quantization targets this component
- Activations (proportional to sequence length and batch size) -- self-attention computes an n x n score matrix, so activation memory scales quadratically with sequence length. Gradient checkpointing, which Safe Synthesizer enables by default, reduces this by recomputing activations during the backward pass instead of storing them. Context length and batch size target this component
- LoRA adapter gradients and optimizer states (small) -- typically < 1 GiB for standard LoRA ranks
For deeper coverage, see Methods and tools for efficient training on a single GPU in the HuggingFace documentation.
If Safe Synthesizer fails to find a GPU, the Unsloth backend raises immediately:
RuntimeError: Cannot use unsloth without GPU.
The HuggingFace backend will not error but will attempt to use CPU (extremely slow).
To diagnose:
- Verify NVIDIA drivers:
nvidia-smi - Verify PyTorch CUDA build:
python -c "import torch; print(torch.cuda.is_available())" - Ensure you installed the CUDA extras, not the CPU-only package. See Installation for the full command with required index URLs.
Switch to the HuggingFace backend for CPU-only environments (useful for development, not recommended for production training).
The effective context window (max_seq_length) is a computed property on
[ModelMetadata][nemo_safe_synthesizer.llm.metadata.ModelMetadata] --
base_max_seq_length * rope_scaling_factor. Every training example must
fit within this window. If it doesn't, data assembly fails with a
GenerationError before training even starts.
Two error messages indicate context-length problems during data assembly:
The number of tokens in an example exceeds the available context length
A single training example (schema prompt + records) exceeds
max_seq_length.
The dataset schema requires more tokens than the max length of the model
The schema prompt alone is wider than max_seq_length -- typically
because the table has too many columns for the model's context window.
-
Reduce record size -- shorten text fields, drop unnecessary columns, or simplify the schema.
-
When using
data.group_training_examples_by, all records in the same group must fit in context together, making the limit tighter. Consider reducing the number of records per group.??? tip "Sizing formula (approximate)" Estimate token budget before adjusting parameters. The /4 divisor is a rough heuristic for BPE tokenizers on JSON content (actual ratios vary by tokenizer and content):
- `tokens_per_group ≈ (records_per_group × chars_per_record) / 4` - `total ≈ prompt_tokens + tokens_per_group × max_sequences_per_example` Example: 5 records × 200 chars ≈ 250 tokens/group; with a 400-token prompt and 3 groups per example: `400 + 250 × 3 = 1150` tokens. See [Example Generation -- Sizing](../developer-guide/example-generation.md#sizing-and-context-budget) for per-mode formulas. -
If using
TinyLlama/TinyLlama-1.1B-Chat-v1.0, increasetraining.rope_scaling_factorto extend the context window. When set to"auto", it is estimated from dataset token counts using heuristics (4 chars per token for text, 1 token per digit) -- this can underestimate for complex or multilingual data.training.rope_scaling_factoris not applicable when usingHuggingFaceTB/SmolLM3-3B(default model) ormistralai/Mistral-7B-Instruct-v0.3.
!!! note "Error type clarification"
These errors are typed as GenerationError in the codebase even though
they fire during data assembly, not during generation proper. They appear
in the pipeline before any training or generation occurs.
Context-length issues can also surface as OOM during training (the model attempts to process sequences near the limit). See Out of Memory During Training for memory-specific fixes like quantization and batch size reduction.
VRAM, invalid records, and early stopping during synthetic data production.
Generation OOM errors appear during the "Generation" phase with vLLM.
GPU allocation defaults to 80% of available VRAM. Training exposes
training.max_vram_fraction to override this; generation does not yet have
an equivalent config field.
- Ensure no other processes hold GPU memory -- training cleanup should release
it, but verify with
nvidia-smi - If the GPU has less memory than expected, check that the training teardown completed before generation started
Generation failures during synthetic data production. The two most common:
Generation stopped prematurely due to no valid records
: The first batch produced zero valid records. The model may be underfitting
or the schema may not match the training data. Increase
training.num_input_records_to_sample to give the model more context,
and check training logs for quality issues.
Generation stopped prematurely because the average fraction of invalid records was higher than...
: Too many invalid records across generation.patience consecutive batches.
Consider retraining with more records, adjusting training.number_of_input_records_to_sample, or setting use_structured_generation=True.
For context-length errors during data assembly ("The number of tokens in an example exceeds the available context length"), see
Context Length and Record Fitting.
Memory and scope issues during quality scoring and report generation.
If evaluation OOMs, reduce the evaluation scope or dataset size:
- For wide datasets, PCA computation in deep structure analysis can OOM.
Reduce the number of columns included in evaluation by lowering
evaluation.sqs_report_columnsor by subsetting the input data. If evaluation is not required for your run, disable it entirely withevaluation.enabled: false. - Histogram binning uses the
doanemethod to reduce memory, but very large datasets may still cause issues. Reduceevaluation.sqs_report_columnsorevaluation.sqs_report_rowsto limit the evaluation scope.
!!! tip "Evaluation and Data Quality" SQS scores, UNAVAILABLE metrics, report limits, and low-quality diagnostics are covered in Synthetic Data Quality.
Defaults, auto-resolution, and validation errors for pipeline parameters.
Several defaults may not match your expectations:
| Parameter | Default | Notes |
|---|---|---|
training.batch_size |
1 |
Effective batch = batch_size x gradient_accumulation_steps (8) |
training.validation_ratio |
0.0 |
No validation split by default |
data.holdout |
0.05 |
5% of records held out for evaluation; capped by data.max_holdout (2000) |
data.random_state |
None |
Auto-generates a random seed -- set this value explicitly if you need reproducibility |
generation.num_records |
1000 |
May be too small for production use |
Many parameters accept "auto" and are resolved at runtime by the
[AutoConfigResolver][nemo_safe_synthesizer.config.autoconfig.AutoConfigResolver].
See Configuration Reference for the full list.
training.rope_scaling_factor-- auto-estimated from dataset token counts; see Context Length and Record Fitting for details and caveatstraining.num_input_records_to_sample-- derived fromrope_scaling_factor * 25000training.use_unsloth-- resolves totrueunless DP is enabled. DP uses Opacus per-sample gradients (GradSampleModule), which require standard model layers and disable gradient checkpointing -- Unsloth's custom layers and checkpointing are incompatibletraining.learning_rate-- model-specific default fromModelMetadata: Mistral uses 0.0001, all other supported model families use 0.0005data.max_sequences_per_example-- resolves to1when differential privacy is enabled (required to limit per-example gradient contribution),10otherwise for best performanceprivacy.delta-- computed from record count
!!! warning "Unsloth and Mistral compatibility"
If you encounter issues when using Unsloth with Mistral models, set
training.use_unsloth: false explicitly. There is no automatic
detection for this incompatibility.
Use safe-synthesizer config validate to see how "auto" and default values resolve for
your configuration. Note that some "auto" fields (such as
training.rope_scaling_factor and training.num_input_records_to_sample)
require a dataset to resolve -- they will remain "auto" in the validate
output and only resolve during an actual run:
safe-synthesizer config validate --config config.yamlorder_training_examples_by requires group_training_examples_by:
: If you set data.order_training_examples_by without also setting
data.group_training_examples_by, config validation will fail. Ordering only
makes sense within groups.
Unsupported file extensions:
: The url parameter accepts .csv, .json, .jsonl, .parquet, and .txt
files. Other formats raise a ValueError.
Incompatible DP settings:
: If privacy.dp_enabled is true but use_unsloth is true or
data.max_sequences_per_example is not 1, config validation will fail
with a clear error message. Set these to "auto" and they will resolve
correctly.
!!! tip "Differential Privacy" DP errors and privacy budget troubleshooting are covered in Synthetic Data Quality.
Model downloads and processing timeouts for PII detection.
The PII replacer downloads the GLiNER NER model on first use. If the download fails, it raises an exception immediately.
Pre-download the model by running PII replacement once in an environment
with internet access, or set LOCAL_FILES_ONLY=true after the model is cached.
NER uses an internal max_runtime_seconds timeout. If processing a chunk takes
too long, it is dropped with a warning in the logs.
Check the logs for timeout warnings. The timeout is not currently configurable; for large datasets, reduce the amount of text processed per chunk (for example, shorten text fields or split them into smaller pieces) and optionally reduce CPU parallelism so each worker has more resources.
WandB requires an API key when running in online mode. If the key is missing
or invalid, training will fail when the WandB run is initialized.
wandb: ERROR api_key not configured (no-auth)
Set the API key before running:
export WANDB_API_KEY="your-api-key" # pragma: allowlist secretOr switch to offline mode to avoid network access entirely:
safe-synthesizer run --wandb-mode disabled --config config.yaml --data-source data.csvSee Running Safe Synthesizer -- WandB Integration for the full WandB setup.
If a WandB run fails to resume (e.g., the run ID no longer exists on the WandB server),
pass --wandb-resume-job-id with a valid run ID from the same WandB project, or
remove the argument to start a fresh WandB run.
!!! warning "Experimental" Time series synthesis is an experimental feature. APIs and behavior may change between releases.
Time series synthesis has additional validation and generation requirements. For configuration examples, see Configuration -- Time Series.
Missing timestamp values:
: Any NaN or null values in the timestamp column raise a DataError.
Clean your data before running the pipeline:
```python
df = df.dropna(subset=["timestamp"])
df = df.sort_values(by=["group_column", "timestamp"])
```
Interval mismatch:
: If timestamp_interval_seconds does not match the actual intervals in your
data, a warning is logged but the pipeline continues. Verify your interval
setting matches the data.
Groups skipped during generation:
: If a group consistently produces invalid records (exceeding
generation.patience consecutive batches above
generation.invalid_fraction_threshold), that group is skipped entirely.
Check your training data quality for those groups.
Out-of-order records:
: During generation, records are validated for chronological order. Records that arrive out of order are marked invalid.
All groups in the dataset must begin at the same timestamp when
time_series.start_timestamp is null (inferred from data). If group
start timestamps differ, the pipeline raises a DataError. Either align
all group start timestamps in your data, or set
time_series.start_timestamp to an explicit value that applies to all
groups.
Safe Synthesizer uses a structured error hierarchy. Understanding which error
class you received helps narrow down the cause and write targeted except clauses.
Inheritance:
SafeSynthesizerError
├── InternalError (also RuntimeError)
└── UserError
├── DataError (also ValueError)
├── ParameterError (also ValueError)
└── GenerationError (also RuntimeError)
SDK callers can catch [UserError][nemo_safe_synthesizer.errors.UserError] to handle all user-facing errors, or
[SafeSynthesizerError][nemo_safe_synthesizer.errors.SafeSynthesizerError] to also catch internal errors. Catching the built-in
base (ValueError, RuntimeError) also works since each class inherits from
both.
Bad input data -- NaNs, unsupported types, empty DataFrames, missing values in group or timestamp columns.
Checklist:
- Verify your CSV loads cleanly with
pd.read_csv() - Check for mixed types in columns
- Check that column names in your config match the actual data
- For time series, ensure no nulls in timestamp or group columns
Context-length errors (records too long for the model) raise GenerationError,
not DataError -- see Context Length and Record Fitting.
Invalid configuration -- missing columns referenced in config, incompatible option combinations, or missing required parameters. The stacktrace will indicate which parameter is invalid.
Checklist:
- Run
safe-synthesizer config validate --config config.yaml - Verify column names in
group_training_examples_byandorder_training_examples_byexist in your data - For DP, ensure all required privacy parameters are set
Errors during generation or data assembly. Two common cases:
- Sampling failures (no valid records, patience exceeded) -- see GenerationError in the Generation section
- Context-length errors during data assembly (records too long for the model) -- see Context Length and Record Fitting
Library bugs. If you encounter this error through documented interfaces, please file an issue on GitHub.
- Running Safe Synthesizer -- pipeline execution and CLI commands
- Configuration Reference -- parameter tables
- Synthetic Data Quality -- quality and privacy score diagnostics