Skip to content

feat: support native embedding generation#106

Merged
nabinchha merged 52 commits into
mainfrom
nmulepati/feat/support-embedding-generation
Dec 15, 2025
Merged

feat: support native embedding generation#106
nabinchha merged 52 commits into
mainfrom
nmulepati/feat/support-embedding-generation

Conversation

@nabinchha

@nabinchha nabinchha commented Dec 8, 2025

Copy link
Copy Markdown
Contributor

Major changes:

  • Broke out InferenceParameters into generation type specific ones. Changes include renaming existing InferenceParameters -> ChatCompletionInferenceParams with backwards compatibility + deprecation warning.
  • Broke out docs for inference parameters into concepts/models/inference-parameters.md
  • Updated CLI for a better UX around display and CRUD of generation type specific inference parameters

Minor changes:

  • TokenUsageStats.prompt_tokens -> TokenUsageStats.input_tokens
  • TokenUsageStats.completion_tokens -> TokenUsageStats.output_tokens
  • Added nvidia-embedding and openai-embedding to default model configs .

Here's an example of what the workflow looks like for embeddings

import json
import pandas as pd
from data_designer.essentials import (
    DataDesigner,
    DataDesignerConfigBuilder,
    EmbeddingColumnConfig,
    EmbeddingInferenceParameters,
    ExpressionColumnConfig,
    ModelConfig,
)

model_configs = [
    ModelConfig(
        alias="nvidia-embedder",
        model="nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2",
        provider="nvidia",
        inference_parameters=EmbeddingInferenceParameters(
            extra_body={"input_type": "query"},
        ),
    ),
    ModelConfig(
        alias="openai-embedder",
        model="text-embedding-3-small",
        provider="openai",
        inference_parameters=EmbeddingInferenceParameters(
            dimensions=768,
            encoding_format="float"
        )
    )
]

config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

with open("dummy_generated_data.json", "r") as f:
    full_generation_data = json.load(f)

config_builder.with_seed_dataset(
    dataset_reference=DataDesigner.make_seed_reference_from_dataframe(
        pd.DataFrame(full_generation_data),
        "tmp_dedup.json"
    ),
    sampling_strategy="ordered"
)

config_builder.add_column(
    ExpressionColumnConfig(
        name="questions",
        expr='[{% for pair in qa_generation.pairs %}"{{ pair.question }}",{% endfor %}]'
    )
)

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_nvidia",
        model_alias="nvidia-embedder",
        target_column="questions",
        chunk_pattern=f"\n+"
    )
)

config_builder.add_column(
    EmbeddingColumnConfig(
        name="embedding_openai",
        model_alias="openai-embedder",
        target_column="questions",
        chunk_pattern=f"\n+"
    )
)

data_designer = DataDesigner()
result = data_designer.preview(config_builder)
result.display_sample_record()

Pending:

  • Update docs
  • Add/Update unit tests
  • Update CLI for inference parameter specification

closes #110, #40, and #89

@nabinchha nabinchha changed the title Nmulepati/feat/support embedding generation feat: support native embedding generation Dec 8, 2025
@nabinchha nabinchha linked an issue Dec 9, 2025 that may be closed by this pull request
Comment thread docs/concepts/columns.md Outdated
Comment thread src/data_designer/config/utils/constants.py
Comment thread src/data_designer/engine/column_generators/generators/embedding.py

@andreatgretel andreatgretel left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

andreatgretel
andreatgretel previously approved these changes Dec 15, 2025
eric-tramel
eric-tramel previously approved these changes Dec 15, 2025
johnnygreco
johnnygreco previously approved these changes Dec 15, 2025
Comment thread docs/concepts/models/inference-parameters.md Outdated
@nabinchha nabinchha merged commit 8370e4a into main Dec 15, 2025
28 checks passed
@nabinchha nabinchha deleted the nmulepati/feat/support-embedding-generation branch December 15, 2025 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add native embedding generation support

4 participants