23 Jul 16:03

gabrielmbmb

add2b6e

1.2.4

What's Changed

Update InferenceEndpointsLLM to use chat_completion method by @gabrielmbmb in #815

Full Changelog: 1.2.3...1.2.4

Contributors

gabrielmbmb

Assets 2

23 Jul 08:02

gabrielmbmb

1.2.3

54ecc38

1.2.3

What's Changed

Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue #785) by @Hassaan-Qaisar in #786
Correct variable name in dataset push example (in ultrafeedback.md file) (Issue #787) by @Hassaan-Qaisar in #791
docs: update script for issue dashboard by @sdiazlor in #775
Fix 404 model not found for private Serverless IE by @dvsrepo in #806

New Contributors

@Hassaan-Qaisar made their first contribution in #786

Full Changelog: 1.2.2...1.2.3

Contributors

dvsrepo, Hassaan-Qaisar, and sdiazlor

Assets 2

12 Jul 11:09

gabrielmbmb

1.2.2

a22c7e2

1.2.2

What's Changed

Fix passing input to format_output function by @gabrielmbmb in #781

Full Changelog: 1.2.1...1.2.2

Contributors

gabrielmbmb

Assets 2

01 Jul 08:58

gabrielmbmb

1.2.1

fe615d6

1.2.1

What's Changed

Fix docs for distiset.save_to_disk kwargs by @fpreiss in #745
docs: change references by @sdiazlor in #754
Fix response_format for TogetherLLM and AnyScaleLLM by @gabrielmbmb in #764

New Contributors

@fpreiss made their first contribution in #745

Full Changelog: 1.2.0...1.2.1

Contributors

fpreiss, gabrielmbmb, and sdiazlor

Assets 2

18 Jun 12:40

gabrielmbmb

1.2.0

3910aca

1.2.0

✨ Release highlights

Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task

instructor has been integrated bringing support for structured generation with OpenAILLM, AnthropicLLM, LiteLLM, MistralLLM, CohereLLM and GroqLLM:

Structured generation with `instructor` example

from typing import List

from distilabel.llms import MistralLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, Field


class Node(BaseModel):
    id: int
    label: str
    color: str


class Edge(BaseModel):
    source: int
    target: int
    label: str
    color: str = "black"


class KnowledgeGraph(BaseModel):
    nodes: List[Node] = Field(..., default_factory=list)
    edges: List[Edge] = Field(..., default_factory=list)


with Pipeline(
    name="Knowledge-Graphs",
    description=(
        "Generate knowledge graphs to answer questions, this type of dataset can be used to "
        "steer a model to answer questions with a knowledge graph."
    ),
) as pipeline:
    sample_questions = [
        "Teach me about quantum mechanics",
        "Who is who in The Simpsons family?",
        "Tell me about the evolution of programming languages",
    ]

    load_dataset = LoadDataFromDicts(
        name="load_instructions",
        data=[
            {
                "system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.",
                "instruction": f"{question}",
            }
            for question in sample_questions
        ],
    )

    text_generation = TextGeneration(
        name="knowledge_graph_generation",
        llm=MistralLLM(
            model="open-mixtral-8x22b",
            structured_output={"schema": KnowledgeGraph}
        ),
    )
    load_dataset >> text_generation

InferenceEndpointsLLM now supports structured generation
New StructuredGeneration task that allows defining the schema of the structured generation per input row.

New tasks for generating datasets for training embedding models

sentence-transformers v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!

New GenerateSentencePair task that allows to generate a positive sentence for an input anchor, and optionally also a negative sentence. The tasks allows creating different kind of data specifying the action to perform with respect to the anchor: paraphrasing, generate semantically-similar sentence, generate a query or generate an answer.
Implemented Improving Text Embeddings with Large Language Models and adding the following tasks derived from the paper:
- EmbeddingTaskGenerator which allows generating new embedding-related tasks using an LLM.
- GenerateTextRetrievalData which allows creating text retrieval data with an LLM.
- GenerateShortTextMatchingData which allows creating short texts matching the input data.
- GenerateLongTextMatchingData which allows creating long texts matching the input data.
- GenerateTextClassificationData which allows creating text classification data from the input data.
- MonolingualTripletGenerator which allows creating monolingual triplets from the input data.
- BitextRetrievalGenerator which allows creating bitext retrieval data from the input data.

New `Step`s for loading data from different sources and saving/loading `Distiset` to disk

We've added a few new steps allowing to load data from different sources:

LoadDataFromDisk allows loading a Distisetor datasets.Dataset that was previously saved using the save_to_disk method.
LoadDataFromFileSystem allows loading a datasets.Dataset from a file system.

Thanks to @rasdani for helping us testing this new tasks!

In addition, we have added save_to_disk method to Distiset akin to datasets.Dataset.save_to_disk, that allows saving the generated distiset to disk, along with the pipeline.yaml and pipeline.log.

`save_to_disk` example

from distilabel.pipeline import Pipeline

with Pipeline(name="my-pipeline") as pipeline:
    ...
    
if __name__ == "__main__":
    distiset = pipeline.run(...)
    distiset.save_to_disk(dataset_path="my-distiset")

`MixtureOfAgentsLLM` implementation

We've added a new LLM called MixtureOfAgentsLLM derived from the paper Mixture-of-Agents Enhances Large Language Model Capabilities. This new LLM allows generating improved outputs thanks to the collective expertise of several LLMs.

`MixtureOfAgentsLLM` example

from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM

llm = MixtureOfAgentsLLM(
    aggregator_llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
    proposers_llms=[
        InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        ),
        InferenceEndpointsLLM(
            model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
            tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
        ),
        InferenceEndpointsLLM(
            model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
            tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
        ),
    ],
    rounds=2,
)

llm.load()

output = llm.generate(
    inputs=[
        [
            {
                "role": "user",
                "content": "My favorite witty review of The Rings of Power series is this: Input:",
            }
        ]
    ]
)

Saving cache and passing batches to `GlobalStep`s optimizations

The cache logic of the _BatchManager has been improved to incrementally update the cache making the process much faster.
The data of the input batches of the GlobalSteps will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration of fsspec, which can be configured to use a file system or cloud storage as backend for passing the data of the batches.

`BasePipeline` and `_BatchManager` refactor

The logic around BasePipeline and _BatchManager has been refactored, which will make it easier to implement new pipelines in the future.

Added `ArenaHard` as an example of how to use `distilabel` to implement a benchmark

distilabel can be easily used to create an LLM benchmark. To showcase this, we decided to implement Arena Hard as an example: Benchmarking with distilabel: Arena Hard

📚 Improved documentation structure

We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.

What's Changed

Add prometheus.md by @alvarobartt in #656
Reduce time required to execute _cache method by @gabrielmbmb in #672
[DOCS] Update theme styles and images by @leiyre in #667
Fix circular import due to DISTILABEL_METADATA_KEY by @plaguss in #675
Add CITATION.cff by @alvarobartt in #677
Deprecate conversation support in TextGeneration in favour of ChatGeneration by @alvarobartt in #676
Add functionality to load/save distisets to/from disk by @plaguss in #673
Integration instructor by @plaguss in #654
Fix docs of saving/loading distiset from disk by @plaguss in https://githu...

Contributors

leiyre, davidberenstein1957, and 4 other contributors

Assets 2

22 May 06:29

alvarobartt

1.1.1

a0c9a41

1.1.1

What's Changed

Fix crash when using vLLM without structured generation by @cg123 in #658
Fix error on Pipeline.dry_run without parameters by @plaguss in #655

New Contributors

@cg123 made their first contribution in #658

Full Changelog: 1.1.0...1.1.1

Contributors

cg123 and plaguss

Assets 2

20 May 14:02

plaguss

1.1.0

690013a

1.1.0

Distilabel 1.1.0

Two new tasks implemented!

`Genstruct` task (#600)

You can now use Genstruct task as described in https://huggingface.co/NousResearch/Genstruct-7B, to generate synthetic instruction fine-tuning datasets from a raw document:

from distilabel.llms import TransformersLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromDicts
from distilabel.steps.tasks import Genstruct

with Pipeline(name="harry-potter-genstruct") as pipeline:
    load_hub_dataset = LoadDataFromDicts(
        name="load_dataset",
        data=[
            {
                "title": "Harry Potter and the Sorcerer's Stone",
                "content": "An orphaned boy enrolls in a school of wizardry, where he learns the truth about himself, his family and the terrible evil that haunts the magical world.",
            },
            {
                "title": "Harry Potter and the Chamber of Secrets",
                "content": "Harry Potter lives his second year at Hogwarts with Ron and Hermione when a message on the wall announces that the legendary Chamber of Secrets has been opened. The trio soon realize that, to save the school, it will take a lot of courage.",
            },
        ],
    )

    task = Genstruct(
        name="task",
        llm=TransformersLLM(
            model="NousResearch/Genstruct-7B",
            torch_dtype="float16",
            chat_template="{{ messages[0]['content'] }}",
            device="cuda:0",
        ),
        num_generations=2,
        group_generations=False,
        output_mappings={"model_name": "model"},
    )

`PrometheusEval` task (#610)

A new PrometheusEval task, based on the recently published paper "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models":

from distilabel.steps.tasks import PrometheusEval

with Pipeline(name="prometheus") as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        repo_id="HuggingFaceH4/instruction-dataset",
        split="test",
        output_mappings={"prompt": "instruction", "completion": "generation"},
    )

    task = PrometheusEval(
        name="task",
        llm=vLLM(
            model="prometheus-eval/prometheus-7b-v2.0",
            chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
        ),
        mode="absolute",
        rubric="factual-validity",
        reference=False,
        num_generations=1,
        group_generations=False,
    )
    
    load_dataset >> task

Connect the steps in the pipeline with `>>` (#490)

Now you can connect your steps using the binary shift operator in python:

from distilabel.pipeline import Pipeline
from distilabel.steps.generators.huggingface import LoadHubDataset
from distilabel.steps.task.evol_instruct.base import EvolInstruct
from distilabel.steps.combine import CombineColumns

with Pipeline(name="Pipe name") as pipeline:
    load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
    evol_instruction_complexity_1 = EvolInstruct(
        llm=OpenAILLM(model="gpt-3.5-turbo"),
    )
    evol_instruction_complexity_2 = EvolInstruct(
        llm=InferenceEndpointsLLM(model_id="mistralai/Mixtral-8x7B-Instruct-v0.1"),
    )

    combine_columns = CombineColumns(
        columns=["response"],
        output_columns=["candidates"],
    )

    (
        load_hub_dataset 
        >> [evol_instruction_complexity_1, evol_instruction_complexity_2] 
        >> combine_columns
    )

Routing batch function (#595)

Thanks to the new routing_batch_function, each batch of an upstream step can be routed conditionally to a list of specific downstream steps. In addition, we have included a sample_n_steps routing batch function, making easier replicating the definition of the original UltraFeedback paper:

import random
from distilabel.llms import MistralLLM, OpenAILLM, VertexAILLM
from distilabel.pipeline import Pipeline, routing_batch_function
from distilabel.steps import CombineColumns, LoadHubDataset
from distilabel.steps.tasks import TextGeneration

@routing_batch_function()
def sample_two_steps(steps: list[str]) -> list[str]:
    return random.sample(steps, 2)

with Pipeline("pipe-name", description="My first pipe") as pipeline:
    load_dataset = LoadHubDataset(
        name="load_dataset",
        output_mappings={"prompt": "instruction"},
    )

    tasks = []
    for llm in (
        OpenAILLM(model="gpt-4-0125-preview"),
        MistralLLM(model="mistral-large-2402"),
        VertexAILLM(model="gemini-1.0-pro"),
    ):
        tasks.append(
            TextGeneration(name=f"text_generation_with_{llm.model_name}", llm=llm)
        )

    combine_generations = CombineColumns(
        name="combine_generations",
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"],
    )

    load_dataset >> sample_two_steps >> tasks >> combine_generations

Generate structured outputs using `outlines` (#601)

You can generate JSON or regex using TransformersLLM, LlamaCppLLM or vLLM thanks to the integration with [outlines](https://github.com/outlines-dev/outlines)

from enum import Enum

from distilabel.llms import LlamaCppLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, StringConstraints, conint
from typing_extensions import Annotated

class Weapon(str, Enum):
    sword = "sword"
    axe = "axe"
    mace = "mace"
    spear = "spear"
    bow = "bow"
    crossbow = "crossbow"

class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"
    mithril = "mithril"

class Character(BaseModel):
    name: Annotated[str, StringConstraints(max_length=30)]
    age: conint(gt=1, lt=3000)
    armor: Armor
    weapon: Weapon

with Pipeline("RPG-characters") as pipeline:
    system_prompt = (
        "You are a leading role play gamer. You have seen thousands of different characters and their attributes."
        " Please return a JSON object with common attributes of an RPG character."
    )

    load_dataset = LoadDataFromDicts(
        name="load_instructions",
        data=[
            {
                "system_prompt": system_prompt,
                "instruction": f"Give me a character description for a {char}",
            }
            for char in ["dwarf", "elf", "human", "ork"]
        ],
    )

    text_generation = TextGeneration(
        name="text_generation_rpg",
        llm=LlamaCppLLM(
            model_path="model/path",  # type: ignore
            structured_output={"format": "json", "schema": Character},
        ),
    )
    load_dataset >> text_generation

New `GroqLLM` (#583)

New integration with groq, special mention to @kcentric which did the initial work prior to the refactor for 1.0.0

from distilabel.llms.groq import GroqLLM
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration

with Pipeline(name="text-generation-groq") as pipeline:
		...
    text_generation_with_groq = TextGeneration(
        llm=GroqLLM(model="llama3-70b-8192"),
    )
    ...

Easily test your pipeline doing a `dry_run` (#635)

with Pipeline(...) as pipeline:
    ...
    distiset = pipeline.dry_run(
        parameters=...,  # The same argument as `Pipeline.run`
        batch_size=1  # Optional, will be set to 1 by default.
    )

[05/13/24 16:22:30] INFO     ['distilabel.pipeline.local'] 🌵  Dry run mode                                                                                                                                                                local.py:103
                    INFO     ['distilabel.pipeline.local'] 📝 Pipeline data will be ...                                    local.py:125

`Pipeline.log` file is dumped to the Hugging Face repository (#568)

Now on when you call distiset.push_to_hub, the pipeline.log file will be automatically dumped to your dataset repository with the pipeline.yaml to keep track of the execution.

New `distilabel_metadata` column to store internal data (#586)

You can now optionally enable the addition of a metadata column. This column can store other things in the future, but for the moment can be really handy to keep the raw output from an LLM, and in case it does some post processing via format_output , keep the original output to avoid lossing anything.

You can include the metadata at the task level as:

TextGeneration(..., add_raw_output=True|False)

And directly determine whether you want this column in your final Distiset:

with Pipeline(...,enable_metadata=True|False):
    ...

This way we can decide to remove all the column altogether.

All the changes in this PR

Allow nested connect calls and overload rshift method to connect steps by @plaguss in #490
Fix llm_blender installation by @alvarobartt in #557
Warn user a...

Contributors

gabrielmbmb, alvarobartt, and 3 other contributors

Assets 2

25 Apr 12:48

gabrielmbmb

1.0.3

9f38b49

1.0.3

What's Changed

Add stop and stop_sequences in LLM.generate subclasses by @alvarobartt in #585

Full Changelog: 1.0.2...1.0.3

Contributors

alvarobartt

Assets 2

24 Apr 11:43

alvarobartt

1.0.2

712d32c

1.0.2

What's Changed

Fix RuntimeParamater validation when provided as _Step attr by @alvarobartt in #564
Add seed with random.randint to ensure cache is not used by @alvarobartt in #571

Full Changelog: 1.0.1...1.0.2

Contributors

alvarobartt

Assets 2

19 Apr 10:11

gabrielmbmb

1.0.1

b870f39

1.0.1

What's Changed

Fix typo in readme and remove the ToArgilla step by @dvsrepo in #548
Fix model_validator in InferenceEndpoints due to Pipeline pickling by @alvarobartt in #552

Full Changelog: 1.0.0...1.0.1

Contributors

dvsrepo and alvarobartt

Assets 2

Uh oh!

Releases: argilla-io/distilabel

1.2.4

What's Changed

Contributors

Uh oh!

1.2.3

What's Changed

New Contributors

Contributors

Uh oh!

1.2.2

What's Changed

Contributors

Uh oh!

1.2.1

What's Changed

New Contributors

Contributors

Uh oh!

1.2.0

✨ Release highlights

Structured generation with instructor, InferenceEndpointsLLM now supports structured generation and StructuredGeneration task

New tasks for generating datasets for training embedding models

New Steps for loading data from different sources and saving/loading Distiset to disk

MixtureOfAgentsLLM implementation

Saving cache and passing batches to GlobalSteps optimizations

BasePipeline and _BatchManager refactor

Added ArenaHard as an example of how to use distilabel to implement a benchmark

📚 Improved documentation structure

What's Changed

Contributors

Uh oh!

1.1.1

What's Changed

New Contributors

Contributors

Uh oh!

1.1.0

Distilabel 1.1.0

Two new tasks implemented!

Genstruct task (#600)

PrometheusEval task (#610)

Connect the steps in the pipeline with >> (#490)

Routing batch function (#595)

Generate structured outputs using outlines (#601)

New GroqLLM (#583)

Easily test your pipeline doing a dry_run (#635)

Pipeline.log file is dumped to the Hugging Face repository (#568)

New distilabel_metadata column to store internal data (#586)

All the changes in this PR

Contributors

Uh oh!

1.0.3

What's Changed

Contributors

Uh oh!

1.0.2

What's Changed

Contributors

Uh oh!

1.0.1

What's Changed

Contributors

Uh oh!

Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task

New `Step`s for loading data from different sources and saving/loading `Distiset` to disk

`MixtureOfAgentsLLM` implementation

Saving cache and passing batches to `GlobalStep`s optimizations

`BasePipeline` and `_BatchManager` refactor

Added `ArenaHard` as an example of how to use `distilabel` to implement a benchmark

`Genstruct` task (#600)

`PrometheusEval` task (#610)

Connect the steps in the pipeline with `>>` (#490)

Generate structured outputs using `outlines` (#601)

New `GroqLLM` (#583)

Easily test your pipeline doing a `dry_run` (#635)

`Pipeline.log` file is dumped to the Hugging Face repository (#568)

New `distilabel_metadata` column to store internal data (#586)