Releases: argilla-io/distilabel
1.5.3
What's Changed
- Fix typo by @Riezebos in #1111
- Checks for images using PIL only if available by @plaguss in #1112
- Fix pipeline getting stuck when multiple step replicas by @gabrielmbmb in #1113
New Contributors
Full Changelog: 1.5.2...1.5.3
1.5.2
What's Changed
- Fix structured output JSON to
pydantic.BaseModelandLiteLLMasync completion client by @rolshoven in #1105
New Contributors
- @rolshoven made their first contribution in #1105
Full Changelog: 1.5.1...1.5.2
1.5.1
What's Changed
- Remove deprecated
CombineColumnsstep by @gabrielmbmb in #1101 - Fix image import handling and update MlxLLM initialisation by @davidberenstein1957 in #1102
- Fix
MlxLLMby aligning it withmlx-lm>=0.21by @davidberenstein1957 in #1103 1.5.1by @gabrielmbmb in #1104
Full Changelog: 1.5.0...1.5.1
1.5.0
✨ Release highlights
🖼️ Image Generation Support
We're excited to introduce ImageGenerationModel, a new abstraction for working with image generation models. This addition enables seamless integration with models that can transform text prompts into images.
Available Services
- 🤗
InferenceEndpointsImageGeneration: Integration with Hugging Face's Inference Endpoints OpenAIImageGeneration: Integration with OpenAI's DALL-E
Architecture
Just as LLMs are used by a Task, we've introduced ImageTask as a high-level abstraction for image generation workflows. ImageTask defines how a step should use an ImageGenerationModel to accomplish specific image generation tasks.
Our first implementation, the ImageGeneration task, provides a straightforward interface: given a text prompt, it generates the corresponding image, leveraging any of the supported image generation models.
We've also added a small tutorial on how to generate images using distilabel: distilabel - Tutorials - Image generation with distilabel
Images as inputs for LLMs
We've added initial support for providing images as input to an LLM through the new TextGenerationWithImage task. We've updated and tested InferenceEndpointsLLM and OpenAILLM with this new task, but we'll image as input compatibility in the next releases for others such as vLLM.
Check the tutorial distilabel - Tutorials - Text generation with images in distilabel to get started!
💻 New MlxLLM integration
We've integrated mlx-lm package with the new MlxLLM class, enabling native machine learning acceleration on Apple Silicon Macs. This integration supercharges synthetic data generation by leveraging MLX's highly optimized framework designed specifically for the M-series chips.
New InstructionResponsePipeline template
We've started making changes so distilabel is easier to use since minute one. We'll start adding presets or templates that allows to quickly get a pipeline with some sensible preconfigured defaults for generating data for certain tasks. The first task we've worked on is the SFT or Instruction Response tuning pipeline which you can use like:
from distilabel.pipeline import InstructionResponsePipeline
pipeline = InstructionResponsePipeline()
distiset = pipeline.run()Define load stages
We've added a way for users to define which steps of the pipeline should be loaded together, allowing for more efficient resource management and better control over the execution flow. This new feature is particularly useful in scenarios where resource-constrained environments limit the ability to execute all steps simultaneously, requiring steps to be executed in distinct stages.
We've added a detailed guide on how to use this feature: distilabel - How-to guides - Load groups and execution stages.
What's Changed
- Add common typing module by @plaguss in #1029
- docs: textcat tutorial by @sdiazlor in #949
- Add
taskdecorator by @gabrielmbmb in #1028 - Update
docsworkflows to useuvby @gabrielmbmb in #1032 - fix: simplify prompt template
ArgillaLabellerby @davidberenstein1957 in #1033 - Add
dataset_batch_sizeargument by @gabrielmbmb in #1039 - Move all LLMs to distilabel.models by @plaguss in #1045
- Fix a tiny typo in
_Stepdocstring by @sadra-barikbin in #1051 - docs: improve docs for
MinHashDedupStepby @anakin87 in #1050 - Fix new response_format variable in openai api by @plaguss in #1053
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci in #1043
- Update
LLM.generateoutput to includestatisticsby @plaguss in #1034 - Add example of structured output. by @plaguss in #1061
- feat: implenent basic SFT pipeline based on synthetic data generator by @burtenshaw in #1059
- fix: broken import in instruction by @burtenshaw in #1063
- Fix StepOutput type by @plaguss in #1072
- docs: update issue templates by @sdiazlor in #1074
- Update
unloadmethod fromvLLMto properly free resources by @gabrielmbmb in #1077 - Add tasks to replicate Math-shepherd by @plaguss in #1052
- Add
load_groupsargument torunby @gabrielmbmb in #1075 - Add
TextGenerationWithImagetask by @plaguss in #1066 - Create columns with
LLMreturned extra keys by @gabrielmbmb in #1078 - Fix
vLLMunload logic when model isNoneby @gabrielmbmb in #1080 - Fix
merge_distilabel_metadatafunction when handling outputs fromTaskwithgroup_generations==Trueby @gabrielmbmb in #1082 - chore: update base.py by @eltociear in #1085
- Add magpie support llama cpp ollama by @davidberenstein1957 in #1086
- Feat/954 llama cpp by @bikash119 in #1000
- fix import by replacing GeneratorOutput with GeneratorStepOutput by @davidberenstein1957 in #1093
- add mlx support by @davidberenstein1957 in #1089
- Support custom default headers in
OpenAILLMclass. by @khulaifi95 in #1088 - fix/pip install messages by @davidberenstein1957 in #1095
- Fix handling empty list statistics by @gabrielmbmb in #1094
- update to outlines010 by @davidberenstein1957 in #1092
- update: search by match by @sdiazlor in #1096
- Add Legend to Component Gallery Icons by @ParagEkbote in #1090
- Image Language Models and
ImageGenerationtask by @plaguss in #1060 - Update
LLMs to support prompt logprobs use-case by @gabrielmbmb in #1099 1.5.0by @gabrielmbmb in #1100
New Contributors
- @sadra-barikbin made their first contribution in #1051
- @anakin87 made their first contribution in #1050
- @pre-commit-ci made their first contribution in #1043
- @eltociear made their first contribution in #1085
- @bikash119 made their first contribution in #1000
- @khulaifi95 made their first contribution in #1088
- @ParagEkbote made their first contribution in #1090
Full Changelog: 1.4.2...1.5.0
1.4.2
What's Changed
- Fix chat template not applied in
TransformersLLMby @gabrielmbmb in #1083
Full Changelog: 1.4.1...1.4.2
1.4.1
What's Changed
- Fix not handling list of all primitive types in
SignatureMixinby @gabrielmbmb in #1037
Full Changelog: 1.4.0...1.4.1
1.4.0
✨ Release highlights
Offline Batch Generation and OpenAI Batch API
We’ve updated the LLM interface so now LLMs using an external platform that offers a batch service can be integrated in distilabel. In addition, OpenAILLM has been updated so it can use the OpenAI Batch API to get 50% cost reductions.
distilabel-offline-batch-generation.mp4
Improved cache for maximum outputs reusability
We all know that running LLM is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset generated by one that finished its execution and was re-executed.
In this release, we've greatly improved the cache so the outputs of all the Steps are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:
In addition, we've added a use_cache attribute in the Steps that allows toggling the use of the cache at step level.
Steps can generated artifacts
In some cases, Step produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact that can be called within the step to store artifacts generated by it. The artifacts generated by the Step will also get uploaded to the Hugging Face Hub.
from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt
if TYPE_CHECKING:
from distilabel.steps import StepOutput
class CountTextCharacters(GlobalStep):
@property
def inputs(self) -> List[str]:
return ["text"]
@property
def outputs(self) -> List[str]:
return ["text_character_count"]
def process(self, inputs: StepInput) -> "StepOutput": # type: ignore
character_counts = []
for input in inputs:
text_character_count = len(input["text"])
input["text_character_count"] = text_character_count
character_counts.append(text_character_count)
# Generate plot with the distribution of text character counts
plt.figure(figsize=(10, 6))
plt.hist(character_counts, bins=30, edgecolor="black")
plt.title("Distribution of Text Character Counts")
plt.xlabel("Character Count")
plt.ylabel("Frequency")
# Save the plot as an artifact of the step
self.save_artifact(
name="text_character_count_distribution",
write_function=lambda path: plt.savefig(path / "figure.png"),
metadata={"type": "image", "library": "matplotlib"},
)
plt.close()
yield inputsNew Tasks: CLAIR, APIGEN and many more!
- New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A
preferredA’ is much more contrastive and precise. - New tasks to replicate APIGen framework:
APIGenGenerator,APIGenSemanticChecker,APIGenExecutionChecker. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets - New URIAL task that allows using non-instruct models to generate a response for an instruction.
- New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
- TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
- Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.
New Steps to sample data in your pipelines and remove duplicates
- New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
- New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
- New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
- New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
- New CombineOutputs to combine the outputs of two or more steps into a single output.
Generate text embeddings using vLLM
- Now you can generate embeddings using vLLMEmbeddings!
Extra things
- Easily visualize the tasks’ prompts using Task.print method.
- New use_default_structured_outputs flag in tasks to automatically use structured generation in some tasks that can benefit from it.
What's Changed
- Make
ClientvLLM.model_nameacached_propertyby @gabrielmbmb in #862 - Pass dataset to dry_run method by @plaguss in #863
- Add default structured output for
GenerateSentencePairtask by @plaguss in #868 - Complexity scorer default structured output by @plaguss in #870
- Quality scorer default structured output by @plaguss in #873
- Ultrafeedback default structured output by @plaguss in #876
- Remove use of
default_chat_templateby @gabrielmbmb in #888 - Temporary fix for installing
llama-cpp-pythonby @gabrielmbmb in #886 - Fix unit tests after release of
transformers==4.44.0by @gabrielmbmb in #891 - Fix default structured output by @plaguss in #892
- Send as many batches as possible to input queues by @gabrielmbmb in #895
- Exclude
repo_idfromLoadDataFromFileSystemby @plaguss in #898 - Fix loader to read from a glob pattern by @plaguss in #877
- Add
save_artifactmethod to_Stepby @gabrielmbmb in #871 - Add new
add_raw_inputargument to_Taskso we can automatically include the formatted input by @plaguss in #903 - New
TruncateTextColumnto truncate the length of texts using the number of tokens or characters by @plaguss in #902 - Update
inputsandoutputsinterface to allow returning dict indicating optionality by @gabrielmbmb in #883 - Update mistrallm by @plaguss in #904
- Deepseek prover by @plaguss in #907
- Update
RewardModelScore.inputsproperty by @gabrielmbmb in #908 - Add tutorial - generate data for training embeddings and reranking models by @davidberenstein1957 in #893
- Fix load data from disk by @plaguss in #910
- docs: minor fixes by @davidberenstein1957 in #913
- Add
URIALtask by @gabrielmbmb in #921 - Add
vLLMEmbeddingsby @plaguss in #920 - docs: add tutorials preference and clean by @sdiazlor in #917
- Fix
StructuredGenerationexamples and internal check by @plaguss in #912 - Generate deterministic pipeline name when it's not given by @plaguss in #878
- Add custom errors by @plaguss in #911
- Docs/tutorials fix by @sdiazlor in #922
- Add
revisionruntime parameter toLoadDataFromHubby @gabrielmbmb in #928 - Add plausible as replacement for GA by @davidberenstein1957 in #929
- Add minhash related steps to deduplicate texts by @plaguss in #931
- docs: API reference review by @sdiazlor in #932
- Refactor of MinHash to work with a single class and fix the shelve backend by @plaguss in #937
- Update
make_generator_stepto set pipeline to step and add edge to steps in trophic level 1 by @gabrielmbmb in https://g...
1.3.2
What's Changed
- Deepseek prover task by @plaguss in #733
- Do not cancel in progress docs workflows by @gabrielmbmb in #919
- Fix creating Ray placement groups for vLLM by @gabrielmbmb in #918
- Fix passing
base_urlinmodel_idinInferenceEndpointsLLMby @gabrielmbmb in #924
Full Changelog: 1.3.1...1.3.2
1.3.1
What's Changed
- Create new
distilabel.constantsmodule to store constants and avoid circular imports by @plaguss in #861 - Add OpenAI request timeout by @ashim-mahara in #858
New Contributors
- @ashim-mahara made their first contribution in #858
Full Changelog: 1.3.0...1.3.1
1.3.0
What's Changed
- Add new step
CombineKeysby @plaguss in #747 - Refactor naming columns steps combinecolumns combinekeys expandcolumns by @davidberenstein1957 in #758
- Drop remove deprecated
LoadHubDatasetby @davidberenstein1957 in #759 - Add
requirementslist forPipelineby @plaguss in #720 - Add
StepResourcesand step replicas inPipelineby @gabrielmbmb in #750 - Add load stages by @gabrielmbmb in #760
- Update min required version to
python==3.9by @gabrielmbmb in #770 - Optionally include the pipeline script in the hub when pushing your distiset by @plaguss in #762
- Add
docs-pr.ymlanddocs-pr-close.ymlworkflows by @gabrielmbmb in #774 - Add
RayPipelineclass by @gabrielmbmb in #769 - Fixed closed PR workflow by @gabrielmbmb in #776
- Add
MagpieandMagpieGeneratortasks by @gabrielmbmb in #778 - Fix some issues related to
Magpietask by @gabrielmbmb in #783 - Add
end_with_userandinclude_system_promptflags toMagpietasks and handleNones. by @gabrielmbmb in #784 - Add workflow concurrency group for publishing docs by @gabrielmbmb in #796
- Add
_desired_num_gpusattribute toCudaDevicePlacementMixinby @gabrielmbmb in #795 - Compatibility with
vLLMwithtensor_parallel_sizeargument by @gabrielmbmb in #805 - Update default names in
GroupColumnsby @plaguss in #808 - Request batches to
GeneratorStepif only step in pipeline by @gabrielmbmb in #828 - Add default name for a pipeline by @plaguss in #809
- Update distilabel phrasing based on PR hugging face hub by @davidberenstein1957 in #821
- Some more
Magpieimprovements by @gabrielmbmb in #833 - Add
Embeddingsbase class,SentenceTransformerEmbeddingsclass,EmbeddingGenerationandFaissNearestNeighboursteps by @gabrielmbmb in #830 - Create file per hostname in
CudaDevicePlacementMixinby @gabrielmbmb in #814 - Create a
GeneratorStepfrom a dataset using a helper function by @plaguss in #812 - Do not take into account
disable_cuda_device_placementfor pipeline signature by @gabrielmbmb in #838 - Add
RewardModelScorestep by @gabrielmbmb in #840 - Fix
LoadDataFromHubattribute_datasethadellipsisby default instead ofNoneby @gabrielmbmb in #841 - Create
PlacementGroupfor steps usingvLLMby @gabrielmbmb in #842 - Update
argillaintegration to useargilla_sdkv2 by @alvarobartt in #705 - Make
overall-ratingthe default aspect forUltraFeedbacktask by @gabrielmbmb in #843 - fix typo index.md by @franperic in #844
- Use
CudaDevicePlacementMixininRewardModelScorestep by @gabrielmbmb in #845 - Gather GPUs per Ray node to create placement groups by @gabrielmbmb in #848
- Fix typo in docs by @plaguss in #850
- Add
xfailrouting batch function tests by @gabrielmbmb in #852 - Fix creating placement group when
pipeline_parallel_size>1by @gabrielmbmb in #851 - docs: 846 docs include google analytics by @davidberenstein1957 in #847
- Add
ClientvLLMclass by @gabrielmbmb in #854 - Add hard-negative flag to include similar challenging negatives on triplets by @plaguss in #856
- Add bibtex references in the docstrings to be shown in the README by @plaguss in #855
- distilabel
1.3.0by @gabrielmbmb in #857
New Contributors
- @franperic made their first contribution in #844
Full Changelog: 1.2.4...1.3.0
