Skip to content

Conversation

@plaguss
Copy link
Contributor

@plaguss plaguss commented Nov 8, 2024

Description

This PR adds a way of computing summary statistics of the distiset, and adds a table in the final README template card.
It computes the statistics per leaf node and shows a table per each one like the following:


Dataset Statistics

  • Summary statistics: default
mean std min max sum
input_tokens_statistics_generation 1881 1.41421 1880 1882 3762
output_tokens_statistics_generation 582 1.41421 581 583 1164

Should be merged after #1034

Closes #1046

…dd new merge_dicts to help merging user-assistant messages in magpie
@plaguss plaguss added the enhancement New feature or request label Nov 8, 2024
@plaguss plaguss added this to the 1.5.0 milestone Nov 8, 2024
@plaguss plaguss self-assigned this Nov 8, 2024
@plaguss plaguss linked an issue Nov 8, 2024 that may be closed by this pull request
@github-actions
Copy link

github-actions bot commented Nov 8, 2024

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-1055/

@codspeed-hq
Copy link

codspeed-hq bot commented Nov 8, 2024

CodSpeed Performance Report

Merging #1055 will not alter performance

Comparing count-dataset-tokens (5741cd1) with develop (e830e25)

Summary

✅ 1 untouched benchmarks

@gabrielmbmb gabrielmbmb removed this from the 1.5.0 milestone Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Compute the input/output tokens of a dataset

2 participants