An instruction-following dataset generation recipe for Danish.
To install the dependencies and set up the project, you can run the following command:
make install
To then generate the dataset, you can run:
make dataset
To finetune a model on the dataset, you can run:
uv run src/scripts/finetune_model.py \
--base-model HUGGINGFACE_MODEL_ID \
--new-model HUGGINGFACE_MODEL_ID
All examples in the dataset are structured as follows:
{
"messages": [
{
"role": "user",
"content": "(...)"
},
{
"role": "assistant",
"content": "(...)"
},
{
"role": "user",
"content": "(...)"
},
(...)
{
"role": "assistant",
"content": "(...)"
}
}
The dataset was created using several steps, each of which is described in detail in the subsections below. The code base used for the dataset generation can be found here.
We started by generating a set of 176 Danish seed prompts and answers manually, adapted from the English Self-Instruct seed prompts as well as prompts crowdsourced as part of the EU Horizon project TrustLLM (grant agreement number 101135671). These seed prompts can be found here.
With the seed prompts in hand, we used the Alpaca recipe to generate an initial instruction dataset with 52,000 examples, based on the Gemma-3-27b-pt base decoder model. This used the seed prompts from the previous step as few-shot examples, and were filtered using the same filters as in the Alpaca recipe, with the additional filter that the generated examples had to be in Danish, which was checked using the Lingua language detection package.
The generated dataset was then grammar-corrected using the Gemma-3-27b-it instruction-tuned model. We used the base model in the previous step to not have any instruction bias in the generation, but such bias is not applicable in this step.
A number of the generated examples were non-sensical or generally of low quality, so we run the generated instructions through Gemma-3-27b-it again, this time asking it to rewrite the instructions to improve their quality, in case they were of low quality.
We next used the Evol-Instruct recipe to evolve the dataset for 4 generations, using the Gemma-3-27b-it model. This process both makes the examples more complex and diverse. All the new evolved examples were added to the dataset and shuffled with the previous examples.
Finally, we added 3 follow-up queries and answers to each of the examples in the dataset, again using the Gemma-3-27b-it model.
This dataset is licensed under the Gemma Terms of Use, which allows use for both commercial and non-commercial purposes, provided that the dataset is not used to cause harm. Any modifications of the dataset as well as models trained on it must also be shared under the same license.
This dataset was created by Sofie Helene Bruun and Dan Saattrup Smart from the Alexandra Institute as part of the Danish Foundation Models project. The project is funded by the Danish Research Reserve as part of the national budget of Denmark for 2025, and consists of the following partners: