Lærebogen

An instruction-following dataset generation recipe for Danish.

Quick Start

To install the dependencies and set up the project, you can run the following command:

make install

To then generate the dataset, you can run:

make dataset

To finetune a model on the dataset, you can run:

uv run src/scripts/finetune_model.py \
	--base-model HUGGINGFACE_MODEL_ID \
	--new-model HUGGINGFACE_MODEL_ID

Dataset Structure

All examples in the dataset are structured as follows:

{
  "messages": [
	{
		"role": "user",
		"content": "(...)"
	},
	{
		"role": "assistant",
		"content": "(...)"
	},
	{
		"role": "user",
		"content": "(...)"
	},
	(...)
	{
		"role": "assistant",
		"content": "(...)"
	}
}

Dataset Generation Process

The dataset was created using several steps, each of which is described in detail in the subsections below. The code base used for the dataset generation can be found here.

Step 1: Seed Generation

We started by generating a set of 176 Danish seed prompts and answers manually, adapted from the English Self-Instruct seed prompts as well as prompts crowdsourced as part of the EU Horizon project TrustLLM (grant agreement number 101135671). These seed prompts can be found here.

Step 2: Base Dataset Generation

With the seed prompts in hand, we used the Alpaca recipe to generate an initial instruction dataset with 52,000 examples, based on the Gemma-3-27b-pt base decoder model. This used the seed prompts from the previous step as few-shot examples, and were filtered using the same filters as in the Alpaca recipe, with the additional filter that the generated examples had to be in Danish, which was checked using the Lingua language detection package.

Step 3: Grammar Correction

The generated dataset was then grammar-corrected using the Gemma-3-27b-it instruction-tuned model. We used the base model in the previous step to not have any instruction bias in the generation, but such bias is not applicable in this step.

Step 4: Quality Improvement

A number of the generated examples were non-sensical or generally of low quality, so we run the generated instructions through Gemma-3-27b-it again, this time asking it to rewrite the instructions to improve their quality, in case they were of low quality.

Step 5: Evolving the Dataset

We next used the Evol-Instruct recipe to evolve the dataset for 4 generations, using the Gemma-3-27b-it model. This process both makes the examples more complex and diverse. All the new evolved examples were added to the dataset and shuffled with the previous examples.

Step 6: Adding Follow-Up Questions

Finally, we added 3 follow-up queries and answers to each of the examples in the dataset, again using the Gemma-3-27b-it model.

License

This dataset is licensed under the Gemma Terms of Use, which allows use for both commercial and non-commercial purposes, provided that the dataset is not used to cause harm. Any modifications of the dataset as well as models trained on it must also be shared under the same license.

Creators and Funders

This dataset was created by Sofie Helene Bruun and Dan Saattrup Smart from the Alexandra Institute as part of the Danish Foundation Models project. The project is funded by the Danish Research Reserve as part of the national budget of Denmark for 2025, and consists of the following partners:

Name		Name	Last commit message	Last commit date
Latest commit History 228 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
data		data
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DATASET_README.md		DATASET_README.md
LICENSE		LICENSE
README.md		README.md
dependabot.yaml		dependabot.yaml
makefile		makefile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lærebogen

Quick Start

Dataset Structure

Dataset Generation Process

Step 1: Seed Generation

Step 2: Base Dataset Generation

Step 3: Grammar Correction

Step 4: Quality Improvement

Step 5: Evolving the Dataset

Step 6: Adding Follow-Up Questions

License

Creators and Funders

About

Uh oh!

Uh oh!

Languages

License

alexandrainst/laerebogen

Folders and files

Latest commit

History

Repository files navigation

Lærebogen

Quick Start

Dataset Structure

Dataset Generation Process

Step 1: Seed Generation

Step 2: Base Dataset Generation

Step 3: Grammar Correction

Step 4: Quality Improvement

Step 5: Evolving the Dataset

Step 6: Adding Follow-Up Questions

License

Creators and Funders

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages