Multi turn conversation Alpaca format for instruction fine tuning #579

urbanspr1nter · 2025-03-25T04:22:49Z

urbanspr1nter
Mar 25, 2025

Great book so far, @rasbt ! I'm currently on my second pass at it and exploring further. The appendix links the ShareGPT dataset. One thing not clear to me is how I would go about translating each of those entries to the Alpaca format. I noticed that the format really only assumes 1 question and 1 answer. However, in entries of this dataset, I see there is a back and forth interaction between the human and assistant. Is there anything different to be done when building out the training/val/test data when it comes to formatting input? Do we just build separate Alpaca formatted messages?

Answered by rasbt

Mar 26, 2025

Good question. It was a bit too much for the already (too) long Chapter 7, but multi-turn would be an interesting case for a bonus material section one day...
How ChatGPT and other LLMs etc handle multi-turn questions is that they simply shove the previous conversation into the prompt.

This is true for the training (SFT) stage as well:

E.g. if the turn 1 json input is:

{
  "instruction": "I'm planning a trip to Italy. Can you suggest three cities I should visit?",
  "input": "",
  "output": "Absolutely! Here are three cities in Italy worth visiting: Rome, Florence, Venice."
}

and turn 2 is

{
  "instruction": "Nice. What’s the best time of year to visit those places?",
  "input": "",
  "ou…

View full answer

rasbt · 2025-03-26T00:18:33Z

rasbt
Mar 26, 2025
Maintainer

Good question. It was a bit too much for the already (too) long Chapter 7, but multi-turn would be an interesting case for a bonus material section one day...
How ChatGPT and other LLMs etc handle multi-turn questions is that they simply shove the previous conversation into the prompt.

This is true for the training (SFT) stage as well:

E.g. if the turn 1 json input is:

{
  "instruction": "I'm planning a trip to Italy. Can you suggest three cities I should visit?",
  "input": "",
  "output": "Absolutely! Here are three cities in Italy worth visiting: Rome, Florence, Venice."
}

and turn 2 is

{
  "instruction": "Nice. What’s the best time of year to visit those places?",
  "input": "",
  "output": "The best times to visit Rome, Florence, and Venice are typically in the spring (April to June) and fall (September to October)."
}

For turn 1 the example would be similar to what's in the chapter:

### Instruction:
I'm planning a trip to Italy. Can you suggest three cities I should visit?

### Response:
Absolutely! Here are three cities in Italy worth visiting: Rome, Florence, Venice.

but for turn 2, you'd format it as follows:

### Instruction:
I'm planning a trip to Italy. Can you suggest three cities I should visit?

### Response:
Absolutely! Here are three cities in Italy worth visiting: Rome, Florence, Venice.

### Instruction:
Nice. What’s the best time of year to visit those places?

### Response:
The best times to visit Rome, Florence, and Venice are typically in the spring (April to June) and fall (September to October).

I hope this helps!

4 replies

urbanspr1nter Mar 26, 2025
Author

Ah! So is the thought here that we're formatting all of the conversations into an input from the multi-turn, and we're only interested in the very last response to train on? (Same as a single)

rasbt Mar 27, 2025
Maintainer

Yes, but it's important that the training example contains the previous turns as well. It's also how the LLM sees the context each time. E.g. when you use ChatGPT and start a new conversation, all the tokens from that conversation (the previous turn) are prepended to your prompt.

urbanspr1nter Mar 27, 2025
Author

Thank you so much! This makes sense now, and surprisingly less complicated than I thought.

rasbt Mar 27, 2025
Maintainer

Yes, it's surprisingly brute-force and surprisingly simple (in terms of implementation) :)

VachanVY · 2025-03-29T18:44:29Z

VachanVY
Mar 29, 2025

Hi @rasbt,

Conversations are trained in batches, so what if their lengths are different? Are they padded, or is another conversation concatenated to avoid the wasteful computation of the padding tokens? I think in the Llama3 paper, I read that they concatenate instead of padding (ig for pretraining; Do they do that for SFT?).

Also, is padding done on the left or the right?

How's it done in Production?

Thanks.

3 replies

rasbt Mar 30, 2025
Maintainer

I think in the Llama3 paper, I read that they concatenate instead of padding (ig for pretraining; Do they do that for SFT?).

For pretraining it's not necessary to pad anything. That's basically the only different between pretraining and SFT if you use instruction data during pretraining.

Not sure what they did for SFT tbh; I don't think their paper was that detailed in that regard.

Also, is padding done on the left or the right?

Good question. I think it depends. Padding to the right makes more sense imho due to the causal mask. If you have a custom causal mask where you ignore padding tokens, sure then you can pad to the left. But that sounds more complicated implementation-wise for no obvious benefit (unless I am missing something).

VachanVY Mar 31, 2025

during training, the actual data (excluding pad tokens) receives positional encodings that come after the pad tokens, whereas during inference, the same data receives positional encodings starting from the first position, as there are no pad tokens at the beginning (in inference)

is there any discrepancy seen due to this?
does using rope over sinusoidal positional encoding solve the problem, as relative positions are considered?

(SFT in deepseek-math paper

)

@rasbt

rasbt Apr 3, 2025
Maintainer

So what's nice about right-padding is, is that it doesn't influence the training at all because a) they don't move things and affect the positional embeddings and b) they are ignored in the loss. During inference, when you use the model for chatting, then there is no need to pad because you don't have batches.

However, I think in batched inference, there would be the case that if you want to generate responses for multiple queries simultaneously, then you could pad to the left. But that would require a custom mask be passed to the attention mechanism.

In that case, it would also have a slight effect RoPE-wise, but like you said, since RoPE is relative, the effect is probably negligible.

You can also do batched inference with right-padding btw. There shouldn't be an issue.

I guess it gets more interesting when you add a KV cache. Here, left-padding could maybe be a bit simpler to implement as you can better reuse the cache.

So, right-padding is generally simpler, and it doesn't require a custom attention mask (and thus supports FlashAttention).

I think you can also implement batched inference with KV caching efficiently with right-padding. I remember having some discussions with Luca Antiga and Thomas Viehmann about that. I remember that I wanted to add left-padding to LitGPT but they had good arguments why this wasn't necessary. I don't recall the details though.

Btw the paragraph you highlighted is a form of packing for efficiency. This wouldn't require any padding. This is a cool technique but makes things extra complicated because you know have to apply the various loss masks correctly to not contaminate the training examples.

Multi turn conversation Alpaca format for instruction fine tuning #579

Uh oh!

urbanspr1nter Mar 25, 2025

Replies: 2 comments · 7 replies

Uh oh!

rasbt Mar 26, 2025 Maintainer

Uh oh!

urbanspr1nter Mar 26, 2025 Author

Uh oh!

rasbt Mar 27, 2025 Maintainer

Uh oh!

urbanspr1nter Mar 27, 2025 Author

Uh oh!

rasbt Mar 27, 2025 Maintainer

Uh oh!

Uh oh!

VachanVY Mar 29, 2025

Uh oh!

rasbt Mar 30, 2025 Maintainer

Uh oh!

Uh oh!

VachanVY Mar 31, 2025

Uh oh!

rasbt Apr 3, 2025 Maintainer

urbanspr1nter
Mar 25, 2025

Replies: 2 comments 7 replies

rasbt
Mar 26, 2025
Maintainer

urbanspr1nter Mar 26, 2025
Author

rasbt Mar 27, 2025
Maintainer

urbanspr1nter Mar 27, 2025
Author

rasbt Mar 27, 2025
Maintainer

VachanVY
Mar 29, 2025

rasbt Mar 30, 2025
Maintainer

rasbt Apr 3, 2025
Maintainer