Apply standardised formatter to practical-ai-298.md

changelogbot · changelogbot · commit c731467f637f · 2024-12-11T18:33:12.000Z
This commit was automatically generated by the formatter github action which ran the src/format.js script

Files changed:
practicalai/practical-ai-298.md
diff --git a/practicalai/practical-ai-298.md b/practicalai/practical-ai-298.md
@@ -18,7 +18,7 @@ Basically, we have sufficient capacity to be kind of competitive with big labs.
 
 So now I think what has changed in the recent years is really the kind of independence that's growing from this kind of initial seeding. I think for many years there weren't a number of like truly French organizations where you could have access to a sufficient number of GPU, large enough clusters so as to develop machine learning model for a number of applications, and that's especially the case with large language models... But there's been a number of events that have kind of led to this diversification of the ecosystem in France. So now I guess there's a number of big startups, there's Kyutai, and I think that's only going to grow.
 
-\[00:07:56.09\] Also, there's one specificity in France which I think is very nice, especially for deep learning, and it's the fact that we can do a PhD as a resident in a private company. So for instance -- or even like a nonprofit. So at Kyutai we're going to have PhD students at Facebook, where I partially did my PhD, there was also a number of PhD students, and I think it's such a great opportunity to get to use graphic cards so early during our career, and even as students. And I think that's very specific to France, and that's also part of the success we're seeing at the moment... And that I think can only be growing as we train more and more people in such a way.
+\[07:56\] Also, there's one specificity in France which I think is very nice, especially for deep learning, and it's the fact that we can do a PhD as a resident in a private company. So for instance -- or even like a nonprofit. So at Kyutai we're going to have PhD students at Facebook, where I partially did my PhD, there was also a number of PhD students, and I think it's such a great opportunity to get to use graphic cards so early during our career, and even as students. And I think that's very specific to France, and that's also part of the success we're seeing at the moment... And that I think can only be growing as we train more and more people in such a way.
 
 **Chris Benson:** I'm curious, as you were describing the ecosystem there in France, and how strong it is, what was the specific dynamic with all these for-profit organizations around you, that brought about the desire to have the nonprofit? And how did you find yourself in the middle of that as you were in the formative stages?
 
@@ -30,7 +30,7 @@ I got the opportunity -- so I was contacted by \[unintelligible 00:09:47.09\] wh
 
 **Alexandre Défossez:** Yes. So I think the two are quite related. Usually the open science comes really around explaining how you arrived at the final results, and kind of what are the mistakes you made, what are the things you tried, what was important and whatnot... So I would say that's like a first part that we've been doing really well with Moshi. We released like a preprint technical report with a lot of details, that actually took us a bit of time, and that's something that's not necessarily... I don't think if we were not with this kind of nonprofit mindset, we would dedicate as much time, but I think on the long run it's kind of important. And then there are several aspects. The open sourcing can go from just the weights to like full training pipelines...
 
-\[00:11:57.23\] So releasing more code around the training of such models is also on our roadmap. We didn't get a chance to do it yet because - yeah, the paper already took us a bit of time, and we have other things we're working on. But I think that's also part of it, explaining exactly how you got to the final results, and not just having a set of weights for one specific task, but being kind of stuck with it, if you need to adapt it to something else. That's kind of the, I think, the vision of open science.
+\[11:57\] So releasing more code around the training of such models is also on our roadmap. We didn't get a chance to do it yet because - yeah, the paper already took us a bit of time, and we have other things we're working on. But I think that's also part of it, explaining exactly how you got to the final results, and not just having a set of weights for one specific task, but being kind of stuck with it, if you need to adapt it to something else. That's kind of the, I think, the vision of open science.
 
 **Chris Benson:** Could you talk a little bit about kind of what you're able to do with that model that maybe the commercial labs that you have in the same ecosystem aren't able to do? And maybe also kind of -- is it more standard within other nonprofits around the world, that are doing similar things, or is there something very, very distinctive compared to you, that maybe other nonprofits that you've seen, or maybe even modeled after don't have?
 
@@ -44,7 +44,7 @@ Some of them might be more around like contribution to science, for instance lik
 
 Then I think we have a strong -- for instance, we have a desire to go more and more towards on-device models. So Moshi is kind of barely on-device. We demoed it on a MacBook Pro, but it was like a top tier MacBook Pro, so it's kind of like proof of concept; it runs on device, not every device... But I think we definitely have a value there, because a number of for-profits are not going to develop really powerful on-device models, because that would be a potential threat to their... Like, it's harder to protect in terms of intellectual property. And I think in general, between the bigger players, there is kind of the race to the very top, very best numbers on the benchmarks, MMLU and everything... And so if it takes 10 times more inference time to beat the other on the benchmark, they are going to do it, because it's either beating the other on the benchmarks, or kind of leaving the arena. So we're not really in this mindset. We're more like -- the on-device, I think could have a very large number of applications. It definitely cannot solve all issues... But I think as a non-profit, we won't have the kind of reservation other for-profit might have for on-device models.
 
-**Break**: \[00:16:18.22\]
+**Break**: \[16:18\]
 
 **Daniel Whitenack:** So Alex, you've mentioned Moshi a few times now... Maybe if you could just give those that haven't heard of this an idea of, first, what is Moshi? And then maybe if you could then after that step back and describe - well, how did the lab, how did Kyutai start thinking about that sort of model or that sort of research direction as a research direction of the lab?
 
@@ -58,15 +58,15 @@ So that was back in November. At the time, OpenAI hadn't made any announcements,
 
 **Daniel Whitenack:** That's great. And just one more kind of background question, for those -- some people might have seen, I guess, non-real-time agents... So agents that would take in audio, transcribe that, maybe transcribe that with one model, use a language model to generate an answer, and then use a third model maybe to generate speech. So that's one kind of way to process this pipeline. You're talking about something different here, particularly for these speech to speech models, or the kind of multiplex models that you're talking about. Could you give a little bit of a background? How long have people sort of been studying this, researching this type of model? And has it really only been possible in sort of recent times to make this kind of real-time speech a reality? Because I think some people are -- at least public-wise, they may have seen things like Alexa in the past, that processes speech in certain ways... But these sort of demos, at least, that they're seeing from OpenAI, demos that they're seeing from Kyutai - this is a different type of interaction. So how long has this sort of been possible, and what is the kind of history of research? I know that's a hard question, because there's probably a million things that have been done... But from an overall perspective, how would you view it?
 
-**Alexandre Défossez:** \[00:24:20.16\] So I guess just to put it in perspective - so I'm not necessarily entirely familiar with how Alexa works, but it's more... I mean, anything that's kind of pre-GPT model would be kind of rule-based, or based on automatic speech recognition, which is actually a fairly old field; and even real-time speech recognition has been successful for a while, not necessarily with the amount of success we see with deep learning. I mean, it was already using, some of them, deep learning before... But then it's kind of rule-based. So if you don't formulate your request in quite the right way, it's quickly going to say I don't know"", or just do a Google search.
+**Alexandre Défossez:** \[24:20\] So I guess just to put it in perspective - so I'm not necessarily entirely familiar with how Alexa works, but it's more... I mean, anything that's kind of pre-GPT model would be kind of rule-based, or based on automatic speech recognition, which is actually a fairly old field; and even real-time speech recognition has been successful for a while, not necessarily with the amount of success we see with deep learning. I mean, it was already using, some of them, deep learning before... But then it's kind of rule-based. So if you don't formulate your request in quite the right way, it's quickly going to say I don't know"", or just do a Google search.
 
 Then what brought a change of paradigm was all the GPT models, and ChatGPT in particular, with this ability to perfectly understand human requests, no matter how it is formulated. Then to bring that to the audio domain, what you need is the ability for a kind of language model like a transformer to process the audio streams. Ideally, you would think it's very easy for a GPT model. You have text tokens in, and you predict the next token, and then you just need some special characters to differentiate between the request and the reply, and you want to be able to do something similar with audio... But things are not quite as easy with audio. Audio is not as dense in terms of information. You can think of words as being like really almost information -- from an information theory point of view optimal way of transmitting information, while audio as recorded by a microphone is just a wave that's oscillating like maybe 40,000 times per second, and if you just look at it with your naked eye, it will make no sense. So you need the right representation to be able to feed that into like a transformer model, have the transformer understand it, and be able to produce the output, and that has been quite a challenging task.
 
 If we just talk about audio, the first few successes were, for instance, WaveNet, and on top of WaveNet, there was Jukebox by OpenAI, that I think was the first like "Let's use a transformer language model to try to model audio." But I think I record from their paper that processing one minute of audio would take eight hours on a top of the line GPU at the time. So obviously, the technology has progressed a lot, and I think some of this progress was especially done by \[unintelligible 00:26:46.08\] for instance - he's another co-founder at Kyutai - at Google, with SoundStream in particular, that provided these kinds of discrete representations at a relatively low sample rate, low frame rate... And then already very quickly, Nel and his team showed that this could be fed into a transformer. At the time they were kind of using a technique where you would still have many more -- like, for one second of audio, you would need to do maybe like a few hundred autoregressive steps, which is very costly. One second with a transformer of like equivalent information would be maybe three autoregressive steps... So that naturally put a constraint of both your context, and the kind of lens of the sequence you can generate, and completely ruled out the real-time aspect.
 
 Then when I was at Meta, I also worked on a similar topic, especially on how to kind of not do as many autoregressive steps, but try to predict some of the information in parallel, and how to organize it in a way that you would have kind of minimal dependency between the different aspects you need to predict. That maybe I guess is a bit hard to say orally, but basically it's like for each timestamp, instead of having just one token like you would have in text, now you have maybe four, or eight, or 16 tokens... And yeah, you need to make sense of that. You cannot just flatten everything, because that's just not going to work in terms of performance.
 
-\[00:28:13.23\] And then there was a number of works... I think one we use for Moshi, the RQ transformer, that kind of models the dependency between those tokens for a given timestamp with a smaller transformer. I guess it was a pretty important algorithmic contribution from -- I'm trying to find back who did that, but I don't have it under my eyes... But yeah, so we kind of built -- so both on this expertise, the work that Nel had been doing, the work that I've been doing, and this kind of RQ transformer paper... And that's to solve the aspect of being able to run a big language model, so let's say 7 billion parameters, to take audio as input, and then output audio sufficiently fast for real-time processing.
+\[28:13\] And then there was a number of works... I think one we use for Moshi, the RQ transformer, that kind of models the dependency between those tokens for a given timestamp with a smaller transformer. I guess it was a pretty important algorithmic contribution from -- I'm trying to find back who did that, but I don't have it under my eyes... But yeah, so we kind of built -- so both on this expertise, the work that Nel had been doing, the work that I've been doing, and this kind of RQ transformer paper... And that's to solve the aspect of being able to run a big language model, so let's say 7 billion parameters, to take audio as input, and then output audio sufficiently fast for real-time processing.
 
 And yes, then the other aspects - I guess the one where we kind of brought a lot of innovation was the full duplex aspect of kind of having multiple audio streams. So one audio stream for the user, one audio stream for Moshi... And I think that's kind of -- it's not something you would naturally do with text, because you already have one stream, so going to two streams, it's kind of a hassle... But if you think of it for audio, it's like all those kinds of tokens in parallel, they already form like up to 16 streams that we already had to enter, so it was just like "Okay, let's just double the number of streams." Then now we have two of them, that are clearly separated. We do, actually -- the model is trained, for instance, during pre-training to also generate some of the user's reply, even if at that stage of the training there's no real -- it's just kind of a participants in the conversation that sample randomly. Then obviously with the model we released, now it only tries to model its own stream... But yeah, so that's kind of like the rough line of work that led to Moshi.
 
@@ -80,7 +80,7 @@ There's a number of things that we're exploring with this kind of approach. Anyt
 
 And yes, in terms of more as a general community, I'm not aware of anything in particular. I think one thing we want to do though is to release code to allow fine-tuning, maybe with LoRA, and also make it really easy. Obviously, the pipeline is a bit more complex, because you need audio, ideally you need transcripts, you need separation between the agent you want to train and the users... So we want to help with that regard, and try to make it easier to adapt it to a new use case.
 
-**Break**: \[00:34:06.14\]
+**Break**: \[34:06\]
 
 **Daniel Whitenack:** So Alex, you touched a little bit on the data side of this, and also kind of hopeful future fine-tuning opportunities... But I'm wondering if you could go into a little bit in particular, because we're able to talk about this sort of thing which sometimes we're not able to talk about, given the nature of the models that we're talking about on the podcast... What was the sort of data situation that you had to put together in terms of the specific training datasets or fine-tuning datasets that you put together and curated for the model that you've publicly released as kind of model builder?