You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
https://huggingface.co/ai-forever/Kandinsky3.0
Last week, kandinsky3 comes out, without paper or blogpost on hf. I noticed its unusually large model size, being 11.8B in total, and largest OSS T2I model ever seen.
Yes, a 3B unet, which is largest latent space unet ever seen(IF-I-XL have 4.7B). The encoder part, they got even larger than the T5-XXL which IF uses. They use Flan-ul2, added up to 8.6B in text encoder part.
As for my testing, the extra large encoder proved to be worthwhile. The model can handle much tricky prompt than SDXL, and feels a bit close to Dalle3.
Some examples:
"A redbrick on a blue ball". >60% success rate, while SDXL rarely got it.
"A black apple in front of a green bag" ~80% success rate, while SDXL only got ~20%?
"A pear cut into seven pieces arranged in a ring" ~20% chance to get amount right
"Rat chasing a cat" Both model hardly get this correct, only dalle3 have small chance.
So, I leave the pros and cons of using this model, for everyone to discuss:
Pros:
1: The enhanced text-to-image alignment is helpful in arrange things in image with prompt only and generate counterfactural images much easier, help us closing the gap to dalle3, and cope with Fooocus's simple to use, and prompt only design philophy.
2: Potential higher quality image, bigger unet and vae, accordig to scale law.
3: More permissive license, Apache2.0 compare to SDXL's openrail++ or non commerical
4: Possibility to support prompt in more languages
Cons:
1: Huge model, problem in supporting GPUs with <8GB VRAM. The 3.0B unet is alreadly a huge pain(have to offload or load in fp8), and added 8.6B encoder are even worse. Need dirty works on quantize and offloading. The model download size is also annoyingly large(6GB+20GB)
2: Lack of certain concepts, such as amine, and may have some hidden bias compare to MJ or SDXL, which need future work to address and fix.
3: We have little starting point, most of the pipeline have to code from scratch. Giving the fact that the "shelf life" of sota models is often being only 3-6 months, I am not sure how much work can community done.
4: Long compute time than SDXL or SDXL-turbo. Just load and passing the ul2 encoder takes 2-3 sec, and the model somehow require 50 its to do its job, and lack of support of LCM.
5: Still unable to render correct words like dalle3.
So, kandinsky3 does to be a interesting model, and I'd willing to put it here to discuss would itbe a good idea to adpot this model, as a "upgrade". But such a huge model still comes with issues, and require effort to do(It is a bit ethical issue here, It will be bad to waste developer's effort if things turns out to be not worthwhile) . Whatever the decision is, I will be excited to see🤗
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
https://huggingface.co/ai-forever/Kandinsky3.0





Last week, kandinsky3 comes out, without paper or blogpost on hf. I noticed its unusually large model size, being 11.8B in total, and largest OSS T2I model ever seen.
Yes, a 3B unet, which is largest latent space unet ever seen(IF-I-XL have 4.7B). The encoder part, they got even larger than the T5-XXL which IF uses. They use Flan-ul2, added up to 8.6B in text encoder part.
As for my testing, the extra large encoder proved to be worthwhile. The model can handle much tricky prompt than SDXL, and feels a bit close to Dalle3.
Some examples:
"A redbrick on a blue ball". >60% success rate, while SDXL rarely got it.
"A black apple in front of a green bag" ~80% success rate, while SDXL only got ~20%?
"A pear cut into seven pieces arranged in a ring" ~20% chance to get amount right
"Rat chasing a cat" Both model hardly get this correct, only dalle3 have small chance.
So, I leave the pros and cons of using this model, for everyone to discuss:
Pros:
1: The enhanced text-to-image alignment is helpful in arrange things in image with prompt only and generate counterfactural images much easier, help us closing the gap to dalle3, and cope with Fooocus's simple to use, and prompt only design philophy.
2: Potential higher quality image, bigger unet and vae, accordig to scale law.
3: More permissive license, Apache2.0 compare to SDXL's openrail++ or non commerical
4: Possibility to support prompt in more languages
Cons:
1: Huge model, problem in supporting GPUs with <8GB VRAM. The 3.0B unet is alreadly a huge pain(have to offload or load in fp8), and added 8.6B encoder are even worse. Need dirty works on quantize and offloading. The model download size is also annoyingly large(6GB+20GB)
2: Lack of certain concepts, such as amine, and may have some hidden bias compare to MJ or SDXL, which need future work to address and fix.
3: We have little starting point, most of the pipeline have to code from scratch. Giving the fact that the "shelf life" of sota models is often being only 3-6 months, I am not sure how much work can community done.
4: Long compute time than SDXL or SDXL-turbo. Just load and passing the ul2 encoder takes 2-3 sec, and the model somehow require 50 its to do its job, and lack of support of LCM.
5: Still unable to render correct words like dalle3.
So, kandinsky3 does to be a interesting model, and I'd willing to put it here to discuss would itbe a good idea to adpot this model, as a "upgrade". But such a huge model still comes with issues, and require effort to do(It is a bit ethical issue here, It will be bad to waste developer's effort if things turns out to be not worthwhile) . Whatever the decision is, I will be excited to see🤗
Beta Was this translation helpful? Give feedback.
All reactions