-
Notifications
You must be signed in to change notification settings - Fork 84
Description
Thanks for making transformers much more approachable! The down side of this may be stupid questions from beginners like me (still, I hope this is not one). In the NLP results the five different datasets had best accuracy with five different CCT models. The Transformer, ViT-Lite, and CVT models almost have accuracy inversely correlated with size. My "intuition" is that bigger models would be better (for example, LLM often give the best results). Maybe the small size of the datasets means larger models can't be trained as well. Maybe the embedding is not optimized for transformers. Could you please offer insight into this?
The CCT is an encoder architecture. Are there small transformers that demonstrates an encoder/decoder or decoder architecture? How would you expect a decoder implementation of CCT to perform in generative tasks?