Update documentation to explicitly describe compatability/performance with early Pascal cards #55
Description
This was originally a question I wanted to ask, but in the interest of not abusing Github Issues, I'm disguising it as a feature request for documentation :)
There are a couple of very inexpensive cards with large VRAM; the Tesla M40 24GB (Maxwell) and Tesla P40 24GB (Pascal). Neither of these seem to have Tensor cores, which makes them pretty useless for FP16 math - and maybe equally useless for int8/int4, I'm not sure.
What is confusing to a lot of people who are interested in running LLM's on commodity hardware is that Tesla M40 is listed as part of the "Pascal" family, and a feature of Pascal is the inclusion of FP16 processing. However, the Tesla P40 specifically lacks FP16 support and thus runs FP16 at 1/64th the performance of other Tesla Pascal series cards.
Question 1: Do you know if FlexGen will run on a P40 24GB with reasonable performance, given that it is using 8bit or 4bit math? Is it comperable to other Pascal cards in terms of performance?
Question 2: Do you know if FlexGen can split a model across multiple Tesla P40 cards? Something I read suggested that splitting the model was not possible using bitsandbytes on older cards, but I'm not clear on the reason.
For context; if it turns out that the Tesla P40, or 2-3 Tesla P40's, can give reasonable performance in the < 1 second/token range for inference on large models, it would open up a new world of possibility to individuals looking to run LLM's at home.