Commit 23f2619
authored
Matmul - Add Support for 2D DRAM interleaved in0 + batched height sharded in1 (#37681)
### Ticket
#37403
### Problem description
For Deepseek MLA, prefill must reuse the same weights as decode.
Optimised decode will require sharded weights, while prefill has assumed
interleaved inputs. To reuse the sharded weights, prefill will need to
run with interleaved activations and sharded weights. For the batched
matmuls, this combination of inputs was not supported.
Based on past experience in Llama, it's expected this will slow down
matmuls by 30%.
### What's changed
This PR adds support for the specific case where in1 is a height sharded
batched matmul, where the sharding cleanly splits in1 along B/num_banks.
The in1 reader is updated to correctly index the data, while in0/compute
remain unchanged. The PR also includes a test to exercise all of the
prefill matmuls with this interleaved + sharded pattern, with the
sequence length left at the minimum (128) due to the test time required
for the larger sequence lengths.
These matmuls were profiled and compared to interleaved+interleaved
matmuls with no program config (i.e. current prefill implementation) at
sequence lengths up to 8k (the larger 32k and 128k sequence lengths are
too unwieldy to profile). The relative slowdown worsens as sequence
length increases. The best case speedup is 0.47x, the worse case
slowdown is 1.16x.
Also note that the wkv_b1 numbers are assuming using 8 DRAM banks.
Adding this support is in progress and should be complete before prefill
perf becomes a focus. However, if wkv_b1 is instead padded for 12 DRAM
banks (as it will be for the very first implementation), it will be
about 1.5x slower than currently shown. This does not impact the other
matmuls.
<img width="995" height="608" alt="image"
src="https://github.com/user-attachments/assets/d41fe9d2-bdc8-4d64-b7d2-898b38c4eb61"
/>
### Checklist
- [x] [](https://github.com/tenstorrent/tt-metal/actions/workflows/all-post-commit-workflows.yaml?query=branch:edwinlee/37403/deepseek_prefill_sharded)
- [x] [](https://github.com/tenstorrent/tt-metal/actions/workflows/blackhole-post-commit.yaml?query=branch:edwinlee/37403/deepseek_prefill_sharded)
- [ ]
[](https://github.com/tenstorrent/tt-metal/actions/workflows/tt-metal-l2-nightly.yaml?query=branch:edwinlee/37403/deepseek_prefill_sharded)
- [ ] New/Existing tests provide coverage for changes
#### Model tests
If your changes cover model-related code, you should run tests
corresponding to affected models and platforms (Single card, T3K,
Galaxy). "Choose your pipeline" workflows facilitate running multiple
kinds of tests in a single run. Each offers `models-mandatory` and
`models-extended` presets.
The former includes a minimal set of tests, to be run always. The latter
extends that with additional ones - use your best judgement in deciding
which is the most appropriate for your PR.
- [ ] [](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select.yaml?query=branch:edwinlee/37403/deepseek_prefill_sharded)
- [ ] `models-mandatory` preset (runs: [Device perf
regressions](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-device-models.yaml)
and [Frequent model and ttnn
tests](https://github.com/tenstorrent/tt-metal/actions/workflows/fast-dispatch-full-regressions-and-models.yaml))
- [ ] `models-extended` preset (runs: the mandatory tests, plus
[Demo](https://github.com/tenstorrent/tt-metal/actions/workflows/single-card-demo-tests.yaml)
and [Model
perf](https://github.com/tenstorrent/tt-metal/actions/workflows/perf-models.yaml)
tests)
- [ ] other selection - specify runs
- [ ] [](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select-t3k.yaml?query=branch:edwinlee/37403/deepseek_prefill_sharded)
- [ ] `models-mandatory` preset (runs: [Unit
tests](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-unit-tests.yaml))
- [ ] `models-extended` preset (runs: the mandatory tests, plus
[Demo](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-demo-tests.yaml)
and [Model
perf](https://github.com/tenstorrent/tt-metal/actions/workflows/t3000-model-perf-tests.yaml)
tests)
- [ ] other selection - specify runs
- [ ] [](https://github.com/tenstorrent/tt-metal/actions/workflows/pipeline-select-galaxy.yaml?query=branch:edwinlee/37403/deepseek_prefill_sharded)
- [ ] `models-mandatory` preset (runs: [Quick
tests](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-quick.yaml))
- [ ] `models-extended` preset (runs: the mandatory tests, plus
[Demo](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-demo-tests.yaml)
and [Model
perf](https://github.com/tenstorrent/tt-metal/actions/workflows/galaxy-perf-tests.yaml)
tests)
- [ ] other selection - specify runs1 parent be9fb56 commit 23f2619
File tree
5 files changed
+475
-74
lines changed- tests/ttnn/unit_tests/operations/matmul
- ttnn/cpp/ttnn/operations/matmul
- device
- factory
- kernels/dataflow
5 files changed
+475
-74
lines changedLines changed: 275 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
756 | 756 | | |
757 | 757 | | |
758 | 758 | | |
| 759 | + | |
| 760 | + | |
| 761 | + | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
| 765 | + | |
| 766 | + | |
| 767 | + | |
| 768 | + | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
| 843 | + | |
| 844 | + | |
| 845 | + | |
| 846 | + | |
| 847 | + | |
| 848 | + | |
| 849 | + | |
| 850 | + | |
| 851 | + | |
| 852 | + | |
| 853 | + | |
| 854 | + | |
| 855 | + | |
| 856 | + | |
| 857 | + | |
| 858 | + | |
| 859 | + | |
| 860 | + | |
| 861 | + | |
| 862 | + | |
| 863 | + | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
| 868 | + | |
| 869 | + | |
| 870 | + | |
| 871 | + | |
| 872 | + | |
| 873 | + | |
| 874 | + | |
| 875 | + | |
| 876 | + | |
| 877 | + | |
| 878 | + | |
| 879 | + | |
| 880 | + | |
| 881 | + | |
| 882 | + | |
| 883 | + | |
| 884 | + | |
| 885 | + | |
| 886 | + | |
| 887 | + | |
| 888 | + | |
| 889 | + | |
| 890 | + | |
| 891 | + | |
| 892 | + | |
| 893 | + | |
| 894 | + | |
| 895 | + | |
| 896 | + | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
| 980 | + | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
| 1000 | + | |
| 1001 | + | |
| 1002 | + | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
| 1007 | + | |
| 1008 | + | |
| 1009 | + | |
| 1010 | + | |
| 1011 | + | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
| 1018 | + | |
| 1019 | + | |
| 1020 | + | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
| 1032 | + | |
| 1033 | + | |
0 commit comments