Commit 8a465b6
levanter: parallelize build_caches over components (#5388)
* parallelize `LmDataConfig.build_caches` since sequential GCS
round-trips dominated startup (~40 min for ~100 components in the
Datakit Testbed before the first training step)
* run per-component work in a `ThreadPoolExecutor` with
`max_workers=min(32, len(items))`; work is GCS-metadata-bound (ledger
reads, per-shard `ShardedTreeCache.__init__`) so threads fit [^1]
* refactor the loop body into a `_build_one` helper returning `(name,
cache_or_None)`; pre-filter eligible components (skip zero-weight train,
`DirectDatasetComponent`, raise on unsupported types) before scheduling,
then post-filter `None` results when keying the result dict
* wrap the executor in `rigging.timing.log_time` so total wall time per
`build_caches[<split>]` lands in the logs
* skip and exception semantics unchanged — one bad component still fails
the whole build
* add unit tests in `lib/levanter/tests/test_text.py`
* `test_build_caches_returns_all_components_in_parallel` — 4-component
build, asserts the result dict is keyed by name with the right cache
contents
* `test_build_caches_propagates_exception_from_one_component` — mixed
good/bad pair must raise so errors aren't swallowed by `pool.map`
[^1]: cap of 32 avoids hammering GCS on very large component lists.
---------
Co-authored-by: Rafal Wojdyla <ravwojdyla@gmail.com>1 parent a4a43fa commit 8a465b6
2 files changed
Lines changed: 85 additions & 28 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
| |||
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
| 22 | + | |
21 | 23 | | |
22 | 24 | | |
23 | 25 | | |
| |||
861 | 863 | | |
862 | 864 | | |
863 | 865 | | |
864 | | - | |
| 866 | + | |
865 | 867 | | |
866 | 868 | | |
867 | 869 | | |
868 | | - | |
869 | 870 | | |
870 | 871 | | |
871 | | - | |
872 | 872 | | |
873 | 873 | | |
874 | | - | |
| 874 | + | |
| 875 | + | |
| 876 | + | |
| 877 | + | |
| 878 | + | |
| 879 | + | |
| 880 | + | |
| 881 | + | |
| 882 | + | |
| 883 | + | |
| 884 | + | |
| 885 | + | |
| 886 | + | |
| 887 | + | |
| 888 | + | |
875 | 889 | | |
| 890 | + | |
876 | 891 | | |
877 | 892 | | |
878 | 893 | | |
879 | 894 | | |
880 | | - | |
881 | | - | |
882 | | - | |
| 895 | + | |
883 | 896 | | |
884 | 897 | | |
885 | | - | |
| 898 | + | |
886 | 899 | | |
887 | 900 | | |
| 901 | + | |
| 902 | + | |
888 | 903 | | |
889 | | - | |
890 | | - | |
| 904 | + | |
891 | 905 | | |
892 | | - | |
893 | | - | |
894 | | - | |
895 | | - | |
896 | | - | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
897 | 909 | | |
898 | | - | |
899 | 910 | | |
900 | | - | |
| 911 | + | |
901 | 912 | | |
902 | | - | |
903 | | - | |
904 | | - | |
905 | | - | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
906 | 920 | | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
907 | 935 | | |
908 | | - | |
909 | | - | |
910 | | - | |
911 | | - | |
912 | | - | |
913 | | - | |
| 936 | + | |
914 | 937 | | |
915 | | - | |
916 | 938 | | |
917 | 939 | | |
918 | 940 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
779 | 779 | | |
780 | 780 | | |
781 | 781 | | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
0 commit comments