final-project-RorroArt/writeup.html at main · UChicago-PL/final-project-RorroArt · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>Functional Programming in Haskell &mdash; A Compiler for a VLIW+SIMD Accelerator</title>
    <link rel="icon" href="favicon.ico" type="image/x-icon">
    <link rel="stylesheet" href="./writeup/fph.css" type="text/css" media="all">
</head>

<body>
<div id="header">
    <h1>Functional Programming<br> in Haskell</h1>
    <h2>Winter 2026</h2>
</div>

<div id="navigation">
    <a href="../../index.html">Table of Contents</a>
</div>

<div id="content">

<h1>A Compiler for a VLIW+SIMD Accelerator</h1>

<h2 id="introduction">Introduction</h2>

<p>For our final project, we wrote a compiler and simulator targeting a VLIW+SIMD accelerator.
The compiler and simulator are based on a performance engineering challenge written by Anthropic, in which the program of interest hashes and traverses a forest data structure. We wrote an optimizing compiler that generates a list of VLIW
<em>bundles</em>, each representing one cycle of parallel work on the target machine. Our simulator
reimplements Anthropic&rsquo;s reference Python simulator as a monad transformer stack, where each
layer in the stack corresponds to a distinct simulator concern: configuration, mutable state,
tracing, and error handling. Along the way the compiler performs constant folding, dead code elimination,
automatic vectorization, peephole optimization, instruction scheduling,
and VLIW instruction packing. We verified the simulator&rsquo;s correctness through
property-based testing with Hedgehog, comparing cycle traces against the Python reference
across 1000 randomly generated programs.</p>

<p>Our compiler is designed as a sequence of composable passes:</p>

<pre><code>let fn   = scalarKernel rounds batchSize
    fn'  = combineMulAdd . vectorizeBatch 8
          . eliminateDeadCode . canonicalize $ fn
    ops  = schedule (lowerToMachineOps 8 fn')
    bundles = bundleProgram ops</code></pre>

<p>Each of these is a pure function, and the whole pipeline is an expression
whose value is the final program.</p>

<p>This writeup is organized in two parts. Part&nbsp;I covers the compiler: its
intermediate representation, optimization passes, and code generation pipeline.
Part&nbsp;II covers the simulator: its monad transformer architecture, execution
semantics, and the property-based testing infrastructure used to validate it.</p>

<h1>Part I: The Compiler</h1>

<h2 id="target-machine">The Target Machine</h2>

<p>Before discussing the compiler, we should understand the machine it
targets. The accelerator is a VLIW (very long instruction word) processor with
SIMD (single instruction, multiple data) support. Each cycle, it can issue
instructions to five independent <em>execution engines</em> in parallel:</p>

<table>
<tr><th>Engine</th><th>Max Slots/Cycle</th><th>Operations</th></tr>
<tr><td><strong>ALU</strong></td><td style="text-align:center">12</td><td>Scalar arithmetic: <code>+</code>, <code>-</code>, <code>*</code>, <code>//</code>, <code>^</code>, <code>&amp;</code>, <code>|</code>, <code>&lt;&lt;</code>, <code>&gt;&gt;</code>, <code>%</code>, <code>&lt;</code>, <code>==</code></td></tr>
<tr><td><strong>VALU</strong></td><td style="text-align:center">6</td><td>8-wide SIMD versions of the above, plus <code>vbroadcast</code> and <code>multiply_add</code></td></tr>
<tr><td><strong>Load</strong></td><td style="text-align:center">2</td><td><code>load</code>, <code>load_offset</code>, <code>vload</code> (8 consecutive words), <code>const</code> (immediate)</td></tr>
<tr><td><strong>Store</strong></td><td style="text-align:center">2</td><td><code>store</code>, <code>vstore</code> (8 consecutive words)</td></tr>
<tr><td><strong>Flow</strong></td><td style="text-align:center">1</td><td><code>select</code>, <code>vselect</code>, <code>halt</code>, <code>pause</code>, jumps</td></tr>
</table>

<p>A single cycle can therefore issue up to 12+6+2+2+1 = 23 operations. The
register file is a flat <em>scratch-pad</em> addressed by integer index
(<code>ScratchAddr</code>). SIMD operations work on blocks of 8 consecutive scratch
addresses, and memory is flat and word-addressable. Again, each bundle is a single cycle,
so all reads see the state from the <em>beginning</em> of the cycle, and all
writes commit atomically at the end.</p>

<p>The compiler&rsquo;s job is twofold: (a) <em>vectorize</em> scalar code to use 8-wide SIMD
where legal, and (b) <em>pack</em> independent operations into each bundle to exploit
the parallelism of the VLIW execution model.</p>

<p>The compiler constructs typed ISA values &mdash; actual <code>AluSlot</code>, <code>ValuSlot</code>, <code>LoadSlot</code> values &mdash; and the simulator pattern-matches on the same constructors to execute them. These types should guarantee that the compiler can only emit instructions the simulator knows how to execute.</strong></p>

<h2 id="lifecycle">The Lifecycle of a Program</h2>

<p>A program passes through nine stages, each a pure function that transforms the
previous stage&rsquo;s output:</p>

<pre><code>
  scalarKernel        -- build the IR         (BaselineKernel.hs)
      |
  canonicalize        -- constant fold        (Canonicalize.hs)
      |
  eliminateDeadCode   -- dead-code-elim       (DCE.hs)
      |
  vectorizeBatch      -- auto-vectorize loops (Vectorize.hs)
      |
  combineMulAdd       -- operator fusion      (Peephole.hs)
      |
  lowerToMachineOps   -- lower to machine ops (FinalLowering.hs)
      |
  schedule            -- reorder for packing  (Schedule.hs)
      |
  bundleProgram       -- VLIW bundle packing  (Bundle.hs)
      |
  renderProgramPayload -- emit Python literal (PythonInterop.hs)</code></pre>

<p>The first five stages operate on <code>Function</code>, the compiler&rsquo;s intermediate
representation. <code>lowerToMachineOps</code> converts to <code>[MachineOp]</code>, a flat
list of individual machine instructions with metadata.
<code>schedule</code> reorders those ops for better VLIW packing.
<code>bundleProgram</code> packs them into <code>[ISA.Bundle ()]</code>, the final VLIW bundles.
<code>renderProgramPayload</code> serializes the bundles as a Python dict literal for testing with Anthropic&rsquo;s provided Python simulator.</p>

<p>Let us now examine each stage in detail.</p>

<h2 id="the-ir">The Intermediate Representation (<code>FinalIR.hs</code>)</h2>

<p>The IR is a structured-SSA language with three key design properties:</p>

<p><strong>1. SSA values.</strong> Every <code>Let</code> introduces a fresh <code>Id</code> that is defined exactly
once and used by later statements through <code>Expr = Var Id | Const Word32</code>.
This discipline is known as <em>static single assignment</em>: each name is assigned exactly
once, and that definition dominates all uses. This makes dataflow analysis
trivial &mdash; to find where a value comes from, you look up its (unique)
definition.</p>

<p><strong>2. Structured control flow.</strong> The only looping construct is <code>For</code>, which
carries its bounds and body inline. We do not allow arbitrary gotos.</p>

<p><strong>3. Named memory regions.</strong> Loads and stores address a <code>Buffer</code> (a symbolic
memory region) at an <code>Expr</code> offset, instead of pointer arithmetic. This is
what makes the IR easily analyzable.</p>

<h3 id="types">Types</h3>

<pre><code>data Ty = I32 | Vec !Int !Ty</code></pre>

<p><code>I32</code> is a scalar 32-bit integer. <code>Vec w I32</code> is a vector of <code>w</code> scalars,
corresponding to <code>w</code> consecutive scratch-pad slots at the machine level. The
kernel starts entirely scalar; the vectorizer introduces <code>Vec</code> types.</p>

<p>Pointers are represented as <code>I32</code> values. Boolean conditions are <code>I32</code> values (0 = false, nonzero = true). This simplicity is a deliberate choice, and we will later examine the alternative (a richer type
system with pointer types and memory tokens).</p>

<h3 id="buffers">Buffers: Symbolic Memory</h3>

<pre><code>data Buffer = Header | Forest | InpIdx | InpVal</code></pre>

<p>These four constructors correspond 1:1 to the memory regions the test harness
allocates:</p>

<table>
<tr><th>Buffer</th><th>Harness Region</th><th>Contents</th></tr>
<tr><td><code>Header</code></td><td>header words</td><td>rounds, n_nodes, batch_size, forest_height, 3 base pointers</td></tr>
<tr><td><code>Forest</code></td><td>forest_values</td><td>Tree node array</td></tr>
<tr><td><code>InpIdx</code></td><td>inp_indices</td><td>Current index for each batch element</td></tr>
<tr><td><code>InpVal</code></td><td>inp_values</td><td>Current value for each batch element</td></tr>
</table>

<p>Using symbolic buffer names instead of raw pointers is the single most
important design decision in the IR. It is what makes the vectorizer
<em>possible</em> without a general-purpose alias analysis. The vectorizer can ascertain
&ldquo;Does this load use the induction variable as an index into
<code>InpIdx</code>?&rdquo; by pattern-matching on the buffer name. We avoid trying to solve the
(occasionally undecidable) problem of whether two arbitrary pointer expressions might alias.</p>

<h3 id="load-store-kinds">Load and Store Kinds</h3>

<pre><code>data LoadKind  = LoadScalar | LoadContigVec !Int | LoadGatherVec !Int
data StoreKind = StoreScalar | StoreContigVec !Int</code></pre>

<p>The <code>Kind</code> tags record <em>how wide</em> a memory access is. The kernel emits only
<code>LoadScalar</code>/<code>StoreScalar</code>. The vectorizer rewrites eligible accesses to
<code>LoadContigVec w</code> (contiguous vector load of <code>w</code> words),
<code>LoadGatherVec w</code> (indexed gather of <code>w</code> words at data-dependent offsets),
or <code>StoreContigVec w</code>. The lowering then selects the correct machine
instruction based on this tag: <code>load</code> vs. <code>vload</code> vs. <code>load_offset</code>,
<code>store</code> vs. <code>vstore</code>.</p>

<p>This is an instance of a useful compiler design pattern: the analysis pass
(Vectorize) communicates its decisions to the code-generation pass (Lowering)
<em>declaratively</em>, by rewriting the IR. Obviously, this borrows heavily from the philosophy and design of MLIR.</p>

<h3 id="rhs">Right-Hand Sides</h3>

<p>Every value-producing computation is one <code>Rhs</code> variant:</p>

<table>
<tr><th>Constructor</th><th>Meaning</th><th>Introduced by</th></tr>
<tr><td><code>RConst w</code></td><td>32-bit literal</td><td>Kernel</td></tr>
<tr><td><code>RBin op a b</code></td><td>Binary ALU operation</td><td>Kernel</td></tr>
<tr><td><code>RMulAdd a b c</code></td><td>Fused multiply-add: <code>a*b + c</code></td><td>Peephole pass</td></tr>
<tr><td><code>RSelect c a b</code></td><td>Conditional select: <code>c &ne; 0 ? a : b</code></td><td>Kernel</td></tr>
<tr><td><code>RLoad kind buf ix</code></td><td>Load from <code>buf[ix]</code></td><td>Kernel; kind rewritten by Vectorize</td></tr>
<tr><td><code>RReduce op e</code></td><td>Horizontal vector-to-scalar reduction</td><td>Kernel builders (reduction patterns)</td></tr>
<tr><td><code>RVBroadcast w e</code></td><td>Broadcast scalar to vector</td><td>(Available for future use)</td></tr>
</table>

<p><code>RSelect</code> replaces traditional if-conversion. Instead of <code>if c then a else b</code>
as a control-flow construct, the kernel uses <code>RSelect</code> directly, which lowers
to the simulator&rsquo;s <code>select</code> (scalar) or <code>vselect</code> (vector) flow slot.

<p><code>RMulAdd</code> does not appear in the initial kernel. It is introduced by the
peephole pass when it discovers a <code>Mul</code> followed by an <code>Add</code>. The lowering maps
it to a single <code>multiply_add</code> VALU slot for vectors, or two ALU slots (mul
then add) for scalars.</p>

<p><code>RReduce op e</code> represents a horizontal reduction of a vector value <code>e</code> to a
scalar using an associative operation <code>op</code>. It is introduced for reduction patterns and lowered to a log&#8322;(w) tree of scalar ALU ops &mdash; for example, an 8-wide <code>RReduce Add v</code> becomes three stages of pairwise additions.</p>

<h3 id="stmts">Statements</h3>

<pre><code>data Stmt
  = Let !Id !Ty !Rhs
  | Store !StoreKind !Buffer !Expr !Expr
  | For !Id !Int !Int !Int ![(Id, Expr)] ![Stmt]
  -- above is iv, start, end, step, carries, body</code></pre>

<p><code>Let</code> introduces a new SSA value. <code>Store</code> is a side effect.
<code>For</code> is a counted loop with compile-time integer bounds, a step, a list of
carries, and an inline body. The carries field holds loop-carried accumulators
&mdash; <code>(Id, Expr)</code> pairs where each <code>Id</code> is initialized from its <code>Expr</code>
before the first iteration and updated by the loop body. An empty carries list
(the common case) means the loop has no cross-iteration state. The body
is an inline list of statements, which makes the IR is a tree.
Statement order determines memory ordering. There are no explicit memory tokens
or effect edges. This is sound because all loops are fully unrolled during
lowering, and within a single unrolled iteration the statement sequence
<em>is</em> the execution sequence.</p>

<h3 id="builder">The Builder Monad</h3>

<pre><code>newtype Build a = Build { unBuild :: State BuildSt a }

emitLet   :: Ty -&gt; Rhs -&gt; Build Expr
emitStore :: StoreKind -&gt; Buffer -&gt; Expr -&gt; Expr -&gt; Build ()
emitFor   :: Int -&gt; Int -&gt; Int -&gt; (Expr -&gt; Build ()) -&gt; Build ()</code></pre>

<p><code>Build</code> is a thin wrapper around <code>State</code> that provides a convenient
monadic API for constructing IR. <code>emitLet</code> allocates a fresh <code>Id</code>, appends a
<code>Let</code> statement, and returns the new <code>Var</code> as an <code>Expr</code>.</p>

<p>The interesting piece is <code>emitFor</code>: it takes a callback that receives the
induction variable as an <code>Expr</code> and builds the loop body. Internally,
<code>capture</code> temporarily redirects the statement buffer so that body statements
don&rsquo;t escape into the outer scope. The body is captured, the outer buffer is restored, and the captured
body is wrapped in a <code>For</code> node. This is quite nice to express with
first-class functions and closures: the body-building callback
closes over the induction variable and the enclosing environment, and the
builder captures the side effects (emitted statements) without any of them
leaking.</p>

<h3 id="why-this-ir">Why This IR Design?</h3>

<p>Each design choice has a concrete consequence for the passes that follow:</p>

<table>
<tr><th>Decision</th><th>Consequence</th></tr>
<tr><td>Named <code>Buffer</code> instead of raw pointers</td><td>Vectorizer uses pattern matching, not alias analysis</td></tr>
<tr><td><code>LoadKind</code>/<code>StoreKind</code> tags</td><td>Vectorizer communicates access pattern to lowering declaratively</td></tr>
<tr><td><code>For</code> with inline body</td><td>No CFG, no phi nodes; loop transformations are tree rewrites</td></tr>
<tr><td><code>Ty</code> on every <code>Let</code></td><td>Lowering dispatches scalar vs. vector without re-inferring types</td></tr>
<tr><td><code>AluOp</code> from shared <code>ISA.hs</code></td><td>Compiler and simulator agree on operations by construction</td></tr>
<tr><td>No memory tokens</td><td>Simpler IR; valid because loops are fully unrolled</td></tr>
<tr><td><code>Expr = Var | Const</code></td><td>Constants appear directly in operands; enables simple folding</td></tr>
</table>

<h2 id="kernel">The Kernel (<code>BaselineKernel.hs</code>)</h2>

<p><code>scalarKernel :: Int -&gt; Int -&gt; Function</code> builds the baseline kernel using the
<code>Build</code> monad. The two parameters are <code>rounds</code> (outer loop count) and
<code>batchSize</code> (inner loop count). The result is a purely scalar <code>Function</code> with
all types <code>I32</code>, all loads/stores scalar, and all loop steps equal to 1.</p>

<p>The kernel&rsquo;s structure is:</p>

<p><strong>1. Header loads</strong> (7 scalar loads from <code>Header[0..6]</code>): metadata including
<code>nNodes</code> (the tree size, used for wrap-around) and base pointers for each
memory region.</p>

<p><strong>2. Outer loop</strong> over rounds.</p>

<p><strong>3. Inner loop</strong> over the batch: for each element <code>i</code>:</p>
<ul>
<li>Load <code>idx</code> from <code>InpIdx[i]</code> and <code>val</code> from <code>InpVal[i]</code>.</li>
<li>Load the tree node from <code>Forest[idx]</code>.</li>
<li>XOR <code>val</code> with the node and hash the result through 6 stages.</li>
<li>Compute the next index: <code>idx*2 + (1 if hash%2==0, else 2)</code>.</li>
<li>Wrap: if <code>nextIdx &ge; nNodes</code> then <code>nextIdx = 0</code>.</li>
<li>Store the new index and hashed value back.</li>
</ul>

<p><strong>4. Hash function</strong> (<code>emitHash</code>): The 6-stage integer hash uses a table of
<code>(op1, const1, op2, op3, const3)</code> tuples. Each stage computes
<code>a = op2(op1(a, const1), op3(a, const3))</code>, just like the Anthropic Python harness.
Conveniently, we can write the entire hash pipeline as a <code>foldl</code> over the stage table, with each step emitting IR via the builder monad.</p>

<pre><code>emitHash :: Expr -&gt; Build Expr
emitHash a0 = foldl go (pure a0) hashStages
  where
    go ma (op1, c1, op2, op3, c3) = do
      a  &lt;- ma
      t1 &lt;- emitLet I32 (RBin op1 a (Const c1))
      t2 &lt;- emitLet I32 (RBin op3 a (Const c3))
      emitLet I32 (RBin op2 t1 t2)</code></pre>

<p>The accumulator here is a monadic computation (<code>Build Expr</code>). Each step sequences the monadic
action from the previous stage with two new let-bindings and their combination.
The <code>&lt;-</code> extracts the value from the accumulated monadic computation, and
<code>emitLet</code> adds to it.</p>

<h2 id="canonicalize">Pass 1: Canonicalize (<code>Canonicalize.hs</code>)</h2>

<p>The first pass performs constant propagation and constant folding in a single
forward traversal; we just walk the statements in order, maintaining
an environment <code>Map Id Word32</code> of SSA values known to be constant. For
each <code>Let</code>:</p>

<ol>
<li><code>simpRhs</code> substitutes known-constant <code>Var</code>s with <code>Const</code> in the RHS operands.</li>
<li><code>rhsConst</code> checks whether the simplified RHS is fully constant (e.g.,
<code>RBin Add (Const 3) (Const 5)</code> folds to <code>RConst 8</code>).</li>
<li>If so, the <code>Id</code> is added to the constant environment for use by later statements.</li>
</ol>

<p><code>evalAluOp</code> implements all 13 <code>AluOp</code> variants in 32-bit modular arithmetic,
matching the simulator&rsquo;s semantics. This function is defined once and is
correct by construction: it handles every constructor of <code>AluOp</code>, and a
non-exhaustive pattern warning would fire if a new operation were added to
<code>ISA.hs</code> without updating the evaluator.</p>

<p>For <code>For</code> loops, the induction variable is <em>deleted</em> from the constant map when
entering the body, since it takes different values on each iteration. Constants
defined outside the loop remain visible inside.</p>

<p>For <code>RSelect</code>, if both branches are the same expression, the condition is
replaced with <code>Const 1</code> &mdash; a signal to later passes that the select is
trivially dead.</p>

<h2 id="dce">Pass 2: Dead Code Elimination (<code>DCE.hs</code>)</h2>

<p>After canonicalization propagates constants forward, some <code>Let</code> bindings become
unused since their values were folded into downstream expressions and the
original names are never referenced again. The DCE pass removes these dead
bindings in two phases: a backward liveness analysis followed by a forward
filter. <code>computeUsed</code> walks the statement list via <code>foldr</code>: for
<code>Let x _ rhs</code>, it adds <code>rhsIds rhs</code> (the set of IDs referenced by the
right-hand side) to the used set only if <code>x</code> is already in that set.
<code>Store</code> and <code>For</code> statements unconditionally contribute their referenced IDs,
since their effects are always observable. Then <code>filterDead</code> makes a forward
pass that drops any <code>Let</code> binding whose ID is not in the used set.</p>

<p>This is the textbook backward dataflow analysis, and the functional
implementation is notably clean: the analysis is a <code>foldr</code> and the
rewrite is a <code>concatMap</code>. It is the "dual" to Canonicalize &mdash;
one propagates information forward (substituting constants into uses), the
other propagates information backward (determining which definitions are
needed).

<h2 id="vectorize">Pass 3: Vectorize (<code>Vectorize.hs</code>)</h2>

<p>This is the core optimization. It rewrites eligible inner loops from scalar
(step 1) to SIMD (step <code>w</code>, typically 8), turning scalar loads/stores into
vector loads/stores and promoting arithmetic to vector width.</p>

<h3 id="vectorize-legality">Legality</h3>

<p>The vectorizer must first determine whether it is <em>legal</em> to vectorize a loop.
The answer is yes only if the induction variable <code>iv</code> is used <em>exclusively</em> as
a direct index into <code>InpIdx</code> or <code>InpVal</code>. Concretely, <code>iv</code> may appear in:</p>

<ul>
<li><code>RLoad LoadScalar InpIdx (Var iv)</code> &mdash; loading from inp_indices at the batch index</li>
<li><code>RLoad LoadScalar InpVal (Var iv)</code> &mdash; loading from inp_values at the batch index</li>
<li><code>Store StoreScalar InpIdx (Var iv) _</code> &mdash; storing to inp_indices</li>
<li><code>Store StoreScalar InpVal (Var iv) _</code> &mdash; storing to inp_values</li>
</ul>

<p>And it must <em>not</em> appear in arithmetic operands, non-InpIdx/InpVal loads, or
nested loops.</p>

<p>This is a rather conservative vectorizer: it vectorizes loops where each
iteration is independent and the induction variable maps to a contiguous
memory range. The required check is just a pattern-match on
<code>Buffer</code> and <code>Expr</code>! We get to avoid general dependence analysis thanks to
named <code>Buffer</code> regions. Without them, the
vectorizer would need to prove that two pointer expressions don&rsquo;t alias,
which is undecidable in general.</p>

<h3 id="strip-mining">Strip-Mining and Tail Loops</h3>

<p>When the trip count is not divisible by the SIMD width <code>w</code>, the vectorizer
performs <em>strip-mining</em>:</p>

<pre><code>vecEnd = start + (tripCount / w) * w

For iv start  vecEnd w  [vectorized body]    -- main loop
For iv vecEnd end    1  [scalar body]        -- tail loop</code></pre>

<p>If <code>vecEnd &le; start</code> (trip count &lt; <code>w</code>), no vectorization occurs. If
<code>vecEnd == end</code> (evenly divisible), no tail loop is emitted. For the default
<code>batchSize=256</code> with <code>w=8</code>, the trip count is exactly divisible, so no tail
is needed. For <code>batchSize=10</code>, the main loop processes elements 0&ndash;7 and
the tail processes 8&ndash;9.</p>

<h3 id="rewriting">Rewriting</h3>

<p>Within the vectorized loop body, the vectorizer rewrites each statement:</p>

<p><strong>Loads.</strong> <code>RLoad LoadScalar InpIdx (Var iv)</code> becomes
<code>RLoad (LoadContigVec w) InpIdx (Var iv)</code>. The load was reading one element at
the batch index; now it reads <code>w</code> contiguous elements. The forest load
<code>RLoad LoadScalar Forest idx</code> becomes <code>RLoad (LoadGatherVec w) Forest idx</code>
when its index has been promoted to a vector, because the forest indices are
data-dependent and therefore not contiguous.</p>

<p><strong>Stores.</strong> <code>Store StoreScalar InpIdx (Var iv) v</code> becomes
<code>Store (StoreContigVec w) InpIdx (Var iv) v</code>.</p>

<p><strong>Arithmetic.</strong> <code>RBin</code>, <code>RSelect</code>, <code>RMulAdd</code> are structurally unchanged. The
<em>type</em> is promoted by <code>inferTy</code>: if any operand is a <code>Vec</code>, the result is
<code>Vec w I32</code>. Otherwise <code>I32</code>.</p>

<p>Once <code>InpIdx</code>/<code>InpVal</code>
loads become vectors, every downstream computation (XOR, all six hash
stages, mod, select, multiply) gets promoted too. The forest load
becomes a gather because its index (which came from the now-vectorized
<code>InpIdx</code> load) is a vector.</p>

<h3 id="vectorize-limits">What the Vectorizer Does Not Do</h3>

<p>The vectorizer does not handle reductions (loop-carried dependencies). It does
not vectorize loops with nested loops. It does not handle strided access
patterns. It does not reorder statements. These are deliberate simplifications, largely to preserve my sanity.
The baseline kernel&rsquo;s inner loop is a pure map, but it'd be fun to spend more time designing a more elegant and powerful vectorizer.

<h2 id="peephole">Pass 4: Peephole (<code>Peephole.hs</code>)</h2>

<p>The peephole pass scans the statement list for adjacent pairs:</p>

<pre><code>Let t (Vec w I32) (RBin Mul a b)
Let y (Vec w I32) (RBin Add (Var t) c) -- or (RBin Add c (Var t))</code></pre>

<p>where <code>t</code> is not used anywhere after <code>y</code>. It replaces both with:</p>

<pre><code>Let y (Vec w I32) (RMulAdd a b c)</code></pre>

<p>This is vector-only because the simulator has a <code>multiply_add</code> VALU instruction
but no scalar fused-multiply-add. The vector version saves a VALU slot and a
scratch register.</p>

<p>The dead-use check (<code>usesId</code>) traverses all subsequent statements, including
nested loop bodies, to verify <code>t</code> doesn&rsquo;t appear. This is conservative but
sufficient for our kernel of interest.</p>

<h2 id="machine">The Machine Model (<code>Machine.hs</code>)</h2>

<p>After the IR passes, the program is lowered to a sequence of
<code>MachineOp</code> values. These bridge the gap between the IR (which thinks in
buffers and types) and the ISA (which thinks in scratch addresses and
instruction slots).</p>

<pre><code>data MachineSlot
  = MSlotAlu   !AluSlot
  | MSlotValu  !ValuSlot
  | MSlotLoad  !LoadSlot
  | MSlotStore !StoreSlot
  | MSlotFlow  !FlowSlot

data MachineOp = MachineOp
  { moSlot   :: !MachineSlot
  , moReads  :: ![Int]       -- scratch addresses read
  , moWrites :: ![Int]       -- scratch addresses written
  , moMem    :: ![MemAccess] -- memory regions accessed
  }</code></pre>

<p>Each <code>MachineSlot</code> wraps one of the typed slot ADTs from <code>ISA.hs</code>. For
example, <code>MSlotAlu (Alu Add (ScratchAddr 10) (ScratchAddr 5) (ScratchAddr 6))</code>
represents a scalar add writing scratch 10. The slot carries full type
information &mdash; not strings &mdash; so the bundler extracts typed slot lists
directly into the final <code>ISA.Bundle</code>.</p>

<p>The metadata fields (<code>moReads</code>, <code>moWrites</code>, <code>moMem</code>) are what the bundler and
scheduler use to determine which operations can safely share a cycle or be
reordered. For a scalar op, <code>moWrites</code> is a single address; for a vector op,
it&rsquo;s <code>w</code> consecutive addresses. <code>MemAccess</code> records describe the memory
region, direction (load/store), and optional address range for each memory
operation. An <code>Unknown</code> range conservatively prevents packing with any other
access to the same region.</p>

<h2 id="lowering">Lowering (<code>FinalLowering.hs</code>)</h2>

<p>The lowering pass is the largest module because it handles
scratch-pad allocation, address computation, constant materialization, loop
unrolling, and the scalar/vector dispatch for every <code>Rhs</code> variant.</p>

<h3 id="scratch-layout">Scratch-Pad Layout</h3>

<p>The lowering allocates scratch-pad addresses in a single up-front pass.
Loop induction variables don&rsquo;t get persistent slots &mdash; they are materialized
as constants on each unrolled iteration. Header-load IDs are already mapped to
header block slots 0&ndash;6.</p>

<h3 id="header-prelude">Header Prelude</h3>

<p>The lowering emits a single <code>vload</code> that reads 8 words from memory
address 0 into scratch slots <code>[0..7]</code>. This loads the entire header in one
cycle. The IR&rsquo;s seven <code>RLoad LoadScalar Header</code> statements are then no-ops
during lowering &mdash; the value for each is already in its scratch slot.</p>

<h3 id="constants">Constants and Broadcasts</h3>

<p><code>getConstS c</code> materializes a scalar constant once and caches the scratch
address. <code>getConstV c</code> calls <code>getConstS</code> then broadcasts. <code>getBroadcast s</code>
broadcasts an existing scalar scratch to a vector. All three check their cache
first and emit nothing on a hit. When a vector operation has a scalar operand
(e.g., a <code>Const</code> in the IR), the lowering calls <code>exprVector</code> which
automatically broadcasts it.</p>

<h3 id="sv-dispatch">Scalar vs. Vector Dispatch</h3>

<p><code>lowerLet</code> cases on the <em>type</em> of the let-binding:</p>

<table>
<tr><th>IR Type</th><th><code>RBin op</code></th><th><code>RSelect</code></th><th><code>RLoad</code></th></tr>
<tr><td><code>I32</code></td><td><code>Alu op</code> (ALU)</td><td><code>Select</code> (Flow)</td><td><code>Load</code> / <code>load</code> (Load)</td></tr>
<tr><td><code>Vec _ _</code></td><td><code>VAlu op</code> (VALU)</td><td><code>VSelect</code> (Flow)</td><td><code>VLoad</code> / <code>load_offset</code> (Load)</td></tr>
</table>

<p>For vectors, <code>moReads</code>/<code>moWrites</code> lists span <code>w</code> consecutive addresses.
For scalars, they are single-element lists.</p>

<h3 id="address-comp">Address Computation and Caching</h3>

<p>Memory accesses go through <code>computeAddr</code>, which computes <code>base_ptr + offset</code>.
The function <code>basePtr</code> maps each <code>Buffer</code> to its base scratch address:
<code>Header &rarr; const 0</code>, <code>Forest &rarr; hdrBase+4</code>, <code>InpIdx &rarr; hdrBase+5</code>,
<code>InpVal &rarr; hdrBase+6</code>. These are the pointer words from the header.</p>

<p>An optimization: if the same buffer was just accessed at the same static offset,
the address ALU add is skipped and the cached address temp is reused. This
saves an ALU slot when the same address is used for a load and its subsequent
store (a common case, for example, is that <code>InpIdx[i]</code> is loaded and stored in every iteration).</p>

<h3 id="gather-loads">Gather Loads</h3>

<p><code>lowerLoadVecGather</code> handles the forest load after vectorization. The index
is a vector of <code>w</code> data-dependent offsets:</p>

<ol>
<li>Broadcast the forest base pointer to a vector.</li>
<li>VALU-add the index vector and base vector, producing a vector of absolute
addresses.</li>
<li>Issue <code>w</code> individual <code>load_offset</code> instructions, one per lane.</li>
</ol>

<p>Each <code>load_offset</code> reads from a different element of the address vector. The
memory metadata is <code>Unknown</code> because gather addresses are data-dependent.</p>

<h3 id="unrolling">Loop Unrolling</h3>

<p>The lowering fully unrolls every loop by iterating
<code>[start, start+step .. end-1]</code> and emitting the body for each iteration value.
This is valid because loop bounds are compile-time constants. Before each
iteration, if the IV is used in the body, a <code>const</code> load materializes the
current iteration value as a scalar scratch.</p>

<p>After vectorization, the inner loop has step <code>w=8</code>. For <code>batchSize=256</code>,
this means 32 unrolled iterations (down from 256 scalar iterations). Each
iteration emits vector ops that process 8 elements, so the total work is
the same but uses roughly 8x fewer instructions.</p>

<h2 id="scheduling">Instruction Scheduling (<code>Schedule.hs</code>)</h2>

<p>The scheduler reorders the flat <code>[MachineOp]</code> list produced by lowering
before it reaches the bundler. The bundler is greedy and in-order &mdash; it
never reorders ops, it only decides whether the next op fits in the current
bundle. So the scheduler&rsquo;s job is to present ops in an order that gives
the bundler more packing opportunities, placing independent ops from
different engines adjacent to each other rather than in long runs of the same
engine type.</p>

<h3 id="schedule-algo">Algorithm</h3>

<p>The scheduler builds a dependency DAG from the <code>moReads</code> and <code>moWrites</code> fields
of each <code>MachineOp</code>, tracking three kinds of hazards: RAW (read-after-write:
an op reads a scratch address written by an earlier op), WAW
(write-after-write: two ops write the same address), and WAR
(write-after-read: an op writes an address read by an earlier op). Memory
conflicts through <code>moMem</code> are also modeled &mdash; two ops that access the same
memory region with at least one store are linked by a dependence edge.</p>

<p>With the DAG built, the scheduler performs a topological sort using two
priority criteria. First, ops on the longest dependence chain are scheduled first, because delaying them
would extend the total schedule length. Second, as a tiebreaker,
we consider the most-constrained engine: ops targeting engines with lower slot limits
(Flow has only 1 slot per cycle; Store and Load have 2) are preferred over
ops targeting the less constrained ALU engine (12 slots). I think this is a reasonable list scheduling approach for VLIW, where the goal is not to minimize latency (all ops are single-cycle) but to maximize bundle packing density.</p>

<h3 id="schedule-impact">Impact</h3>

<p>For a 16-element batch with 1 round, the unscheduled vectorized pipeline
produces 77 bundles; with the scheduler, 59 &mdash; a 23% reduction. For an
8-element batch: 50 bundles unscheduled, 31 with scheduling &mdash; a 38%
reduction. The improvement is larger for smaller batches because there are
fewer total ops and more opportunities for cross-iteration packing.</p>

<h3 id="schedule-composability">Scheduling as a Separate Pass</h3>

<p>The scheduler&rsquo;s type signature is <code>[MachineOp] &rarr; [MachineOp]</code> &mdash; a
pure reordering that preserves the set of ops exactly. This means it
composes cleanly between lowering and bundling. The bundler does not need to
know about scheduling; the scheduler does not need to know about slot limits
or bundle structure.</p>

<h2 id="bundling">Bundling (<code>Bundle.hs</code>)</h2>

<p>The bundler takes the flat <code>[MachineOp]</code> list (now reordered by the scheduler)
and packs it into VLIW bundles, where each bundle represents one cycle on the
simulator.</p>

<h3 id="bundling-algo">Algorithm</h3>

<p>The bundler is both greedy and order-preserving: it scans ops front-to-back
and tries to add each op to the current bundle. If the op doesn&rsquo;t fit, the
current bundle is flushed and a new one is started. <strong>Ops are never
reordered</strong> since we delegate this to the scheduler&rsquo;s responsibilities.</p>

<h3 id="bundling-legality">Legality Checks</h3>

<p><code>canPack</code> checks four conditions. All must hold for the op to enter the
current bundle:</p>

<p><strong>1. Slot capacity.</strong> Each engine has a maximum: ALU &le; 12, VALU &le; 6,
LOAD &le; 2, STORE &le; 2, FLOW &le; 1.</p>

<p><strong>2. No RAW hazard.</strong> The op&rsquo;s reads must not overlap with addresses written
by earlier ops in this bundle. Even though the simulator executes all ops
in a bundle atomically, the compiler emits ops in program order and expects
sequential semantics.</p>

<p><strong>3. No WAW hazard.</strong> No two writes to the same scratch address in one cycle.</p>

<p><strong>4. No memory hazard.</strong> If the op accesses a memory region that already has
a store in this bundle, the address ranges must not overlap.</p>

<h3 id="bundling-practice">Packing in Practice</h3>

<p>For a batch size of 16 with 1 round, the scalar pipeline
produces <strong>386 bundles</strong> (nearly one op per cycle), while the vectorized and
scheduled pipeline produces <strong>59 bundles</strong>. The roughly 6.5x reduction comes
from three sources: vectorization reduces instruction count by processing
8 elements per op, the scheduler reorders ops so that independent ops from
different engines land next to each other, and the bundler fills cycles
with those independent ops (an ALU address computation and a LOAD can share
a cycle, two LOADs can share a cycle, a VALU and a LOAD <code>const</code> can share
a cycle, and so on).</p>

<h2 id="rendering">Rendering (<code>PythonInterop.hs</code>)</h2>

<p><code>renderProgramPayload</code> produces a Python dict literal:</p>

<pre><code>{"rounds":3,"batch_size":256,"program":[{...},{...},...]}</code></pre>

<p>Each bundle is a dict with engine keys (<code>"alu"</code>, <code>"valu"</code>, <code>"load"</code>,
<code>"store"</code>, <code>"flow"</code>) mapping to lists of tuples. The rendering functions
pattern-match on the typed ISA slot constructors and produce the correct
string. Because the slots are typed, the renderer is total over all ISA
constructors and the output format is guaranteed to match what the simulator&rsquo;s
parser expects.</p>

<h2 id="worked-example">Worked Example: One Batch Element Through the Pipeline</h2>

<p>To make the pipeline concrete, let us trace the transformation of one batch
element through every stage.</p>

<h3>Stage 1: Scalar IR</h3>

<p>The inner loop body for element <code>i</code> (all types <code>I32</code>):</p>

<pre><code>Let %12 I32 (RLoad LoadScalar InpIdx (Var %10))      -- idx = InpIdx[i]
Let %13 I32 (RLoad LoadScalar InpVal (Var %10))      -- val = InpVal[i]
Let %14 I32 (RLoad LoadScalar Forest (Var %12))      -- node = Forest[idx]
Let %15 I32 (RBin Xor (Var %13) (Var %14))           -- val ^= node
   ... 6 hash stages, each 3 Lets ...
Let %33 I32 (RBin Mod (Var %32) (Const 2))           -- mod2 = hash % 2
Let %34 I32 (RBin Eq_ (Var %33) (Const 0))           -- even?
Let %35 I32 (RSelect (Var %34) (Const 1) (Const 2))  -- add = even?1:2
Let %36 I32 (RBin Mul (Var %12) (Const 2))           -- idx*2
Let %37 I32 (RBin Add (Var %36) (Var %35))           -- nextIdx
Let %38 I32 (RBin Lt (Var %37) (Var %1))             -- &lt; nNodes?
Let %39 I32 (RSelect (Var %38) (Var %37) (Const 0))  -- wrap to 0
Store StoreScalar InpIdx (Var %10) (Var %39)
Store StoreScalar InpVal (Var %10) (Var %32)</code></pre>

<h3>Stage 3: After Vectorize (step becomes 8)</h3>

<p>The vectorizer checks: <code>iv</code> (%10) is used only as an index into <code>InpIdx</code> and
<code>InpVal</code>. Legality passes. Types are promoted:</p>

<pre><code>Let %12 (Vec 8 I32) (RLoad (LoadContigVec 8) InpIdx (Var %10))
Let %13 (Vec 8 I32) (RLoad (LoadContigVec 8) InpVal (Var %10))
Let %14 (Vec 8 I32) (RLoad (LoadGatherVec 8) Forest (Var %12))
Let %15 (Vec 8 I32) (RBin Xor (Var %13) (Var %14))
   ... all hash stages now Vec 8 I32 ...
Store (StoreContigVec 8) InpIdx (Var %10) (Var %39)
Store (StoreContigVec 8) InpVal (Var %10) (Var %32)</code></pre>

<p>The <code>InpIdx</code>/<code>InpVal</code> loads became contiguous vector loads. The forest load
became a gather (its index is now a vector). All arithmetic was promoted
to <code>Vec 8 I32</code>. The <code>(Const 2)</code> operands remain <code>I32</code> &mdash; the lowering
will broadcast them.</p>

<h3>Stage 5: After Lowering (one unrolled iteration at i=0)</h3>

<pre><code>MSlotAlu  (Alu Add addr=8 base=5 offset=const(0))  -- addr for InpIdx[0]
MSlotLoad (VLoad dst=18 addr=8)                     -- InpIdx[0:8] &rarr; scratch 18..25
MSlotAlu  (Alu Add addr=9 base=6 offset=const(0))  -- addr for InpVal[0]
MSlotLoad (VLoad dst=26 addr=9)                     -- InpVal[0:8] &rarr; scratch 26..33
MSlotValu (VAlu Add addrV=10 ixV=18 baseV=bcast(4)) -- gather addresses
MSlotLoad (LoadOffset dst=34 base=10 lane=0)        -- Forest[idx[0]]
...                                                  -- 8 lane loads total
MSlotValu (VAlu Xor dst=42 src=26 src=34)           -- val ^= node
... (6 hash stages, each ~3 VALU ops) ...
MSlotStore (VStore addr=8 src=...)                   -- InpIdx[0:8] = wrapped
MSlotStore (VStore addr=9 src=...)                   -- InpVal[0:8] = hashed</code></pre>

<h3>Stage 6: After Scheduling and Bundling</h3>

<p>The scheduler reorders ops so that independent operations targeting different
engines are adjacent. The bundler then packs them into cycles: the address
ALU + VLoad share a cycle (different engines), the 8 gather
<code>LoadOffset</code>s span 4 cycles (2 loads/cycle), VALU hash ops pack up to 6
per cycle, and the two VStores fit in one cycle.</p>

  <h2 id="earlier-design">An Earlier, More Ambitious Design</h2>

  <p>Early in the project, I was aiming at a much more general compiler. The plan was to build a structured-SSA IR with region-based control flow, explicit memory tokens for effect ordering, and an affine index language for reasoning about layout and strides. A lot of this direction came from my past time working with MLIR. But trying to reproduce the generality and completeness of a large compiler framework inside a small project was unwise: producing non-trivial optimizations to arbitrary structured-SSA programs required significant engineering work. What&rsquo;s worse, in pursuing this generality, I obtained lackluster optimizations on the particular tree-hash kernel. Some particular issues:</p>

  <p><strong>1. The IR made simple code hard to write.</strong> In the old design, even a basic load had a lot of machinery around it. Every load had to thread a memory token through a <code>Let</code> node with explicit output bindings. Something as simple as <code>idx = InpIdx[i]</code>, which is one line in the current builder, expanded into several steps: creating fresh token IDs, building an address expression such as <code>IndexAff (IxAdd (IxConst 0) (IxMul 1 (IxVar iv)))</code>, issuing the
  load, and rebinding the resulting token.</p>
  <p><strong>2. The old representation left too much room for subtle mistakes.</strong> Affine index analysis is easy to get subtly wrong: if an expression like <code>IxMul 2 (IxVar iv)</code> were mistakenly treated as contiguous instead of strided, the vectorizer could emit a <code>vload</code> from the wrong addresses. What&rsquo;s scary is that this would not necessarily crash; it would just quietly compute the wrong answer. This caused me to pull a lot of hair out with some of the test programs I wrote. The newer named-buffer representation prevents this class of mistakes. With <code>LoadScalar InpIdx (Var iv)</code>, the buffer identifies the memory region and the expression identifies the offset. There is much less room for the compiler to infer the wrong thing.

</p><h1>Part II: The Simulator and Property-Based Testing</h1>

<h2 id="sim-goal">The Goal</h2>

<p>The reference implementation of Anthropic&rsquo;s simulator is a Python class &mdash; roughly
200 lines of imperative code that mutates dictionaries and lists in place. Translating
it line-for-line into Haskell would be awkward and pointless. The interesting question
here is whether we can write a simulator, using Haskell&rsquo;s type system and abstractions,
that is <em>clearer</em> than the original. But how?</p>

<p>It turns out that we can decompose the simulator into a set of independent(ish) concerns:</p>

<ol>
<li>The machine configuration. Things like the scratch size.</li>
<li>The mutable state, that is, the scratch and memory.</li>
<li>Execution.</li>
<li>Tracing and Debugging, which the Python implementation achieves via custom debug
instructions that, in our opinion, pollute the ISA.</li>
<li>Error Handling. Do not divide by zero!</li>
</ol>

<p>Then, the question becomes whether our implementation could take advantage of and
reflect this separation.</p>

<p>The answer is yes. In particular, we can achieve this via a Monad Transformer stack,
where each concern maps to a layer in the stack, and we also get the &ldquo;onion&rdquo; effect.
In other words, adding or removing a concern (such as disabling tracing entirely) is a
matter of changing which layers you peel, not rewriting execution logic, just like
peeling an onion.</p>

<p>However, a <em>clear</em> and <em>elegant</em> simulator is useless if it does not work correctly.
It is easy to not get VLIW + SIMD semantics right in very subtle ways. Fortunately,
we have a reference implementation, and the amazing property based testing tools in
Haskell! Using these tools we can ensure that our implementation matches the reference
across a good, randomly generated, set of programs.</p>

<h2 id="sim-architecture">Architecture: A Monad Transformer Stack (<code>Sim/Types.hs</code>)</h2>

<p>The simulator&rsquo;s type is a four-layer transformer stack:</p>

<pre><code>type SimM a =
  ReaderT SimConfig
    (StateT MachineState
      (WriterT (Seq CycleTrace)
        (ExceptT SimError IO))) a</code></pre>

<p>Each layer corresponds to one simulator concern:</p>

<table>
<tr><th>Layer</th><th>Type</th><th>Concern</th></tr>
<tr><td><code>ReaderT SimConfig</code></td><td>Immutable config</td><td>Machine limits, scratch size, trace/pause flags</td></tr>
<tr><td><code>StateT MachineState</code></td><td>Mutable state</td><td>Core registers, memory, cycle counter</td></tr>
<tr><td><code>WriterT (Seq CycleTrace)</code></td><td>Append-only output</td><td>Cycle-by-cycle trace records</td></tr>
<tr><td><code>ExceptT SimError</code></td><td>Short-circuit errors</td><td>Divide-by-zero, out-of-bounds, unknown PC</td></tr>
</table>

<p>The <code>SimConfig</code> record holds values that never change during execution: the machine
configuration (slot limits), scratch-pad size, and flags for enabling pause and trace
modes. Placing these in <code>ReaderT</code> means every function in the simulator can read them
with <code>ask</code>, and none can accidentally modify them.</p>

<p><code>MachineState</code> holds everything that <em>does</em> change: the core&rsquo;s program counter,
run state, scratch registers (an <code>IntMap Word32</code> of 1536 entries), the flat memory
image, and the cycle counter. <code>StateT</code> gives us <code>get</code> and <code>put</code>, in some ways this
is the same interface an imperative simulator would use, but scoped to the
<code>runStateT</code> call.</p>

<p>The <code>WriterT</code> layer is the one that benefits most from being a separate concern.
Cycle traces, records of what each cycle wrote to scratch and memory, accumulate via
<code>tell</code>. When tracing is disabled, the trace sequence stays empty. The execution code
does not branch on whether tracing is on; it simply emits a <code>CycleTrace</code> after each
cycle, and the runner decides whether to <code>tell</code> it based on the config. If we wanted to
strip tracing entirely, we could replace <code>WriterT</code> with <code>Identity</code> and the execution
modules would not need to change.</p>

<p><code>ExceptT SimError</code> handles the four runtime errors the ISA can produce:
divide-by-zero (<code>SimDivideByZero</code>), scratch out-of-bounds (<code>SimScratchOob</code>),
memory out-of-bounds (<code>SimMemoryOob</code>), and unknown program counter
(<code>SimUnknownPc</code>). Any instruction handler can <code>throwError</code> and the entire
simulation short-circuits &mdash; no error codes to check, no sentinel values to
propagate.</p>

<h3 id="sim-stepm">The Inner Monad: <code>StepM</code></h3>

<p>Execution within a single cycle uses a second, smaller stack:</p>

<pre><code>type StepM a =
  ReaderT StepSnapshot (StateT PendingWrites (Either SimError)) a</code></pre>

<p><code>StepSnapshot</code> freezes the machine state at the beginning of the cycle: the current
PC, scratch registers, and memory. <code>PendingWrites</code> accumulates the writes that will be
committed at cycle end. This design is really important to preserve the semantics of the
machine. All slots in a VLIW bundle execute in parallel, so all reads must see the state
from the <em>beginning</em> of the cycle and all writes must commit atomically at the end. That
said, we are trusting the programmer to not bundle instructions with data hazards.</p>

<h2 id="sim-exec">Execution Semantics (<code>Sim/Exec.hs</code>)</h2>

<p>Bundle execution follows a snapshot-then-commit protocol:</p>

<pre><code>executeBundle :: ISA.Bundle () -> SimM PendingWrites
executeBundle bundle = do
  cfg &lt;- ask
  ms  &lt;- get
  let snap = buildSnapshot cfg ms
  case runStateT
         (runReaderT (executeBundleSlots bundle) snap)
         emptyPendingWrites of
    Left err -> throwError err
    Right (_, pw) -> do
      commitPendingWrites pw
      pure pw</code></pre>

<p>First, the current machine state is frozen into a <code>StepSnapshot</code>. Then
<code>executeBundleSlots</code> runs every slot in the bundle &mdash; ALU, VALU, Load, Store,
Flow &mdash; against that snapshot, accumulating writes into <code>PendingWrites</code>. Finally,
<code>commitPendingWrites</code> applies the buffered writes to the real machine state in one
step.</p>

<p>Each engine dispatch is a pattern match on the ISA constructors. Scalar ALU:</p>

<pre><code>executeAluSlot :: ISA.AluSlot -> StepM ()
executeAluSlot (ISA.Alu op dst srcA srcB) = do
  a &lt;- readScratch srcA
  b &lt;- readScratch srcB
  r &lt;- liftEitherStep (aluEval op a b)
  writeScratch dst r</code></pre>

<p>Vector operations iterate over 8 lanes:</p>

<pre><code>executeValuSlot :: ISA.ValuSlot -> StepM ()
executeValuSlot slot = case slot of
  ISA.VBroadcast dest src ->
    forLane8 (execVBroadcastLane dest src)
  ISA.MultiplyAdd dest a b c ->
    forLane8 (execMultiplyAddLane dest a b c)
  ISA.VAlu op dest a b ->
    forLane8 (execVAluLane op dest a b)</code></pre>

<p>where <code>forLane8 = forM_ [0..7]</code>. Each lane reads from <code>scratchPlus base lane</code>
and writes to the corresponding destination lane. In our case, the SIMD width is 8
consecutive scratch addresses, so the lane loop is just integer arithmetic on
<code>ScratchAddr</code>.</p>

<p>The <code>readScratch</code> and <code>writeScratch</code> functions in <code>StateAccess</code> enforce the snapshot
design: reads come from the <code>StepSnapshot</code> (via <code>ReaderT</code>), writes go into
<code>PendingWrites</code> (via <code>StateT</code>). A slot cannot observe a write made by another slot in
the same bundle. This guarantee is free, and it falls out of the monad stack.</p>

<h2 id="sim-loop">The Simulation Loop (<code>Sim/Runner.hs</code>)</h2>

<p>The top-level loop is short:</p>

<pre><code>runUntilStop :: [ISA.Bundle ()] -> SimM ()
runUntilStop bundles = do
  progressed &lt;- stepCore bundles
  when progressed (runUntilStop bundles)</code></pre>

<p><code>stepCore</code> fetches the bundle at the current PC, executes it, records a trace if
tracing is enabled, increments the cycle counter, and returns <code>True</code> if the core is
still running. When the core halts (or the PC goes out of range), <code>stepCore</code> returns
<code>False</code> and the loop terminates.</p>

<p>Note: the original Python implementation supports multiple cores, but in Anthropic&rsquo;s
optimization assignment you are supposed to ignore this and assume a single core. We are
also only running one core, but we maintain the semantics for simplicity.</p>

<p><code>initMachineState</code> builds a dense <code>IntMap</code> for both scratch (1536 zeros) and
memory (the provided image). <code>finalMemoryImage</code> reads the <code>IntMap</code> back into a flat
list for comparison with the reference.</p>

<p>The entire simulator entry point is:</p>

<pre><code>runSimulator
  :: SimConfig -> [ISA.Bundle ()] -> [Word32]
  -> ExceptT SimError IO ([Word32], Int, Seq CycleTrace)
runSimulator cfg bundles mem0 = do
  let st0 = initMachineState cfg mem0
  ((_, st1), trace) &lt;-
    runWriterT
      (runStateT
        (runReaderT (runProgram bundles) cfg) st0)
  pure (finalMemoryImage st1, msCycle st1, trace)</code></pre>

<p>One line peels the entire monad stack. The result is the final memory image, the cycle
count, and the trace &mdash; all pure values.</p>

<h2 id="pbt">Property-Based Testing with Hedgehog (<code>test/PBT.hs</code>, <code>test/Test/Gen.hs</code>)</h2>

<p>Time to test! The simulator must agree with the Python reference on every cycle, for
every valid program. This is a natural property to test: generate a random program and
memory image, run both simulators, compare the traces. Easy right? Well&hellip; This part was
challenging for a variety of reasons we describe in this section. The first issue we
encounter is that QuickCheck, their shrinking in particular, was not well suited for this
task. In the next section we describe the second big problem, running the Python
implementation from Haskell.</p>

<h3 id="pbt-quickcheck">Why Not QuickCheck</h3>

<p>QuickCheck was the first attempt, and it failed in an instructive way. QuickCheck
generates a random value and, when a test fails, <em>shrinks the value</em> &mdash; it tries to
produce a smaller counterexample by removing elements from the generated data. The
problem is that the generated data is an ISA program with structural invariants: slot
limits, live-register dependencies, valid memory addresses. Shrinking a bundle list by
dropping an element can produce a program where a register is read before it is written,
or where a conditional jump targets a now-nonexistent bundle. The shrunk counterexample
is not a valid program.</p>

<p>QuickCheck&rsquo;s <code>Arbitrary</code> instances work well for algebraic data where any subterm
is valid in isolation. ISA programs are not like that. An <code>AluSlot</code> that reads scratch
address 42 is only valid if a prior bundle wrote to scratch 42. Shrinking must preserve
this dependency structure, and QuickCheck does not provide a way to express that.</p>

<p>You could validate each example you shrink, but when it comes to Property Based Testing,
filtering is not really a good idea, because you are wasting time and compute with every
miss. To illustrate that clearly, imagine you want to generate Python programs, and you
have a way to test whether a string is valid Python. The naive strategy is to generate
random strings and filter all of those that are not Python programs. Insane, right?</p>

<h3 id="pbt-hypothesis">Why Not Hypothesis</h3>

<p>Since the reference simulator is Python, using Hypothesis (Python&rsquo;s property-based
testing library) was a natural thought, especially given that we have experience using it.
Hypothesis has integrated shrinking: it shrinks the <em>generator&rsquo;s random choices</em>, not
the generated value, so structural invariants are preserved by construction. This solves
our problem completely. But using Hypothesis would mean writing the testing harness in
Python, and this is a Haskell project for a Haskell class. So we decided not to go that
way, and keep looking for a solution.</p>

<h3 id="pbt-hedgehog">Hedgehog: Integrated Shrinking in Haskell</h3>

<p>This is when we discovered
<a href="https://hackage.haskell.org/package/hedgehog">Hedgehog</a> solves both problems.
Like Hypothesis, it shrinks the random byte stream that drives the generator, not the
generated value. This means the generator&rsquo;s invariants (slot limits, register liveness,
address validity) are all preserved during shrinking. But the best part is that it is a
Haskell library!</p>

<h3 id="pbt-generators">The Generators</h3>

<p>The generators in <code>Test/Gen.hs</code> are themselves stateful programs. A <code>GenM</code> monad
threads a <code>GenEnv</code> that tracks:</p>

<ul>
<li>Scratch addresses with defined values.</li>
<li>Scratch addresses holding valid memory pointers.</li>
<li>Addresses that must not be overwritten, like loop counters.</li>
<li>And addresses written in the current bundle (WAW prevention).</li>
</ul>

<pre><code>data GenEnv = GenEnv
  { geLive      :: !(Set Int),
    geMemAddrs  :: !(Set Int),
    geReserved  :: !(Set Int),
    geBundleDests :: !(Set Int),
    geMemSize   :: !Int,