This project implements an LLVM backend for the Texas Instruments TMS9900 microprocessor, the CPU used in the TI-99/4A home computer (1981). The TMS9900 was the first single-chip 16-bit microprocessor and has several unique architectural features that required special handling in the compiler.
- Word size: 16-bit
- Endianness: Big-endian
- Registers: 16 general-purpose 16-bit registers (R0-R15)
- Unique feature: Registers are memory-mapped via the Workspace Pointer (WP)
- No hardware stack: Software stack implemented using R10 as stack pointer
All standard 16-bit operations are supported natively using TMS9900 instructions:
- Arithmetic: ADD (
A), SUB (S), NEG, ABS, INC, DEC, INCT, DECT - Logic: OR (
SOC), XOR, NOT (INV) - Shifts: SLA, SRA, SRL, SRC (rotate right)
- Compare: C (sets status bits for signed/unsigned comparisons)
TMS9900 Quirk: The TMS9900 has no AND instruction! Instead, it has:
SZC(Set Zeros Corresponding):dst = dst AND (NOT src)(bit clear)SOC(Set Ones Corresponding):dst = dst OR src
Solution: We implement AND using a three-instruction pseudo-expansion:
; To compute: rd = rs1 AND rs2
INV rs2 ; rs2 = NOT rs2
SZC rs2, rd ; rd = rd AND (NOT rs2) = rd AND original_rs2
INV rs2 ; Restore rs2 to original valueThis is handled in TMS9900InstrInfo.cpp:expandPostRAPseudo().
The TMS9900 has byte instructions (MOVB, AB, SB, etc.) that operate on the upper byte of registers. Rather than dealing with this complexity throughout the backend, we promote all i8 operations to i16. Byte loads/stores use MOVB which handles the byte positioning automatically.
The TMS9900 is a 16-bit CPU, so 32-bit operations require special handling:
-
Addition/Subtraction: LLVM automatically expands these to pairs of 16-bit operations with carry handling. We synthesize carry propagation since TMS9900 lacks ADC/SBC instructions.
-
Multiplication: LLVM expands to partial products using 16-bit multiplies.
-
Division/Remainder: Uses libcalls (
__divsi3,__udivsi3,__modsi3,__umodsi3) from the runtime library because inline expansion would be too complex. -
Shifts: Variable-amount shifts use libcalls (
__ashlsi3,__ashrsi3,__lshrsi3) from the runtime library. Constant shifts are optimized by LLVM at compile time. The runtime implementation uses a word-swap optimization for shifts ≥16 bits, then loops for the remaining 0-15 bits. Edge cases handled: shift by 0 (return unchanged), shift ≥32 (return 0 or sign-extended -1 for arithmetic right shift).
The TMS9900 has native multiply and divide instructions with quirks:
MPY (Multiply): 16x16 -> 32-bit result
- Produces 32-bit result in register pair Rd:Rd+1
- Rd must be an even register
- We hardcode to R0 and extract the low 16 bits from R1 for normal
mul
DIV (Divide): 32-bit / 16-bit -> 16-bit quotient + 16-bit remainder
- Dividend must be in Rd:Rd+1 (even register pair)
- For 16-bit divide, we clear the high word and place dividend in low word
- Quotient ends up in Rd, remainder in Rd+1
Signed Division: The DIV instruction is unsigned-only. For signed division, we:
- Save the XOR of operand signs (determines result sign)
- Take absolute values of both operands
- Perform unsigned division
- Negate result if signs differed
This is implemented via pseudo-instructions (SDIV16, UDIV16, SREM16, UREM16) that expand in EmitInstrWithCustomInserter.
Defined in TMS9900CallingConv.td:
| Register | Purpose |
|---|---|
| R0 | Return value |
| R1-R9 | Arguments (first 9 words), caller-saved |
| R10 | Stack Pointer (SP) |
| R11 | Link Register (return address) |
| R12 | Scratch (caller-saved) |
| R13-R15 | Callee-saved |
- 32-bit values use register pairs: R0:R1 (high:low) for return, R1:R2, R3:R4 for args
- Additional arguments spill to the stack
- Stack grows downward (high to low addresses)
No Hardware Stack: TMS9900 has no PUSH/POP or hardware stack pointer. We implement a software stack:
- R10 serves as the stack pointer
- Push:
DECT R10thenMOV Rx,*R10 - Pop:
MOV *R10+,Rx
Indexed Stack Access: For register spills and local variables, we use indexed addressing:
MOV R5,@4(R10) ; Store R5 at SP+4
MOV @6(R10),R6 ; Load from SP+6 into R6This is implemented via MOV_FI_Load and MOV_FI_Store pseudo-instructions that get frame indices resolved in eliminateFrameIndex.
The TMS9900 compare instruction (C Rs,Rd) sets multiple status bits:
- ST0 (L>): Logical (unsigned) greater than
- ST1 (A>): Arithmetic (signed) greater than
- ST2 (EQ): Equal
Different jump instructions test different conditions:
- JGT: ST1=1 (signed greater than)
- JLT: ST1=0 AND ST2=0 (signed less than)
- JHE: ST0=1 OR ST2=1 (unsigned high or equal)
- JLE: ST0=0 OR ST2=1 (unsigned low or equal)
- JEQ/JNE: ST2=1/0 (equal/not equal)
We map LLVM's comparison conditions to appropriate jumps, sometimes swapping operands to simplify:
a >= bbecomes:C b,a+JGT false_branch(using !(b > a))a <= bbecomes:C a,b+JGT false_branch(using !(a > b))
Switch statements compile to jump tables for efficiency:
SLA R0,1 ; Index * 2 (word offset)
MOV @LJTI0(R0),R1 ; Load target address from table
B *R1 ; Indirect branch
LJTI0:
DATA L_case0
DATA L_case1
DATA L_case2Implemented via BR_JT expansion and custom JumpTable lowering.
TMS9900 supports post-increment addressing (*R+):
- Word operations increment by 2
- Byte operations increment by 1
We expose this to LLVM via setIndexedLoadAction(ISD::POST_INC, ...) allowing the optimizer to combine pointer arithmetic with loads/stores:
MOV *R3+,R5 ; Load word at R3, then R3 += 2Full support for inline assembly with:
- Register constraints:
r(any register),{R0}through{R15}(specific registers) - Immediate constraints:
i,n - Memory constraints:
m
Example:
int result;
asm("MPY %1,R0\nMOV R1,%0" : "=r"(result) : "r"(multiplier) : "R0", "R1");Variable argument functions are supported:
va_startstores the frame pointer to the va_list locationva_argloads successive arguments from the stack- Arguments beyond R0-R3 are already on the stack from the caller
TMS9900 Quirk: The TMS9900 has no atomic instructions, cache, or memory barriers. It's a simple single-core CPU where all memory operations are inherently ordered and immediately visible.
Solution: We tell LLVM to expand all atomic operations to regular loads/stores via shouldExpandAtomicLoadInIR() returning true. This is correct because there's no concurrency that could observe non-atomic behavior.
Defined in TMS9900Schedule.td based on timing data from the TMS 9900 Data Manual (Table 3):
| Operation | Base Cycles | Notes |
|---|---|---|
| Fast (B, CLR, INC, INV) | 8-10 | |
| ALU (MOV, A, S, C) | 14 | Register-to-register |
| Memory access | +4-8 | Addressing mode overhead |
| MPY | 52 | |
| DIV | 92-124 | Data dependent |
| Shifts | 12 + 2*count |
The scheduler uses this to prefer shorter instructions when order doesn't matter.
The TMS9900 has SWPB which swaps the two bytes of a register, directly supporting __builtin_bswap16().
The LLVM backend generates assembly in a format based on LLVM's MC layer conventions:
- Section directives like
.text,.data,.bss - Alignment directives like
.p2align - ELF-style directives like
.type,.size
However, the xas99 assembler (from the xdt99 toolkit) uses traditional TI assembler syntax:
DEF,REF,EVEN,BSS,DATA,TEXT,BYTE,END
llvm2xas99.py post-processes LLVM output for xas99 compatibility:
# Usage:
llc -march=tms9900 input.ll -o - | python3 llvm2xas99.py > output.asm
# Or:
llc -march=tms9900 input.ll -o temp.s
python3 llvm2xas99.py temp.s > output.asm| LLVM Output | xas99 Output |
|---|---|
.text |
; === TEXT SECTION === |
.data |
; === DATA SECTION === |
.bss |
; === BSS SECTION === |
.p2align N |
EVEN |
.zero N |
DATA 0 (repeated) |
.ascii "str" |
TEXT 'str' |
.asciz "str" |
TEXT 'str' + BYTE 0 |
.type, .size, etc. |
(removed) |
The translator automatically detects references to external symbols (like libcall functions) and generates REF directives:
; External references (libcalls)
REF __mulsi3
REF __divsi3We added native xas99 dialect support directly to the LLVM backend:
clang --target=tms9900 -O2 -S -fno-addrsig \
-mllvm -tms9900-asm-dialect=xas99 test.c -o test.sFeatures:
- Hex immediates as
>XXXXformat (not0x...) - Negative values as two's complement (e.g.,
-1→>FFFF) DEFfor exported symbolsBSS Nfor zero-fill (not.zero N)
Quirk: LLVM's MC layer always emits certain directives (.text, .data, .bss, .p2align) when switching sections. These cannot be suppressed without a custom MCStreamer. Workaround:
grep -v '^\s*\.' test.s > test_clean.sFor complex projects, llvm2xas99.py still provides more comprehensive conversion
Located in runtime/tms9900_rt.asm, this provides compiler support functions:
| Function | Purpose | Status |
|---|---|---|
__mulsi3 |
32-bit multiply | Implemented |
__divsi3 |
32-bit signed divide | Implemented |
__udivsi3 |
32-bit unsigned divide | Implemented |
__modsi3 |
32-bit signed remainder | Implemented |
__umodsi3 |
32-bit unsigned remainder | Implemented |
__ashlsi3 |
32-bit left shift | Implemented |
__ashrsi3 |
32-bit arithmetic right shift | Implemented |
__lshrsi3 |
32-bit logical right shift | Implemented |
# Assemble standalone object
xas99.py -R runtime/tms9900_rt.asm -o tms9900_rt.o
# Or include directly in your program
cat your_code.asm runtime/tms9900_rt.asm > combined.asm
xas99.py -R combined.asm -b -o program.binWhat: LLVM's branch analysis allows dead code elimination and branch optimization. Status: Partially implemented (unconditional branches only). Why Deferred: Low optimization impact. Dead code elimination happens at IR level anyway. The main benefit would be slightly cleaner assembly output.
What: Software floating point emulation.
Status: Not implemented.
Why Deferred: Would require a substantial softfloat library. The TI-99/4A rarely used floating point in assembly programs. Could be added by implementing __addsf3, __mulsf3, etc.
What: Dedicated frame pointer register for debugging/unwinding. Status: Not used (hasFP returns false). Why Deferred: Not needed for correctness. Could be added for debugger support.
The TMS9900 has a unique interrupt architecture based on workspace switching:
- 16 Priority Levels: Level 0 (RESET, highest) through Level 15 (lowest)
- Vector Table at 0x0000-0x003F: Each level has a 4-byte vector (WP, PC)
- Automatic Context Switch: On interrupt, the CPU:
- Fetches new Workspace Pointer from vector
- Fetches new Program Counter from vector+2
- Saves old WP→R13, PC→R14, ST→R15 in the NEW workspace
- Sets interrupt mask to (level - 1)
- Return via RTWP: Restores WP, PC, ST from R13-R15
This means each interrupt handler gets a fresh set of registers (R0-R12) without any software save/restore overhead.
Generates no prologue or epilogue. The user is responsible for the entire function body including return:
void __attribute__((naked)) my_asm_func(void) {
asm volatile(
"LI R0, 42\n"
"B *R11" // User provides return
);
}Output:
my_asm_func:
LI R0, 42
B *R11Generates no prologue/epilogue and returns with RTWP instead of B *R11:
void __attribute__((interrupt)) my_isr(void) {
// ISR code - can use R0-R12 freely
// R13-R15 hold return context, don't modify
}Output:
my_isr:
; ... your code ...
RTWP ; Return from interruptNote: The interrupt attribute only affects code generation. You still need to:
- Set up the vector table (in assembly or linker script)
- Allocate workspace memory for the ISR
- Set up a stack (R10) if calling C functions from the ISR
The startup/ directory contains templates for bare-metal applications:
startup.asm- Vector table, reset handler, interrupt handlersREADME.md- Detailed usage instructions
See startup/README.md for complete documentation.
If your ISR needs to call C functions:
IRQ1_Handler:
LI R10,IRQ_STACK_TOP ; Set up stack for C calling convention
BL @my_c_handler ; Call C function
RTWP ; Return from interruptThe C handler is a normal function:
void my_c_handler(void) {
// Handle interrupt
// Uses normal calling convention (R10=SP, R11=LR)
}What: The TMS9900's BLWP instruction can call a subroutine while atomically switching to a new register set, and RTWP returns while restoring the old workspace.
Use Cases:
- Coroutine-style context switching
- Multiple "threads" with separate register banks
- Fast subroutine calls without register save/restore
Current Status: The hardware mechanism works for interrupts (automatic BLWP). For explicit BLWP-style calls, use inline assembly.
Potential Future Implementation:
- New calling convention attribute:
__attribute__((workspace(0x8300))) - Generate
BLWP @vectorinstead ofBL @function
What: TMS9900 has special instructions for bit-addressable I/O:
LDCR- Load Communication Register (input bits)STCR- Store Communication Register (output bits)SBO- Set Bit to OneSBZ- Set Bit to ZeroTB- Test Bit
Use Cases: Keyboard scanning, cassette I/O, RS-232, speech synthesizer.
Potential Implementation: Intrinsic functions or inline assembly patterns.
What: Software-implemented "extended instructions" via XOP trap.
Use Cases: OS services, debugger breakpoints.
Status: The instruction is defined but not used.
What: Direct manipulation of status register bits.
STST- Store Status (save SR to register)LST- Load Status (unsupported on TMS9900, available on TMS9995)
Use Cases: Saving/restoring interrupt state, checking overflow flag.
Potential Implementation: Intrinsics like __builtin_tms9900_stst().
What: Since TMS9900 registers are memory-mapped, you can access another context's registers directly.
Example: If WP=0x8300, then R0 is at 0x8300, R1 at 0x8302, etc.
Use Cases: Debuggers, context inspection, inter-context communication.
Potential Implementation: Would need careful interaction with register allocator.
cd llvm-project
mkdir build && cd build
cmake -G Ninja -DLLVM_TARGETS_TO_BUILD="TMS9900" \
-DCMAKE_BUILD_TYPE=Release ../llvm
ninja# Compile LLVM IR to TMS9900 assembly
./bin/llc -march=tms9900 test.ll -o test.s
# Convert to xas99 format
python3 ../llvm2xas99.py test.s > test_xas.asm
# Assemble with xas99
xas99.py -R -o test.o test_xas.asmThe tms9900-trace tool provides a standalone TMS9900 CPU simulator for testing compiled code without requiring TI-99/4A ROMs or a full system emulator.
Repository: ~/personal/ti99/tms9900-trace/
Features:
- Flat 64K RAM memory model (no ROM requirements)
- NDJSON trace output with PC, WP, ST, and all registers after each instruction
- Configurable load address, entry point, and workspace pointer
- Interrupt injection for ISR testing (
--irq=LEVEL@STEP) - Infinite loop detection for automatic termination
- Cycle-accurate timing based on TMS9900 Data Manual
Complete Test Workflow:
# 1. Compile LLVM IR to assembly
./bin/llc -march=tms9900 test.ll -o test.s
# 2. Convert to xas99 format
python3 ../llvm2xas99.py test.s > test.asm
# 3. Create a harness with startup code (see startup/startup.asm template)
# Or assemble standalone with absolute addresses:
python3 ~/personal/ti99/xdt99/xas99.py -R test.asm -b -o test.bin
# 4. Run through trace simulator
~/personal/ti99/tms9900-trace/build/tms9900-trace \
-l 0x0000 -e 0x0000 -w 0x8300 -n 1000 test.binExample Output:
{"step":0,"pc":"0000","wp":"8300","st":"0000","clk":0,"op":"LWPI","asm":"LWPI >8300","r":["0000",...]}
{"step":1,"pc":"0004","wp":"8300","st":"0000","clk":44,"op":"LI","asm":"LI R0,>000F","r":["000F",...]}Testing Interrupt Handlers:
# Trigger IRQ level 1 after 500 instructions
./tms9900-trace --irq=1@500 -s 0x0200 test.binThe tests/ directory contains an end-to-end test that validates the complete toolchain:
./tests/run_mwe_test.shTest Coverage (tests/mwe_test.c):
- Array operations with pointer arithmetic
- Dot product computation (exercises MPY instruction)
- Bubble sort (exercises comparisons, swaps, nested loops)
- Global variable access
- Function calls
Expected Results:
result_dot1= 55 (0x0037) - dot product before sortresult_dot2= 86 (0x0056) - dot product after sortsorted_array= {1, 2, 5, 8, 9}- Return value in R0 = 141 (0x008D)
We implemented direct ELF object file emission, eliminating the need for an external assembler (xas99) in many workflows. The compiler can now produce machine code bytes directly.
Created the MCTargetDesc layer with the following components:
| File | Purpose |
|---|---|
TMS9900FixupKinds.h |
Defines relocation types (16-bit absolute, 8-bit PC-relative, 16-bit PC-relative) |
TMS9900MCCodeEmitter.cpp |
Encodes MCInst to binary bytes, big-endian |
TMS9900AsmBackend.cpp |
Handles fixup application, creates ELF writer |
TMS9900ELFObjectWriter.cpp |
Maps fixups to ELF relocation types |
TMS9900MCTargetDesc.cpp |
Registers all MC components |
The TableGen format classes needed variants to handle different operand patterns for the same encoding:
Two-Address Operations (A, S, SOC, SZC, XOR):
- Input:
(outs $rd), (ins $rs1, $rs2)with$rd = $rs1constraint - Need to encode
$rdand$rs2(not$rs1which is tied) - Created:
Format1_Reg_TwoAddr,Format2_Reg_TwoAddr
Compare Operations (C, CI):
- No output operand, just sets flags
- Created:
Format1_Reg_Cmp,Format8_Cmp
Shift Operations (SLA, SRA, SRL, SRC):
- Two-address with count operand
- Created:
Format7_TwoAddr,Format7_R0Count
Implicit Register Operations (MPY, DIV, RET):
- Some operands are hardcoded (R0 for MPY/DIV, R11 for RET)
- Created:
Format2_Reg_R0,Format3_Reg_R11
# Compile C to ELF object file
clang --target=tms9900 -c source.c -o source.o
# Examine the object file
llvm-readelf -a source.o
# Extract raw binary (for ROM/RAM loading)
llvm-objcopy -O binary source.o source.bin
# Or Intel HEX format
llvm-objcopy -O ihex source.o source.hex- Class: ELF32
- Endianness: Big-endian (MSB)
- Machine: None (EM_NONE = 0) - no official ELF machine type for TMS9900
- OS/ABI: Standalone (embedded)
Old Workflow (still supported):
C code → clang -S → assembly → llvm2xas99.py → xas99.py → binary
New Workflow:
C code → clang -c → ELF object → objcopy → raw binary
The new workflow is simpler and doesn't require external Python scripts or xas99.
- No disassembler yet (llvm-objdump can't disassemble TMS9900)
- No linker support (use objcopy for single-file programs, or external linker)
- No debug info emission
The object file generation was verified by:
- Compiling a simple function to .o file
- Examining ELF structure with llvm-readelf
- Extracting .text section with objcopy
- Inspecting machine code bytes with xxd
| File | Purpose |
|---|---|
TMS9900.td |
Main target description, processor features |
TMS9900RegisterInfo.td |
Register definitions (R0-R15, WP, ST) |
TMS9900CallingConv.td |
Calling convention (argument passing, callee-saved) |
TMS9900InstrInfo.td |
Instruction definitions and patterns |
TMS9900InstrFormats.td |
Instruction encoding formats |
TMS9900Schedule.td |
Instruction timing/scheduling model |
| File | Purpose |
|---|---|
TMS9900TargetMachine.cpp |
Target machine setup |
TMS9900Subtarget.cpp |
Subtarget features |
TMS9900ISelLowering.cpp |
DAG lowering (custom operation handling) |
TMS9900ISelDAGToDAG.cpp |
Instruction selection |
TMS9900InstrInfo.cpp |
Instruction info, pseudo expansion |
TMS9900RegisterInfo.cpp |
Register info, frame index elimination |
TMS9900FrameLowering.cpp |
Prologue/epilogue generation |
TMS9900AsmPrinter.cpp |
Assembly output |
TMS9900MCInstLower.cpp |
MC layer interface |
- TMS 9900 Microprocessor Data Manual (May 1976) - Instruction set, timing
- TI-99/4A Technical Data - Memory map, I/O addresses
- xdt99 Cross-Development Tools - xas99 assembler documentation
- LLVM Backend Tutorial - General LLVM backend development
What: Implemented support for xas99-style labels that don't require colons (e.g., START LI R0,>1234 instead of START: LI R0,>1234)
Where: llvm/lib/Target/TMS9900/AsmParser/TMS9900AsmParser.cpp - added isKnownMnemonic(), isKnownDirective(), modified ParseInstruction()
Why: xas99 and traditional TI assembler syntax uses labels without colons. This enables direct assembly of existing TI-99/4A code and interoperability with the xas99 toolchain.
Technical notes:
- Added StringSaver member to properly manage string lifetime when lowercasing mnemonics
- Labels are uppercased for case-insensitive symbol matching (xas99 convention)
- When identifier is not a known mnemonic/directive, emit it as a label then parse rest of line as actual instruction
- Both LLVM-style (with colons) and xas99-style (without) produce identical object files
- Verified with
clang --target=tms9900 -c test.S -o test.o
What: Added .org directive parsing to set the assembly origin address.
Where: llvm/lib/Target/TMS9900/AsmParser/TMS9900AsmParser.cpp - parseDirective(), parseDirectiveOrg()
Why: TI-99/4A cartridges must be placed at specific addresses (0x6000 for ROM). The .org directive (and xas99's AORG) allows setting this without linker scripts.
Technical notes:
- Uses
getStreamer().emitValueToAlignment()for padding - Supports both
.org(LLVM standard) andAORG(xas99 dialect) - Expression evaluation handled by LLVM's expression parser
What: Added .include "filename" directive to include external assembly files.
Where: llvm/lib/Target/TMS9900/AsmParser/TMS9900AsmParser.cpp - parseDirectiveInclude()
Why: Enables modular assembly code structure - wrapper files can include compiled C output.
Technical notes:
- Uses LLVM's SourceMgr infrastructure
- Searches relative to current file's directory
- Proper error handling for missing files
What: Discovered LLD applies relocations in little-endian, but TMS9900 is big-endian. Object files are correct, but linked output has byte-swapped addresses.
Where: LLD's ELF linker - no TMS9900-specific code exists. Affects all R_TMS9900_16 relocations.
Why: LLD's generic relocation handling defaults to little-endian. Without target-specific code, symbol addresses are written with wrong byte order.
Technical notes:
- Object file (cart2.o) shows correct big-endian format:
00 10followed bycartridge_mainplaceholder - Linked ELF shows byte-swapped result: addresses in little-endian
- Currently using EM_NONE (0) as machine type - need EM_TMS9900
- Fix requires: new
lld/ELF/Arch/TMS9900.cppwithrelocate()using big-endian writes - Similar to how MSP430 handles this (also 16-bit)
What: Created implementation plan for proper TMS9900 LLD support.
Where: New files needed in lld/ELF/Arch/ and modifications to lld/ELF/Target.cpp
Why: Enable proper linking of TMS9900 object files with correct big-endian address encoding.
Tasks identified:
- Add EM_TMS9900 to
llvm/include/llvm/BinaryFormat/ELF.h - Create
lld/ELF/Arch/TMS9900.cppwithrelocate()usingwrite16be() - Add
getTMS9900TargetInfo()declaration tolld/ELF/Target.h - Add EM_TMS9900 case to
lld/ELF/Target.cpp - Update
lld/ELF/Arch/CMakeLists.txt - Update
TMS9900ELFObjectWriter.cppto useELF::EM_TMS9900
What: Implemented TMS9900 target support in LLD with proper big-endian relocations.
Where:
lld/ELF/Arch/TMS9900.cpp(new file)lld/ELF/Target.h,lld/ELF/Target.cpp,lld/ELF/CMakeLists.txtllvm/include/llvm/BinaryFormat/ELF.h(added EM_TMS9900 = 0x99)llvm/lib/Target/TMS9900/MCTargetDesc/TMS9900ELFObjectWriter.cpp
Why: Complete the native LLVM toolchain - can now compile, assemble, and link TMS9900 programs without external tools.
Example workflow:
clang --target=tms9900 -c startup.S -o startup.o
clang --target=tms9900 -c main.c -o main.o
ld.lld -T linker.ld startup.o main.o -o program.elf
llvm-objcopy -O binary program.elf program.binTechnical notes:
- Machine type EM_TMS9900 = 0x99 (153) - fits nicely with TI processor family (near EM_TI_C6000 etc.)
- Relocations use
write16be()for proper big-endian address encoding - Supports R_TMS9900_16 and R_TMS9900_PCREL_8/16 relocation types
What: Fixed instruction encoding for Format8 instructions (LI, AI, ANDI, ORI, CI) and JMP fixup application.
Where:
TMS9900InstrFormats.td- Format8/Format8_Cmp class definitionsTMS9900InstrInfo.td- Instruction definitions using Format8TMS9900AsmBackend.cpp- applyFixup for pcrel_8
Why: Instructions were encoding with wrong byte/bit positions:
- LI R12,0 produced 0x200C instead of 0x02C0 (opcode and register swapped)
- JMP loop produced 0xFF00 instead of 0x10FF (displacement in wrong byte)
Technical notes:
- Format8 register is in bits 7-4 (not 3-0), subop in bits 3-0
- Changed Format8 parameter from 8-bit opcode to 4-bit subop
- JMP fixup must write displacement to byte 1 (low byte in big-endian), not byte 0
What: Implemented disassembler enabling llvm-objdump -d support for TMS9900 ELF files.
Where:
llvm/include/llvm/Object/ELFObjectFile.h- Added EM_TMS9900 to ELF triple mappingllvm/lib/Target/TMS9900/Disassembler/TMS9900Disassembler.cpp(new file)llvm/lib/Target/TMS9900/Disassembler/CMakeLists.txt(new file)llvm/lib/Target/TMS9900/CMakeLists.txt- Added Disassembler subdirectory
Why: Complete the toolchain - allows inspecting compiled code, debugging, and round-trip verification (compile → disassemble → verify).
Technical notes:
- Instruction decoding by format: Format8 (LI/AI/ANDI/ORI/CI), Format9 (LWPI/LIMI), Format6 (jumps), Format7 (shifts), Format3 (single operand), Format1 (dual operand)
- Key challenge: TableGen-generated instruction names have suffixes (e.g.,
ABSr,MOVrr,SRAri) not just base names - Special-cased BL @symbol - our encoding puts Ts=0 with target in second word, not standard Format3 addressing
- MOVB partially supported (register-to-indirect mode); operand order may be swapped in print output
- BLWP and X instructions not implemented in current TableGen, so not disassembled
- Successfully decodes cart_example: LWPI, LI, BL, JMP, CLR, MOVB all working
What: Found and fixed critical bug where register and sub-opcode fields were swapped in Format8 instructions. LI R10, 0x83FE was generating 0x02A0 (which is STWP R0) instead of 0x020A.
Where: TMS9900InstrFormats.td lines 552-556 (Format8 class) and lines 573-576 (Format8_Cmp class)
Why: This broke stack pointer initialization (LI R10, 0x83FE) and consequently all function calls. The TMS9900 Format VIII encoding is 0000 0010 ssss rrrr where ssss=sub-opcode, rrrr=register. We had them swapped.
Technical notes:
- Before:
Inst{7-4} = rd; Inst{3-0} = subop;→ LI R10 = 0x02A0 (STWP R0!) - After:
Inst{7-4} = subop; Inst{3-0} = rd;→ LI R10 = 0x020A (correct) - Verified with tms9900-trace: function calls now work correctly at -O0
- Audited ALL instruction formats against
tms9900_reference.txt- no other encoding bugs found - Affected instructions: LI (subop=0), AI (subop=2), ANDI (subop=4), ORI (subop=6), CI (subop=8)
What: Fixed multiple crashes in the TMS9900 disassembler caused by incorrect operand counts for instructions with tied constraints.
Where: TMS9900Disassembler.cpp - Format8, Format3, Format7, and Format1 auto-increment handling
Why: Instructions with tied operands (e.g., $rd = $rs constraints) require the register to be added multiple times to the MCInst. The printer expects operands at specific indices based on the instruction definition.
Technical notes:
- Format8 (AI/ANDI/ORI): Have
(outs $rd), (ins $rs, $imm)with$rd = $rs- need 3 operands (reg, reg, imm), was only adding 2 - Format3 (INC/DEC/NEG/etc): Have
(outs $rd), (ins $rs)with$rd = $rs- need 2 operands (reg, reg), was only adding 1 - Format3 opcode mapping: Was completely wrong! INC/INCT/DEC/DECT opcodes were mixed up with ABS/SWPB/etc. Fixed to match TMS9900InstrInfo.td definitions
- Format7 (shifts): Have
(outs $rd), (ins $rs, $cnt)with$rd = $rs- need 3 operands (reg, reg, imm), was only adding 2 - Format1 auto-increment: MOVpim has
(outs $rd, $rs_wb), (ins $rs)- need 3 operands, was only adding 2
What: Fixed critical bug where register spills to the stack would crash with "Not supported instr: MCInst 275". Any C code with nested function calls or register pressure would fail.
Where: TMS9900MCInstLower.cpp - added expansion of MOV_FI_Store and MOV_FI_Load pseudo instructions
Why: The MOV_FI_Store and MOV_FI_Load pseudo instructions (used by storeRegToStackSlot/loadRegFromStackSlot) were being passed directly to MCCodeEmitter, which can't encode pseudos. They need to be expanded to real MOVmx/MOVxm instructions first.
Technical notes:
MOV_FI_Store: (ins base, offset, rs) →MOVmx: (ins offset, ri, rs) - operand reordering requiredMOV_FI_Load: (outs rd), (ins base, offset) →MOVxm: (outs rd), (ins offset, ri) - operand reordering required- MCInst 275 = MOV_FI_Store opcode number
- Now cart_example with
vdp_write_at()helper function compiles correctly - This was blocking any real C development - functions that call other functions need to save registers
What: Fixed critical bug where Format1_SymLoad, Format1_SymStore, Format1_IdxLoad, and Format1_IdxStore instruction classes weren't emitting the second instruction word (address/offset). All global variable accesses produced addresses of 0x0000.
Where: TMS9900InstrFormats.td lines 300-436 - all four Format1_*Load and Format1_*Store classes with memory addressing
Why: These instruction classes had bits<16> Inst but the instructions are 4 bytes (2 words). The address/offset operand was defined but never assigned to instruction bits. The second word was just zeros - no relocations were generated.
Technical notes:
- Changed
bits<16> Insttobits<32> Instin all four classes - Added
let Inst{31-16} = addr;(or= offset;) to emit the second word BL_symalready had this correct:bits<32> InstwithInst{31-16} = target- ball.c bouncing ball demo now has 36 relocations instead of 1
- Global variables at addresses 0x2000+ now appear correctly in linked binary
- MOVma/MOVam/MOVmx/MOVxm all affected (word + byte variants)
What: Disassembler was crashing when encountering Format 1 instructions with symbolic or indexed addressing modes (Ts=2 or Td=2). Added proper handling for these cases.
Where: TMS9900Disassembler.cpp lines 523-610 - added four new decoding paths
Why: Format 1 instructions with memory addressing (MOV @addr,Rd, MOV Rs,@addr, MOV @offset(Rs),Rd, MOV Rs,@offset(Rd)) need to read the second word for address/offset and decode accordingly.
Technical notes:
- Symbolic source (Ts=2, S=0, Td=0): MOV @addr,Rd → MOVam
- Symbolic dest (Td=2, D=0, Ts=0): MOV Rs,@addr → MOVma
- Indexed source (Ts=2, S!=0, Td=0): MOV @offset(Rs),Rd → MOVxm
- Indexed dest (Td=2, D!=0, Ts=0): MOV Rs,@offset(Rd) → MOVmx
- Must distinguish symbolic (S=0/D=0) from indexed (S!=0/D!=0) addressing
What: Disassembler crashed on A/S/SOC/SZC register-to-register instructions due to tied operand handling. These instructions have $rd = $rs1 constraint requiring the destination register to appear twice in the MCInst.
Where: TMS9900Disassembler.cpp lines 427-465 - register-to-register Format 1 decoding
Why: Instructions like A R2,R1 have pattern (outs $rd), (ins $rs1, $rs2) with $rd = $rs1 tied constraint. MCInst needs 3 register operands but only 2 were being added. Compare instructions (C/CB) have no output and don't need tied handling.
Technical notes:
- Added
needsTiedOperandflag for A/S/SOC/SZC (opcodes 0x4,0x6,0xA,0xE and byte variants) - Added
isCompareflag for C/CB (opcodes 0x8,0x9) which have no output register - MOV doesn't need tied operand - it's a simple copy not read-modify-write
- ball.elf and cart.elf now disassemble completely without crashes
What: Implemented automatic branch relaxation for JMP instructions that exceed the 8-bit signed displacement range (~256 bytes). Out-of-range JMPs are converted to B @addr (4-byte absolute branch).
Where:
TMS9900InstrInfo.td- addedB_syminstruction andbrtarget16operand typeTMS9900MCCodeEmitter.cpp- addedgetBranchTarget16Encoding()methodTMS9900AsmBackend.cpp- implementedfixupNeedsRelaxation(),fixupNeedsRelaxationAdvanced(),relaxInstruction(), and fixedmayNeedRelaxation()to only return true for JMP
Why: ball2.c (bouncing ball demo with hit counters) was failing with "fixup value out of range" because the main() function exceeded the ~256 byte conditional branch limit. TMS9900 Format 6 jumps have 8-bit signed displacement.
Technical notes:
B @addrencoding: opcode 0x0440 with Ts=10 (symbolic), S=0, followed by 16-bit addressB_symis a 4-byte instruction (vs 2-byte JMP)- Only JMP can be relaxed; conditional branches (JEQ/JNE/JGT/JL/etc.) cannot be relaxed in LLVM's MC layer because it would require emitting multiple instructions (inverted branch + B @addr)
- Code generator emits patterns like
JL skip; JMP targetso relaxing JMP handles most cases - Currently ALL JMPs are relaxed (fixupNeedsRelaxationAdvanced returns true for unresolved fixups) - could be optimized to only relax when actually out of range
What: Created libtms9900/ runtime library providing compiler builtins and math functions. Includes custom compact sinf/cosf implementation (1.2KB vs 7KB+ standard picolibc).
Where:
libtms9900/builtins/- 32-bit integer ops (mul32.asm, div32.asm, shift32.asm) and soft-float from compiler-rtlibtms9900/libm/- Math library with sincosf_tiny.c and picolibc sourceslibtms9900/picolibc/- Upstream picolibc source (TODO: convert to submodule)
Why: Needed soft-float and libm support for floating-point C code. Picolibc chosen over newlib for smaller size. Custom sin/cos written because standard picolibc's Payne-Hanek range reduction pulls in 8KB+ of code.
Technical notes:
- Float32 only - no 64-bit double support to minimize code size
- Compact sinf/cosf uses Cody-Waite range reduction with extended-precision π/2 constants
- Tradeoff: Full precision for |x| < 10^4, reduced precision (~4 digits) for |x| > 10^6
- libm.a sizes: -O1=21,522B, -Os=20,980B, -O2=21,546B, -O3=25,816B (sqrtf explodes to 4.4KB at -O3)
- powf is 5.1KB (24% of library) with 126 soft-float calls - TODO: optimize for small integer exponents
- Documented full symbol table with sizes in libtms9900/README.md
What: Added TMS9900 system and I/O instructions: BLWP (branch and load workspace pointer), XOP (extended operation), and CRU bit I/O (LDCR, STCR, SBO, SBZ, TB).
Where:
TMS9900InstrFormats.td- new Format9/Format12 instruction classesTMS9900InstrInfo.td- instruction definitionsTMS9900Disassembler.cpp- decoding logicTMS9900AsmParser.cpp- parsing support
Why: These instructions are essential for TI-99/4A system programming - BLWP for context switches, XOP for system calls, CRU for peripheral I/O (keyboard, cassette, etc).
What: Added clock control and interrupt instructions: CKOF (clock off), CKON (clock on), LREX (load or restart execution).
Where: TMS9900InstrInfo.td, TMS9900Disassembler.cpp
Why: Complete the TMS9900 instruction set for system-level programming.
What: Introduced CMPBR pseudo-instruction to keep compare and branch adjacent. Select CMPBR when BR_CC is glued to CMP, expand post-RA to C/CI + Jcc.
Where:
TMS9900ISelDAGToDAG.cpp- pattern matching for CMPBR selectionTMS9900InstrInfo.cpp- post-RA expansionTMS9900InstrInfo.td- CMPBR pseudo definition
Why: TMS9900 conditional branches test the status register set by the previous compare. If other flag-setting instructions interleave between compare and branch, the condition is corrupted.
Technical notes:
- CMPBR bundles compare opcode + condition + operands into single pseudo
- Expanded after register allocation when no more instructions can be inserted
- Handles both register and immediate comparisons
What: Marked memory MOV/MOVB instructions as defining the status register (ST). Changed BRCOND lowering to go via CMP+BR_CC to preserve flag dependencies.
Where: TMS9900InstrInfo.td (Defs = [ST] on MOV patterns), TMS9900ISelLowering.cpp
Why: TMS9900's MOV instruction sets status flags (compare result with 0). If this isn't modeled, the scheduler might move a MOV between a compare and its branch, corrupting the condition.
What: Added NOP as alias for JMP $+2 (jump to next instruction). Prevent relaxation of JMP with zero offset so inline asm NOP stays 2 bytes.
Where: TMS9900InstrInfo.td, TMS9900AsmBackend.cpp
Why: Convenience for assembly programmers and inline asm. TMS9900 has no hardware NOP - JMP $+2 is the standard idiom.
What: Fixed register allocation crash when truncating i16 to i8 with frame index operands. Tied operand constraints weren't properly handled.
Where: TMS9900InstrInfo.td - i8 store patterns with frame index
Why: Code like char x = (char)value; stack_var = x; was crashing during register allocation.
What: Fixed bug where materializing frame index addresses would clobber live registers. Added proper scratch register handling in frame lowering.
Where: TMS9900RegisterInfo.cpp, TMS9900FrameLowering.cpp
Why: Functions with multiple stack variables were getting corrupted values because the frame index materialization code was reusing registers without checking liveness.
What: Major rework of comparison and branch lowering. Use -1 (all ones) for boolean true, custom SETCC for i16/i32, restrict indexed addressing base registers to avoid R0.
Where:
TMS9900ISelLowering.cpp- custom SETCC lowering, boolean representationTMS9900InstrInfo.td- IdxRegs register class excluding R0
Why: Multiple issues: (1) compare+branch weren't staying adjacent, (2) i32 BR_CC wasn't lowering correctly, (3) R0 used as index register encodes as symbolic addressing (0 means "no register").
Technical notes:
- -1 booleans allow AND/OR to work correctly with condition results
- i32 BR_CC now lowers via SETCC+BRCOND chain
- Frame index pseudos also restricted to IdxRegs
What: Added LEAfi pseudo-instruction to compute frame index addresses (base + offset) into a register.
Where: TMS9900InstrInfo.td, TMS9900ISelLowering.cpp
Why: Needed for passing addresses of stack variables to functions (e.g., scanf(&x)).
What: Taught analyzeBranch/insertBranch/removeBranch to handle CMPBR pseudo instructions alongside regular branches.
Where: TMS9900InstrInfo.cpp - branch analysis methods
Why: The branch optimization passes need to understand CMPBR to correctly analyze control flow and avoid breaking the compare+branch bundles.
What: Fixed incorrect control flow when a branch's fallthrough target couldn't be inverted. Was generating incorrect code for certain if/else patterns.
Where: TMS9900InstrInfo.cpp - branch inversion logic
Why: Some branch patterns have asymmetric invertibility - e.g., JL (jump if less) inverts to JGE, but complex compound conditions might not have a simple inverse.
What: Fixed select (ternary operator) lowering to properly glue compare and conditional move operations.
Where: TMS9900ISelLowering.cpp - SELECT_CC lowering
Why: Code like x = (a < b) ? c : d was generating incorrect results because the compare flags were being clobbered before the conditional move.
What: Fixed crash when call-frame setup/destroy pseudos appeared with reserved call frame (no SP adjustment needed). Skip these pseudos during expansion.
Where: TMS9900FrameLowering.cpp
Why: Certain calling patterns with small/no stack frames were triggering assertion failures during pseudo expansion.
What: Major refactor of disassembler from hand-written decoding to TableGen-generated tables. Deleted ~974 lines of manual decoder code.
Where:
TMS9900Disassembler.cpp- now uses generated tables with custom decoders for branches/CRU/shiftsTMS9900InstrInfo.td- added DecoderMethod annotationsCMakeLists.txt- enabled gen-disassembler
Why: Hand-written decoder was error-prone and hard to maintain. TableGen tables are auto-generated from instruction definitions, ensuring consistency.
Technical notes:
- Custom decoders still needed for: branch targets (PC-relative), CRU bit offsets, R0-count shifts
- R0-count shifts hidden from asm/disasm (use SLA R1,0 not SLA R1,R0)
What: Added disassembly support for format1/format2 memory ops, format3 (memory operands), byte ops (MOVB, AB, etc.), COC/CZC, and 48-bit format1 mem-to-mem instructions.
Where: TMS9900Disassembler.cpp, TMS9900InstrInfo.td
Why: Complete disassembler coverage for all TMS9900 instruction formats.
What: New TMS9900LongBranch pass expands conditional branches that exceed the signed 8-bit displacement range. Converts JLT target (out of range) to JGE skip; B @target; skip:.
Where:
TMS9900LongBranch.cpp- new MachineFunctionPassTMS9900TargetMachine.cpp- pass registration
Why: TMS9900 conditional branches (JEQ/JNE/JGT/JLT/etc.) have only 8-bit signed displacement (~256 bytes). Large functions need branch expansion.
Technical notes:
- Different from MC-layer relaxation which only handles JMP
- Conditional branches require two-instruction sequence (inverted condition + absolute branch)
- Pass runs late, after branch folding
What: New TMS9900Peephole pass for target-specific optimizations:
- Fold
MOV @x,Ry; INC RyintoMOV *Rx+,Ry(auto-increment) - Optimize
AI Rx,0(remove),AI Rx,1→INC,AI Rx,2→INCT - Optimize
LI Rx,0→CLR Rx,LI Rx,-1→SETO Rx - Fold
LI Rx,val; XOR Rx,RyintoXOR @lit,Ry
Where:
TMS9900Peephole.cpp- new MachineFunctionPass- Added
-tms9900-disable-peepholeflag for debugging
Why: These patterns are common in compiled code but LLVM's generic optimizers don't know about TMS9900-specific instructions like INC/INCT/CLR/SETO.
What: Added 8-bit fixup support for CRU single-bit instructions (SBO, SBZ, TB). Allows symbolic bit offsets that the linker resolves.
Where: TMS9900AsmBackend.cpp, TMS9900MCCodeEmitter.cpp, TMS9900AsmParser.cpp
Why: CRU programming often uses named bit offsets (e.g., CRU_LED EQU 5; SBO CRU_LED). Previously only immediate values worked.
What: Generate inline code for 32-bit shifts by constant amounts instead of always calling runtime library.
Where: TMS9900ISelLowering.cpp
Why: x << 1 was calling __ashlsi3 even though inline code is smaller and faster for small constants.
Technical notes:
- Shift by 16 uses word swap
- Small shifts (1-3) expand to repeated operations
- Large/variable shifts still use libcall
What: Fixed MC-layer branch relaxation to not relax JMPs with unresolved symbols (external references).
Where: TMS9900AsmBackend.cpp
Why: Was incorrectly relaxing all JMPs including those to external symbols, causing link errors.
What: Added basic scheduling model annotations (instruction latencies) to TMS9900 instruction definitions.
Where: TMS9900InstrInfo.td, TMS9900Schedule.td
Why: Enables LLVM's scheduler to make better decisions about instruction ordering. Foundation for future cycle-count optimization.
What: Fixed issues with SELECT16 pseudo-instruction: wasn't preserving SSA form correctly, and CMPBR pseudos weren't being scheduled properly with their compare inputs.
Where: TMS9900ISelLowering.cpp, TMS9900InstrInfo.cpp
Why: Ternary operator code (a ? b : c) was generating incorrect results in some cases.
What: Fixed compiler warnings, cleaned up test cases for conditional branch relaxation.
Where: Various files
Why: Code hygiene.
What: Enabled DWARF5 debug information emission for the TMS9900 backend. The compiler now produces valid .debug_info, .debug_line, .debug_frame, .debug_addr, and .debug_str sections when compiling with -g.
Where:
MCTargetDesc/TMS9900MCTargetDesc.cpp: SetSupportsDebugInformation = true,UsesCFIWithoutEH = true,DwarfRegNumForCFI = true. Added initial CFA frame state (DW_CFA_def_cfa R10, 0) increateTMS9900MCAsmInfo(). ChangedshouldOmitSectionDirective()to defer to base class so debug sections get proper directives.TMS9900FrameLowering.cpp: Added CFI directives inemitPrologue()andemitEpilogue()—cfiDefCfaOffsetafter stack adjustments,createOffsetfor R11 (return address) save.llvm/include/llvm/BinaryFormat/ELFRelocs/TMS9900.def: New file defining R_TMS9900_NONE/16/PCREL_8/PCREL_16/8 relocation types (previously only in a local enum in the ELF object writer).llvm/include/llvm/BinaryFormat/ELF.h: Added#include "ELFRelocs/TMS9900.def"enum block.llvm/lib/Object/ELF.cpp: Added EM_TMS9900 case for relocation name printing.llvm/lib/Object/RelocationResolver.cpp: AddedsupportsTMS9900()/resolveTMS9900()and switch case, sollvm-dwarfdumpcan resolve relocations in.ofiles.TMS9900ELFObjectWriter.cpp: Removed local relocation enum, now usesELF::R_TMS9900_*from shared header.
Why: Enables source-level debugging with GDB. The DWARF info maps addresses to source lines, names functions/variables, and provides frame unwinding data for backtraces.
Technical notes:
- DWARF register numbering was already correct in
TMS9900RegisterInfo.td(R0-R15 = 0-15, PC=16, WP=17, ST=18). - The key missing piece was the initial CFA rule in the CIE — without
DW_CFA_def_cfa R10, 0, allcfiDefCfaOffsetinstructions in FDEs failed with "CFA rule was not RegPlusOffset". intis correctly reported asDW_ATE_signedwithbyte_size = 0x02(16-bit).- The relocation resolver was critical for
llvm-dwarfdumpto work on.ofiles — without it, all string references showed as(). - This serves as a prototype for adding DWARF to other vintage CPU backends (i8085, i8086, etc.).
What: Created a complete LLDB ABI plugin so LLDB can debug TMS9900 targets natively (no MSP430 pretense). This includes the ABI plugin, ArchSpec registration, GDB remote register fallback, and trap opcodes.
Where:
- New:
lldb/source/Plugins/ABI/TMS9900/(ABISysV_tms9900.h, .cpp, CMakeLists.txt) - Modified:
lldb/source/Plugins/ABI/CMakeLists.txt,lldb/include/lldb/Utility/ArchSpec.h,lldb/source/Utility/ArchSpec.cpp,lldb/source/Plugins/Process/gdb-remote/GDBRemoteRegisterFallback.cpp,lldb/source/Host/common/NativeProcessProtocol.cpp,lldb/source/Target/Platform.cpp
Why: GDB requires pretending to be MSP430 (stock GDB has no TMS9900 architecture). LLDB gets its arch support from LLVM, so with the ABI plugin it can use the real tms9900 triple, real DWARF register numbers, and get native disassembly from the LLVM backend.
Technical notes:
- Key difference from MSP430: TMS9900 is a link-register architecture (BL puts return address in R11, not on stack). Function entry unwind plan sets PC=R11 (register, not memory). Default unwind (post-prologue) uses CFA=SP+2, return address at [CFA-2].
- 19 registers exposed: R0-R15 (DWARF 0-15), PC (16), WP (17), ST (18). R10=SP, R11=LR, ST=flags.
- Big-endian (unlike MSP430 which is little-endian).
- Callee-saved: R13-R15 (plus R10/R11 managed by unwind).
- Trap opcode: 0x0000 (undefined instruction). Emulator handles breakpoints via GDB stub, so this is rarely exercised.
- Build deferred: adding
lldbtoLLVM_ENABLE_PROJECTSrequires ~1-2GB extra build space; disk was at 3.2GB free. - GDB stub (tms9900-trace) will need update to send TMS9900 target description XML when LLDB connects.
What: Fixed three bugs to achieve a perfect 45/45 benchmark pass rate (9 benchmarks × 5 opt levels: O0, O1, O2, Os, Oz). Previously had failures in q7_8_matmul (O0), float_torture (O0), and json_parse (O1/Oz).
Where:
libtms9900/builtins/mul32.S— MPY instruction workaroundtests/fp32_builtins.c— removed duplicate 32-bit builtinstests/benchmarks/json_parse.c— removedoptnoneattributestests/benchmarks/q7_8_matmul.c— simplified back to clean>> 8version
Why: Three independent bugs were causing failures at specific opt levels:
-
MPY assembler encoding bug (q7_8_matmul O0): The TMS9900 LLVM assembler always encodes the MPY destination register as R0 regardless of the source.
MPY R3,R4encodes as0x3803(R0) instead of0x3843(R4). Workaround: restructured__mulsi3to always use R0 as MPY destination, then MOV results to R4:R5. The underlying assembler bug in the MPY type-9 instruction encoder remains unfixed. -
Calling convention mismatch (float_torture O0):
fp32_builtins.cdefined C implementations of__ashlsi3(int32_t a, int32_t b)etc. These read the shift count from R3 (low word of the 32-bit R2:R3 pair). But the compiler passes the count as a 16-bit value in R2 alone. R3 contained garbage, causing random shift amounts. Fix: removed all duplicate 32-bit builtins from fp32_builtins.c — the hand-coded assembly versions in libbuiltins.a use the correct R2 convention. -
optnone attribute mismatch (json_parse O1/Oz): Leftover
__attribute__((noinline, optnone))from i8085 port prevented optimization of helper functions whilemainwas optimized, triggering register spill bugs. Fix: changed to__attribute__((noinline))only.
Technical notes:
- The MPY encoding bug affects any
.Sfile usingMPY Rx,Rnwhere Rn≠R0. The destination register field (bits 7-4 of the type-9 instruction) is always zeroed. DIV likely has the same bug but our div32.S uses shift-and-subtract instead. - The calling convention issue was subtle:
int32_t __ashlsi3(int32_t, int32_t)signature makes the compiler-generated implementation receive the count in R2:R3, but the compiler's callers only set R2. This only manifested when fp32_builtins.o was linked before libbuiltins.a, which only happens for float_torture. - Benchmark results now show O2 generating the most efficient code (e.g., fib: 70 steps at O2 vs 416 at O0), with Os/Oz close behind.
What: Fixed the root cause of the MPY destination register encoding bug in the TMS9900 LLVM backend. DIV had the same bug and was fixed simultaneously.
Where: llvm-project/llvm/lib/Target/TMS9900/TMS9900InstrInfo.td (MPY/DIV instruction definitions), llvm-project/llvm/lib/Target/TMS9900/TMS9900ISelLowering.cpp (pseudo instruction expanders for MUL16, UDIV16, UREM16, SDIV16, SREM16)
Why: The MPY/DIV instructions used Format2_*_R0 TableGen format classes which hardcode the destination register field (bits 9-6) to 0000 (R0). The assembly strings also had R0 hardcoded. This meant MPY R3,R4 encoded as 0x3803 (dest R0) instead of 0x3903 (dest R4).
Technical notes:
- All 10 MPY/DIV definitions (5 addressing modes each) were changed from
Format2_*_R0→ standardFormat2_*classes with explicitGR16:$rdoperand - Pseudo expanders updated to pass explicit
TMS9900::R0destination register operand via.addReg(TMS9900::R0) - Removed workaround comment from
libtms9900/builtins/mul32.S(R0-only MPY workaround still works but is no longer required) - All 45/45 benchmarks still pass across O0/O1/O2/Os/Oz after the fix
What: Evaluated three potential codegen optimizations. Auto-increment addressing was already working. Hardware DIV fast path added to div32.S. CTLZ left as Expand (already optimal).
Where: libtms9900/builtins/div32.S (hardware DIV fast path), TMS9900ISelLowering.cpp (auto-increment and CTLZ verified)
Why: Post-correctness optimization pass. Sought measurable cycle count improvements.
Technical notes:
- Auto-increment (
*R+): Fully working.copy_words()generatesMOV *R1+,R3/MOV R3,*R0+. Infrastructure:POST_INClegal,getPostIndexedAddressParts(), peepholetryFoldPostInc, full.tdpatterns. TMS9900 only supports auto-increment on source operand, so temp register intermediary is correct. - Hardware DIV fast path: Added 21 lines at entry of
UDIV32. When divisor is 16-bit, divisor nonzero, and dividend_hi < divisor, uses singleDIV R3,R0instruction instead of 32-iteration software loop. Saves ~500-800 cycles per qualifying division. Verified with 12-test suite covering signed/unsigned div/mod edge cases. - CTLZ: LLVM's Expand generates 19 straight-line instructions (spread-bits-down + popcount). A loop would average ~40 executed instructions (5/iter × 8 avg). Since TMS9900 has no branch prediction, branchless expansion is faster.
- Pre-existing bug noted: Software division path returns incorrect results for some dividend/divisor combinations where dividend_hi >= divisor (e.g., 0x30000/3). Not introduced by fast path changes.
- Benchmark results: All 9/9 pass, identical cycle counts (none exercise 32-bit division).
What: Created life3 variant with hand-coded TMS9900 assembly for the CSA inner loop. C wrapper (life3.c) delegates per-row processing to life_compute_row() in life3_step.S. Result: 2,572,436 cycles/step — 8.1% faster than life2_2x (2,800,354) and 33.6% faster than original life2x (3,876,196).
Where: cart_example/life3.c (C wrapper, life_next_word replaced with extern call), cart_example/life3_step.S (hand-coded assembly), cart_example/Makefile (added life3 build targets)
Why: Compiler-generated code had 8 stack spills per word in the inner loop (224 cycles overhead). Hand-coded assembly reduces this to 1 spill (mC, 56 cycles) by using all 10 available computation registers (R4-R9, R12-R15) and keeping row pointers (R0-R2) in registers throughout.
Technical notes:
- CSA tree factored into
.Lcsa_lifesubroutine called via BL from word 0, loop body (words 1-6), and word 7. Saves ~200 bytes of code duplication at ~54 cycles/call overhead. - Register allocation: R0-R2 = row pointers, R3 = loop counter/row_dst on stack, R4-R9/R12-R15 = CSA computation. Only mC needs a push/pop per word.
- Full adder pattern:
XOR+XORfor sum,INV+SZC+INV+SZC+SOCfor carry (majority function via AND-NOT). - Life rule uses 5 instructions:
SOC(OR mC into bit0), twoSZC(AND-NOT for ~bit2/~bit3),INV+SZC(AND with bit0|mC). - Words 1-6 use auto-increment addressing (
*R4+) for loading 3 consecutive words per row, saving address computation. - ROM: 6472B (79.00%) vs life2_2x 6048B (73.83%) — 424B larger due to expanded assembly.
What: Fixed critical bug where @offset(R0) instructions in word 0 and word 7 code assembled as absolute/symbolic addressing instead of indexed, causing garbage neighbor data and cells accumulating in a column on the right side of the screen.
Where: cart_example/life3_step.S — all @offset(R0) instructions in word 0 (lines 44-46) and word 7 (lines 194-196) sections.
Why: TMS9900 cannot use R0 as an index register. When encoding indexed mode with R0 (Ts=10, S=0000), the CPU interprets it as symbolic (absolute) addressing. So MOV @14(R0), R5 assembled as MOV @0x000E, R5 — reading from absolute address 0x000E (cart header memory) instead of row_prev + 14.
Technical notes:
- Fix: moved row_prev from R0 to R3 (non-zero register supports indexed addressing), used R0 as loop counter instead. Added
MOV R0, R3in prologue after saving R3 (row_dst) to stack. - All
@offset(R0)became@offset(R3),*R0became*R3in word 0/7 code. - Loop body already worked because it copied R0 to R4 first (
MOV R0, R4; A R3, R4), so the swap just changed which register holds what. - Performance unchanged at 2,577,684 cycles/step (8.0% faster than life2_2x). The extra
MOV R0, R3is negligible (22 cycles, once per row). - Confirmed correct by disassembly:
c1 63 00 0e MOV @14(R3),R5(indexed with R3) vs previousc1 60 00 0e MOV @0x000e,R5(absolute). - Added R0 indexing constraint to MEMORY.md as a critical ISA note.
What: Added setLoadExtAction for MVT::i1 → Promote (ZEXT/SEXT/EXT) in the TMS9900 backend. Without this, compiling if (debug_snapshot_pending) at -O3 crashed with "Cannot select: load".
Where: llvm-project/llvm/lib/Target/TMS9900/TMS9900ISelLowering.cpp (constructor, after the existing i8 load ext actions around line 60)
Why: LLVM's optimizer deduced that debug_snapshot_pending (a uint8_t only assigned 0 or 1) could be treated as i1, emitting a zextload i1 node. The backend only had i8 load extension actions (Custom), not i1. Promoting i1 → i8 lets the existing Custom i8 path handle it.
Technical notes: Any uint8_t variable that only stores 0/1 values can trigger this at higher optimization levels. The fix is general — affects all future boolean-like byte variables.
What: Added the missing if (debug_snapshot_pending) { vdp_debug_clean_snapshot_clear(); } to the main loop in life2_2x_opt.c, completing the one-shot 'F' key snapshot feature.
Where: cart_example/life2_2x_opt.c main loop (after vdp_debug_dirty_update(), before #endif)
Why: The fix was already applied to life3.c but had not been ported to life2_2x_opt.c. Without it, pressing 'F' would color tiles purple permanently instead of clearing after one frame.
What: Added 4 new peephole patterns to TMS9900Peephole.cpp: LI -1 → SETO, MOV Rx,Rx self-move deletion, CI Rx,0 elimination when preceding instruction already set flags, and redundant consecutive load elimination.
Where: llvm/lib/Target/TMS9900/TMS9900Peephole.cpp — all 4 patterns in runOnMachineFunction()
Why: Every byte matters in 8K cartridge ROM. These patterns eliminate redundant instructions that survive register allocation and prior optimization passes. Combined savings: life2_2x ROM 6848B → 6724B (124B, 1.8%).
Technical notes: CI Rx,0 elimination is the most impactful (16/35 instances eliminated in life2_2x, 64 bytes saved). Requires: (1) immediately preceding instruction's operand 0 is def of TestReg, (2) that instruction sets ST, (3) next instruction is a conditional branch. Must mark preceding instruction's ST def as not-dead when eliminating CI. Expanded from JEQ/JNE to all 8 conditional branches (JGT/JLT/JH/JHE/JL/JLE also only test EQ/LGT/AGT flags). Redundant load elimination moved early in the loop (before LI→CLR) to catch more cases.
What: Initial CI Rx,0 elimination walked backward to find any instruction defining TestReg. This caused json_parse benchmark to enter an infinite loop.
Where: TMS9900Peephole.cpp, CI elimination section
Why: Multi-def instructions like MOVpim (auto-increment load) have two defs: operand 0 is the loaded value, and a secondary def is the auto-incremented pointer. The ST flags reflect operand 0 (the loaded value), NOT the pointer. If CI tested the pointer register, the backward walk would find MOVpim as a "definer" and incorrectly delete CI, even though flags don't reflect the pointer's value.
Technical notes: Fixed by restricting to immediately preceding instruction only, and requiring operand 0 (primary result, which flags reflect) to be TestReg. This is conservative but safe — 16/35 CI Rx,0 still eliminated. The remaining 19 can't be optimized without cross-basic-block analysis or walking past non-ST-clobbering instructions.
What: Added DAG patterns to fold LI Rt,global / A Rx,Rt / MOV *Rt,Rd into MOV @global(Rx),Rd for word loads, word stores, byte loads, and byte stores accessing global arrays with a register offset.
Where: llvm/lib/Target/TMS9900/TMS9900InstrInfo.td (lines ~1984-2000), new test llvm/test/CodeGen/TMS9900/indexed-global.ll
Why: The TMS9900 indexed addressing mode @addr(Rx) can encode a global symbol + register offset in a single 2-word instruction, replacing a 3-instruction sequence (LI+A+MOV*) that uses 4 words. Saves 2 words + ~8 cycles per occurrence.
Technical notes: Added 4 patterns matching (load/store (add (TMS9900Wrapper tglobaladdr:$addr), GR16:$idx)) to emit MOVxm/MOVmx/MOVBxm/MOVBmx. The IdxRegs constraint on these instructions automatically prevents R0 from being used as the index register (R0 cannot be used for indexed addressing on TMS9900). LLVM's pattern matcher handles add commutativity automatically. A similar pattern already existed for jump tables (tjumptable). Current benchmarks don't exercise this pattern (they use pointer arithmetic, not global array indexing), so no size change in existing code. 33/33 lit tests pass, 9/9 benchmarks pass.
What: Made R12 reservation conditional on a new subtarget feature FeatureReserveCRU (default OFF). R12 is now available as a general-purpose register for programs that don't use CRU I/O instructions (LDCR, STCR, SBO, SBZ, TB).
Where: TMS9900.td (new FeatureReserveCRU), TMS9900Subtarget.h (ReserveCRU member), TMS9900RegisterInfo.cpp (conditional reservation)
Why: R12 was unconditionally reserved as CRU base address, wasting 1 of ~10 allocatable registers. Most programs (benchmarks, Game of Life) never use CRU instructions. Freeing R12 gives the register allocator one more register, reducing spill pressure significantly.
Technical notes: Programs needing CRU can compile with -mattr=+reserve-cru. Impact was dramatic: life2_2x ROM dropped from 6724B to 6088B (-636B, -9.5%). This is because one extra register eliminates many stack spills in tight loops. Verified safe: 45/45 benchmarks pass across O0/O1/O2/Os/Oz, 57/57 lit tests pass. No cart_example programs use CRU instructions directly (keyboard scanning is done via TI console ROM routines, not direct CRU access).
What: Added 24 new lit tests (19 CodeGen .ll, 4 CodeGen .mir, 1 MC .s) bringing total from 33 to 57, all passing.
Where: llvm/test/CodeGen/TMS9900/ (new files: addressing-modes.ll, calling-convention.ll, 32bit-ops.ll, select-ops.ll, global-address.ll, alu-ops.ll, type-conv.ll, stack-frame.ll, auto-increment.ll, const-materialization.ll, callee-saved.ll, shifts-extended.ll, byte-arith.ll, compare-unsigned.ll, control-flow.ll, inline-asm.ll, volatile-ops.ll, mul-div.ll, pointer-arith.ll, peephole-ci-elim.mir, peephole-seto.mir, peephole-ai-fold.mir, peephole-mov-self.mir), llvm/test/MC/TMS9900/inst-negative-imm.s
Why: Previous coverage was 33 tests. Needed comprehensive tests for: all addressing modes, calling convention, 32-bit operations, peephole patterns, inline asm, volatile ops, type conversions, stack frames, control flow patterns, etc.
What: All 9 items from the TMS9900 optimization plan are complete. Summary of results:
Metrics:
- life2_2x ROM: 6848B → 6088B (-760B, -11.1%)
- Lit tests: 33 → 57 (all passing)
- Benchmarks: 45/45 across O0/O1/O2/Os/Oz
Optimizations implemented:
- LI Rx,-1 → SETO Rx (2B savings per instance)
- MOV Rx,Rx self-move deletion (2B + 14 cycles per instance)
- CI Rx,0 elimination (4B + 14 cycles per instance, 16/35 eliminated in life2_2x)
- Redundant consecutive load elimination (4-6B per instance)
- AI 0 deletion when ST dead
- Indexed global address folding (4B + ~22 cycles per instance)
- R12 freed for general allocation (-636B in life2_2x alone)
What: Added two new peephole optimizations based on waste pattern analysis of life2_2x disassembly. Extended self-MOV elimination to handle live-ST cases where preceding instruction already set flags (8/25 eliminated). Added ANDI Rx,0xFF00 elimination before MOVB stores where the low byte doesn't matter (16/24 eliminated).
Where: TMS9900Peephole.cpp (lines ~286-340 for self-MOV, ~417-482 for ANDI), new test peephole-andi-ff00.mir, updated peephole-mov-self.mir
Why: Waste analysis of life2_2x found 25 self-MOVs used as flag tests (50B) and 24 redundant ANDI 0xFF00 after MOVB (96B). Self-MOV flag-test uses same safety constraints as CI Rx,0 elimination. ANDI elimination checks that next use is MOVB (only sends high byte) and Rx is killed.
Technical notes: Self-MOV remaining 17/25 are at BB boundaries (no local preceding instruction). ANDI remaining 8/24 are before inline assembly MOVB instructions the peephole can't match. Cumulative life2_2x ROM: 6848B → 6008B (-840B, -12.3%). 58 lit tests (46 CG + 12 MC), 9/9 benchmarks. Waste analysis also identified untackled targets: trailing INV (~480K cyc/frame), stack spills (~374K cyc/frame), SRL/SLA→SWPB (~45K cyc/frame).
What: Added Tier 2 fallback for CI Rx,0 optimization. When CI Rx,0 cannot be fully eliminated (Tier 1: preceding instruction set flags on same register), it is now replaced with MOV Rx,Rx -- same 14 cycles, same EQ/LGT/AGT flag semantics, but 2 bytes smaller (2B vs 4B). Saves 36 bytes in life2_2x (18 instances).
Where: TMS9900Peephole.cpp (CI block restructured with do/while(false) for Tier 1, new Tier 2 at lines ~434-460), updated peephole-ci-elim.mir (7 test cases), updated peephole.mir (2 CHECK lines)
Why: ~18 CI Rx,0 remained in life2_2x where the preceding instruction clobbers flags (e.g., a MOVB store between the value-producing instruction and the CI). Full elimination is unsafe but the encoding can still be shrunk.
Technical notes: Fixed a latent bug in ST dead-flag propagation: findRegisterDefOperandIdx(ST, isDead=true) on a freshly-built instruction always returns -1 because implicit defs start as non-dead. New code uses isDead=false, Overlap=true to find the ST def regardless of dead status. life2_2x ROM: 6008B -> 5972B (-36B). 59 lit tests (47 CG + 12 MC), 9/9 benchmarks pass.
What: Added two peephole optimizations: (1) SRL/SLA Rx,8 + ANDI → SWPB Rx + ANDI (2B smaller, 26 cycles faster per instance), (2) Crr elimination: delete C Rx,Ry when one operand is provably zero and preceding instruction already set flags. Also added SLA Rx,8 + MOVB → SWPB Rx + MOVB variant for byte stores.
Where: TMS9900Peephole.cpp — new trySWPBFolding() and tryCrrElimination() functions
Why: SWPB folding targets byte-swap patterns common in byte load/store sequences. Crr elimination removes compare instructions that are redundant because the preceding instruction already set the needed flags with a zero operand. life2_2x ROM: 5972B → 5916B (-56B).
What: Added 3 new benchmarks bringing suite from 10 to 13 programs, all passing across O0/O1/O2 (39/39).
Where: tests/benchmarks/huffman.c, tests/benchmarks/long_torture.c, tests/benchmarks/heap4.c, updated Makefile and run_benchmarks.py
Why: Expand test coverage with diverse workloads — Huffman exercises bit manipulation and tree traversal, long_torture validates 30 different 32-bit operations, heap4 tests a FreeRTOS-style dynamic memory allocator with coalescing.
Technical notes:
- huffman: 1266B text at O2, 520K cycles. Encode+decode+verify cycle with frequency table, Huffman tree, and bitstream operations.
- long_torture: 1584B text at O2, 21K cycles. 30 tests: add/sub with carry, multiply, shifts, bitwise, comparisons, chain computations. Uses noinline fold32 (inlined version triggers O2 miscompilation — 0x8000 checksum error).
- heap4: 2042B text at O2, 114K cycles. 2048-byte static heap, first-fit allocation, block splitting, adjacent-block coalescing. 7-phase test workload.
- Discovered SLA Rw,0 with R0=0 shifts by 16 (not no-op) — worked around in huffman code.
What: Fixed compiler bug where variable shifts could produce incorrect results when the shift count is zero. TMS9900 hardware treats R0=0 in SLA Rw,0 (variable shift via R0) as shift-by-16, not shift-by-0.
Where: TMS9900ISelLowering.cpp — SLA_VAR/SRA_VAR/SRL_VAR expansion in EmitInstrWithCustomInserter(). New lit test var-shift-zero.ll.
Why: Variable shift pseudos expanded to MOV cnt,R0; SLA Rw,0. When cnt=0, R0=0, and the 4-bit count encoding treats 0 as 16. This is documented in TMS9900 Data Manual page 26 and confirmed by MAME reference emulator.
Technical notes: Fix adds a JEQ guard: MOV cnt,R0; JEQ done; SLA Rw,0; done: PHI [shifted, shift_bb], [original, start_bb]. Combined three separate cases into one switch. Constant shifts (e.g., SLA R1,3) are unaffected — only variable (count-from-R0) shifts had the bug. 63 lit tests (51 CodeGen + 12 MC), 39/39 benchmarks pass.
What: Investigation of byte operation code generation quality. Current promote-to-i16 architecture is sound but peephole improvements are possible.
Where: Analysis of compiled byte-heavy code patterns.
Why: Identified two priority improvements:
- Priority 1: SRL Rx,8 + CI Rx,N → CI Rx,(N<<8) — fold byte comparison with pre-shifted value. Saves 2 bytes + 22 cycles per instance.
- Priority 2: CB (compare bytes) instruction — compare memory/register bytes directly without SRL/MOVB. Requires ISel pattern or peephole to detect byte-compare opportunities.
Technical notes: ISel restructuring to MSP430-style GR8 subregisters is NOT recommended — TMS9900 MOVB operates on HIGH byte (unlike MSP430's LOW byte), making the register promotion approach correct. The improvements should be peephole-level.
What: Investigated the O2 miscompilation in long_torture (0x8000 checksum error when fold32 is inlined). Bug does not reproduce with current compiler/code.
Where: Analysis of compiled output for various long_torture code shapes.
Why: Root cause was likely previously-fixed bugs: SELECT16 PHI SSA form fix and/or frame index scratch register clobber fix. The bug only appeared with the original volatile _result/_halt halt pattern, not with the halt_ok()/fail_loop() pattern used in the final code.
Technical notes: Found two latent issues during investigation: (1) SWPB incorrectly modeled as Defs=[ST] in InstrInfo.td — real hardware doesn't affect ST. Conservative (safe but may inhibit some peephole opportunities). (2) Post-increment fold in peephole may not check ST liveness. Neither causes current failures.
What: Implemented 64-bit integer runtime functions (__muldi3, __udivmoddi4, __divdi3, __udivdi3, __moddi3, __umoddi3) and i64_torture benchmark. All 14 benchmarks pass at O0/O1/O2.
Where: libtms9900/builtins/i64/ (6 new C files), libtms9900/builtins/Makefile, tests/benchmarks/i64_torture.c
Why: i64 operations (long long) needed runtime support. LLVM legalizes i64 → 2×i32 → 4×i16, generating libcalls for mul/div/mod.
Technical notes: compiler-rt's __udivmoddi4 uses inline i64 subtract/shift expressions that miscompile on TMS9900. Rewrote division loop with explicit 32-bit word operations (hi32/lo32/make64 helpers). Also fixed latent shift32.S bug: bit-by-bit shift loop had wrong word order (must shift high first for left shift, low first for right shift).
What: Fixed bit-by-bit shift loop in __ashlsi3/__lshrsi3/__ashrsi3 that had wrong word shift order, causing carry bits to be lost.
Where: libtms9900/builtins/shift32.S
Why: __ashlsi3 was shifting R1 (low) before R0 (high), so the carry from R1's MSB was lost. __lshrsi3 was shifting R0 (high) before R1 (low), same issue.
Technical notes: Left shift must do SLA R0,1 then SLA R1,1 then propagate carry. Right shift must do SRL R1,1 then SRL/SRA R0,1 then propagate carry. This was latent — only triggered by i64 code paths using 32-bit shifts with counts > hardware shift range.
What: Created comprehensive TMS9900.rst backend documentation covering architecture, ABI, calling convention, toolchain usage, subtarget features, instruction set, known limitations, and runtime library.
Where: llvm-project/llvm/docs/TMS9900.rst
Why: First-class LLVM targets need documentation. This serves as the reference for anyone using or maintaining the backend.
What: Created disasm-comprehensive.s with ~200 instruction/addressing-mode round-trip tests. No disassembler bugs found.
Where: llvm-project/llvm/test/MC/TMS9900/disasm-comprehensive.s (1127 lines)
Why: Verify every instruction encoding/decoding path works correctly. Tests both assembly encoding (CHECK) and disassembly (DISASM) in a single file.
Technical notes: Covers all 5 addressing modes × all Format1 instructions, plus Format2 (MPY/DIV/XOR/etc), shifts, immediates, jumps, CRU, and special instructions. 16/16 MC tests pass.
What: Removed 10 unused includes from 3 backend source files and fixed 1 stale comment.
Where: TMS9900AsmPrinter.cpp (5 includes + comment), TMS9900FrameLowering.cpp (2 includes), TMS9900ISelDAGToDAG.cpp (3 includes)
Why: Reduce unnecessary compilation dependencies and keep code clean for upstream review.
What: All 5 delivery items completed: (1) lit.local.cfg verified, (2) code quality audit, (3) disassembler completeness, (4) backend documentation, (5) i64 runtime library.
Where: Across submodule and outer repo. Commits: df0576681 (submodule), 31a2636 (outer repo).
Why: This concludes the final polish phase. Backend is now functionally complete with 98 lit tests (82 CodeGen + 16 MC), 14 benchmarks (42/42 across O0/O1/O2), comprehensive documentation, and full integer support through 64-bit.
What: CoreMark ported to TMS9900 freestanding environment. Passes CRC validation at -O0, -O1, -O2, -O3. Fails at -Os and -Oz due to variable-count SRL instruction when shift amount is 0 (TMS9900 treats R0=0 as shift-by-16).
Where: Bug manifests in core_main.c expression (1 << (ee_u32)i) & results[0].execs. Root cause is in backend shift lowering (TMS9900ISelLowering.cpp / TMS9900InstrInfo.td). Minimal reproducer: tests/stress/coremark/Os_repro.c. Full bug report: tests/stress/BUGS.md (Bug 7).
Why: The TMS9900 SRL/SLA/SRA instructions with count field = 0 read the count from R0 bits 12-15. When R0[12:15] = 0, hardware shifts by 16 (not 0). The compiler generates SRL R6,0 for variable shifts without guarding against count=0. At -O2/-O3, the loop is unrolled and constant-folded, avoiding the variable shift. At -Os/-Oz, the loop is kept, exposing the bug.
Technical notes: The corruption cascade in full CoreMark: shift bug -> num_algorithms=2 instead of 3 -> size=1000 instead of 666 -> memblock[1]=0x0000 -> core_list_init writes linked list data to address 0x0000 -> .text section corrupted -> CPU executes garbage -> infinite loop.
What: Fixed the SLA_VAR/SRA_VAR/SRL_VAR shift-by-zero miscompilation that caused CoreMark -Os/-Oz to fail. The zero-guard now uses CMPBRri (a terminator pseudo) instead of MOV+JEQ, preventing PHI elimination from breaking the flag chain.
Where: TMS9900ISelLowering.cpp (EmitInstrWithCustomInserter, SLA_VAR/SRA_VAR/SRL_VAR case), TMS9900InstrInfo.td (added R0 to Defs), var-shift-zero.ll (updated lit test)
Why: The original fix emitted MOV $cnt, R0; JEQ DoneBB in StartBB. PHI elimination inserted copies (e.g., MOV R1, R6) between MOV and JEQ, clobbering ST flags. The JEQ then tested the wrong register and failed to skip the shift when count=0.
Technical notes:
- New approach: CMPBRri is a terminator pseudo. PHI copies are placed BEFORE terminators, so they go before CMPBRri, not between CI and JEQ. CMPBRri expands atomically to CI+JEQ in expandPostRAPseudo.
- MOV $cnt, R0 moved to ShiftBB (the shift basic block) where it can't be disrupted.
- First attempted expandPostRAPseudo approach but discovered it can't split MBBs: the ExpandPostRAPseudos pass uses
make_early_inc_range(MBB), and splicing instructions to a new MBB orphans the iterator, causing an infinite loop. - Verification: CoreMark passes at O0, O1, O2, O3, Os, Oz. Os_repro.c passes. 82/82 lit tests, 16/16 MC tests, 14/14 benchmarks.
What: Analyzed 3 Csmith checksum mismatches (test_1, test_3, test_11). Determined 2/3 are NOT compiler bugs but 16-bit int semantic differences. test_3 may have a real O1+ optimizer bug.
Where: tests/stress/csmith/test_{1,3,11}.c, tests/stress/BUGS.md
Why: test_1 and test_11 produce the same (wrong vs native) checksum at O0, O1, and O2 — self-consistent across all optimization levels. This proves the compiler is correct; the difference is from C integer promotion rules (int16_t+int16_t promotes to 32-bit int on x86 but stays 16-bit on TMS9900). Overriding INT_MAX macros doesn't change actual int width.
Technical notes: test_3 has O0=FA241ADF but O1/O2/Os=1A77C269, suggesting a real optimizer bug on top of the semantic difference. Deferred — needs C-Reduce minimization. Key lesson: Csmith is NOT a reliable cross-platform correctness oracle when int width differs. Self-consistency checks (same result across opt levels on the SAME target) are the right approach.
What: Fixed a miscompilation where LLVM's DAG combiner transforms ADD(ptr, 2) to OR(ptr, 2) for computing the low-word address of split i32 stack values. With 2-byte stack alignment, stack addresses like 0xFAFE have bit 1 already set, making ORI Rx,2 a no-op — causing the high word to be read twice instead of accessing the low word.
Where: TMS9900FrameLowering.cpp (prologue/epilogue +2 alignment padding), TMS9900RegisterInfo.cpp (eliminateFrameIndex +2 offset adjustment), TMS9900TargetMachine.cpp (S16->S32 data layout), clang/lib/Basic/Targets/TMS9900.h (S16->S32), calling-convention.ll and spill-reload.ll (updated test expectations)
Why: The root cause was traced via Csmith test_3 O0 vs O1 checksum mismatch. Execution trace comparison showed divergence at a JNE branch where a 32-bit zero-check read the high word twice (via ORI no-op) instead of reading both words. The DAGCombiner fold at line 2977 ((a+b) -> (a|b) iff no common bits) is correct generically but requires 4-byte-aligned stack bases.
Technical notes:
- Stack alignment changed to
Align(4)in FrameLowering andS32in data layout - Non-leaf functions push R11 with DECT (2 bytes), breaking 4-byte alignment. Fix: prologue adds +2 padding to the AI allocation (
AI R10,-(StackSize+2)instead ofAI R10,-StackSize), making total displacement StackSize+4 (4-aligned) eliminateFrameIndexadds +2 to StackAdj for non-leaf functions with stack objects to account for the padding- Leaf functions (no DECT) need no padding — SP stays 4-aligned naturally
- Non-leaf with StackSize=0 needs no padding — no stack accesses to misalign
- All 98 lit tests pass (82 CodeGen + 16 MC), 42/42 benchmarks pass (O0/O1/O2)
- Csmith test_3 O0 now correct (FA24:1ADF). O1+ still wrong — separate backend bug, likely in CodeGenPrepare/ISel interaction (confirmed via
-disable-cgpfixing limit=7 IR, but full pipeline has additional issues)
What: The ANDrr pseudo instruction (expands to INV+SZC+INV) was defined without Defs = [ST], meaning LLVM's instruction scheduler and register allocator did not know that ANDrr clobbers the status register. This allowed flag-dependent sequences to be reordered across ANDrr expansions, corrupting comparison results.
Where: llvm-project/llvm/lib/Target/TMS9900/TMS9900InstrInfo.td (line ~301, ANDrr definition)
Why: All three expansion instructions (INV, SZC, INV) set status flags on real hardware. The other bitwise pseudos (SOC, SZC, XOR, ANDI, ORI) were correctly inside a let Defs = [ST] block (lines 1123-1338), but ANDrr was in a separate let isPseudo = 1 block that omitted the ST clobber annotation.
Fix: Added let Defs = [ST] in before the ANDrr definition.
Verification: 82/82 CodeGen lit tests, 16/16 MC tests, 14/14 benchmarks at O2 pass.
What: The long_torture benchmark's fold32 function can now be safely inlined at O2. Previously, inlining fold32 caused a 0x8000 single-bit checksum error, worked around with __attribute__((noinline)). The root cause was the missing Defs = [ST] on ANDrr (Bug 9).
Where: tests/benchmarks/long_torture.c (removed __attribute__((noinline)) from fold32)
Why: When fold32 is inlined across 30 call sites, val & 0xFFFF generates ANDrr instructions at the i16 level. With 30 inlined fold32 calls producing 60 XOR operations, the scheduler had many opportunities to move ANDrr between compares and branches, corrupting status flags and causing one test case to produce a sign-bit error.
Technical notes: After the Bug 9 fix, all 42/42 benchmarks pass across O0/O1/O2 with fold32 inlined. The noinline workaround is no longer needed.
What: The Csmith test_3 O1+ miscompilation (O0=FA24:1ADF correct, O1/O2=1A77:C269 wrong) was previously believed to be a backend codegen bug. Investigation proved it is a middle-end IR optimization bug.
Where: Analysis files saved in tests/stress/csmith/build/test_3_before.ll and test_3_after.ll
Why: Key evidence:
- O0 IR compiled through
llc -O0= FA24:1ADF (CORRECT) - O1 IR compiled through
llc -O0= 1A77:C269 (WRONG) - This proves the O1 IR itself is already incorrect -- the backend is not at fault.
Technical notes:
- Used
-mllvm -opt-bisect-limit=Nbinary search to narrow down:- limit=37 (LowerExpectIntrinsic on func_13): CORRECT
- limit=38 (SimplifyCFGPass on func_13): WRONG
- SimplifyCFGPass performs three transformations on func_13:
- Short-circuit optimization: converts two sequential
br i1intoselect i1 %cond1, i1 %cond2, i1 false - Switch-to-branch: simplifies two-case switch to linear code
- Single-case switch to
icmp eq + br
- Short-circuit optimization: converts two sequential
- All three transformations appear semantically correct on inspection
- May be related to TMS9900's 16-bit
intcausing different behavior in safe_math overflow checks (INT16_MAX >= INT_MAX is true on TMS9900), or an interaction between SimplifyCFG and the TMS9900 DataLayout - Deferred as a potential upstream LLVM issue or DataLayout/TargetInfo configuration issue
- The ANDrr
Defs = [ST]fix does NOT affect this bug (verified)
What: Exhaustive line-by-line IR comparison of the before/after SimplifyCFGPass output for func_13 in Csmith test_3. The function shrinks from 1272 lines to 893 lines. Seven distinct transformations identified and analyzed.
Where: tests/stress/csmith/build/test_3_before.ll (limit=37) and test_3_after.ll (limit=38)
Transformations found (all individually semantically correct):
-
Block merging in loops: Three nested initialization loops have body+increment blocks merged into single blocks. Standard SimplifyCFG optimization.
-
Short-circuit OR to select (C line 166:
g_4 || (l_26 != &g_8)):- Before:
load volatile @g_4; icmp; br -> load %4; icmp; br -> phi i1 - After:
load volatile @g_4; icmp; load %4; icmp; select i1 %cond, i1 true, i1 %other - The non-volatile load of
%4is speculated. TBAA metadata!25lost on speculated load.
- Before:
-
Short-circuit AND to select (C line 184:
safe_lshift(...) && g_10):- Before:
call safe_lshift; icmp; br -> load @g_10; icmp; br - After:
call safe_lshift; icmp; load @g_10; icmp; select i1 %cond, i1 %other, i1 false - The load of
@g_10is speculated after the call (still sequenced after it). TBAA metadata!3lost.
- Before:
-
Dead phi folding:
phi i1 [ true, %322 ], [ true, %326 ]->zext i1 true to i16. Always-true phi correctly simplified. -
Dead branch removal:
br i1 true, label %337, label %378eliminated. Block 378 (dead else branch with ~260 lines of dead code) removed. Correct. -
Inner switch elimination:
switch i32 %val, unreachable [i32 0, %a; i32 22, %b]where both %a and %b reach the same block -> unconditional branch. Correct. -
Outer switch to icmp:
switch i32 %val, %default [i32 0, %case0]->icmp eq + br. Default and non-zero both go to cleanup. Correct.
Key finding: The most suspicious transformation is #3 (AND to select). The select i1 %220, i1 %222, i1 false pattern requires the i1 result of icmp ne i32 @g_10, 0 to be type-legalized from i1 to i16 and used in a SELECT expansion chain. On TMS9900:
ISD::SELECTfor i16 -> Expand -> SELECT_CCISD::SELECT_CCfor i16 -> Custom -> SELECT16 pseudo- SELECT16 -> EmitInstrWithCustomInserter -> CMPBR + branch + PHI
- Meanwhile,
ISD::SETCCfor i16 -> Custom -> returns 0xFFFF/-1 for true, 0 for false - The i1-to-i16 promotion should insert AND-with-1, but if this masking step is missing or mis-ordered, the 0xFFFF could propagate incorrectly through the select chain.
Conclusion: The SimplifyCFGPass transforms are individually correct at the IR semantic level. The bug is most likely a backend type legalization issue where the select i1, i1, i1 pattern, after i1-to-i16 promotion, interacts incorrectly with the TMS9900 SETCC/SELECT lowering chain. The branch-based IR in the "before" version avoids this codepath entirely because br i1 directly uses BRCOND which has a simpler lowering than SELECT.
Next steps: C-Reduce test_3.c, then trace select i1 through llc -O0 -debug to pinpoint the legalization bug.
What: Fixed intermittent SIGSEGV in computeRegisterProperties / findRepresentativeClass that crashed clang when compiling any code that included libc++ headers. Crash address was random (ASLR-dependent), manifesting as NULL or garbage function pointer dereference.
Where: llvm-project/llvm/lib/Target/TMS9900/TMS9900Subtarget.h (member declaration order), TMS9900Subtarget.cpp (constructor initializer list)
Why: C++ members initialize in declaration order. TLInfo (TMS9900TargetLowering) was declared before RegInfo (TMS9900RegisterInfo), but TLInfo's constructor calls STI.getRegisterInfo() which returns &RegInfo — an uninitialized object with a corrupt vtable. Fix: moved RegInfo declaration before TLInfo.
Technical notes: The bug was latent since project inception but only manifested with libc++ headers because the larger AST / heap activity changed memory layout enough to make the uninitialized vtable pointer reliably point to unmapped memory. Simpler C files happened to have heap residue that looked like valid vtable entries. Classic undefined behavior.
What: Brought up libc++ <array> and <algorithm> (std::sort) on TMS9900 freestanding. Three issues fixed: (1) backend crash above, (2) 16-bit hash specialization, (3) sort extern template linkage.
Where:
llvm-project/libcxx/include/__functional/hash.h— added__murmur2_or_cityhash<_Size, 16>specialization using FNV-1allvm-project/libcxx/include/__algorithm/sort.h— guardedextern templatedeclarations and__sort_is_specialized_in_librarybehind_LIBCPP_DISABLE_EXTERN_TEMPLATEtests/benchmarks/libcxx_config/time.h— new stub fortime_t,clock_t,struct tmtests/benchmarks/Makefile— addedSTL_BENCHMARKS,STLFLAGS, stl_test build rulestests/benchmarks/run_benchmarks.py— addedcpp_testandstl_testto benchmark list
Why: libc++ hash.h only had 32-bit and 64-bit murmur2/cityhash specializations; TMS9900 has 16-bit size_t. Sort used extern template for common types (int, unsigned, etc.) expecting definitions in libc++.so which doesn't exist freestanding.
Technical notes:
- 16-bit hash uses FNV-1a (offset basis 0x811D, prime 0x0193) — simple byte-at-a-time, appropriate for 16-bit targets
- Sort fix makes
__sort_is_specialized_in_libraryyieldfalse_typewhen_LIBCPP_DISABLE_EXTERN_TEMPLATEis defined, forcing inline__introsortpath - stl_test: std::array + std::sort on 8 elements, 3526B at O2, 341 instructions, 6156 cycles
- 48/48 benchmarks pass (16 programs x 3 opt levels), 98/98 lit tests pass
What: Launched 3 parallel autonomous agents to stress-test C++ language features beyond STL. Created lambda_test.cpp (8 tests), mi_test.cpp (8 tests), cpp_adv_test.cpp (10 tests). All 26 features pass at O0/O1/O2. Zero new backend bugs found.
Where:
tests/benchmarks/lambda_test.cpp— lambdas: stateless, capture by value/reference, mutable, fn ptr conversion, with std::sort, capturing this, nestedtests/benchmarks/mi_test.cpp— multiple inheritance: simple MI, this-pointer adjustment, diamond (non-virtual), virtual inheritance, deep hierarchy (4 levels), override both bases, data layout, static_cast up/downcasttests/benchmarks/cpp_adv_test.cpp— move ctor/assign, std::move, rule of five, perfect forwarding, variadic templates, structured bindings, constexpr, enum class, static local inittests/benchmarks/Makefile— added CPP_FEATURE_BENCHMARKS with compile/link rulestests/benchmarks/run_benchmarks.py— added lambda_test, mi_test, cpp_adv_test to ALL_BENCHMARKS and IDLE_HALT
Why: After STL bringup (Phases 10-11), tested deeper C++ features to verify backend correctness for vtable thunks, this-pointer adjustments, move semantics, template instantiation, and other patterns that stress register allocation and calling conventions.
Technical notes:
- MI test verified non-virtual thunks (DECT R0 for -2, AI R0,0xFFFC for -4, AI R0,0xFFFA for -6), virtual inheritance vcall offset thunks, and diamond inheritance
- Anti-devirtualization via
escape()template with inline asm ensures 25+ indirect vtable calls at all opt levels - 60/60 benchmarks pass (20 programs x 3 opt levels), 98/98 lit tests pass
- Binary sizes at O2: lambda_test 2886B, mi_test 1706B, cpp_adv_test 1090B
What: Built Rust 1.81.0 (LLVM 18) stage 1 with TMS9900 support. Test crate with 5 tests (arithmetic, u8, control flow, loop, array) compiles and runs on emulator: 146B code, 58 steps, 940 cycles.
Where:
- Rust repo:
~/personal/ti99/rust-tms9900/(Rust 1.81.0, config.toml) - Calling convention:
compiler/rustc_target/src/abi/call/tms9900.rs - LLVM registration:
compiler/rustc_llvm/build.rs(OPTIONAL_COMPONENTS),compiler/rustc_llvm/src/lib.rs(init_target!) - Target spec:
~/personal/ti99/rust-tms9900/test/tms9900-unknown-none.json - Test crate:
~/personal/ti99/rust-tms9900/test/ - LLVM fixes:
TMS9900ISelLowering.cpp(CanLowerReturn, ROTR/ROTL Expand)
Why: Extending TMS9900 backend from C/C++ to Rust. Rust 1.81.0 is the last stable release using LLVM 18.
Technical notes:
CanLowerReturn()added: limits returns to 8 bytes (4×i16 R0-R3), larger types (i128 in compiler_builtins) use sret- ROTR/ROTL i16 changed from Legal to Expand: SRC instruction only has constant-count pattern; compiler_builtins needs variable-count rotate
- Linker must use
ld.llddirectly ("linker-flavor": "ld"), NOT clang as driver (avoids GNU-ld flag confusion on macOS) core::arch::asm!doesn't work for custom targets — use extern "C" calls to .S files insteadconfig.toml(not bootstrap.toml) for Rust 1.81; setextended = falseto avoid book submodule dependency- LLVM targets must include AArch64 for host:
-DLLVM_TARGETS_TO_BUILD="TMS9900;AArch64" - All 98 lit tests pass (rotate.ll updated for Expand codegen), 60/60 C/C++ benchmarks still pass
Project Journal - Last Updated: February 11, 2026