|
| 1 | +# Halcyon: A C Compiler in Haskell |
| 2 | + |
| 3 | +Halcyon is a work-in-progress compiler for a large subset of C, written in Haskell. It targets the x86_64 instruction set architecture. This project focuses on implementing the core compiler functionality while leveraging existing system tools for preprocessing, assembly, and linking. |
| 4 | + |
| 5 | +## Current Status |
| 6 | + |
| 7 | +The compiler currently handles the simplest subset of C programs: functions that return integer constants. For example: |
| 8 | + |
| 9 | +```c |
| 10 | +int main(void) { |
| 11 | + return 42; |
| 12 | +} |
| 13 | +``` |
| 14 | +
|
| 15 | +### Compilation Pipeline |
| 16 | +
|
| 17 | +The compiler processes source code through the following stages: |
| 18 | +
|
| 19 | +1. **Lexical Analysis**: Breaks source code into a sequence of tokens |
| 20 | +2. **Parsing**: Converts tokens into an Abstract Syntax Tree (AST) |
| 21 | +3. **Code Generation**: Transforms AST into x86_64 assembly |
| 22 | +4. **Code Emission**: Outputs the assembly code to an executable |
| 23 | +
|
| 24 | +### Internal Representations |
| 25 | +
|
| 26 | +Programs are represented internally using a series of increasingly lower-level data structures: |
| 27 | +
|
| 28 | +1. **Abstract Syntax Tree (AST)**: |
| 29 | + ```haskell |
| 30 | + data Program = Program FunctionDef |
| 31 | + data FunctionDef = Function |
| 32 | + { name :: Text |
| 33 | + , body :: Statement |
| 34 | + } |
| 35 | + data Statement = Return Expr |
| 36 | + data Expr = Constant Int |
| 37 | + ``` |
| 38 | + |
| 39 | +2. **Assembly AST**: |
| 40 | + ```haskell |
| 41 | + data Program = Program FunctionDef |
| 42 | + data FunctionDef = Function |
| 43 | + { name :: Text |
| 44 | + , instructions :: [Instruction] |
| 45 | + } |
| 46 | + data Instruction = Mov Operand Operand | Ret |
| 47 | + data Operand = Imm Int | Register |
| 48 | + ``` |
| 49 | + |
| 50 | +## Project Structure |
| 51 | + |
| 52 | +``` |
| 53 | +lib/ |
| 54 | +├── Halcyon/ |
| 55 | +│ ├── Backend/ # Code generation and emission |
| 56 | +│ │ ├── Codegen.hs # AST to Assembly conversion |
| 57 | +│ │ └── Emit.hs # Assembly to text output |
| 58 | +│ ├── Core/ # Core data types and utilities |
| 59 | +│ │ ├── Assembly.hs # Assembly representation |
| 60 | +│ │ ├── Ast.hs # C language AST |
| 61 | +│ │ ├── Monad.hs # Compiler monad stack |
| 62 | +│ │ └── Settings.hs # Compiler settings and types |
| 63 | +│ ├── Driver/ # Compiler driver |
| 64 | +│ │ ├── Cli.hs # Command line interface |
| 65 | +│ │ └── Pipeline.hs # Compilation pipeline |
| 66 | +│ └── Frontend/ # Parsing and analysis |
| 67 | +│ ├── Lexer.hs # Lexical analysis |
| 68 | +│ ├── Parse.hs # Parsing |
| 69 | +│ └── Tokens.hs # Token definitions |
| 70 | +``` |
| 71 | +
|
| 72 | +### Architecture |
| 73 | +
|
| 74 | +The compiler uses a monad transformer stack to handle IO operations and error management: |
| 75 | +
|
| 76 | +```haskell |
| 77 | +newtype CompilerT m a = CompilerT |
| 78 | + { unCompilerT :: ExceptT CompilerError m a } |
| 79 | +
|
| 80 | +type Compiler = CompilerT IO |
| 81 | +``` |
| 82 | + |
| 83 | +This provides: |
| 84 | +- Error handling through `ExceptT` |
| 85 | +- IO capabilities through the underlying monad |
| 86 | +- Clean separation of pure and effectful code |
| 87 | +- Structured error reporting and recovery |
| 88 | + |
| 89 | +## Command Line Interface |
| 90 | + |
| 91 | +```bash |
| 92 | +halcyon [OPTIONS] FILE |
| 93 | + |
| 94 | +Options: |
| 95 | + --lex Run lexical analysis only |
| 96 | + --parse Run parsing only |
| 97 | + --codegen Run through code generation |
| 98 | + -S Stop after assembly generation |
| 99 | + -h,--help Show help text |
| 100 | +``` |
| 101 | + |
| 102 | +### Build and Run |
| 103 | + |
| 104 | +```bash |
| 105 | +# Build the project |
| 106 | +cabal build |
| 107 | + |
| 108 | +# Run the compiler |
| 109 | +cabal run halcyon -- [OPTIONS] input.c |
| 110 | + |
| 111 | +# Example: Compile a file |
| 112 | +cabal run halcyon -- input.c |
| 113 | + |
| 114 | +# Example: Run only the lexer |
| 115 | +cabal run halcyon -- --lex input.c |
| 116 | +``` |
| 117 | + |
| 118 | +## External Dependencies |
| 119 | + |
| 120 | +Halcyon relies on the following system tools: |
| 121 | +- **GCC**: For preprocessing C source files (`gcc -E`) |
| 122 | +- **Assembler**: For converting assembly to object files |
| 123 | +- **Linker**: For producing final executables |
| 124 | + |
| 125 | +Make sure these tools are installed and available in your system path. |
| 126 | + |
| 127 | +## Error Handling |
| 128 | + |
| 129 | +The compiler provides detailed error reporting for: |
| 130 | +- Lexical errors (invalid characters, malformed numbers) |
| 131 | +- Syntax errors (invalid program structure) |
| 132 | +- Semantic errors (coming soon) |
| 133 | +- System errors (file I/O, external tool failures) |
| 134 | + |
| 135 | +## Future Plans |
| 136 | + |
| 137 | +### The Basics |
| 138 | +- [x] A minimal compiler |
| 139 | +- [ ] Unary operators |
| 140 | +- [ ] Binary operators |
| 141 | +- [ ] Logical and relational operators |
| 142 | +- [ ] Local variables |
| 143 | +- [ ] if statements and conditional expressions |
| 144 | +- [ ] Compound statements |
| 145 | +- [ ] Loops |
| 146 | +- [ ] Functions |
| 147 | +- [ ] File scope variable declarations and storage-class specifiers |
| 148 | + |
| 149 | +### Types Beyond Int |
| 150 | +- [ ] Long integers |
| 151 | +- [ ] Unsigned integers |
| 152 | +- [ ] Floating-point numbers |
| 153 | +- [ ] Pointers |
| 154 | +- [ ] Arrays and pointer arithmetic |
| 155 | +- [ ] Characters and strings |
| 156 | +- [ ] Supporting dynamic memory |
| 157 | +- [ ] Structures |
| 158 | + |
| 159 | +### Optimizations |
| 160 | +- [ ] Optimizing TACKY programs |
| 161 | +- [ ] Register Allocations |
| 162 | + |
| 163 | +## Contributing |
| 164 | + |
| 165 | +This is a personal learning project following the book "Writing a C Compiler" by Nora Sandler. While it's not currently open for contributions, feel free to use it as a reference for your own compiler projects. |
0 commit comments