BitFly is a research codebase for studying low-bit LLM GEMM execution on top of the Ara RISC-V vector architecture. The repository contains:
- a proposed BMPU/BMPMM execution path for low-bit mixed-precision GEMM
- an RVV baseline path under the same Ara-based software and simulation stack
- application overlays derived from model-layer GEMM workloads
- analysis and automation scripts for correctness, benchmarking, tiling search, and roofline-style post-processing
The central evaluation question is:
Under the same model-derived workload slices and the same Ara-based execution stack, how does the proposed BMPMM path compare against an RVV baseline?
- Persistent project-owned sources live under
src/; the synced Ara working tree lives underara/. - Benchmark apps are organized as a model-split matrix:
bmpmm_*versusrvv_*, acrossbinary,INT2, andINT4. - The repository includes both fast correctness checks such as
bmpu_verifyand paper-oriented benchmark runners. - Tiling search and roofline analysis are implemented in
scripts/analysis/and operate on the same execution assumptions used by the software template and RTL.
This repository is best understood as a research artifact workspace rather than a polished end-user package manager distribution. The maintained workflow is:
src/ -> sync into ara/ -> build and simulate -> logs and plots under tmp/ or repository outputs
If you want to preserve a change as part of BitFly, edit src/ first and treat ara/ as the working build tree.
git clone --recursive git@github.com:hypertseng/bitfly.git
cd bitfly
git submodule update --init --recursiveRecommended host tools:
gitmakepython3gcc/g++rsyncVerilator- optional:
QuestaSim,gtkwave
Ara-specific prerequisites are documented in ara/DEPENDENCIES.md.
scripts/dev/sync_src_to_ara.shmake -C ara/hardware verilate -j8
make -C ara/apps bin/bmpu_verify -j8
make -C ara/hardware simv app=bmpu_verifyA healthy run ends with ALL CASES PASSED.
One smoke-check benchmark app:
scripts/benchmarks/run_model_split_apps.sh \
--mode run \
--apps bmpmm_INT2_gemma3_270m \
--parallel 1 \
--batch-size 1Paper-style benchmark matrix:
scripts/benchmarks/run_model_split_apps.sh \
--mode all \
--build-jobs 16 \
--parallel 5 \
--batch-size 5python3 scripts/analysis/roofline.pyThis produces:
roofline_search_results.pngroofline_search_results.pdf
| Path | Role | Notes |
|---|---|---|
src/ |
BitFly source of truth | Project-owned overlays for apps, RTL, and LLVM-side instruction support |
ara/ |
Working Ara tree | Main build and simulation workspace |
scripts/ |
Automation entry points | Benchmark runners, sync scripts, analysis, and debug helpers |
docs/ |
Reproducibility and workflow notes | Start here for artifact-style documentation |
tmp/ |
Generated outputs | Logs, CSV summaries, and intermediate artifacts |
build/ |
Local build outputs | Disposable products |
patches/ |
Sync-side review artifacts | Optional exported diffs from sync workflows |
docs/artifact_quickstart.mddocs/artifact_checklist.mddocs/repo_structure.mdsrc/README.mdsrc/apps/README.mdscripts/README.mddocs/benchmark_workflow.md
The main benchmark matrix treats each app as one workload slice:
<implementation>_<precision>_<model>
Examples:
bmpmm_binary_gemma3_270mbmpmm_INT2_qwen25_15brvv_INT4_opt_13b
The intended comparison is always:
- proposed path:
bmpmm_* - baseline path:
rvv_*
under:
- the same model-derived GEMM shape set
- the same precision
- the same Ara-based build and simulation stack unless intentionally changed
bmpu_verify is a correctness regression app, not a paper-performance data point.
For artifact-quality runs, keep the following together:
- the exact launcher command
- the run directory under
tmp/model_app_runs/<run>/ apps.txtrunner.logsummary.csv- per-app logs under
batch_XX/
For a complete workflow, including runner options and log interpretation, see docs/benchmark_workflow.md.
- Edit
src/when the change should be maintained by BitFly. - Edit
ara/only for local working-tree experiments or upstream Ara concerns. - Use
scripts/dev/sync_src_to_ara.shto propagate maintained overlays intoara/.
Treat these as generated or runtime artifacts, not primary source:
tmp/build/- simulator logs
- large local model binaries
- generated figures and CSV summaries unless explicitly versioned for a reason
The repository intentionally separates source from outputs. Measurements should live in dedicated run directories rather than being mixed with maintained source overlays.
docs/README.md: documentation indexdocs/artifact_quickstart.md: fastest path from clone to benchmark and plotdocs/artifact_checklist.md: artifact and repository release checklistdocs/benchmark_workflow.md: benchmark methodology and output contractdocs/repo_structure.md: directory map and file placement guidedocs/repo_guide.md: ownership boundaries and repository mental modelsrc/README.md: source-of-truth policy and overlay mapsrc/apps/README.md: benchmark app taxonomysrc/hardware/README.md: RTL overlay organizationsrc/llvm_instr/README.md: custom instruction supportscripts/README.md: command-oriented entry points
See CONTRIBUTING.md for repository conventions, patch boundaries, and pre-PR checks.
If BitFly is used in academic work, cite the repository and the associated paper when available. Citation metadata is provided in CITATION.cff.
This repository is released under the MIT License. The ara/ submodule and other vendored dependencies keep their own upstream licenses.