Skip to content

Releases: nareshis21/Truelarge-RT

v0.1.0-beta

20 Feb 13:36

Choose a tag to compare

TrueLarge-RT v1.0-beta: The Pipelining Update

This release marks a significant milestone in mobile LLM inference. We have moved from sequential layer loading to a Deep-Pipelined Architecture, enabling large-scale models (32B - 70B) to run with unprecedented fluidness on Android devices.

Major Highlights

1. Deep-Pipelined Layer Execution (Gen 2)

We have eliminated the I/O bottleneck. The engine no longer waits for weights to load from storage.

  • Eager Prefetch Queue: A thread-safe std::deque based system that peeks 3-5 layers ahead, keeping the storage pipeline (UFS 3.1/4.0) at 100% saturation.
  • Asynchronous Memory Touching: Background "touch" loops force physical page-ins, ensuring data is in RAM before the computation thread reaches it.

2. "Greedy" RAM Window (Gen 3)

Maximize your flagship hardware. We've shifted from "minimum RAM usage" to "high-performance hybrid caching."

  • Expanded Sliding Window: Increased the layer cap from 10 to 80 layers.
  • Aggressive Budgeting: Reduced the OS safety buffer to 500MB, allowing the model to occupy nearly all available RAM for maximum speed.
  • Inter-Token Pipelining: Prefetches Layers 0-8 of the next token immediately while the current one is still being sampled.

3. Precision Telemetry

  • 4-Digit TPS Tracking: Real-time logging now reports Tokens-Per-Second with %.4f precision, allowing for granular performance profiling on large-leaf models.
  • I/O Overlap Analysis: Enhanced logs to track "HIT" vs "WAIT" times for prefetched layers.

Technical Improvements

  • Multi-Layer Logic: Implemented sophisticated eviction protection to prevent the prefetcher from evicting upcoming queue targets.
  • Kernel-Level Tweaks: Added MADV_SEQUENTIAL and MADV_WILLNEED hints to optimize the Android kernel's read-ahead behavior.
  • Stability: Fixed memory leaks and race conditions in the background I/O thread.

📦 Getting Started

  1. Model Support: Fully tested with Qwen 2.5 (32B), Llama 3.1 (70B), and Mistral-based models in GGUF format.
  2. Device Requirements:
    • 4GB RAM Minimum (LBL Mode)
    • 12GB+ RAM Recommended for Hybrid Mode
    • UFS 3.1/4.0 Storage highly recommended.

Full Changelog: f931c56...v1.0.0-beta