Add Swift port of V-JEPA2 with mlx-swift and camera apps #3

gaarutyunov · 2025-11-17T07:33:25Z

Major additions:

Complete Swift port of V-JEPA2 model using mlx-swift
macOS app with camera integration for real-time action recognition
iPhone app with camera-based video classification
Comprehensive test suite with cross-language Python-Swift comparison
CI/CD integration for automated testing on macOS runners
Something-Something-V2 labels (174 action classes)

Swift Implementation:

VisionTransformer: ViT-Large/16 encoder with video support
AttentivePooler: Classification head with cross-attention
Core modules: Patch embedding, positional encodings, attention, MLP
Support for both image and video inputs
RoPE attention implementation
Dynamic position encoding interpolation

Applications:

macOS: Desktop app with split-view interface, real-time preview
iOS: Mobile app with full-screen camera and overlay predictions
Both apps include FPS monitoring and inference time metrics
AVFoundation integration for camera capture

Testing:

Unit tests for all components (shape verification)
Cross-language tests comparing Python vs Swift outputs
Component tests: embeddings, attention, MLP, blocks
Integration tests: end-to-end classification pipeline
Test data export from Python for cross-validation

CI/CD:

GitHub Actions workflow with Swift test job
Runs on macOS-14 with Apple Silicon
Automated test data export and cross-language validation
Swift package build and test automation

Performance:

Native Apple Silicon optimization via MLX
Real-time inference (~6.5 FPS on M2)
Efficient memory management for video processing

Major additions: - Complete Swift port of V-JEPA2 model using mlx-swift - macOS app with camera integration for real-time action recognition - iPhone app with camera-based video classification - Comprehensive test suite with cross-language Python-Swift comparison - CI/CD integration for automated testing on macOS runners - Something-Something-V2 labels (174 action classes) Swift Implementation: - VisionTransformer: ViT-Large/16 encoder with video support - AttentivePooler: Classification head with cross-attention - Core modules: Patch embedding, positional encodings, attention, MLP - Support for both image and video inputs - RoPE attention implementation - Dynamic position encoding interpolation Applications: - macOS: Desktop app with split-view interface, real-time preview - iOS: Mobile app with full-screen camera and overlay predictions - Both apps include FPS monitoring and inference time metrics - AVFoundation integration for camera capture Testing: - Unit tests for all components (shape verification) - Cross-language tests comparing Python vs Swift outputs - Component tests: embeddings, attention, MLP, blocks - Integration tests: end-to-end classification pipeline - Test data export from Python for cross-validation CI/CD: - GitHub Actions workflow with Swift test job - Runs on macOS-14 with Apple Silicon - Automated test data export and cross-language validation - Swift package build and test automation Performance: - Native Apple Silicon optimization via MLX - Real-time inference (~6.5 FPS on M2) - Efficient memory management for video processing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Swift port of V-JEPA2 with mlx-swift and camera apps #3

Add Swift port of V-JEPA2 with mlx-swift and camera apps #3

Uh oh!

gaarutyunov commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add Swift port of V-JEPA2 with mlx-swift and camera apps #3

Are you sure you want to change the base?

Add Swift port of V-JEPA2 with mlx-swift and camera apps #3

Uh oh!

Conversation

gaarutyunov commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants