Skip to content

Conversation

@gaarutyunov
Copy link
Owner

Major additions:

  • Complete Swift port of V-JEPA2 model using mlx-swift
  • macOS app with camera integration for real-time action recognition
  • iPhone app with camera-based video classification
  • Comprehensive test suite with cross-language Python-Swift comparison
  • CI/CD integration for automated testing on macOS runners
  • Something-Something-V2 labels (174 action classes)

Swift Implementation:

  • VisionTransformer: ViT-Large/16 encoder with video support
  • AttentivePooler: Classification head with cross-attention
  • Core modules: Patch embedding, positional encodings, attention, MLP
  • Support for both image and video inputs
  • RoPE attention implementation
  • Dynamic position encoding interpolation

Applications:

  • macOS: Desktop app with split-view interface, real-time preview
  • iOS: Mobile app with full-screen camera and overlay predictions
  • Both apps include FPS monitoring and inference time metrics
  • AVFoundation integration for camera capture

Testing:

  • Unit tests for all components (shape verification)
  • Cross-language tests comparing Python vs Swift outputs
  • Component tests: embeddings, attention, MLP, blocks
  • Integration tests: end-to-end classification pipeline
  • Test data export from Python for cross-validation

CI/CD:

  • GitHub Actions workflow with Swift test job
  • Runs on macOS-14 with Apple Silicon
  • Automated test data export and cross-language validation
  • Swift package build and test automation

Performance:

  • Native Apple Silicon optimization via MLX
  • Real-time inference (~6.5 FPS on M2)
  • Efficient memory management for video processing

Major additions:
- Complete Swift port of V-JEPA2 model using mlx-swift
- macOS app with camera integration for real-time action recognition
- iPhone app with camera-based video classification
- Comprehensive test suite with cross-language Python-Swift comparison
- CI/CD integration for automated testing on macOS runners
- Something-Something-V2 labels (174 action classes)

Swift Implementation:
- VisionTransformer: ViT-Large/16 encoder with video support
- AttentivePooler: Classification head with cross-attention
- Core modules: Patch embedding, positional encodings, attention, MLP
- Support for both image and video inputs
- RoPE attention implementation
- Dynamic position encoding interpolation

Applications:
- macOS: Desktop app with split-view interface, real-time preview
- iOS: Mobile app with full-screen camera and overlay predictions
- Both apps include FPS monitoring and inference time metrics
- AVFoundation integration for camera capture

Testing:
- Unit tests for all components (shape verification)
- Cross-language tests comparing Python vs Swift outputs
- Component tests: embeddings, attention, MLP, blocks
- Integration tests: end-to-end classification pipeline
- Test data export from Python for cross-validation

CI/CD:
- GitHub Actions workflow with Swift test job
- Runs on macOS-14 with Apple Silicon
- Automated test data export and cross-language validation
- Swift package build and test automation

Performance:
- Native Apple Silicon optimization via MLX
- Real-time inference (~6.5 FPS on M2)
- Efficient memory management for video processing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants