Abstract: Conventional video style transfer techniques apply styles uniformly across entire frames, making it challenging to selectively transform specific objects. In this study, I propose RotoNet, a novel deep-learning framework that enables object-specific style transfer based on rotoscoping. RotoNet consists of an object tracking network and a style transfer network, aiming to selectively apply artistic styles to targeted objects within a video. By overcoming the limitations of existing style transfer models, RotoNet captures the distinctive aesthetic qualities of rotoscoping animation-precision in motion tracing, line expressiveness, and artistic interpretation of human movement.
Rotoscoping is a traditional animation technique that involves manually tracing objects in live-action video frame by frame. While it enables highly realistic motion representation, it is time-consuming, labor-intensive, and requires pre-recorded footage, making it costly and difficult to scale for large projects. To address these limitations, I propose RotoNet, a deep learning–based framework for object-specific style transfer in videos. RotoNet aims to automate the rotoscoping process, reduce production time and cost, and improve the efficiency of video stylization.
The overall architecture of RotoNet consists of two main components designed to accurately track specific objects in a video and selectively apply style transformations. The object tracking network identifies the target object specified by the user in the initial frame and consistently segments and tracks the object throughout the entire sequence of video frames. The style transfer network utilizes the binary masks generated by the object tracking network to selectively apply style only to the designated object regions within the video.
For accurate object segmentation and tracking in videos, I employ SAMURAI. SAMURAI introduces motion-based modeling and a motion-aware memory selection mechanism, enabling robust object tracking even in cluttered and dynamic environments. It supports zero-shot video object segmentation, allowing the target object to be segmented and tracked throughout the entire video using only a simple prompt—such as a box or mask—in the first frame, without any additional training. Built upon the Segment Anything Model (SAM), SAMURAI ensures strong segmentation performance and provides spatiotemporal consistency tailored for the video domain.
Directly applying image style transfer models to video often leads to a lack of temporal consistency, resulting in frame-to-frame variations known as the "popping effect." To mitigate this temporal discontinuity, I adopt a blending strategy that combines the current frame with the previously stylized frame at a fixed ratio. This creates a ghosting effect that enhances temporal coherence across frames, ensuring more consistent stylization and reducing visual artifacts throughout the video.
| Original | Object Tracking & Segmentation |
|---|---|
![]() |
![]() |
| Binary Mask | Stylization |
|---|---|
![]() |
![]() |
This is the final project for the course <Introduction to Generative AI>.




