Mind-Blowing Dream-To-Video Could Be Coming With Stable Diffusion Video Rebuild From Brain Activity #247
FurkanGozukara
announced in
Tutorials
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Mind-Blowing Dream-To-Video Could Be Coming With Stable Diffusion Video Rebuild From Brain Activity
Full tutorial: https://www.youtube.com/watch?v=dmzdoMnuloo
In this groundbreaking video, we delve into the realm of mind-video and brain-activity reconstruction, bringing you an in-depth discussion on a new research paper titled, "Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity". This may open the doors of dream-to-video era.
If I have been of assistance to you and you would like to show your support for my work, please consider becoming a patron on 🥰⤵️
https://www.patreon.com/SECourses
Technology & Science: News, Tips, Tutorials, Tricks, Best Applications, Guides, Reviews⤵️
https://www.youtube.com/playlist?list=PL_pbwdIyffsnkay6X91BWb9rrfLATUMr3
Playlist of #StableDiffusion Tutorials, Automatic1111 and Google Colab Guides, DreamBooth, Textual Inversion / Embedding, LoRA, AI Upscaling, Pix2Pix, Img2Img⤵️
https://www.youtube.com/playlist?list=PL_pbwdIyffsmclLl0O144nQRnezKlNdx3
Research Paper⤵️
https://arxiv.org/pdf/2305.11675.pdf
Video Footage Source⤵️
https://mind-video.com/
This fascinating research explores the intersection of neurology, machine learning and video generation, aiming to understand and recreate the visual experiences directly from brain signals. Using advanced techniques such as masked brain modeling, multimodal contrastive learning and co-training with an augmented Stable Diffusion model, the MinD-Video approach seeks to convert functional Magnetic Resonance Imaging (fMRI) data into high-quality videos.
We dissect the various components of the MinD-Video methodology, focusing on the fMRI encoder and the video generative model. We also discuss the paper's innovative use of progressive learning and explain the pre-processing of the fMRI data for efficient results.
Further, we explore how the research attempts to address the challenges of time delays and individual variations in brain activity. We go in depth into each stage of the progressive learning applied to the fMRI encoder, from general to semantic-related features and from large-scale pre-training to contrastive learning.
Discover how the Stable Diffusion model is adapted for video generation, and how scene-dynamic sparse causal attention ensures smooth video transitions. We also cover the use of adversarial guidance in controlling the diversity of generated videos and how attention maps help visualize the learning process.
Perfect for anyone interested in neuroscience, machine learning or video generation, this video provides a comprehensive overview of a cutting-edge approach in brain-activity reconstruction. Expand your knowledge and join the discussion as we explore the future of mind-video.
For a more detailed understanding, the link to the full research paper is provided in the description. Stay curious, keep learning, and don't forget to like, comment, and subscribe for more exciting content.
Abstract
Reconstructing human vision from brain activities has been an appealing task that
helps to understand our cognitive process. Even though recent research has seen great
success in reconstructing static images from non-invasive brain recordings, work on
recovering continuous visual experiences in the form of videos is limited. In this work,
we propose MinD-Video that learns spatiotemporal information from continuous fMRI
data of the cerebral cortex progressively through masked brain modeling, multimodal
contrastive learning with spatiotemporal attention, and co-training with an augmented
Stable Diffusion model that incorporates network temporal inflation. We show that
high-quality videos of arbitrary frame rates can be reconstructed with MinD-Video
using adversarial guidance. The recovered videos were evaluated with various semantic
and pixel-level metrics. We achieved an average accuracy of 85% in semantic
classification tasks and 0.19 in structural similarity index (SSIM), outperforming the
previous state-of-the-art by 45%. We also show that our model is biologically plausible
and interpretable, reflecting established physiological processes.
Introduction
Life unfolds like a film reel, each moment seamlessly transitioning into the next, forming a “perpetual theater” of experiences. This dynamic narrative forms our perception, explored through the naturalistic paradigm, painting the brain as a moviegoer engrossed in the relentless film of experience. Understanding the information hidden within our complex brain activities is a big puzzle in cognitive neuroscience. The task of recreating human vision from brain recordings, especially using non-invasive tools like functional Magnetic Resonance Imaging (fMRI), is an exciting but difficult task. Non-invasive methods, while less intrusive, capture limited information, susceptible to various interferences like noise. Furthermore, the acquisition of neuroimaging data is a complex, costly process. Despite these complexities, progress has been made, notably in learning valuable fMRI features with limited fMRI-annotation pairs.
#MinDVideo #fMRI
Video Transcription
00:00:00 Greetings everyone.
00:00:01 It looks like Dream to Video is coming soon.
00:00:03 Today, I will introduce you to a new research paper.
00:00:06 Mind-Video.
00:00:07 Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity.
00:00:12 The research paper focuses on reconstructing high-quality videos from brain activity, aiming
00:00:17 to understand the cognitive process and visual perception.
00:00:20 The proposed approach, called MinD-Video, utilizes masked brain modeling, multimodal
00:00:25 contrastive learning, and co-training with an augmented Stable Diffusion model to learn
00:00:31 spatiotemporal information from continuous functional Magnetic Resonance Imaging (fMRI)
00:00:36 data.
00:00:37 The paper focuses on composing human vision from brain recordings, particularly using
00:00:42 non-invasive tools like fMRI.
00:00:44 The unique challenge of reconstructing dynamic visual experiences from fMRI data is addressed,
00:00:51 considering the time delays in capturing brain activity and the variations in hemodynamic
00:00:55 response across individuals.
00:00:56 The MinD-Video methodology consists of two modules: an fMRI encoder and a video generative
00:01:03 model.
00:01:04 The fMRI encoder progressively learns from brain signals, starting with general visual
00:01:09 fMRI features obtained through large-scale unsupervised learning with masked brain modeling.
00:01:16 Semantic-related features are then distilled using multimodal contrastive learning in the
00:01:20 Contrastive Language-Image Pre-Training (CLIP) space.
00:01:24 The augmented stable diffusion model is employed for video generation, with scene-dynamic sparse
00:01:29 causal attention to handle scene changes and temporal constraints.
00:01:33 The fMRI data captured during visual stimuli is pre-processed to identify the regions of
00:01:39 interest (ROIs) in the visual cortex.
00:01:43 The activated voxels are determined through statistical tests, and the top 50% most significant
00:01:48 voxels are selected.
00:01:50 Progressive learning is employed as an efficient training scheme for the fMRI encoder.
00:01:55 The encoder undergoes multiple stages to learn fMRI features progressively, starting from
00:02:01 general features to more specific and semantic-related features.
00:02:05 Large-scale pre-training with masked brain modeling is utilized to learn general features
00:02:09 of the visual cortex.
00:02:11 An autoencoder architecture is trained on the Human Connectome Project dataset using
00:02:16 the visual cortex regions defined by a parcellation method.
00:02:19 The goal of this pre-training is to obtain rich and compact embeddings that describe
00:02:24 the original fMRI data effectively.
00:02:27 Spatiotemporal attention is introduced to process multiple fMRI frames in a sliding
00:02:32 window, considering the time delays caused by the hemodynamic response.
00:02:37 The augmented fMRI encoder is further trained using multimodal contrastive learning.
00:02:42 Triplets consisting of fMRI, video, and caption are used for training.
00:02:47 Videos are down sampled and captioned with the BLIP model.
00:02:51 Contrastive learning is applied to pull the fMRI embeddings closer to a shared CLIP space,
00:02:56 which contains rich semantic information.
00:02:59 The aim is to make the fMRI embeddings more understandable by the generative model during
00:03:03 conditioning.
00:03:04 The Stable Diffusion model is used as the base generative model, modified to handle
00:03:08 video generation.
00:03:11 Scene-dynamic sparse causal attention is employed to condition each video frame on its previous
00:03:15 two frames, allowing for scene changes while ensuring video smoothness.
00:03:20 Adversarial guidance is introduced to control the diversity of generated videos based on
00:03:25 positive and negative conditions.
00:03:27 The generative module is trained with the target dataset using text conditioning.
00:03:32 The paper aims to understand the biological principles of the decoding process.
00:03:37 Attention maps from different layers of the fMRI encoder are visualized to observe the
00:03:41 transition from capturing local relations to recognizing global, abstract features.
00:03:47 The attention maps are projected back to brain surface maps, enabling the observation of
00:03:52 each brain region's contributions and the learning progress through each training stage.
00:03:57 To learn more, please check the description for the link to the paper.
Beta Was this translation helpful? Give feedback.
All reactions