Skip to content

Commit 2dfa95d

Browse files
committed
update main body
1 parent dbc235b commit 2dfa95d

File tree

5 files changed

+746
-109
lines changed

5 files changed

+746
-109
lines changed

figures/avatar.jpeg

21 KB
Loading

figures/framework.png

657 KB
Loading

figures/teaser.png

294 KB
Loading

index.html

Lines changed: 126 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -77,8 +77,8 @@
7777
<title>World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy - Xiaokang Liu et al. | Academic Research</title>
7878

7979
<!-- Favicon and App Icons -->
80-
<link rel="icon" type="image/x-icon" href="static/images/favicon.ico">
81-
<link rel="apple-touch-icon" href="static/images/favicon.ico">
80+
<link rel="icon" type="image/jpeg" href="figures/avatar.jpeg">
81+
<link rel="apple-touch-icon" href="figures/avatar.jpeg">
8282

8383
<!-- Critical CSS - Load synchronously -->
8484
<link rel="stylesheet" href="static/css/bulma.min.css">
@@ -270,23 +270,28 @@ <h1 class="title is-1 publication-title">World-VLA-Loop : Closed-Loop Learning o
270270
</section>
271271

272272

273-
<!-- Teaser video-->
273+
<!-- Teaser figure-->
274274
<section class="hero teaser">
275275
<div class="container is-max-desktop">
276276
<div class="hero-body">
277-
<!-- TODO: Replace with your teaser video -->
278-
<video poster="" id="tree" autoplay controls muted loop height="100%" preload="metadata">
277+
<!-- Teaser original a video -->
278+
<!-- <video poster="" id="tree" autoplay controls muted loop height="100%" preload="metadata"> -->
279279
<!-- TODO: Add your video file path here -->
280-
<source src="static/videos/banner_video.mp4" type="video/mp4">
281-
</video>
282-
<!-- TODO: Replace with your video description -->
280+
<!-- <source src="static/videos/banner_video.mp4" type="video/mp4"> -->
281+
<!-- </video> -->
282+
<!-- Teaser figure -->
283+
<div style="text-align: center;">
284+
<img src="figures/teaser.png" alt="World-VLA-Loop Teaser" style="width: 100%; max-width: 100%; height: auto;" />
285+
</div>
286+
287+
<!-- TODO: Replace with your figure description -->
283288
<h2 class="subtitle has-text-centered">
284-
Aliquam vitae elit ullamcorper tellus egestas pellentesque. Ut lacus tellus, maximus vel lectus at, placerat pretium mi. Maecenas dignissim tincidunt vestibulum. Sed consequat hendrerit nisl ut maximus.
289+
(a) Paradigms for world-model-based VLA reinforcement learning. Existing methodologies typically rely on reconstructing the environment within 3D world or training video world models that simulate the environment. To address the imprecise action-following inherent in existing video-based simulators, we propose World-VLA-Loop, a closed-loop paradigm that jointly optimizes the world model and the VLA policy to iteratively enhance the performance and grounding of both. (b) We show that the real-world policy success rate is improved by 36.7% after two iterations of joint optimization with VLA model and world model.
285290
</h2>
286291
</div>
287292
</div>
288293
</section>
289-
<!-- End teaser video -->
294+
<!-- End teaser figure -->
290295

291296
<!-- Paper abstract -->
292297
<section class="section hero is-light">
@@ -297,7 +302,7 @@ <h2 class="title is-3">Abstract</h2>
297302
<div class="content has-text-justified">
298303
<!-- TODO: Replace with your paper abstract -->
299304
<p>
300-
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ullamcorper tellus sed ante aliquam tempus. Etiam porttitor urna feugiat nibh elementum, et tempor dolor mattis. Donec accumsan enim augue, a vulputate nisi sodales sit amet. Proin bibendum ex eget mauris cursus euismod nec et nibh. Maecenas ac gravida ante, nec cursus dui. Vivamus purus nibh, placerat ac purus eget, sagittis vestibulum metus. Sed vestibulum bibendum lectus gravida commodo. Pellentesque auctor leo vitae sagittis suscipit.
305+
Recent progress in robotic world models has leveraged video diffusion transformers to predict future observations conditioned on historical states and actions. While these models can simulate realistic visual outcomes, they often exhibit poor action-following precision, hindering their utility for downstream robotic learning. In this work, we introduce World-VLA-Loop, a closed-loop framework for the joint refinement of world models and Vision-Language-Action (VLA) policies. We propose a state-aware video world model that functions as a high-fidelity interactive simulator by jointly predicting future observations and reward signals. To enhance reliability, we introduce the Sans dataset, which incorporates near-success trajectories to improve action-outcome alignment within the world model. This framework enables a closed-loop for reinforcement learning (RL) post-training of VLA policies entirely within a virtual environment. Crucially, our approach facilitates a co-evolving cycle: failure rollouts generated by the VLA policy are iteratively fed back to refine the world model’s precision, which in turn enhances subsequent RL optimization. Evaluations across simulation and real-world tasks demonstrate that our framework significantly boosts VLA performance with minimal physical interaction, establishing a mutually beneficial relationship between world modeling and policy learning for general-purpose robotics.
301306
</p>
302307
</div>
303308
</div>
@@ -307,122 +312,134 @@ <h2 class="title is-3">Abstract</h2>
307312
<!-- End paper abstract -->
308313

309314

310-
<!-- Image carousel -->
311-
<section class="hero is-small">
312-
<div class="hero-body">
313-
<div class="container">
314-
<div id="results-carousel" class="carousel results-carousel">
315-
<div class="item">
316-
<!-- TODO: Replace with your research result images -->
317-
<img src="static/images/carousel1.jpg" alt="First research result visualization" loading="lazy"/>
318-
<!-- TODO: Replace with description of this result -->
319-
<h2 class="subtitle has-text-centered">
320-
First image description.
321-
</h2>
322-
</div>
323-
<div class="item">
324-
<!-- Your image here -->
325-
<img src="static/images/carousel2.jpg" alt="Second research result visualization" loading="lazy"/>
326-
<h2 class="subtitle has-text-centered">
327-
Second image description.
328-
</h2>
329-
</div>
330-
<div class="item">
331-
<!-- Your image here -->
332-
<img src="static/images/carousel3.jpg" alt="Third research result visualization" loading="lazy"/>
333-
<h2 class="subtitle has-text-centered">
334-
Third image description.
335-
</h2>
336-
</div>
337-
<div class="item">
338-
<!-- Your image here -->
339-
<img src="static/images/carousel4.jpg" alt="Fourth research result visualization" loading="lazy"/>
340-
<h2 class="subtitle has-text-centered">
341-
Fourth image description.
342-
</h2>
343-
</div>
344-
</div>
345-
</div>
346-
</div>
347-
</section>
348-
<!-- End image carousel -->
349-
350-
351315

352-
353-
<!-- Youtube video -->
354-
<section class="hero is-small is-light">
355-
<div class="hero-body">
356-
<div class="container">
357-
<!-- Paper video. -->
358-
<h2 class="title is-3">Video Presentation</h2>
359-
<div class="columns is-centered has-text-centered">
360-
<div class="column is-four-fifths">
361-
362-
<div class="publication-video">
363-
<!-- TODO: Replace with your YouTube video ID -->
364-
<iframe src="https://www.youtube.com/embed/JkaxUblCGz0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>
365-
</div>
316+
<section class="section hero is-small">
317+
<div class="container is-max-desktop">
318+
<div class="columns is-centered">
319+
<div class="column is-full">
320+
<div class="content">
321+
<div class="level-set has-text-justified">
322+
<p>
323+
World-VLA-Loop framework consists of four phases.
324+
<ol>
325+
<li>Curate success and near-success dataset (SANS) mainly via manual teleoperation. Few demonstrations are needed.
326+
</li>
327+
<li>Fine-tune the action-conditioned world model on SANS dataset with joint reward and video supervision.</li>
328+
<li>Execute VLA policy rollouts within the world model and perform RL (GRPO) optimization.</li>
329+
<li>Deploy the refined policy in real-world. And in real-world deployment, new rollouts could be used to collect new failure and success data for further SANS dataset augmentation, which can be used to iteratively improve the world model and policy.</li>
330+
</ol>
331+
This cycle enables the joint optimization of the world model and the VLA policy, iteratively enhancing both performance.
332+
</p>
333+
</div>
334+
</div>
335+
</div>
366336
</div>
367-
</div>
368337
</div>
369-
</div>
338+
</div>
370339
</section>
371-
<!-- End youtube video -->
372340

373341

374-
<!-- Video carousel -->
375342
<section class="hero is-small">
376-
<div class="hero-body">
377-
<div class="container">
378-
<h2 class="title is-3">Another Carousel</h2>
379-
<div id="results-carousel" class="carousel results-carousel">
380-
<div class="item item-video1">
381-
<!-- TODO: Add poster image for better preview -->
382-
<video poster="" id="video1" controls muted loop height="100%" preload="metadata">
383-
<!-- Your video file here -->
384-
<source src="static/videos/carousel1.mp4" type="video/mp4">
385-
</video>
386-
</div>
387-
<div class="item item-video2">
388-
<!-- TODO: Add poster image for better preview -->
389-
<video poster="" id="video2" controls muted loop height="100%" preload="metadata">
390-
<!-- Your video file here -->
391-
<source src="static/videos/carousel2.mp4" type="video/mp4">
392-
</video>
393-
</div>
394-
<div class="item item-video3">
395-
<!-- TODO: Add poster image for better preview -->
396-
<video poster="" id="video3" controls muted loop height="100%" preload="metadata">
397-
<!-- Your video file here -->
398-
<source src="static/videos/carousel3.mp4" type="video/mp4">
399-
</video>
343+
<div class="hero-body">
344+
<div class="container">
345+
<div class="columns is-centered">
346+
<div class="column is-two-thirds">
347+
<div class="item">
348+
<!-- Your image here -->
349+
<img src="figures/framework.png" alt="A table comparing probing to previous approaches"/>
350+
<h2 class="subtitle has-text-centered">
351+
<em><b>Full pipeline of our proposed framework.</b></em>
352+
</h2>
353+
</div>
354+
</div>
355+
</div>
400356
</div>
401-
</div>
402357
</div>
403-
</div>
404358
</section>
405-
<!-- End video carousel -->
406359

360+
<section class="section hero is-small">
361+
<div class="container is-max-desktop">
362+
<div class="columns is-centered">
363+
<div class="column is-full">
364+
<div class="content">
365+
<div class="level-set has-text-justified">
366+
<p>
367+
We propose <em>Deep Linear Probe Generators</em> (<strong>ProbeGen</strong>) for learning better probes. ProbeGen optimizes a
368+
deep generator module limited to linear expressivity, that shares information between the
369+
different probes. It then observes the responses from all probes, and trains an MLP classifier on
370+
them. While simple, we demonstrate it greatly enhances probing methods, and also outperforms
371+
other approaches by a large margin.
372+
</p>
373+
</div>
374+
</div>
407375

376+
</div>
377+
</div>
378+
</div>
379+
</section>
408380

381+
<section class="hero teaser">
382+
<div class="container is-max-desktop">
383+
<div class="hero-body">
384+
<img src="static/images/ProbeGen_results.png" alt="Main results of our method ProbeGen"/>
385+
</div>
386+
</div>
387+
</section>
409388

410389

390+
<section class="section hero is-small">
391+
<div class="container is-max-desktop">
392+
<div class="columns is-centered">
393+
<div class="column is-full">
394+
<div class="content">
395+
<div class="level-set has-text-justified">
396+
<p>
397+
ProbeGen represents each model as an ordered list of output
398+
values based on carefully chosen probes. These representations often have semantic meanings as
399+
the output space of the model (here, image pixels or logits) are semantic by design.
400+
</p>
401+
</div>
402+
</div>
403+
</div>
404+
</div>
405+
</div>
406+
</section>
411407

412-
<!-- Paper poster -->
413-
<section class="hero is-small is-light">
414-
<div class="hero-body">
415-
<div class="container">
416-
<h2 class="title">Poster</h2>
417408

418-
<!-- TODO: Replace with your poster PDF -->
419-
<iframe src="static/pdfs/sample.pdf" width="100%" height="550">
420-
</iframe>
421-
422-
</div>
409+
<section class="section hero is-small">
410+
<div class="hero-body">
411+
<div class="container">
412+
<div id="results-carousel" class="carousel results-carousel">
413+
<div class="item">
414+
<!-- Your image here -->
415+
<img src="static/images/mnist_queries.png" alt="MNIST INR Representation visualization"/>
416+
<h2 class="subtitle has-text-centered">
417+
<em><b>MNIST INR Representations.</b></em> ProbeGen chooses object centric locations as suitable for this task,
418+
while Vanilla Probing chooses locations scattered around the image, including pixels far out of the image.
419+
</h2>
420+
</div>
421+
<div class="item">
422+
<!-- Your image here -->
423+
<img src="static/images/cifar_queries.png" alt="CIFAR10 Wild Park Representation visualization"/>
424+
<h2 class="subtitle has-text-centered">
425+
<em><b>CIFAR10 Wild Park Representations.</b></em> The values become more uniform as the accuracy of the models
426+
decreases, and sharper as it increases. This suggests that ProbeGen uses some form of prediction
427+
entropy in its classifier. We validate this by training a classifier that only takes the
428+
entropy of each probe as its features, which already reaches a Kendall’s τ of 0.877.
429+
</h2>
430+
</div>
431+
<div class="item">
432+
<!-- Your image here -->
433+
<img src="static/images/probes_comp_flat_space.png" alt="Comparing the probes learned from different algorithms"/>
434+
<h2 class="subtitle has-text-centered">
435+
<em><b>ProbeGen vs. Vanilla Probing Learned Probes.</b></em> Although both not interpetable by humans,
436+
it is clear that ProbeGen probes have much more structure than latent-optimized ones.
437+
</h2>
438+
</div>
439+
</div>
440+
</div>
423441
</div>
424-
</section>
425-
<!--End paper poster -->
442+
</section>
426443

427444

428445

0 commit comments

Comments
 (0)