Skip to content
This repository was archived by the owner on Nov 12, 2025. It is now read-only.

Commit c1ecbc3

Browse files
DelinQuEO-Robotics
andauthored
Qwen 2.5 VL Dependency Decoupling and EO-1 Codebase Simplification (#16)
* This pull request sets up the initial GitHub repository configuration and automation for the EO-1 project (#5) (#6) * Add initial project structure with configuration files, datasets, and example scripts * Update .gitignore to include new demo data paths, modify pre-commit configuration to exclude additional directories, and enhance README with more examples and installation instructions. Adjust dataset handling in pipeline configuration and dataset classes for improved training flexibility. Remove deprecated demo scripts and refine evaluation scripts for clarity. * Update .gitignore to include demo data paths, enhance README with additional examples, and modify Libero benchmark configuration files for improved clarity and structure. Adjust training scripts and evaluation settings across various experiments for consistency. * Remove fast testing workflow configuration from GitHub Actions * Update pre-commit configuration to refine exclusions, enhance README with structured examples, and remove unused imports in the EO model script. Co-authored-by: dlqu_0010 <dlqu22@m.fudan.edu.cn> * Refactor model input handling for multimodal data, including image and video features * Refactor model input handling for multimodal data, including image and video features * Refactor training scripts and configuration files for improved clarity and performance. * Update .gitignore to include new output paths, modify dataset configurations for improved clarity, and adjust training scripts for consistency across experiments. Enhance README documentation for better guidance on dataset preparation and training processes. * Refactor import order in test_vlm.py for improved readability and consistency. * Update pre-commit configuration to exclude processing_eo1.py from bandit checks for improved security analysis. * Refactor EO1 configuration and processing classes for improved structure and functionality. Updated EO1VisionFlowMatchingConfig to inherit from PretrainedConfig, streamlined initialization, and added keys_to_ignore_at_inference. Enhanced EO1VisionProcessor to support new text processing capabilities and improved handling of robot inputs and outputs. Adjusted class names for consistency and clarity. * Update .gitignore to exclude hf_save_pretrained.py and enhance README with integration details for EO-1 with LERobot. Refactor dataset handling in MultimodaLeRobotDataset and adjust model architecture in EO1VisionFlowMatchingModel for improved functionality. Update training utilities for better configuration management and streamline processor methods for action selection. --------- Co-authored-by: EO-Robotics-Team <22110240029@m.fudan.edu.cn>
1 parent 35612c8 commit c1ecbc3

18 files changed

Lines changed: 427 additions & 403 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,4 +237,5 @@ demo_data/demos25
237237

238238
demo_data/libero_spatial_no_noops_1.0.0_lerobot
239239
experiments/test
240+
tools/hf_save_pretrained.py
240241
dev/

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ pip install --upgrade setuptools
8282
# install flash-attn 2
8383
MAX_JOBS=4 pip install flash-attn==2.8.3 --no-build-isolation
8484

85-
# [recommended] install from source with H100 / H800 GPU, CUDA 12.8 for best performance
85+
# [recommended] ⭐️ install flash-attn 3 from source with H100 / H800 GPU, CUDA 12.8 for best performance
8686
# git clone https://github.com/Dao-AILab/flash-attn.git -b v2.8.3 --recursive --depth 1
8787
# cd hopper && python setup.py install
8888

@@ -308,6 +308,7 @@ Robot Control Benchmark Results
308308
## 📅 Roadmap
309309

310310
- [x] 🤖 Release [EO-1](https://huggingface.co/IPEC-COMMUNITY/EO-1-3B) pretraining, finetune scripts, and documentations.
311+
- [x] Integrate into [LERobot](https://github.com/huggingface/lerobot). We have merged the [PR](https://github.com/huggingface/lerobot/pull/1971) into the main branch. You can now use EO-1 with LERobot without any modifications.
311312
- [ ] 🤗 Release [pre-training models](https://huggingface.co/collections/IPEC-COMMUNITY/eo-robotics-68ac4ff30e1f746cac28ca14), Interleaved Dataset `EO-Data1.5M` and benchmark `EO-Bench`.
312313
- [ ] ⚡️ Efficient LLM Inference over Long Sequences, Efficient KV-cache, etc.
313314
- [ ] 🤖 Integrate with human feedback fine-tuning.

eo/data/dataset.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,7 +88,7 @@ def __init__(
8888
if len(data_configs.mm_datasets) > 0:
8989
mm_dataset = MultimodaDataset(
9090
data_configs=data_configs.mm_datasets,
91-
max_packed_length=args.max_packed_length,
91+
# max_packed_length=args.max_packed_length,
9292
max_action_dim=args.max_action_dim,
9393
meta_dataset=lerobot_dataset,
9494
chunk_size=args.chunk_size,
@@ -327,6 +327,9 @@ def __getitem__(self, i) -> dict[str, torch.Tensor]:
327327
def info_qwen_vision_fetch(self):
328328
from qwen_vl_utils import smart_resize
329329

330+
if not self.lerobot_dataset:
331+
return
332+
330333
print(f"qwen2.5 vl min pixel {self.args.image_min_pixels}, max pixel {self.args.image_max_pixels}")
331334
for dataset in self.lerobot_dataset._datasets:
332335
meta_features, video_key = dataset.meta.features, dataset.select_video_keys

eo/data/lerobot_dataset.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,6 @@ def __init__(
7777
delta_action: bool = False,
7878
effector_indices: list[int] | None = None,
7979
weight: float | None = None,
80-
chunk_size: int = 32,
8180
):
8281
super().__init__(
8382
repo_id=repo_id,

eo/model/configuration_eo1.py

Lines changed: 30 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -12,66 +12,61 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15+
from transformers.configuration_utils import PretrainedConfig
1516
from transformers.models.qwen2_5_vl.configuration_qwen2_5_vl import (
16-
Qwen2_5_VLConfig,
1717
Qwen2_5_VLTextConfig,
1818
Qwen2_5_VLVisionConfig,
1919
)
2020

2121

22-
class EO1VisionVLTextConfig(Qwen2_5_VLTextConfig):
23-
def __init__(
24-
self,
25-
state_token_id=None,
26-
action_token_start_id=None,
27-
action_token_id=None,
28-
action_pass_id=None,
29-
vision_token_start_id=None,
30-
image_token_id=None,
31-
video_token_id=None,
32-
**kwargs,
33-
):
34-
super().__init__(**kwargs)
35-
self.state_token_id = state_token_id
36-
self.action_token_start_id = action_token_start_id
37-
self.action_token_id = action_token_id
38-
self.action_pass_id = action_pass_id
39-
40-
self.vision_token_start_id = vision_token_start_id
41-
self.image_token_id = image_token_id
42-
self.video_token_id = video_token_id
43-
44-
45-
class EO1VisionFlowMatchingConfig(Qwen2_5_VLConfig):
46-
model_type = "onevision_fm"
47-
sub_configs = {"vision_config": Qwen2_5_VLVisionConfig, "text_config": EO1VisionVLTextConfig}
22+
class EO1VisionFlowMatchingConfig(PretrainedConfig):
23+
model_type = "eo1"
24+
sub_configs = {"vision_config": Qwen2_5_VLVisionConfig, "text_config": Qwen2_5_VLTextConfig}
25+
keys_to_ignore_at_inference = ["past_key_values"]
4826

4927
def __init__(
5028
self,
5129
text_config=None,
5230
vision_config=None,
5331
image_token_id=151655,
5432
video_token_id=151656,
55-
# flow matching specific
5633
action_chunk_size=50,
5734
max_action_dim=32,
5835
num_denoise_steps=10,
5936
action_act="linear",
6037
num_action_layers=2,
38+
state_token_id=151670,
39+
action_token_id=151666,
40+
action_pass_id=151667,
6141
**kwargs,
6242
):
63-
super().__init__(
64-
text_config=text_config,
65-
vision_config=vision_config,
66-
image_token_id=image_token_id,
67-
video_token_id=video_token_id,
68-
**kwargs,
69-
)
43+
if isinstance(vision_config, dict):
44+
self.vision_config = self.sub_configs["vision_config"](**vision_config)
45+
elif vision_config is None:
46+
self.vision_config = self.sub_configs["vision_config"](
47+
hidden_size=1280,
48+
out_hidden_size=2048,
49+
tokens_per_second=2,
50+
)
51+
52+
if isinstance(text_config, dict):
53+
self.text_config = self.sub_configs["text_config"](**text_config)
54+
elif text_config is None:
55+
self.text_config = self.sub_configs["text_config"](**kwargs)
56+
57+
self.image_token_id = image_token_id
58+
self.video_token_id = video_token_id
59+
self.state_token_id = state_token_id
60+
self.action_token_id = action_token_id
61+
self.action_pass_id = action_pass_id
62+
7063
self.action_chunk_size = action_chunk_size
7164
self.max_action_dim = max_action_dim
7265
self.num_denoise_steps = num_denoise_steps
7366
self.action_act = action_act
7467
self.num_action_layers = num_action_layers
7568

69+
super().__init__(**kwargs)
70+
7671

7772
EO1VisionFlowMatchingConfig.register_for_auto_class()

0 commit comments

Comments
 (0)