β MarcoPolo Team β
Alibaba International Digital Commerce
Github π€ Hugging Face π Paper π§βπ» Model ποΈ Data π½οΈ Demo
π― Marco-o1 not only focuses on subjects with standard answers, such as mathematics, physics, and coding that are highly suitable for the use of Reinforcement Learning, but we also emphasize some open-ended solutions. Our goal is to build a general model applicable to agentic, incorporating comprehensive planning capabilities and function call abilities.

Figure 1: A classic 'strawberry' question reasoned by our Marco-o1 model: "How many 'r' are in strawberry". Although the answer is correct, the final letter 'y' is overlooked during CoT. This is an interesting finding, which is discussed in issue #3.
-
[Coming Soon] π Marco-o1 ???: We are working on training a more powerful reinforcement learning-based model. The new model will provide better support for agents, with enhanced planning capabilities, task decomposition abilities, and function call capabilities.
-
[Coming Soon] π Marco-o1 ???: We are working on training a more efficient reasoning model that can actively skip serveral steps in the reasoning process while maintaining performance, thereby improving reasoning efficiency. Notably, this does not require significant changes to the original model,user ca control the model's reasoning granularity.
-
[2025/05/15] π₯ Our paper γMarco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Modelsγ has been accepted to the main conference of ACL 2025.
-
[2025/02/14] π₯ We released Marco-o1 v2. This version entirely relies on self-built data and has undergone DPO. It has been optimized more comprehensively for mathematical problem-solvingγplanning and instruction-following capabilities. π¬ This time, our model's ability in counting letters is quite impressive! π
-
[2024/11/13] π₯ We released Marco-o1 v1. This initial release includes our reasoning model, optimized for complex problem-solving and versatile applications across various domains.
OpenAI recently introduced the groundbreaking o1 model, renowned for its exceptional reasoning capabilities. This model has demonstrated outstanding performance on platforms such as AIME and CodeForces, surpassing other leading models. Inspired by this success, we aimed to push the boundaries of LLMs even further, enhancing their reasoning abilities to tackle complex, real-world challenges.
π Marco-o1 leverages advanced techniques like CoT fine-tuning, MCTS, and Reasoning Action Strategies to enhance its reasoning power. As shown in Figure 2, by fine-tuning Qwen2-7B-Instruct with a combination of the filtered Open-O1 CoT dataset, Marco-o1 CoT dataset, and Marco-o1 Instruction dataset, Marco-o1 improved its handling of complex tasks. MCTS allows exploration of multiple reasoning paths using confidence scores derived from softmax-applied log probabilities of the top-k alternative tokens, guiding the model to optimal solutions. Moreover, our reasoning action strategy involves varying the granularity of actions within steps and mini-steps to optimize search efficiency and accuracy.
π As shown in Figure 3, Marco-o1 achieved accuracy improvements of +6.17% on the MGSM (English) dataset and +5.60% on the MGSM (Chinese) dataset, showcasing enhanced reasoning capabilities.
π Additionally, in translation tasks, we demonstrate that Marco-o1 excels in translating slang expressions, such as translating "θΏδΈͺιζ₯ζθΈ©ε±ζ" (literal translation: "This shoe offers a stepping-on-poop sensation.") to "This shoe has a comfortable sole," demonstrating its superior grasp of colloquial nuances.
For more detail please refer to this or our paper.
For Marco-o1 v2, we have removed some data from Open-O1 and replaced it entirely with Marco-o1 CoT data. We have expanded both the categories and quantity of our CoT data, Additionally, we improved our MCTS architecture to enable dynamic addition of reflections, as shown in Figure 5. While also conducting DPO using naturally data pairs from MCTS.
As mentioned in our paper, we found that models like R1 and QwQ often engage in reflection for the sake of reflection itself, which we called formalistic long-time thinking. This has a certain impact on the distillation learning of smaller models, leading to behaviors such as repetitive generate and redundant thinking.
Data constructed using MCTS is more suitable for smaller models, as it does not involve redundant thinking and reflection. Instead, we start with planning at the very beginning of the CoT process and then gradually work through the problem. We only guide the model to reflect at appropriate moments. This aligns better with the capabilities and thinking patterns of lower-capacity smaller models.
Additionally, we have conducted DPO using naturally formed positive and negative pairs from MCTS and have made some preliminary findings.
We have open-sourced our MCTS search code. For more detail please refer to this or our paper.
We are now working on expanding the Marco-o1 family. These expansions include a more robust model based on RL, tailored for agent scenarios. This model places greater emphasis on the accuracy of function call and planning abilities, which are crucial for current agent applications.
Additionally, as mentioned earlier, the outputs of current reasoning models tend to be quite redundancy. Unlike other works that focus on compression to enable models to distinguish problem difficulty and provide outputs of varying lengths, our goal is for the model to dynamically select skipping unnecessary reasoning steps based on a hyperparameter provided by the user.
π₯π₯ For more details, we will open source and update our latest work later.
π₯ Marco-o1 v1
π₯ Marco-o1 v2
To install Marco-o1, follow these steps:
# Clone the repository
git clone https://github.com/AIDC-AI/Marco-o1
# Change to the Macaw-LLM directory
cd Marco-o1
# Install required packages
pip install -r requirements.txt
-
Load Marco-o1-CoT model:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("AIDC-AI/Marco-o1") model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Marco-o1")
-
Inference:
Execute the inference script (you can give any customized inputs inside):
./src/output/talk_with_model.py # Use vLLM ./src/output/talk_with_model_vllm.py
-
Deploy using FastAPI:
Check the README.md file in examples folder.
From MarcoPolo Team, AI Business, Alibaba International Digital Commerce:
If you find Marco-o1 useful for your research and applications, please cite:
@misc{zhao2024marcoo1openreasoningmodels,
title={Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions},
author={Yu Zhao and Huifeng Yin and Bo Zeng and Hao Wang and Tianqi Shi and Chenyang Lyu and Longyue Wang and Weihua Luo and Kaifu Zhang},
year={2024},
eprint={2411.14405},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.14405},
}
@misc{yin2025wideningdistillationbottleneckreasoning,
title={Marco o1 v2:Towards Widening The Distillation Bottleneck for Reasoning Models},
author={Huifeng Yin and Yu Zhao and Minghao Wu and Xuanfan Ni and Bo Zeng and Hao Wang and Tianqi Shi and Liangying Shao and Chenyang Lyu and Longyue Wang and Weihua Luo and Kaifu Zhang},
year={2025},
eprint={2503.01461},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.01461},
}
This project is licensed under Apache License Version 2 (SPDX-License-identifier: Apache-2.0).
We used compliance checking algorithms during the training process, to ensure the compliance of the trained model and dataset to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.