Switch to the Chinese version (切换至中文版)
- [2025/4/20] ✨ The first attempt to replicate o3-like visual clue-tracking reasoning capabilities! We have open-sourced the SeekWorld dataset and the directly RL-trained model SeekWorld-7B!
To enhance the performance of multimodal large language models (MLLMs), recent methods have tried to stimulate pure reasoning abilities through image-based mathematical problems, chart analysis, and logical puzzles. Others focus on enhancing low-level perceptual abilities through traditional detection tasks (e.g., object detection, counting, and segmentation). Additionally, some works aim to achieve visual content re-perception in text form as part of the reasoning process. However, a key limitation is that MLLMs still rely solely on text when performing visual reasoning.
OpenAI's ChatGPT-o3 enables visual reasoning within a chain-of-thought, allowing for dynamic image manipulation (e.g., rotation, zooming, or transformation) during the reasoning process. This is exemplified by the phrase, "but I’ll zoom in a bit just to be absolutely sure!" which significantly boosts perceptual capabilities, enabling the model to uncover subtle, ambiguous, or easily overlooked visual clues and form a comprehensive visual reasoning evidence chain. One particularly interesting example from the official documentation demonstrates how an image can be used to infer the filming location of a movie. In such scenarios, visual clues are extracted and inferred iteratively—extract visual clues, reason, extract visual clues, reason—until a final conclusion is reached. We believe the term "Visual Clue-Tracking" aptly captures this capability.
We introduce a novel task named Geolocation Reasoning. This task requires the model to reason through high-level logical relationships within the visual semantics while perceiving visual information, ultimately determining the correct location, making it a perfect candidate for o3-like visual clue-tracking reasoning. You can gain a deeper understanding of this task through these "guess where the photo was taken" games: GeoGuess and TuXun. We have built a rule-based reinforcement learning dataset for geolocation: SeekWorld. The dataset includes two training sets: one (Train-Clue-Tracking) contains 50(with ongoing expansion) detailed examples of visual clue-tracking reasoning processes collected from o3, while the other (Train-No-Process) contains 8541 data points without visual clue-tracking reasoning. The former is used for Cold-Start SFT training, and the latter is used for RL training. Additionally, there are two test sets for comprehensive evaluation.
We have trained a model based on Train-No-Process using RL with Qwen2.5-7B-VL-Instruct: SeekWorld-7B.
- Further expand the size of the Cold-Start SFT Train-Clue-Tracking dataset
- Cold-Start SFT (Train-Clue-Tracking) + RL (Train-No-Process): The Cold-Start SFT training has not yet been completed, and we aim to replicate o3-like visual clue-tracking abilities.
- Evaluating o3's performance on SeekWorld: Due to API restrictions, we have not yet been able to evaluate o3's performance on SeekWorld.
- Evaluating performance on other perceptual and reasoning benchmarks: We will assess how the o3-like model trained on geolocation reasoning performs across other domains.
- Visual clue-tracking Process: The first dataset containing o3 model's visual reasoning chains or visual clue tracking capabilities.
- Global Diverse Sampling: Includes a wide variety of scenes from across the world, ensuring that the model can generalize well to different cultures, terrains, and environmental contexts.
- Image-Label Pairs Optimized for Rule-Based RL: The dataset cleans images from location watermarks, and provides additional aliases for the geographic coordinates' administrative divisions to prevent model misinterpretation.
- Hierarchical Difficulty Architecture: Contains three levels of reasoning difficulty—easy, medium, and hard—designed to gradually challenge and assess the model's geolocation ability.
| Dataset | Data Volume | Source |
|---|---|---|
| Train-Clue-Tracking | 50(with ongoing expansion) | Collection of visual clue-tracking reasoning process from o3 |
| Train-No-Process (easy-medium-hard) |
8541 (1945-941-5655) |
Panoramas & user-uploaded images from Google Maps in recent years |
| Global-Test | 320 | Panoramas & user-uploaded images from Google Maps in recent years |
| China-Test | 373 | The latest Xiaohongshu App images collected on April 14, 2025, and it is almost impossible to have been pre-trained |
Google Driver: SeekWorld. For more details about the dataset, refer to DATASET.md.
| Model | Global-Test | China-Test | Overall Accuracy |
|---|---|---|---|
| Bigger model | |||
| GPT4o-240806🔒 | 56.50 | 31.90 | 43.26 |
| Doubao-1.5-vision-pro-32k-250115🔒 | 43.75 | 40.48 | 41.99 |
| Gemini-2.0-flash-thinking-exp-01-21🔒🧠 | 56.25 | 29.49 | 41.85 |
| QvQ-72B-max-2025-03-25🧠 | 48.13 | 31.63 | 39.25 |
| Qwen-2.5-32B-VL | 38.12 | 24.13 | 30.59 |
| Small model (7B) | |||
| SeekWorld-7B [Cold-Start SFT + RL] (ours) | - | - | - |
| SeekWorld-7B [Direct RL] (ours) | 59.69 | 34.65 | 46.21 |
| Qwen-2.5-7B-VL [Direct RL] | 51.25 | 31.90 | 40.84 |
| Qwen-2.5-7B-VL [Direct SFT] | 37.19 | 25.47 | 30.88 |
| Qwen-2.5-7B-VL | 33.44 | 24.40 | 28.57 |
| Qwen-2.5-7B-VL + CoT | 25.31 | 21.45 | 23.23 |
Models marked with 🔒 are proprietary closed-source models, and those marked with 🧠 have enhanced reasoning capabilities. We use the (Reinforce++) RL algorithm.
Currently, we have not completed Cold-Start SFT training on Train-Clue-Tracking. Direct SFT and Direct RL refer to SFT and RL training directly on Train-No-Process, respectively. Compared to Qwen-2.5-7B-VL, SeekWorld-7B attempted two optimizations during RL training, with difficulty sampling significantly improving test accuracy. Length incentives only increased the reasoning process length but did not enhance accuracy. However, it better demonstrated the intermediate reasoning process. We are also experimenting with GRM (code for geolocation reasoning).
-
Difficulty Sampling: We sampled questions of varying difficulty levels in the training set. Specifically, due to the relatively large number of hard questions (5655), we doubled the number of easy (1945) and medium difficulty (941) questions, expanding them to 3890 and 1882, respectively.
-
Length Incentives: We introduced a reward mechanism to encourage increased reasoning length. The reward value is set as the character count multiplied by 0.001, with a maximum reward of 1.0.
Let's test a picture and experience the power of geolocation reasoning from o3 model:
Can you identify this location?
The process of visual clue-tracking and its results are recorded in o3_demo.md.
Let’s take a look at the effect of our SeekWorld-7B:
CUDA_VISIBLE_DEVICES=0 python simple_inference.py --image_path test.jpg --model_name TheEighthDay/SeekWorld_RL_PLUS --inference_engine transformers<think>
The image depicts a traditional Chinese street scene with wooden buildings, red lanterns, and Chinese characters on the walls. The architecture and the style of the buildings, along with the presence of traditional Chinese elements such as the lanterns and the writing, suggest that this is a scene from a Chinese village or town. The design and the layout of the buildings are characteristic of ancient Chinese towns, which often have a historical significance and are known for their preserved traditional architecture.
The presence of Chinese characters and the style of the buildings suggest that this is likely in China, specifically in an area where such traditional architecture is common. The style of the buildings and the overall atmosphere could be indicative of a place like Lijiang, which is famous for its well-preserved old town and is located in Yunnan Province, a province known for its diverse cultural heritage and historical sites.
Given the architectural style and the cultural elements, it is reasonable to infer that this image is from a region in China that has a rich history and traditional architecture. The specific style of the buildings and the presence of traditional Chinese elements suggest that this could be a part of Yunnan Province, which is known for its ancient towns and villages that have been preserved over time.
Therefore, based on the architectural style and the cultural elements, it is reasonable to conclude that this image was taken in China, specifically in Yunnan Province, which is one of the first-level administrative regions of China.
</think>
<answer>
$China, Yunnan$
</answer>
This is indeed the ancient town of Shaxi in Yunnan, and the model provided a detailed reasoning process, locating it in Yunnan!
You can try on our Online Demo~
Please refer to LMM-R1 for setting up the training environment.
cd src/lmm-r1/
bash examples/scripts/lmm_r1/train_direct_rl_seekworld.shWe would like to express our sincere thanks to lmm-r1 and OpenRLHF for providing the excellent baseline code!
If you find our work helpful in your research, please consider citing us:
@misc{seekworld2025,
title = {{S}eek{W}orld: Geolocation is a Natural {RL} Task for o3-like Visual Clue-Tracking},
author = {Tian, Kaibin and Xin, Zijie and Liu, Jiazhen},
year = {2025},
howpublished = {\url{https://github.com/TheEighthDay/SeekWorld}},
note = {GitHub repository}
}We warmly welcome you to contribute to the SeekWorld project! If you are interested in geolocation reasoning, you can send us a challenging test image to help us build a more comprehensive evaluation dataset. Here’s how you can contribute:
- Take a photo with geographic clues that is not easy to immediately recognize the location (e.g., street views, lifestyle photos, architecture, natural landscapes).
- Ensure the photo corresponds to a real location (e.g., specific country and first-level administrative region). If possible, please also provide the latitude and longitude of the location. Ensure the image does not contain any personal information.
- Please include "[SeekWorld Crowd Contribution]" in the email subject. Then, send the image to our email address: tikibi001@163.com.
Kaibin Tian: 1109419614@qq.com




