MM-HELIX Logo MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

1Shanghai Jiao Tong University, 2Shanghai AI Laboratory, 3Beijing University of Posts and Telecommunications, 4Zhejiang University, 5Princeton University
Teaser Image

Overview of proposed framework. Our framework comprises two core components: (1) MM-HELIX benchmark to evaluate the reflective capabilities of MLLM, and (2) AHPO method to boost reflection capability and transfer enhanced skills to general reasoning tasks.

Abstract

While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.

Details of AHPO

Task Category

Teaser Image

Overview of tasks in MM-HELIX benchmark. MM-HELIX contains 42 challenging tasks designed to evaluate long-chain reflective reasoning across five progressive levels of difficulty.

Evaluation results on MM-HELIX Benchmark

Evaluation results on MM-HELIX across both multimodal and text-only settings. These results underscore the ongoing difficulty MLLMs face with complex, long-chain reflective tasks. Thinking models with reflective reasoning capabilities generally achieve higher scores than those without. Furthermore, a significant modality gap is observed where text-only inputs are superior.

Overall Overall
Algorithms Algorithms
Graphs Graphs
Puzzles Puzzles
Games Games
Model Thinking Breakdown by Category Overall
Puzzles Games Algorithms Graphs
Txt Img Txt Img Txt Img Txt Img Txt Img
Proprietary Models
GPT-583.088.598.350.480.952.680.040.084.558.1
Seed-1.5-VL89.378.986.740.451.641.955.633.366.948.3
o4-mini76.350.795.042.169.145.866.735.675.244.7
Gemini-2.5-Flash92.666.788.340.852.136.749.428.367.342.7
GPT-4.161.944.473.835.030.916.813.98.943.325.1
GPT-4o33.718.944.625.410.24.210.66.721.811.7
Open-Source Models
Intern-S1-241B-A28B75.269.376.730.035.323.526.115.050.433.3
GLM-4.5V-106B-A12B-Thinking49.629.340.411.315.320.212.213.927.019.5
Kimi-VL-16B-A3B-Thinking-250645.936.349.623.39.610.410.67.228.919.3
GLM-4.1V-9B-Thinking38.130.750.429.211.67.45.06.123.716.3
Qwen-2.5-VL-72B24.418.542.125.88.23.95.67.220.113.9
Qwen-2.5-VL-32B22.215.246.322.58.14.75.66.720.612.3
QVQ-72B-Preview22.621.136.716.74.93.36.73.317.711.1
MiniCPM-V-4.5-8B20.020.032.120.85.83.70.03.313.010.4
InternVL3-78B20.014.443.325.410.24.010.01.118.69.9
InternVL3-38B19.314.140.822.58.23.57.85.616.79.7
Llama-4-Scout-109B-A17B-16E24.116.340.821.34.44.22.21.715.29.7
Ovis2-34B14.410.433.822.13.91.25.01.712.07.2
Gemma-3-27B-IT20.710.444.222.16.50.55.61.716.66.9
Qwen-2.5-VL-7B5.65.925.417.90.40.40.61.18.06.3
InternVL3-8B8.15.928.816.71.60.71.11.18.14.9
Ovis2-8B7.83.324.215.40.50.21.10.66.73.8
Ours
MM-HELIX-7B-Thinking32.234.827.519.216.325.316.116.721.824.9
Model 24 BuySell Container Hills Crypto HIndex Rect LIS Rain
Proprietary Models
GPT-5 96.7 80.0 93.3 73.3 100.0 96.7 90.0 93.3 73.3
Seed-1.5-VL 100.0 80.0 83.3 60.0 86.7 83.3 73.3 73.3 70.0
o4-mini 86.7 10.0 36.7 43.3 60.0 66.7 50.0 63.3 40.0
Gemini-2.5-Flash 96.7 43.3 66.7 56.7 83.3 76.7 56.7 70.0 50.0
GPT-4.1 63.3 46.7 56.7 16.7 26.7 60.0 33.3 43.3 53.3
GPT-4o 10.0 30.0 23.3 0.0 0.0 30.0 23.3 33.3 20.0
Open-Source Models
Intern-S1-241B-A28B 86.7 80.0 70.0 83.3 63.3 46.7 66.7 83.3 43.3
GLM-4.5V-106B-A12B-Thinking 56.7 16.7 40.0 3.3 23.3 23.3 33.3 53.3 13.3
Kimi-VL-16B-A3B-Thinking-2506 90.0 36.7 33.3 10.0 16.7 43.3 26.7 43.3 26.7
GLM-4.1V-9B-Thinking 76.7 10.0 43.3 13.3 20.0 30.0 16.7 30.0 36.7
Qwen-2.5-VL-72B 13.3 20.0 26.7 16.7 0.0 43.3 6.7 30.0 10.0
Qwen-2.5-VL-32B 33.3 26.7 16.7 0.0 3.3 16.7 3.3 26.7 10.0
QVQ-72B-Preview 76.7 20.0 26.7 3.3 0.0 20.0 3.3 33.3 6.7
MiniCPM-V-4.5-8B 53.3 6.7 20.0 13.3 6.7 30.0 13.3 33.3 3.3
InternVL3-78B 46.7 20.0 20.0 6.7 6.7 10.0 10.0 10.0 0.0
InternVL3-38B 43.3 3.3 23.3 3.3 3.3 13.3 3.3 26.7 6.7
Llama-4-Scout-109B-A17B-16E 66.7 30.0 3.3 10.0 0.0 6.7 3.3 20.0 6.7
Ovis2-34B 23.3 0.0 3.3 6.7 0.0 20.0 13.3 26.7 0.0
Gemma-3-27B-IT 10.0 0.0 13.3 3.3 0.0 23.3 10.0 30.0 3.3
Qwen-2.5-VL-7B 10.0 0.0 6.7 0.0 0.0 10.0 3.3 23.3 0.0
InternVL3-8B 10.0 0.0 6.7 3.3 0.0 10.0 0.0 23.3 0.0
Ovis2-8B 13.3 0.0 0.0 0.0 0.0 10.0 0.0 6.7 0.0
Ours
MM-HELIX-7B-Thinking 56.7 30.0 46.7 40.0 10.0 46.7 26.7 43.3 13.3
Model EulerCyc EulerPath GraphIso HamilCyc HamilPath MaxFlow ShortDist TopoSort
Proprietary Models
GPT-5 33.3 33.3 53.3 40.0 60.0 80.0 90.0 13.3
Seed-1.5-VL 23.3 30.0 56.7 23.3 46.7 70.0 60.0 13.3
o4-mini 33.3 33.3 53.3 33.3 50.0 66.7 56.7 10.0
Gemini-2.5-Flash 30.0 36.7 43.3 26.7 46.7 63.3 66.7 13.3
GPT-4.1 10.0 20.0 63.3 20.0 33.3 70.0 60.0 3.3
GPT-4o 6.7 26.7 56.7 16.7 20.0 33.3 43.3 0.0
Open-Source Models
Intern-S1-241B-A28B 16.7 26.7 50.0 16.7 23.3 50.0 56.7 0.0
GLM-4.5V-106B-A12B-Thinking 0.0 10.0 6.7 10.0 20.0 30.0 13.3 0.0
Kimi-VL-16B-A3B-Thinking-2506 16.7 20.0 46.7 16.7 26.7 40.0 20.0 0.0
GLM-4.1V-9B-Thinking 16.7 23.3 46.7 16.7 33.3 50.0 43.3 3.3
Qwen-2.5-VL-72B 16.7 23.3 56.7 10.0 20.0 43.3 36.7 0.0
Qwen-2.5-VL-32B 13.3 20.0 30.0 16.7 23.3 40.0 36.7 0.0
QVQ-72B-Preview 16.7 16.7 36.7 6.7 13.3 20.0 20.0 3.3
MiniCPM-V-4.5-8B 6.7 23.3 40.0 20.0 16.7 26.7 30.0 3.3
InternVL3-78B 10.0 20.0 46.7 16.7 26.7 40.0 40.0 3.3
InternVL3-38B 10.0 23.3 46.7 16.7 13.3 33.3 36.7 0.0
Llama-4-Scout-109B-A17B-16E 16.7 26.7 43.3 10.0 23.3 26.7 20.0 3.3
Ovis2-34B 16.7 23.3 53.3 23.3 16.7 23.3 13.3 6.7
Gemma-3-27B-IT 16.7 26.7 33.3 16.7 23.3 36.7 20.0 3.3
Qwen-2.5-VL-7B 10.0 23.3 53.3 0.0 13.3 23.3 20.0 0.0
InternVL3-8B 13.3 26.7 33.3 16.7 23.3 6.7 13.3 0.0
Ovis2-8B 16.7 10.0 26.7 23.3 13.3 16.7 16.7 0.0
Ours
MM-HELIX-7B-Thinking 16.7 23.3 20.0 10.0 26.7 26.7 30.0 3.3
Model Aqua Bina Brid Calcu Camp Eule Futo Hito Kaku Kuku Nono Num Shin Sky Snak Sudo Tapa WLad WSch
Proprietary Models
GPT-5 33.3 23.3 83.3 30.0 63.3 53.3 33.3 83.3 26.7 100.0 26.7 86.7 80.0 53.3 10.0 26.7 50.0 36.7 100.0
Seed-1.5-VL 10.0 30.0 50.0 16.7 86.7 60.0 20.0 40.0 36.7 63.3 3.3 6.7 70.0 40.0 100.0 50.0 33.3 20.0 60.0
o4-mini 26.7 13.3 73.3 23.3 53.3 50.0 30.0 43.3 43.3 76.7 13.3 50.0 43.3 43.3 96.7 3.3 43.3 30.0 100.0
Gemini-2.5-Flash 3.3 20.0 60.0 3.3 46.7 46.7 16.7 63.3 36.7 40.0 0.0 40.0 40.0 40.0 83.3 40.0 36.7 10.0 70.0
GPT-4.1 3.3 0.0 46.7 13.3 13.3 33.3 10.0 40.0 16.7 60.0 0.0 16.7 53.3 20.0 43.3 0.0 3.3 10.0 33.3
GPT-4o 0.0 20.0 0.0 3.3 16.7 20.0 10.0 16.7 13.3 33.3 0.0 0.0 6.7 3.3 33.3 0.0 0.0 13.3 13.3
Open-Source Models
Intern-S1-241B-A28B 3.3 23.3 60.0 26.7 20.0 16.7 20.0 20.0 30.0 0.0 0.0 26.7 13.3 23.3 53.3 53.3 16.7 0.0 43.3
GLM-4.5V-106B-A12B-Thinking 13.3 30.0 13.3 6.7 60.0 6.7 6.7 30.0 0.0 33.3 0.0 0.0 0.0 6.7 40.0 20.0 6.7 6.7 50.0
Kimi-VL-16B-A3B-Thinking-2506 3.3 16.7 20.0 6.7 16.7 13.3 26.7 13.3 16.7 10.0 0.0 10.0 3.3 6.7 50.0 10.0 0.0 0.0 26.7
GLM-4.1V-9B-Thinking 6.7 16.7 6.7 13.3 40.0 10.0 3.3 16.7 13.3 20.0 0.0 10.0 0.0 3.3 30.0 3.3 6.7 0.0 0.0
Qwen-2.5-VL-72B 0.0 6.7 13.3 10.0 23.3 16.7 10.0 6.7 0.0 6.7 0.0 6.7 16.7 3.3 13.3 6.7 0.0 0.0 10.0
Qwen-2.5-VL-32B 3.3 0.0 10.0 3.3 6.7 3.3 16.7 0.0 6.7 13.3 0.0 6.7 3.3 3.3 16.7 3.3 0.0 0.0 3.3
QVQ-72B-Preview 10.0 13.3 6.7 6.7 6.7 6.7 16.7 10.0 13.3 0.0 0.0 0.0 6.7 0.0 0.0 0.0 0.0 6.7 10.0
MiniCPM-V-4.5-8B 6.7 3.3 10.0 10.0 20.0 10.0 20.0 6.7 6.7 6.7 0.0 0.0 0.0 0.0 6.7 0.0 0.0 0.0 16.7
InternVL3-78B 0.0 0.0 30.0 26.7 3.3 3.3 6.7 3.3 0.0 13.3 0.0 0.0 6.7 3.3 6.7 0.0 0.0 3.3 0.0
InternVL3-38B 3.3 3.3 16.7 20.0 0.0 3.3 13.3 10.0 6.7 10.0 0.0 0.0 3.3 3.3 3.3 0.0 0.0 0.0 3.3
Llama-4-Scout-109B-A17B-16E 6.7 10.0 13.3 3.3 30.0 16.7 3.3 23.3 10.0 3.3 0.0 0.0 20.0 3.3 13.3 0.0 0.0 6.7 16.7
Ovis2-34B 13.3 30.0 6.7 13.3 6.7 13.3 30.0 10.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0 3.3 3.3 0.0 0.0
Gemma-3-27B-IT 6.7 3.3 10.0 3.3 0.0 0.0 6.7 13.3 10.0 3.3 0.0 0.0 3.3 0.0 3.3 0.0 0.0 0.0 10.0
Qwen-2.5-VL-7B 13.3 0.0 3.3 0.0 6.7 0.0 10.0 6.7 10.0 16.7 0.0 0.0 0.0 0.0 3.3 0.0 0.0 6.7 3.3
InternVL3-8B 6.7 0.0 3.3 16.7 10.0 0.0 10.0 3.3 6.7 0.0 0.0 0.0 0.0 3.3 0.0 0.0 6.7 0.0 3.3
Ovis2-8B 0.0 10.0 0.0 0.0 0.0 6.7 3.3 10.0 0.0 3.3 0.0 0.0 0.0 0.0 0.0 3.3 6.7 0.0 6.7
Ours
MM-HELIX-7B-Thinking 3.3 23.3 50.0 6.7 46.7 43.3 20.0 13.3 30.0 26.7 13.3 23.3 60.0 20.0 16.7 30.0 6.7 6.7 40.0
Model 24 BuySell Container Hills Crypto HIndex Rect LIS Rain
Proprietary Models
GPT-5 96.7 80.0 93.3 73.3 100.0 96.7 90.0 93.3 73.3
Seed-1.5-VL 100.0 80.0 83.3 60.0 86.7 83.3 73.3 73.3 70.0
o4-mini 86.7 10.0 36.7 43.3 60.0 66.7 50.0 63.3 40.0
Gemini-2.5-Flash 96.7 43.3 66.7 56.7 83.3 76.7 56.7 70.0 50.0
GPT-4.1 63.3 46.7 56.7 16.7 26.7 60.0 33.3 43.3 53.3
GPT-4o 10.0 30.0 23.3 0.0 0.0 30.0 23.3 33.3 20.0
Open-Source Models
Intern-S1-241B-A28B 86.7 80.0 70.0 83.3 63.3 46.7 66.7 83.3 43.3
GLM-4.5V-106B-A12B-Thinking 56.7 16.7 40.0 3.3 23.3 23.3 33.3 53.3 13.3
Kimi-VL-16B-A3B-Thinking-2506 90.0 36.7 33.3 10.0 16.7 43.3 26.7 43.3 26.7
GLM-4.1V-9B-Thinking 76.7 10.0 43.3 13.3 20.0 30.0 16.7 30.0 36.7
Qwen-2.5-VL-72B 13.3 20.0 26.7 16.7 0.0 43.3 6.7 30.0 10.0
Qwen-2.5-VL-32B 33.3 26.7 16.7 0.0 3.3 16.7 3.3 26.7 10.0
QVQ-72B-Preview 76.7 20.0 26.7 3.3 0.0 20.0 3.3 33.3 6.7
MiniCPM-V-4.5-8B 53.3 6.7 20.0 13.3 6.7 30.0 13.3 33.3 3.3
InternVL3-78B 46.7 20.0 20.0 6.7 6.7 10.0 10.0 10.0 0.0
InternVL3-38B 43.3 3.3 23.3 3.3 3.3 13.3 3.3 26.7 6.7
Llama-4-Scout-109B-A17B-16E 66.7 30.0 3.3 10.0 0.0 6.7 3.3 20.0 6.7
Ovis2-34B 23.3 0.0 3.3 6.7 0.0 20.0 13.3 26.7 0.0
Gemma-3-27B-IT 10.0 0.0 13.3 3.3 0.0 23.3 10.0 30.0 3.3
Qwen-2.5-VL-7B 10.0 0.0 6.7 0.0 0.0 10.0 3.3 23.3 0.0
InternVL3-8B 10.0 0.0 6.7 3.3 0.0 10.0 0.0 23.3 0.0
Ovis2-8B 13.3 0.0 0.0 0.0 0.0 10.0 0.0 6.7 0.0
Ours
MM-HELIX-7B-Thinking 56.7 30.0 46.7 40.0 10.0 46.7 26.7 43.3 13.3
Model Maze Mine Nib Slide Soko Hanoi
Proprietary Models
GPT-5 10.0 23.3 10.0 86.7 16.7 93.3
Seed-1.5-VL 6.7 53.3 20.0 63.3 3.3 53.3
o4-mini 6.7 26.7 10.0 66.7 13.3 90.0
Gemini-2.5-Flash 0.0 50.0 13.3 46.7 3.3 56.7
GPT-4.1 3.3 0.0 0.0 3.3 0.0 46.7
GPT-4o 0.0 0.0 0.0 3.3 0.0 36.7
Open-Source Models
Intern-S1-241B-A28B 0.0 20.0 0.0 36.7 0.0 33.3
GLM-4.5V-106B-A12B-Thinking 0.0 16.7 3.3 10.0 3.3 50.0
Kimi-VL-16B-A3B-Thinking-2506 0.0 3.3 0.0 3.3 0.0 10.0
GLM-4.1V-9B-Thinking 0.0 0.0 3.3 3.3 0.0 26.7
Qwen-2.5-VL-72B 0.0 20.0 0.0 36.7 3.3 26.7
Qwen-2.5-VL-32B 0.0 16.7 3.3 33.3 0.0 6.7
QVQ-72B-Preview 0.0 3.3 3.3 6.7 0.0 16.7
MiniCPM-V-4.5-8B 0.0 0.0 0.0 3.3 3.3 13.3
InternVL3-78B 0.0 0.0 3.3 6.7 0.0 16.7
InternVL3-38B 0.0 3.3 3.3 10.0 6.7 13.3
Llama-4-Scout-109B-A17B-16E 0.0 3.3 0.0 10.0 0.0 33.3
Ovis2-34B 0.0 0.0 0.0 3.3 0.0 6.7
Gemma-3-27B-IT 0.0 0.0 0.0 3.3 0.0 3.3
Qwen-2.5-VL-7B 0.0 0.0 0.0 0.0 0.0 3.3
InternVL3-8B 0.0 0.0 0.0 0.0 0.0 6.7
Ovis2-8B 0.0 0.0 0.0 0.0 3.3 3.3
Ours
MM-HELIX-7B-Thinking 3.3 16.7 23.3 26.7 3.3 26.7

Try icon Let's Try MM-HELIX Benchmark!

Puzzle icon Puzzles
Game icon Games
Graph icon Graphs
Algorithm icon Algorithms

Try icon Dataset Visualization

Puzzle icon Puzzles
Game icon Games
Graph icon Graphs
Algorithm icon Algorithms

BibTeX

@article{zhao2025mmhelix,
    title={MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization},
    author={Zhao, Xiangyu and Lin, Junming and Liang, Tianhao and Zhou, Yifan and Chai, Wenhao and Gu, Yuzhe and Wang, Weiyun and Chen, Kai and Luo, Gen and Zhang, Wenwei and Yan, Junchi and Yang, Hua and Duan, Haodong and Yang, Xue},
    journal={arXiv preprint arXiv:2510.08540},
    year={2025}
  }