Evaluation results on MM-HELIX across both multimodal and text-only settings. These results underscore the ongoing difficulty MLLMs face with complex, long-chain reflective tasks. Thinking models with reflective reasoning capabilities generally achieve higher scores than those without. Furthermore, a significant modality gap is observed where text-only inputs are superior.
Model | Thinking | Breakdown by Category | Overall | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Puzzles | Games | Algorithms | Graphs | ||||||||
Txt | Img | Txt | Img | Txt | Img | Txt | Img | Txt | Img | ||
Proprietary Models | |||||||||||
GPT-5 | ✓ | 83.0 | 88.5 | 98.3 | 50.4 | 80.9 | 52.6 | 80.0 | 40.0 | 84.5 | 58.1 |
Seed-1.5-VL | ✓ | 89.3 | 78.9 | 86.7 | 40.4 | 51.6 | 41.9 | 55.6 | 33.3 | 66.9 | 48.3 |
o4-mini | ✓ | 76.3 | 50.7 | 95.0 | 42.1 | 69.1 | 45.8 | 66.7 | 35.6 | 75.2 | 44.7 |
Gemini-2.5-Flash | ✓ | 92.6 | 66.7 | 88.3 | 40.8 | 52.1 | 36.7 | 49.4 | 28.3 | 67.3 | 42.7 |
GPT-4.1 | ✗ | 61.9 | 44.4 | 73.8 | 35.0 | 30.9 | 16.8 | 13.9 | 8.9 | 43.3 | 25.1 |
GPT-4o | ✗ | 33.7 | 18.9 | 44.6 | 25.4 | 10.2 | 4.2 | 10.6 | 6.7 | 21.8 | 11.7 |
Open-Source Models | |||||||||||
Intern-S1-241B-A28B | ✓ | 75.2 | 69.3 | 76.7 | 30.0 | 35.3 | 23.5 | 26.1 | 15.0 | 50.4 | 33.3 |
GLM-4.5V-106B-A12B-Thinking | ✓ | 49.6 | 29.3 | 40.4 | 11.3 | 15.3 | 20.2 | 12.2 | 13.9 | 27.0 | 19.5 |
Kimi-VL-16B-A3B-Thinking-2506 | ✓ | 45.9 | 36.3 | 49.6 | 23.3 | 9.6 | 10.4 | 10.6 | 7.2 | 28.9 | 19.3 |
GLM-4.1V-9B-Thinking | ✓ | 38.1 | 30.7 | 50.4 | 29.2 | 11.6 | 7.4 | 5.0 | 6.1 | 23.7 | 16.3 |
Qwen-2.5-VL-72B | ✗ | 24.4 | 18.5 | 42.1 | 25.8 | 8.2 | 3.9 | 5.6 | 7.2 | 20.1 | 13.9 |
Qwen-2.5-VL-32B | ✗ | 22.2 | 15.2 | 46.3 | 22.5 | 8.1 | 4.7 | 5.6 | 6.7 | 20.6 | 12.3 |
QVQ-72B-Preview | ✓ | 22.6 | 21.1 | 36.7 | 16.7 | 4.9 | 3.3 | 6.7 | 3.3 | 17.7 | 11.1 |
MiniCPM-V-4.5-8B | ✓ | 20.0 | 20.0 | 32.1 | 20.8 | 5.8 | 3.7 | 0.0 | 3.3 | 13.0 | 10.4 |
InternVL3-78B | ✗ | 20.0 | 14.4 | 43.3 | 25.4 | 10.2 | 4.0 | 10.0 | 1.1 | 18.6 | 9.9 |
InternVL3-38B | ✗ | 19.3 | 14.1 | 40.8 | 22.5 | 8.2 | 3.5 | 7.8 | 5.6 | 16.7 | 9.7 |
Llama-4-Scout-109B-A17B-16E | ✗ | 24.1 | 16.3 | 40.8 | 21.3 | 4.4 | 4.2 | 2.2 | 1.7 | 15.2 | 9.7 |
Ovis2-34B | ✗ | 14.4 | 10.4 | 33.8 | 22.1 | 3.9 | 1.2 | 5.0 | 1.7 | 12.0 | 7.2 |
Gemma-3-27B-IT | ✗ | 20.7 | 10.4 | 44.2 | 22.1 | 6.5 | 0.5 | 5.6 | 1.7 | 16.6 | 6.9 |
Qwen-2.5-VL-7B | ✗ | 5.6 | 5.9 | 25.4 | 17.9 | 0.4 | 0.4 | 0.6 | 1.1 | 8.0 | 6.3 |
InternVL3-8B | ✗ | 8.1 | 5.9 | 28.8 | 16.7 | 1.6 | 0.7 | 1.1 | 1.1 | 8.1 | 4.9 |
Ovis2-8B | ✗ | 7.8 | 3.3 | 24.2 | 15.4 | 0.5 | 0.2 | 1.1 | 0.6 | 6.7 | 3.8 |
Ours | |||||||||||
MM-HELIX-7B-Thinking | ✓ | 32.2 | 34.8 | 27.5 | 19.2 | 16.3 | 25.3 | 16.1 | 16.7 | 21.8 | 24.9 |
Model | 24 | BuySell | Container | Hills | Crypto | HIndex | Rect | LIS | Rain | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Proprietary Models | |||||||||||||||||||
GPT-5 | 96.7 | 80.0 | 93.3 | 73.3 | 100.0 | 96.7 | 90.0 | 93.3 | 73.3 | ||||||||||
Seed-1.5-VL | 100.0 | 80.0 | 83.3 | 60.0 | 86.7 | 83.3 | 73.3 | 73.3 | 70.0 | ||||||||||
o4-mini | 86.7 | 10.0 | 36.7 | 43.3 | 60.0 | 66.7 | 50.0 | 63.3 | 40.0 | ||||||||||
Gemini-2.5-Flash | 96.7 | 43.3 | 66.7 | 56.7 | 83.3 | 76.7 | 56.7 | 70.0 | 50.0 | ||||||||||
GPT-4.1 | 63.3 | 46.7 | 56.7 | 16.7 | 26.7 | 60.0 | 33.3 | 43.3 | 53.3 | ||||||||||
GPT-4o | 10.0 | 30.0 | 23.3 | 0.0 | 0.0 | 30.0 | 23.3 | 33.3 | 20.0 | ||||||||||
Open-Source Models | |||||||||||||||||||
Intern-S1-241B-A28B | 86.7 | 80.0 | 70.0 | 83.3 | 63.3 | 46.7 | 66.7 | 83.3 | 43.3 | ||||||||||
GLM-4.5V-106B-A12B-Thinking | 56.7 | 16.7 | 40.0 | 3.3 | 23.3 | 23.3 | 33.3 | 53.3 | 13.3 | ||||||||||
Kimi-VL-16B-A3B-Thinking-2506 | 90.0 | 36.7 | 33.3 | 10.0 | 16.7 | 43.3 | 26.7 | 43.3 | 26.7 | ||||||||||
GLM-4.1V-9B-Thinking | 76.7 | 10.0 | 43.3 | 13.3 | 20.0 | 30.0 | 16.7 | 30.0 | 36.7 | ||||||||||
Qwen-2.5-VL-72B | 13.3 | 20.0 | 26.7 | 16.7 | 0.0 | 43.3 | 6.7 | 30.0 | 10.0 | ||||||||||
Qwen-2.5-VL-32B | 33.3 | 26.7 | 16.7 | 0.0 | 3.3 | 16.7 | 3.3 | 26.7 | 10.0 | ||||||||||
QVQ-72B-Preview | 76.7 | 20.0 | 26.7 | 3.3 | 0.0 | 20.0 | 3.3 | 33.3 | 6.7 | ||||||||||
MiniCPM-V-4.5-8B | 53.3 | 6.7 | 20.0 | 13.3 | 6.7 | 30.0 | 13.3 | 33.3 | 3.3 | ||||||||||
InternVL3-78B | 46.7 | 20.0 | 20.0 | 6.7 | 6.7 | 10.0 | 10.0 | 10.0 | 0.0 | ||||||||||
InternVL3-38B | 43.3 | 3.3 | 23.3 | 3.3 | 3.3 | 13.3 | 3.3 | 26.7 | 6.7 | ||||||||||
Llama-4-Scout-109B-A17B-16E | 66.7 | 30.0 | 3.3 | 10.0 | 0.0 | 6.7 | 3.3 | 20.0 | 6.7 | ||||||||||
Ovis2-34B | 23.3 | 0.0 | 3.3 | 6.7 | 0.0 | 20.0 | 13.3 | 26.7 | 0.0 | ||||||||||
Gemma-3-27B-IT | 10.0 | 0.0 | 13.3 | 3.3 | 0.0 | 23.3 | 10.0 | 30.0 | 3.3 | ||||||||||
Qwen-2.5-VL-7B | 10.0 | 0.0 | 6.7 | 0.0 | 0.0 | 10.0 | 3.3 | 23.3 | 0.0 | ||||||||||
InternVL3-8B | 10.0 | 0.0 | 6.7 | 3.3 | 0.0 | 10.0 | 0.0 | 23.3 | 0.0 | ||||||||||
Ovis2-8B | 13.3 | 0.0 | 0.0 | 0.0 | 0.0 | 10.0 | 0.0 | 6.7 | 0.0 | ||||||||||
Ours | |||||||||||||||||||
MM-HELIX-7B-Thinking | 56.7 | 30.0 | 46.7 | 40.0 | 10.0 | 46.7 | 26.7 | 43.3 | 13.3 |
Model | EulerCyc | EulerPath | GraphIso | HamilCyc | HamilPath | MaxFlow | ShortDist | TopoSort | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Proprietary Models | |||||||||||||||||||
GPT-5 | 33.3 | 33.3 | 53.3 | 40.0 | 60.0 | 80.0 | 90.0 | 13.3 | |||||||||||
Seed-1.5-VL | 23.3 | 30.0 | 56.7 | 23.3 | 46.7 | 70.0 | 60.0 | 13.3 | |||||||||||
o4-mini | 33.3 | 33.3 | 53.3 | 33.3 | 50.0 | 66.7 | 56.7 | 10.0 | |||||||||||
Gemini-2.5-Flash | 30.0 | 36.7 | 43.3 | 26.7 | 46.7 | 63.3 | 66.7 | 13.3 | |||||||||||
GPT-4.1 | 10.0 | 20.0 | 63.3 | 20.0 | 33.3 | 70.0 | 60.0 | 3.3 | |||||||||||
GPT-4o | 6.7 | 26.7 | 56.7 | 16.7 | 20.0 | 33.3 | 43.3 | 0.0 | |||||||||||
Open-Source Models | |||||||||||||||||||
Intern-S1-241B-A28B | 16.7 | 26.7 | 50.0 | 16.7 | 23.3 | 50.0 | 56.7 | 0.0 | |||||||||||
GLM-4.5V-106B-A12B-Thinking | 0.0 | 10.0 | 6.7 | 10.0 | 20.0 | 30.0 | 13.3 | 0.0 | |||||||||||
Kimi-VL-16B-A3B-Thinking-2506 | 16.7 | 20.0 | 46.7 | 16.7 | 26.7 | 40.0 | 20.0 | 0.0 | |||||||||||
GLM-4.1V-9B-Thinking | 16.7 | 23.3 | 46.7 | 16.7 | 33.3 | 50.0 | 43.3 | 3.3 | |||||||||||
Qwen-2.5-VL-72B | 16.7 | 23.3 | 56.7 | 10.0 | 20.0 | 43.3 | 36.7 | 0.0 | |||||||||||
Qwen-2.5-VL-32B | 13.3 | 20.0 | 30.0 | 16.7 | 23.3 | 40.0 | 36.7 | 0.0 | |||||||||||
QVQ-72B-Preview | 16.7 | 16.7 | 36.7 | 6.7 | 13.3 | 20.0 | 20.0 | 3.3 | |||||||||||
MiniCPM-V-4.5-8B | 6.7 | 23.3 | 40.0 | 20.0 | 16.7 | 26.7 | 30.0 | 3.3 | |||||||||||
InternVL3-78B | 10.0 | 20.0 | 46.7 | 16.7 | 26.7 | 40.0 | 40.0 | 3.3 | |||||||||||
InternVL3-38B | 10.0 | 23.3 | 46.7 | 16.7 | 13.3 | 33.3 | 36.7 | 0.0 | |||||||||||
Llama-4-Scout-109B-A17B-16E | 16.7 | 26.7 | 43.3 | 10.0 | 23.3 | 26.7 | 20.0 | 3.3 | |||||||||||
Ovis2-34B | 16.7 | 23.3 | 53.3 | 23.3 | 16.7 | 23.3 | 13.3 | 6.7 | |||||||||||
Gemma-3-27B-IT | 16.7 | 26.7 | 33.3 | 16.7 | 23.3 | 36.7 | 20.0 | 3.3 | |||||||||||
Qwen-2.5-VL-7B | 10.0 | 23.3 | 53.3 | 0.0 | 13.3 | 23.3 | 20.0 | 0.0 | |||||||||||
InternVL3-8B | 13.3 | 26.7 | 33.3 | 16.7 | 23.3 | 6.7 | 13.3 | 0.0 | |||||||||||
Ovis2-8B | 16.7 | 10.0 | 26.7 | 23.3 | 13.3 | 16.7 | 16.7 | 0.0 | |||||||||||
Ours | |||||||||||||||||||
MM-HELIX-7B-Thinking | 16.7 | 23.3 | 20.0 | 10.0 | 26.7 | 26.7 | 30.0 | 3.3 |
Model | Aqua | Bina | Brid | Calcu | Camp | Eule | Futo | Hito | Kaku | Kuku | Nono | Num | Shin | Sky | Snak | Sudo | Tapa | WLad | WSch |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Proprietary Models | |||||||||||||||||||
GPT-5 | 33.3 | 23.3 | 83.3 | 30.0 | 63.3 | 53.3 | 33.3 | 83.3 | 26.7 | 100.0 | 26.7 | 86.7 | 80.0 | 53.3 | 10.0 | 26.7 | 50.0 | 36.7 | 100.0 |
Seed-1.5-VL | 10.0 | 30.0 | 50.0 | 16.7 | 86.7 | 60.0 | 20.0 | 40.0 | 36.7 | 63.3 | 3.3 | 6.7 | 70.0 | 40.0 | 100.0 | 50.0 | 33.3 | 20.0 | 60.0 |
o4-mini | 26.7 | 13.3 | 73.3 | 23.3 | 53.3 | 50.0 | 30.0 | 43.3 | 43.3 | 76.7 | 13.3 | 50.0 | 43.3 | 43.3 | 96.7 | 3.3 | 43.3 | 30.0 | 100.0 |
Gemini-2.5-Flash | 3.3 | 20.0 | 60.0 | 3.3 | 46.7 | 46.7 | 16.7 | 63.3 | 36.7 | 40.0 | 0.0 | 40.0 | 40.0 | 40.0 | 83.3 | 40.0 | 36.7 | 10.0 | 70.0 |
GPT-4.1 | 3.3 | 0.0 | 46.7 | 13.3 | 13.3 | 33.3 | 10.0 | 40.0 | 16.7 | 60.0 | 0.0 | 16.7 | 53.3 | 20.0 | 43.3 | 0.0 | 3.3 | 10.0 | 33.3 |
GPT-4o | 0.0 | 20.0 | 0.0 | 3.3 | 16.7 | 20.0 | 10.0 | 16.7 | 13.3 | 33.3 | 0.0 | 0.0 | 6.7 | 3.3 | 33.3 | 0.0 | 0.0 | 13.3 | 13.3 |
Open-Source Models | |||||||||||||||||||
Intern-S1-241B-A28B | 3.3 | 23.3 | 60.0 | 26.7 | 20.0 | 16.7 | 20.0 | 20.0 | 30.0 | 0.0 | 0.0 | 26.7 | 13.3 | 23.3 | 53.3 | 53.3 | 16.7 | 0.0 | 43.3 |
GLM-4.5V-106B-A12B-Thinking | 13.3 | 30.0 | 13.3 | 6.7 | 60.0 | 6.7 | 6.7 | 30.0 | 0.0 | 33.3 | 0.0 | 0.0 | 0.0 | 6.7 | 40.0 | 20.0 | 6.7 | 6.7 | 50.0 |
Kimi-VL-16B-A3B-Thinking-2506 | 3.3 | 16.7 | 20.0 | 6.7 | 16.7 | 13.3 | 26.7 | 13.3 | 16.7 | 10.0 | 0.0 | 10.0 | 3.3 | 6.7 | 50.0 | 10.0 | 0.0 | 0.0 | 26.7 |
GLM-4.1V-9B-Thinking | 6.7 | 16.7 | 6.7 | 13.3 | 40.0 | 10.0 | 3.3 | 16.7 | 13.3 | 20.0 | 0.0 | 10.0 | 0.0 | 3.3 | 30.0 | 3.3 | 6.7 | 0.0 | 0.0 |
Qwen-2.5-VL-72B | 0.0 | 6.7 | 13.3 | 10.0 | 23.3 | 16.7 | 10.0 | 6.7 | 0.0 | 6.7 | 0.0 | 6.7 | 16.7 | 3.3 | 13.3 | 6.7 | 0.0 | 0.0 | 10.0 |
Qwen-2.5-VL-32B | 3.3 | 0.0 | 10.0 | 3.3 | 6.7 | 3.3 | 16.7 | 0.0 | 6.7 | 13.3 | 0.0 | 6.7 | 3.3 | 3.3 | 16.7 | 3.3 | 0.0 | 0.0 | 3.3 |
QVQ-72B-Preview | 10.0 | 13.3 | 6.7 | 6.7 | 6.7 | 6.7 | 16.7 | 10.0 | 13.3 | 0.0 | 0.0 | 0.0 | 6.7 | 0.0 | 0.0 | 0.0 | 0.0 | 6.7 | 10.0 |
MiniCPM-V-4.5-8B | 6.7 | 3.3 | 10.0 | 10.0 | 20.0 | 10.0 | 20.0 | 6.7 | 6.7 | 6.7 | 0.0 | 0.0 | 0.0 | 0.0 | 6.7 | 0.0 | 0.0 | 0.0 | 16.7 |
InternVL3-78B | 0.0 | 0.0 | 30.0 | 26.7 | 3.3 | 3.3 | 6.7 | 3.3 | 0.0 | 13.3 | 0.0 | 0.0 | 6.7 | 3.3 | 6.7 | 0.0 | 0.0 | 3.3 | 0.0 |
InternVL3-38B | 3.3 | 3.3 | 16.7 | 20.0 | 0.0 | 3.3 | 13.3 | 10.0 | 6.7 | 10.0 | 0.0 | 0.0 | 3.3 | 3.3 | 3.3 | 0.0 | 0.0 | 0.0 | 3.3 |
Llama-4-Scout-109B-A17B-16E | 6.7 | 10.0 | 13.3 | 3.3 | 30.0 | 16.7 | 3.3 | 23.3 | 10.0 | 3.3 | 0.0 | 0.0 | 20.0 | 3.3 | 13.3 | 0.0 | 0.0 | 6.7 | 16.7 |
Ovis2-34B | 13.3 | 30.0 | 6.7 | 13.3 | 6.7 | 13.3 | 30.0 | 10.0 | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.3 | 3.3 | 0.0 | 0.0 |
Gemma-3-27B-IT | 6.7 | 3.3 | 10.0 | 3.3 | 0.0 | 0.0 | 6.7 | 13.3 | 10.0 | 3.3 | 0.0 | 0.0 | 3.3 | 0.0 | 3.3 | 0.0 | 0.0 | 0.0 | 10.0 |
Qwen-2.5-VL-7B | 13.3 | 0.0 | 3.3 | 0.0 | 6.7 | 0.0 | 10.0 | 6.7 | 10.0 | 16.7 | 0.0 | 0.0 | 0.0 | 0.0 | 3.3 | 0.0 | 0.0 | 6.7 | 3.3 |
InternVL3-8B | 6.7 | 0.0 | 3.3 | 16.7 | 10.0 | 0.0 | 10.0 | 3.3 | 6.7 | 0.0 | 0.0 | 0.0 | 0.0 | 3.3 | 0.0 | 0.0 | 6.7 | 0.0 | 3.3 |
Ovis2-8B | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 6.7 | 3.3 | 10.0 | 0.0 | 3.3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.3 | 6.7 | 0.0 | 6.7 |
Ours | |||||||||||||||||||
MM-HELIX-7B-Thinking | 3.3 | 23.3 | 50.0 | 6.7 | 46.7 | 43.3 | 20.0 | 13.3 | 30.0 | 26.7 | 13.3 | 23.3 | 60.0 | 20.0 | 16.7 | 30.0 | 6.7 | 6.7 | 40.0 |
Model | 24 | BuySell | Container | Hills | Crypto | HIndex | Rect | LIS | Rain | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Proprietary Models | |||||||||||||||||||
GPT-5 | 96.7 | 80.0 | 93.3 | 73.3 | 100.0 | 96.7 | 90.0 | 93.3 | 73.3 | ||||||||||
Seed-1.5-VL | 100.0 | 80.0 | 83.3 | 60.0 | 86.7 | 83.3 | 73.3 | 73.3 | 70.0 | ||||||||||
o4-mini | 86.7 | 10.0 | 36.7 | 43.3 | 60.0 | 66.7 | 50.0 | 63.3 | 40.0 | ||||||||||
Gemini-2.5-Flash | 96.7 | 43.3 | 66.7 | 56.7 | 83.3 | 76.7 | 56.7 | 70.0 | 50.0 | ||||||||||
GPT-4.1 | 63.3 | 46.7 | 56.7 | 16.7 | 26.7 | 60.0 | 33.3 | 43.3 | 53.3 | ||||||||||
GPT-4o | 10.0 | 30.0 | 23.3 | 0.0 | 0.0 | 30.0 | 23.3 | 33.3 | 20.0 | ||||||||||
Open-Source Models | |||||||||||||||||||
Intern-S1-241B-A28B | 86.7 | 80.0 | 70.0 | 83.3 | 63.3 | 46.7 | 66.7 | 83.3 | 43.3 | ||||||||||
GLM-4.5V-106B-A12B-Thinking | 56.7 | 16.7 | 40.0 | 3.3 | 23.3 | 23.3 | 33.3 | 53.3 | 13.3 | ||||||||||
Kimi-VL-16B-A3B-Thinking-2506 | 90.0 | 36.7 | 33.3 | 10.0 | 16.7 | 43.3 | 26.7 | 43.3 | 26.7 | ||||||||||
GLM-4.1V-9B-Thinking | 76.7 | 10.0 | 43.3 | 13.3 | 20.0 | 30.0 | 16.7 | 30.0 | 36.7 | ||||||||||
Qwen-2.5-VL-72B | 13.3 | 20.0 | 26.7 | 16.7 | 0.0 | 43.3 | 6.7 | 30.0 | 10.0 | ||||||||||
Qwen-2.5-VL-32B | 33.3 | 26.7 | 16.7 | 0.0 | 3.3 | 16.7 | 3.3 | 26.7 | 10.0 | ||||||||||
QVQ-72B-Preview | 76.7 | 20.0 | 26.7 | 3.3 | 0.0 | 20.0 | 3.3 | 33.3 | 6.7 | ||||||||||
MiniCPM-V-4.5-8B | 53.3 | 6.7 | 20.0 | 13.3 | 6.7 | 30.0 | 13.3 | 33.3 | 3.3 | ||||||||||
InternVL3-78B | 46.7 | 20.0 | 20.0 | 6.7 | 6.7 | 10.0 | 10.0 | 10.0 | 0.0 | ||||||||||
InternVL3-38B | 43.3 | 3.3 | 23.3 | 3.3 | 3.3 | 13.3 | 3.3 | 26.7 | 6.7 | ||||||||||
Llama-4-Scout-109B-A17B-16E | 66.7 | 30.0 | 3.3 | 10.0 | 0.0 | 6.7 | 3.3 | 20.0 | 6.7 | ||||||||||
Ovis2-34B | 23.3 | 0.0 | 3.3 | 6.7 | 0.0 | 20.0 | 13.3 | 26.7 | 0.0 | ||||||||||
Gemma-3-27B-IT | 10.0 | 0.0 | 13.3 | 3.3 | 0.0 | 23.3 | 10.0 | 30.0 | 3.3 | ||||||||||
Qwen-2.5-VL-7B | 10.0 | 0.0 | 6.7 | 0.0 | 0.0 | 10.0 | 3.3 | 23.3 | 0.0 | ||||||||||
InternVL3-8B | 10.0 | 0.0 | 6.7 | 3.3 | 0.0 | 10.0 | 0.0 | 23.3 | 0.0 | ||||||||||
Ovis2-8B | 13.3 | 0.0 | 0.0 | 0.0 | 0.0 | 10.0 | 0.0 | 6.7 | 0.0 | ||||||||||
Ours | |||||||||||||||||||
MM-HELIX-7B-Thinking | 56.7 | 30.0 | 46.7 | 40.0 | 10.0 | 46.7 | 26.7 | 43.3 | 13.3 |
Model | Maze | Mine | Nib | Slide | Soko | Hanoi | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Proprietary Models | |||||||||||||||||||
GPT-5 | 10.0 | 23.3 | 10.0 | 86.7 | 16.7 | 93.3 | |||||||||||||
Seed-1.5-VL | 6.7 | 53.3 | 20.0 | 63.3 | 3.3 | 53.3 | |||||||||||||
o4-mini | 6.7 | 26.7 | 10.0 | 66.7 | 13.3 | 90.0 | |||||||||||||
Gemini-2.5-Flash | 0.0 | 50.0 | 13.3 | 46.7 | 3.3 | 56.7 | |||||||||||||
GPT-4.1 | 3.3 | 0.0 | 0.0 | 3.3 | 0.0 | 46.7 | |||||||||||||
GPT-4o | 0.0 | 0.0 | 0.0 | 3.3 | 0.0 | 36.7 | |||||||||||||
Open-Source Models | |||||||||||||||||||
Intern-S1-241B-A28B | 0.0 | 20.0 | 0.0 | 36.7 | 0.0 | 33.3 | |||||||||||||
GLM-4.5V-106B-A12B-Thinking | 0.0 | 16.7 | 3.3 | 10.0 | 3.3 | 50.0 | |||||||||||||
Kimi-VL-16B-A3B-Thinking-2506 | 0.0 | 3.3 | 0.0 | 3.3 | 0.0 | 10.0 | |||||||||||||
GLM-4.1V-9B-Thinking | 0.0 | 0.0 | 3.3 | 3.3 | 0.0 | 26.7 | |||||||||||||
Qwen-2.5-VL-72B | 0.0 | 20.0 | 0.0 | 36.7 | 3.3 | 26.7 | |||||||||||||
Qwen-2.5-VL-32B | 0.0 | 16.7 | 3.3 | 33.3 | 0.0 | 6.7 | |||||||||||||
QVQ-72B-Preview | 0.0 | 3.3 | 3.3 | 6.7 | 0.0 | 16.7 | |||||||||||||
MiniCPM-V-4.5-8B | 0.0 | 0.0 | 0.0 | 3.3 | 3.3 | 13.3 | |||||||||||||
InternVL3-78B | 0.0 | 0.0 | 3.3 | 6.7 | 0.0 | 16.7 | |||||||||||||
InternVL3-38B | 0.0 | 3.3 | 3.3 | 10.0 | 6.7 | 13.3 | |||||||||||||
Llama-4-Scout-109B-A17B-16E | 0.0 | 3.3 | 0.0 | 10.0 | 0.0 | 33.3 | |||||||||||||
Ovis2-34B | 0.0 | 0.0 | 0.0 | 3.3 | 0.0 | 6.7 | |||||||||||||
Gemma-3-27B-IT | 0.0 | 0.0 | 0.0 | 3.3 | 0.0 | 3.3 | |||||||||||||
Qwen-2.5-VL-7B | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.3 | |||||||||||||
InternVL3-8B | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.7 | |||||||||||||
Ovis2-8B | 0.0 | 0.0 | 0.0 | 0.0 | 3.3 | 3.3 | |||||||||||||
Ours | |||||||||||||||||||
MM-HELIX-7B-Thinking | 3.3 | 16.7 | 23.3 | 26.7 | 3.3 | 26.7 |