TensorTeach's Newsletter
Posts
Tiny Models Tackle Math with Phased RL, MLLMs Merge Without Data, Medical Minds Learn to Reason, and Jigsaw Puzzles Sharpen Spatial Skill

Tiny Models Tackle Math with Phased RL, MLLMs Merge Without Data, Medical Minds Learn to Reason, and Jigsaw Puzzles Sharpen Spatial Skill

TensorTeach AI
May 30, 2025

Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models

What’s the research question?
How can we enhance the reasoning capabilities of multimodal small language models (MSLMs) through a curriculum-based phased reinforcement learning framework?

What did the authors do?
The authors developed Infi-MMR, a novel training framework designed to systematically improve the multimodal reasoning abilities of small language models by leveraging reinforcement learning in three distinct phases:

Foundational Reasoning Activation (FRA): Uses high-quality textual reasoning datasets to activate core reasoning skills in the model.
Cross-Modal Reasoning Adaptation (CMRA): Employs caption-augmented multimodal data (e.g., charts, tables, spatial info) to transfer reasoning skills from text to multimodal contexts.
Multimodal Reasoning Enhancement (MRE): Uses caption-free multimodal data to eliminate linguistic biases and further strengthen cross-modal reasoning.

Training involves rule-based reinforcement learning with a custom reward function that evaluates output correctness and format. The dataset includes 39,000 math problem-answer pairs and 39,000 multimodal question-answer pairs. Performance is tested on benchmarks like MATH500, MathVerse, MathVision, OlympiadBench, and MathVista.

What did they find?
The Infi-MMR-3B model achieved state-of-the-art results on multiple benchmarks:

43.68% accuracy on MathVerse
27.04% on MathVision
21.33% on OlympiadBench
67.2% on MathVista

It showed significant improvements over baseline models, especially in mathematical reasoning and multimodal inference. The phased curriculum approach effectively addressed challenges like limited high-quality multimodal reasoning datasets and the degradation of reasoning abilities during multimodal integration.

Why does this matter?
This work introduces a powerful new method for enhancing the reasoning capabilities of multimodal small language models, which are crucial for AI systems that need to understand and reason across different types of data such as text, images, and tables. By systematically training models through a curriculum that progressively builds and transfers reasoning skills, Infi-MMR overcomes key limitations that have hindered multimodal reasoning in smaller models. This approach paves the way for more robust, generalizable, and resource-efficient multimodal AI systems, enabling applications in education, data analysis, and human-AI interaction where understanding complex, multi-source information is essential.

Key Points

Introduces Infi-MMR, a curriculum-based phased reinforcement learning framework for multimodal reasoning.
Combines textual reasoning activation, cross-modal adaptation, and bias reduction to improve multimodal inference.
Achieves state-of-the-art results on multiple challenging benchmarks with small language models.
Addresses data scarcity and reasoning degradation issues in multimodal model training.

Read on arXiv

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging

What’s the research question?
Can model merging effectively combine diverse capabilities and modalities in Multimodal Large Language Models (MLLMs) without requiring additional training data?

What did the authors do?
The authors introduced a novel approach to unify multiple specialized MLLMs by merging their models instead of traditional multi-task training:

Developed a benchmark focusing on tasks like Visual Question Answering (VQA), Geometry, Charts, Optical Character Recognition (OCR), and Grounding.
Used two models: InternVL2.5 (full fine-tuning with instruction-following capabilities) and Qwen2-VL (LoRA fine-tuning for general vision-language tasks).
Applied a model merging technique that combines task-specific vectors (parameter differences) from each fine-tuned model into a single unified model.
Employed different optimization strategies: low-rank SVD approximations for InternVL2.5 to reduce noise, and SGD instead of Adam for Qwen2-VL to avoid local minima.
Optimized the merged model over 300 iterations and evaluated on benchmarks like VizWiz, GQA, MathVista, and RefCOCO.

What did they find?
The model merging approach yielded promising results:

InternVL2.5 merged model achieved an average score of 57.44 across tasks, outperforming individual models and matching or exceeding traditional multi-task training.
Qwen2-VL merged model scored 63.30, showing strong performance especially in VQA and OCR tasks.
Both merged models excelled in different areas: InternVL2.5 in Geometry and Grounding, Qwen2-VL in VQA and OCR.
Ablation studies revealed that low-rank approximation and SGD optimization significantly contributed to performance gains.
Successfully merged models with different modalities (vision, audio, video), improving zero-shot multimodal task performance.
Limitations include the need for careful tuning of merging hyperparameters and potential challenges in scaling to many models or modalities simultaneously.

Why does this matter?
This work offers a scalable, data-free method to enhance the capabilities of Multimodal Large Language Models by merging specialized models instead of retraining from scratch. It provides a new benchmark and methodology for evaluating model merging techniques, demonstrating that merging can be a cost-effective alternative to traditional multi-task learning. This approach can lead to more versatile and efficient multimodal AI systems, enabling better cross-modal reasoning and understanding in applications like visual question answering, document analysis, and multimodal interaction, with potential impacts on AI assistants, robotics, and content analysis.

Key Points

Introduces a model merging approach to unify multimodal LLM capabilities without additional training data.
Uses specialized fine-tuned models and combines them via task vector merging with optimized strategies.
Achieves competitive or superior performance compared to traditional multi-task training on diverse benchmarks.
Enables effective merging across different modalities, including vision, audio, and video.

Read on arXiv

Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

What’s the research question?
How can we effectively extend multimodal reasoning strategies to improve AI performance in complex medical domains?

What did the authors do?
The authors developed a novel two-stage training pipeline called MedE 2 to enhance multimodal reasoning in medical AI models:

Stage I: Fine-tuned large multimodal language models (MLLMs) on 2,000 curated text-only medical reasoning samples. These samples were carefully selected to focus on complex reasoning tasks, excluding questions solvable by pattern recognition or prior knowledge, thereby encouraging models to use reasoning rather than memorization.
Stage II: Further refined the models using 1,500 multimodal medical cases, aligning model outputs with a multimodal medical reasoning preference (MMRP) that emphasizes logical coherence, accurate visual analysis, and avoiding hallucinations. This was achieved using Direct Preference Optimization (DPO) to calibrate the models’ reasoning outputs.
The pipeline was applied to general-purpose MLLMs like QwenVL2.5 and InternVL3.0, fine-tuned with LoRA and optimized with DeepSpeed.
Evaluated the models on five benchmarks targeting medical knowledge and reasoning, including MedXpertQA-MM, MMMU-Health, and MMMU-Pro-Health, and compared performance against proprietary models like GPT-4o and Gemini-2.5-Pro.
Conducted ablation studies to assess the impact of different training strategies and model sizes.

What did they find?
Models trained with MedE 2 showed consistent and significant improvements across all benchmarks:

QwenVL2.5-7B improved by +4.45% on MedXpertQA-MM and +6.67% on MMMU-Health.
Larger models like QwenVL2.5-72B gained even more, with +11.25% and +12.10% improvements respectively.
The training strategy prioritized high-quality, task-specific supervision over sheer data volume, leading to more robust reasoning abilities.
MedE 2 enhanced model output coherence and reduced hallucinations, as demonstrated by qualitative examples.
Models with more extensive pretraining benefited more from MedE 2, though combining text-only and multimodal training sometimes impaired language abilities.

Why does this matter?
This work provides a scalable and effective approach to improving multimodal reasoning in medical AI systems, which is crucial for deploying reliable and accurate clinical tools. By focusing on high-quality, task-specific supervision and a two-stage training process, MedE 2 advances the ability of models to integrate and reason over complex visual and textual medical data. This has broad implications:

Enhances the interpretability and trustworthiness of AI in healthcare by reducing hallucinations and improving logical consistency.
Offers a blueprint for applying similar multimodal reasoning enhancements to other specialized domains beyond medicine.
Highlights the importance of targeted, high-quality training data over large data volume for developing reasoning skills in AI models.
Supports the development of AI assistants capable of complex clinical reasoning, potentially improving diagnosis and treatment planning.

Key Points

Introduces MedE 2, a two-stage fine-tuning pipeline for multimodal medical AI models.
Stage I: Fine-tuning on curated reasoning demonstrations to elicit reasoning behaviors.
Stage II: Refinement with multimodal cases aligned to a reasoning preference using DPO.
Significant performance gains on multiple medical reasoning benchmarks, outperforming proprietary models.
Improves output coherence and reduces hallucinations in multimodal medical AI.

Read on arXiv

Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

What’s the research question?
Can large multimodal language models (MLLMs) effectively learn and generalize visual reasoning tasks that are based on explicit rules, such as solving jigsaw puzzles?

What did the authors do?
The authors conducted a comprehensive experimental study to evaluate how MLLMs perform on rule-based visual reinforcement learning (RL) tasks using jigsaw puzzles. Their approach included:

Creating puzzles by dividing images into m×n grids, with adjustable complexity by varying the number of pieces.
Generating two types of questions: 'full' puzzles requiring the model to reconstruct the entire image by identifying the original position of each patch, and 'pair' puzzles asking the model to determine the relative position of two patches.
Representing model outputs as either a list of position indices for 'full' puzzles or a single letter indicating relative position for 'pair' puzzles.
Training models using reinforcement learning with the GRPO algorithm, optimizing for accuracy and correct output format.
Fine-tuning models on subsets of puzzles to improve their ability to solve similar tasks and testing their generalization on downstream spatial reasoning and visual grounding tasks.
Comparing reinforcement learning (RL) with supervised fine-tuning (SFT) and exploring the effects of task complexity and question type on learning and generalization.

What did they find?
The study revealed several key findings:

Before fine-tuning, models performed at or near random chance on simple puzzles, highlighting the difficulty of rule-based visual RL without task-specific training.
After fine-tuning, models achieved near-perfect accuracy—e.g., Qwen2.5-VL-3B reached 97.8% on 2×1 puzzles, a significant improvement from 54.1% pre-fine-tuning.
Models successfully generalized to more complex puzzles and downstream tasks like spatial reasoning, with performance gains up to 15.97% after fine-tuning.
Larger puzzles and 'pair' questions generally led to better generalization, suggesting that task difficulty and question format influence learning outcomes.
Complex reasoning patterns were found to be largely pre-existing in the models rather than emerging solely from training, indicating that models already encode some rule-based knowledge.
Reinforcement learning outperformed supervised fine-tuning in terms of generalization, and starting with supervised fine-tuning before RL was found to be detrimental.

Why does this matter?
This research demonstrates that rule-based visual reinforcement learning can be effectively applied to multimodal models, opening new avenues for perception-heavy AI tasks that require spatial and relational reasoning. The findings highlight the importance of task design—such as puzzle complexity and question format—in training models to generalize beyond training examples. Moreover, the observation that complex reasoning patterns are often intrinsic rather than emergent suggests that leveraging pre-existing knowledge in models can be a powerful strategy. These insights are valuable for developing AI systems capable of understanding and manipulating visual information in real-world applications like robotics, visual grounding, and spatial navigation, where rule-based reasoning is essential.

Key Points

Rule-based visual RL with jigsaw puzzles reveals strengths and limitations of multimodal large language models.
Fine-tuning significantly improves puzzle-solving accuracy and generalization to related tasks.
Puzzle complexity and question type influence learning and transfer performance.
Reinforcement learning outperforms supervised fine-tuning in generalization, with pre-training strategies affecting results.

Read on arXiv

Improving Medical Reasoning with Curriculum-Aware Reinforcement Learning

What’s the research question?
How can curriculum-aware reinforcement learning enhance medical reasoning in vision-language models?

What did the authors do?
The authors developed MedCCO, a multimodal medical reasoning framework that integrates curriculum-aware reinforcement learning to improve visual question answering (VQA) in medical AI systems. Their approach includes:

Two-stage training process: First, fine-tune on close-ended VQA tasks using Group Relative Policy Optimization (GRPO), a value-free reinforcement learning algorithm that employs rule-based rewards without needing a separate reward model. Then, fine-tune on open-ended VQA tasks where the model generates free-form answers rewarded by semantic similarity and lexical overlap.
Data refinement: Introduced a VQA-Consistency Auditor to improve the quality of open-ended question-answer pairs, ensuring better alignment and clarity.
Multi-reward policy: Combined rewards for correctness, semantic similarity, and output formatting to guide learning effectively.
Joint training strategies: Explored direct joint GRPO with gradient re-weighting and a curriculum approach where the model first learns close-ended tasks before adapting to open-ended ones, finding the latter more effective.

What did they find?
MedCCO achieved state-of-the-art performance on eight medical VQA benchmarks, demonstrating significant improvements:

11.4% accuracy gain on in-domain tasks compared to previous methods.
5.7% improvement on out-of-domain tasks, showing strong generalization.
Outperformed baselines like HuatuoGPT-Vision and Qwen2.5-VL.
Ablation studies confirmed that the curriculum training strategy and data refinement contributed substantially to performance gains.
Generated clinically relevant, structured reasoning in both close-ended and open-ended scenarios, enhancing interpretability.

Limitations include the reliance on high-quality data refinement and the need for careful task sequencing, which may require domain expertise to replicate in other settings.

Why does this matter?
This work advances the development of multimodal medical AI by demonstrating how curriculum-aware reinforcement learning can effectively bridge structured and free-form reasoning in vision-language models. By emphasizing data quality and task sequencing, MedCCO paves the way for more robust, interpretable, and adaptable AI systems in healthcare. Such models have the potential to support clinical decision-making, medical education, and personalized patient care by providing transparent and clinically relevant reasoning across diverse medical imaging and question types.

Key Points

Introduces MedCCO, a curriculum-aware reinforcement learning framework for medical vision-language reasoning.
Combines close-ended and open-ended VQA tasks with tailored reward strategies and data refinement.
Achieves state-of-the-art accuracy and strong cross-domain generalization on medical VQA benchmarks.
Highlights the importance of task sequencing and high-quality data in training multimodal medical AI models.

Read on arXiv

Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model

What’s the research question?
How do different post-training strategies affect the compositional generalization abilities of vision-language models (VLMs)?

What did the authors do?
The authors investigated how two common training approaches—supervised fine-tuning (SFT) and reinforcement learning (RL)—impact the ability of VLMs to generalize compositionally across multiple reasoning challenges.

Developed ComPABench, a comprehensive benchmark testing three key dimensions: cross-modal (text and vision integration), cross-task (combining different reasoning skills), and out-of-distribution (OOD) robustness.
Designed three types of tasks:
- Cross-modal tasks: Test the model’s ability to jointly reason with text and images.
- Cross-task tasks: Challenge the model to combine independently learned skills like geometry and spatial reasoning.
- OOD tasks: Evaluate robustness when task objectives change unexpectedly.
Compared two training strategies:
- SFT: Fine-tuned models using labeled data via maximum likelihood.
- RL: Trained models using reward signals to optimize reasoning performance.
Introduced RL-Ground, a variant that incorporates:
- A caption-before-thinking prompt to guide structured reasoning.
- A progress reward providing intermediate feedback during reasoning steps.
Conducted experiments on two vision-language models, Qwen2.5-VL-3B and 7B, comparing SFT, RL, and RL-Ground variants, measuring accuracy across the benchmark tasks.

What did they find?
The study revealed significant differences in how training strategies affect compositional reasoning abilities:

RL-trained models consistently outperformed SFT models in compositional generalization, achieving up to 93% accuracy on pure-text tasks and 52.8% with RL-Ground.
SFT models excelled at individual tasks with high accuracy but struggled in compositional and OOD scenarios, indicating limited generalization.
RL-Ground, combining visual-to-text grounding and intermediate rewards, delivered the best overall performance across all settings.
Initializing RL with SFT (SFT-init RL) did not improve results and sometimes worsened performance, highlighting the importance of structured prompting and grounded supervision.

Why does this matter?
This work advances our understanding of how training strategies influence the ability of vision-language models to generalize compositionally—a key challenge in multimodal AI. By introducing ComPABench, the authors provide a valuable tool for systematically evaluating and improving reasoning in VLMs. The novel RL-Ground approach demonstrates that grounded, reward-based training with structured prompts can significantly enhance the robustness and interpretability of multimodal reasoning systems. These insights are crucial for developing more capable AI applications such as visual question answering, multimodal decision-making, and interactive agents that better understand and reason about complex, real-world scenarios.

Key Points

Introduces ComPABench, a benchmark for evaluating compositional generalization in vision-language models.
Shows reinforcement learning with structured grounding (RL-Ground) outperforms supervised fine-tuning in compositional reasoning.
Highlights limitations of traditional fine-tuning in handling cross-modal, cross-task, and out-of-distribution challenges.
Provides a pathway toward more robust, interpretable multimodal AI systems capable of complex reasoning.

Read on arXiv

Reinforced Reasoning for Embodied Planning

What’s the research question?
How can reinforcement fine-tuning improve the performance of vision-language models in embodied planning tasks?

What did the authors do?
The authors developed a novel two-stage training framework to enhance structured decision-making in embodied AI tasks:

Data Distillation: They generated a high-quality dataset by prompting a large proprietary model (Gemini-2.0-flash) to produce structured, multi-step decision outputs across various tasks.
Supervised Fine-Tuning (SFT): The distilled dataset was used to fine-tune a smaller open-source vision-language model (Qwen2.5-VL), aligning it with high-quality task decompositions and commonsense reasoning.
Reinforcement Fine-Tuning (RFT): The model then underwent reinforcement learning using Generalized Reinforced Preference Optimization (GRPO). This involved:
- A rule-based reward function evaluating the quality of multi-step plans based on format correctness and action accuracy.
- Generating multiple responses per prompt, scoring them, and computing relative advantages to update the model’s policy.
- Online data filtering to discard responses with extremely low or high rewards, ensuring stable training.

What did they find?
The combined supervised and reinforcement fine-tuning approach yielded significant improvements:

In-Domain Performance: The model achieved a 15% higher success rate on Embench benchmarks compared to similar or larger models, including GPT-4o-mini and 70B+ open-source baselines.
Out-of-Domain Generalization: It demonstrated strong ability to handle unseen tasks, with a 12% higher success rate than baselines.
Ablation Studies: Both supervised and reinforcement fine-tuning contributed to performance gains, highlighting the importance of their combination.
Response Length Findings: Longer responses did not always improve performance, indicating that optimal response length is crucial for effective structured reasoning.

Why does this matter?
This work advances the field of embodied AI by showing how integrating supervised learning with reinforcement fine-tuning can significantly improve structured, multi-step decision-making in dynamic, interactive environments. By focusing on generating well-formed JSON outputs that accurately capture action sequences, the approach enhances the model’s ability to reason over long horizons and generalize to new tasks. This framework can be adapted to a wide range of embodied and interactive AI applications, such as robotics, virtual assistants, and autonomous agents, where robust planning and reasoning are essential for success.

Key Points

Introduces a two-stage training pipeline combining supervised fine-tuning and reinforcement fine-tuning for embodied planning.
Uses a rule-based reward function to optimize structured multi-step decision outputs.
Achieves significant performance improvements on benchmark embodied tasks, including strong out-of-domain generalization.
Highlights the importance of response length and structured output quality in reinforcement learning for reasoning tasks.

Read on arXiv

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

What’s the research question?
Can multimodal large language models (MLLMs) effectively generate detailed and accurate feedback on AI-generated content (AIGC) videos, covering various reasoning and error detection tasks?

What did the authors do?
The authors developed VF-Eval, a comprehensive benchmark to assess MLLMs' ability to analyze and critique AIGC videos. Their approach included:

Four evaluation tasks: coherence validation (checking alignment between prompts and videos), error awareness (detecting presence of errors), error type detection (identifying specific errors like visual or semantic issues), and reasoning evaluation (performing spatial, temporal, object reasoning, counting, and summarization).
Created a dataset of 9,740 question-answer pairs using videos generated by both proprietary and open-source AIGC models.
Validated dataset quality through human annotations to ensure high agreement.
Evaluated several models, including GPT-4.1, Gemini-2.0-Flash, and InternVL3-38B, measuring accuracy and F1 scores across tasks.

What did they find?
Key results include:

GPT-4.1 achieved an overall accuracy of 66.3%, excelling in error awareness (84.2%) but struggling with coherence validation (39.7%).
Open-source models like InternVL3-38B and LLaVA-NeXT-Video-7B showed competitive performance, highlighting the potential of accessible models.
Analysis revealed models often relied heavily on textual cues and overlooked critical visual details, leading to errors in error detection and reasoning tasks.
The RePrompt experiment demonstrated that aligning MLLMs with human preferences improved video quality, with over 56% win rates in subject consistency and aesthetic quality.

Why does this matter?
VF-Eval provides a vital benchmark for understanding how well multimodal large language models can analyze and critique AI-generated videos. By identifying strengths and limitations—such as difficulties in visual reasoning and coherence validation—it guides future research toward developing models with better visual understanding and alignment with human preferences. This work is crucial for advancing AI tools that can automatically evaluate and improve AIGC content, which is increasingly prevalent in entertainment, education, and content creation industries. Ultimately, VF-Eval helps push the boundaries of multimodal AI, enabling more intelligent and human-like feedback on synthetic videos.

Key Points

Introduces VF-Eval, a benchmark for evaluating MLLMs on AIGC video feedback tasks.
Includes four tasks: coherence validation, error awareness, error type detection, and reasoning evaluation.
Highlights challenges in visual reasoning and reliance on textual cues by current models.
Demonstrates the potential of open-source MLLMs and the benefits of aligning models with human preferences.

Read on arXiv

Safety Through Reasoning: An Empirical Study of Reasoning Guardrail Models

Image from arXiv paper.

What’s the research question?
How can reasoning-based guardrail models improve safety and content moderation in large language models (LLMs)?

What did the authors do?
The authors conducted an extensive empirical study using the Llama-3.1-8B-Instruct model to evaluate how reasoning can enhance safety in LLMs:

Fine-tuned the Llama model on two safety-focused datasets: WildGuardMix and AEGIS 2.0.
Generated reasoning traces (chain-of-thought explanations) using Deepseek-R1-671B, a teacher model designed to produce high-quality reasoning steps.
Filtered reasoning traces for quality using rule-based and LLM-based judgment methods.
Trained models in both reasoning and non-reasoning modes, including dual-mode training where models learn both styles simultaneously.
Evaluated model performance on multiple safety and moderation benchmarks: OpenAI Moderation, ToxicChat, JailbreakBench, and custom datasets DynaGuard and CoSA.
Analyzed the effects of reasoning length, data efficiency, and adaptation to custom safety policies.

What did they find?

The study revealed several key insights:

Reasoning-based models consistently outperformed non-reasoning baselines: For example, the reasoning model L3.1-8B-WildGuardMix-R achieved an average F1 score of 0.841, surpassing the non-reasoning version’s 0.832.
Concise reasoning traces are sufficient: Models trained with shorter reasoning steps (e.g., one sentence) performed comparably to those with full-length explanations, indicating that lengthy reasoning is not always necessary.
Dual-mode models maintained high performance in both reasoning and non-reasoning modes: Demonstrating flexibility and robustness across different training styles.
Fine-tuning with dialogue moderation data improved generalization: Models better adapted to new safety policies beyond the training data.
Limitations include the need for high-quality reasoning data: Generating and filtering reasoning traces remains a resource-intensive process, though the study shows that even limited reasoning data can be effective.

Why does this matter?
This research advances the development of safer and more reliable large language models by demonstrating that integrating reasoning into safety guardrails enhances both effectiveness and efficiency. By showing that models can be trained to reason through safety challenges and generalize to new policies with limited data, the work paves the way for deploying more robust, adaptable, and low-latency safety systems in real-world AI applications. This is particularly important as LLMs become increasingly embedded in platforms requiring strict content moderation and user safety, helping prevent harmful outputs while maintaining high performance.

Key Points

Reasoning-based guardrail models outperform non-reasoning baselines on safety benchmarks.
Concise reasoning traces are nearly as effective as full-length explanations, reducing training complexity.
Dual-mode training enables models to excel in both reasoning and non-reasoning safety tasks.
Quality reasoning data can be generated and filtered effectively, supporting scalable safety model development.

Read on arXiv

Large Language Models for Planning: A Comprehensive and Systematic Survey

What’s the research question?
How do large language models (LLMs) enhance planning capabilities in intelligent agents, and what are the current methodologies, evaluations, and future directions in this field?

What did the authors do?
The authors conducted a thorough survey of how LLMs are used to improve planning in AI agents by categorizing existing approaches into three main types:

External Module Augmented Methods: Integrate LLMs with external components like symbolic planners or memory modules. For example, translating natural language into Planning Domain Definition Language (PDDL) for classical planners.
Finetuning-based Methods: Adjust LLM parameters using trajectory data or feedback signals. This includes:
- Imitation Learning: Training LLMs to mimic expert planning behavior.
- Feedback-based Methods: Using environmental rewards or signals to refine planning.
Searching-based Methods: Employ task decomposition, exploration, and advanced decoding strategies. Examples include:
- Decomposition: Breaking complex tasks into subtasks.
- Exploration: Using algorithms like Monte Carlo Tree Search (MCTS) for trial-and-error.
- Decoding Enhancements: Improving output generation with techniques like beam search.

They also evaluated these methods across various benchmarks and real-world scenarios.

What did they find?
The survey revealed that LLMs can significantly boost planning abilities, but their effectiveness varies depending on the task complexity and approach:

In the ALFWorld benchmark, some models achieved over 80% success rates, demonstrating strong performance in simulated environments.
In more challenging settings like Mind2Web, success rates dropped below 65%, highlighting the difficulty of generalization.
Finetuning-based methods, especially those leveraging imitation learning and feedback signals, generally outperformed other approaches.
Integrating multimodal inputs (e.g., combining visual and textual data) improved agent performance in tasks such as web navigation and robotic control.
Challenges remain in ensuring robustness, generalization to new environments, and the quality of external modules or feedback signals used to guide planning.

While promising, these findings underscore the need for continued research to address current limitations.

Why does this matter?
This comprehensive survey provides a valuable framework for understanding how LLMs can be harnessed to improve planning in AI agents. Effective planning is crucial for autonomous systems operating in complex, dynamic environments—ranging from robots and virtual assistants to web navigation and game playing. By systematically analyzing different methodologies and benchmarks, the paper guides researchers and practitioners toward more robust, generalizable, and multimodal planning agents. Advancements in this area can lead to smarter robots, more capable virtual assistants, and AI systems that better understand and interact with the world, ultimately accelerating progress toward truly intelligent autonomous agents.

Key Points

LLMs enhance planning by integrating language understanding with decision-making and reasoning.
Three main approaches: external modules, finetuning-based methods, and searching-based methods.
Finetuning and multimodal integration improve performance across benchmarks.
Challenges include generalization, robustness, and external module quality.

Read on arXiv