TensorTeach's Newsletter
Posts
Perception Meets Reasoning, LLMs Reuse Their Thoughts, THOR Powers Math with Tools, Latent Thoughts Guide Logic, MIRA Puts AI in Your Pocket

Perception Meets Reasoning, LLMs Reuse Their Thoughts, THOR Powers Math with Tools, Latent Thoughts Guide Logic, MIRA Puts AI in Your Pocket

TensorTeach AI
September 19, 2025

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

What’s the research question?
How can a two-stage reinforcement learning framework enhance both perceptual and reasoning capabilities in vision-language models?

What did the authors do?
The authors proposed a novel two-stage reinforcement learning (RL) approach to improve vision-language models (VLMs) for visual reasoning tasks:

Stage 1: Perception RL — Focuses on visual perception by training the model to attend to important visual regions and semantic concepts.
- Uses a coarse-grained alignment reward based on FGCLIP to measure semantic correspondence between generated descriptions and input images.
- Includes a fine-grained semantic keyword reward to encourage recognition of key visual concepts.
Stage 2: Reasoning RL — Concentrates on logical reasoning by employing rule-based rewards to improve the logical consistency and accuracy of model outputs.
- Uses a dataset sampling strategy to categorize questions into Easy, Medium, and Hard cases: Easy cases for Perception RL, Medium cases for Reasoning RL.
- Optimizes using Group Relative Policy Optimization (GRPO), leveraging normalized advantage estimation across multiple candidate responses to stabilize learning.

What did they find?
The proposed two-stage RL model, named PeBR-R1, demonstrated strong performance across multiple benchmarks:

PeBR-R1-3B and PeBR-R1-7B outperform similarly sized models on the MathVista benchmark, with accuracy gains of +8.9% and +7.8%, respectively.
PeBR-R1-7B surpasses larger models like InternVL2.5-78B and Qwen2.5-VL-72B on MathVista.
The model also outperforms several open-source multimodal reasoning models and closed-source giants like GPT-4o and Claude-3.5 Sonnet.
Limitations include the reliance on question categorization and potential challenges in scaling to more diverse or complex visual reasoning tasks.

Why does this matter?
This work highlights the importance of explicitly separating perception and reasoning in training vision-language models. By focusing first on visual understanding and then on logical inference, the two-stage RL approach leads to significant improvements in visual reasoning accuracy. This methodology can inform future multimodal AI development, making models more robust, interpretable, and capable of handling complex reasoning tasks that combine visual and language information. Such advancements have broad applications in areas like intelligent assistants, educational tools, and autonomous systems that require integrated perception and reasoning.

Key Points

Introduces a two-stage reinforcement learning framework to improve perception and reasoning separately in vision-language models.
Uses FGCLIP-based semantic alignment and keyword rewards for perception; rule-based rewards for reasoning.
Achieves state-of-the-art results on multiple visual reasoning benchmarks, outperforming larger and proprietary models.
Offers a promising strategy for enhancing multimodal AI capabilities by decoupling perception and reasoning training phases.

Read on arXiv

Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors

Image from arXiv paper.

What’s the research question?
How can large language models (LLMs) improve reasoning efficiency and accuracy by turning recurring reasoning fragments into concise, reusable behaviors?

What did the authors do?
The authors proposed a novel framework called Metacognitive Reuse that extracts and leverages short, reusable reasoning instructions—called behaviors—from LLMs’ own reasoning traces. The approach involves three key roles:

Metacognitive Strategist (LLM A): Generates solutions and reflections on reasoning steps.
Teacher (LLM B): Creates behavior-conditioned responses based on the strategist’s outputs.
Student (LLM C): Learns to produce responses conditioned on these behaviors.

They instantiated this framework in three ways:

Behavior-conditioned inference (BCI): Retrieves relevant behaviors based on question topics or embeddings and includes them in the input context during inference to guide reasoning.
Behavior-guided self-improvement: Uses a curated set of behaviors from past reasoning traces as hints to improve future reasoning without updating model parameters.
Behavior-conditioned supervised fine-tuning (BC-SFT): Fine-tunes models on datasets generated via BCI, embedding behaviors directly into training examples.

Evaluation involved testing these methods on math benchmarks and comparing against baseline reasoning approaches.

What did they find?
Key results include:

BCI: Achieved up to 46% token savings while maintaining or improving accuracy, demonstrating more concise reasoning without sacrificing performance.
Behavior-guided self-improvement: Improved accuracy by up to 10% without any parameter updates, showing effective reuse of reasoning behaviors for model enhancement.
BC-SFT: Produced models that were both more accurate and more concise than those trained on vanilla (non-behavioral) reasoning traces.

These gains were consistent across multiple models and datasets, highlighting the robustness of the approach. Limitations include reliance on the quality of extracted behaviors and the potential computational cost of retrieval and fine-tuning.

Why does this matter?
This work introduces a powerful new way to enhance LLM reasoning by transforming slow, verbose chains of thought into fast, reusable behaviors. By doing so, it addresses key challenges in reasoning efficiency and accuracy, enabling models to reason more effectively with less computational overhead. The approach has broad implications for scalable reasoning in diverse domains such as mathematics, coding, and complex question answering. It also opens avenues for integrating procedural knowledge into LLMs, making them better equipped to handle multi-step, logical, or symbolic tasks. Overall, Metacognitive Reuse advances the state of the art in LLM reasoning by combining metacognitive reflection, retrieval, and fine-tuning to turn reasoning fragments into reusable building blocks.

Key Points

Introduces Metacognitive Reuse framework to extract and reuse reasoning behaviors from LLM traces.
Achieves up to 46% token savings and up to 10% accuracy improvements without parameter updates.
Uses three instantiations: behavior-conditioned inference, self-improvement, and supervised fine-tuning.
Enhances reasoning efficiency and accuracy by turning slow chains of thought into fast, reusable behaviors.

Read on arXiv

THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

What’s the research question?
How can external tools be integrated into large language models to improve their mathematical reasoning capabilities?

What did the authors do?
The authors developed a novel framework called THOR that combines multiple advanced techniques to enhance mathematical problem-solving in language models:

TIRGen data generation: An actor-critic pipeline where the actor generates reasoning steps and the critic evaluates and converts these steps into executable code. This iterative process produces high-quality training data aligned with model policies.
Hierarchical reinforcement learning (RL): Trajectory-level optimization rewards the model based on the correctness of the final answer, guiding overall problem-solving strategy. Step-level optimization focuses on correcting code generation errors by backtracking and regenerating parts of the reasoning process, improving robustness.
Self-correction during inference: The model dynamically revises its reasoning paths based on immediate feedback from external tools, reducing errors and increasing reliability.

What did they find?
THOR achieved state-of-the-art performance on key mathematical benchmarks:

4.3% improvement over previous methods on MATH500
2.1% improvement on AIME 2024

Its hierarchical RL and self-correction mechanisms significantly enhanced reasoning accuracy and code generation robustness:

The TIRGen pipeline generated high-quality, policy-aligned data that improved generalization across different models.
Step-level optimization reduced code errors, leading to more reliable problem-solving.

Limitations include potential computational complexity due to the iterative generation and correction processes, and the need for external tools to be well-integrated and reliable.

Why does this matter?
This work advances the integration of external tools with large language models, demonstrating a scalable approach to boosting mathematical reasoning and problem-solving abilities. By combining hierarchical reinforcement learning with dynamic self-correction, THOR enhances model robustness and accuracy, paving the way for more capable AI systems in education, scientific research, and automated reasoning tasks. The publicly available code and data will enable further exploration and development of tool-enhanced reasoning methods.

Key Points

Introduces THOR, a framework combining tool integration, hierarchical RL, and self-correction for math reasoning.
Uses TIRGen to generate high-quality, policy-aligned training data through iterative reasoning and code conversion.
Achieves state-of-the-art results on math benchmarks, outperforming previous methods.
Enhances robustness and generalization of language models in complex reasoning tasks.

Read on arXiv

LTA-thinker: Latent Thought-Augmented Training Framework for Large Language Models on Complex Reasoning

What’s the research question?
Can generating and optimizing latent thought representations improve the reasoning capabilities of large language models (LLMs)?

What did the authors do?
The authors introduced LTA-Thinker, a novel training framework designed to enhance complex reasoning in LLMs through latent thought generation and distributional optimization:

Latent Thought Generation Architecture: A lightweight Transformer-based module with a learnable prior generates Latent Thought vectors that serve as intermediate reasoning cues.
Integration with LLM: These vectors are incorporated into the input of a frozen (pre-trained and unmodified) backbone LLM to guide reasoning.
Multi-objective Loss Function: The training optimizes three components simultaneously:
- Distributional Foundation Constraint: Ensures Latent Thought vectors match the output distribution for accurate reasoning.
- Semantic Alignment (KL Divergence): Aligns the Latent Thought distribution with the question distribution to maintain relevance.
- Reasoning Focus (Contrastive Learning): Emphasizes critical reasoning steps by contrasting important versus less important reasoning paths.
Joint Optimization: Shapes the Latent Thought distribution in both shape and direction, improving reasoning guidance.

What did they find?
LTA-Thinker demonstrated strong reasoning performance across multiple benchmarks:

State-of-the-art Results: Achieved top scores on GSM8K (93.25 with N=1, close to SoftCoT++'s 93.65 with N=10) and MATH-500 (88.00 vs. SoftCoT++'s 86.91 with N=100).
Ablation Studies: Confirmed the importance of semantic alignment and reasoning focus losses; removing them led to performance drops.
Efficiency: Outperformed baselines with fewer inference responses, indicating better reasoning efficiency.

Why does this matter?
This work advances the understanding of how to improve reasoning in large language models by focusing on the distributional properties of latent thought representations. The lightweight architecture and joint optimization approach make it a scalable and effective method for tackling complex reasoning tasks. By better modeling the variance and relevance of intermediate reasoning steps, LTA-Thinker paves the way for more capable AI systems that can perform multi-step logical inference, solve challenging math problems, and understand nuanced questions—benefiting applications in education, scientific research, and AI-powered decision-making.

Key Points

Introduces a lightweight Latent Thought generator integrated into large language models to enhance reasoning.
Uses a multi-objective loss combining distributional accuracy, semantic relevance, and reasoning focus.
Achieves state-of-the-art results on challenging reasoning benchmarks with fewer inference responses.
Highlights the importance of distributional variance and semantic alignment in latent reasoning representations.

Read on arXiv

MIRA: Empowering One-Touch AI Services on Smartphones with MLLM-based Instruction Recommendation

What’s the research question?
Can multimodal large language models (MLLMs) be effectively used to improve instruction recommendation for AI services on smartphones?

What did the authors do?
The authors developed a novel system called MIRA that enhances instruction recommendation by integrating multiple advanced techniques within MLLMs:

Structured reasoning: Mimics human-like reasoning by extracting key entities, understanding user intent, and generating precise instructions through a three-step process involving entity recognition, contextual relevance analysis, and instruction synthesis.
Template-augmented reasoning: Uses a library of reasoning templates that contain high-level strategies. These templates are retrieved based on similarity to the initial reasoning output, refining the reasoning process and improving accuracy.
Constrained decoding with prefix trees: Restricts the model’s output to a set of predefined instruction candidates by dynamically masking invalid tokens during generation, ensuring coherence and alignment with user intent.

Evaluation involved:

Testing on a real-world dataset of 1,000 users and nearly 5,000 instruction pairs.
Comparing against zero-shot and supervised fine-tuning baselines.
Conducting ablation studies to assess the contribution of each component.
Performing sensitivity analysis to determine optimal similarity thresholds.
Running a user study with 100 participants to evaluate practical validity.

What did they find?
MIRA achieved impressive results:

Recall of 0.7164 and F1-score of 0.7271, outperforming baseline methods.
Template-augmented reasoning improved accuracy by 20.4%, highlighting the benefit of combining learned reasoning with high-level templates.
Robustness was confirmed through sensitivity analysis, with the best performance at a similarity threshold of 0.6.
The user study showed a high validity ratio of 93-95%, indicating practical usefulness.

Limitations include:

Evaluation was limited to a specific dataset and user group, so generalization to broader populations remains to be tested.
The system’s complexity may pose challenges for real-time deployment on resource-constrained smartphones.

Why does this matter?
MIRA represents a significant step forward in making AI services on smartphones more intuitive and user-friendly. By intelligently recommending instructions that unify multimodal inputs (like text, images, and audio), it enables users to interact with AI tools more naturally with a single touch. This has broad implications for:

Virtual assistants: Providing more accurate and context-aware commands.
Multimodal recommendation systems: Enhancing personalized content delivery across different media types.
Human-AI collaboration: Facilitating seamless and efficient interactions between users and AI-powered apps.

Overall, MIRA’s innovative combination of structured reasoning, template guidance, and constrained decoding paves the way for smarter, more responsive AI interfaces on everyday devices.

Key Points

Introduces MIRA, a system that improves instruction recommendation on smartphones using multimodal large language models.
Combines structured reasoning, template retrieval, and constrained decoding for high accuracy and coherence.
Outperforms baselines with a recall of 0.7164 and F1-score of 0.7271 on real-world data.
Validated by user study showing high practical validity (93-95%).

Read on arXiv