TensorTeach's Newsletter
Posts
VRBench Probes Narrative Video Reasoning, Time‐R1 Applies Slow Thinking to Forecasting, PAL Tunes Audio‐LLMs, VideoDeepResearch Uses Agentic Tools, and RRP Links Knowledge Graphs to Reasoning

VRBench Probes Narrative Video Reasoning, Time‐R1 Applies Slow Thinking to Forecasting, PAL Tunes Audio‐LLMs, VideoDeepResearch Uses Agentic Tools, and RRP Links Knowledge Graphs to Reasoning

TensorTeach AI
June 13, 2025

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

What’s the research question?
How can we effectively evaluate the multi-step reasoning capabilities of large models on long narrative videos?

What did the authors do?
The authors developed VRBench, a comprehensive benchmark designed to assess how well AI models can understand and reason about long, complex videos that tell stories over time. Their approach included:

Curating 1,010 long narrative videos through expert review to ensure quality and diversity.
Annotating each video with 8–10 question-answer pairs that require multi-step reasoning, including seven reasoning types such as event attribution and implicit inference, with precise timestamps.
Validating annotations via expert review to ensure accuracy and clarity.
Designing a multi-phase evaluation pipeline that assesses models both on the final answers (outcome-level) and on the reasoning process itself (process-level).
Using multiple evaluation metrics: multiple-choice question (MCQ) accuracy for outcome assessment, and LLM-guided open-ended ratings evaluating reasoning chains across four dimensions—logical coherence, similarity to ground truth, factual accuracy, and clarity.
Supporting both large language models (LLMs) and vision-language models (VLMs), with the latter directly perceiving raw visual content.
Enabling long-context processing to handle extended narrative content and complex reasoning.

What did they find?
Key findings from VRBench include:

GPT-4o achieved 83.25% accuracy on outcome-level MCQs but only 58.1% on process-level reasoning quality, revealing that models can get the right answers but struggle with the reasoning steps leading to them.
Proprietary vision-language models with long-context support outperformed text-only LLMs by 12.2% in accuracy, highlighting the importance of dense visual grounding in understanding videos.
Test-time scaling improved the accuracy of a 32-billion-parameter model (QwQ-32B) by 12.43%, demonstrating benefits of model scaling and inference optimization.
Open-source vision-language models lagged behind proprietary ones by nearly 9%, indicating architectural limitations beyond just model size.
Ablation studies showed high correlation (>0.8) between human judgments and LLM evaluations, validating the automatic scoring approach.

Why does this matter?
VRBench sets a new standard for evaluating AI models’ ability to understand and reason about long, complex visual narratives. By combining detailed annotations, nuanced reasoning types, and both outcome and process assessments, it provides a more comprehensive picture of model capabilities. This is crucial because real-world video understanding often involves integrating visual and temporal information over extended periods, requiring sustained reasoning similar to human cognition.
Supporting long-context multimodal reasoning pushes the development of models that can handle intricate storytelling, event causality, and implicit inferences—skills vital for applications like video summarization, content analysis, and human-AI interaction in dynamic environments. By clearly separating domain expertise from contextual analysis, VRBench encourages the creation of models that excel in real-world video understanding tasks, bridging the gap between AI research and practical deployment.

Key Points

VRBench is a large, expert-annotated benchmark for multi-step reasoning in long narrative videos.
Combines outcome accuracy with detailed process-level reasoning quality assessments.
Supports both vision-language and large language models with long-context capabilities.
Highlights the importance of dense visual grounding and model scaling for improved reasoning.

Read on arXiv

Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

Image from arXiv paper.

What’s the research question?
Can training large language models (LLMs) to develop slow-thinking reasoning capabilities improve time series forecasting performance?

What did the authors do?
The authors introduced Time-R1, a novel framework that trains LLMs to perform time series forecasting through structured reasoning and reinforcement learning:

Two-stage fine-tuning: First, supervised fine-tuning (SFT) with synthetic reasoning trajectories guides the model to analyze temporal patterns and produce well-formatted forecasts. These trajectories are carefully generated and refined to match ground-truth data.
Reinforcement learning (RL): The model further improves via RL, guided by a multi-objective reward function that balances structural integrity, numerical accuracy, and temporal coherence of forecasts.
GRIP (Group-based Relative Importance for Policy Optimization): An innovative sampling and weighting strategy that encourages the model to explore diverse reasoning paths and enhances generalization during RL training.

What did they find?
Time-R1 achieved significant improvements over baseline models on nine different datasets, with lower mean squared error (MSE) and mean absolute error (MAE). Ablation studies confirmed that both the supervised fine-tuning and reinforcement learning stages contributed to these gains. Additionally, the reasoning paths generated by Time-R1 were more coherent and interpretable than those of competing models, demonstrating enhanced explainability alongside accuracy.

Why does this matter?
This work pioneers a slow-thinking approach to time series forecasting by explicitly training LLMs to generate structured reasoning steps. By combining reasoning with reinforcement learning, it not only boosts forecasting accuracy but also improves interpretability—a key factor for real-world applications like finance, supply chain management, and climate modeling. This methodology opens new avenues for integrating complex reasoning into AI systems that handle sequential data, bridging the gap between large language models and traditional time series analysis.

Key Points

Introduces Time-R1, a two-stage reinforcement fine-tuning framework for time series forecasting with LLMs.
Uses synthetic reasoning trajectories and multi-objective rewards to train models to analyze temporal data.
Employs GRIP sampling to enhance exploration of reasoning paths and improve generalization.
Achieves state-of-the-art accuracy and interpretability across multiple datasets.

Read on arXiv

PAL: Probing Audio Encoders via LLMs - A Study of Information Transfer from Audio Encoders to LLMs

Image from arXiv paper.

What’s the research question?
How do architectural design choices affect the transfer of semantic information from audio encoders to large language models (LLMs)?

What did the authors do?
- Investigated how different ways of integrating audio representations into LLMs impact their ability to understand and query audio content.
- Used a standard audio-LLM architecture inspired by Pengi/LLaVA, with modifications guided by mechanistic interpretability hypotheses.
- Baseline architecture combined audio and text via cross-attention and projection layers, prepending audio tokens to text tokens.
- Tested three key hypotheses:

Delaying audio integration until later LLM layers
Processing audio exclusively through attention without propagating to feed-forward networks (FFNs)
Using an ensemble of diverse audio encoders to capture varied audio features

- Evaluated all modifications individually and in combination using a three-stage training curriculum on 5.6 million audio-text pairs.

What did they find?
- Delaying audio integration until later layers improved model performance across training stages.
- Processing audio solely through attention was sufficient for effective cross-modal concept activation, without needing FFN propagation.
- An ensemble of diverse audio encoders broadened the LLM’s capacity to query different audio information types.
- The combined architecture with all modifications achieved relative improvements ranging from 10% to 60% over the baseline.

Why does this matter?
This work provides valuable architectural insights for designing more effective audio-LLMs capable of rich cross-modal reasoning. By systematically evaluating how and when audio information is integrated, it advances our understanding of how to enable LLMs to effectively query and utilize complex audio representations. These findings have broad implications for improving multimodal AI systems that combine language and audio, enhancing applications in audio-grounded reasoning, speech understanding, and human-AI interaction.

Key Points

Delaying audio integration to later LLM layers improves cross-modal transfer.
Processing audio exclusively through attention suffices for effective querying.
An ensemble of diverse audio encoders enhances the model’s ability to handle varied audio content.
Systematic architectural probing reveals best practices for audio-LLM design.

Read on arXiv

VideoDeepResearch: Long Video Understanding With Agentic Tool Using

Image from arXiv paper.

What’s the research question?
How can agentic frameworks improve long video understanding without relying solely on large context windows in multimodal large language models?

What did the authors do?
The authors developed VideoDeepResearch, an innovative framework that combines a text-only reasoning model with a modular multimodal toolkit to tackle long video understanding. Key components include:

Core reasoning model: A large reasoning model (LRM) that plans and executes problem-solving strategies through iterative thought generation and action.
Multimodal toolkit tools: Specialized modules that handle different aspects of video content:
- Video Clip Retriever: Segments long videos into fixed clips and retrieves the most relevant ones based on a query.
- Subtitle Retriever: Finds relevant subtitles within specific timeframes.
- Visual Perceiver: Converts local visual information from short video segments into textual descriptions.
- Subtitle Extractor: Identifies subtitles within given timestamps.
- Video Browser: Provides an overall understanding of the entire video for general questions.
Inference process: The model segments videos, initializes context with instructions and questions, then iteratively generates reasoning thoughts and invokes tools based on its reasoning, updating context with tool outputs until it produces a final answer.

What did they find?
VideoDeepResearch achieved an average score of 60.8 across four benchmark datasets, significantly outperforming:

The base Qwen2.5VL-7B model (52.0)
Other long video models like LongVILA-7B (49.0) and Video-XL-7B (45.5)

It maintained strong performance across videos of varying lengths, with only a 4.9-point decline from shorter to longer videos, demonstrating robustness. Additionally, it was more efficient in token usage, requiring only 48,932 tokens for shorter videos and 53,920 for longer ones, outperforming models like GPT-4o and Gemini-1.5-Pro in both accuracy and efficiency.

Why does this matter?
This work shows that an agentic framework combining strategic reasoning with specialized multimodal tools can effectively understand long videos without relying solely on large context windows. This challenges the common assumption that massive context sizes are necessary, offering a more resource-efficient approach that can adaptively select relevant information. The success of VideoDeepResearch opens new avenues for designing intelligent systems that reason strategically and process multimodal content more effectively, with potential applications in video analysis, content retrieval, and AI-powered multimedia understanding. By leveraging modular tools and iterative reasoning, this approach could significantly advance how AI comprehends complex, long-form visual and textual data in real-world scenarios.

Read on arXiv

Reliable Reasoning Path: Distilling Effective Guidance for LLM Reasoning with Knowledge Graphs

What’s the research question?
Can integrating structured knowledge graphs and logical reasoning paths improve the reasoning accuracy of large language models (LLMs)?

What did the authors do?
The authors introduced the Reliable Reasoning Path (RRP) framework, which combines multiple strategies to generate and refine reasoning paths for LLMs:

Semantic reasoning paths: Generated by LLMs based on the question's meaning, capturing natural language-based reasoning steps.
Structural reasoning paths: Derived from knowledge graphs (KGs) using relation embeddings and bidirectional learning to explore graph structure.
Rethinking module: Evaluates and ranks the generated paths to filter out redundancies and emphasize the most relevant and coherent ones.
Implementation: Built on the LLaMA2-Chat-7B model, with hyperparameters tuned for datasets.
Evaluation: Tested on WebQSP and CWQ datasets, measuring performance with Hits@1 and F1 scores.

What did they find?
The RRP framework achieved state-of-the-art results:

WebQSP: 90.0% Hits@1 and 72.5 F1
CWQ: 64.5% Hits@1 and 56.5 F1

Ablation studies confirmed that removing any component (semantic paths, structural paths, or rethinking module) reduced performance, highlighting their importance. Hyperparameter tuning identified optimal thresholds for filtering and weighting paths.
Limitations include potential computational overhead from generating and evaluating multiple paths and reliance on the quality of knowledge graphs.

Why does this matter?
This work advances the integration of structured knowledge and logical reasoning in LLMs, leading to more accurate and interpretable reasoning. Its plug-and-play design makes it adaptable to various models and tasks, paving the way for more robust AI systems capable of complex, multi-step reasoning by effectively leveraging external knowledge sources.

Key Points

Combines semantic and structural reasoning paths to enhance LLM reasoning.
Introduces a rethinking module to select the most relevant reasoning paths.
Achieves state-of-the-art performance on knowledge-based question answering datasets.
Framework is flexible and can be integrated with different LLMs and knowledge graphs.

Read on arXiv

Resa: Transparent Reasoning Models via SAEs

Image from arXiv paper.

What’s the research question?
How can we efficiently and transparently induce strong reasoning abilities in language models without relying on costly reinforcement learning or explicit reasoning traces?

What did the authors do?
The authors introduced SAE-Tuning, a novel approach to elicit reasoning in language models through the following steps:

Stage 1: Train a Sparse Autoencoder (SAE) to learn reasoning-related features by reconstructing activations from a source language model using a trigger dataset.
Stage 2: Insert the trained SAE into a target language model at a specific layer and fine-tune the model using a labeled elicitation dataset, guiding the model to develop reasoning pathways without explicit reasoning traces.
The SAE acts as a bridge, capturing and transferring reasoning features efficiently, requiring minimal hardware (~$1, 20 minutes) and only verified question-answer data.

What did they find?
The SAE-Tuning method demonstrated impressive results:

Achieved high reasoning performance, e.g., 43.33% Pass@1 on AIME24 and 90% on AMC23, comparable or superior to models trained with reinforcement learning.
Required minimal training costs and data, significantly reducing resource barriers.
Produced reasoning abilities that are generalizable across datasets and modular, allowing them to be attached to other models without retraining.
Limitations include the need for a source model to extract reasoning features and potential challenges in scaling to very large or diverse reasoning tasks.

Why does this matter?
This work offers a resource-efficient and transparent approach to enhancing reasoning in language models, addressing key challenges in AI interpretability and generalization. By avoiding costly reinforcement learning and explicit reasoning traces, SAE-Tuning makes reasoning capabilities more accessible and easier to deploy in real-world applications such as education, automated reasoning, and AI assistants. Its modularity and transferability open new avenues for building more capable, interpretable, and adaptable AI systems that can better understand and solve complex problems.

Key Points

Introduces SAE-Tuning, a novel autoencoder-based method for eliciting reasoning in language models.
Achieves high reasoning performance with minimal computational resources and data.
Produces reasoning abilities that are generalizable and modular, enabling transfer to other models.
Offers a transparent alternative to reinforcement learning and explicit reasoning trace methods.

Read on arXiv

LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs

What’s the research question?
Can large language models (LLMs) effectively perform complex logical planning and relational reasoning tasks involving intricate relational structures?

What did the authors do?
The authors developed LogiPlan, a comprehensive benchmark designed to evaluate LLMs on logical planning and relational reasoning. Their approach included:

Three core tasks: Plan Generation (construct relational graphs), Consistency Detection (identify inconsistencies), and Comparison Questions (evaluate relationship validity).
Dynamic variation of task complexity by adjusting parameters like the number of objects, relations, and relational chain depth.
Generation of synthetic relational graphs using Python and NetworkX to control complexity and structure.
Evaluation of a range of models, including state-of-the-art reasoning models (DeepSeek R1, O1) and instruction-tuned LLMs (GPT-4.5, Llama 3.1 405B).

What did they find?
The study revealed several key insights:

Model scale and architecture matter: Reasoning-focused models like DeepSeek R1 and O1 achieved high accuracy in Plan Generation (>97%), while instruction-tuned models like GPT-4.5 and Llama 3.1 405B performed well but showed less consistency.
Complexity challenges: All models struggled as graph size and relational chain length increased. For example, DeepSeek R1’s F1 score dropped from 0.85 in simple graphs to 0.45 in complex ones.
Relational reasoning difficulty: In Comparison Question tasks, accuracy decreased with complexity, with GPT-4.5 reaching only 49.9%, just above random chance (33%).
Self-correction helps: Prompting models to verify and refine outputs improved performance, notably Gemini 2’s F1 score increased by over 10%.

Why does this matter?
LogiPlan fills a critical gap in AI benchmarking by providing a structured, scalable way to evaluate LLMs on logical planning and relational reasoning. These skills are vital for real-world applications such as:

Designing network topologies
Validating knowledge bases
Modeling complex business processes

and are essential in domains where relational coherence and logical consistency are paramount. The benchmark’s ability to vary difficulty helps researchers identify strengths and weaknesses across models, guiding improvements in architecture, training, and self-correction techniques. Ultimately, LogiPlan advances our understanding of how well LLMs can handle the kind of structured, multi-step reasoning that underpins many AI applications, especially those requiring rigorous relational logic and decision-making.

Key Points

LogiPlan is a new benchmark for logical planning and relational reasoning in LLMs, covering graph generation, consistency, and relational validation.
Model performance varies significantly with scale and architecture; reasoning models outperform instruction-tuned LLMs on complex tasks.
Increasing graph complexity and relational chain length challenges all models, highlighting areas for future improvement.
Self-correction prompts can enhance model reasoning accuracy, showing promise for more robust AI systems.

Read on arXiv

NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors

Image from arXiv paper.

What’s the research question?
How can retrieval-augmented prompting improve mistake identification in AI tutors?

What did the authors do?
The authors explored four different approaches to enhance mistake detection in AI tutoring systems using large language models (LLMs):

Ensemble of traditional classifiers: Extracted token embeddings from five pretrained language models (BERT, RoBERTa, XLNet, T5, GPT-2) on conversation history and responses, then trained classifiers (SVM, Decision Tree, Random Forest) on these features.
Token-level attention model: Used a sentence-transformer encoder (all-mpnet-base-v2) combined with a custom multi-head attention module to model interactions between conversation history and student responses at the token level.
Frozen sentence-transformer with MLP: Encoded conversation history and responses with a fixed sentence-transformer, then concatenated embeddings and classified with a multilayer perceptron.
Retrieval-augmented few-shot classification: Retrieved semantically similar examples from a vector database, constructed structured prompts, and used GPT-4o to classify mistakes, leveraging retrieval to provide contextually relevant examples.

What did they find?
The approaches showed varying levels of success:

Approach 1 (ensemble classifiers): F1 = 0.446, Accuracy = 0.657
Approach 2 (token-level attention): F1 = 0.571, Accuracy = 0.765
Approach 3 (frozen sentence-transformer + MLP): F1 = 0.583, Accuracy = 0.809
Approach 4 (retrieval-augmented GPT-4o): F1 = 0.584, Accuracy = 0.827

In lenient evaluation settings, the retrieval-augmented GPT-4o approach outperformed others with F1 = 0.814 and accuracy = 0.897. However, the final system ranked 37th on the shared task leaderboard, indicating room for further optimization and integration.
Limitations include the complexity of combining multiple models and retrieval components, as well as the challenge of balancing precision and recall in mistake detection.

Why does this matter?
This work demonstrates the power of combining retrieval-augmented prompting with large language models to improve nuanced mistake identification in educational NLP tasks. By effectively leveraging example retrieval and LLM reasoning, the approach can help develop more pedagogically capable AI tutors that provide precise and contextually relevant feedback. This has broader implications for personalized education, automated grading, and intelligent tutoring systems, where accurate mistake detection is crucial for adaptive learning and student support.

Key Points

Retrieval-augmented prompting enhances mistake detection by providing contextually relevant examples to LLMs.
Combining multiple embedding and attention-based models improves classification accuracy over single-model approaches.
The retrieval-augmented GPT-4o system achieved the highest performance in the shared task, highlighting the value of example retrieval.
Advances in pedagogical feedback can lead to smarter, more responsive AI tutors and personalized learning experiences.

Read on arXiv

OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems

Image from arXiv paper.

What’s the research question?
Can large language models (LLMs) effectively learn from iterative feedback to optimize solutions in large, complex search space problems?

What did the authors do?
- Developed OPT-BENCH, a comprehensive benchmark with 20 real-world machine learning tasks and 10 classical NP-hard problems to test LLMs' iterative reasoning and solution refinement.
- Created OPT-Agent, an end-to-end optimization pipeline that mimics human problem-solving by:

Drafting initial solutions
Iteratively improving solutions based on feedback
Debugging solutions to fix errors

- Evaluated nine state-of-the-art LLMs from six different model families across various hyperparameters, including iteration count and temperature settings.
- Measured performance using metrics such as Win Count (how often historical feedback improved solutions), Buggy Rate (invalid solutions), Rank (overall optimization quality), and Improvement Rate (relative gains from feedback).
- Analyzed how historical context, iteration length, and temperature influenced solution quality and convergence.

What did they find?
- Incorporating historical feedback significantly improved optimization performance for most models, increasing Win Counts and Improvement Rates.
- Longer iteration horizons generally led to better solutions, demonstrating the benefit of multiple refinement steps.
- Lower to moderate temperature settings produced more stable and consistent solutions.
- In classical NP-hard problems, models also benefited from historical information, with reduced Buggy Rates and better overall Ranks.
- Open-source models lagged behind proprietary ones, especially on NP-hard tasks, highlighting the importance of model size and training data.
- Effectiveness of historical feedback varied across tasks and models, suggesting some models are better suited to iterative optimization than others.

Why does this matter?
This work advances our understanding of how large language models can be used not just for language tasks but also as powerful agents for complex optimization problems. By demonstrating that LLMs can learn from iterative feedback and improve solutions over multiple steps, the study opens new avenues for applying LLMs to real-world challenges such as hyperparameter tuning, combinatorial optimization, and automated machine learning. The OPT-BENCH benchmark and OPT-Agent framework provide valuable tools for researchers to evaluate and enhance LLM-based problem-solving agents. Ultimately, this research highlights the potential of LLMs to serve as adaptable, intelligent agents capable of tackling large-scale, dynamic search spaces with minimal human intervention.

Key Points

Introduces OPT-BENCH, a benchmark for evaluating LLMs on large-scale optimization problems.
Shows that leveraging historical feedback improves LLM solution quality across diverse tasks.
Demonstrates the importance of iteration length and temperature tuning for stable optimization.
Highlights differences between open-source and proprietary LLMs in complex problem-solving.

Read on arXiv

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

What’s the research question?
Can we develop a versatile, multimodal agent that can perceive, plan, act, ground, and reflect across diverse tasks in the open-world environment of Minecraft?

What did the authors do?
The authors introduced Optimus-3, a groundbreaking multimodal agent designed to handle a wide range of Minecraft tasks. Their approach combined three key innovations:

Knowledge-enhanced data generation pipeline: Leveraged a knowledge graph to generate diverse task plans, which were executed by a goal-conditioned policy (STEVE-1) to produce high-quality observation-action pairs. During execution, visual frames were sampled and annotated by expert models with Minecraft-specific knowledge for tasks like captioning, embodied question answering (QA), and grounding.
Mixture-of-Experts (MoE) architecture with task-level routing: Implemented a modular architecture where each task type was assigned to a dedicated expert, with a shared knowledge expert facilitating transfer across tasks. This design prevented interference among heterogeneous tasks and supported scalability by allowing new tasks to be added seamlessly.
Multimodal reasoning-augmented reinforcement learning (RL): Enhanced the agent’s reasoning capabilities by explicitly generating multimodal reasoning processes based on visual content and instructions before executing tasks. The Group Relative Policy Optimization (GRPO) algorithm with an IoU-Density Reward was used to fine-tune the model, encouraging precise grounding and reasoning.

What did they find?
Optimus-3 achieved state-of-the-art performance across six diverse Minecraft tasks, including:

Long-horizon planning
Captioning
Embodied question answering (QA)
Grounding
Reflection

with notable improvements:

20% better on planning
66% better on captioning
76% better on embodied QA
3.4x improvement on grounding
18% better on reflection

Compared to previous generalist multimodal models and specialized agents, Optimus-3 demonstrated superior scalability and task performance. Ablation studies confirmed that:

The MoE architecture effectively prevented task interference and supported adding new tasks without degrading existing ones.
The multimodal reasoning phase significantly improved vision-related task accuracy.

Limitations include the complexity of integrating multiple components and the need for extensive annotated data, though the approach shows strong promise for scalable, multimodal agent development.

Why does this matter?
Optimus-3 pushes the frontier of generalist multimodal AI agents capable of handling complex, diverse tasks in open-world environments like Minecraft. Its innovative architecture and training methodology provide a scalable blueprint for building agents that can reason across modalities, adapt to new tasks, and operate robustly in dynamic settings. This work has broad implications for AI applications requiring integrated perception, reasoning, and action—such as robotics, interactive virtual assistants, and autonomous agents—by demonstrating how to combine multimodal understanding with scalable, task-specific learning without interference. Ultimately, Optimus-3 advances the goal of creating AI systems that can think, see, and act like flexible, intelligent agents in the real world.

Key Points

Introduces Optimus-3, a multimodal Minecraft agent with scalable task experts.
Combines knowledge-enhanced data generation, MoE architecture, and multimodal reasoning RL.
Achieves state-of-the-art results across diverse Minecraft tasks with strong scalability.
Provides a blueprint for scalable, multimodal generalist agents in open-world environments.

Read on arXiv