• TensorTeach's Newsletter
  • Posts
  • VideoLLMs Guide the Blind, EchoInk Learns to Listen, X-Reasoner Defies Modality, and “Wait” Boosts Math Scores

VideoLLMs Guide the Blind, EchoInk Learns to Listen, X-Reasoner Defies Modality, and “Wait” Boosts Math Scores

I Can See Forever!: Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

Image from arXiv paper — copyright belongs to authors or publishers.

What’s the research question?
How effective are real-time Video-based Large Language Models (VideoLLMs) in assisting visually impaired individuals with daily life tasks?

What did the authors do?
The study evaluated three state-of-the-art VideoLLMs for assistive applications:

  • Introduced two datasets: VisAssistDaily (tasks in Basic Skills, Home Life, Social Life) and SafeVid (1,204 videos annotated for environmental hazards).

  • Assessed models GPT-4o, Zhipu, and VITA-1.5 on task success rate, response latency, prompt cost, and language consistency.

  • Conducted user studies with visually impaired volunteers to gather qualitative feedback on model assistance in closed- and open-world scenarios.

  • Fine-tuned VITA-1.5 on SafeVid to improve proactive hazard detection capabilities.

What did they find?
Key results include:

  • GPT-4o achieved the highest average task success rate of 92.63%, outperforming other models across most tasks.

  • User studies reported high satisfaction, with some closed-world tasks reaching 100% success.

  • Challenges remained in complex environments, particularly for hazard recognition and spatial reasoning.

  • Fine-tuned VITA-1.5 improved proactive risk detection accuracy to 62.24%, showing promise for real-world safety applications.

  • Users highlighted the need for improved accuracy and consistency in noisy or dynamic settings.

Why does this matter?
This research advances the practical deployment of VideoLLMs as assistive tools for visually impaired individuals, emphasizing real-time interaction and proactive environmental awareness. The evaluation framework and datasets provide a foundation for future improvements, aiming to enhance independence and quality of life through AI-powered assistance.

Key Points

  • Systematic evaluation of VideoLLMs on real-world assistive tasks for visually impaired users.

  • Introduces new datasets targeting daily activities and environmental hazard detection.

  • Demonstrates high task success rates with GPT-4o and improvements via fine-tuning.

  • Highlights challenges in complex scenarios and need for robustness enhancements.

REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLMs

Image from arXiv paper — copyright belongs to authors or publishers.

What’s the research question?
How can we effectively evaluate the safety and ethical implications of Vision Large Language Models (VLLMs) in multi-turn interactions involving image inputs?

What did the authors do?
The authors developed REVEAL, a comprehensive framework for multi-turn harm evaluation in VLLMs:

  • Defined detailed harm policies covering sexual harm, violence, and misinformation.

  • Automated image mining to collect relevant images aligned with harm policies.

  • Generated seed user queries targeting harm categories, ensuring contextual relevance.

  • Expanded seeds into multi-turn dialogues using a crescendo attack strategy that incrementally increases harmfulness.

  • Evaluated five state-of-the-art VLLMs on 4,750 generated conversations using GPT-4o as an automated evaluator with few-shot prompting.

What did they find?
Key findings include:

  • Multi-turn interactions resulted in roughly double the defect rates compared to single-turn evaluations across all models.

  • GPT-4o had a defect rate of 6.33%, while Llama-3.2 showed the highest at 16.55%.

  • Lower refusal rates in multi-turn settings indicated models struggled more with contextually integrated harmful requests.

  • Misinformation emerged as a critical vulnerability area, with varying robustness across models.

  • The Safety-Usability Index (SUI) highlighted trade-offs between safety and usability among models.

Why does this matter?
REVEAL addresses a crucial gap by evaluating VLLMs in realistic multi-turn, multimodal scenarios, reflecting real user interactions. Its scalable and automated approach enables better identification of vulnerabilities, informing safer AI system design and policy. This work advances the field of AI safety by emphasizing the importance of multi-turn context in harm detection.

Key Points

  • Introduces a novel multi-turn harm evaluation framework for vision-language models.

  • Employs automated image mining and crescendo attack dialogue generation.

  • Reveals increased vulnerabilities in multi-turn versus single-turn settings.

  • Highlights misinformation as a key safety challenge for VLLMs.

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

What’s the research question?

How can reinforcement learning enhance audio-visual reasoning in multimodal large language models?

What did the authors do?

The authors developed EchoInk-R1, a framework that fine-tunes a multimodal LLM using reinforcement learning to improve reasoning over synchronized audio-image inputs:

  • Started with the Qwen2.5-Omni-7B model and fine-tuned it on the AVQA-R1-6K dataset containing paired audio and images with multiple-choice questions.

  • Employed Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that updates policies based on relative rankings of candidate responses rather than explicit value functions.

  • Designed two reward signals: Answer Accuracy (matching ground truth answers) and Format Consistency (ensuring outputs follow structured reasoning and answer tags).

  • Fine-tuned the model over 562 reinforcement learning iterations, focusing on enhancing structured reasoning capabilities.

What did they find?

EchoInk-R1-7B achieved an accuracy of 85.77% on the validation set, significantly outperforming the baseline Qwen2.5-Omni-7B model’s 80.53%. Key observations include:

  • The emergence of “aha moments” where the model self-corrects and refines its reasoning on ambiguous inputs, demonstrating belief revision and improved cross-modal understanding.

  • Training dynamics showed a steady increase in accuracy rewards and a shift towards more concise reasoning outputs, indicating the model learned to express reasoning efficiently.

  • Limitations include the relatively small dataset size and focus on multiple-choice tasks, which may limit generalization to open-ended reasoning.

Why does this matter?

This work represents a significant step in integrating audio, visual, and textual modalities for complex reasoning tasks. By leveraging reinforcement learning, EchoInk-R1 enhances the structured reasoning abilities of multimodal LLMs, which is crucial for applications such as interactive agents, multimedia retrieval, and assistive technologies. The released code and dataset provide valuable resources for further research in multimodal reasoning, addressing a notable gap in combining audio and visual modalities effectively.

Key Points

  • Introduces reinforcement learning fine-tuning for multimodal LLMs combining audio and vision.

  • Uses Group Relative Policy Optimization to optimize answer accuracy and output format.

  • Demonstrates improved reasoning accuracy and emergent self-correction behaviors.

  • Provides new dataset and code to foster further research in audio-visual reasoning.

SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models

What’s the research question?

How can we achieve effective zero-shot spatial reasoning in 3D environments using multimodal large language models without the need for specialized 3D inputs or fine-tuning?

What did the authors do?

The authors proposed SpatialPrompting, a framework that enables spatial reasoning in 3D scenes by leveraging keyframe extraction and prompt engineering without specialized 3D model training:

  • Extracted keyframes from RGB-D videos using spatial (Mahalanobis distance) and semantic (cosine similarity) metrics to select diverse and informative views.

  • Estimated 3D camera poses via RGB-D SLAM to provide spatial context for each keyframe.

  • Generated structured prompts combining keyframe images, camera poses, annotations, and user queries for multimodal LLMs.

  • Evaluated the approach on ScanQA and SQA3D benchmarks, focusing on zero-shot spatial reasoning performance.

What did they find?

SpatialPrompting achieved state-of-the-art zero-shot results:

  • On ScanQA, it reached an exact match rate (EM@1) of 27.34%, outperforming existing methods on metrics like ROUGE-L and SPICE.

  • On SQA3D, it excelled in diverse question categories including “What” and “How,” demonstrating robust spatial reasoning.

  • Ablation studies confirmed that keyframe selection, camera pose integration, and few-shot annotations each significantly contributed to performance.

  • Limitations were noted in handling questions requiring explicit user orientation, indicating room for improvement in spatial context modeling.

Why does this matter?

By showing that effective spatial reasoning can be achieved without costly 3D-specific fine-tuning, SpatialPrompting challenges conventional approaches that rely heavily on specialized inputs. This makes spatial reasoning more scalable and accessible for real-world applications such as robotics, augmented reality, and autonomous navigation. The framework paves the way for more efficient use of off-the-shelf multimodal LLMs in complex 3D environments.

Key Points

  • Introduces a novel keyframe-driven prompt generation strategy for spatial reasoning.

  • Leverages camera pose data and vision-language features without 3D fine-tuning.

  • Achieves state-of-the-art zero-shot performance on 3D spatial reasoning benchmarks.

  • Demonstrates the importance of combining spatial and semantic metrics for keyframe selection.

Crosslingual Reasoning through Test-Time Scaling

What’s the research question?
How effective is test-time scaling of English-centric reasoning language models on multilingual reasoning tasks?

What did the authors do?
The study investigated whether increasing inference compute at test time improves multilingual reasoning in English-trained models:

  • Used s1 models based on Qwen2.5-Instruct, fine-tuned on a small English STEM reasoning dataset (1k examples).

  • Evaluated crosslingual test-time scaling on the Multilingual Grade School Math (MGSM) benchmark across multiple languages.

  • Applied budget forcing techniques to control inference length, including truncation and extrapolation.

  • Assessed the impact of reasoning language control (language forcing) on accuracy and compliance.

  • Tested cross-domain generalization on benchmarks like Global-MMLU, FORK, and COPAL-ID.

What did they find?
Key findings include:

  • Test-time scaling significantly improved accuracy for models ≥3B parameters, with the 14B model gaining +9.4% accuracy by increasing inference tokens from 0.5k to 8k.

  • s1 models outperformed larger state-of-the-art models on multilingual math reasoning tasks.

  • Models often mixed languages during reasoning, frequently quoting non-English phrases.

  • Reasoning in high-resource languages (e.g., French, Mandarin) yielded better results than low-resource languages.

  • Cross-domain generalization was limited; performance sometimes degraded with more inference compute outside STEM domains, suggesting overthinking effects.

Why does this matter?
This work highlights the practical benefit of increasing inference compute at test time to enhance multilingual reasoning, even for English-centric models. It suggests prioritizing high-resource languages for better performance and calls attention to the challenges of reasoning in low-resource languages and cross-domain settings. These insights guide future development of more robust and efficient multilingual reasoning systems.

Key Points

  • Demonstrates test-time scaling as a simple yet effective method to boost multilingual reasoning.

  • Shows superior performance of English-trained models on multilingual math tasks with increased inference budget.

  • Highlights language mixing and better results in high-resource languages.

  • Reveals challenges in cross-domain generalization and potential overthinking.

R3-VQA: “Read the Room” by Video Social Reasoning

Image from arXiv paper — copyright belongs to authors or publishers.

What’s the research question?
How can we effectively evaluate and enhance social reasoning capabilities in large vision-language models using a comprehensive video question answering dataset?

What did the authors do?
The authors created R3-VQA, a novel video question answering dataset focused on social reasoning:

  • Collected 316 video clips annotated with detailed social events, mental states (belief, intent, desire, emotion), and causal chains.

  • Generated 5,156 question-answer pairs, including 4,840 automatically generated and 316 human-designed, covering four question types: Event Understanding, Mental State Estimation, Causal-Why, and Causal-How/What.

  • Evaluated state-of-the-art large vision-language models (LVLMs) on the dataset, measuring accuracy and reasoning consistency using novel metrics like Chain Consistency and Subchain Consistency.

  • Tested model configurations with and without subtitles and with Theory of Mind (ToM) prompting to enhance reasoning.

What did they find?
Results revealed significant challenges for LVLMs in social reasoning:

  • Best model (e.g., GPT-4o, Gemini 1.5 Pro) achieved only 48.73% accuracy on human-designed questions, far below human accuracy of 80.06%.

  • Models struggled most with mental state estimation questions, showing lower accuracy than event understanding.

  • Chain consistency scores were low, indicating difficulty maintaining coherent reasoning across related questions.

  • Incorporating subtitles and ToM prompting improved both accuracy and consistency, suggesting these are promising strategies to enhance social reasoning.

Why does this matter?
R3-VQA sets a new benchmark for evaluating social reasoning in multimodal AI systems, exposing current limitations in understanding complex social interactions and mental states. This work encourages development of more nuanced datasets and reasoning techniques, advancing artificial social intelligence. Improving social reasoning is critical for applications like human-computer interaction, assistive technologies, and social robotics.

Key Points

  • Introduces a rich video QA dataset focused on social events and mental states.

  • Evaluates LVLMs, revealing significant gaps in social reasoning capabilities.

  • Proposes novel consistency metrics to assess reasoning coherence.

  • Shows that subtitles and Theory of Mind prompting improve model performance.

X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

What’s the research question?
Is reasoning generalizable across modalities and domains?

What did the authors do?
The authors proposed X-Reasoner, a vision-language model trained to generalize reasoning across different input types and domains:

  • Started with Qwen2.5-VL-7B-Instruct as the base model.

  • Stage 1: Supervised fine-tuning on 114k text questions with distilled long chain-of-thought reasoning traces from the OpenThoughts-114k dataset (math, coding, science).

  • Stage 2: Reinforcement learning with verifiable rewards (RLVR) using Group Relative Policy Optimization on Orz-math-57k textual math questions.

  • Evaluated on four settings: general-domain text-only, general-domain multimodal, specialized-domain text-only, and specialized-domain multimodal tasks.

  • Implemented a forced-exiting mechanism to prevent endless reasoning loops and ensure concise outputs.

What did they find?
X-Reasoner showed strong generalization and reasoning improvements:

  • Achieved state-of-the-art results on general-domain benchmarks like MMLU-Pro (+3.6% accuracy) and GSM8K (+1.4%).

  • Outperformed prior multimodal models on tasks such as MMMU and MathVista, with 56.4% accuracy on MMMU vs. previous best of 54.7%.

  • Fine-tuning a medical variant (X-Reasoner-Med) further improved performance on medical benchmarks (MedXpertQA-MM, MMMU-Health).

  • Consistent gains were observed even after excluding text-solvable examples, indicating robust multimodal reasoning.

Why does this matter?
This work challenges the assumption that multimodal reasoning requires multimodal training by showing that text-based post-training can impart universal reasoning patterns transferable across modalities and domains. This approach reduces reliance on expensive multimodal datasets and enables efficient development of generalizable reasoning models. The medical domain results highlight potential for impactful applications in healthcare decision support.

Key Points

  • Introduces a two-stage training combining supervised fine-tuning and reinforcement learning for reasoning.

  • Demonstrates reasoning generalization across text and vision modalities, and across domains.

  • Achieves state-of-the-art performance on multiple benchmarks, including specialized medical tasks.

  • Implements mechanisms to ensure concise and coherent reasoning outputs.

Advancing and Benchmarking Personalized Tool Invocation for LLMs

What’s the research question?
How can personalized tool invocation enhance the capabilities of Large Language Models (LLMs) in real-world applications?

What did the authors do?
The authors introduced PTool, a data synthesis framework to enable personalized tool invocation in LLMs:

  • Tool Generation: Created a hierarchical tree of diverse APIs using depth-first expansion to simulate various tool scenarios.

  • User Profile Construction: Generated realistic user profiles by clustering relevant features and assigning values top-down, capturing both explicit and implicit preferences.

  • Query and Solution Generation: Employed a multi-agent approach where a user agent simulates queries based on profiles and an assistant agent generates tool invocation solutions.

  • Validated solutions using rule-based and model-based verification to ensure correctness.

  • Produced PTBench, a dataset of 8,197 samples split into training and manually verified testing sets.

What did they find?
Experimental results showed:

  • Fine-tuned GPT-4-turbo on PTBench achieved an overall accuracy of 95.75%, excelling in tool preference and parameter extraction.

  • Marked improvements in handling profile-dependent queries, though challenges remain in extracting profile-dependent parameters.

  • Demonstrated generalization to untrained users, indicating robustness of the approach.

  • Highlighted areas for further refinement, particularly in complex profile-parameter interactions.

Why does this matter?
This work pioneers personalized tool invocation for LLMs, addressing a critical gap in adapting AI assistance to individual user preferences. The PTBench benchmark enables systematic evaluation and fosters research toward more user-centric AI applications in domains like e-commerce, healthcare, and personal assistants, ultimately improving relevance and user satisfaction.

Key Points

  • Proposes a novel framework for synthesizing personalized tool invocation data.

  • Generates realistic user profiles and corresponding queries to simulate diverse scenarios.

  • Achieves high accuracy in tool selection and parameter extraction with GPT-4-turbo.

  • Provides PTBench benchmark to facilitate future research in personalized AI assistance.

Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects

What’s the research question?
How do trust dynamics develop in user interactions with Vision Language Models (VLMs)?

What did the authors do?
The study explored user trust in VLMs through a mixed-methods approach:

  • Conducted a systematic literature review to categorize research on user trust in VLMs, resulting in a taxonomy based on cognitive capabilities, collaboration modes, and agent behaviors.

  • Organized a pilot workshop with experts involving a collaborative game comparing responses from a blindfolded partner, an LLM, and a VLM on visual tasks.

  • Collected qualitative data from participant notes, discussions, and feedback forms.

  • Evaluated a mock-up web application designed to assess user trust in VLMs, gathering user preferences on features and usability.

What did they find?
Findings revealed complex trust dynamics:

  • Blindfolded participants and LLMs achieved similar accuracy (~75%), while VLMs lagged significantly (~29.63%).

  • Prompt wording influenced understanding and trust; LLMs outperformed VLMs on perception tasks, but VLMs did better on temporal reasoning.

  • Participants expressed distrust in VLM capabilities, suggesting removal from the game.

  • Users prioritized basic interaction features over trust metric trackers in the mock-up app.

  • Both textual and graph-based VLM responses were valued for usability and engagement.

Why does this matter?
This research advances understanding of how users develop trust in VLMs, highlighting the importance of cognitive and collaborative factors. The taxonomy and user-centered insights provide a foundation for designing more trustworthy and engaging AI systems, crucial for adoption in critical applications.

Key Points

  • Develops a taxonomy of factors influencing user trust in VLMs.

  • Uses mixed methods including literature review and expert workshop.

  • Reveals performance gaps and trust challenges in current VLMs.

  • Informs design of user-centered tools for trust assessment and engagement.

Enhancing Cooperative Multi-Agent Reinforcement Learning with State Modelling and Adversarial Exploration

Image from arXiv paper — copyright belongs to authors or publishers.

What’s the research question?
Can agents learn to infer informative state representations from their own observations to enhance cooperation and exploration in multi-agent reinforcement learning?

What did the authors do?
The authors proposed SMPE2, a novel multi-agent reinforcement learning (MARL) method combining state modelling and adversarial exploration:

  • Each agent uses a variational encoder-decoder architecture to infer latent belief representations of the joint state from its own observations.

  • Agent Modelling (AM) filters remove non-informative features, improving the quality of inferred states.

  • Adversarial exploration incentivizes agents to explore novel states beneficial to themselves and others, promoting cooperative learning.

  • Implemented on top of Multi-Agent Actor-Critic (MAA2C) with two critic networks: one for standard value estimation and one for joint value approximation based on filtered states.

  • Training optimizes actor and critic networks with policy gradients, updating encoder-decoder parameters and AM filters periodically.

What did they find?
Experimental results on Multi-Agent Particle Environment (MPE), Level-Based Foraging (LBF), and Multi-Robot Warehouse (RWARE) benchmarks showed:

  • SMPE2 outperformed state-of-the-art MARL algorithms in episodic rewards and convergence speed.

  • Handled sparse-reward and complex coordination tasks more effectively than baselines.

  • Achieved higher average rewards with lower variance, indicating robustness across environments.

  • Demonstrated scalability to increasing agent numbers and task complexity.

Why does this matter?
This work advances cooperative MARL by addressing partial observability and coordination challenges through learned state representations and adversarial exploration. The approach is applicable to real-world multi-agent systems such as robotic fleets, autonomous vehicles, and resource management, improving efficiency and cooperation in decentralized settings.

Key Points

  • Introduces state modelling with variational encoder-decoder and agent modelling filters.

  • Incorporates adversarial exploration to encourage beneficial cooperative behaviors.

  • Outperforms existing MARL methods on diverse benchmarks with sparse rewards.

  • Demonstrates robustness and scalability in multi-agent coordination tasks.