• TensorTeach's Newsletter
  • Posts
  • Meta AI Delays, Google Gemini OS Push, NVIDIA GTC, OpenAI Enterprise Agents, Baidu Agent Ecosystem

Meta AI Delays, Google Gemini OS Push, NVIDIA GTC, OpenAI Enterprise Agents, Baidu Agent Ecosystem

The week’s biggest AI moves across model competition, AI operating systems, infrastructure scaling, enterprise agents, and global AI competition.

This Week In AI

Over the past week, major AI developments centered on model competition, agentic systems, infrastructure scaling, enterprise adoption, and global AI expansion.

In model development, Meta AI faced a major setback as it delayed the release of its next-generation model, internally known as “Avocado,” after underwhelming performance in reasoning, coding, and writing tasks. The delay, now expected until at least May, signals increasing difficulty in competing at the frontier and raises questions about Meta’s position relative to leading labs. Reports that Meta may consider licensing external models further highlight the growing gap at the top end of AI capability.

On the platform side, Google Gemini made a significant leap toward becoming a full AI operating system within Workspace. Gemini can now generate and orchestrate content across documents, spreadsheets, emails, and files, effectively turning Google Drive into a queryable knowledge base. This marks a major shift from chat-based assistants toward deeply integrated AI systems capable of operating across an organization’s entire data layer.

In infrastructure, NVIDIA continues to anchor the AI ecosystem with its GTC event, often referred to as “AI Woodstock.” The company is expected to introduce new inference-optimized hardware and scaling solutions, reinforcing the industry’s shift from training large models to efficiently serving them at scale. This transition is critical as demand grows for real-time AI systems and autonomous agents operating in production environments.

Enterprise adoption also accelerated through OpenAI, which is increasingly partnering with major consulting firms to deploy AI systems inside large organizations. Rather than replacing consultants, AI is augmenting them, creating a new hybrid service layer where businesses rely on both AI agents and human expertise to automate workflows, analyze data, and drive decision-making.

Finally, global competition intensified as Baidu advanced its AI agent ecosystem, signaling continued momentum from China in building end-to-end AI platforms. The focus on agent frameworks and multimodal capabilities reflects a broader shift toward systems that can act, reason, and interact across environments, not just generate text.

This Week In AI Research

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

What’s the research question?
How does attention dispersion during chain-of-thought prompting affect visual reasoning in multimodal large language models (MLLMs)?

What did the authors do?
The authors investigated how attention mechanisms in MLLMs influence visual reasoning by:

  • Analyzing attention maps of MLLMs during visual question answering (VQA) tasks to understand focus patterns.

  • Introducing the Relevant Region Attention Ratio (RRAR) to quantify how well attention concentrates on question-relevant visual regions.

  • Comparing three prompting strategies: Direct, Reason, and Region-guided to see how they affect attention focus.

  • Performing layer-wise and head-level analysis of attention behaviors to identify patterns and issues.

  • Proposing the Visual Region-Guided Attention (VRGA) framework, a training-free method that dynamically enhances attention to question-relevant regions by:

    • Localizing relevant visual regions using attention maps from vision-focused heads.

    • Constructing refined attention maps emphasizing question-relevant areas.

    • Selecting high-attention tokens as question-relevant regions.

    • Amplifying attention to these regions during response generation in key heads.

What did they find?
The study yielded several key findings:

  • VRGA improved visual reasoning accuracy across three VQA benchmarks (MMStar, Hallusion Bench, HaloQuest).

  • On HaloQuest, Qwen2.5-VL-3B’s comprehensive score increased from 0.445 to 0.488, mainly by reducing irrelevance (II) from 0.601 to 0.405.

  • For Qwen2.5-VL-7B, the score rose from 0.502 to 0.549, with higher answer correctness (AA) from 0.667 to 0.740.

  • VRGA consistently decreased attention dispersion by guiding focus to question-relevant visual regions, leading to more precise and robust reasoning.

  • Layer-wise and head-level analyses revealed how visual heads can be dynamically identified and reweighted to improve focus.

  • Limitations include reliance on attention maps for localization, which may not always perfectly identify relevant regions, and the need to test VRGA on broader tasks beyond VQA.

Why does this matter?
This work advances the understanding of how attention mechanisms influence multimodal reasoning in large language models. By addressing the common problem of attention dispersion—where models spread their focus too broadly rather than honing in on relevant information—VRGA provides a training-free and interpretable approach to improve visual grounding. This has several important implications:

  • Enhances the robustness and accuracy of multimodal models in visual reasoning tasks, which are critical for applications like robotics, assistive technologies, and intelligent assistants.

  • Offers insights into the internal attention dynamics of MLLMs, aiding future model design and interpretability.

  • Demonstrates that dynamic, head-level attention reweighting can effectively mitigate perceptual impairments without additional training or fine-tuning.

  • Supports the broader goal of integrating vision and language understanding in AI systems, enabling more natural and effective human-AI interactions.

Key Points

  • Attention dispersion during chain-of-thought prompting impairs visual reasoning in multimodal large language models.

  • VRGA dynamically identifies and amplifies attention to question-relevant visual regions in a training-free manner.

  • VRGA improves accuracy and reduces irrelevant reasoning across multiple VQA benchmarks.

  • Layer-wise and head-level analysis provides new insights into attention behavior during multimodal reasoning.

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

What’s the research question?
Can vision-language-action models be improved by explicitly integrating perception as a dynamic, reasoning-based component during decision-making?

What did the authors do?
The authors introduced VLA-Thinker, a novel framework that treats visual perception not just as a passive input but as an active, invocable reasoning step within a model’s decision process. Their approach includes:

  • Explicit perception as reasoning: Perception is modeled as a reasoning operation that can be invoked dynamically during task execution, allowing the model to actively decide when and how to perceive visual information.

  • Two-stage training pipeline: - Supervised Fine-Tuning (SFT): Fine-tunes the model on a curated embodied Chain-of-Thought (CoT) dataset that emphasizes structured reasoning and tool use.
    - Group Relative Policy Optimization (GRPO): Aligns complete reasoning-action trajectories with task success signals to improve decision quality.

  • Inference process: During deployment, the model generates interleaved sequences of reasoning steps, perception invocations, visual evidence, and actions, enabling active perception and dynamic reasoning.

What did they find?
VLA-Thinker demonstrated significant improvements over previous models:

  • Achieved a 97.5% success rate on the LIBERO benchmark, outperforming the baseline OpenVLA-OFT by 6.5%.

  • On RoboTwin 2.0, it showed strong robustness and reasoning capabilities with:

    • 62.3% success on short-horizon tasks

    • 70.7% success on medium-horizon tasks

    • 64.6% success on long-horizon tasks

  • The explicit, dynamic perception approach improved the model’s ability to handle complex, multi-step tasks requiring active perception and reasoning.

  • Limitations include the need for curated reasoning datasets and potential computational overhead from dynamic perception invocations.

Why does this matter?
This work advances embodied AI by integrating perception directly into the core reasoning process, rather than treating it as a static input. By enabling models to actively decide when and how to perceive visual information, VLA-Thinker enhances robustness and flexibility in complex environments—crucial for applications like robotics, autonomous agents, and interactive systems. Its trajectory-level alignment of reasoning and actions offers a new paradigm for training vision-language-action models, paving the way for more intelligent, adaptable AI agents capable of sophisticated cross-modal reasoning.

Key Points

  • Introduces perception as an explicit, dynamically invocable reasoning step in vision-language-action models.

  • Uses a two-stage training pipeline combining structured reasoning and trajectory alignment.

  • Achieves state-of-the-art results on benchmarks requiring active perception and long-horizon reasoning.

  • Enhances robustness and flexibility in complex, embodied AI tasks.

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

What’s the research question?
How effectively can multimodal large language models (MLLMs) utilize external visual tools to perform complex visual reasoning tasks?

What did the authors do?
The authors developed VTC-Bench, a comprehensive benchmark to evaluate the ability of MLLMs to compose and use multiple visual tools. Key features include:

  • 32 diverse OpenCV-based visual operations grouped into four modules: Geometry, Enhancement, Feature Extraction, and Drawing.

  • 680 problems structured across a nine-category cognitive hierarchy, each with ground-truth tool invocation sequences.

  • Tasks covering Visual Perception Enhancement, Quantitative Visual Estimation, and Compositional Visual Reasoning.

  • Evaluation metrics such as Average Pass Rate (APR), Tool Call Rate (TCR), Mean Absolute Error (MAE), and Tool Usage Efficiency (Eff).

  • Support for both interface-driven and code-driven tool invocation to assess models’ ability to generate programmatic solutions.

What did they find?
Performance of 19 MLLMs on VTC-Bench was generally low, highlighting significant challenges:

  • The best model, Gemini-3.0-Pro, achieved only 51.2% APR, indicating room for improvement.

  • Proprietary models outperformed open-source counterparts, especially when augmented with tools.

  • Models struggled with precise tool invocation, often relying on a limited subset of tools.

  • High redundancy in tool calls revealed inefficiencies in multi-step reasoning, with some models using unnecessarily long toolchains (e.g., GPT-5.2 with MAE of 9.96 in toolchain length).

  • Tool Usage Efficiency was low in many models, showing difficulty in selecting and chaining tools effectively.

Why does this matter?
VTC-Bench provides a rigorous and detailed evaluation framework that exposes the current limitations of agentic multimodal models in composing and generalizing across diverse visual tools. By highlighting the challenges in multi-tool orchestration, it guides researchers toward developing more robust, efficient, and versatile visual reasoning agents. Improving these capabilities is crucial for advancing AI systems that can interact with and interpret complex visual environments in real-world applications such as robotics, autonomous agents, and intelligent assistants.

Key Points

  • VTC-Bench evaluates how well multimodal large language models can chain multiple visual tools for complex reasoning.

  • Performance is currently limited, with top models achieving just over 50% success rate.

  • Models face challenges in precise tool invocation and efficient tool chaining, often showing redundancy and inefficiency.

  • The benchmark sets a high bar for multi-tool orchestration, pushing the field toward more capable visual agents.