Gemini 3.1 vs Claude Plug-Ins, Meta’s $60B AMD Deal, and the India AI Summit

Google’s new reasoning benchmarks, Anthropic’s enterprise agents, massive chip spending, and global AI policy moves show AI shifting from chatbots to world-scale systems.

This Week In AI

Over the past week, major AI developments centered on frontier-model competition, enterprise agent deployment, massive infrastructure spending, and accelerating global AI policy coordination.

At the model frontier, reports suggested Gemini 3.1 Pro taking a benchmark lead on ARC-AGI-2, highlighting real progress toward multi-step reasoning and agentic planning. At the same time, Anthropic announced new enterprise plug-ins for Claude, targeting workflows in finance, HR, and engineering. Together, these signals point to a clear industry shift: the era of standalone chatbots is ending, and customizable AI agents integrated into real workflows are becoming the dominant paradigm.

Infrastructure competition also intensified. Reports that Meta secured a $60 billion AI chip deal with AMD underline how the compute arms race is accelerating. Hyperscalers are locking in supply chains to support next-generation training clusters, reinforcing a key reality of the AI boom: massive infrastructure investment is happening before clear platform winners emerge.

On the policy front, global coordination around AI continued to expand. The U.S. announced a Tech Corps initiative to spread AI expertise internationally, while 86 nations signed a cooperative declaration at the India AI Impact Summit. These developments show that AI governance is moving from abstract safety debates toward concrete geopolitical strategy, workforce development, and technology diplomacy.

Taken together, this week’s news reinforces three trends shaping the future of AI: rapid advances in reasoning models, enterprise adoption of agentic systems, and growing global competition to control compute, talent, and standards. As AI moves from research labs into infrastructure and policy, the technology is becoming less of a product—and more of a foundational layer of the global economy.

This Week In AI Research

TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models

Image from arXiv paper.

What’s the research question?
How can we improve the temporal and procedural reasoning abilities of resource-efficient multimodal large language models (MLLMs)?


What did the authors do?
The authors developed TPRU, a new dataset and training approach aimed at enhancing temporal and procedural understanding in MLLMs:

  • Dataset creation: Collected from diverse embodied scenarios like robotic manipulation and GUI navigation, TPRU includes three core tasks—Temporal Reordering, Next-Frame Prediction, and Previous-Frame Review—to teach models about event sequences and state changes.

  • Negative sampling: Included challenging negative samples to encourage models to actively validate and distinguish correct from incorrect sequences.

  • Model fine-tuning: Used reinforcement learning (RL) with Group-wise Preference Optimization (GRPO) to fine-tune Qwen2.5-VL models, focusing on small to medium-sized, resource-efficient models.

  • Training stages: Implemented three stages—sequence filtering for quality, description generation, and task formulation—to ensure high-quality, coherent data for training.

What did they find?
The TPRU approach significantly improved temporal and procedural reasoning:

  • Accuracy gains: TPRU-7B model improved its accuracy on the TPRU-7B benchmark from 50.33% to 75.70% (+25.37 percentage points), and TPRU-3B from 37.96% to 60.95% (+19.01 points).

  • Outperforming larger models: The TPRU-7B outperformed GPT-4o (68.00%) and other models on the TPRU-Test benchmark.

  • Better generalization: Showed strong performance on external benchmarks like MuirBench and LEGO-Puzzles, indicating improved ability to handle diverse, unseen tasks.

  • Ablation studies: Confirmed that task diversity, negative samples, and RL fine-tuning were critical for success.

Why does this matter?
This work demonstrates that small to medium-sized multimodal models can achieve state-of-the-art temporal and procedural reasoning without relying on massive, resource-intensive architectures. By carefully designing targeted datasets and leveraging reinforcement learning, the authors show a practical path to more capable, deployable AI systems that understand complex event sequences and state changes. This has broad implications for robotics, navigation, interactive agents, and other applications requiring nuanced temporal understanding. Additionally, it provides a scalable blueprint for future research aiming to enhance reasoning capabilities in resource-efficient multimodal AI.

Key Points

  • Introduces TPRU, a multimodal dataset focused on temporal and procedural reasoning from embodied scenarios.

  • Uses reinforcement learning with Group-wise Preference Optimization to fine-tune resource-efficient models.

  • Achieves significant accuracy improvements and outperforms larger models on key benchmarks.

  • Highlights the importance of task diversity, negative sampling, and RL in training temporal reasoning models.

A Very Big Video Reasoning Suite

What’s the research question?
How can we systematically evaluate and improve the reasoning capabilities of video models?


What did the authors do?
The authors developed a comprehensive benchmarking suite called VBVR to assess the reasoning skills of video models across multiple cognitive domains. Their approach included:

  • VBVR-Dataset: A large-scale, diverse collection of over 1 million video clips spanning 200 reasoning tasks designed to test five core faculties: perception, transformation, spatiality, abstraction, and knowledge.

  • Task Generation: Tasks are implemented as parameterized generators with standardized interfaces, enabling scalable and diverse instance creation while maintaining quality through expert review and automated validation.

  • Evaluation Framework: The VBVR-Bench employs rule-based, human-aligned scorers to evaluate models on in-domain and out-of-domain generalization, task completion, reasoning logic, and visual quality.

  • Benchmarking: Eight state-of-the-art video models were tested to reveal strengths and weaknesses in reasoning abilities and to explore correlations among cognitive faculties.

What did they find?
The study uncovered several key insights:

  • Emergent Generalization: Models showed early signs of learning to generalize across unseen tasks, with performance improving as dataset size increased.

  • Model Performance: The best model, VBVR-Wan2.2, achieved an overall score of 0.8835, outperforming the baseline Wan2.2 by 84.6%.

  • Out-of-Domain Challenges: Performance on out-of-domain tasks improved from 0.329 to 0.610, but a significant 15% gap in generalization ability remained.

  • Qualitative Insights: Models demonstrated emergent multi-step reasoning strategies and controllability, but struggled with process faithfulness (accurately following reasoning steps) and maintaining long-term identity consistency across videos.

Why does this matter?
This work establishes a new standard for the large-scale, systematic evaluation of video reasoning models. By providing a diverse and challenging benchmark, VBVR helps researchers identify the strengths and limitations of current architectures, guiding future improvements. The insights into emergent reasoning capabilities and persistent challenges inform the design of more generalizable, reasoning-enabled video models, which are crucial for applications like autonomous agents, video understanding, and multimodal AI systems that require complex temporal and spatial reasoning.

Key Points

  • Introduces VBVR, a large-scale benchmark with 200 reasoning tasks and over 1 million videos.

  • Evaluates five core cognitive faculties: perception, transformation, spatiality, abstraction, and knowledge.

  • Shows models can learn emergent reasoning skills with increased data scale but still face generalization gaps.

  • Highlights challenges in process faithfulness and long-term identity stability in video reasoning.

Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models

Image from arXiv paper.

What’s the research question?
How well do vision-language models (VLMs) perform on fine-grained visual perception tasks, and how can their performance be improved?


What did the authors do?
The authors systematically evaluated 15 VLMs on four fine-grained classification benchmarks: ImageNet, Flowers, Pets, and Food. Their approach included:

  • Testing models using a 5-way multiple-choice format, where each image is paired with four distractor labels generated based on cosine similarity to the true label.

  • Varying key components such as the vision encoder (e.g., CLIP ViT-L/14, DFN-CLIP ViT-H/14) and the language model (e.g., Vicuna-7B, Qwen2-7B).

  • Comparing fine-grained classification accuracy with general VQA (Visual Question Answering) performance to assess whether these are distinct facets of visual intelligence.

  • Conducting ablation experiments, including unfreezing the language model during pretraining, to identify factors influencing fine-grained perception.

What did they find?
Key findings include:

  • Fine-grained classification performance varies widely among models with similar general VQA scores, indicating that current benchmarks do not fully capture fine-grained visual knowledge.

  • For example, LLaVA-1.5-7B achieves 52.8% accuracy on fine-grained tasks but only 41.8% on general VQA, while Qwen2-VL-7B scores 87.9% and 62.4%, respectively.

  • Upgrading the vision encoder from CLIP to DFN-CLIP improves fine-grained accuracy by 4.5 percentage points when combined with proper pretraining.

  • Unfreezing the language model during pretraining yields a 5.5-point increase in fine-grained accuracy, highlighting the importance of training strategies.

  • These results demonstrate that both the choice of vision encoder and training methodology are critical for enhancing fine-grained visual understanding.

Why does this matter?
This work emphasizes that fine-grained classification is a distinct and important aspect of visual intelligence in vision-language models. Improving fine-grained perception can lead to more accurate and detailed understanding of visual content, which is crucial for real-world applications like medical imaging, species identification, and detailed scene analysis. The findings suggest that future evaluations of VLMs should include fine-grained benchmarks to better capture their true visual understanding capabilities. By systematically optimizing vision encoders and training strategies, researchers can develop models that are more robust, precise, and better suited for complex visual reasoning tasks.

Key Points

  • Fine-grained classification performance varies significantly among VLMs, independent of general VQA scores.

  • Upgrading vision encoders and unfreezing language models during training improve fine-grained visual perception.

  • Current benchmarks may underestimate models’ detailed visual knowledge; new fine-grained benchmarks are needed.

  • Optimizing both vision and language components is essential for advancing fine-grained multimodal understanding.