• TensorTeach's Newsletter
  • Posts
  • Anthropic’s $50 Billion Data Center Bet, Agentic AI Reaches Wall Street, and Rising Fears of an AI Spending Bubble

Anthropic’s $50 Billion Data Center Bet, Agentic AI Reaches Wall Street, and Rising Fears of an AI Spending Bubble

How trillion-dollar bets on infrastructure, enterprise-grade AI agents, and investor mania are reshaping the trajectory of artificial intelligence.

This Week In AI

This week revealed one of the clearest signs yet that the global AI race is shifting from model development to infrastructure dominance. Anthropic announced a historic $50 billion U.S. data-center expansion. Meanwhile, Foxconn’s new Kaohsiung AI data-center is already 70–80% booked months ahead of schedule, showing unprecedented demand for global inference and training capacity.

But the money isn’t only flowing into hardware. This week, agentic AI entered mainstream finance as Franklin Templeton partnered with Wand AI to deploy autonomous AI into investment workflows—marking one of the most serious enterprise deployments of agentic systems so far.

Yet behind all this acceleration, the financial world is starting to sweat. A Wall Street Journal report warned that AI spending may be entering bubble territory, fueled by investor fear of missing out as companies secure compute at any cost.

Research

An Analysis of Architectural Impact on LLM-based Abstract Visual Reasoning: A Systematic Benchmark on RAVEN-FAIR

Image from arXiv paper.

What’s the research question?
How do different reasoning architectures affect the performance of large language models (LLMs) in abstract visual reasoning tasks?


What did the authors do?
The authors systematically evaluated four state-of-the-art LLMs—GPT-4.1-Mini, Claude-3.5-Haiku, Gemini-1.5-Flash, and Llama-3.3-70b—across four reasoning architectures:

  • Single-shot: One-step reasoning without iterative refinement.

  • Embedding-controlled repetition: Repeated reasoning guided by learned embeddings.

  • Self-reflection: Models generate reasoning steps and revise their outputs iteratively.

  • Multi-agent: Multiple models collaborate, each handling different reasoning aspects.

They used the RAVEN-FAIR dataset, a benchmark for abstract visual reasoning, where models generate visual responses through a three-stage process:

  • Extract JSON parameters describing visual elements (shape, size, color, position).

  • Use LLM reasoning to interpret the task and generate visual descriptions.

  • Render visual panels based on JSON using a Tool Function.

Model outputs were evaluated with SSIM and LPIPS perceptual similarity metrics, Chain-of-Thought (CoT) reasoning scores, and error analyses focusing on semantic hallucination and numeric misperception.


What did they find?
Key results include:

  • GPT-4.1-Mini achieved the highest overall accuracy (46.91%) and perceptual similarity (SSIM 0.946), outperforming larger models in accuracy.

  • LLaMA-3.3-70b had the highest CoT reasoning score (8.31) but a lower accuracy (32.57%), indicating that strong reasoning paths do not always translate to correct answers.

  • The multi-agent architecture improved LLaMA-3.3-70b’s accuracy by 8.76% but increased semantic hallucination errors, showing a trade-off between reasoning diversity and correctness.

  • Self-reflection generally decreased accuracy and increased hallucination errors, especially in Claude-3.5-Haiku, which also showed a 24.5% drop in reasoning coverage.

  • The embedding-controlled architecture boosted GPT-4.1-Mini’s accuracy but also led to more semantic hallucinations, highlighting complex interactions between reasoning strategies and output quality.

This study reveals that high Chain-of-Thought scores do not necessarily correlate with accuracy, emphasizing the need for multiple evaluation metrics and error analyses.


Why does this matter?
Understanding how reasoning architecture influences LLM performance on visual tasks is crucial for designing more reliable and effective AI systems. The findings suggest that hybrid approaches combining different reasoning strategies could leverage their strengths and mitigate weaknesses, leading to better cross-modal reasoning capabilities. This work advances the development of AI that can interpret and generate complex visual information, with potential applications in robotics, autonomous agents, and multimodal AI systems that require integrated language and vision understanding. By highlighting the importance of architectural choices and nuanced evaluation, it guides future research toward more robust and explainable AI models for visual reasoning.

Key Points

  • Systematic benchmarking of four LLMs across four reasoning architectures on the RAVEN-FAIR visual reasoning dataset.

  • GPT-4.1-Mini achieved the best accuracy and perceptual similarity, outperforming larger models in correctness.

  • High Chain-of-Thought scores do not always align with accuracy; multiple metrics are needed for evaluation.

  • Hybrid reasoning architectures may offer a path to improved performance and reliability in multimodal AI systems.

Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction

Image fromarXiv paper.

What’s the research question?
Can a large language model (LLM) learn to iteratively build and refine SPARQL queries using execution feedback to answer complex multi-hop knowledge graph questions?


What did the authors do?
The authors transformed the task of multi-hop knowledge graph question answering (KGPA) into an iterative decision-making process and developed an agentic reinforcement learning (RL) framework with these key features:

  • Used a large language model (LLM) as the core agent that generates and refines SPARQL queries based on interaction history.

  • Operated within a sequential decision-making loop where the agent analyzes previous actions and environment feedback to decide whether to generate a new query or produce a final answer.

  • Integrated live execution of SPARQL queries against a knowledge graph (KG) to obtain structured feedback, which updates the agent’s state.

  • Optimized the agent’s policy using Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm designed for sparse reward signals.

  • Fine-tuned the LLM with trainable QLoRA adapters to efficiently learn effective query construction strategies.

What did they find?

  • Achieved 49.7% accuracy on a curated subset of LC-QuAD 2.0, outperforming the best baseline by 17.5 percentage points.

  • Attained an 81.0% executability rate, indicating high likelihood of generating valid and executable SPARQL queries.

  • Learned to adapt effort based on question complexity, showing strategic decomposition of multi-hop queries.

  • Ablation studies confirmed that outcome-driven RL was the main driver of success, with the reasoning block enhancing policy precision.

Why does this matter?

  • Demonstrates that small LLMs can effectively learn complex, multi-step reasoning strategies in knowledge graphs through interactive refinement.

  • Bridges the gap between the symbolic nature of SPARQL queries and the flexible, pattern-based strengths of neural models.

  • Offers a promising approach for robust, interpretable, and adaptive KGQA systems that can handle multi-hop questions requiring multiple reasoning steps.

  • Potentially impacts applications in semantic search, data integration, and AI assistants that rely on precise querying of large knowledge bases.

Key Points

  • Reinforcement learning enables an LLM to iteratively construct and refine SPARQL queries using execution feedback.

  • The agent learns an adaptive policy that balances query complexity and accuracy.

  • Significant accuracy improvements over baselines demonstrate the effectiveness of outcome-driven RL in structured reasoning.

  • The approach combines neural flexibility with symbolic rigor, opening new directions for KGQA research.

CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models

What’s the research question?
How effectively can multimodal large language models (MLLMs) perform reasoning tasks that involve understanding and integrating information across multiple videos and diverse reasoning dimensions?


What did the authors do?
The authors developed CrossVid, a new benchmark dataset designed to evaluate cross-video reasoning (CVR)capabilities of MLLMs. Their approach included:

  • Curating a large and diverse dataset with 5,331 videos and 9,015 question-answer (QA) pairs from six sources, covering four high-level reasoning dimensions: comparative analysis, temporal understanding, multi-view reasoning, and free-form QA.

  • Annotating the dataset using a semi-automated pipeline that combined frame captioning, QA generation, manual filtering, and refinement to ensure quality and diversity.

  • Designing 10 challenging CVR tasks such as behavioral understanding, narrative comprehension, culinary comparison, procedural error detection, plot inference, functional step alignment, step sequencing, multi-view spatial reasoning, object counting, and culinary QA.

  • Evaluating MLLMs’ performance using metrics like accuracy, Intersection-over-Union (IoU), and GPT-4.1-based scoring for open-ended questions.

What did they find?

  • The best-performing model, Gemini-2.5-Pro, achieved an average accuracy of 50.4% across all tasks, significantly below human performance of 89.2%.

  • MLLMs struggled particularly with multi-view reasoning (40.7%) and temporal understanding (13.4%) tasks, highlighting challenges in integrating evidence across different videos and understanding temporal sequences.

  • Closed-source models outperformed open-source counterparts, and reasoning-enabled models like Gemini-2.5-Pro showed notable gains over simpler models.

  • Ablation studies demonstrated that increasing input frames improved performance, and applying Chain-of-Thought prompting enhanced reasoning capabilities.

  • Error analysis identified key limitations: loss of critical frames, misunderstandings of video content, difficulties in cross-video comparison, and formatting errors in model outputs.

Why does this matter?

  • Evidence integration across different videos and views

  • Causal and temporal reasoning to understand sequences and relationships

  • Handling diverse and complex reasoning challenges in real-world applications such as video analysis, multimedia content understanding, and autonomous agents that operate across multiple visual inputs.

Key Points

  • Introduces CrossVid, a large, diverse benchmark for cross-video reasoning in multimodal large language models.

  • Evaluates models on 10 challenging tasks spanning temporal, spatial, and comparative reasoning across multiple videos.

  • Finds that current MLLMs lag significantly behind human performance, especially in multi-view and temporal understanding.

  • Highlights the importance of input frame quantity and Chain-of-Thought prompting for improving reasoning accuracy.