- TensorTeach's Newsletter
- Posts
- Amazon’s $50B AI Infrastructure Plan, State-Level AI Regulation, and Anthropic Claude Opus 4.5 Model Release
Amazon’s $50B AI Infrastructure Plan, State-Level AI Regulation, and Anthropic Claude Opus 4.5 Model Release
What the latest moves in compute, policy, and advanced model design mean for the industry.
This Week In AI
This week’s AI news wasn’t just a collection of big announcements — it painted a clear picture of where the industry is headed. On the infrastructure side, Amazon made a bold move with plans to invest $50 billion into AI and supercomputing capacity for U.S. federal agencies. It’s a signal that the race for large-scale compute is only accelerating, and that government demand for AI capability is becoming a major competitive battleground.
(Source: LiveMint)
At the same time, the regulatory landscape is heating up. A coalition of 35 state attorneys general urged Congress not to limit states’ ability to pass their own AI laws — a reminder that AI governance in the U.S. may become a patchwork of state-level rules rather than a single federal standard. That tension is shaping how companies will need to deploy and monitor their models going forward.
(Source: Reuters)
And on the model front, Anthropic rolled out Claude Opus 4.5, an upgrade focused on better reasoning and multimodal understanding. It’s another step in the steady climb of model capability — not a revolution, but an important signal that frontier labs are still pushing rapid, iterative progress.
(Source: AI Business)
Taken together, this week shows the same pattern emerging again: bigger infrastructure, more political pressure, and increasingly capable AI systems — all advancing at the same time.
New Research This Past Week
Synthesizing Visual Concepts as Vision-Language Programs

Image from arXiv paper.
What’s the research question?
Can a neuro-symbolic framework combining vision-language models with program synthesis improve visual reasoning tasks?
What did the authors do?
The authors developed a novel approach called Vision-Language Programs (VLPs) that integrates neural perception with symbolic reasoning through a three-stage process:
Symbol grounding: Use a pretrained vision-language model to identify and ground relevant objects, properties, and actions in images, creating a set of symbolic representations.
Grammar construction: Build a Probabilistic Context-Free Grammar (PCFG) from a domain-specific language (DSL) and the grounded symbols, defining the space of valid programs.
Program synthesis: Search for the best program within this grammar that distinguishes positive from negative examples, guided by the PCFG prior and evaluated on visual inputs.
They then select the top-ranked program based on accuracy and probability, which can be executed on new images for inference.
What did they find?
The VLP approach outperformed traditional direct prompting methods across multiple models and datasets, with improvements up to 13.5%. It was especially effective on complex, compositional datasets like CLEVR-Hans3, demonstrating its ability to handle intricate logical reasoning. In some cases, VLPs even surpassed dedicated reasoning models designed specifically for such tasks. Additionally, VLPs benefited from increased input images, enhancing robustness and generalization. The approach also allows interaction with the program space, enabling debugging and incorporation of prior knowledge.
However, the method relies on the quality of the vision-language grounding and the design of the grammar, which may require domain expertise. Computational complexity during program search can also be a consideration.
Why does this matter?
This work advances the integration of neural perception and symbolic reasoning, a key challenge in AI. By synthesizing visual concepts into executable programs, VLPs offer a more interpretable and flexible way to perform complex visual reasoning tasks. This neuro-symbolic approach can improve generalization to new, unseen combinations of concepts and provide clearer insights into model decision-making. It has potential applications in robotics, autonomous agents, and any domain requiring robust understanding and reasoning over visual data, bridging the gap between pattern recognition and systematic logic.
Key Points
Introduces Vision-Language Programs (VLPs) combining vision-language models with program synthesis for visual reasoning.
Uses a three-stage process: symbol grounding, grammar construction, and program synthesis guided by a probabilistic grammar.
Outperforms direct prompting and some dedicated reasoning models on multiple datasets, especially on complex compositional tasks.
Enables interaction with the program space for debugging and knowledge integration, enhancing interpretability and robustness.
Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
What’s the research question?
How can large language models (LLMs) be extended to handle new modalities such as vision without requiring costly retraining or architectural changes?
What did the authors do?
The authors introduced BeMyEyes, a modular, multi-agent framework that enables LLMs to perform multimodal reasoning by orchestrating collaboration between specialized perceiver agents and powerful reasoner agents:
Perceiver agents: Small, adaptable vision-language models that process visual inputs and generate descriptive summaries.
Reasoner agents: Large LLMs (like GPT-4) that interpret perceiver outputs and leverage extensive knowledge to solve tasks.
Orchestration mechanism: Defines roles and conversational flow, allowing perceiver and reasoner agents to engage in iterative exchanges and refine their understanding.
Data synthesis pipeline: Uses GPT-4o to generate synthetic multimodal reasoning questions and dialogues, creating a dataset of 12,145 multimodal questions with images and simulated conversations.
Fine-tuning: The perceiver agent is fine-tuned on the synthetic dataset via supervised learning to improve its descriptive and communicative abilities.
Evaluation: The framework is tested on knowledge-intensive multimodal reasoning tasks (MMMU, MMMU Pro, MathVista, MathVision) using different perceiver and reasoner model pairings.
What did they find?
The BeMyEyes framework demonstrated strong and consistent improvements in multimodal reasoning performance:
When paired with Qwen2.5-VL-7B as the perceiver, GPT-4 achieved near the performance of fully multimodal GPT-4o, with a 12.1% increase on MathVision.
DeepSeek-R1, a text-only model, outperformed GPT-4o on MMMU Pro, MathVista, and MathVision, highlighting the effectiveness of the collaborative approach even with different model strengths.
Ablation studies confirmed that supervised fine-tuning and multi-turn conversations were crucial for maximizing performance.
The framework generalized well to domain-specific tasks, maintaining strong results on MMMU Med and MMMU Pro Med without additional domain-specific training.
Why does this matter?
BeMyEyes offers a scalable and flexible alternative to training large-scale multimodal models from scratch. By enabling LLMs to collaborate with specialized perceiver agents, it reduces the need for costly retraining and architectural modifications, making multimodal reasoning more accessible and adaptable. Its modular design allows easy integration of new modalities and models, fostering open-source development and broad applicability across AI domains. This approach leverages the complementary strengths of perception and reasoning agents, demonstrating that multi-agent collaboration can effectively extend LLM capabilities to new modalities. The framework’s success suggests a promising direction for building versatile, multimodal AI systems that can reason across language, vision, and beyond without the heavy computational costs typically associated with large multimodal model training.
Key Points
Introduces BeMyEyes, a multi-agent framework combining perceiver (vision-language) and reasoner (LLM) agents for multimodal reasoning.
Uses synthetic data generation and fine-tuning to enable perceivers to communicate visual information effectively to reasoners.
Achieves significant performance gains on multimodal reasoning benchmarks, rivaling fully multimodal large models.
Offers a scalable, modular approach that facilitates integration of new modalities and promotes open research.
What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models

Image from arXiv paper.
What’s the research question?
How do different retrieval approaches and interventions impact the effectiveness of cross-lingual information retrieval (CLIR), where queries and documents are in different languages?
What did the authors do?
The authors systematically evaluated four key retrieval strategies using multilingual language models:
Document translation: Translating documents into a single language using NLLB-200, a multilingual translation model.
Multilingual dense retrieval: Embedding queries and documents into a shared vector space with pretrained multilingual encoders, then ranking by cosine similarity.
Contrastive learning: Fine-tuning encoders to better align query and document representations at different granularities (word, phrase, query–document pairs).
Cross-encoder re-ranking: Jointly encoding query–document pairs with a transformer to produce relevance scores.
Experiments were conducted on three datasets: CLIRMatrix, mMARCO, and Large-Scale CLIR, using approximate nearest neighbor search for efficiency and evaluating with Recall@100, Recall@10, and nDCG@100 metrics.
What did they find?
Key results include:
Dense retrieval models trained specifically for CLIR outperform lexical matching and translation-based methods. Embedding-based approaches capture semantic cross-lingual similarity more effectively than translation pipelines.
Contrastive learning improves alignment and retrieval effectiveness, especially for encoders that are initially weakly aligned across languages. Fine-tuning at different granularities helps models better match queries and documents in different languages.
Cross-encoder re-ranking is effective but highly dependent on the quality and quantity of training data.
Document translation offers limited benefits for embedding-based retrieval, suggesting that semantic alignment in the embedding space is more crucial than translation.
Performance varies across language pairs, generally decreasing as linguistic distance increases. Languages that are more similar benefit more from these approaches, while distant language pairs remain challenging.
Limitations include the dependency of re-ranking on training data quality and the varying impact of linguistic factors across language pairs.
Why does this matter?
This study advances our understanding of how to build more effective cross-lingual search systems. By demonstrating that semantic alignment through multilingual embeddings and targeted contrastive learning outperforms traditional translation pipelines, it paves the way for more efficient and accurate multilingual information retrieval. This is especially important for low-resource and cross-script languages, where translation quality may be poor or unavailable. Improving CLIR can enhance global information access, support multilingual digital libraries, and enable more inclusive AI-powered search experiences across diverse languages and scripts.
Key Points
Multilingual dense retrieval with semantic embeddings outperforms translation-based methods in cross-lingual search.
Contrastive learning effectively improves cross-lingual alignment at multiple granularities.
Cross-encoder re-ranking is powerful but sensitive to training data quality.
Linguistic distance between languages affects retrieval performance, with closer languages benefiting more.