- TensorTeach's Newsletter
- Posts
- OpenAI’s $38B Amazon Deal, South Korea’s AI Surge, and the Automation Shock: This Week in AI
OpenAI’s $38B Amazon Deal, South Korea’s AI Surge, and the Automation Shock: This Week in AI
Global investments, corporate shifts, and the rising tension between automation and opportunity.
This Week In AI
The past week underscored just how global—and competitive—the AI landscape has become. OpenAI announced a massive $38 billion, seven-year cloud partnership with Amazon, marking a major shift in infrastructure strategy as the ChatGPT maker turns to AWS for access to hundreds of thousands of Nvidia GPUs. The deal strengthens Amazon’s bid to reassert itself in the AI race, following Microsoft’s deep integration with OpenAI via Azure.
Meanwhile, governments are racing to keep pace. South Korea unveiled a ₩10.1 trillion (≈ $6.9 billion) AI investment plan for 2026—tripling its current spending—to boost chip manufacturing, robotics, and national AI infrastructure. The initiative reflects how countries are treating compute access as a matter of economic security, with massive orders for Nvidia hardware now central to national policy.
On the corporate front, AI automation narratives dominated headlines. Several companies, including Amazon, attributed new rounds of layoffs to productivity gains from AI—though analysts caution that “AI” may sometimes serve as a convenient cover for broader restructuring. Regardless, the shift highlights both the transformative power and social tension surrounding automation as organizations redefine efficiency in the AI era.
From billion-dollar infrastructure deals to national-level investments and labor-market shake-ups, this week made one thing clear: AI is no longer a niche technology—it’s a geopolitical and economic force reshaping how nations, companies, and workers adapt to the future.
Research
Dynamic Routing Between Experts: A Data-Efficient Approach to Continual Learning in Vision-Language Models
What’s the research question?
How can vision-language models (VLMs) be improved to learn new tasks continually without forgetting previous ones, while using data efficiently?
What did the authors do?
The authors developed a novel modular framework for continual learning in VLMs, featuring:
Task-specific LoRA modules: Lightweight neural modules added to the base model, each dedicated to a particular task, which are trained while keeping the main model frozen.
Token-level dynamic routing: A learned routing vector determines, at the token level, which LoRA modules influence each input token during inference, enabling fine-grained, dynamic activation of modules.
Training and evaluation: Only current task's LoRA modules are updated, with no access to previous task data. The approach was tested on InternVL-2 models (2B and 8B parameters) across diverse vision-language tasks such as image captioning, visual entailment, hate speech detection, and multimodal classification.
Comparative analysis: The routing method was compared against multi-task learning (MTL), sequential fine-tuning, experience replay, and model merging, including analysis of model size and cross-modal transfer effects.
What did they find?
Key results include:
Competitive performance: Routing achieved results comparable to MTL on specialized tasks and even outperformed MTL on the MMIMDb dataset, demonstrating effective task-specific specialization.
Cross-modal transfer: Knowledge learned in one modality improved performance in others, highlighting the method’s ability to leverage shared representations across tasks.
Robustness to forgetting: Larger models (8B parameters) showed increased resilience to catastrophic forgetting, with minimal performance drops across tasks.
Efficiency advantages: Routing was more computationally efficient than MTL, requiring only current task data and no access to previous task data during training, making it suitable for real-world scenarios with data privacy or storage constraints.
Limitations: The study focused on vision-language tasks and models; applicability to other modalities or larger-scale continual learning settings remains to be explored.
Why does this matter?
This work advances the field of continual learning by demonstrating that token-level dynamic routing between task-specific modules can effectively prevent catastrophic forgetting in vision-language models. Its data efficiency and ability to balance task specialization with generalization make it highly relevant for deploying AI systems that must adapt to new tasks over time without retraining from scratch or storing all previous data. This approach opens new avenues for building scalable, modular AI architectures capable of lifelong learning across diverse multimodal applications, from interactive agents to adaptive content understanding.
Key Points
Introduces a routing-based modular framework for continual learning in vision-language models.
Uses lightweight, task-specific LoRA modules with token-level dynamic gating.
Achieves competitive and transfer learning performance without access to previous task data.
Enhances scalability and efficiency for real-world, multi-task AI systems.
Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series
What’s the research question?
How can knowledge distillation be used to develop small, efficient reasoning models that balance inference speed and reasoning performance for real-world applications?
What did the authors do?
The authors extended the DistilQwen model family by creating four specialized series of models aimed at improving reasoning efficiency:
Slow-thinking models: Optimized for high accuracy by generating detailed, structured Chain-of-Thought (CoT) reasoning paths.
Adaptive-thinking models: Dynamically adjust the length and complexity of reasoning based on input difficulty, using specialized scorers to guide CoT verbosity.
Distilled reward models: Trained via reinforcement learning (RL) to predict reasoning quality, guiding model improvements without requiring large amounts of labeled data.
To support training and evaluation, they developed a Data Source Collector to gather diverse reasoning datasets from platforms like Hugging Face and ModelScope. They used several advanced techniques:
LLM-based CoT processors: Elastic Teacher LLM Inference enabled scalable reasoning path generation.
Difficulty scorers: CoT Difficulty Scorer and Reasoning Verbosity/Cognitive Difficulty scorers to categorize and adapt reasoning complexity.
Curriculum learning: Fine-tuned models starting with medium difficulty CoTs and progressing to harder examples.
Reinforcement learning: Applied Group Relative Policy Optimization (GRPO) with distilled reward models to improve reasoning quality.
Models were evaluated on challenging benchmarks including AIME2024, MATH500, GPQA Diamond, and LiveCodeBench V2, focusing on accuracy and reasoning ability.
What did they find?
The key findings include:
Slow-thinking models (DistilQwen2.5-R1 series) achieved significant accuracy improvements over baseline models, demonstrating the benefit of detailed reasoning paths.
Adaptive-thinking models (DistilQwen-ThoughtX and ThoughtY series) outperformed slow-thinking models, especially on complex tasks, by tailoring reasoning length to input difficulty—using longer CoTs for hard problems and shorter ones for easy ones.
Increasing the number of reasoning attempts (K) during inference improved accuracy, with 7B models approaching the performance of much larger 32B models but at lower computational costs.
Distilled reward models trained via RL enhanced reasoning capabilities beyond vanilla GRPO, particularly in mathematical reasoning tasks.
Integration into Alibaba Cloud’s PAI platform demonstrated practical utility, showing these models can be deployed effectively in real-world applications requiring high reasoning accuracy and efficiency.
Limitations include: The evaluation focused on specific benchmarks; real-world diversity may present additional challenges. Also, adaptive reasoning strategies require careful tuning of difficulty scorers to balance speed and accuracy.
Why does this matter?
This work advances the development of small, efficient reasoning models that do not sacrifice accuracy for speed—a critical need for deploying AI in resource-constrained environments and real-time applications. By introducing adaptive reasoning strategies and distilled reward models, the authors provide new tools for scaling reasoning capabilities in language models without requiring prohibitively large compute resources.
The successful integration into Alibaba Cloud’s platform highlights the practical impact, enabling industries to leverage high-quality reasoning AI at lower costs and faster inference times. This approach opens avenues for deploying intelligent agents, automated reasoning systems, and multimodal AI applications that need to balance complexity and efficiency.
Key Points
Four tailored model series (slow-thinking, adaptive-thinking, distilled reward models) improve reasoning efficiency and accuracy.
Adaptive reasoning dynamically adjusts complexity based on input difficulty, optimizing inference time.
Distilled reward models trained via RL enhance reasoning quality beyond traditional methods.
Models demonstrate strong performance on challenging benchmarks and practical deployment scenarios.
Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs
What’s the research question?
What are the underlying information-theoretic principles of large language models (LLMs) from a semantic perspective?
What did the authors do?
The authors developed a novel semantic information theory tailored for LLMs, focusing on tokens as the fundamental units of information rather than bits. Their approach includes:
A probabilistic model of LLMs as autoregressive next-token predictors, where input token sequences are mapped to semantic vector sequences.
Modeling the generation of the next token’s embedding based on previous semantic vectors using transition probabilities learned after training.
Defining new information-theoretic measures such as the directed rate-distortion function (pre-training), directed rate-reward function (post-training), and semantic information flow during inference.
Introducing token-level semantic embeddings and vectorization techniques, including a semantic space, semantic vectors, and an optimal vectorization method.
Formalizing autoregressive LLMs (AR-LLMs) as time-varying vector autoregression (TV-VAR) processes, with the Transformer architecture as a special case.
Applying variational inference to derive the Evidence Lower BOund (ELBO) of the Transformer and establishing generalization error bounds using Rademacher complexity and Talagrand's inequalities.
Discussing other architectures like Mamba/Mamba2 and LLaDA within this semantic information framework.
What did they find?
The paper presents a comprehensive theoretical framework that reinterprets LLMs through the lens of semantic information theory:
Shows that the Transformer architecture can be viewed as an autoregressive semantic model (AR-LLM) with specific properties.
Derives bounds on the generalization error of Transformers using advanced statistical tools, providing insights into their learning capabilities.
Introduces new measures—directed rate-distortion, rate-reward, and semantic information flow—that quantify how semantic information is captured, compressed, and transmitted during training and inference.
Provides neural estimators to practically measure these semantic information quantities from trained models.
Highlights the importance of token-level semantics over traditional bit-based information measures, emphasizing the role of semantic properties in model performance.
Limitations include the complexity of the theoretical framework and the need for further empirical validation on diverse LLM architectures and tasks.
Why does this matter?
This work offers a transformative perspective on understanding and improving LLMs by focusing on the semantic content of tokens rather than purely statistical or bit-based metrics. By formalizing how semantic information flows and is optimized during training and inference, the framework can:
Guide the design of more efficient and semantically-aware language models that better capture meaning and context.
Enable new evaluation metrics that directly measure the quality of semantic representations, beyond traditional perplexity or accuracy.
Inspire novel training algorithms that explicitly optimize semantic information flow, potentially leading to models with improved generalization and robustness.
Bridge the gap between information theory and NLP, fostering interdisciplinary research that leverages insights from both fields.
Key Points
Tokens are the fundamental units of semantic information in LLMs, not bits.
The paper develops a semantic information theory framework for autoregressive LLMs, including Transformers.
Introduces new measures: directed rate-distortion, rate-reward, and semantic information flow.
Provides theoretical bounds on model generalization and practical estimators for semantic quantities.
Assessing LLM Reasoning Steps via Principal Knowledge Grounding
What’s the research question?
How can we systematically evaluate the knowledge grounding of intermediate reasoning steps in large language models (LLMs)?
What did the authors do?
The authors developed a comprehensive evaluation framework to measure how well LLMs recall and apply prerequisite knowledge during reasoning:
Principal Knowledge Collection (PK Collection): Created a large repository of atomic knowledge essential for reasoning by extracting and clustering knowledge from top-performing LLMs on the MMLU benchmark, resulting in 112,780 PK units.
Knowledge-grounded evaluation metrics: Designed metrics including knowledge recall, precision, and F1 score to assess how accurately and comprehensively models apply PK units in their generated rationales.
Lightweight LLM evaluator: Trained a distilled evaluator model from a proprietary GPT-4-based teacher to efficiently and reliably compute the metrics at scale.
What did they find?
The evaluation revealed significant differences in how well various open-source LLMs recall and apply knowledge:
Llama3-8B-Instruct achieved 81.1% knowledge recall and 92.4% knowledge precision, resulting in a 76.0% F1 score.
Larger models like Qwen2.5-32B outperformed smaller ones with 97.7% recall, 93.9% precision, and an 85.0% F1 score.
Integrating knowledge-grounded metrics into a preference optimization framework improved model controllability, enabling generation of more concise or comprehensive reasoning.
Limitations include reliance on the quality of the PK collection and potential challenges in generalizing to unseen knowledge or tasks.
Why does this matter?
This work advances the evaluation of LLM reasoning by providing a systematic, knowledge-grounded framework that captures both correctness and coverage of prerequisite knowledge application. The PK Collection and associated metrics enable more interpretable diagnostics and targeted improvements, helping researchers develop more reliable and transparent LLMs capable of grounded reasoning. The lightweight evaluator makes large-scale, cost-effective assessment feasible, supporting the broader goal of building AI systems that reason more like humans by grounding their steps in explicit knowledge.
Key Points
Introduces a large-scale PK Collection of atomic knowledge for evaluating reasoning.
Develops knowledge-grounded metrics to measure recall, precision, and F1 in reasoning steps.
Trains a lightweight, distillation-based evaluator for scalable assessment.
Demonstrates improved model controllability and reasoning quality through knowledge-grounded evaluation.
Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance
What’s the research question?
How can we systematically evaluate and improve the reasoning capabilities of Large Language Models (LLMs) in temporal and causal reasoning tasks?
What did the authors do?
The authors developed TempoBench, a diagnostic benchmark designed to analyze LLM reasoning in temporal and causal contexts through the following steps:
Created two core evaluation tasks: Temporal Trace Evaluation (TTE), which tests whether an LLM can determine if a sequence of inputs and outputs (a trace) is accepted by a formal reactive system modeled as a finite-state automaton (FSA); and
Temporal Causality Evaluation (TCE), which assesses an LLM's ability to identify the causal inputs needed to produce a specific output effect at a given time.
Generated a formal, verifiable dataset by synthesizing controllers from temporal logic specifications using reactive synthesis tools, then producing finite traces with the HOAX tool, ensuring known optimal solutions.
Prompted LLMs with JSON-encoded traces, asking them to classify acceptance (TTE) or identify causal inputs (TCE).
Evaluated performance using precision, recall, and F1 scores at both the Atomic Proposition (AP) and Timestep (TS) levels, and analyzed how task complexity affected results by varying parameters like effect depth, system states, transition count, causal inputs, and input diversity.
What did they find?
Key findings include:
Performance varied significantly across five tested LLMs (GPT-4o-mini, GPT-4o, Claude-3.5-sonnet, Claude-4.5, Qwen3-coder-plus), with F1 scores on TCE-normal tasks ranging from 65.6% to 69.5%, but dropping to as low as 5.4% on TCE-hard tasks.
On TTE-normal tasks, scores ranged from 42.2% to 61.5%, with even lower scores on TTE-hard tasks (as low as 49.6%).
Larger automata with more states and transitions negatively impacted LLM performance, especially on TCE tasks requiring causal inference over longer temporal horizons.
The results revealed that while LLMs can parse formal temporal and causal representations, their ability to perform complex causal reasoning and handle long-range temporal dependencies remains limited, particularly as task difficulty increases.
Why does this matter?
TempoBench introduces a formal, interpretable framework for diagnosing the reasoning strengths and weaknesses of LLMs in temporal and causal domains. By providing a structured way to evaluate how well models understand and manipulate temporal logic and causality, it helps researchers identify specific challenges and guide targeted improvements. The benchmark's design allows for scalable and rigorous assessment, making it valuable for developing LLM-based agentic systems, decision-making tools, and long-horizon planning applications where understanding temporal and causal relationships is crucial. Its emphasis on formal verification and interpretability sets a new standard for evaluating reasoning capabilities in large language models.
Key Points
Introduces TempoBench, a benchmark for temporal and causal reasoning in LLMs based on formal automata models.
Evaluates LLMs on trace acceptance and causal input inference with verifiable, synthesized datasets.
Shows significant performance drops as automaton complexity and task difficulty increase.
Highlights the need for improved causal and temporal reasoning in large language models to enable more reliable agentic AI systems.