- TensorTeach's Newsletter
- Posts
- Yann LeCun AMI, Microsoft Copilot Agents, Pentagon Agent Designer, Anthropic–DoD Clash, ChatGPT 5.4
Yann LeCun AMI, Microsoft Copilot Agents, Pentagon Agent Designer, Anthropic–DoD Clash, ChatGPT 5.4
The week’s biggest AI moves across world-model research, enterprise agents, government AI platforms, regulation battles, and massive LLM adoption.
This Week In AI
Over the past week, major AI developments centered on agents, embodied intelligence, enterprise automation, regulation, and the continued scaling of AI infrastructure.
In research, Yann LeCun launched Advanced Machine Intelligence (AMI), a new startup backed by $1 billion in funding focused on building AI systems that understand the physical world through persistent memory, planning, and “world models.” The effort challenges the current LLM scaling paradigm and signals growing momentum toward embodied and agentic AI systems designed to interact with real environments.
On the enterprise side, Microsoft introduced Copilot Cowork, a new AI agent system inside Microsoft 365 capable of orchestrating tasks across documents, spreadsheets, and communications. The system reflects a broader shift from chat interfaces toward autonomous workflow agents that can operate across enterprise data and productivity tools.
Government adoption of AI agents also accelerated. The U.S. Department of Defense launched Agent Designer, a platform built on GenAI.mil that allows internal teams to create custom AI assistants powered by Google Gemini models, highlighting how governments are beginning to build their own internal AI development ecosystems.
Regulation and governance also moved into the spotlight as Anthropic filed a lawsuit against the U.S. Department of Defense after being labeled a “supply-chain risk.” The dispute stems from Anthropic’s refusal to remove guardrails preventing certain military applications, making it one of the first major conflicts between a frontier AI company and a national government over how advanced AI systems should be deployed.
Meanwhile, the scale of AI adoption continued expanding rapidly. ChatGPT is approaching 900 million weekly active users and roughly 50 million paying subscribers, reinforcing the idea that LLMs are becoming foundational internet infrastructure. OpenAI also released ChatGPT 5.4, improving coding capabilities, document handling, and developer productivity tools.
This Week In AI Research
Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective
What’s the research question?
How can the intrinsic behavioral plasticity of Large Language Models (LLMs) be exposed and stabilized to enable diverse and effective behaviors?
What did the authors do?
The authors introduced a novel framework called ToCoRL that leverages token-conditional generation to reveal and stabilize the flexible behaviors of LLMs at inference time without retraining. Their approach involves:
Guiding exploration by conditioning token generation on specific prefixes, allowing the model to adapt behaviors dynamically.
Shaping behavior using a KL divergence constraint to balance imitation of a reference instruct model and maintain diversity.
Instantiating the framework with two models: a large reasoning model (LRM) trained on complex reasoning and factual questions, and an instruct model providing guiding prefixes.
Training the LRM with a mixture of standard reinforcement learning (RL) and token-conditional rollouts, where the latter uses prefixes from the instruct model to steer behavior.
What did they find?
The ToCoRL framework demonstrated significant improvements and interesting behaviors:
Factual question answering accuracy increased from 18.9% to 28.3% on the SimpleQA benchmark, showing better handling of real-world knowledge.
The model exhibited a recalibrative reasoning style, iteratively refining answers based on confidence estimates, leading to more accurate and thoughtful responses.
Ablation studies confirmed the robustness of the approach to hyperparameter choices and the selection of prefix providers.
The emergent behavioral plasticity was transferable to supervised fine-tuning, resulting in further gains in factual problem solving.
Math reasoning performance was maintained, indicating that the approach enhances factuality without sacrificing reasoning skills.
Why does this matter?
This work advances our understanding of how to unlock and control the flexible behaviors of LLMs at inference time, without the need for costly retraining. By exposing and stabilizing behavioral plasticity through token-conditioned prompts and reinforcement learning, the framework enables LLMs to adapt dynamically to different tasks and styles. This has broad implications for creating controllable, versatile AI systems that can be tailored to diverse applications such as education, decision support, and human-AI collaboration. The ability to refine and steer model behavior on the fly opens new avenues for making large language models more effective, reliable, and aligned with user needs.
Key Points
Introduces ToCoRL, a reinforcement learning framework that stabilizes behavioral plasticity in LLMs via token-conditional generation.
Achieves significant improvements in factual question answering accuracy and maintains reasoning performance.
Uses a KL divergence constraint to balance imitation of a reference instruct model and behavioral diversity.
Demonstrates transferability of emergent behaviors to supervised fine-tuning, enhancing adaptability.
SaiVLA-0: Cerebrum–Pons–Cerebellum Tripartite Architecture for Compute-Aware Vision-Language-Action
What’s the research question?
How can a neuroscience-inspired tripartite architecture improve compute-aware vision-language-action models for robotic control?
What did the authors do?
- Developed SaiVLA-0, a novel architecture inspired by the brain’s Cerebrum, Pons, and Cerebellum, designed to handle vision, language, and action in robotics.
- The architecture separates high-level semantic planning (Cerebrum) from low-level control (Cerebellum), with the Pons acting as a semantic-to-dynamics compiler.
- Used a large, frozen vision-language model (Cerebrum) to provide multimodal priors without retraining during downstream tasks.
- The Pons adapter projects and fuses Cerebrum outputs into context tokens, aligning high-level intent with low-level control signals.
- The Cerebellum performs fast, parallel decoding of action commands as categorical deltas {-1, 0, +1} per control dimension, stabilized through hysteresis, EMA, temperature scaling, and entropy regularization.
- Employed a fixed-ratio schedule (N=5) for Cerebrum updates and micro-horizon reuse (K=20) for Cerebellum to balance compute load and reactivity.
- Implemented feature caching for offline extraction of Cerebrum features, enabling faster training and reproducibility.
- Integrated a region-of-interest perception module tied to the robot’s end-effector for stable, high-resolution contact cues.
- Designed the system to support modular upgrades, allowing replacement of the Cerebrum or robot hardware without retraining the entire model.
What did they find?
- SaiVLA-0 achieved a 99.0% mean success rate on the LIBERO benchmark, surpassing prior models like \u03c00 and OpenVLA.
- Feature caching reduced training time from 7.5 hours to 4.5 hours and increased success rates from 86.5% to 92.5%.
- The fixed-ratio schedule and micro-horizon reuse effectively balanced compute and control latency, enabling low-latency robotic control.
- Demonstrated robust performance across diverse tasks: spatial, object, goal-oriented, and long-horizon manipulations, with success rates between 97.8% and 99.8%.
- The modular design facilitated rapid upgrades and transferability to new robot platforms.
- Limitations include reliance on a fixed Cerebrum model and potential challenges in scaling to more complex control spaces.
Why does this matter?
- Introduces a neuroscience-inspired tripartite architecture that effectively separates semantic planning from control, improving stability, latency, and generalization in vision-language-action robotics.
- The compute-aware scheduling and feature caching strategies enable efficient training and deployment, making the approach practical for real-world robotic systems.
- Achieving near-perfect success on the LIBERO benchmark demonstrates SaiVLA-0’s potential to advance embodied AI, especially in tasks requiring tight integration of vision, language, and action.
- The modular design and use of frozen large models provide a blueprint for building scalable, flexible, and compute-efficient embodied AI systems that can adapt to new hardware and tasks.
Key Points
Neuroscience-inspired tripartite architecture separates high-level semantic planning from low-level control.
Fixed-ratio scheduling and micro-horizon reuse balance compute load and control reactivity.
Feature caching accelerates training and enhances reproducibility.
Achieves state-of-the-art success on the LIBERO robotic vision-language benchmark.
Large Language Model for Discrete Optimization Problems: Evaluation and Step-by-step Reasoning Report
What’s the research question?
How do different large language models perform on discrete optimization problems across various problem types and dataset configurations?
What did the authors do?
The authors conducted a comprehensive evaluation of several large language models (LLMs) on discrete optimization tasks, including:
Models tested: GPT-4 Mini, LLAMA3-8B, ORLM, and DeepSeek-R1.
Datasets: Derived from Operations Research (OR) Library and Vehicle Routing Problem (VRP) collections, with problems expressed in natural language in three formats: original (structured), expanded (with extra context), and disordered (randomized sentence order).
Prompting techniques: Used Chain of Thought (CoT) and Program of Thought (PoT) to guide step-by-step reasoning and code generation.
Evaluation metrics: Assessed performance using Pass Rate (PR), Accuracy Rate (AR), Mean Absolute Percentage Error (MAPE), and Timeout Rate (TR).
What did they find?
Key results include:
GPT-4 Mini achieved a high Pass Rate of 92.59% on disordered datasets, significantly outperforming weaker models like LLAMA3-8B and ORLM, which both had PRs around 13.89%.
AR was highest for GPT-4 Mini at 11.11%, indicating it found near-optimal solutions more consistently, while DeepSeek-R1 had a much lower AR of 0.93%.
MAPE was lowest for GPT-4 Mini at 0.93%, showing its solutions were close to the optimal objective value.
Timeout rates varied, with GPT-4 Mini at 2.78% and DeepSeek-R1 at 0.93%, reflecting differences in computational efficiency.
Disordered datasets improved GPT-4 Mini’s PR but negatively impacted AR and MAPE for weaker models, suggesting dataset structure influences model performance.
Using CoT prompting improved AR for GPT-4 Mini but reduced PR for weaker models, highlighting that the effectiveness of prompting techniques depends on model strength and dataset format.
Why does this matter?
This study establishes a benchmark for applying large language models to complex discrete optimization problems, which are central to operations research, logistics, and planning. By systematically evaluating how model strength, dataset design, and prompting strategies interact, it provides valuable insights for developing AI tools that can assist in solving real-world optimization challenges. The findings suggest that carefully tailored prompts and dataset configurations can significantly enhance LLM performance, paving the way for more effective AI-driven decision-making in domains like supply chain management, vehicle routing, and resource allocation. Ultimately, this work advances the integration of language models into the toolkit of optimization and operations research, enabling smarter, faster solutions to complex problems.
Key Points
Large language models can be applied to discrete optimization problems, but performance varies widely by model strength and dataset design.
GPT-4 Mini outperforms other models significantly when evaluated on diverse problem formats and prompting techniques.
Disordered datasets challenge models but can also reveal robustness and generalization capabilities.
Prompting strategies like Chain of Thought can improve reasoning but may have mixed effects depending on the model.