- TensorTeach's Newsletter
- Posts
- OpenAI Partners With Broadcom On New Chips + How Do Machines Learn to Reason?
OpenAI Partners With Broadcom On New Chips + How Do Machines Learn to Reason?
This Week In AI
AI’s momentum keeps building—and this week, it was impossible to miss. OpenAI’s new partnership with Broadcom marks a major shift toward self-reliance, with plans to co-develop massive custom AI chips by 2026. At the same time, Apple’s M5 chip pushes AI performance into everyday devices, and Salesforce’s $15 billion investment in San Francisco shows how the AI boom is reshaping cities and corporate strategy alike. The message is clear: AI is no longer a side project—it’s the main event driving both innovation and infrastructure.
On the policy side, California’s new “AI must tell you it’s AI” law sets an early standard for transparency, likely influencing global norms. Philadelphia’s AI task force and other state initiatives signal that local governments aren’t waiting for Washington—they’re experimenting, regulating, and deploying AI on their own terms. We’re seeing the early blueprint for how human institutions adapt to algorithmic power: from open labs to open laws.
What ties it all together is scale and integration. Chips, models, laws, and universities are now moving in sync to shape how AI fits into daily life. The focus is shifting from building smarter systems to building trustworthy ones—AI that works reliably, ethically, and efficiently at every level. The takeaway? The next wave of AI won’t just be about what models can do—it’ll be about where, how, and why we let them.
Latest Research
Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph
What’s the research question?
How can we improve the spatio-temporal reasoning capabilities of multimodal large language models (MLLMs) in video understanding tasks?
What did the authors do?
The authors introduced Video-STR, a novel framework designed to enhance video reasoning by integrating structured spatial and temporal relationships into MLLMs. Key components include:
Graph-based reasoning: Nodes represent objects, and edges encode relationships like spatial distance and relative direction, explicitly modeling object topology.
Reinforcement learning with verifiable rewards: Multiple reward types guide learning—format rewards ensure answer structure, multi-choice rewards compare to ground truth, numerical rewards evaluate numeric predictions, and IoU rewards assess object localization.
Group Relative Policy Optimization (GRPO): An extension of Proximal Policy Optimization (PPO) that generates multiple candidate responses, normalizes their rewards, and improves policy learning.
Large-scale dataset: The STV-205k dataset with 205,000 QA pairs covering diverse spatial and temporal reasoning tasks in indoor and outdoor scenes.
Evaluation benchmarks: Performance assessed on multiple datasets including STI-Bench, V-STaR, VSI-Bench, SPAR-Bench, Video-MME, and TempCompass.
What did they find?
Video-STR achieved state-of-the-art results across all benchmarks, outperforming the base model by 13% on STI-Bench. Ablation studies confirmed the importance of each component:
Removing graph reasoning or spatial/temporal subsets led to significant performance drops.
Enhanced numerical prediction accuracy and better generalization across tasks.
Qualitative analysis showed the model's ability to explicitly reason about object distribution and motion trends, surpassing models lacking explicit spatial reasoning.
Limitations include:
Training on only 16 frames and inferring on 32 frames may restrict handling longer videos.
Dependence on explicit graph structures might limit flexibility in highly dynamic or unstructured scenes.
Why does this matter?
This work advances video spatio-temporal reasoning by explicitly modeling object relationships and integrating them into multimodal language models through reinforcement learning. The combination of graph reasoning, verifiable rewards, and a large diverse dataset enables more accurate and generalizable video understanding. Potential applications include:
Autonomous navigation and driving, where understanding object motion and relationships is critical.
Human-robot interaction, requiring real-time comprehension of dynamic scenes.
Immersive virtual environments and augmented reality experiences.
Moreover, the explicit reasoning mechanisms could improve AI interpretability and fairness by making decision processes more transparent. The introduction of a new structured reasoning paradigm and publicly available dataset/codebase will also catalyze further research into explicit spatial-temporal modeling in multimodal AI systems.
Key Points
Integrates graph-based spatial-temporal reasoning with reinforcement learning in multimodal video models.
Uses verifiable rewards to guide learning of object relationships and motion.
Achieves state-of-the-art results on diverse video reasoning benchmarks.
Provides a large, annotated dataset to facilitate future research.
Revisiting Model Interpolation for Efficient Reasoning
What’s the research question?
How does model interpolation influence the performance and efficiency of reasoning in large language models?
What did the authors do?
- Investigated the effects of combining two specialized large language models (LLMs): one focused on reasoning (Thinking) and one on instruction-following (Instruct).
- Used the Qwen3 series of models (4B and 30B parameters) to systematically interpolate between Thinking and Instruct models by adjusting an interpolation coefficient (λ).
- Evaluated the interpolated models across diverse reasoning benchmarks: AIME 2019 (math reasoning), IFEval (instruction following), and GPQA-Diamond (scientific reasoning).
- Analyzed how varying λ affects model behavior and performance, identifying a three-stage evolution in reasoning capabilities.
- Compared the interpolation method (MI) with baseline model merging techniques like Task Arithmetic and TIES-Merging to assess effectiveness and controllability.
What did they find?
- Model interpolation exhibits a clear three-stage progression as λ varies:
Stage 1 (λ ∈ [0, 0.4)): Dominated by the Instruct model, producing direct responses with minimal explicit reasoning.
Stage 2 (λ ∈ [0.4, 0.6]): Emergence of explicit reasoning, leading to significant improvements in reasoning accuracy (Mean@k scores).
Stage 3 (λ ∈ (0.6, 1]): Convergence to the Thinking model, generating longer, more detailed responses with diminishing returns in accuracy.
- Empirical results showed that MI outperforms baseline merging methods, with MI-0.8 achieving a Mean@64 of 80.5 on AIME 2019, surpassing Task Arithmetic and TIES-Merging.
- Demonstrated robustness of MI to different decoding strategies and its ability to smoothly interpolate between models with distinct reasoning capabilities.
- Limitations include potential diminishing returns in response length and the need to carefully tune λ for optimal performance.
Why does this matter?
- Provides a practical and effective framework for balancing reasoning ability and computational efficiency in large language models through simple interpolation.
- Reveals a three-stage paradigm that enhances understanding of how model behavior evolves as models are combined, informing future model fusion and adaptation strategies.
- Suggests that straightforward interpolation can outperform more complex merging techniques, opening new avenues for efficient reasoning in AI systems.
- Impacts the design of adaptive models that can dynamically adjust their reasoning strategies based on task complexity and resource constraints, benefiting applications in education, scientific research, and AI-assisted problem-solving.
Key Points
Model interpolation between Thinking and Instruct LLMs reveals a three-stage reasoning evolution.
Interpolation coefficient λ controls the balance between direct responses and explicit reasoning.
MI outperforms baseline merging methods across diverse reasoning benchmarks.
Findings enable more efficient and adaptable large language model deployment.
From Answer to Think: Multidimensional Supervision of Reasoning Process for LLM Optimization
What’s the research question?
Can supervising the reasoning process of large language models (LLMs) along multiple dimensions improve their reasoning abilities?
What did the authors do?
The authors introduced a novel approach called the Dimension-level Reward Model (DRM) to enhance LLM reasoning by evaluating and supervising their reasoning process across three key dimensions:
Confidence: Faithfulness of the reasoning to the input data.
Relevance: Semantic alignment between input and output.
Coherence: Logical consistency of the reasoning steps.
DRM assigns dense, interpretable scores without needing ground truth answers, making it suitable for diverse training paradigms:
Off-policy training (DPO): DRM scores guide sample selection for preference-based learning.
On-policy training (GRPO): DRM scores are combined with traditional correctness rewards.
They evaluated DRM on various open-domain tasks—including mathematics, code generation, question answering, and logical reasoning—using models like Llama-3.1-8B-Instruct, R1-distil-LLaMA8B, and Qwen3-8B.
What did they find?
Models trained with DRM supervision consistently outperformed answer-supervised counterparts across multiple tasks and models:
On the RewardBench 2 dataset, DRM@T+T (relevance + coherence) achieved 66.3% answer correctness versus 63.9% for answer-supervised RLVR.
DRM reduced the proportion of correct answers with flawed reasoning from 71.9% to 52.9% in one case.
Combining DRM supervision with traditional answer supervision led to further performance gains.
Improvements were observed both in in-distribution and out-of-distribution settings, demonstrating robustness.
While promising, the approach relies on accurately defining and measuring the three reasoning dimensions, which may be challenging for some tasks.
Why does this matter?
This work shows that supervising the reasoning process along multiple, interpretable dimensions can significantly enhance the reasoning quality of LLMs. Unlike traditional answer-based supervision, DRM provides a scalable and transparent way to guide models toward more logical and relevant reasoning steps. This has broad implications:
Broader Impact: Improves the trustworthiness and generalizability of AI systems in complex reasoning tasks.
Applications: Enables more reliable AI in education, scientific discovery, decision support, and other domains requiring nuanced understanding.
Future Directions: Opens avenues for extending multidimensional supervision to other reasoning aspects, integrating user feedback, and combining with reinforcement learning.
By advancing how we teach AI to think rather than just answer, DRM helps build smarter, more interpretable, and more dependable language models.
Key Points
Introduces the Dimension-level Reward Model (DRM) to supervise reasoning along confidence, relevance, and coherence.
Supervised DRM outperforms traditional answer-based methods on diverse reasoning tasks.
Provides an interpretable, scalable alternative to improve LLM reasoning quality.
Enhances trust and applicability of LLMs in real-world, complex reasoning scenarios.
MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model
What’s the research question?
Can reinforcement learning (RL) be used to help large language models (LLMs) go beyond their initial reasoning abilities and solve new, more challenging problems?
What did the authors do?
The authors developed a new benchmark called MATH-Beyond (MATH-B) to evaluate how well RL-finetuned LLMs can expand their reasoning skills beyond what their base models can do:
Constructed from high-school-level math problems sourced from DAPO-Math-17K and DeepScaleR datasets.
Filtered problems to ensure quality and novelty, verifying difficulty against stronger models like GPT-5-Mini.
Created a union set of 181 problems unsolved by at least one base model, and a core intersection set of 41 problems unsolved by all base models.
Used the pass@1024 metric to measure how many problems models can solve within 1024 attempts.
Introduced the Expansion Rate metric to quantify how many problems the RL-finetuned model can solve that the base model could not.
What did they find?
The results revealed that:
RL-finetuned models like Nemotron-Research-Reasoning-Qwen-1.5B and DeepScaleR-1.5B performed poorly on MATH-B, with pass@1024 scores near zero, indicating limited ability to expand beyond their base models.
Models with better distributional overlap with the dataset, such as Qwen3-4B and Qwen3-8B, achieved higher scores (58.93% and 66.38%), suggesting that similarity to training data helps but does not guarantee boundary expansion.
The study highlights the challenge of discovering truly novel reasoning pathways without relying on teacher models, emphasizing the need for exploration methods that can find new solutions rather than just refine existing ones.
Why does this matter?
This work provides a rigorous and challenging benchmark for evaluating how well LLMs can push beyond their initial reasoning limits using reinforcement learning. By focusing on boundary expansion—solving problems that the base model cannot—the benchmark encourages the development of AI systems capable of genuine exploration and learning new reasoning strategies. This has broad implications for advancing AI's ability to tackle complex, unseen problems in education, science, and real-world decision-making, moving beyond rote pattern matching toward true logical and mathematical reasoning.
Key Points
Introduces MATH-Beyond (MATH-B), a benchmark for boundary expansion in LLMs using math problems.
Highlights the difficulty of RL models to discover new reasoning pathways beyond their base capabilities.
Shows that better dataset overlap improves performance but does not guarantee boundary expansion.
Provides a foundation for future research on exploration-driven learning in large language models.
From Reasoning LLMs to BERT: A Two-Stage Distillation Framework for Search Relevance
What’s the research question?
Can the advanced reasoning abilities of large language models (LLMs) be effectively transferred to smaller, more practical models to improve search relevance?
What did the authors do?
- Developed a two-stage distillation framework to transfer reasoning skills from LLMs to lightweight models.
- Stage 1: Created a domain-adapted reasoning LLM by continuing pre-training, supervised fine-tuning, and preference optimization using a multi-dimensional reward model.
- Stage 2: Introduced Contrastive Reasoning Self-Distillation (CRSD), where a BERT-based model learns to produce similar semantic representations for both standard and reasoning-augmented inputs through contrastive learning.
- Evaluated the approach on relevance prediction tasks with real-world search data.
What did they find?
- The domain-adapted reasoning LLM achieved a Macro F1 score of 0.7174, demonstrating strong reasoning capabilities.
- The distilled lightweight BERT model retained 98.6% of the teacher’s performance, with a Macro F1 of 0.7076.
- Online A/B testing showed a 0.64% increase in click-through rate (CTR) and a 1.73% increase in conversion rate (CVR) for ads, indicating practical relevance improvements.
- Limitations include the need for careful domain adaptation and the potential complexity of the two-stage process.
Why does this matter?
This work offers a practical pathway to embed advanced reasoning capabilities into lightweight relevance models used in search engines and advertising platforms. By effectively distilling reasoning skills from large, resource-intensive LLMs into deployable models, it bridges the gap between high performance and real-world applicability. This approach can enhance search relevance, improve user engagement, and drive better business outcomes without the computational costs typically associated with large LLMs.
Key Points
Introduces a two-stage distillation framework combining reasoning LLMs and lightweight models for search relevance.
Uses Contrastive Reasoning Self-Distillation to align semantic representations of standard and reasoning-augmented inputs.
Achieves high retention of reasoning performance in a compact BERT model, with strong improvements in online relevance metrics.
Provides a scalable method to bring advanced reasoning into production search systems.