Apple and Google Gemini Team Up, Meta’s Yann LeCun Departs, AI Launches Into Orbit

From corporate leadership shifts to space-based autonomy, this week marks a defining moment for artificial intelligence.

This Week In AI

The past week highlighted just how far—and how fast—AI is expanding beyond the lab. Yann LeCun, Meta’s Chief AI Scientist and Turing Award winner, is set to leave the company to launch his own startup, marking a major shift in the AI research landscape. His departure underscores growing tension between open scientific exploration and product-driven Big Tech strategies, and could reshape where frontier work happens—and who leads it.

Meanwhile, AI literally left the planet. Scientists at Julius-Maximilians-Universität Würzburg announced that an onboard AI system successfully performed the first fully autonomous satellite attitude maneuver in orbit, showing how machine intelligence is moving into space operations and enabling real-time decisions without ground control.

On the consumer front, Apple is preparing to integrate custom AI models into iPhones alongside expanded satellite connectivity, reportedly in talks with Google on a $1 billion-per-year deal to power Siri with Gemini. It signals a new era of AI-enhanced devices that think and connect beyond traditional cloud constraints.

From leadership shake-ups to autonomous satellites and next-gen smartphones, this week underscored a defining theme: AI is breaking out of the data center and embedding itself in infrastructure, orbit, and everyday devices.

Research

MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning

Image from arXiv paper.

What’s the research question?
How can iterative fine-tuning, reflection, and reward-guided feedback improve the mathematical reasoning capabilities of multimodal large language models (MLLMs)?


What did the authors do?
The authors developed a novel training framework called MathSE that enhances multimodal mathematical reasoning through a three-stage iterative process:

  • Initial supervised fine-tuning (SFT): They trained a base multimodal language model using GPT-4o distilled data containing high-quality, step-by-step reasoning examples across text and images.

  • Outcome Reward Model (ORM) evaluation: An ORM was designed to assess the correctness of the model’s reasoning paths, providing detailed error analyses to guide improvements.

  • Reflection and self-improvement: The model revisits incorrect reasoning paths, reflects on errors, and generates refined reasoning steps. Correct reasoning paths are added to the training set, and incorrect ones are improved through reflection.

  • The process repeats iteratively, with the model generating reasoning paths on remaining data, which are then evaluated and refined, leading to progressively better multimodal mathematical reasoning.

What did they find?
The MathSE framework led to significant improvements on challenging multimodal math reasoning benchmarks:

  • MathVL-test: MathSE-InternVL achieved 65% accuracy, outperforming the previous best open-source model QVQ by 12.8%.

  • MathVista: The model improved accuracy by 15.39%, demonstrating enhanced geometric reasoning capabilities.

  • Ablation studies confirmed the importance of the ORM and reflection mechanisms: ORM feedback alone improved accuracy from 62.35% to 64.70%, highlighting the value of reward-guided evaluation.

  • Limitations include reliance on high-quality initial data and the computational cost of iterative refinement, which may challenge scalability.

Why does this matter?
MathSE introduces a powerful new approach to improving multimodal mathematical reasoning by combining iterative self-reflection with reward-guided feedback. This dynamic, self-evolving training paradigm moves beyond static datasets, enabling models to learn from their mistakes and refine their reasoning skills over multiple rounds. The success of MathSE demonstrates that such iterative, reflection-based methods can significantly enhance the capabilities of multimodal large language models, opening new avenues for AI systems that need to understand and reason about complex visual and textual information simultaneously. This advancement has broad implications for AI applications in education, scientific research, and any domain requiring sophisticated multimodal reasoning.

Key Points

  • Introduces MathSE, an iterative framework combining reflection and reward-guided fine-tuning for multimodal math reasoning.

  • Achieves state-of-the-art performance on MathVL-test and MathVista benchmarks, surpassing previous models.

  • Highlights the importance of reward-based evaluation and self-reflection in model improvement.

  • Offers a new paradigm for dynamic, self-improving multimodal AI systems.

Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B

Image from arXiv paper.

What’s the research question?
Can a small, 1.5 billion-parameter language model achieve reasoning capabilities comparable to larger models?


What did the authors do?
The researchers developed a novel training approach called VibeThinker-1.5B, designed to enhance reasoning in a relatively small language model. Their methodology involved two key phases:

  • Spectrum Phase (SFT): Fine-tuned the model using a Diversity-Exploring Distillation technique to generate a wide range of solutions across different subdomains such as algebra, geometry, calculus, and statistics. This was achieved through Domain-Aware Diversity Probing, which identified specialized sub-models for each domain, and Expert Model Fusion, which combined these into a unified fine-tuned model.

  • Signal Phase (RL): Employed MaxEnt-Guided Policy Optimization (MGPO), a reinforcement learning framework that dynamically focused on problems where the model showed high uncertainty, encouraging the model to explore and strengthen its reasoning paths.

They evaluated VibeThinker-1.5B on several challenging benchmarks, including mathematical problem sets (AIME24, AIME25, HMMT25), coding tasks (LiveCodeBench V6), and general knowledge assessments (GPQA).What did they find?
VibeThinker-1.5B demonstrated remarkable reasoning abilities despite its small size:

  • Achieved an 80.3 score on AIME24, outperforming larger models like DeepSeek R1 (70.0) and Qwen3-1.7B (36.8).

  • Scored 74.4 on AIME25, surpassing DeepSeek R1 (70.0) and matching the performance of MiniMax-M1 (74.6).

  • Reached 50.4 on HMMT25, outperforming DeepSeek R1 (41.7).

  • Excelled in coding tasks with a 55.9 score on LiveCodeBench V6, exceeding Magistral Medium (50.3).

  • Showed strong general knowledge with a 46.7 on GPQA.

This approach highlights the importance of diversity and targeted optimization in training small models to develop complex reasoning skills. Limitations include the need for sophisticated training techniques and potential challenges in scaling or applying to other domains.

Why does this matter?
This work challenges the common belief that only large models can perform advanced reasoning tasks. By demonstrating that a carefully trained small model can match or exceed larger counterparts, it opens new avenues for democratizing AI:

  • Reduces computational and financial barriers, making advanced reasoning AI more accessible.

  • Encourages the development of efficient training paradigms that leverage diversity and targeted optimization.

  • Potentially accelerates AI research by enabling broader participation and experimentation with smaller, more manageable models.

  • Highlights the value of innovative training principles like the Spectrum-to-Signal approach for future AI system design.

Key Points

  • VibeThinker-1.5B uses a two-phase training combining diversity-driven distillation and reinforcement learning.

  • Achieves reasoning performance comparable to or better than larger models on math, coding, and knowledge benchmarks.

  • Demonstrates the power of diversity and targeted optimization in small model training.

  • Offers a new paradigm that could democratize access to advanced AI capabilities.

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

Image from arXiv paper.

What’s the research question?
Can integrating structured scene graph grounding with dense spatial rewards in reinforcement learning improve 3D spatial reasoning in multimodal large language models?


What did the authors do?
The authors introduced SpatialThinker, a novel approach to enhance 3D spatial reasoning in multimodal large language models (LLMs) by combining structured scene graph representations with dense spatial rewards in reinforcement learning (RL). Key components include:

  • Scene graph construction: From a synthetic dataset (STVQA-7K), they generate question-focused scene subgraphs where nodes are objects with category labels and 2D bounding boxes, and edges represent spatial or interactive relations.

  • Structured reasoning: The model encodes the scene as a JSON-encoded subgraph, capturing objects, their spatial relations, and local coordinates, and reasons over this structure to produce answers.

  • Dense spatial rewards: Four components guide training: format (correct output structure), count (correct number of objects/relations), accuracy (correct answer), and spatial (localization accuracy using CIoU). Spatial rewards are only granted if the answer is correct, encouraging precise spatial understanding.

  • Training methodology: Uses Group-Relative Policy Optimization (GRPO), an online RL algorithm that estimates advantages via intra-group comparisons, avoiding critic networks. The policy is updated with a PPO-style clipped loss and KL regularization to balance exploration and stability.

What did they find?
SpatialThinker demonstrated strong performance on several benchmarks:

  • Achieved 78.2% accuracy on CV-Bench, close to GPT-4o’s 79.4%.

  • Scored 56.4% on 3DSRBench, a +12.1% improvement over GPT-4o.

  • Reached 86.0% on BLINK Spatial Relations.

These results show that dense spatial rewards combined with structured scene graph reasoning enable robust 3D spatial understanding, even with limited training data. The model outperformed open-source models trained on much larger datasets, highlighting the effectiveness of their approach.
Limitations include reliance on synthetic data and the need to evaluate generalization to real-world scenes. Future work could explore applying this method to real images and extending to dynamic scenes.

Why does this matter?
This work advances multimodal reasoning by demonstrating that integrating structured scene representations with dense, multi-objective rewards can significantly improve 3D spatial reasoning in LLMs. Such capabilities are crucial for applications like robotics, embodied AI, and scene understanding, where understanding complex spatial relationships is essential. By showing that effective 3D reasoning can be achieved with limited data, SpatialThinker opens new avenues for building more intelligent, spatially aware AI systems that can better interpret and interact with the physical world.

Key Points

  • Introduces SpatialThinker, combining scene graph grounding with dense spatial rewards in RL.

  • Uses a structured JSON scene subgraph to reason over objects and spatial relations.

  • Achieves near state-of-the-art accuracy on 3D spatial reasoning benchmarks with limited data.

  • Highlights the importance of structured representations and dense rewards for multimodal reasoning.