- TensorTeach's Newsletter
- Posts
- Mastercard Agent Payments, Meta Ads AI, Markets Signal Resilience
Mastercard Agent Payments, Meta Ads AI, Markets Signal Resilience
The week’s biggest AI moves across autonomous commerce, embedded enterprise agents, and AI’s growing influence on financial infrastructure.
This Week In AI
Over the past week, major AI progress centered on agentic commerce, platform integration, and financial markets.
Santander and Mastercard completed Europe’s first live end-to-end AI-executed payment, marking a significant step toward real-world autonomous commerce. The milestone demonstrates that AI agents can initiate and execute regulated financial transactions under institutional guardrails — moving “agent checkout” from concept to live infrastructure.
On the platform side, Meta integrated its Manus AI agent directly into Ads Manager, embedding autonomous analysis and workflow support inside one of the world’s largest advertising systems. Rather than standalone chat tools, AI agents are increasingly being built directly into core enterprise interfaces.
Meanwhile, AI’s macro footprint continues to expand. AI-linked crypto assets and equities showed resilience despite geopolitical volatility, reinforcing that AI remains a dominant structural theme across markets — influencing capital allocation, investor narratives, and sector positioning.
Together, these developments signal a shift from AI as an experimental interface toward AI as embedded economic infrastructure — operating inside payments, advertising, and financial markets at scale.
This Week In AI Research
LiTS: A Modular Framework for LLM Tree Search
What’s the research question?
How can a modular framework improve the reusability and extensibility of large language model (LLM) tree search algorithms across different reasoning tasks?
What did the authors do?
The authors developed LiTS (Language Inference via Tree Search), a flexible Python framework designed to decompose LLM reasoning into three core, task-agnostic components:
Policy: Generates candidate actions from current states.
Transition: Executes actions to produce new states.
RewardModel: Evaluates the quality of actions.
These components are registered using a decorator-based registry, allowing users to compose and extend them without altering core code. LiTS supports three reasoning task types:
Environment Grounded: Tasks involving physical or simulated environments (e.g., BlocksWorld).
Language Grounded: Tasks involving language inputs and outputs (e.g., Math QA).
Tool Use: Tasks requiring tool invocation (e.g., MapEval).
For each task type, LiTS defines specific subclasses for Action, Step, and State, which integrate with search algorithms like Monte Carlo Tree Search (MCTS) and Breadth-First Search (BFS). The framework also allows registration of custom components and algorithms to tailor the search process to domain-specific needs.
What did they find?
LiTS was evaluated on three reasoning tasks:
BlocksWorld: Used a domain-specific Transition while reusing Policy and RewardModel, demonstrating component reuse across environment-grounded tasks.
Crosswords: Incorporated a different Transition and dataset loader, but reused Policy and RewardModel, highlighting flexibility in language-grounded tasks.
MapEval: Employed a tool registration process for tool use, showcasing extensibility for tool-based reasoning.
Key insights include:
LiTS enables reuse of core reasoning components across diverse tasks and algorithms, reducing development effort.
The framework revealed a mode-collapse issue in infinite action spaces, where lack of policy diversity hindered effective tree search.
Supporting custom components and algorithms facilitates domain-specific optimizations and comparisons.
Limitations noted include the challenge of maintaining policy diversity in large action spaces, which is critical for search effectiveness.
Why does this matter?
LiTS offers a flexible and extensible platform for developing and evaluating LLM-based reasoning algorithms. By modularizing core components, it lowers the barrier for researchers to experiment with new search strategies, integrate domain-specific actions, and compare different approaches fairly. This accelerates progress toward more capable LLMs that can perform complex, multi-step reasoning across modalities and domains. The framework’s insights into policy diversity bottlenecks also inform future research on improving LLM reasoning robustness and efficiency, with potential applications in AI assistants, automated problem-solving, and intelligent agents.
Key Points
LiTS is a modular Python framework for LLM tree search, decomposing reasoning into Policy, Transition, and RewardModel components.
Supports environment-grounded, language-grounded, and tool-use reasoning tasks with customizable components.
Enables reuse and extension of components across tasks, facilitating fair comparisons and rapid prototyping.
Revealed challenges with policy diversity in infinite action spaces, guiding future improvements in LLM reasoning.
Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning
What’s the research question?
How can large language models be evaluated on multi-step reasoning tasks with verifiable intermediate steps?
What did the authors do?
The authors developed a new benchmark called Pencil Puzzle Bench to assess the reasoning capabilities of large language models (LLMs) on complex logic puzzles. Their approach includes:
Curating a dataset of 94 varieties of pencil puzzles, totaling 62,231 puzzles with verified unique solutions and detailed step-by-step solutions.
Implementing two evaluation strategies:
Direct ask (single-shot): Asking the model to solve the puzzle in one step.
Agentic (multi-turn with iterative verification): Allowing the model to make a move, check for constraint violations, and course-correct through multiple interactions.
Supporting multiple representations of puzzle states, including ASCII serialization, SVG vector rendering, and pixel images.
Enabling step-level verification by applying variety-specific constraints and a verification pipeline that validates each move against these constraints.
What did they find?
The study revealed several key insights:
Models showed an 81× improvement in success rate when increasing reasoning effort from no reasoning to maximum effort.
The agentic gap (performance difference between agentic and direct ask modes) was substantial:
Claude Opus 4.6 improved from 0.3% to 30.0% success with iterative verification.
GPT-5.2@xhigh improved from 20.2% to 56.0% success.
Extended agentic evaluation successfully solved 3 puzzle varieties previously unsolved.
Reasoning effort scaling highlighted a tradeoff between success rate and infrastructure reliability, with 35% of requests failing at xhigh effort.
Why does this matter?
This work introduces a novel benchmark that enables more reliable evaluation of language models’ reasoning by focusing on verifiable intermediate steps. The agentic evaluation paradigm and step-level verification infrastructure provide valuable tools for future research in:
Process supervision: guiding models to produce correct reasoning steps.
Reinforcement learning from verifiable rewards: training models with feedback based on step correctness.
Curriculum learning: progressively increasing puzzle difficulty while ensuring reasoning quality.
By highlighting the importance of both reasoning effort and iterative verification, this benchmark paves the way for developing more robust and explainable AI systems capable of complex logical reasoning.
Key Points
Introduces Pencil Puzzle Bench, a large-scale benchmark for multi-step verifiable reasoning in logic puzzles.
Evaluates large language models using both single-shot and iterative agentic strategies with step-level verification.
Demonstrates significant success rate improvements and solves previously unsolved puzzle varieties.
Highlights the tradeoff between reasoning effort and infrastructure reliability, informing future model design and evaluation.
Advancing Multimodal Judge Models through a Capability-Oriented Benchmark and MCTS-Driven Data Generation
What’s the research question?
How can we improve the evaluation and training of multimodal large language models (MLLMs) when they act as judges for assessing response quality across different modalities?
What did the authors do?
The authors tackled this challenge by developing two key innovations:
M-JudgeBench: A comprehensive, ten-dimensional benchmark designed to evaluate MLLMs as judges. It breaks down the judging task into fine-grained subtasks such as pairwise Chain-of-Thought (CoT) comparison, length bias detection, and process error identification, covering diverse aspects of response quality.
Judge-MCTS: A novel data generation framework that uses Monte Carlo Tree Search (MCTS) to create structured, contrastive pairs of reasoning trajectories. These pairs vary in correctness and length, providing rich supervision signals to help models distinguish subtle differences in reasoning style and quality.
To train their judge models, called M-Judger, the authors combined supervised fine-tuning (SFT) on open-source data with reinforcement learning (RL) using a hybrid reward function called DAPO. They evaluated their models on three benchmarks—M-JudgeBench, VL-RewardBench, and Multimodal RewardBench—using pairwise accuracy as the main metric.
What did they find?
The M-Judger models enhanced with Judge-MCTS significantly outperformed existing judge models across all benchmarks. Notably:
On M-JudgeBench, the best M-Judger-RL-Qwen8B achieved an accuracy of 62.93%, compared to 44.53% for the previous R1-Reward and 48.03% for UnifiedReward-Think-qwen-7b.
The improvements were especially strong in pairwise CoT comparison and length bias detection tasks, demonstrating the effectiveness of structured reasoning pairs generated by MCTS.
The models maintained or improved performance on other judge benchmarks, showing that the approach generalizes well across different evaluation settings.
However, the study focused primarily on pairwise judgment and may require further validation on real-world multimodal datasets. Additionally, the MCTS-based data generation, while powerful, introduces computational complexity that could impact scalability.
Why does this matter?
This work advances the field by providing a more nuanced and capability-oriented way to evaluate and train multimodal large language models as judges. By generating structured, contrastive reasoning pairs and decomposing evaluation into detailed subtasks, the approach enables more reliable and fine-grained assessment of model responses across language, vision, and other modalities. This has important implications for improving the quality, fairness, and trustworthiness of multimodal AI systems, which are increasingly used in applications like content moderation, AI-assisted creativity, and human-AI interaction. Better judge models can lead to more accurate feedback loops during model training and deployment, ultimately enhancing the user experience and safety of multimodal AI.
Key Points
Introduced M-JudgeBench, a ten-dimensional, capability-oriented benchmark for evaluating multimodal judge models.
Developed Judge-MCTS, a Monte Carlo Tree Search-based data generation framework producing structured, contrastive reasoning pairs.
Enhanced judge models (M-Judger) outperform existing models on multiple benchmarks, especially in reasoning and length bias detection.
Approach improves the reliability and interpretability of multimodal response evaluation, supporting better AI system development.