- TensorTeach's Newsletter
- Posts
- AI-Designed Drugs, Soft Robotics Advances, and Enterprise AI Gains This Week
AI-Designed Drugs, Soft Robotics Advances, and Enterprise AI Gains This Week
New advances in AI medicine, flexible robotics, and enterprise productivity.
This Week in AI
AI research took some bold steps into the real world this week. Scientists are pushing toward one of the first truly AI-designed drugs entering clinical trials, as machine-engineered antibodies show early promise in targeting disease more efficiently than humans alone can design. Meanwhile, at MIT, researchers unveiled a new speech-to-reality system that allows users to generate physical objects on demand using a combination of generative AI and robotics — a breakthrough that could transform product design and manufacturing. In robotics, another MIT team advanced control systems for soft robots capable of adapting to their environment and safely interacting with people, pointing toward a future where automation is far more flexible and human-friendly.
Business leaders are moving quickly to capture this momentum. This week, Accenture and Anthropic announced a major enterprise partnership, signaling massive deployment of safe foundation models into sectors like finance and healthcare. And according to a new industry report from OpenAI, businesses adopting AI at scale are already seeing productivity gains of up to 10 hours per employee per week, shifting AI from “nice-to-have” to mission-critical. Nations and corporations alike are beginning to treat AI as a strategic priority shaping long-term competitiveness and economic power.
But while progress accelerates, concerns are rising. The AI research community is facing scrutiny over questionable publishing practices and paper quality, as output skyrockets. And as enterprise adoption widens, gaps in skills, access, and infrastructure are leaving some organizations behind. With AI becoming deeply embedded in healthcare, energy, and national infrastructure, transparency, governance, and reliability are no longer optional — they’re the new frontier of innovation.
This Week in Research
On Memory: A comparison of memory mechanisms in world models

Image from arXiv paper.
What’s the research question?
How can different memory encoding and injection strategies be effectively integrated into transformer-based world models to enhance their ability to remember past states and close loops over long sequences?
What did the authors do?
The authors systematically compared various memory mechanisms in transformer-based world models using the MemoryMaze dataset, which challenges models with sequences of randomly colored objects to test long-term recall.
Built a backbone model based on a Vision Transformer (ViT) operating in latent space, with a pretrained CNN encoder and decoder.
Implemented different memory encoding strategies: Cache (explicit stored representations), Neural Weights (learned weight matrix for retrieval), and Recurrent Hidden State (using RNN or state-space models).
Explored multiple injection methods to incorporate memory into the transformer: Attention biasing (modifying attention coefficients), Adaptive normalization (conditioning normalization parameters), Attention (cross-attention between current context and memory), and Additive (adding memory representations to residual stream).
Evaluated models on their ability to recall past context over sequences of increasing length, using metrics like SSIM, LPIPS, and MSE to measure image and latent-space reconstruction quality.
What did they find?
The study revealed that certain combinations of memory encoding and injection strategies significantly improve long-term recall and loop closure:
The Cache + Context Pre-pend combination achieved the best overall performance, with the highest image reconstruction and latent-space accuracy (average rank 1.0).
State-space models (SSM) with pre-pended hidden states performed comparably well (average rank 3.0), highlighting the effectiveness of recurrent-like memory integration.
Attention-based injection methods outperformed others, likely because they can selectively route information across token channels, enhancing the model’s ability to focus on relevant memory content.
LoRA and AdaNorm injections, while underperforming in image quality, showed lower latent error, suggesting potential issues like reconstruction collapse but also indicating different trade-offs.
Overall, augmenting world models with memory mechanisms improved their ability to recall recent context and close loops over long sequences, which is critical for planning and reasoning tasks.
Why does this matter?
This work advances our understanding of how to effectively incorporate memory into transformer-based world models, a key challenge in building AI systems capable of long-term reasoning and planning.
Provides a taxonomy of memory encoding and injection strategies, guiding future design choices for memory-augmented models.
Highlights the importance of attention-based injection methods for dynamic and selective memory access, which can improve model flexibility and performance.
Demonstrates that combining different memory mechanisms can significantly enhance long-term recall and loop closure, enabling AI agents to better understand and interact with complex, temporally extended environments.
Implications for developing more capable autonomous agents, robotics, and reasoning systems that require persistent memory and context awareness.
VisChainBench: A Benchmark for Multi-Turn, Multi-Image Visual Reasoning Beyond Language Priors
What’s the research question?
How can we evaluate the ability of Large Vision-Language Models (LVLMs) to perform multi-step visual reasoning across sequential, interdependent tasks with minimal language guidance?
What did the authors do?
The authors developed VisChainBench, a comprehensive benchmark designed to test LVLMs’ capabilities in complex visual reasoning tasks that involve multiple images and multiple reasoning steps. Key features include:
Constructed 1,457 tasks spanning over 20,000 images across three diverse domains: daily scenarios, engineering troubleshooting, and natural science understanding.
Structured tasks as visual chains, simulating extended human-AI interactions with up to 6 dialogue turns and 27 evolving images.
Generated initial task structures using language models, then retrieved or synthesized relevant images, followed by human annotation and refinement to ensure quality.
Formulated tasks as single- or multiple-choice questions with clear ground-truth labels, and released evaluation scripts for consistency.
Evaluated three major forms of visual reasoning: Image-text multi-turn reasoning (ITMR), In Context image-only reasoning (ICIR), and Image-only Multi-turn reasoning (IOMR).
Assessed the performance of eight proprietary and two open-source LVLMs on the benchmark.
What did they find?
Major findings include:
Proprietary models like GPT-4o and Gemini-2.0-flash significantly outperformed open-source models, especially in ITMR and ICIR tasks.
Gemini-2.0-flash achieved the highest accuracy (82.04%) in ITMR, while GPT-4o led in ICIR with 71.74% accuracy.
Larger models such as Qwen2.5VL-32B outperformed smaller variants by over 41% in accuracy, highlighting the importance of scale.
Open-source models trained on structured, multi-image, and instruction-aligned data (e.g., Qwen2.5VL-32B, InternVL3-14B) performed well, suggesting training data quality and structure matter.
Smaller models like LLaVA-NEXT and MiniCPM underperformed due to their focus on single-image tasks and limited context handling.
Performance improves steeply with model size, indicating that multi-turn, multi-image reasoning requires latent world model construction and temporal-spatial coherence—capabilities that emerge only at larger scales.
Why does this matter?
VisChainBench introduces a new paradigm for evaluating LVLMs’ ability to handle complex, multi-step visual reasoning tasks that mirror real-world scenarios such as technical troubleshooting, scientific analysis, and sequential decision-making. By focusing on image-centric reasoning with minimal language cues, it moves beyond traditional text-based benchmarks and emphasizes the importance of structured training data and model architecture in enabling robust multimodal understanding. This benchmark provides a valuable foundation for future research aiming to develop LVLMs capable of sophisticated, multi-turn, multi-image reasoning, which is critical for advancing AI systems that interact seamlessly with the visual world.
Key Points
VisChainBench evaluates LVLMs’ multi-turn, multi-image visual reasoning beyond language priors.
Contains 1,457 tasks across diverse domains, structured as visual chains with up to 6 dialogue turns and 27 images.
Proprietary models outperform open-source ones; larger models excel due to emergent reasoning capabilities.
Highlights the importance of structured training data and architectural design for multimodal reasoning.
RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs
What’s the research question?
How can we develop a scalable and robust reinforcement learning framework for large language models (LLMs) on TPUs that supports diverse algorithms, training paradigms, and preemption?
What did the authors do?
The authors introduced RLAX, a new reinforcement learning (RL) system designed specifically for training large language models on TPU clusters. Key features include:
A parameter-server architecture with a centralized controller to enable flexible resource management and robust preemption handling.
Integration with AXLearn as the backend, supporting multiple RL algorithms through a modular design that separates loss computation, advantage estimation, and gradient calculation.
Support for both on-policy and off-policy training paradigms via configurable staleness bounds, allowing inference workers to generate rollouts using stale weights without compromising convergence.
A numerical alignment technique that recomputes log-probabilities using an inference-matched training graph to ensure consistency between trainer and inference workers.
Design optimizations for large-scale TPU clusters, including distributed parameter servers, in-memory persistence, and custom version management.
A verifier service for code execution and a data curation pipeline to ensure high-quality training data.
What did they find?
RLAX demonstrated impressive performance and robustness in large-scale training experiments:
Achieved a 12.8% improvement in pass@8 accuracy on the QwQ-32B benchmark within 12 hours and 48 minutes using 1024 v5p TPUs.
Scaled inference throughput linearly up to 1024 TPUs, reducing training step time by a factor of 3.6 as inference throughput increased 8-fold.
Through ablation studies, showed that bounded staleness parameters (e.g., (j,k)=(16,32)) effectively balanced convergence speed and training efficiency.
Reduced log-probability volatility from a 95th percentile of 0.0443 to 0.0199 through numerical alignment, improving training stability.
Limitations include the need for careful tuning of staleness bounds and potential complexity in system deployment.
Why does this matter?
RLAX advances the field of large-scale reinforcement learning for language models by providing a flexible, scalable, and robust system architecture tailored for TPU clusters. Its support for diverse RL algorithms, training paradigms, and preemption makes it highly applicable to real-world deployment scenarios where resource variability and data quality are critical concerns. The emphasis on numerical stability addresses a common challenge in RL training, enabling more reliable and efficient development of future large language models. This work paves the way for more effective use of RL in improving language understanding, generation, and alignment tasks at scale.
Key Points
RLAX is a modular, distributed RL framework optimized for large language models on TPUs.
Supports on-policy and off-policy training with configurable staleness bounds for stability and efficiency.
Numerical alignment via log-prob recomputation improves training robustness.
Achieved significant accuracy gains and scalable inference throughput on large TPU clusters.