📣 Headlines
           
            •
            
             Anthropic’s Claude gains memory, an optional incognito mode, and cross‑export features
            
            , even as the company faces a potential
            
             $3,000 settlement over AI piracy claims
            
            .
            
            •
            
             Oracle is doubling down on AI hardware and cloud offerings to power models
            
            , while
            
             OpenAI plans its first AI chip by 2026
            
            and
            
             Arm unveils Lumex for on‑device AI
            
            , highlighting a broader push on compute at every layer.
            
            •
            
             How do AI models generate videos?
            
            explains the current stack—diffusion in latent space, transformers, and frame‑consistency systems like Sora and Veo 3 used to produce coherent video and audio.
            
            •
            
             Ex‑Google X founders raised $5.7M for TwinMind to build a passive “second brain” that captures ambient speech and builds a personal knowledge graph
            
            , reflecting the rising focus on background and agentic assistants described in
            
             coverage of agentic AI
            
            .
            
            •  Security teams are shipping autonomous defenses:
            
             AegisAI uses agents to inspect and neutralize email threats in real time
            
            ,
            
             Lookout launched Smishing AI for mobile social‑engineering detection
            
            , and
            
             Miru raised funding for AI‑assisted cyber investigations
            
            .
            
            •  Policy and oversight debates intensify as
            
             California seeks frontier AI transparency reports to mitigate catastrophic risks
            
            , the
            
             FTC probes AI chatbots used with children
            
            , and
            
             Sen. Ted Cruz proposes regulatory waivers and a White House sandbox for AI companies
            
            .
            
            •
            
             AI training infrastructure draws massive valuations: Mercor eyes a $10B+ valuation on a reported $450M run rate
            
            , underlining demand and capital flow into model training services.
            
            •  Browser and device AI updates:
            
             Firefox for iOS adds local AI page summarization on newer iPhones
            
            , and
            
             Firefox will end support for 32‑bit Linux in 2026
            
            , signaling shifts toward on‑device AI and platform consolidation.
            
            🔧 Company Engineering Blogs
           
             Jupyter Agents: training LLMs to reason with notebooks
            
             (huggingface.co)
            
            . Jupyter Agent builds a data science workflow inside notebooks using Qwen models, scaffolding, QA generation, and E2B execution pipelines
            
             Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 2 of 2)
            
             (medium.com/pinterest-engineering)
            
            . Deploying EKS clusters, Fluent Bit logging, OTEL metrics pipelines, image management, and a custom Moka UI for Spark on Kubernetes
            
             Accelerating scientific discovery with AI-powered empirical software
            
             (research.google)
            
            . Google Research presents an AI-powered system, built on Gemini, that writes, optimizes, and empirically evaluates scientific software across genomics, public health, geospatial analysis, neuroscience, and time-series forecasting
            
             Scientific frontiers of agentic AI
            
             (amazon.science)
            
            . Agentic AI explores embedding languages, context, negotiation, common sense, and privacy with embeddings, context windows, and behavioral economics insights
            
            📈 Applied LLMs in the Wild: RAG, Recsys, Pipelines, and Science
           
             How to Train an LLM-RecSys Hybrid for Steerable Recs with Semantic IDs
            
             (eugeneyan.com)
            
            . LLM-Recsys hybrid with semantic IDs using RQ-VAE, SASRec, Qwen models; train on Amazon Video Games data; steerable, conversational recommendations
            
             Text analytics in Data Pipelines using AI
            
             (medium.com/@ed.bullen)
            
            . Databricks AI Query workflows for ETL pipelines; using LLMs to classify, rate sentiment, and justify results on Amazon Reviews data
            
             Single-cell analysis and infectious disease forecasting: Google's new AI scientist
            
             (blog.stephenturner.us)
            
            . AI systems generate and test new methods for single-cell RNA-seq batch integration and COVID-19 forecasting, surpassing some benchmarks
            
             Stumbling into AI: Part 3—RAG
            
             (rmoff.net)
            
            . Explains Retrieval-Augmented Generation (RAG) using embeddings, vector stores (ChromaDB), Ollama, and Llama models with Kafka release notes as example
            
             Beyond the Chatbot: What Actually Works in Enterprise AI
            
             (thedataexchange.media)
            
            . RAG systems evolution, evaluation as IP, embeddings, enterprise security, agent workflows, multi-modality, small models, and AI-enabled coding tools
            
            🧭 Agents and Human-in-the-Loop Orchestration
           
             Exploring Active Agent, or can we build AI features the Rails way?
            
             (evilmartians.com)
            
            . Rails-style AI abstractions with Active Agent: agents, prompts, callbacks, templates, and battle-tested Rails examples
            
             Lessons learned from a 100 blog posts on AI
            
             (frontierai.substack.com)
            
            . Big-picture AI trends: economics of inference, token costs vs. volume, open-loop agents, evals, data quality, context management, and UX in AI apps
            
             Generalists Can Also Dig Deep
            
             (towardsdatascience.com)
            
            . Generalist Ida Silfverskiöld on AI agents, RAG, evals, and design choices in agentic systems
            
             LangGraph 201: Adding Human Oversight to Your Deep Research Agent
            
             (towardsdatascience.com)
            
            . LangGraph 201 adds two human-in-the-loop checkpoints to a deep research agent, using interruption patterns, state graphs, and in-memory/DB checkpointers
            
            ⚙️ Deterministic and Efficient Inference & Serving
           
             Defeating Nondeterminism in LLM Inference
            
             (simonwillison.net)
            
            . Nondeterminism in LLM inference arises mainly from varying load and batch size; paper proposes invariant kernels in PyTorch to achieve determinism
            
             Speculative cascades — A hybrid approach for smarter, faster LLM inference
            
             (research.google)
            
            . Speculative cascades combine cascades and speculative decoding with a deferral rule to speed LLM inference and improve cost–quality trade-offs
            
             The Rise of Multimodal LLMs and Efficient Serving with vLLM
            
             (pyimagesearch.com)
            
            . Multimodal LLMs (LLaVA, GPT-4V, BakLLaVA) and vLLM enable OpenAI-compatible vision–language inference and efficient deployment
            
             Defeating Nondeterminism in LLM Inference – Thinking Machines Lab
            
             (jmason.ie)
            
            . Defeating nondeterminism in LLM inference by examining sampling, temperature effects, and deterministic behavior across stacks and libraries
            
             Benchmarking AI & ML on local CPU/GPUs: an end-to-end Python project
            
             (allaboutdata.substack.com)
            
            . Benchmarking AI/ML on local CPU/GPU with Python: XGBoost, Ollama, CUDA, uv, Altair, Streamlit dashboard and Docker-free workflow
            
             Text embedding inference on a cheap server
            
             (paulw.tokyo)
            
            . Benchmarking text embedding inference on cheap CPUs using Granite-embedding-107m-multilingual-GGUF and EmbeddingGemma with llama-server and wrk load testing
            
            🧪 Architectures, Alignment, and Post-Training
           
             Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!
            
             (simonwillison.net)
            
            . Qwen3-Next-80B-A3B-Instruct and Thinking models; 80B with 3B active per round; OpenRouter deployment; llm-openrouter plugin; pelican SVG prompt; performance claims
            
             Paper Review: Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
            
             (andlukyane.com)
            
            . Decentralized RL post-training with SAPO sharing rollouts across a swarm for LM fine-tuning and reward-based learning
            
             lecture three
            
             (aarnphm.xyz)
            
            . Lecture three on tokenizers, LLMs, alignment, sparse autoencoders, residual streams, and speculative decoding for efficient inference
            
             assignment three reports.
            
             (aarnphm.xyz)
            
            . Discussion of replacing one-hot cross-entropy, 2D GEMMs, batching, tokenization, and optimization techniques for large V vocabularies
            
             Qwen 3 Next
            
             (sibellavia.lol)
            
            . Qwen3-Next-80B models with hybrid Gated DeltaNet, ultra-sparse MoE (512 experts), YaRN context up to 1,000,000 tokens, and multi-token prediction
            
            📚 Academic Research
           
             Inpainting-Guided Policy Optimization for Diffusion Large Language   Models
            
             (arxiv:cs)
            
            . Inpainting-guided RL for diffusion LLMs improves exploration, using partial ground-truth reasoning to boost GRPO, with synthetic traces and entropy filtering
            
             Can Understanding and Generation Truly Benefit Together -- or Just   Coexist?
            
             (arxiv:cs)
            
            . Unified multimodal learning: encoder–decoder paradigm with long-context captions, UAE framework, Unified-GRPO RL, and Unified-Bench benchmark
            
             AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making   through Multi-Turn Reinforcement Learning
            
             (arxiv:cs)
            
            . AgentGym-RL trains LLM agents for multi-turn decision making using RL, ScalingInter-RL for exploration-exploitation balance across diverse environments
            
             RewardDance: Reward Scaling in Visual Generation
            
             (arxiv:cs)
            
            . RewardDance: scalable reward modeling for visual generation using yes-token probability, enabling large RMs and CoT integration
            
             Multipole Semantic Attention: A Fast Approximation of Softmax Attention   for Pretraining
            
             (arxiv:cs)
            
            . MuSe uses separate semantic clustering and multipole expansions to accelerate transformer attention with dipole corrections for causal and acausal attention
            
             Customizing the Inductive Biases of Softmax Attention using Structured   Matrices
            
             (arxiv:cs)
            
            . Structured-matrix attention with BTT and MLR boosts high-dimensional input tasks and locality-aware language modeling performance
            
             Recurrence Meets Transformers for Universal Multimodal Retrieval
            
             (arxiv:cs)
            
            . ReT-2: a unified multimodal retrieval model using recurrent Transformer with LSTM-inspired gating for image-text queries across multimodal documents
            
             D-LEAF: Localizing and Correcting Hallucinations in Multimodal LLMs via   Layer-to-head Attention Diagnostics
            
             (arxiv:cs)
            
            . D-LEAF localizes and corrects multimodal LLM hallucinations via Layer Image Attention Entropy and Image Attention Focus for dynamic, efficient inference
            
             Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust   Text-based Person Retrieval
            
             (arxiv:cs)
            
            . Noise-resistant data curation with MLLMs and GA-DMS framework for robust cross-modal alignment in WebPerson dataset
            
             Harnessing Uncertainty: Entropy-Modulated Policy Gradients for   Long-Horizon LLM Agents
            
             (arxiv:cs)
            
            . Entropy-Modulated Policy Gradients (EMPG) re-calibrate step-wise learning signals for LLM agents in long-horizon tasks to balance uncertainty and outcomes
            
             A Survey of Reinforcement Learning for Large Reasoning Models
            
             (arxiv:cs)
            
            . Survey of RL methods for reasoning with LLMs/LRMs, focusing on mathematics, coding, scalability, data, and infrastructure toward ASI
            
             Selective Induction Heads: How Transformers Select Causal Structures In   Context
            
             (arxiv:stat)
            
            . Selective Induction Heads enable transformers to choose causal structures and copy past tokens via interleaved Markov chains and lag selection
            
             Adaptive Token Merging for Efficient Transformer Semantic Communication   at the Edge
            
             (arxiv:cs)
            
            . Adaptive token merging reduces transformer compute and communication at the edge via data-dependent, training-free redundancy reduction
            
             AgentX: Towards Orchestrating Robust Agentic Workflow Patterns with   FaaS-hosted MCP Services
            
             (arxiv:cs)
            
            . AgentX orchestrates robust agentic workflows with stage designer, planner, and executor agents using MCP tools and FaaS-hosted services
            
             Combating the Memory Walls: Optimization Pathways for Long-Context   Agentic LLM Inference
            
             (arxiv:cs)
            
            . PLENA: hardware-software co-design with asymmetric quantization and FlashAttention for long-context LLM inference, delivering higher utilization and throughput
            
             MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging   Bit-Slice-enabled Sparsity and Repetitiveness
            
             (arxiv:cs)
            
            . MCBP: a bit-slice-enabled, memory-efficient LLM inference accelerator exploiting BRCR, BSTC, and BGPP for faster GEMMs and KV caching
            
             Can SSD-Mamba2 Unlock Reinforcement Learning for End-to-End Motion   Control?
            
             (arxiv:cs)
            
            . Vision-driven cross-modal RL using SSD-Mamba2 backbone for end-to-end motion control with proprioceptive and exteroceptive tokens
            
             BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated   Cross-Modal Fusion
            
             (arxiv:cs)
            
            . BcQLM: BreezeCLIP-based lightweight vision-language model for efficient visual question answering with 1.2B parameters
            
             Scalable Training for Vector-Quantized Networks with 100% Codebook   Utilization
            
             (arxiv:cs)
            
            . Scalable training of vector-quantized networks with 100% codebook utilization via VQBridge and learning annealing
            
             CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning   in Large Language Models
            
             (arxiv:cs)
            
            . Curiosity-driven exploration for RLVR in LLMs using actor perplexity and multi-head critic variance to boost exploration
            
             The Choice of Divergence: A Neglected Key to Mitigating Diversity   Collapse in Reinforcement Learning with Verifiable Reward
            
             (arxiv:cs)
            
            . Diversity-Preserving Hybrid RL uses mass-covering f-divergences to counter Pass@k degradation in RLVR for LLMs with math and SQL tasks
            
             Visual Representation Alignment for Multimodal Large Language Models
            
             (arxiv:cs)
            
            . VIRAL aligns internal visual representations of MLLMs with vision foundation models to improve object counting and spatial reasoning
            
             TCPO: Thought-Centric Preference Optimization for Effective Embodied   Decision-making
            
             (arxiv:cs)
            
            . Thought-Centric Preference Optimization (TCPO) enhances embodied decision-making with stepwise preference learning and Action Policy Consistency in ALFWorld
            
             Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual   Search
            
             (arxiv:cs)
            
            . Mini-o3 scales multi-turn visual search reasoning with deep tool-based exploration and over-turn masking for tens of interaction steps
            
             Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
            
             (arxiv:cs)
            
            . Reinforcement learning framework Parallel-R1 enables parallel thinking for complex reasoning on math benchmarks like MATH, AMC23, AIME
            
             Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles
            
             (arxiv:cs)
            
            . Multimodal Bayesian prompt ensembles improve calibration and accuracy for judging TTI quality with image clustering in MLLMs
            
             Compute Only 16 Tokens in One Timestep: Accelerating Diffusion   Transformers with Cluster-Driven Feature Caching
            
             (arxiv:cs)
            
            . Cluster-Driven Feature Caching (ClusCa) accelerates diffusion transformers by spatially clustering tokens and propagating one token per cluster
            
             ENSI: Efficient Non-Interactive Secure Inference for Large Language   Models
            
             (arxiv:cs)
            
            . ENSI: co-designs cryptographic protocols and BitNet LLM, enabling CKKS-based secure inference with sigmoid attention and RMSNorm bootstrapping
            
             GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented   Training for AI-Generated Image Detection
            
             (arxiv:cs)
            
            . GAMMA uses multi-task, manipulation-augmented training and reverse cross-attention for robust AI-generated image detection across diverse models
            
             Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing   LLM-based Multi-Agent Systems
            
             (arxiv:cs)
            
            . MOAT enables joint alignment tuning for planning and grounding agents to improve coordination in multi-agent LLM systems
            
             All for One: LLMs Solve Mental Math at the Last Token With Information   Transferred From Other Tokens
            
             (arxiv:cs)
            
            . Investigates how LLMs perform mental math via late, last-token computation using Context-Aware Mean Ablation and Attention-Based Peeking
            
             Astra: A Multi-Agent System for GPU Kernel Performance Optimization
            
             (arxiv:cs)
            
            . Astra: an LLM-driven multi-agent system for GPU kernel optimization from SGLang CUDA implementations, achieving 1.32x speedups
            
             Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain   Expansion, and Metric Optimization
            
             (arxiv:cs)
            
            . TAM Bench: automated task collection from Kaggle, AIcrowd, Biendata; multi-modal ML tasks; leaderboard-based difficulty; multi-dimensional evaluation
            
             Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D   Multimodal Large Language Models
            
             (arxiv:cs)
            
            . Med3DInsight: pretraining 3D medical encoders with 2D multimodal LLMs via plane-slice transformer and partial optimal transport alignment
            
             Sharing is Caring: Efficient LM Post-Training with Collective RL   Experience Sharing
            
             (arxiv:cs)
            
            . SAPO: a decentralized, asynchronous RL post-training method for LMs with shared rollouts across heterogeneous nodes
            
             Towards Generalized Routing: Model and Agent Orchestration for Adaptive   and Efficient Inference
            
             (arxiv:cs)
            
            . MoMA: Mixture of Models and Agents for adaptive routing of LLMs and agents to optimize cost, performance, and efficiency
            
             VARCO-VISION-2.0 Technical Report
            
             (arxiv:cs)
            
            . VARCO-VISION-2.0: bilingual Korean-English VLM with multi-image understanding, layout-aware OCR, and on-device 1.7B variant
            
             Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in   Southeast Asia
            
             (arxiv:cs)
            
            . Compass-v3: a 245B multilingual e-commerce Mixture-of-Experts model with OTPO alignment and GPU-optimized training for Southeast Asia
            
             Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems
            
             (arxiv:cs)
            
            . Dual knowledge-enhanced two-stage reasoner (DK2R) uses structured attributes and unstructured reviews with LLMs for multimodal dialog response generation
            
            👋 Before you go
           
            I've got a big favor to ask - keeping Blaze running isn't expensive, but it does all add up, so I'm asking readers like you to help, if you can.
That's why I'm launching
            
             a Patreon page!
            
            .  Nothing flashy, just a way for folks who find value in these newsletters to chip in a little each month. In return, you'll get:
            
- 
             Real say in how Blaze evolves — vote on new topics, features, topic curation ideas
            
 
- 
             First dibs on merch (details still cooking)
            
 
- 
             That warm fuzzy feeling knowing you're supporting something that saves you time and keeps you plugged into great tech writing
            
 
 
            If you are getting value from blaze, checking this out would mean the world. And if you can't contribute, no worries—the newsletters keep coming either way, and you can follow along on patreon for free.
Thanks for reading and being part of this nerdy corner of the internet. All the best - Alastair.
            
 |