Generative AI newsletter

                        September 16, 2025

            Generative AI newsletter

   Blaze Email

               Generative AI

               2025-09-16

                •  read online

                •  patreon

            📣 Headlines

            •

             Anthropic’s Claude gains memory, an optional incognito mode, and cross‑export features

            , even as the company faces a potential

             $3,000 settlement over AI piracy claims

            .

            •

             Oracle is doubling down on AI hardware and cloud offerings to power models

            , while

             OpenAI plans its first AI chip by 2026

            and

             Arm unveils Lumex for on‑device AI

            , highlighting a broader push on compute at every layer.

            •

             How do AI models generate videos?

            explains the current stack—diffusion in latent space, transformers, and frame‑consistency systems like Sora and Veo 3 used to produce coherent video and audio.

            •

             Ex‑Google X founders raised $5.7M for TwinMind to build a passive “second brain” that captures ambient speech and builds a personal knowledge graph

            , reflecting the rising focus on background and agentic assistants described in

             coverage of agentic AI

            .

            •  Security teams are shipping autonomous defenses:

             AegisAI uses agents to inspect and neutralize email threats in real time

            ,

             Lookout launched Smishing AI for mobile social‑engineering detection

            , and

             Miru raised funding for AI‑assisted cyber investigations

            .

            •  Policy and oversight debates intensify as

             California seeks frontier AI transparency reports to mitigate catastrophic risks

            , the

             FTC probes AI chatbots used with children

            , and

             Sen. Ted Cruz proposes regulatory waivers and a White House sandbox for AI companies

            .

            •

             AI training infrastructure draws massive valuations: Mercor eyes a $10B+ valuation on a reported $450M run rate

            , underlining demand and capital flow into model training services.

            •  Browser and device AI updates:

             Firefox for iOS adds local AI page summarization on newer iPhones

            , and

             Firefox will end support for 32‑bit Linux in 2026

            , signaling shifts toward on‑device AI and platform consolidation.

            🔧 Company Engineering Blogs

             Jupyter Agents: training LLMs to reason with notebooks

             (huggingface.co)

            . Jupyter Agent builds a data science workflow inside notebooks using Qwen models, scaffolding, QA generation, and E2B execution pipelines

             Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 2 of 2)

             (medium.com/pinterest-engineering)

            . Deploying EKS clusters, Fluent Bit logging, OTEL metrics pipelines, image management, and a custom Moka UI for Spark on Kubernetes

             Accelerating scientific discovery with AI-powered empirical software

             (research.google)

            . Google Research presents an AI-powered system, built on Gemini, that writes, optimizes, and empirically evaluates scientific software across genomics, public health, geospatial analysis, neuroscience, and time-series forecasting

             Scientific frontiers of agentic AI

             (amazon.science)

            . Agentic AI explores embedding languages, context, negotiation, common sense, and privacy with embeddings, context windows, and behavioral economics insights

            📈 Applied LLMs in the Wild: RAG, Recsys, Pipelines, and Science

             How to Train an LLM-RecSys Hybrid for Steerable Recs with Semantic IDs

             (eugeneyan.com)

            . LLM-Recsys hybrid with semantic IDs using RQ-VAE, SASRec, Qwen models; train on Amazon Video Games data; steerable, conversational recommendations

             Text analytics in Data Pipelines using AI

             (medium.com/@ed.bullen)

            . Databricks AI Query workflows for ETL pipelines; using LLMs to classify, rate sentiment, and justify results on Amazon Reviews data

             Single-cell analysis and infectious disease forecasting: Google's new AI scientist

             (blog.stephenturner.us)

            . AI systems generate and test new methods for single-cell RNA-seq batch integration and COVID-19 forecasting, surpassing some benchmarks

             Stumbling into AI: Part 3—RAG

             (rmoff.net)

            . Explains Retrieval-Augmented Generation (RAG) using embeddings, vector stores (ChromaDB), Ollama, and Llama models with Kafka release notes as example

             Beyond the Chatbot: What Actually Works in Enterprise AI

             (thedataexchange.media)

            . RAG systems evolution, evaluation as IP, embeddings, enterprise security, agent workflows, multi-modality, small models, and AI-enabled coding tools

            🧭 Agents and Human-in-the-Loop Orchestration

             Exploring Active Agent, or can we build AI features the Rails way?

             (evilmartians.com)

            . Rails-style AI abstractions with Active Agent: agents, prompts, callbacks, templates, and battle-tested Rails examples

             Lessons learned from a 100 blog posts on AI

             (frontierai.substack.com)

            . Big-picture AI trends: economics of inference, token costs vs. volume, open-loop agents, evals, data quality, context management, and UX in AI apps

             Generalists Can Also Dig Deep

             (towardsdatascience.com)

            . Generalist Ida Silfverskiöld on AI agents, RAG, evals, and design choices in agentic systems

             LangGraph 201: Adding Human Oversight to Your Deep Research Agent

             (towardsdatascience.com)

            . LangGraph 201 adds two human-in-the-loop checkpoints to a deep research agent, using interruption patterns, state graphs, and in-memory/DB checkpointers

            ⚙️ Deterministic and Efficient Inference & Serving

             Defeating Nondeterminism in LLM Inference

             (simonwillison.net)

            . Nondeterminism in LLM inference arises mainly from varying load and batch size; paper proposes invariant kernels in PyTorch to achieve determinism

             Speculative cascades — A hybrid approach for smarter, faster LLM inference

             (research.google)

            . Speculative cascades combine cascades and speculative decoding with a deferral rule to speed LLM inference and improve cost–quality trade-offs

             The Rise of Multimodal LLMs and Efficient Serving with vLLM

             (pyimagesearch.com)

            . Multimodal LLMs (LLaVA, GPT-4V, BakLLaVA) and vLLM enable OpenAI-compatible vision–language inference and efficient deployment

             Defeating Nondeterminism in LLM Inference – Thinking Machines Lab

             (jmason.ie)

            . Defeating nondeterminism in LLM inference by examining sampling, temperature effects, and deterministic behavior across stacks and libraries

             Benchmarking AI & ML on local CPU/GPUs: an end-to-end Python project

             (allaboutdata.substack.com)

            . Benchmarking AI/ML on local CPU/GPU with Python: XGBoost, Ollama, CUDA, uv, Altair, Streamlit dashboard and Docker-free workflow

             Text embedding inference on a cheap server

             (paulw.tokyo)

            . Benchmarking text embedding inference on cheap CPUs using Granite-embedding-107m-multilingual-GGUF and EmbeddingGemma with llama-server and wrk load testing

            🧪 Architectures, Alignment, and Post-Training

             Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?!

             (simonwillison.net)

            . Qwen3-Next-80B-A3B-Instruct and Thinking models; 80B with 3B active per round; OpenRouter deployment; llm-openrouter plugin; pelican SVG prompt; performance claims

             Paper Review: Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

             (andlukyane.com)

            . Decentralized RL post-training with SAPO sharing rollouts across a swarm for LM fine-tuning and reward-based learning

             lecture three

             (aarnphm.xyz)

            . Lecture three on tokenizers, LLMs, alignment, sparse autoencoders, residual streams, and speculative decoding for efficient inference

             assignment three reports.

             (aarnphm.xyz)

            . Discussion of replacing one-hot cross-entropy, 2D GEMMs, batching, tokenization, and optimization techniques for large V vocabularies

             Qwen 3 Next

             (sibellavia.lol)

            . Qwen3-Next-80B models with hybrid Gated DeltaNet, ultra-sparse MoE (512 experts), YaRN context up to 1,000,000 tokens, and multi-token prediction

            📚 Academic Research

             Inpainting-Guided Policy Optimization for Diffusion Large Language   Models

             (arxiv:cs)

            . Inpainting-guided RL for diffusion LLMs improves exploration, using partial ground-truth reasoning to boost GRPO, with synthetic traces and entropy filtering

             Can Understanding and Generation Truly Benefit Together -- or Just   Coexist?

             (arxiv:cs)

            . Unified multimodal learning: encoder–decoder paradigm with long-context captions, UAE framework, Unified-GRPO RL, and Unified-Bench benchmark

             AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making   through Multi-Turn Reinforcement Learning

             (arxiv:cs)

            . AgentGym-RL trains LLM agents for multi-turn decision making using RL, ScalingInter-RL for exploration-exploitation balance across diverse environments

             RewardDance: Reward Scaling in Visual Generation

             (arxiv:cs)

            . RewardDance: scalable reward modeling for visual generation using yes-token probability, enabling large RMs and CoT integration

             Multipole Semantic Attention: A Fast Approximation of Softmax Attention   for Pretraining

             (arxiv:cs)

            . MuSe uses separate semantic clustering and multipole expansions to accelerate transformer attention with dipole corrections for causal and acausal attention

             Customizing the Inductive Biases of Softmax Attention using Structured   Matrices

             (arxiv:cs)

            . Structured-matrix attention with BTT and MLR boosts high-dimensional input tasks and locality-aware language modeling performance

             Recurrence Meets Transformers for Universal Multimodal Retrieval

             (arxiv:cs)

            . ReT-2: a unified multimodal retrieval model using recurrent Transformer with LSTM-inspired gating for image-text queries across multimodal documents

             D-LEAF: Localizing and Correcting Hallucinations in Multimodal LLMs via   Layer-to-head Attention Diagnostics

             (arxiv:cs)

            . D-LEAF localizes and corrects multimodal LLM hallucinations via Layer Image Attention Entropy and Image Attention Focus for dynamic, efficient inference

             Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust   Text-based Person Retrieval

             (arxiv:cs)

            . Noise-resistant data curation with MLLMs and GA-DMS framework for robust cross-modal alignment in WebPerson dataset

             Harnessing Uncertainty: Entropy-Modulated Policy Gradients for   Long-Horizon LLM Agents

             (arxiv:cs)

            . Entropy-Modulated Policy Gradients (EMPG) re-calibrate step-wise learning signals for LLM agents in long-horizon tasks to balance uncertainty and outcomes

             A Survey of Reinforcement Learning for Large Reasoning Models

             (arxiv:cs)

            . Survey of RL methods for reasoning with LLMs/LRMs, focusing on mathematics, coding, scalability, data, and infrastructure toward ASI

             Selective Induction Heads: How Transformers Select Causal Structures In   Context

             (arxiv:stat)

            . Selective Induction Heads enable transformers to choose causal structures and copy past tokens via interleaved Markov chains and lag selection

             Adaptive Token Merging for Efficient Transformer Semantic Communication   at the Edge

             (arxiv:cs)

            . Adaptive token merging reduces transformer compute and communication at the edge via data-dependent, training-free redundancy reduction

             AgentX: Towards Orchestrating Robust Agentic Workflow Patterns with   FaaS-hosted MCP Services

             (arxiv:cs)

            . AgentX orchestrates robust agentic workflows with stage designer, planner, and executor agents using MCP tools and FaaS-hosted services

             Combating the Memory Walls: Optimization Pathways for Long-Context   Agentic LLM Inference

             (arxiv:cs)

            . PLENA: hardware-software co-design with asymmetric quantization and FlashAttention for long-context LLM inference, delivering higher utilization and throughput

             MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging   Bit-Slice-enabled Sparsity and Repetitiveness

             (arxiv:cs)

            . MCBP: a bit-slice-enabled, memory-efficient LLM inference accelerator exploiting BRCR, BSTC, and BGPP for faster GEMMs and KV caching

             Can SSD-Mamba2 Unlock Reinforcement Learning for End-to-End Motion   Control?

             (arxiv:cs)

            . Vision-driven cross-modal RL using SSD-Mamba2 backbone for end-to-end motion control with proprioceptive and exteroceptive tokens

             BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated   Cross-Modal Fusion

             (arxiv:cs)

            . BcQLM: BreezeCLIP-based lightweight vision-language model for efficient visual question answering with 1.2B parameters

             Scalable Training for Vector-Quantized Networks with 100% Codebook   Utilization

             (arxiv:cs)

            . Scalable training of vector-quantized networks with 100% codebook utilization via VQBridge and learning annealing

             CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning   in Large Language Models

             (arxiv:cs)

            . Curiosity-driven exploration for RLVR in LLMs using actor perplexity and multi-head critic variance to boost exploration

             The Choice of Divergence: A Neglected Key to Mitigating Diversity   Collapse in Reinforcement Learning with Verifiable Reward

             (arxiv:cs)

            . Diversity-Preserving Hybrid RL uses mass-covering f-divergences to counter Pass@k degradation in RLVR for LLMs with math and SQL tasks

             Visual Representation Alignment for Multimodal Large Language Models

             (arxiv:cs)

            . VIRAL aligns internal visual representations of MLLMs with vision foundation models to improve object counting and spatial reasoning

             TCPO: Thought-Centric Preference Optimization for Effective Embodied   Decision-making

             (arxiv:cs)

            . Thought-Centric Preference Optimization (TCPO) enhances embodied decision-making with stepwise preference learning and Action Policy Consistency in ALFWorld

             Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual   Search

             (arxiv:cs)

            . Mini-o3 scales multi-turn visual search reasoning with deep tool-based exploration and over-turn masking for tens of interaction steps

             Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

             (arxiv:cs)

            . Reinforcement learning framework Parallel-R1 enables parallel thinking for complex reasoning on math benchmarks like MATH, AMC23, AIME

             Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

             (arxiv:cs)

            . Multimodal Bayesian prompt ensembles improve calibration and accuracy for judging TTI quality with image clustering in MLLMs

             Compute Only 16 Tokens in One Timestep: Accelerating Diffusion   Transformers with Cluster-Driven Feature Caching

             (arxiv:cs)

            . Cluster-Driven Feature Caching (ClusCa) accelerates diffusion transformers by spatially clustering tokens and propagating one token per cluster

             ENSI: Efficient Non-Interactive Secure Inference for Large Language   Models

             (arxiv:cs)

            . ENSI: co-designs cryptographic protocols and BitNet LLM, enabling CKKS-based secure inference with sigmoid attention and RMSNorm bootstrapping

             GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented   Training for AI-Generated Image Detection

             (arxiv:cs)

            . GAMMA uses multi-task, manipulation-augmented training and reverse cross-attention for robust AI-generated image detection across diverse models

             Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing   LLM-based Multi-Agent Systems

             (arxiv:cs)

            . MOAT enables joint alignment tuning for planning and grounding agents to improve coordination in multi-agent LLM systems

             All for One: LLMs Solve Mental Math at the Last Token With Information   Transferred From Other Tokens

             (arxiv:cs)

            . Investigates how LLMs perform mental math via late, last-token computation using Context-Aware Mean Ablation and Attention-Based Peeking

             Astra: A Multi-Agent System for GPU Kernel Performance Optimization

             (arxiv:cs)

            . Astra: an LLM-driven multi-agent system for GPU kernel optimization from SGLang CUDA implementations, achieving 1.32x speedups

             Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain   Expansion, and Metric Optimization

             (arxiv:cs)

            . TAM Bench: automated task collection from Kaggle, AIcrowd, Biendata; multi-modal ML tasks; leaderboard-based difficulty; multi-dimensional evaluation

             Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D   Multimodal Large Language Models

             (arxiv:cs)

            . Med3DInsight: pretraining 3D medical encoders with 2D multimodal LLMs via plane-slice transformer and partial optimal transport alignment

             Sharing is Caring: Efficient LM Post-Training with Collective RL   Experience Sharing

             (arxiv:cs)

            . SAPO: a decentralized, asynchronous RL post-training method for LMs with shared rollouts across heterogeneous nodes

             Towards Generalized Routing: Model and Agent Orchestration for Adaptive   and Efficient Inference

             (arxiv:cs)

            . MoMA: Mixture of Models and Agents for adaptive routing of LLMs and agents to optimize cost, performance, and efficiency

             VARCO-VISION-2.0 Technical Report

             (arxiv:cs)

            . VARCO-VISION-2.0: bilingual Korean-English VLM with multi-image understanding, layout-aware OCR, and on-device 1.7B variant

             Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in   Southeast Asia

             (arxiv:cs)

            . Compass-v3: a 245B multilingual e-commerce Mixture-of-Experts model with OTPO alignment and GPU-optimized training for Southeast Asia

             Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems

             (arxiv:cs)

            . Dual knowledge-enhanced two-stage reasoner (DK2R) uses structured attributes and unstructured reviews with LLMs for multimodal dialog response generation

            👋 Before you go

            I've got a big favor to ask - keeping Blaze running isn't expensive, but it does all add up, so I'm asking readers like you to help, if you can.
That's why I'm launching

             a Patreon page!

            .  Nothing flashy, just a way for folks who find value in these newsletters to chip in a little each month. In return, you'll get:

             Real say in how Blaze evolves — vote on new topics, features, topic curation ideas

             First dibs on merch (details still cooking)

             That warm fuzzy feeling knowing you're supporting something that saves you time and keeps you plugged into great tech writing

            If you are getting value from blaze, checking this out would mean the world. And if you can't contribute, no worries—the newsletters keep coming either way, and you can follow along on patreon for free.
Thanks for reading and being part of this nerdy corner of the internet. All the best - Alastair.

              Have an idea for how blaze could be better? Please visit the

               feedback form

              to let us know. To update your preferences, or to unsubscribe, please go to

               blaze.email/unsubscribe

              .

Don't miss what's next. Subscribe to The AI Engineer:

Start the conversation: