The AI Engineer logo

The AI Engineer

Subscribe
Archives
September 16, 2025

Generative AI newsletter

Blaze Email

Generative AI

Blaze Logo 2025-09-16 • read online • patreon

📣 Headlines

• Anthropic’s Claude gains memory, an optional incognito mode, and cross‑export features , even as the company faces a potential $3,000 settlement over AI piracy claims .

• Oracle is doubling down on AI hardware and cloud offerings to power models , while OpenAI plans its first AI chip by 2026 and Arm unveils Lumex for on‑device AI , highlighting a broader push on compute at every layer.

• How do AI models generate videos? explains the current stack—diffusion in latent space, transformers, and frame‑consistency systems like Sora and Veo 3 used to produce coherent video and audio.

• Ex‑Google X founders raised $5.7M for TwinMind to build a passive “second brain” that captures ambient speech and builds a personal knowledge graph , reflecting the rising focus on background and agentic assistants described in coverage of agentic AI .

• Security teams are shipping autonomous defenses: AegisAI uses agents to inspect and neutralize email threats in real time , Lookout launched Smishing AI for mobile social‑engineering detection , and Miru raised funding for AI‑assisted cyber investigations .

• Policy and oversight debates intensify as California seeks frontier AI transparency reports to mitigate catastrophic risks , the FTC probes AI chatbots used with children , and Sen. Ted Cruz proposes regulatory waivers and a White House sandbox for AI companies .

• AI training infrastructure draws massive valuations: Mercor eyes a $10B+ valuation on a reported $450M run rate , underlining demand and capital flow into model training services.

• Browser and device AI updates: Firefox for iOS adds local AI page summarization on newer iPhones , and Firefox will end support for 32‑bit Linux in 2026 , signaling shifts toward on‑device AI and platform consolidation.

🔧 Company Engineering Blogs

Jupyter Agents: training LLMs to reason with notebooks (huggingface​.co) . Jupyter Agent builds a data science workflow inside notebooks using Qwen models, scaffolding, QA generation, and E2B execution pipelines

Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 2 of 2) (medium​.com/pinterest-engineering) . Deploying EKS clusters, Fluent Bit logging, OTEL metrics pipelines, image management, and a custom Moka UI for Spark on Kubernetes

Accelerating scientific discovery with AI-powered empirical software (research​.google) . Google Research presents an AI-powered system, built on Gemini, that writes, optimizes, and empirically evaluates scientific software across genomics, public health, geospatial analysis, neuroscience, and time-series forecasting

Scientific frontiers of agentic AI (amazon​.science) . Agentic AI explores embedding languages, context, negotiation, common sense, and privacy with embeddings, context windows, and behavioral economics insights

📈 Applied LLMs in the Wild: RAG, Recsys, Pipelines, and Science

How to Train an LLM-RecSys Hybrid for Steerable Recs with Semantic IDs (eugeneyan​.com) . LLM-Recsys hybrid with semantic IDs using RQ-VAE, SASRec, Qwen models; train on Amazon Video Games data; steerable, conversational recommendations

Text analytics in Data Pipelines using AI (medium​.com/@ed​.bullen) . Databricks AI Query workflows for ETL pipelines; using LLMs to classify, rate sentiment, and justify results on Amazon Reviews data

Single-cell analysis and infectious disease forecasting: Google's new AI scientist (blog​.stephenturner​.us) . AI systems generate and test new methods for single-cell RNA-seq batch integration and COVID-19 forecasting, surpassing some benchmarks

Stumbling into AI: Part 3—RAG (rmoff​.net) . Explains Retrieval-Augmented Generation (RAG) using embeddings, vector stores (ChromaDB), Ollama, and Llama models with Kafka release notes as example

Beyond the Chatbot: What Actually Works in Enterprise AI (thedataexchange​.media) . RAG systems evolution, evaluation as IP, embeddings, enterprise security, agent workflows, multi-modality, small models, and AI-enabled coding tools

🧭 Agents and Human-in-the-Loop Orchestration

Exploring Active Agent, or can we build AI features the Rails way? (evilmartians​.com) . Rails-style AI abstractions with Active Agent: agents, prompts, callbacks, templates, and battle-tested Rails examples

Lessons learned from a 100 blog posts on AI (frontierai​.substack​.com) . Big-picture AI trends: economics of inference, token costs vs. volume, open-loop agents, evals, data quality, context management, and UX in AI apps

Generalists Can Also Dig Deep (towardsdatascience​.com) . Generalist Ida Silfverskiöld on AI agents, RAG, evals, and design choices in agentic systems

LangGraph 201: Adding Human Oversight to Your Deep Research Agent (towardsdatascience​.com) . LangGraph 201 adds two human-in-the-loop checkpoints to a deep research agent, using interruption patterns, state graphs, and in-memory/DB checkpointers

⚙️ Deterministic and Efficient Inference & Serving

Defeating Nondeterminism in LLM Inference (simonwillison​.net) . Nondeterminism in LLM inference arises mainly from varying load and batch size; paper proposes invariant kernels in PyTorch to achieve determinism

Speculative cascades — A hybrid approach for smarter, faster LLM inference (research​.google) . Speculative cascades combine cascades and speculative decoding with a deferral rule to speed LLM inference and improve cost–quality trade-offs

The Rise of Multimodal LLMs and Efficient Serving with vLLM (pyimagesearch​.com) . Multimodal LLMs (LLaVA, GPT-4V, BakLLaVA) and vLLM enable OpenAI-compatible vision–language inference and efficient deployment

Defeating Nondeterminism in LLM Inference – Thinking Machines Lab (jmason​.ie) . Defeating nondeterminism in LLM inference by examining sampling, temperature effects, and deterministic behavior across stacks and libraries

Benchmarking AI & ML on local CPU/GPUs: an end-to-end Python project (allaboutdata​.substack​.com) . Benchmarking AI/ML on local CPU/GPU with Python: XGBoost, Ollama, CUDA, uv, Altair, Streamlit dashboard and Docker-free workflow

Text embedding inference on a cheap server (paulw​.tokyo) . Benchmarking text embedding inference on cheap CPUs using Granite-embedding-107m-multilingual-GGUF and EmbeddingGemma with llama-server and wrk load testing

🧪 Architectures, Alignment, and Post-Training

Qwen3-Next-80B-A3B: 🐧🦩 Who needs legs?! (simonwillison​.net) . Qwen3-Next-80B-A3B-Instruct and Thinking models; 80B with 3B active per round; OpenRouter deployment; llm-openrouter plugin; pelican SVG prompt; performance claims

Paper Review: Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing (andlukyane​.com) . Decentralized RL post-training with SAPO sharing rollouts across a swarm for LM fine-tuning and reward-based learning

lecture three (aarnphm​.xyz) . Lecture three on tokenizers, LLMs, alignment, sparse autoencoders, residual streams, and speculative decoding for efficient inference

assignment three reports. (aarnphm​.xyz) . Discussion of replacing one-hot cross-entropy, 2D GEMMs, batching, tokenization, and optimization techniques for large V vocabularies

Qwen 3 Next (sibellavia​.lol) . Qwen3-Next-80B models with hybrid Gated DeltaNet, ultra-sparse MoE (512 experts), YaRN context up to 1,000,000 tokens, and multi-token prediction

📚 Academic Research

Inpainting-Guided Policy Optimization for Diffusion Large Language Models (arxiv:cs) . Inpainting-guided RL for diffusion LLMs improves exploration, using partial ground-truth reasoning to boost GRPO, with synthetic traces and entropy filtering

Can Understanding and Generation Truly Benefit Together -- or Just Coexist? (arxiv:cs) . Unified multimodal learning: encoder–decoder paradigm with long-context captions, UAE framework, Unified-GRPO RL, and Unified-Bench benchmark

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning (arxiv:cs) . AgentGym-RL trains LLM agents for multi-turn decision making using RL, ScalingInter-RL for exploration-exploitation balance across diverse environments

RewardDance: Reward Scaling in Visual Generation (arxiv:cs) . RewardDance: scalable reward modeling for visual generation using yes-token probability, enabling large RMs and CoT integration

Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining (arxiv:cs) . MuSe uses separate semantic clustering and multipole expansions to accelerate transformer attention with dipole corrections for causal and acausal attention

Customizing the Inductive Biases of Softmax Attention using Structured Matrices (arxiv:cs) . Structured-matrix attention with BTT and MLR boosts high-dimensional input tasks and locality-aware language modeling performance

Recurrence Meets Transformers for Universal Multimodal Retrieval (arxiv:cs) . ReT-2: a unified multimodal retrieval model using recurrent Transformer with LSTM-inspired gating for image-text queries across multimodal documents

D-LEAF: Localizing and Correcting Hallucinations in Multimodal LLMs via Layer-to-head Attention Diagnostics (arxiv:cs) . D-LEAF localizes and corrects multimodal LLM hallucinations via Layer Image Attention Entropy and Image Attention Focus for dynamic, efficient inference

Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval (arxiv:cs) . Noise-resistant data curation with MLLMs and GA-DMS framework for robust cross-modal alignment in WebPerson dataset

Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents (arxiv:cs) . Entropy-Modulated Policy Gradients (EMPG) re-calibrate step-wise learning signals for LLM agents in long-horizon tasks to balance uncertainty and outcomes

A Survey of Reinforcement Learning for Large Reasoning Models (arxiv:cs) . Survey of RL methods for reasoning with LLMs/LRMs, focusing on mathematics, coding, scalability, data, and infrastructure toward ASI

Selective Induction Heads: How Transformers Select Causal Structures In Context (arxiv:stat) . Selective Induction Heads enable transformers to choose causal structures and copy past tokens via interleaved Markov chains and lag selection

Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge (arxiv:cs) . Adaptive token merging reduces transformer compute and communication at the edge via data-dependent, training-free redundancy reduction

AgentX: Towards Orchestrating Robust Agentic Workflow Patterns with FaaS-hosted MCP Services (arxiv:cs) . AgentX orchestrates robust agentic workflows with stage designer, planner, and executor agents using MCP tools and FaaS-hosted services

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference (arxiv:cs) . PLENA: hardware-software co-design with asymmetric quantization and FlashAttention for long-context LLM inference, delivering higher utilization and throughput

MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness (arxiv:cs) . MCBP: a bit-slice-enabled, memory-efficient LLM inference accelerator exploiting BRCR, BSTC, and BGPP for faster GEMMs and KV caching

Can SSD-Mamba2 Unlock Reinforcement Learning for End-to-End Motion Control? (arxiv:cs) . Vision-driven cross-modal RL using SSD-Mamba2 backbone for end-to-end motion control with proprioceptive and exteroceptive tokens

BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion (arxiv:cs) . BcQLM: BreezeCLIP-based lightweight vision-language model for efficient visual question answering with 1.2B parameters

Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization (arxiv:cs) . Scalable training of vector-quantized networks with 100% codebook utilization via VQBridge and learning annealing

CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models (arxiv:cs) . Curiosity-driven exploration for RLVR in LLMs using actor perplexity and multi-head critic variance to boost exploration

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward (arxiv:cs) . Diversity-Preserving Hybrid RL uses mass-covering f-divergences to counter Pass@k degradation in RLVR for LLMs with math and SQL tasks

Visual Representation Alignment for Multimodal Large Language Models (arxiv:cs) . VIRAL aligns internal visual representations of MLLMs with vision foundation models to improve object counting and spatial reasoning

TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making (arxiv:cs) . Thought-Centric Preference Optimization (TCPO) enhances embodied decision-making with stepwise preference learning and Action Policy Consistency in ALFWorld

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search (arxiv:cs) . Mini-o3 scales multi-turn visual search reasoning with deep tool-based exploration and over-turn masking for tens of interaction steps

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning (arxiv:cs) . Reinforcement learning framework Parallel-R1 enables parallel thinking for complex reasoning on math benchmarks like MATH, AMC23, AIME

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles (arxiv:cs) . Multimodal Bayesian prompt ensembles improve calibration and accuracy for judging TTI quality with image clustering in MLLMs

Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching (arxiv:cs) . Cluster-Driven Feature Caching (ClusCa) accelerates diffusion transformers by spatially clustering tokens and propagating one token per cluster

ENSI: Efficient Non-Interactive Secure Inference for Large Language Models (arxiv:cs) . ENSI: co-designs cryptographic protocols and BitNet LLM, enabling CKKS-based secure inference with sigmoid attention and RMSNorm bootstrapping

GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection (arxiv:cs) . GAMMA uses multi-task, manipulation-augmented training and reverse cross-attention for robust AI-generated image detection across diverse models

Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems (arxiv:cs) . MOAT enables joint alignment tuning for planning and grounding agents to improve coordination in multi-agent LLM systems

All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens (arxiv:cs) . Investigates how LLMs perform mental math via late, last-token computation using Context-Aware Mean Ablation and Attention-Based Peeking

Astra: A Multi-Agent System for GPU Kernel Performance Optimization (arxiv:cs) . Astra: an LLM-driven multi-agent system for GPU kernel optimization from SGLang CUDA implementations, achieving 1.32x speedups

Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization (arxiv:cs) . TAM Bench: automated task collection from Kaggle, AIcrowd, Biendata; multi-modal ML tasks; leaderboard-based difficulty; multi-dimensional evaluation

Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models (arxiv:cs) . Med3DInsight: pretraining 3D medical encoders with 2D multimodal LLMs via plane-slice transformer and partial optimal transport alignment

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing (arxiv:cs) . SAPO: a decentralized, asynchronous RL post-training method for LMs with shared rollouts across heterogeneous nodes

Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference (arxiv:cs) . MoMA: Mixture of Models and Agents for adaptive routing of LLMs and agents to optimize cost, performance, and efficiency

VARCO-VISION-2.0 Technical Report (arxiv:cs) . VARCO-VISION-2.0: bilingual Korean-English VLM with multi-image understanding, layout-aware OCR, and on-device 1.7B variant

Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia (arxiv:cs) . Compass-v3: a 245B multilingual e-commerce Mixture-of-Experts model with OTPO alignment and GPU-optimized training for Southeast Asia

Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems (arxiv:cs) . Dual knowledge-enhanced two-stage reasoner (DK2R) uses structured attributes and unstructured reviews with LLMs for multimodal dialog response generation

👋 Before you go

I've got a big favor to ask - keeping Blaze running isn't expensive, but it does all add up, so I'm asking readers like you to help, if you can. That's why I'm launching a Patreon page! . Nothing flashy, just a way for folks who find value in these newsletters to chip in a little each month. In return, you'll get:

  • Real say in how Blaze evolves — vote on new topics, features, topic curation ideas
  • First dibs on merch (details still cooking)
  • That warm fuzzy feeling knowing you're supporting something that saves you time and keeps you plugged into great tech writing

If you are getting value from blaze, checking this out would mean the world. And if you can't contribute, no worries—the newsletters keep coming either way, and you can follow along on patreon for free. Thanks for reading and being part of this nerdy corner of the internet. All the best - Alastair.

Have an idea for how blaze could be better? Please visit the feedback form to let us know. To update your preferences, or to unsubscribe, please go to blaze.email/unsubscribe .

Don't miss what's next. Subscribe to The AI Engineer:
Start the conversation:
Bluesky Mastodon LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.