The AI Engineer 12-11-2025

        November 12, 2025

The AI Engineer 12-11-2025
cloud-native AI, Snapchat's AI integration, Siri's temporary brain rental

            📣 Headlines
• The open-source and cloud-native forces reshaping AI stacks are driving adoption of open-weight models, agents, OpenFGA and infra orchestration; Moonshot’s Kimi K2 agent showcases step-by-step reasoning and heavy tool use, and vibe coding/context engineering is changing developer workflows.  
• Perplexity and Snap struck a $400M deal for Perplexity to power Snapchat’s conversational search and My AI integration in a 2026 rollout, embedding generative search across Snapchat’s user base (report).  
• Apple will temporarily rent a Gemini-powered brain for Siri for about $1B/year, combining on-device and cloud processing as a stopgap while it develops longer-term solutions.  
• OpenAI urged extending CHIPS Act tax credits to AI data centers and server makers to support a proposed $500B AI build plan, seeking a 35% AMIC-style incentive for AI infrastructure.  
• AI factories require massive CapEx and face a decade-long J-curve, with data, power and talent constraints tempering near-term returns despite multitrillion-dollar potential.  
• Major security vendors rolled out AI-focused defenses and tooling — Fortinet, SentinelOne and CrowdStrike announced new AI security features while Sysdig enhanced Falco for real-time threat detection and forensic capture.  
• Oracle’s Autonomous AI Lakehouse pairs Exadata performance with Apache Iceberg openness for multicloud analytics and AI-ready data management.  
• Anthropic opened new European offices and appointed Pip White head of EMEA North as it expands enterprise operations and regional presence.  

🔧 Company Engineering Blogs
1,500+ PRs Later: Spotify’s Journey with Our Background Coding Agent (Part 1) (engineering.atspotify.com). Spotify uses Fleet Management with AI coding agents to automate cross-repo migrations, Java/POM updates, YAML/JSON configs, and UI migrations  
Video Invisible Watermarking at Scale (engineering.fb.com). Meta discusses scalable invisible watermarking for video using CPU-first pipelines, FFmpeg filters, and frame-selection to balance quality, detection accuracy, and BD-Rate  
How Cursor AI Slashed Dashboard Migration Time 75% Across 240 Queries (engineering.salesforce.com). Cursor and MCP Server slashed migration time by 75% migrating 240 queries across 20 dashboards from Splunk to Tableau  
A Decade of AI Platform at Pinterest (medium.com/pinterest-engineering). A decade-long look at Pinterest’s unified ML platform, covering Linchpin, Scorpion, EzFlow, Galaxy, UFR, MLEnv, TabularML, and GPU-centric innovations across GPUs, Ray, and large embedding models  
Introducing Nested Learning: A new ML paradigm for continual learning (research.google). Nested Learning proposes multi-level optimization to tackle catastrophic forgetting in ML models, introducing CMS memory and Hope architecture (Lang: Python-like concepts)  

🔓 Open Models & Architectures
Kimi K2 Thinking (simonwillison.net). Kimi K2 Thinking: Moonshot’s 1T-parameter agentic model with INT4 quantization, tool use, and benchmark leadership  
Beyond Standard LLMs (magazine.sebastianraschka.com). Explores linear attention hybrids, text diffusion, code world models, and small recursive transformers with Qwen3-Next, MiniMax-M2, Kimi Linear, DeepSeek, and related architectures in open-weight LLMs  
5 Thoughts on Kimi K2 Thinking (interconnects.ai). Open model Kimi K2 Thinking from Moonshot AI, with 1T MoE, 32B params, 256K context, INT4 post-training, tool-use, benchmarks, and China AI rise insights by Nathan Lambert  
Reflection (alexpolozov.com). Polozov discusses AI as a platform, open Western models, and open frontier-model science, highlighting Reflection, Windsurf, Cursor, and 2025–2026 open-model goals  

🤖 Agents & RL
An ARENA 6.0 Capstone: Model Organism of Encoded Reasoning (lesswrong.com). ARENA 6.0 capstone using RL on Qwen-3-4B to study encoded reasoning in GSM8K math CoTs with judge models, toxicity signals, and RL challenges  
Introspection or confusion? (lesswrong.com). Explores introspection vs. confusion in LLMs; reproduces Anthropic-style experiments with steering vectors; analyzes control questions and noise across Mistral, Qwen, and Llama models  
RL Learning with LoRA: A Diverse Deep Dive (kalomaze.bearblog.dev). RL training with LoRA for SFT and RL finetuning in prime-rl using rsLoRA scaling and multi-environment experiments  
#299 Jacob Buckman: Why the Future of AI Won't Be Built on Transformers (aneyeonai.libsyn.com). Explores memory, long-context challenges, and retrofitting transformers with state-space and Power Retention by Manifest AI in AI agent workflows  
Building Metacognitive AI Agents: A Complete Guide from Theory to Production (rewire.it). Metacognitive AI agents with dual-loop Actor-Critic design, Reflexion patterns, and LangGraph-driven production-ready workflows  
Language as a Universal Interface for Reinforcement Learning Agents (richardli.xyz). Language as a universal interface for RL agents, detailing a formal framework, memory compression, and two-layer thought-action generation using LLMs  

📚 Retrieval & RAG
Essential Chunking Techniques for Building Better LLM Applications (machinelearningmastery.com). Chunking techniques for LLMs: fixed-size to hierarchical, semantic, LLM- and agent-based strategies for better retrieval and generation  
How I'm Building a Context-Aware Retriever to Boost RAG Quality (Part 3: Evaluation) (egpivo.github.io). Context-aware retriever for RAG quality using multi-stage retrieval, GPT-4o, DSPy, and a ContractNLI dataset  
How to Evaluate Retrieval Quality in RAG Pipelines (part 2): Mean Reciprocal Rank (MRR) and Average Precision (AP) (towardsdatascience.com). Binary, order-aware metrics MRR and AP for evaluating retrieval in RAG pipelines using Python implementations  
Multi-Agent SQL Assistant, Part 2: Building a RAG Manager (towardsdatascience.com). RAG strategies with Keyword, FAISS, and Chroma for a SQL assistant using Python, SQLite, and LLMs  

⚙️ Systems & Deployment
Deterministic Inference: The Latency Tax (akashbajwa.co). Latency vs determinism in AI inference; diffusion models, FP nondeterminism, RAG/guardrails, and enterprise AI workloads  
Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly (aarnphm.xyz). Guided decoding and speculative decoding merge to enable CPU-GPU collaboration for LLM inference using TensorRT LLM and XGrammar with CUDA graphs  
The Untold System Design Problem in LLM Inference (dylanhuang.com). Variable-shape LLM inference, KV caches, MoE routing, quantization, and adaptive scheduling for interactive and batch workloads  
How neuroscientists are using AI (thetransmitter.org). Eight neuroscientists describe using large language models to analyze literature, brainstorm hypotheses, code, and interpret spatial genomics and neural data  
🔬 Theory & Foundations
The Principles of Diffusion Models (arxiv.org). Diffusion models fundamentals, training, sampling, and theoretical insights using ML concepts and diffusion processes  
Introducing Nested Learning: A new ML paradigm for continual learning (research.google). Nested Learning proposes multi-level optimization to tackle catastrophic forgetting in ML models, introducing CMS memory and Hope architecture (Lang: Python-like concepts)  
The Spacetime of Large Language Models (medium.com/data-science-collective). Geometric view of Transformers: curvature, parallel transport, and QKV attention in language models  
Large Language Models and Emergence: A Complex Systems Perspective (rewire.it). Explores whether LLMs exhibit genuine emergence or measurement artifacts using complexity science, KI/KO distinctions, and mechanistic interpretability with figures and studies by Krakauer, Li, and Wei  

📚 Academic Research
Whisper Leak: a side-channel attack on Large Language Models (arxiv:cs). Demonstrates a side‑channel attack that infers prompt topics from encrypted LLM streaming metadata (packet sizes/timing), exposing severe privacy risks and evaluating partial mitigations—critical for deployment security  
Attention and Compression is all you need for Controllably Efficient Language Models (arxiv:cs). Introduces CAT: chunk compression + dense attention to trade quality for compute. Adaptive single model supports run-time compute budgeting—improves speed/memory for long-context and production LLMs  
Apriel-H1: Towards Efficient Enterprise Reasoning Models (arxiv:cs). Presents Apriel‑H1, a distilled hybrid SSM–Transformer replacing attention with Mamba blocks for 15B models; yields ~2× inference throughput in production with minimal reasoning loss—enterprise impact  
PerfDojo: Automated ML Library Generation for Heterogeneous Architectures (arxiv:cs). PerfLLM/PerfDojo leverages LLMs and RL to auto-optimize ML kernels across CPUs/GPUs/accelerators, improving performance portability—highly relevant for ML systems engineers and deployment optimization  
Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks (arxiv:cs). Proposes FedAttn: distributed self‑attention where participants compute local attention and aggregate KV matrices, enabling private, communication‑efficient collaborative LLM inference with theoretical tradeoffs for edge deployments

                            Don't miss what's next. Subscribe to The AI Engineer:

          Add a comment:

                Share this email:

                                Share on LinkedIn

                                Share on Hacker News

                                Share on Mastodon

                                Share on Bluesky