The AI Engineer 12-11-2025
cloud-native AI, Snapchat's AI integration, Siri's temporary brain rental
📣 Headlines
• The open-source and cloud-native forces reshaping AI stacks are driving adoption of open-weight models, agents, OpenFGA and infra orchestration; Moonshot’s Kimi K2 agent showcases step-by-step reasoning and heavy tool use, and vibe coding/context engineering is changing developer workflows.
• Perplexity and Snap struck a $400M deal for Perplexity to power Snapchat’s conversational search and My AI integration in a 2026 rollout, embedding generative search across Snapchat’s user base (report).
• Apple will temporarily rent a Gemini-powered brain for Siri for about $1B/year, combining on-device and cloud processing as a stopgap while it develops longer-term solutions.
• OpenAI urged extending CHIPS Act tax credits to AI data centers and server makers to support a proposed $500B AI build plan, seeking a 35% AMIC-style incentive for AI infrastructure.
• AI factories require massive CapEx and face a decade-long J-curve, with data, power and talent constraints tempering near-term returns despite multitrillion-dollar potential.
• Major security vendors rolled out AI-focused defenses and tooling — Fortinet, SentinelOne and CrowdStrike announced new AI security features while Sysdig enhanced Falco for real-time threat detection and forensic capture.
• Oracle’s Autonomous AI Lakehouse pairs Exadata performance with Apache Iceberg openness for multicloud analytics and AI-ready data management.
• Anthropic opened new European offices and appointed Pip White head of EMEA North as it expands enterprise operations and regional presence.
đź”§ Company Engineering Blogs
1,500+ PRs Later: Spotify’s Journey with Our Background Coding Agent (Part 1) (engineering​.atspotify​.com). Spotify uses Fleet Management with AI coding agents to automate cross-repo migrations, Java/POM updates, YAML/JSON configs, and UI migrations
Video Invisible Watermarking at Scale (engineering​.fb​.com). Meta discusses scalable invisible watermarking for video using CPU-first pipelines, FFmpeg filters, and frame-selection to balance quality, detection accuracy, and BD-Rate
How Cursor AI Slashed Dashboard Migration Time 75% Across 240 Queries (engineering​.salesforce​.com). Cursor and MCP Server slashed migration time by 75% migrating 240 queries across 20 dashboards from Splunk to Tableau
A Decade of AI Platform at Pinterest (medium​.com/pinterest-engineering). A decade-long look at Pinterest’s unified ML platform, covering Linchpin, Scorpion, EzFlow, Galaxy, UFR, MLEnv, TabularML, and GPU-centric innovations across GPUs, Ray, and large embedding models
Introducing Nested Learning: A new ML paradigm for continual learning (research​.google). Nested Learning proposes multi-level optimization to tackle catastrophic forgetting in ML models, introducing CMS memory and Hope architecture (Lang: Python-like concepts)
🔓 Open Models & Architectures
Kimi K2 Thinking (simonwillison​.net). Kimi K2 Thinking: Moonshot’s 1T-parameter agentic model with INT4 quantization, tool use, and benchmark leadership
Beyond Standard LLMs (magazine​.sebastianraschka​.com). Explores linear attention hybrids, text diffusion, code world models, and small recursive transformers with Qwen3-Next, MiniMax-M2, Kimi Linear, DeepSeek, and related architectures in open-weight LLMs
5 Thoughts on Kimi K2 Thinking (interconnects​.ai). Open model Kimi K2 Thinking from Moonshot AI, with 1T MoE, 32B params, 256K context, INT4 post-training, tool-use, benchmarks, and China AI rise insights by Nathan Lambert
Reflection (alexpolozov​.com). Polozov discusses AI as a platform, open Western models, and open frontier-model science, highlighting Reflection, Windsurf, Cursor, and 2025–2026 open-model goals
🤖 Agents & RL
An ARENA 6.0 Capstone: Model Organism of Encoded Reasoning (lesswrong​.com). ARENA 6.0 capstone using RL on Qwen-3-4B to study encoded reasoning in GSM8K math CoTs with judge models, toxicity signals, and RL challenges
Introspection or confusion? (lesswrong​.com). Explores introspection vs. confusion in LLMs; reproduces Anthropic-style experiments with steering vectors; analyzes control questions and noise across Mistral, Qwen, and Llama models
RL Learning with LoRA: A Diverse Deep Dive (kalomaze​.bearblog​.dev). RL training with LoRA for SFT and RL finetuning in prime-rl using rsLoRA scaling and multi-environment experiments
#299 Jacob Buckman: Why the Future of AI Won't Be Built on Transformers (aneyeonai​.libsyn​.com). Explores memory, long-context challenges, and retrofitting transformers with state-space and Power Retention by Manifest AI in AI agent workflows
Building Metacognitive AI Agents: A Complete Guide from Theory to Production (rewire​.it). Metacognitive AI agents with dual-loop Actor-Critic design, Reflexion patterns, and LangGraph-driven production-ready workflows
Language as a Universal Interface for Reinforcement Learning Agents (richardli​.xyz). Language as a universal interface for RL agents, detailing a formal framework, memory compression, and two-layer thought-action generation using LLMs
📚 Retrieval & RAG
Essential Chunking Techniques for Building Better LLM Applications (machinelearningmastery​.com). Chunking techniques for LLMs: fixed-size to hierarchical, semantic, LLM- and agent-based strategies for better retrieval and generation
How I'm Building a Context-Aware Retriever to Boost RAG Quality (Part 3: Evaluation) (egpivo​.github​.io). Context-aware retriever for RAG quality using multi-stage retrieval, GPT-4o, DSPy, and a ContractNLI dataset
How to Evaluate Retrieval Quality in RAG Pipelines (part 2): Mean Reciprocal Rank (MRR) and Average Precision (AP) (towardsdatascience​.com). Binary, order-aware metrics MRR and AP for evaluating retrieval in RAG pipelines using Python implementations
Multi-Agent SQL Assistant, Part 2: Building a RAG Manager (towardsdatascience​.com). RAG strategies with Keyword, FAISS, and Chroma for a SQL assistant using Python, SQLite, and LLMs
⚙️ Systems & Deployment
Deterministic Inference: The Latency Tax (akashbajwa​.co). Latency vs determinism in AI inference; diffusion models, FP nondeterminism, RAG/guardrails, and enterprise AI workloads
Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly (aarnphm​.xyz). Guided decoding and speculative decoding merge to enable CPU-GPU collaboration for LLM inference using TensorRT LLM and XGrammar with CUDA graphs
The Untold System Design Problem in LLM Inference (dylanhuang​.com). Variable-shape LLM inference, KV caches, MoE routing, quantization, and adaptive scheduling for interactive and batch workloads
How neuroscientists are using AI (thetransmitter​.org). Eight neuroscientists describe using large language models to analyze literature, brainstorm hypotheses, code, and interpret spatial genomics and neural data
🔬 Theory & Foundations
The Principles of Diffusion Models (arxiv​.org). Diffusion models fundamentals, training, sampling, and theoretical insights using ML concepts and diffusion processes
Introducing Nested Learning: A new ML paradigm for continual learning (research​.google). Nested Learning proposes multi-level optimization to tackle catastrophic forgetting in ML models, introducing CMS memory and Hope architecture (Lang: Python-like concepts)
The Spacetime of Large Language Models (medium​.com/data-science-collective). Geometric view of Transformers: curvature, parallel transport, and QKV attention in language models
Large Language Models and Emergence: A Complex Systems Perspective (rewire​.it). Explores whether LLMs exhibit genuine emergence or measurement artifacts using complexity science, KI/KO distinctions, and mechanistic interpretability with figures and studies by Krakauer, Li, and Wei
📚 Academic Research
Whisper Leak: a side-channel attack on Large Language Models (arxiv:cs). Demonstrates a side‑channel attack that infers prompt topics from encrypted LLM streaming metadata (packet sizes/timing), exposing severe privacy risks and evaluating partial mitigations—critical for deployment security
Attention and Compression is all you need for Controllably Efficient Language Models (arxiv:cs). Introduces CAT: chunk compression + dense attention to trade quality for compute. Adaptive single model supports run-time compute budgeting—improves speed/memory for long-context and production LLMs
Apriel-H1: Towards Efficient Enterprise Reasoning Models (arxiv:cs). Presents Apriel‑H1, a distilled hybrid SSM–Transformer replacing attention with Mamba blocks for 15B models; yields ~2× inference throughput in production with minimal reasoning loss—enterprise impact
PerfDojo: Automated ML Library Generation for Heterogeneous Architectures (arxiv:cs). PerfLLM/PerfDojo leverages LLMs and RL to auto-optimize ML kernels across CPUs/GPUs/accelerators, improving performance portability—highly relevant for ML systems engineers and deployment optimization
Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks (arxiv:cs). Proposes FedAttn: distributed self‑attention where participants compute local attention and aggregate KV matrices, enabling private, communication‑efficient collaborative LLM inference with theoretical tradeoffs for edge deployments