The AI Engineer logo

The AI Engineer

Subscribe
Archives
November 12, 2025

The AI Engineer 12-11-2025

cloud-native AI, Snapchat's AI integration, Siri's temporary brain rental

📣 Headlines

• The open-source and cloud-native forces reshaping AI stacks are driving adoption of open-weight models, agents, OpenFGA and infra orchestration; Moonshot’s Kimi K2 agent showcases step-by-step reasoning and heavy tool use, and vibe coding/context engineering is changing developer workflows.

• Perplexity and Snap struck a $400M deal for Perplexity to power Snapchat’s conversational search and My AI integration in a 2026 rollout, embedding generative search across Snapchat’s user base (report).

• Apple will temporarily rent a Gemini-powered brain for Siri for about $1B/year, combining on-device and cloud processing as a stopgap while it develops longer-term solutions.

• OpenAI urged extending CHIPS Act tax credits to AI data centers and server makers to support a proposed $500B AI build plan, seeking a 35% AMIC-style incentive for AI infrastructure.

• AI factories require massive CapEx and face a decade-long J-curve, with data, power and talent constraints tempering near-term returns despite multitrillion-dollar potential.

• Major security vendors rolled out AI-focused defenses and tooling — Fortinet, SentinelOne and CrowdStrike announced new AI security features while Sysdig enhanced Falco for real-time threat detection and forensic capture.

• Oracle’s Autonomous AI Lakehouse pairs Exadata performance with Apache Iceberg openness for multicloud analytics and AI-ready data management.

• Anthropic opened new European offices and appointed Pip White head of EMEA North as it expands enterprise operations and regional presence.

đź”§ Company Engineering Blogs

1,500+ PRs Later: Spotify’s Journey with Our Background Coding Agent (Part 1) (engineering​.atspotify​.com). Spotify uses Fleet Management with AI coding agents to automate cross-repo migrations, Java/POM updates, YAML/JSON configs, and UI migrations

Video Invisible Watermarking at Scale (engineering​.fb​.com). Meta discusses scalable invisible watermarking for video using CPU-first pipelines, FFmpeg filters, and frame-selection to balance quality, detection accuracy, and BD-Rate

How Cursor AI Slashed Dashboard Migration Time 75% Across 240 Queries (engineering​.salesforce​.com). Cursor and MCP Server slashed migration time by 75% migrating 240 queries across 20 dashboards from Splunk to Tableau

A Decade of AI Platform at Pinterest (medium​.com/pinterest-engineering). A decade-long look at Pinterest’s unified ML platform, covering Linchpin, Scorpion, EzFlow, Galaxy, UFR, MLEnv, TabularML, and GPU-centric innovations across GPUs, Ray, and large embedding models

Introducing Nested Learning: A new ML paradigm for continual learning (research​.google). Nested Learning proposes multi-level optimization to tackle catastrophic forgetting in ML models, introducing CMS memory and Hope architecture (Lang: Python-like concepts)

🔓 Open Models & Architectures

Kimi K2 Thinking (simonwillison​.net). Kimi K2 Thinking: Moonshot’s 1T-parameter agentic model with INT4 quantization, tool use, and benchmark leadership

Beyond Standard LLMs (magazine​.sebastianraschka​.com). Explores linear attention hybrids, text diffusion, code world models, and small recursive transformers with Qwen3-Next, MiniMax-M2, Kimi Linear, DeepSeek, and related architectures in open-weight LLMs

5 Thoughts on Kimi K2 Thinking (interconnects​.ai). Open model Kimi K2 Thinking from Moonshot AI, with 1T MoE, 32B params, 256K context, INT4 post-training, tool-use, benchmarks, and China AI rise insights by Nathan Lambert

Reflection (alexpolozov​.com). Polozov discusses AI as a platform, open Western models, and open frontier-model science, highlighting Reflection, Windsurf, Cursor, and 2025–2026 open-model goals

🤖 Agents & RL

An ARENA 6.0 Capstone: Model Organism of Encoded Reasoning (lesswrong​.com). ARENA 6.0 capstone using RL on Qwen-3-4B to study encoded reasoning in GSM8K math CoTs with judge models, toxicity signals, and RL challenges

Introspection or confusion? (lesswrong​.com). Explores introspection vs. confusion in LLMs; reproduces Anthropic-style experiments with steering vectors; analyzes control questions and noise across Mistral, Qwen, and Llama models

RL Learning with LoRA: A Diverse Deep Dive (kalomaze​.bearblog​.dev). RL training with LoRA for SFT and RL finetuning in prime-rl using rsLoRA scaling and multi-environment experiments

#299 Jacob Buckman: Why the Future of AI Won't Be Built on Transformers (aneyeonai​.libsyn​.com). Explores memory, long-context challenges, and retrofitting transformers with state-space and Power Retention by Manifest AI in AI agent workflows

Building Metacognitive AI Agents: A Complete Guide from Theory to Production (rewire​.it). Metacognitive AI agents with dual-loop Actor-Critic design, Reflexion patterns, and LangGraph-driven production-ready workflows

Language as a Universal Interface for Reinforcement Learning Agents (richardli​.xyz). Language as a universal interface for RL agents, detailing a formal framework, memory compression, and two-layer thought-action generation using LLMs

📚 Retrieval & RAG

Essential Chunking Techniques for Building Better LLM Applications (machinelearningmastery​.com). Chunking techniques for LLMs: fixed-size to hierarchical, semantic, LLM- and agent-based strategies for better retrieval and generation

How I'm Building a Context-Aware Retriever to Boost RAG Quality (Part 3: Evaluation) (egpivo​.github​.io). Context-aware retriever for RAG quality using multi-stage retrieval, GPT-4o, DSPy, and a ContractNLI dataset

How to Evaluate Retrieval Quality in RAG Pipelines (part 2): Mean Reciprocal Rank (MRR) and Average Precision (AP) (towardsdatascience​.com). Binary, order-aware metrics MRR and AP for evaluating retrieval in RAG pipelines using Python implementations

Multi-Agent SQL Assistant, Part 2: Building a RAG Manager (towardsdatascience​.com). RAG strategies with Keyword, FAISS, and Chroma for a SQL assistant using Python, SQLite, and LLMs

⚙️ Systems & Deployment

Deterministic Inference: The Latency Tax (akashbajwa​.co). Latency vs determinism in AI inference; diffusion models, FP nondeterminism, RAG/guardrails, and enterprise AI workloads

Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly (aarnphm​.xyz). Guided decoding and speculative decoding merge to enable CPU-GPU collaboration for LLM inference using TensorRT LLM and XGrammar with CUDA graphs

The Untold System Design Problem in LLM Inference (dylanhuang​.com). Variable-shape LLM inference, KV caches, MoE routing, quantization, and adaptive scheduling for interactive and batch workloads

How neuroscientists are using AI (thetransmitter​.org). Eight neuroscientists describe using large language models to analyze literature, brainstorm hypotheses, code, and interpret spatial genomics and neural data

🔬 Theory & Foundations

The Principles of Diffusion Models (arxiv​.org). Diffusion models fundamentals, training, sampling, and theoretical insights using ML concepts and diffusion processes

Introducing Nested Learning: A new ML paradigm for continual learning (research​.google). Nested Learning proposes multi-level optimization to tackle catastrophic forgetting in ML models, introducing CMS memory and Hope architecture (Lang: Python-like concepts)

The Spacetime of Large Language Models (medium​.com/data-science-collective). Geometric view of Transformers: curvature, parallel transport, and QKV attention in language models

Large Language Models and Emergence: A Complex Systems Perspective (rewire​.it). Explores whether LLMs exhibit genuine emergence or measurement artifacts using complexity science, KI/KO distinctions, and mechanistic interpretability with figures and studies by Krakauer, Li, and Wei

📚 Academic Research

Whisper Leak: a side-channel attack on Large Language Models (arxiv:cs). Demonstrates a side‑channel attack that infers prompt topics from encrypted LLM streaming metadata (packet sizes/timing), exposing severe privacy risks and evaluating partial mitigations—critical for deployment security

Attention and Compression is all you need for Controllably Efficient Language Models (arxiv:cs). Introduces CAT: chunk compression + dense attention to trade quality for compute. Adaptive single model supports run-time compute budgeting—improves speed/memory for long-context and production LLMs

Apriel-H1: Towards Efficient Enterprise Reasoning Models (arxiv:cs). Presents Apriel‑H1, a distilled hybrid SSM–Transformer replacing attention with Mamba blocks for 15B models; yields ~2× inference throughput in production with minimal reasoning loss—enterprise impact

PerfDojo: Automated ML Library Generation for Heterogeneous Architectures (arxiv:cs). PerfLLM/PerfDojo leverages LLMs and RL to auto-optimize ML kernels across CPUs/GPUs/accelerators, improving performance portability—highly relevant for ML systems engineers and deployment optimization

Federated Attention: A Distributed Paradigm for Collaborative LLM Inference over Edge Networks (arxiv:cs). Proposes FedAttn: distributed self‑attention where participants compute local attention and aggregate KV matrices, enabling private, communication‑efficient collaborative LLM inference with theoretical tradeoffs for edge deployments

Don't miss what's next. Subscribe to The AI Engineer:
Start the conversation:
Bluesky Mastodon LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.