The AI Engineer logo

The AI Engineer

Archives
Subscribe
December 17, 2025

The AI Engineer 17-12-2025

Vibecoding on Slack, Google's MCP servers, rising AI training data market, EU's inquiry on Google

📣 Headlines

• Claude’s Slack-based coding beta can create and modify files, run tests, and iterate from chat, pushing “vibecoding” deeper into everyday workflows.

• Google launched managed MCP servers for services like BigQuery, Maps, and GKE to simplify connecting AI agents to cloud tools and data sources.

• The AI training-data market is booming as expert-driven data and RLHF rubrics suppliers scale up to meet demand for higher-quality model alignment and evaluation.

• The EU is investigating Google over AI-generated search summaries, focusing on publisher compensation, data use, and opt-out mechanisms.

• Intel is reportedly in talks to acquire inference-chip startup SambaNova for about $1.6B, signaling continued consolidation pressure in AI hardware.

• Couchbase rolled out a unified platform for AI data, vector search, and governance aimed at simplifying and speeding up agent development.

• Oracle shares fell after earnings missed expectations, with investors scrutinizing AI-infrastructure growth and broader “AI bubble” fears.

• AI-enabled kids’ toys are surfacing safety and propaganda risks, underscoring gaps in regulation, content controls, and data protections for consumer LLM deployments.



đź”§ Company Engineering Blogs

Deepening our partnership with the UK AI Security Institute (deepmind​.google). Google DeepMind and UK AI Security Institute expand a research partnership to study monitoring AI reasoning, socioaffective impacts, and economic evaluations

Background Coding Agents: Predictable Results Through Strong Feedback Loops (Part 3) (engineering​.atspotify​.com). Strong verification loops with Maven verifiers, an LLM judge, and sandboxed agents drive reliable background code changes in Spotify's CI/CD workflow

4x Faster: How AI-Assisted Development Accelerated Building New SQL Dialects for Zero Copy Connectors (engineering​.salesforce​.com). AI-assisted dialect generation and automated 25,000-query validation accelerate Zero Copy connectors across 16 dialects, expanding from 5 to 100+

The future of AI-powered software optimization (and how it can help your team) (github​.blog). Explores Continuous Efficiency: AI-enabled, agentic workflows for green software on GitHub, using Agentic Workflows and LLMs

Codex is Open Sourcing AI models (huggingface​.co). Hugging Face demonstrates Codex, OpenAI Codex integration with HF Skills for end-to-end ML experiments and model fine-tuning

🔓 Open Model Drops

Olmo 3 and the Open LLM Renaissance (cameronrwolfe​.substack​.com). Olmo 3 releases fully-open training artifacts for reproducible open-LLM research and transparency-driven benchmarking

Devstral 2 (simonwillison​.net). Devstral 2: Mistral's 123B and Small 24B models for coding agents, with licensing contrasts and SVG prompt examples

2025-12-11 22:35 (aicode​.danvoronov​.com). Devstral 2 (123B) and Devstral Small 2 (24B) from Mistral AI, Vibe CLI, SWE-bench performance, open licenses, local GPU viability

Gemma 3 AI model in Clojure (dragan​.rocks). Gemma 3 AI model in Clojure demonstrates ONNX runtime integration for loading and running a 1B parameter Gemma 3 LLM in main memory with oneDNN

Introduction to Qwen3-VL (debuggercafe​.com). Qwen3-VL 4B-Instruct and Thinking models enhance multimodal understanding with OCR, object detection, and video tasks using Python and Transformers

The $250 Million Paper (blog​.bytebytego​.com). Molmo builds a strong vision-language model from scratch using PixMo datasets for open training and improved grounding

🌍 Policy & Ecosystem

Why AI reading science fiction could be a problem (transformernews​.ai). Explores how science fiction and misalignment research influence AI training, with examples from Anthropic, Redwood, and policy options

State of AI : Une étude empirique sur 100 000 milliards de tokens d'interactions LLM réelles (openrouter​.ai). Empirical study of 100 trillion tokens reveals OSS vs proprietary usage, roleplay and programming focus, and agentic inference trends across regions

Agentic AI: Governance, risks, and responsible deployment in the social sector (merltech​.org). Explores Agentic AI, its differences from GenAI and AI agents, governance, consent, risks, and multi-stakeholder accountability in the social sector

Deepening our partnership with the UK AI Security Institute (deepmind​.google). Google DeepMind and UK AI Security Institute expand a research partnership to study monitoring AI reasoning, socioaffective impacts, and economic evaluations

Cryptographers Show That AI Protections Will Always Have Holes (quantamagazine​.org). Cryptographers show two-tier AI protections have inherent gaps, exploitable via time-lock puzzles and substitution ciphers

🎬 AI in Media

All AI Videos Are Harmful (idiallo​.com). Explores AI video tools like Sora, Runway ML, Veo; warns of misuse by scammers and erosion of trust in visual media

Users are complaining OpenAI wreaked Sora 2. EOSHD takes a look at what happened… (eoshd​.com). OpenAI's Sora 2 struggles with scaling, provoking investor controversy and debates on compute costs and accessibility

Unleashing Autonomy: Getting Daily News Updates (danraine​.substack​.com). Daily AI news on film, using Perplexity Gemini 3 to surface articles from Disney licensing to AI actors and copyright topics

Doblando un podcast con iA (victorcorreal​.substack​.com). Estrategias para doblar podcasts al inglés usando IA, pruebas con herramientas de vanguardia y ejemplos de implementación

🧑‍💻 Coding Culture

Ask HN: How can I get better at using AI for programming? (news​.ycombinator​.com). Tips for AI-assisted programming with Claude and Plan mode, CLAUDE.md, memory, and browser checks to speed Svelte/Django workflow

Claude Code Changed How I Work (Part 1) (causalinf​.substack​.com). Economist Scott Cunningham chronicles using Claude Code for AI-driven coding, autos inside local directories, and multi-part insights on attention and collaboration

If you're going to vibe code, why not do it in C? (stephenramsay​.net). Explores vibe coding, its relation to C and assembly, and the idea of a dedicated vibe-oriented language for AI-assisted programming

How to review AI Generated PRs (kevinjmurphy​.com). Guidance on reviewing AI-generated pull requests with caution, focusing on safety, quality, and collaboration across teams

Responding to "The highest quality codebase" (schneidenba​.ch). Claude-driven refactoring experiment: quality vs vanity metrics, prompts, and lessons from a 47k–120k LOC codebase with test and comment growth

đź§° Agent Tooling

Stop using GitHub Copilot as a chatbot! (m365princess​.com). Stop using Copilot as a chatbot; shift decisions into stable repo files to make it a reliable tool

Supercharge Your Design System with LLMs and Storybook MCP (tympanus​.net). Using LLM coding agents with Storybook MCP to build high-quality UI components in a design system

Quoting OpenAI Codex CLI (simonwillison​.net). A look at OpenAI Codex CLI usage, prompts, and skill rendering in Rust with progressive disclosure across example prompts and YAML-driven triggers

Advent of AI 2025 - Day 8: Messy Data to Structured Output (nickyt​.co). Messy napkin notes to structured JSON and an HTML site using Goose, MCP, Netlify, and Day 4 styling

How to Customize Your Claude Code Status Line (alexop​.dev). Step-by-step guide to create a Claude Code status line that shows model, context usage, and costs in your terminal using bash and jq

You are holding GitHub Copilot Wrong! (m365princess​.com). Critical take on GitHub Copilot: prompts, memory, context windows, and the need for structured workflows in software development

đź§± RAG & App Architecture

The Stability-Plasticity Dilemma: How Memory Architectures Are Solving Continual Learning (rewire​.it). Explores memory-augmented and retrieval-based approaches to continual learning in LLMs, including sparse memory finetuning, RAG, and model merging

Personal, Agentic Assistants: A Practical Blueprint for a Secure, Multi-User, Self-Hosted Chatbot (towardsdatascience​.com). Self-hosted, multi-user agentic chatbot with private file access, vector search, and tool-based reasoning using LangGraph, Postgres, MinIO, Ollama, and RabbitMQ in Python

SE Radio 698: Srujana Merugu on How to build an LLM App (se-radio​.net). Srujana Merugu discusses building LLM apps, RAG, agentic architectures, evaluation, safety, prompts, and multi-modal trends

Building a "Lawyer GPT" for Your Blog - Part 9: Document Ingestion with Docling (mostlylucid​.net). Ingests PDFs, DOCX, and more using Docling to power a Lawyer GPT with C#/.NET, Docker, OCR, and vector search

SOTA RAG & Memory without the database: Files, Git, and simple folders (nijho​.lt). Bas Nijholt outlines a file-based RAG and memory approach using Markdown, Git, ONNX Runtime, and a local document folder

How OpenAI, Gemini, and Claude Use Agents to Power Deep Research (blog​.bytebytego​.com). Deep Research workflows across OpenAI, Gemini, Claude, Perplexity, and more using multi-agent orchestration and specialized tools

Analyzing an email dump with BigQuery and Gemini (corinfaife​.co). Analyzes an Epstein email dump using BigQuery and Gemini to extract metadata, summarize at scale with Vertex AI, and present results via a Streamlit interface

🏗️ Systems & Serving

chores 0.3.0 and local LLMs (simonpcouch​.com). Local LLMs power chores helpers for R, with Qwen3-4B-2507, LM Studio, Ollama, and GPT-4.1/Claude prompts guiding roxygen2 templating

SGLang Adds Day-0 Support for the Highly Efficient, Open Nemotron 3 Nano Hybrid MoE Model (lmsys​.org). SGLang adds Day-0 support for NVIDIA Nemotron 3 Nano hybrid MoE model, enabling open-source, efficient AI agent development with BF16/FP8 configurations

Neo, 6 years and 600 citations later (rmarcus​.info). Neo, a learned query optimizer using deep reinforcement learning and tree convolution, details its journey to 600 citations and industry impact

New in llama.cpp: Model Management (huggingface​.co). llama.cpp adds router mode for dynamic multi-model loading and management in a lightweight HTTP server

Fedora Magazine: Find out how your Fedora system really feels (with the linux-mcp-server!) (fedoramagazine​.org). Explains linux-mcp-server and MCP, enabling LLMs to query Fedora system data for troubleshooting and upgrades

Build an AI inference server on Ubuntu (gjolly​.fr). Build a local Ubuntu AI inference server using Ollama and Open WebUI for privacy-preserving LLM chats

🌫️ Diffusion Generation

Text Diffusion Models are Faster at Writing Code (nathan​.rs). diffusion language models accelerate code generation by higher parallel decoding and structured output, demonstrated with Fast dLLM v2 on code vs unstructured text

Omni-Attribute: Open-Vocabulary Attribute Encoder for Visual Concept Personalization (opencv​.org). Omni-Attribute enables open-vocabulary, attribute-specific embeddings for controllable diffusion, combining image + text pairs to disentangle attributes like identity, lighting, and style

Implementing Diffusion Models (vkethana​.com). Explores diffusion models with CFG, SDEdit, inpainting, image editing, and image-to-image techniques using Python code and DeepFloyd IF

📏 Evals & Benchmarks

Open Source Replication of the Auditing Game Model Organism (lesswrong​.com). Replication of an auditing model organism using Llama 3.3 70B Instruct to test reward model biases and auditing techniques

What are AI Evals? (nickyt​.co). How AI evals score outputs, use guardrails, and optimize models with Galileo and LLMs for testing non-deterministic AI systems

Gemini 3 Pro vs GPT 5.2: A Blind Test Reveals Code Red DNA (prashamhtrivedi​.in). Blind A/B test compares Gemini 3 Pro and GPT 5.2 on converting slash commands for sandboxed coding agents, revealing training-driven behavior and salience issues

FACTS Benchmark Suite: Systematically evaluating the factuality of large language models (deepmind​.google). FACTS Benchmark Suite evaluates LLM factuality across Parametric, Search, and Multimodal reasoning with public/private sets and a Gemini-led performance showcase

Validating LLM-as-a-Judge Systems under Rating Indeterminacy (blog​.ml​.cmu​.edu). Framework for validating LLM-as-a-judge under rating indeterminacy; multi-label elicitation, translation matrices, and downstream evaluation tasks

Blog/2025-12-11/LLMs Excel At Easy Verification Problems (wiki​.roshangeorge​.dev). LLMs excel at checkable problems and reasoning, using verification loops over memory and minimal reproducible examples

The Missing Piece in AI Safety (cafebedouin​.org). Structural evaluation: rethinking AI safety by auditing evaluators and metrics that define 'smart' and safe

Funktion und Schwächen von KI-Benchmarking (stuker​.com). Über KI-Benchmarks, syn­thetische vs. menschliche Bewertungen, Evals, Memorization, Reward Hacking und der Schutz geistigen Eigentums

đź§  Understanding Model Behavior

NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating (towardsdatascience​.com). Systematic study of gating in SDPA attention for LLMs using Qwen’s methods; practical insights in gating placement, multiplicative gating, head-specific gates, RoPE, YaRN, and code in Python

I. Drafts (studium​.dev). LLMs as Meaning Optimization explored using Quartz v4.2.3 in a JavaScript/TypeScript context by jerlendds

Some evidence against the idea strange CoT stems from incentives to compress language (lesswrong​.com). Incentives to compress language in RL-tuned LLMs are explored via CoT entropy comparison across RL'd and instruct-tuned variants of Qwen models

Why Vision Language Models Ignore What They See with Munawar Hayat - #758 (twimlai​.com). Munawar Hayat discusses NeurIPS 2025 work on multimodal AI, object hallucination, attention-guided alignment, generalized contrastive learning, and MultiHuman Testbench

Beichtstuhl für LLMs scheint eine gute Idee zu sein (stuker​.com). Beichtstuhl-L concept for LLMs explores confession-based honesty, RL rewards, and training gaps with GPT-5-Thinking experiments

In Defense of Curiosity (davidbau​.com). Explores pragmatic interpretability, three research perspectives, Copernican shift in AI thinking, and the role of curiosity in human-AI collaboration

Cracking the Code: Tackling AI Hallucinations in the Quest for Reliable Language Models (eliza-ng​.me). Explores AI hallucinations in LLMs, grounding, confidence scores, and hybrid data integration for reliable language models

Shortwave: Beyond Hallucinations: The Illusion of Understanding (mamund​.substack​.com). Explores epistemic drift in LLMs, framing with Korzybski and Rose-Frame model, urging cautious use and human-centered reasoning

📚 Academic Research

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language (arxiv:cs). Meta FAIR proposes VL-JEPA, predicting continuous text embeddings instead of tokens. It cuts parameters 50% and enables selective decoding, boosting VQA/retrieval efficiently for multimodal deployment

Exploring MLLM-Diffusion Information Transfer with MetaCanvas (arxiv:cs). MetaCanvas turns MLLMs into latent-space planners for diffusion generators. It improves layout control, attribute binding, and video editing across backbones, bridging understanding-to-generation via lightweight interfaces

Mull-Tokens: Modality-Agnostic Latent Thinking (arxiv:cs). Mull-Tokens adds 10–40 discrete latent scratchpad tokens to multimodal LLMs. Trained with curriculum and optional RL, it boosts visual reasoning cheaply without verbose CoT outputs

HybridToken-VLM: Hybrid Token Compression for Vision-Language Models (arxiv:cs). HTC-VLM compresses hundreds of vision tokens into one via hybrid continuous patches plus discrete semantic anchors. It retains ~87% benchmark performance at 580-to-1 ratio overall

Learning Unmasking Policies for Diffusion Language Models (arxiv:cs). This paper trains RL policies to choose which tokens to unmask in masked diffusion language models. Learned samplers improve quality–throughput, transfer across models and lengths

Don't miss what's next. Subscribe to The AI Engineer:

Add a comment:

Share this email:
Share on LinkedIn Share on Hacker News Share on Mastodon Share on Bluesky
Bluesky
https://mastodo...
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.