The AI Engineer 23-12-2025

        December 23, 2025

The AI Engineer 23-12-2025
Amazon's potential OpenAI investment, codexes from OpenAI, environmental impact of AI

            📣 Headlines
•  Amazon weighed a $10B OpenAI investment alongside supplying Trainium chips and AWS data-center capacity to deepen AI infrastructure ties.

•  OpenAI rolled out new models for builders, with GPT-5.2-Codex for more capable software engineering and GPT Image 1.5 optimized for image editing and text rendering.

•  Research found the 2025 AI boom is driving major environmental impact via surging CO2 emissions and water use.

•  A UK survey reported that one-third of citizens have used AI for emotional support, raising safety and misinformation concerns.

•  The creative sector’s AI fight intensified as major labels embraced AI-generated music while UK creators pushed back, with only 3% backing an active opt-out copyright plan.

•  Marketing automation firm MoEngage extended its fundraising with another $180M after a recent $100M round to fund AI expansion and growth in the US and Europe.

•  US policy focus sharpened as Sen. Mark Kelly discussed taxing AI companies that eliminate jobs and data-center backlash alongside bipartisan tech regulation talks.

•  UK lawmakers questioned government use of Palantir after an investigation highlighted security concerns and potential US data-access risks.

🔧 Company Engineering Blogs

Gemini 3 Flash: frontier intelligence built for speed (deepmind.google). Gemini 3 Flash delivers frontier intelligence at speed, with Pro-grade reasoning and low latency for coding, analysis, and multimodal tasks

1 500+ PRs plus tard : Le parcours de Spotify avec leur agent de codage en arrière-plan (engineering.atspotify.com). Spotify scales Fleet Management with AI coding agents to automate complex migrations across Java, YAML, and UI changes

How We Built Meta Ray-Ban Display: From Zero to Polish (engineering.fb.com). Explores Meta Ray-Ban Display development, AI glasses, display tech, UI patterns, and hardware design challenges

The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator (huggingface.co). Open evaluation standard for Nemotron 3 Nano using NeMo Evaluator, open tooling, configs, artifacts, and reproducible workflows

Google Research 2025: Bolder breakthroughs, bigger impact (research.google). Google Research 2025 highlights breakthroughs in generative models, quantum computing, Earth/health AI, education, and private ML tools, with Gemini, LAVA, MUVERA, and Parfait

🎥 Gemini & Multimodal
Building Speakeasy: From Python Prototype to Native macOS App (migueldavid.eu). Native macOS Speakeasy using AVSpeechSynthesizer for local, privacy-friendly text-to-speech with real-time highlighting

Asking Gemini 3 Flash To Watch A Video And Vividly Visually Describe It Scene By Scene & The Importance Of Media Resolution (blog.gdeltproject.org). Gemini 3 Flash analyzes videos at high media resolution for rich scene-by-scene descriptions and visual search capabilities

How to use Gemini Live audio as an interviewer for a software engineer’s job (with video) (geshan.com.np). Use Gemini Live audio in Google AI Studio to interview backend engineers with prompts, modes, and audio-focused feedback

Gemini 3 Flash: Comparing Accuracy Vs Cost Of Different Media Resolutions For Video Analysis (blog.gdeltproject.org). Video analysis compares Low, Medium, High resolutions for Gemini 3 Flash, showing token costs and no clear accuracy gain on TV news content

Quoting Gemini thinking trace (simonwillison.net). Gemini thinking trace reviews code feedback and comparisons with Claude and ChatGPT, focusing on manifest.json and content.js

🎛️ Vibe Coding & Learning

You Don’t Need to Spend $100/mo on Claude Code: Your Guide to Local Coding Models (aiforswes.com). Local coding models on high-RAM Macs offer cost savings, with tooling like MLX/Ollama and Qwen, compared to cloud tiers

This morning I was asked, if I vibe-coded all or parts of Hule. The asker wasn't accusing me, the... (mikka.is). Local LLM-assisted coding in Hule using Python tooling, CSS tweaks, and code reviews with Claude and Codex

Vibe Coding (davidbau.com). Vibe coding with LLMs: tests, metaprogramming, and towers of complexity for a Mandelbrot web page

Code Revolution: How AI-Driven IDEs and CLI Preferences are Shaping the Developer's Future (eliza-ng.me). AI-driven IDEs like Cursor reshape dev workflows, balancing integration with CLI preferences and market competition

The Strange Case of Engineers Who Dismiss AI (terriblesoftware.org). Engineers resist AI coding tools; Claude Code and Cursor boost project-wide understanding and refactoring across codebases

AI and Elaboration: Which Coding Patterns Build Understanding? (innoq.com). Elaboration-driven AI patterns for software learning; navigator, worked examples, teaching back, and attempting before verifying in the context of Python/Java ecosystems discussed by Daniel Westheide at INNOQ

🧰 MCP & Tool Selection

Embedding-Based Tool Selection for AI Agents (zarar.dev). Embedding-based tool selection using pgvector in Postgres, OpenAI embeddings, and category expansions to scale AI agents’ tools with Elixir code

Make the eyes go away (hexeditreality.com). Building an MCP server to bridge AI agents with i3, using Go, MCP SDK, and Ollama-enabled models

On AI Agents, MCP, and Tool Selection (acalustra.com). Global vs playbook AI agents, MCP tool selection, and balancing many tools for exploration vs few tools for reliable, single-task workflows

Architecting Agentic AI on AWS: From Intelligent Agents to Enterprise-Scale Execution (forgeahead.io). Explores architecting agentic AI on AWS with LLMs, Bedrock/SageMaker, Step Functions, and IAM for enterprise-scale execution

🧑‍💻 Coding Agent Tactics

Coding agents write 90% of my code now (ben.page). Coding agents like Claude Code or Amp now write the majority of the author's code, with the author guiding edits and tweaks

Trying GitHub Copilot coding agent (jlelse.blog). Explores GitHub Copilot Pro usage, PR-driven tasking, GoBlog test coverage, and AI-assisted coding on Go

What Actually Is Claude Code’s Plan Mode? (lucumr.pocoo.org). Claude Code plan mode explored via prompts, tooling, and read-only workflow, contrasting with YOLO mode and manual planning

Claude Code skills not triggering? It might not see them. (blog.fsck.com). Claude Code skills may not trigger due to skill list size and system prompt limits in Code 2.0.70, with a workaround using SLASH_COMMAND_TOOL_CHAR_BUDGET

Claude Code: stash (perrotta.dev). Claude Code stash feature for multi-line prompts enables temporary saving and auto-restoration during coding sessions

⌘← and ⌘→ hotkey navigation in Claude Code and Codex (banagale.com). Discussion of ⌘ key navigation for Claude Code and Codex via iTerm2; includes hex-send workflows and shortcuts

📚 RAG & Retrieval

AI in Production Field Notes: Beyond “Just Call an LLM”: Vimeo’s Production Subtitle Engine (mlopsworld.com). Vimeo's subtitle pipeline uses layered, production-grade workflows with LLMs, chunking, validation, and async orchestration

How to Do Evals on a Bloated RAG Pipeline (towardsdatascience.com). Evaluates a bloated RAG pipeline with seed vs expanded context using RAGAS and DeepEval across GPT-5 models for faithfulness and relevance

DocSummarizer Part 3 - Advanced Concepts: The "I Went Too Far 🤦" Deep Dive (mostlylucid.net). Deep dive into DocSummarizer: ONNX embeddings, RAG architecture, MMR, RRF, and hybrid retrieval for local, production-grade summarization

Stop Shoving Documents Into LLMs: Build a Local Summarizer with Docling + RAG (mostlylucid.net). Local, offline document summarization pipeline using Docling + ONNX embeddings, Ollama support, and Qdrant for structured, citation-grounded summaries

SatoriDB: vector database built from scratch (nubskr.com). SatoriDB presents an embedded, billion-scale vector database with two-tier RAM/SSD routing, HNSW-based clustering, custom caches, and CPU pinning for predictable latency

⚙️ Serving & Performance

Mini-SGLang: Efficient Inference Engine in a Nutshell (lmsys.org). Mini-SGLang offers a lightweight, OpenAI-compatible LLM inference engine with Radix Attention and Tensor Parallelism implemented in a ~5k-line Python codebase

Small adventures with small language models (blog.engora.com). Explores small language models (SLMs) with Ollama and HuggingFace, evaluation, and performance on data-breach analysis tasks

Diagnose & Fix Painfully Slow Ollama: 4 Essential Debugging Techniques + Fixes (journal.hexmos.com). Diagnose Ollama performance: GPU heat, quantization, KV caching, and model comparisons with --verbose

VRAM vs System RAM: What Actually Limits Running LLMs Locally? (dewanahmed.com). VRAM vs system RAM in local LLMs: how GPU memory and host memory shape feasibility and performance with Qwen3-Next-style models

Framework Desktop: How to Expand your Unified Memory For LLM Use (boilingsteam.com). Expands unified memory for LLM use on Framework Desktop by adjusting BIOS and kernel parameters to 90 GB VRAM for large models

🔐 Security & Offense

How I Think About Agentic Risks (cloudberry.engineering). Thoughtful exploration of agentic risks, risk amplifiers, threat modeling, and mitigations for AI agents with input sanitization, data access, and human-in-the-loop approaches

Edition 1: AI for Offense Is Here. Defenders Aren’t Ready. (boringappsec.substack.com). AI-native offense using Claude Code sub-agents and MCP; automation of kill chain across ~30 targets discussed by Sandesh Mysore Anand

The Developer’s Guide to LLM Security (thedataexchange.media). Steve Wilson (Exabeam), OWASP GenAI Security Project lead, discusses prompt injection, AI supply chains, guardrails, MCP, A-to-A, and the security of agentic LLMs

Red Hat Buys an AI Safety Company, Promises to Open Source Its Tech (itsfoss.com). Red Hat acquires Chatterbox Labs to integrate AI safety tooling and promises to open source the tech over time

When the AI Says No: Compliance vs. Security (gordonbeeming.com). GPT-5.2 refuses to write secrets to disk, Claude 4.5 Sonnet complies, highlighting security vs. compliance in AI tooling

🧭 Alignment & Auditing

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (lesswrong.com). Activation Oracles train LLMs to answer questions about their activations, enabling auditing tasks and misalignment detection using diverse prompts and Colab demos

Alignment Fine-Tuning: Lessons from Operant Conditioning (lesswrong.com). Neuroscientist applies operant conditioning to alignment fine-tuning in LLMs, proposing early RLHF, slow post-deployment updates, and cue-based feedback

Note (hsu.cy). RLVR and verifiable rewards drive longer, deeper LLM reasoning; benchmarks face overhang and jagged progress in 2025

How to Teach LLMs to Reason for 50 Cents (artificialintelligencemadesimple.com). Latent space reasoning, multi-judge architecture, and open-source latency-friendly LLM tooling to access model reasoning for 50 cents using IQIDIS approach

Video and transcript of talk on human-like-ness in AI safety (joecarlsmith.com). Joe Carlsmith discusses human-like-ness in AI safety, critiquing alien-ness, corrigibility, and generalization in ML-built AIs

🧪 Evals & Reliability

HELM Arabic (crfm.stanford.edu). HELM Arabic evaluates Arabic benchmarks using open HELM framework and collaborates with Arabic.AI on multilingual LLM capabilities

Structured outputs create false confidence (boundaryml.com). Structured outputs often degrade quality; a hands-on look at constrained decoding versus free-form parsing with OpenAI models and BAML tooling

When AI Reviews AI: A Case Study in Benchmark Contamination (cafebedouin.org). Staged Adversarial Review exposes benchmark contamination in SDE evaluation for LLMs in scientific discovery

Do a sanity check on your experiments (ehudreiter.com). Sanity checks on data, model outputs, and evaluation to detect bugs in NLP/AI experiments

🧱 Diffusion & 1-bit Models

Power Up Diffusion LLMs: Day‑0 Support for LLaDA 2.0 (lmsys.org). Diffusion LLMs via SGLang's Chunked-Prefill with LLaDA 2.0, showing day-0 support and streaming for 100B-scale models

What Happens When You Build an LLM Using Only 1s and 0s (towardsdatascience.com). BitNet b1.58 trains LLMs with ternary weights −1,0,1, enabling 1-bit-like efficiency and up to 9x throughput gains

La Rivoluzione dell'IA: Come i Modelli di Diffusione e Nuove Architetture Sostituiranno gli LLM Attuali (grigio.org). Diffusion models, sub-quadratic architectures, private thinking, continuous learning, and a continuous-thinking machine reshape AI beyond current LLMs

📚 Academic Research

Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models (arxiv:cs). MAHA: a hierarchical attention framework with multiscale aggregation and convex/Nash optimization for scalable LLM context modeling

Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers (arxiv:cs). Log-linear Sparse Attention (LLSA) enables hierarchical Top-K selection for long token sequences in Diffusion Transformers, boosting training and inference efficiency

Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models (arxiv:cs). Dynamic Rank Reinforcement Learning optimizes low-rank MHSA in LLMs via RL-guided rank selection and online perturbation bounds for efficient inference

SFTok: Bridging the Performance Gap in Discrete Tokenizers (arxiv:cs). SFTok: a discrete tokenizer with self-forcing guided reconstruction and debias-and-fitting training to boost image tokenization for high-resolution multimodal generation

IPCV: Information-Preserving Compression for MLLM Visual Encoders (arxiv:cs). IPCV compresses Vision Transformer tokens for MLLMs with Neighbor-Guided Reconstruction and Attention Stabilization to reduce compute without sacrificing text-critical cues

                            Don't miss what's next. Subscribe to The AI Engineer:

          Add a comment:

                Share this email:

                                Share on LinkedIn

                                Share on Hacker News

                                Share on Mastodon

                                Share on Bluesky