The AI Engineer logo

The AI Engineer

Archives
Subscribe
December 23, 2025

The AI Engineer 23-12-2025

Amazon's potential OpenAI investment, codexes from OpenAI, environmental impact of AI

📣 Headlines

• Amazon weighed a $10B OpenAI investment alongside supplying Trainium chips and AWS data-center capacity to deepen AI infrastructure ties.

• OpenAI rolled out new models for builders, with GPT-5.2-Codex for more capable software engineering and GPT Image 1.5 optimized for image editing and text rendering.

• Research found the 2025 AI boom is driving major environmental impact via surging CO2 emissions and water use.

• A UK survey reported that one-third of citizens have used AI for emotional support, raising safety and misinformation concerns.

• The creative sector’s AI fight intensified as major labels embraced AI-generated music while UK creators pushed back, with only 3% backing an active opt-out copyright plan.

• Marketing automation firm MoEngage extended its fundraising with another $180M after a recent $100M round to fund AI expansion and growth in the US and Europe.

• US policy focus sharpened as Sen. Mark Kelly discussed taxing AI companies that eliminate jobs and data-center backlash alongside bipartisan tech regulation talks.

• UK lawmakers questioned government use of Palantir after an investigation highlighted security concerns and potential US data-access risks.



🔧 Company Engineering Blogs

Gemini 3 Flash: frontier intelligence built for speed (deepmind​.google). Gemini 3 Flash delivers frontier intelligence at speed, with Pro-grade reasoning and low latency for coding, analysis, and multimodal tasks

1 500+ PRs plus tard : Le parcours de Spotify avec leur agent de codage en arrière-plan (engineering​.atspotify​.com). Spotify scales Fleet Management with AI coding agents to automate complex migrations across Java, YAML, and UI changes

How We Built Meta Ray-Ban Display: From Zero to Polish (engineering​.fb​.com). Explores Meta Ray-Ban Display development, AI glasses, display tech, UI patterns, and hardware design challenges

The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator (huggingface​.co). Open evaluation standard for Nemotron 3 Nano using NeMo Evaluator, open tooling, configs, artifacts, and reproducible workflows

Google Research 2025: Bolder breakthroughs, bigger impact (research​.google). Google Research 2025 highlights breakthroughs in generative models, quantum computing, Earth/health AI, education, and private ML tools, with Gemini, LAVA, MUVERA, and Parfait

🎥 Gemini & Multimodal

Building Speakeasy: From Python Prototype to Native macOS App (migueldavid​.eu). Native macOS Speakeasy using AVSpeechSynthesizer for local, privacy-friendly text-to-speech with real-time highlighting

Asking Gemini 3 Flash To Watch A Video And Vividly Visually Describe It Scene By Scene & The Importance Of Media Resolution (blog​.gdeltproject​.org). Gemini 3 Flash analyzes videos at high media resolution for rich scene-by-scene descriptions and visual search capabilities

How to use Gemini Live audio as an interviewer for a software engineer’s job (with video) (geshan​.com​.np). Use Gemini Live audio in Google AI Studio to interview backend engineers with prompts, modes, and audio-focused feedback

Gemini 3 Flash: Comparing Accuracy Vs Cost Of Different Media Resolutions For Video Analysis (blog​.gdeltproject​.org). Video analysis compares Low, Medium, High resolutions for Gemini 3 Flash, showing token costs and no clear accuracy gain on TV news content

Quoting Gemini thinking trace (simonwillison​.net). Gemini thinking trace reviews code feedback and comparisons with Claude and ChatGPT, focusing on manifest.json and content.js

🎛️ Vibe Coding & Learning

You Don’t Need to Spend $100/mo on Claude Code: Your Guide to Local Coding Models (aiforswes​.com). Local coding models on high-RAM Macs offer cost savings, with tooling like MLX/Ollama and Qwen, compared to cloud tiers

This morning I was asked, if I vibe-coded all or parts of Hule. The asker wasn't accusing me, the... (mikka​.is). Local LLM-assisted coding in Hule using Python tooling, CSS tweaks, and code reviews with Claude and Codex

Vibe Coding (davidbau​.com). Vibe coding with LLMs: tests, metaprogramming, and towers of complexity for a Mandelbrot web page

Code Revolution: How AI-Driven IDEs and CLI Preferences are Shaping the Developer's Future (eliza-ng​.me). AI-driven IDEs like Cursor reshape dev workflows, balancing integration with CLI preferences and market competition

The Strange Case of Engineers Who Dismiss AI (terriblesoftware​.org). Engineers resist AI coding tools; Claude Code and Cursor boost project-wide understanding and refactoring across codebases

AI and Elaboration: Which Coding Patterns Build Understanding? (innoq​.com). Elaboration-driven AI patterns for software learning; navigator, worked examples, teaching back, and attempting before verifying in the context of Python/Java ecosystems discussed by Daniel Westheide at INNOQ

🧰 MCP & Tool Selection

Embedding-Based Tool Selection for AI Agents (zarar​.dev). Embedding-based tool selection using pgvector in Postgres, OpenAI embeddings, and category expansions to scale AI agents’ tools with Elixir code

Make the eyes go away (hexeditreality​.com). Building an MCP server to bridge AI agents with i3, using Go, MCP SDK, and Ollama-enabled models

On AI Agents, MCP, and Tool Selection (acalustra​.com). Global vs playbook AI agents, MCP tool selection, and balancing many tools for exploration vs few tools for reliable, single-task workflows

Architecting Agentic AI on AWS: From Intelligent Agents to Enterprise-Scale Execution (forgeahead​.io). Explores architecting agentic AI on AWS with LLMs, Bedrock/SageMaker, Step Functions, and IAM for enterprise-scale execution

🧑‍💻 Coding Agent Tactics

Coding agents write 90% of my code now (ben​.page). Coding agents like Claude Code or Amp now write the majority of the author's code, with the author guiding edits and tweaks

Trying GitHub Copilot coding agent (jlelse​.blog). Explores GitHub Copilot Pro usage, PR-driven tasking, GoBlog test coverage, and AI-assisted coding on Go

What Actually Is Claude Code’s Plan Mode? (lucumr​.pocoo​.org). Claude Code plan mode explored via prompts, tooling, and read-only workflow, contrasting with YOLO mode and manual planning

Claude Code skills not triggering? It might not see them. (blog​.fsck​.com). Claude Code skills may not trigger due to skill list size and system prompt limits in Code 2.0.70, with a workaround using SLASH_COMMAND_TOOL_CHAR_BUDGET

Claude Code: stash (perrotta​.dev). Claude Code stash feature for multi-line prompts enables temporary saving and auto-restoration during coding sessions

⌘← and ⌘→ hotkey navigation in Claude Code and Codex (banagale​.com). Discussion of ⌘ key navigation for Claude Code and Codex via iTerm2; includes hex-send workflows and shortcuts

📚 RAG & Retrieval

AI in Production Field Notes: Beyond “Just Call an LLM”: Vimeo’s Production Subtitle Engine (mlopsworld​.com). Vimeo's subtitle pipeline uses layered, production-grade workflows with LLMs, chunking, validation, and async orchestration

How to Do Evals on a Bloated RAG Pipeline (towardsdatascience​.com). Evaluates a bloated RAG pipeline with seed vs expanded context using RAGAS and DeepEval across GPT-5 models for faithfulness and relevance

DocSummarizer Part 3 - Advanced Concepts: The "I Went Too Far 🤦" Deep Dive (mostlylucid​.net). Deep dive into DocSummarizer: ONNX embeddings, RAG architecture, MMR, RRF, and hybrid retrieval for local, production-grade summarization

Stop Shoving Documents Into LLMs: Build a Local Summarizer with Docling + RAG (mostlylucid​.net). Local, offline document summarization pipeline using Docling + ONNX embeddings, Ollama support, and Qdrant for structured, citation-grounded summaries

SatoriDB: vector database built from scratch (nubskr​.com). SatoriDB presents an embedded, billion-scale vector database with two-tier RAM/SSD routing, HNSW-based clustering, custom caches, and CPU pinning for predictable latency

⚙️ Serving & Performance

Mini-SGLang: Efficient Inference Engine in a Nutshell (lmsys​.org). Mini-SGLang offers a lightweight, OpenAI-compatible LLM inference engine with Radix Attention and Tensor Parallelism implemented in a ~5k-line Python codebase

Small adventures with small language models (blog​.engora​.com). Explores small language models (SLMs) with Ollama and HuggingFace, evaluation, and performance on data-breach analysis tasks

Diagnose & Fix Painfully Slow Ollama: 4 Essential Debugging Techniques + Fixes (journal​.hexmos​.com). Diagnose Ollama performance: GPU heat, quantization, KV caching, and model comparisons with --verbose

VRAM vs System RAM: What Actually Limits Running LLMs Locally? (dewanahmed​.com). VRAM vs system RAM in local LLMs: how GPU memory and host memory shape feasibility and performance with Qwen3-Next-style models

Framework Desktop: How to Expand your Unified Memory For LLM Use (boilingsteam​.com). Expands unified memory for LLM use on Framework Desktop by adjusting BIOS and kernel parameters to 90 GB VRAM for large models

🔐 Security & Offense

How I Think About Agentic Risks (cloudberry​.engineering). Thoughtful exploration of agentic risks, risk amplifiers, threat modeling, and mitigations for AI agents with input sanitization, data access, and human-in-the-loop approaches

Edition 1: AI for Offense Is Here. Defenders Aren’t Ready. (boringappsec​.substack​.com). AI-native offense using Claude Code sub-agents and MCP; automation of kill chain across ~30 targets discussed by Sandesh Mysore Anand

The Developer’s Guide to LLM Security (thedataexchange​.media). Steve Wilson (Exabeam), OWASP GenAI Security Project lead, discusses prompt injection, AI supply chains, guardrails, MCP, A-to-A, and the security of agentic LLMs

Red Hat Buys an AI Safety Company, Promises to Open Source Its Tech (itsfoss​.com). Red Hat acquires Chatterbox Labs to integrate AI safety tooling and promises to open source the tech over time

When the AI Says No: Compliance vs. Security (gordonbeeming​.com). GPT-5.2 refuses to write secrets to disk, Claude 4.5 Sonnet complies, highlighting security vs. compliance in AI tooling

🧭 Alignment & Auditing

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (lesswrong​.com). Activation Oracles train LLMs to answer questions about their activations, enabling auditing tasks and misalignment detection using diverse prompts and Colab demos

Alignment Fine-Tuning: Lessons from Operant Conditioning (lesswrong​.com). Neuroscientist applies operant conditioning to alignment fine-tuning in LLMs, proposing early RLHF, slow post-deployment updates, and cue-based feedback

Note (hsu​.cy). RLVR and verifiable rewards drive longer, deeper LLM reasoning; benchmarks face overhang and jagged progress in 2025

How to Teach LLMs to Reason for 50 Cents (artificialintelligencemadesimple​.com). Latent space reasoning, multi-judge architecture, and open-source latency-friendly LLM tooling to access model reasoning for 50 cents using IQIDIS approach

Video and transcript of talk on human-like-ness in AI safety (joecarlsmith​.com). Joe Carlsmith discusses human-like-ness in AI safety, critiquing alien-ness, corrigibility, and generalization in ML-built AIs

🧪 Evals & Reliability

HELM Arabic (crfm​.stanford​.edu). HELM Arabic evaluates Arabic benchmarks using open HELM framework and collaborates with Arabic.AI on multilingual LLM capabilities

Structured outputs create false confidence (boundaryml​.com). Structured outputs often degrade quality; a hands-on look at constrained decoding versus free-form parsing with OpenAI models and BAML tooling

When AI Reviews AI: A Case Study in Benchmark Contamination (cafebedouin​.org). Staged Adversarial Review exposes benchmark contamination in SDE evaluation for LLMs in scientific discovery

Do a sanity check on your experiments (ehudreiter​.com). Sanity checks on data, model outputs, and evaluation to detect bugs in NLP/AI experiments

🧱 Diffusion & 1-bit Models

Power Up Diffusion LLMs: Day‑0 Support for LLaDA 2.0 (lmsys​.org). Diffusion LLMs via SGLang's Chunked-Prefill with LLaDA 2.0, showing day-0 support and streaming for 100B-scale models

What Happens When You Build an LLM Using Only 1s and 0s (towardsdatascience​.com). BitNet b1.58 trains LLMs with ternary weights −1,0,1, enabling 1-bit-like efficiency and up to 9x throughput gains

La Rivoluzione dell'IA: Come i Modelli di Diffusione e Nuove Architetture Sostituiranno gli LLM Attuali (grigio​.org). Diffusion models, sub-quadratic architectures, private thinking, continuous learning, and a continuous-thinking machine reshape AI beyond current LLMs

📚 Academic Research

Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models (arxiv:cs). MAHA: a hierarchical attention framework with multiscale aggregation and convex/Nash optimization for scalable LLM context modeling

Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers (arxiv:cs). Log-linear Sparse Attention (LLSA) enables hierarchical Top-K selection for long token sequences in Diffusion Transformers, boosting training and inference efficiency

Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models (arxiv:cs). Dynamic Rank Reinforcement Learning optimizes low-rank MHSA in LLMs via RL-guided rank selection and online perturbation bounds for efficient inference

SFTok: Bridging the Performance Gap in Discrete Tokenizers (arxiv:cs). SFTok: a discrete tokenizer with self-forcing guided reconstruction and debias-and-fitting training to boost image tokenization for high-resolution multimodal generation

IPCV: Information-Preserving Compression for MLLM Visual Encoders (arxiv:cs). IPCV compresses Vision Transformer tokens for MLLMs with Neighbor-Guided Reconstruction and Attention Stabilization to reduce compute without sacrificing text-critical cues

Don't miss what's next. Subscribe to The AI Engineer:

Add a comment:

Share this email:
Share on LinkedIn Share on Hacker News Share on Mastodon Share on Bluesky
Bluesky
Mastodon
LinkedIn
Powered by Buttondown, the easiest way to start and grow your newsletter.