The AI Engineer 23-12-2025
Amazon's potential OpenAI investment, codexes from OpenAI, environmental impact of AI
đŁ Headlines
⢠Amazon weighed a $10B OpenAI investment alongside supplying Trainium chips and AWS data-center capacity to deepen AI infrastructure ties.
⢠OpenAI rolled out new models for builders, with GPT-5.2-Codex for more capable software engineering and GPT Image 1.5 optimized for image editing and text rendering.
⢠Research found the 2025 AI boom is driving major environmental impact via surging CO2 emissions and water use.
⢠A UK survey reported that one-third of citizens have used AI for emotional support, raising safety and misinformation concerns.
⢠The creative sectorâs AI fight intensified as major labels embraced AI-generated music while UK creators pushed back, with only 3% backing an active opt-out copyright plan.
⢠Marketing automation firm MoEngage extended its fundraising with another $180M after a recent $100M round to fund AI expansion and growth in the US and Europe.
⢠US policy focus sharpened as Sen. Mark Kelly discussed taxing AI companies that eliminate jobs and data-center backlash alongside bipartisan tech regulation talks.
⢠UK lawmakers questioned government use of Palantir after an investigation highlighted security concerns and potential US data-access risks.
đ§ Company Engineering Blogs
Gemini 3 Flash: frontier intelligence built for speed (deepmindâ.google). Gemini 3 Flash delivers frontier intelligence at speed, with Pro-grade reasoning and low latency for coding, analysis, and multimodal tasks
1 500+ PRs plus tard : Le parcours de Spotify avec leur agent de codage en arrière-plan (engineeringâ.atspotifyâ.com). Spotify scales Fleet Management with AI coding agents to automate complex migrations across Java, YAML, and UI changes
How We Built Meta Ray-Ban Display: From Zero to Polish (engineeringâ.fbâ.com). Explores Meta Ray-Ban Display development, AI glasses, display tech, UI patterns, and hardware design challenges
The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator (huggingfaceâ.co). Open evaluation standard for Nemotron 3 Nano using NeMo Evaluator, open tooling, configs, artifacts, and reproducible workflows
Google Research 2025: Bolder breakthroughs, bigger impact (researchâ.google). Google Research 2025 highlights breakthroughs in generative models, quantum computing, Earth/health AI, education, and private ML tools, with Gemini, LAVA, MUVERA, and Parfait
đĽ Gemini & Multimodal
Building Speakeasy: From Python Prototype to Native macOS App (migueldavidâ.eu). Native macOS Speakeasy using AVSpeechSynthesizer for local, privacy-friendly text-to-speech with real-time highlighting
Asking Gemini 3 Flash To Watch A Video And Vividly Visually Describe It Scene By Scene & The Importance Of Media Resolution (blogâ.gdeltprojectâ.org). Gemini 3 Flash analyzes videos at high media resolution for rich scene-by-scene descriptions and visual search capabilities
How to use Gemini Live audio as an interviewer for a software engineerâs job (with video) (geshanâ.comâ.np). Use Gemini Live audio in Google AI Studio to interview backend engineers with prompts, modes, and audio-focused feedback
Gemini 3 Flash: Comparing Accuracy Vs Cost Of Different Media Resolutions For Video Analysis (blogâ.gdeltprojectâ.org). Video analysis compares Low, Medium, High resolutions for Gemini 3 Flash, showing token costs and no clear accuracy gain on TV news content
Quoting Gemini thinking trace (simonwillisonâ.net). Gemini thinking trace reviews code feedback and comparisons with Claude and ChatGPT, focusing on manifest.json and content.js
đď¸ Vibe Coding & Learning
You Donât Need to Spend $100/mo on Claude Code: Your Guide to Local Coding Models (aiforswesâ.com). Local coding models on high-RAM Macs offer cost savings, with tooling like MLX/Ollama and Qwen, compared to cloud tiers
This morning I was asked, if I vibe-coded all or parts of Hule. The asker wasn't accusing me, the... (mikkaâ.is). Local LLM-assisted coding in Hule using Python tooling, CSS tweaks, and code reviews with Claude and Codex
Vibe Coding (davidbauâ.com). Vibe coding with LLMs: tests, metaprogramming, and towers of complexity for a Mandelbrot web page
Code Revolution: How AI-Driven IDEs and CLI Preferences are Shaping the Developer's Future (eliza-ngâ.me). AI-driven IDEs like Cursor reshape dev workflows, balancing integration with CLI preferences and market competition
The Strange Case of Engineers Who Dismiss AI (terriblesoftwareâ.org). Engineers resist AI coding tools; Claude Code and Cursor boost project-wide understanding and refactoring across codebases
AI and Elaboration: Which Coding Patterns Build Understanding? (innoqâ.com). Elaboration-driven AI patterns for software learning; navigator, worked examples, teaching back, and attempting before verifying in the context of Python/Java ecosystems discussed by Daniel Westheide at INNOQ
đ§° MCP & Tool Selection
Embedding-Based Tool Selection for AI Agents (zararâ.dev). Embedding-based tool selection using pgvector in Postgres, OpenAI embeddings, and category expansions to scale AI agentsâ tools with Elixir code
Make the eyes go away (hexeditrealityâ.com). Building an MCP server to bridge AI agents with i3, using Go, MCP SDK, and Ollama-enabled models
On AI Agents, MCP, and Tool Selection (acalustraâ.com). Global vs playbook AI agents, MCP tool selection, and balancing many tools for exploration vs few tools for reliable, single-task workflows
Architecting Agentic AI on AWS: From Intelligent Agents to Enterprise-Scale Execution (forgeaheadâ.io). Explores architecting agentic AI on AWS with LLMs, Bedrock/SageMaker, Step Functions, and IAM for enterprise-scale execution
đ§âđť Coding Agent Tactics
Coding agents write 90% of my code now (benâ.page). Coding agents like Claude Code or Amp now write the majority of the author's code, with the author guiding edits and tweaks
Trying GitHub Copilot coding agent (jlelseâ.blog). Explores GitHub Copilot Pro usage, PR-driven tasking, GoBlog test coverage, and AI-assisted coding on Go
What Actually Is Claude Codeâs Plan Mode? (lucumrâ.pocooâ.org). Claude Code plan mode explored via prompts, tooling, and read-only workflow, contrasting with YOLO mode and manual planning
Claude Code skills not triggering? It might not see them. (blogâ.fsckâ.com). Claude Code skills may not trigger due to skill list size and system prompt limits in Code 2.0.70, with a workaround using SLASH_COMMAND_TOOL_CHAR_BUDGET
Claude Code: stash (perrottaâ.dev). Claude Code stash feature for multi-line prompts enables temporary saving and auto-restoration during coding sessions
ââ and ââ hotkey navigation in Claude Code and Codex (banagaleâ.com). Discussion of â key navigation for Claude Code and Codex via iTerm2; includes hex-send workflows and shortcuts
đ RAG & Retrieval
AI in Production Field Notes: Beyond âJust Call an LLMâ: Vimeoâs Production Subtitle Engine (mlopsworldâ.com). Vimeo's subtitle pipeline uses layered, production-grade workflows with LLMs, chunking, validation, and async orchestration
How to Do Evals on a Bloated RAGÂ Pipeline (towardsdatascienceâ.com). Evaluates a bloated RAG pipeline with seed vs expanded context using RAGAS and DeepEval across GPT-5 models for faithfulness and relevance
DocSummarizer Part 3 - Advanced Concepts: The "I Went Too Far đ¤Ś" Deep Dive (mostlylucidâ.net). Deep dive into DocSummarizer: ONNX embeddings, RAG architecture, MMR, RRF, and hybrid retrieval for local, production-grade summarization
Stop Shoving Documents Into LLMs: Build a Local Summarizer with Docling + RAG (mostlylucidâ.net). Local, offline document summarization pipeline using Docling + ONNX embeddings, Ollama support, and Qdrant for structured, citation-grounded summaries
SatoriDB: vector database built from scratch (nubskrâ.com). SatoriDB presents an embedded, billion-scale vector database with two-tier RAM/SSD routing, HNSW-based clustering, custom caches, and CPU pinning for predictable latency
âď¸ Serving & Performance
Mini-SGLang: Efficient Inference Engine in a Nutshell (lmsysâ.org). Mini-SGLang offers a lightweight, OpenAI-compatible LLM inference engine with Radix Attention and Tensor Parallelism implemented in a ~5k-line Python codebase
Small adventures with small language models (blogâ.engoraâ.com). Explores small language models (SLMs) with Ollama and HuggingFace, evaluation, and performance on data-breach analysis tasks
Diagnose & Fix Painfully Slow Ollama: 4 Essential Debugging Techniques + Fixes (journalâ.hexmosâ.com). Diagnose Ollama performance: GPU heat, quantization, KV caching, and model comparisons with --verbose
VRAM vs System RAM: What Actually Limits Running LLMs Locally? (dewanahmedâ.com). VRAM vs system RAM in local LLMs: how GPU memory and host memory shape feasibility and performance with Qwen3-Next-style models
Framework Desktop: How to Expand your Unified Memory For LLM Use (boilingsteamâ.com). Expands unified memory for LLM use on Framework Desktop by adjusting BIOS and kernel parameters to 90 GB VRAM for large models
đ Security & Offense
How I Think About Agentic Risks (cloudberryâ.engineering). Thoughtful exploration of agentic risks, risk amplifiers, threat modeling, and mitigations for AI agents with input sanitization, data access, and human-in-the-loop approaches
Edition 1: AI for Offense Is Here. Defenders Arenât Ready. (boringappsecâ.substackâ.com). AI-native offense using Claude Code sub-agents and MCP; automation of kill chain across ~30 targets discussed by Sandesh Mysore Anand
The Developerâs Guide to LLM Security (thedataexchangeâ.media). Steve Wilson (Exabeam), OWASP GenAI Security Project lead, discusses prompt injection, AI supply chains, guardrails, MCP, A-to-A, and the security of agentic LLMs
Red Hat Buys an AI Safety Company, Promises to Open Source Its Tech (itsfossâ.com). Red Hat acquires Chatterbox Labs to integrate AI safety tooling and promises to open source the tech over time
When the AI Says No: Compliance vs. Security (gordonbeemingâ.com). GPT-5.2 refuses to write secrets to disk, Claude 4.5 Sonnet complies, highlighting security vs. compliance in AI tooling
đ§ Alignment & Auditing
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers (lesswrongâ.com). Activation Oracles train LLMs to answer questions about their activations, enabling auditing tasks and misalignment detection using diverse prompts and Colab demos
Alignment Fine-Tuning: Lessons from Operant Conditioning (lesswrongâ.com). Neuroscientist applies operant conditioning to alignment fine-tuning in LLMs, proposing early RLHF, slow post-deployment updates, and cue-based feedback
Note (hsuâ.cy). RLVR and verifiable rewards drive longer, deeper LLM reasoning; benchmarks face overhang and jagged progress in 2025
How to Teach LLMs to Reason for 50 Cents (artificialintelligencemadesimpleâ.com). Latent space reasoning, multi-judge architecture, and open-source latency-friendly LLM tooling to access model reasoning for 50 cents using IQIDIS approach
Video and transcript of talk on human-like-ness in AI safety (joecarlsmithâ.com). Joe Carlsmith discusses human-like-ness in AI safety, critiquing alien-ness, corrigibility, and generalization in ML-built AIs
đ§Ş Evals & Reliability
HELM Arabic (crfmâ.stanfordâ.edu). HELM Arabic evaluates Arabic benchmarks using open HELM framework and collaborates with Arabic.AI on multilingual LLM capabilities
Structured outputs create false confidence (boundarymlâ.com). Structured outputs often degrade quality; a hands-on look at constrained decoding versus free-form parsing with OpenAI models and BAML tooling
When AI Reviews AI: A Case Study in Benchmark Contamination (cafebedouinâ.org). Staged Adversarial Review exposes benchmark contamination in SDE evaluation for LLMs in scientific discovery
Do a sanity check on your experiments (ehudreiterâ.com). Sanity checks on data, model outputs, and evaluation to detect bugs in NLP/AI experiments
đ§ą Diffusion & 1-bit Models
Power Up Diffusion LLMs: Dayâ0 Support for LLaDAâŻ2.0 (lmsysâ.org). Diffusion LLMs via SGLang's Chunked-Prefill with LLaDA 2.0, showing day-0 support and streaming for 100B-scale models
What Happens When You Build an LLM Using Only 1s and 0s (towardsdatascienceâ.com). BitNet b1.58 trains LLMs with ternary weights â1,0,1, enabling 1-bit-like efficiency and up to 9x throughput gains
La Rivoluzione dell'IA: Come i Modelli di Diffusione e Nuove Architetture Sostituiranno gli LLM Attuali (grigioâ.org). Diffusion models, sub-quadratic architectures, private thinking, continuous learning, and a continuous-thinking machine reshape AI beyond current LLMs
đ Academic Research
Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models (arxiv:cs). MAHA: a hierarchical attention framework with multiscale aggregation and convex/Nash optimization for scalable LLM context modeling
Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers (arxiv:cs). Log-linear Sparse Attention (LLSA) enables hierarchical Top-K selection for long token sequences in Diffusion Transformers, boosting training and inference efficiency
Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models (arxiv:cs). Dynamic Rank Reinforcement Learning optimizes low-rank MHSA in LLMs via RL-guided rank selection and online perturbation bounds for efficient inference
SFTok: Bridging the Performance Gap in Discrete Tokenizers (arxiv:cs). SFTok: a discrete tokenizer with self-forcing guided reconstruction and debias-and-fitting training to boost image tokenization for high-resolution multimodal generation
IPCV: Information-Preserving Compression for MLLM Visual Encoders (arxiv:cs). IPCV compresses Vision Transformer tokens for MLLMs with Neighbor-Guided Reconstruction and Attention Stabilization to reduce compute without sacrificing text-critical cues
Add a comment: