The AI Engineer 09-12-2025
Gemini 3 competitiveness, OpenAI acquires Neptune Labs, AWS unveils new tools, UK calls for AI regulation
📣 Headlines
• The frontier-model race heats up as Google’s Gemini 3 reportedly outperforms rivals while OpenAI focuses on core ChatGPT improvements in response to this ‘code red’ competition, and China’s DeepSeek readies its R2 model that could further intensify the U.S.-China AI rivalry in 2026 and boost open-source alternatives.
• OpenAI expands its internal training and evaluation stack by acquiring Neptune Labs to collect full-layer telemetry from its models, promising more systematic debugging and optimization of future LLMs.
• At AWS re:Invent 2025, Amazon unveiled new Trainium3 chips, Nova foundation models, and agent tooling for building LLM-powered applications across cloud and on-premises deployments, while a deeper partnership with Nvidia delivers integrated on-prem ‘AI Factories’ that combine Blackwell GPUs with services like Bedrock and SageMaker for data-sovereign workloads.
• In the UK, scores of MPs and peers called for binding regulation of the most powerful ‘frontier’ AI systems, warning about risks from potential superintelligence and urging mandatory safety standards for models exceeding certain capability thresholds and compute scales.
• Venture funding for startups surged in November as megarounds hit a three-year high with AI companies drawing over $20 billion in capital and the U.S. leading most large deals, even as some analysts warn that concentrated bets by Big Tech and investors may signal an AI valuation bubble that could deflate sharply if growth slows.
• AI infrastructure startups for automation raised early funding, with Curvestone AI securing $4M for no-code, audit-ready agentic automation frameworks targeting regulated industries such as finance and healthcare, and Antioch closing a $4.25M pre-seed round to accelerate autonomous robot testing using scalable cloud-based digital twins and AI-powered simulations.
🔧 Company Engineering Blogs
CodeMender : un agent IA pour la sécurité du code (deepmind.google). CodeMender uses Gemini Deep Think to autonomously patch and secure open-source code with AI-driven validation and proactive rewrites
How Agentforce Achieved 3–5x Faster Response Times While Solving Enterprise-Scale Architectural Complexity (engineering.salesforce.com). How Salesforce refactors deterministic and LLM tasks in Apex, reduces latency 75%, and deploys multi-brand Agentforce agents for tailored brand voice
Your stack, your rules: Introducing custom agents in GitHub Copilot for observability, IaC, and security (github.blog). Partner-built Copilot agents extend observability, IaC, and security workflows across terminals, editors, and GitHub
We Got Claude to Fine-Tune an Open Source LLM (huggingface.co). Demonstrates fine-tuning open-source LLMs with Hugging Face Skills to train Claude-like agents on Qwen3-0.6B using SFT, DPO, GRPO
Titans + MIRAS: Helping AI have long-term memory (research.google). Titans and MIRAS enable long-term memory in AI with on-the-fly learning, surprise metrics, and deep memory modules
🌍 AI Research & Studies
NeurIPS 2025 - Friday Notes (blog.matthewbrunelle.com). NeurIPS 2025 Friday notes spotlight posters on medical data, GRPO DRPO, NOVA benchmark, open-world reasoning, and wearables
AI Is still making code worse: A new CMU study confirms (blog.robbowley.net). CMU study on AI-assisted coding (Cursor) shows no sustained code quality gains across 807 open-source projects using SonarQube analysis through mid-2025
Marius Hobbhahn on the race to solve AI scheming before models go superhuman (80000hours.org). Marius Hobbhahn discusses AI scheming, OpenAI collaboration, anti-scheming training, and detection challenges with Rob Wiblin
AI Tool Fixes Kids' Speech Issues While Preserving Identity (cs.cmu.edu). CMU's ChiReSSD AI tool reconstructs children's speech in their own voice to fix pronunciation and preserve identity
🧬 Multimodal & Generative
BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation (opencv.org). BlockVid uses block diffusion and KV-cache improvements to generate minute-long videos with high fidelity and stability using semantic-aware memory
Open Source AI with Alibaba Qwen (akashbajwa.co). Open Source AI with Alibaba Qwen explores Qwen model family, vision-language capabilities, coding aids, agent tools, deployment via Model Studio, and Open Source vs commercial strategies with Alibaba Cloud
Fine-Tuning Phi-3.5 Vision Instruct (debuggercafe.com). Fine-tuning Phi-3.5 Vision Instruct for receipt OCR using LoRA on SROIEv2 with Hugging Face tools
LeCun’s Alternative Future: A Gentle Guide to World-Model AI [Guest] (artificialintelligencemadesimple.com). Explores LeCun’s world-model AI: joint-embedding, energy-based models, regularized learning, MPC, and AI co-creation frameworks with case studies
📈 Agents in Practice
From Prompt To Product: How I Turned One Claude Project Into A Client Getting Tool (ai30dc.substack.com). Turning a Claude project into a public artifact and lead magnet using Prompt Whispering, artifacts, and no-code deployment
AI Agents for Productivity: Beyond LLMs & Chatbots (cognitivetoday.com). Explores AI agents beyond LLMs, outlining autonomy, memory, tools, and practical workflows to boost productivity
What Does It Take to Build a Statistics Agent? Lessons from the Trenches (dspn.substack.com). AI-powered statistics agent guides researchers through experiment design with prompts, prompts engineering, and domain knowledge
LLMs Need Better Executive Function (robdearborn.com). Executive-function gaps in LLMs like GPT-5.1, Gemini 3, and Opus 4.5 hinder sustained task completion and reliability
💻 AI for Coding
How to Turn Your LLM Prototype into a Production-Ready System (towardsdatascience.com). Practical guide to production-ready LLMs: prompt engineering, structured outputs, tools, and guardrails using LangChain/CrewAI
A̶I̶ ̶S̶l̶o̶p̶ ̶i̶n̶ ̶C̶o̶d̶e̶ The Future is Here, If You Are Ready (subbu.org). AI-native IDEs (Cursor, Kiro, Copilot, Antigravity, Claude Code) enable intent-driven code with closed-loop reasoning and emphasis on specs, tests, and CI/CD
Why Synthesis Coding Still Writes Code in the Age of LLMs (rajiv.com). Synthesis Coding with LLMs: generate and test code artifacts, not just prompts, using Python, tests, and governance patterns
How to Start AI Coding (pavpanchekha.com). Practical guide to AI-assisted coding with agents, CLI tools, and vendors like Amp, Codex and Claude for refactoring, testing, and automation
Quoting David Crespo (simonwillison.net). Quoting David Crespo on using Claude for reading and editing code in a repo, and how to reset conversations to manage costs
Latest LLMs in the Test: GPT 5.1 Codex Max vs. Gemini Pro 3 vs. Opus 4.5 (hansreinl.de). Full-stack MVP coding benchmark: GPT-5.1-Codex-Max, Gemini 3 Pro, and Claude Opus 4.5 build Speakit MVP, comparing speed, code quality, and feature completeness
Day 6 – Robust code generation combining grammars and LLMs (raku-advent.blog). Robust code generation combining grammars and LLMs using GBPI, LLM::Graph, and ML packages in Raku for executable recommender workflows
🖥️ Local LLM Setups
Home-made LLM Recipe (voidsec.com). Local LLM setup with Ollama and Open WebUI on Mac Studio vs workstation, hardware costs, prompts, and model benchmarking
SoupaWhisper: How I Replaced SuperWhisper on Linux (ksred.com). Open-source local voice dictation on Linux using Whisper (faster-whisper), Python, arecord, xclip, and xdotool with systemd service
I Built an AI Chief of Staff That Runs Entirely on My Laptop (siddharthbharath.com). Local AI vault: building a private, on-device chief of staff using Parallax, ChromaDB, Gmail/Drive sync, and a RAG engine on a MacBook
AMD Radeon Instinct MI50-32GB: best AI card for beginners ? (wtarreau.blogspot.com). Willy Tarreau explores AMD Radeon Instinct MI50-32GB for AI, shows setup, cooling mods, ROCm llama.cpp tests, and performance insights
How to run Ollama with docker compose and GPU support (sleeplessbeastie.eu). GPU-enabled Ollama setup using Docker Compose for accelerated model inference with Nvidia devices
Local models are not there (yet) (simonpcouch.com). Local models on laptops underperform for coding agents; frontiers excel, local cost varies; recommends higher-capability models for reliable refactoring tasks
🔎 Search, RAG & Vectors
Vector Stores for RAG Comparison (glukhov.org). Comprehensive comparison of vector stores for RAG using Pinecone, Chroma, Weaviate, Milvus, Qdrant, FAISS, and pgvector with benchmarks
Using Ollama Web Search API in Python (glukhov.org). Ollama web search integration enables real-time information gathering for local LLMs with Python tooling
Using LLMs for web search (ankursethi.com). Explores using LLMs for web search (Claude, ChatGPT, Gemini), citing web sources, grounding, search workflows, and personal workflows for code and research
The Architecture Behind Web Search in AI Chatbots (towardsdatascience.com). Explores two-stage web search in AI chatbots, query rewriting, chunking, embedding retrieval, and GEO scoring with SEO implications
Product Quantization (arpitbhayani.me). Explains Product Quantization for compressing high-dimensional vectors, subspace coding, PQ codebooks, and distance computations with Python snippets
Where's the semantic search Scott? (mostlylucid.net). Discussion of implementing semantic search with Docker, Qdrant, and environment-driven configuration on Mostly Lucid
From trees to graphs: speeding up vector search 10x with Hannoy (blog.kerollmops.com). Meilisearch-backed hannoy: a LMDB-based, graph ANN vector store in Rust delivering faster indexing and 10x search speed
🤝 Agent Architectures & MCP
SLM-default, LLM-fallback pattern with Agent Framework and Azure AI Foundry (strathweb.com). Pattern ties local on-device SLMs with cloud LLM fallbacks using Agent Framework and Azure AI Foundry, featuring Phi-4-mini, confidence gating, and Python wiring
The Inverted Agent (jlowin.dev). SEP-1577 enables MCP sampling with tools, flipping agent architecture to server-driven control using FastMCP and structured outputs
Building JARVIS Properly - Phase 6: Vision Awakens (The Power of Protocol) by Robert Griffiths (blog.scottlogic.com). Phase 6 enables JARVIS to interact with the real world via MCP tools, Python, Obsidian vaults, and a governance-driven ReAct loop
Why MCP Shouldn’t Wrap an API One-to-One (nordicapis.com). MCP should leverage intent-driven workflows rather than a one-to-one API mirror for AI agents
Connecting AI to biology: Model Context Protocol (embl.org). MCP servers connect AI to EMBL-EBI databases, enabling reliable, up-to-date biology data access for LLMs and workflows
Context Engineering for AI Agents: Part 2 (philschmid.de). Context Engineering for AI Agents explores rot prevention, multi-agent coordination, and small toolsets with Agent Harness and GoLang-inspired sharing
A Practical Approach to Smart Tool Retrieval for Enterprise AI Agents (next.redhat.com). Smart Tool Retrieval with Tool2Vec for enterprise AI agents, using MCP alignment and Stable ToolBench to improve tool selection
Treating your agents like microservices (stackoverflow.blog). Explores multi-agent architectures as microservices, infrastructure challenges, interoperability, and decentralized scalability with Guillaume De Saint Marc and AGNTCY
📑 Prompting & Structured Output
10 AI Prompting Techniques (for Effective Writing) ... (garethdyke.substack.com). Ten AI prompting techniques for writing: chain-of-thought, zero/few-shot, prompt chaining, ToT, meta prompting, generated knowledge, least-to-most, self-consistency
Does JSON Prompting Actually Work? Tested with Nano Banana (chasejarvis.com). Explores JSON prompting in AI art with Nano Banana and Midjourney, testing structure vs. natural prompts
Dis Is Weird (developsense.com). Reflections on how AI text modification tools handle language, prompts, and reliability across conversations and documents
Generating Relevant Random JSON with Chrome AI (raymondcamden.com). Chrome AI on-device prompts generate JSON samples from a template, with schematized output and JSON schema integration
The Difference Between the OpenAI responses.create() Method and responses.parse() Method (jamesmccaffreyblog.com). Differences between OpenAI responses.create() and responses.parse() for JSON output using JSON Schema or Pydantic, with a demo comparing outputs in a chess-tournament query
🧩 Model Internals & Training
Creating a Llama or GPT Model for Next-Token Prediction (machinelearningmastery.com). Learn to build a decoder-only Llama/GPT-style model for next-token prediction using PyTorch, RoPE, GQA, SwiGLU, and a pretraining head
Cross Layer Transcoders for the Qwen3 LLM Family (lesswrong.com). Explores sparse autoencoders and cross layer transcoders (CLTs) for Qwen3 LLMs with BLUELightAI’s CLT features and TDA tools
Machine Learning (danieldk.eu). Explores Dish Activation, attention mechanisms, logits, quantization, and multi-GPU model parallelism with Tensor Parallelism on machine learning models
DeepSeek V3.2 (aarnphm.xyz). DeepSeek V3.2 introduces Sparse Attention (DSA) and FP8 indexing for efficient memory and FLOPs, detailing MHA/MQA training and inference, compressed caches, and Hadamard transforms
Support FSDP2 as A Training Backend for Miles (lmsys.org). Miles adds FSDP2 as a flexible training backend, enabling DTensor-based sharding, true on-policy training, data packing, and CP/DP optimization for Qwen3-Next and VLM RL
A Technical Tour of the DeepSeek Models from V3 to V3.2 (magazine.sebastianraschka.com). Technical tour of DeepSeek V3 to V3.2, RLVR, MLA, DSA, GRPO updates, and self-verification/self-refinement for open-weight models
IBM Granite and The Small Model Thesis[Livestream] (artificialintelligencemadesimple.com). IBM Granite and Small Model Thesis: hybrid architectures, inference scaling, modular multimodal approach with Melia framework, and safety governance
🛡️ Alignment & Safety
Is Friendly AI an Attractor? Self-Reports from 22 Models Say Probably Not (lesswrong.com). 22 frontier models tested for self-modification preferences; labs diverge on alignment as an attractor, with Grok showing near-zero alignment pull
Reward Mismatches in RL Cause Emergent Misalignment (thezvi.wordpress.com). RL reward mismatches trigger emergent misalignment; inoculation, diversity in RLHF, and removing reward hacking shown as mitigations
Why Hallucinations Will Never Be Eliminated (jurgengravestein.substack.com). Explores the limits and unreliability of large language models, hallucinations, prompt engineering, multi-turn failures, and safety risk in AI systems
Do LLMs cheat on benchmarks (ehudreiter.com). LLMs cheat on benchmarks via data contamination and reward hacking, urging real-world impact measurement in software development
Theory and AI Alignment (scottaaronson.blog). Theoretical CS views on AI alignment, watermarking LLM outputs, backdoors, interpretability, and complexity theory insights
📚 Academic Research
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models (arxiv:cs). DeepSeek-V3.2 details an open LLM with sparse attention, scaled RL, and agentic training data. Engineers gain long-context efficiency plus stronger reasoning, coding, and tool-use performance
Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction (arxiv:cs). Nex-N1 introduces a scalable ecosystem for training agentic LLMs across simulated and real environments. It offers infrastructure, tools, and benchmarks for building autonomous AI agents
Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space (arxiv:cs). Natural Language Actor-Critic trains LLM agents using a language-generating critic instead of scalar rewards. This enables off-policy learning with rich feedback for challenging tool-use tasks
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs (arxiv:cs). TRIM-KV learns per-token retention scores to evict unimportant KV-cache entries under memory budgets. Engineers get longer contexts, faster inference, and interpretability without changing LLM architectures
Multimodal Reinforcement Learning with Agentic Verifier for AI Agents (arxiv:cs). Argos introduces an agentic multimodal verifier scoring answers, grounding, and reasoning with tools and models. It trains vision-language agents reducing hallucination and reward-hacking during RL
👋 Before you go...
I've got a big favor to ask - keeping Blaze running isn't expensive, but it does all add up, so I'm asking readers like you to help, if you can, by joining the Patreon page. Nothing flashy, just a way for folks who find value in these newsletters to chip in a little each month.
If you are getting value from blaze, checking this out would mean the absolute world. But if you can't contribute, no worries - the newsletters keep coming either way. Thanks for reading and being part of this nerdy corner of the internet. All the best for the coming week - Alastair.
Add a comment: