The AI Engineer 04-11-2025

                        November 4, 2025

            The AI Engineer 04-11-2025
            Musk's Grokipedia, OpenAI's ambitious IPO plans, concerns over AI safety

                    📣 Headlines
• Elon Musk’s Grok and its AI-built "Grokipedia" face scrutiny over accuracy, sourcing and political tilt after academics and coverage highlighted reliability concerns; see In Grok we don’t trust and What could go wrong?. 
• Reports say OpenAI is planning a 2026–27 IPO at up to $1T, while investors worry about corporate AI spending and the OpenAI–Microsoft relationship as Microsoft’s stock slipped. 
• MIT unveiled TX‑GAIN, a GPU‑fused system delivering roughly two exaflops to accelerate AI workloads in medicine, climate and defense research. 
• AI safety and moderation issues surfaced: Google pulled Gemma AI Studio after a reported fabricated assault claim, Character.AI will ban minors from chatbot conversations, creators flagged odd YouTube takedowns, and reports show teenage boys using personalized AI for therapy and romance. 
• Wearable AI raised privacy and social concerns as Kevin Rose proposed an emotional/social impact test for hardware (test idea), and reviews of Ray‑Ban Meta Gen 2 and Oakley Meta Vanguard highlight intrusive on‑device AI features. 
• Web security is shifting as captchas evolve: coverage examines disappearing captchas, reCaptcha, Cloudflare Turnstile and Arkose Labs and how AI changes bot‑defense challenges. 
• Thermo Fisher agreed to acquire Clario for up to $9.4B to integrate an AI‑enabled digital endpoint platform into clinical trials and accelerate drug development (Clario acquisition). 

🔧 Company Engineering Blogs
AI Infrastructure and Ontology (blog.palantir.com). Palantir and NVIDIA enable NVIDIA CUDA-X, Nemotron models, and Ontology integration in Foundry and AIP for edge computing and operational AI  
Building a GenAI Agent for Partner-Guest Messaging (booking.ai). GenAI agent for partner-guest messaging using Python, LangGraph, FastAPI, OpenAI GPT-4 Mini, with PII guardrails and multilingual retrieval  
Accelerating discovery with the AI for Math Initiative (deepmind.google). AI for Math Initiative partners with top institutions to accelerate mathematical research using Gemini Deep Think, AlphaEvolve, and AlphaProof  
Measuring what matters: How offline evaluation of GitHub MCP Server works (github.blog). Offline evaluation pipeline for GitHub MCP Server evaluates tool selection, argument accuracy, and multi-tool flow readiness 

🧑‍💻 LLMs for Software Engineering
Introducing SWE-1.5: Our Fast Agent Model (simonwillison.net). SWE-1.5, a fast, frontier-size coding model by Windsurf with Cerebras inference at up to 950 tok/s  
Composer: Building a fast frontier model with RL (simonwillison.net). Cursor debuts Composer 1, a fast MoE model for RL-driven software engineering with tooling integration and parallel agent execution  
Efficient and green LLMs for software engineering (chuniversiteit.nl). Efficient and green LLMs for software engineering: data, model, system, and program-centric techniques for cost-effective code tasks  
My current development tooling setup (danieldemmel.me). Developer tooling for LLM integration, guardrails, local vs cloud models, MoE runs, and workflow automation in a startup context 

🔎 RAG, Retrieval & Search
How Perplexity Built an AI Google (blog.bytebytego.com). How Perplexity built an AI-powered, model-agnostic RAG system with Vespa, ROSE, Sonar, and Bedrock  
Announcing llm-docs-builder: An Open Source Tool for Making Documentation AI-Friendly (mensfeld.pl). llm-docs-builder transforms Markdown docs into AI-friendly, noise-reduced content, generates llms.txt indexes, and Docker-supported tooling to reduce RAG costs  
Graph RAG vs SQL RAG (towardsdatascience.com). Graph vs SQL RAG: evaluating LLM retrieval on F1 data using SQL and graph databases with GPT-5, GPT-4, and GPT-3.5-turbo  
Unifying Retrieval and Reranking with Embedding Models, Learning to Reason for Recommendations, and More! (recsys.substack.com). Unified embedding and reasoning advances for retrieval and reranking in recommender systems using E²RANK, LIMRANK, and DeepAgent  
How I'm Building a Context-Aware Retriever to Boost RAG Quality (Part 2: Implementation) (egpivo.github.io). Context-aware retriever using DSPy and FastMCP to boost RAG quality with file discovery, boosted chunk search, and reflection for ContractNLI datasets  
Combining OpenAI API Responses web_search and file_search Tools (jamesmccaffreyblog.com). Combining OpenAI file_search and web_search tools to augment answers with retrieval-augmented generation (RAG) in Python 

✅ Evaluation & Reliability
Best Practices and Methods for LLM Evaluation (databricks.com). Explore metrics, datasets, and frameworks for evaluating LLMs, including automated tools, LLM judges, human assessments, and Mosaic AI Agent Evaluation  
Writing an LLM from scratch, part 26 -- evaluating the fine-tuned model (gilesthomas.com). Evaluating a fine-tuned LLM with Ollama and Llama 3, using RTX 3090; scripts, seeds, and non-determinism considerations  
LLM Judges aren’t the shortcut you think (softwaredoug.com). Insights on LLM judges in search: limitations, data needs, overfitting risks, and a shift toward learning-to-rank and feature generation  
Stable LLM Inference (gojiberries.io). Stable LLM inference using zero-temperature generation, margins, and infrastructure fixes for reproducible prompts and deterministic outputs 

🚀 Inference Systems & Performance
DGX Spark and Mac Mini for Local PyTorch Development (sebastianraschka.com). DGX Spark and Mac Mini compared for local PyTorch development, inference, benchmarking, and small-scale training with KV-cache and MPS limitations  
An intro to the Tensor Economics blog (lesswrong.com). Tensor Economics analyzes LLM inference economics: FLOP/s per $, memory-bound decoding, KV cache, HBM latency, embeddings vs. inference costs, MoE advantages, interconnects (NVLink, InfiniBand), batching, OpenRouter usage, and RLaaS via LoRA adapters  
Optimizing GPT-OSS on NVIDIA DGX Spark: Getting the Most Out of Your Spark (lmsys.org). NVIDIA DGX Spark boosts GPT-OSS 20B/120B with SGLang, benchmarks, Docker setup, and Open WebUI integration  
SGLang-Jax: An Open-Source Solution for Native TPU Inference (lmsys.org). SGLang-Jax enables native TPU inference with Jax/XLA, Ragged Paged Attention, MoE kernels, speculative decoding, and overlap scheduling  
TIL: The two phases in LLM inference (olshansky.info). Two inference phases in LLMs: prefill and decode, with KV cache, TTFT, TPS, using kernel compute and memory bandwidth 

🧠 Interpretability & Reasoning
Steering Evaluation-Aware Models to Act Like They Are Deployed (lesswrong.com). Activation steering of evaluation-aware LLMs to induce deployment-like behavior via contrastive prompts, synthetic fine-tuning, and model organism training using Llama Nemotron  
OII researchers to explore LLM interpretability at EMNLP 2025 (oii.ox.ac.uk). OII researchers from OxRML present two papers on LLM safety, interpretability and narrative mapping at EMNLP 2025 in Suzhou  
circuit tracing (aarnphm.xyz). Autonomous circuit tracing for causal influence in language models using transcoders, feature-level gradients, and attribution graphs  
Breakthrough to True Thinking (thorsteinnsiglaugsson.substack.com). Discipline in thinking: CLR-guided CRTs for humans and LLMs, tool-assisted reasoning, memory, and feedback to prevent errors 

📚 Training Notes & Tutorials
Watch the recordings from my Python + AI series (blog.pamelafox.org). Watch a nine-part Python + AI live series covering LLMs, embeddings, RAG, vision, structured outputs, safety, tool calling, agents, and MCP using Python  
Writing an LLM from scratch, part 25 -- instruction fine-tuning (gilesthomas.com). Instruction fine-tuning techniques, data handling, batch collation, masking padding, and implementation notes from Raschka's LLM-from-scratch chapter  
Trying out other peoples’ ideas (snimu.github.io). Explores multi-embedding techniques, value embeddings, and per-layer transformations in modded-nanogpt experiments for speed and loss targets 

📚 Academic Research
Emu3.5: Native Multimodal Models are World Learners (arxiv:cs). BAAI’s Emu3.5 is an open-source multimodal world model trained on 10T interleaved tokens, enabling long-horizon VL generation. DiDA delivers 20x faster inference without quality loss  
Continuous Autoregressive Language Models (arxiv:cs). CALM replaces next-token prediction with next-vector prediction by compressing K tokens into continuous codes via autoencoding. Fewer steps yield comparable accuracy at lower compute cost  
SpecAttn: Speculating Sparse Attention (arxiv:cs). SpecAttn piggybacks on draft-model attention during speculative decoding to predict important tokens and prune KV cache accesses. Training-free sparsity yields speedups while keeping perplexity acceptable  
Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model (arxiv:cs). Google DeepMind revisits encoder-decoder LLMs, scaling them with recipes and comparing against decoder-only baselines. Results show competitive performance and better inference efficiency, challenging architecture choices  
Tongyi DeepResearch Technical Report (arxiv:cs). Tongyi DeepResearch is an open-source agentic LLM for long-horizon web research, trained end-to-end with synthetic data. It achieves SOTA on browsing benchmarks with efficient tooling

✨ Before you go...
You can now follow posts on the brand new AI-dedicated Mastodon and Bluesky feeds. You can also search all blog posts shown here (and MUCH more) over at https://blognerd.app.
Finally, I hope that this newsletter brings you some value. Everything here is offered free and always will be, and I'd be so grateful if you'd consider supporting me over on Patreon to help keep blaze newsletters going (if you can't afford to support financially, you can still follow there for free).
Thanks a million, and have a great week! Alastair.

Don't miss what's next. Subscribe to The AI Engineer:

Start the conversation: