AI 技术日报 - 2026-05-25

type

Post

status

Published

date

May 26, 2026 01:56

slug

ai-daily-2026-05-25

summary

今日 AI 领域迎来历史性时刻：OpenAI 与 DeepMind 几乎同时宣布在数学推理上取得突破，AI 首次自主解决困扰数学家数十年的难题。与此同时，产业界呈现出“冰火两重天”的景象：一方面，DeepSeek 大幅降价、Google 重塑搜索框，AI 应用成本持续走低；另一方面，微软因成本失控叫停内部 Claude Code 使用，Uber 全年 AI 预算提前耗尽，揭示了大规模部署 AI 的严峻成本挑战。今日收录 Web 资讯 8 条、GitHub 项目 2 个、论文 5 篇、KOL 推文 23 条、播客 1 集。

📊 今日概览

🔥 趋势洞察

[AI 数学推理的“斯普特尼克时刻”]：OpenAI 和 DeepMind 在解决经典数学难题上的突破，标志着 AI 推理能力从“辅助工具”向“自主发现者”的质变。这不仅验证了 LLM 在形式化推理上的潜力，更预示着 AI 将加速科学发现进程，是今日最具里程碑意义的事件。

[AI 部署的“成本墙”与“定价战”]：微软和 Uber 的内部案例揭示了 AI 大规模落地面临的真实成本挑战，而 DeepSeek 的永久性降价则预示着市场将进入残酷的定价竞争阶段。未来，AI 公司的核心竞争力将不仅在于模型能力，更在于推理效率和成本控制能力。

[Agent 评估从“定性”走向“可重复”]：今日多篇论文和文章聚焦 Agent 评估的可重复性与标准化问题。从披露审计到多维评估框架，业界正试图为 Agent 能力建立更严谨、更可信的度量体系，这对于 Agent 技术的健康发展至关重要。

🐦 X 推文动态

📈 热点与趋势

微软因成本禁止内部工程师使用 Claude Code，Uber 全年 AI 预算 4 月耗尽 – Microsoft 曾为数千工程师提供 Claude Code 访问，因 token 账单失控取消几乎所有许可。Uber CTO 表示全年预算 4 月已用完，84% 工程师使用 AI，70% 提交代码来自 AI，重度用户月消费 $500–$2000。Nvidia VP Bryan Catanzaro 也承认其团队计算成本远高于员工薪资 @Ric_RTP（独立博主）

David Sacks（前 PayPal COO / 云基础设施 CEO）：AI 使 GitHub 提交量年增 14 倍，软件工程师岗位反升 – AI 降低编码成本，催生更多应用和岗位，质疑“AI 造成大量失业”的说法 @DavidSacks

AI 使咨询公司客户质疑人类建议价值，McKinsey 等企业重新定价 – 据 Polymarket 报道，传统咨询公司正面临 AI 带来的定价压力 @Polymarket

Cathie Wood（ARK Invest 创始人）预测 AI agent 将推动 GPU:CPU 从 4–5:1 降至 1:1 – 引用 OpenAI CFO Sarah Fryer 观点，认为 agentic AI 激活 CPU 需求，Intel 等公司受益 @MilkRoadAI

🔧 工具与产品

Together AI 发布 Blackwell 优化推理栈，在 Artificial Analysis 多项第一 – 含新 attention kernel，在 Kimi 2.6 和 MiniMax 等模型上速度领先其他 GPU 端点 @vipulved

Tom Dörr（独立开发者）发布从零构建 AI Agent 教程和自托管编排工具 – 教程从第一原则出发，编排工具无外部依赖 @tom_doerr | @tom_doerr

OpenClaw 2026.5.22 发布：模型加载延迟降至 5ms，npm 锁定依赖 – 启动路径优化，Windows 安装路径加固 @openclaw

CodeWhale 发布：面向开源/开放权重模型的 agent harness – 原名 deepseek-tui，目标是成为开源模型 agent 黄金标准 @goodhunt

StepFun 推出基于 Step Plan 的会议笔记助手 – 粘贴杂乱笔记，自动提取待办和行动项，使用 Step 3.5 Flash 模型 @StepFun_ai

基于 Bittensor 的 ChatGPT 替代上线 Alpha：成本仅 1/250 – 支持文件问答、持久记忆、无审查，使用 chutes.ai 子网 @jaltucher

⚙️ 技术实践

Percy Liang 团队预注册 129B MoE 损失 2.252，实际训练落地 2.234 – 1e23 FLOPs 的运行证明可提前预测模型性能 @percyliang

DeepMind AI agent 自主解决 9 个 Erdős 开放问题，含 44 个 OEIS 猜想 – 包含两个 56 年未解问题，每个问题成本数百美元，全程 LLM-Lean 自动形式化验证 DeepMind | @AISafetyMemes | @Cointelegraph | @AcerFur

SOUL.md 文件定义：AI agent 身份与原则的 8 个关键部分 – 包括 identity、core truths、worldview、voice 等，30–80 行即可改变 agent 行为 @akshay_pachaar

RACO 论文获 ICML2026 Oral（Top 0.7%）：LLM 多目标微调冲突规避优化 – 提出反直觉的理论加速和更优 Pareto 前沿 @PeterLauLukCh

新预印本研究进化编码 Agent 演变过程 – 标题《What Do Evolutionary Coding Agents Evolve?》，论文与博客跟进 @maxzimmerberlin

InsForge Skills+CLI 优化 Claude Code：token 从 10.4M 降至 3.7M，成本 $9.21→$2.81 – 本地开源，通过 context engineering 实现 0 错误 @RodmanAi

⭐ 精选内容

AI 解决 80 年未解数学难题：OpenAI 与 DeepMind 双线突破 ｜产业拐点级事件

OpenAI 和 Google DeepMind 几乎同时宣布在数学推理上取得历史性突破。OpenAI 的模型解决了 Paul Erdős 在 1946 年提出的平面单位距离猜想（困扰数学家 80 年），突破源于一个简单的查询：“Erdős 是否错了”。DeepMind 的 AlphaProof Nexus 系统则以每次推理仅数百美元的成本，自主解决了九个开放的 Erdős 难题，其中两个困扰学界 56 年。与 OpenAI 的自然语言方法不同，DeepMind 使用 Lean 编译器自动验证每一步证明。剑桥数学家 Tim Gowers 称若人类写出此论文可直接发表。这是 AI 首次在数学上取得如此突破，对 LLM 推理能力研究有深远启示。

来源：The New York Times ｜ The Decoder ｜ Phys.org ｜ New Scientist ｜ The Guardian

HRM-Text：100-900x 更少计算量达到 SOTA，打破 Scaling Law 垄断 ｜预训练效率突破

HRM-Text 提出分层循环模型（HRM）替代标准 Transformer，通过 MagicNorm 稳定深度循环训练，并采用任务完成目标 + PrefixLM 替代原始文本预训练。1B 模型仅用 40B token、1500 美元预算、1.9 天训练，在 MMLU/ARC-C/DROP/GSM8K/MATH 上达到 2-7B 开源模型水平，计算量减少 96-432 倍。代码已开源。该工作直接挑战了“大规模预训练必须依赖海量数据和算力”的假设，为低成本预训练提供了实证突破。

来源：arXiv

Google 搜索 25 年来首次改造搜索框，AI 重塑信息入口 ｜产业拐点级产品发布

Google 25 年来首次改造搜索框，采用新模型 Gemini 3.5 Flash，支持更长查询、图片/视频上传、聊天式追问和自动化 Agent（如租房提醒）。这是 AI 重塑搜索的标志性事件，对 LLM 从业者理解产品落地和 Agent 应用有直接参考价值。

来源：The New York Times

2026 年 LLM Agent 与工具使用基准全景：24 个基准、Top 15 模型排名 ｜技术选型参考

BenchLM.ai 发布 2026 年 LLM Agent & Tool-Use 基准排名，覆盖 Terminal-Bench 2.0、BrowseComp、OSWorld-Verified 等 24 个 agentic benchmark，以及 MCP Atlas、BFCL v4 等工具调用基准。GPT-5.5 Pro 以 90.1 分领先，开源模型 Holo3-35B-A3B 以 82.6 分表现突出。页面提供 CSV/JSON 数据导出，适合技术选型和趋势分析。

来源：BenchLM.ai

Agent 评估深度指南：从概念到实践的完整框架 ｜系统性参考

Cameron R. Wolfe 撰写的 Agent 评估深度指南，系统梳理了 agentic loop、工具调用、多 Agent 系统等基本概念，提出清晰的评估框架（维度、指标、常见陷阱），并通过 SWE-bench、WebArena 等案例展示最佳实践，最后提供构建自定义 Agent 评估的路线图。对从事 Agent 开发、评估或研究的从业者，这是难得的系统性参考，兼具理论深度和实操指导。

来源：Deep (Learning) Focus

12 篇 Agent 基准论文披露审计：平均得分仅 0.38/1.0 ｜可重复性危机

本文对 12 篇知名 LLM Agent 基准论文进行披露审计，设计 5 字段评分模式（基准身份、框架规范、推理设置、成本报告、失败分解）。发现 Agent 基准平均得分仅 0.38/1.0，远低于经典基准的 0.66；最大缺口在推理成本（0 篇披露）和框架规范（无完整容器镜像）。作者发布 JSON Schema、代码簿和原始评分表，为 Agent 评估可重复性提供标准化工具。

来源：arXiv

DeepSeek 永久降价 75%，API 成本仅为 GPT-5.5 的 1/9 ｜定价战重塑市场

DeepSeek 宣布将 V4-Pro API 价格永久降低 75%，使其 API 成本仅为 OpenAI GPT-5.5 的 1/9、Anthropic Claude Opus 4.7 的 1/7。这一永久性低价策略直接冲击美国 AI 实验室的定价体系，可能重塑 AI 市场经济学。对于 AI 从业者，这直接影响 API 选型成本，并暗示了未来定价战趋势。

来源：TheStreet

AWS MCP Server 正式 GA，提供完整 API 覆盖和 IAM 治理 ｜基础设施里程碑

AWS 托管 MCP 服务器正式 GA，提供完整 API 覆盖和基于 IAM 的治理，是 AI 编码代理安全访问 AWS 服务的标准接口。该服务器是 AWS Agent Toolkit 的一部分，支持最新文档、认证 API 访问和沙盒脚本执行。目前仅支持 OAuth 2.1，但可通过开源 MCP Proxy 使用本地 IAM 凭证。这是 MCP 生态走向生产级的关键一步。

来源：InfoQ

🎙️ 播客精选

E238｜聊聊Harness时代AI-First的组织架构：从信任人到信任AI

📍 来源：硅谷101 | ⭐⭐⭐⭐⭐ | 🏷️ Agent, LLM, Product | ⏱️ 1:05:20

本期播客深入探讨Harness Engineering范式，嘉宾来自CreaoAI，分享其Agent系统实现99%代码由AI完成、每天3-8次生产部署的极致效率。核心观点包括：AI-First不是使用AI，而是让AI主导生产力；组织转型关键在于信任AI；产品经理角色可被替代；初级工程师更适应AI时代，资深工程师的核心竞争力转向发现AI规划缺陷和判断价值。讨论涵盖Agent系统设计、反馈循环、自动修复bug等实战经验，对LLM/Agent从业者极具参考价值。

💡 推荐理由： 重量级嘉宾深度分享Harness Agent实战，提供AI-First组织转型的独家洞察，内容前沿且对从业者极具启发。

📄 今日论文精选

Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

📍 来源：NVIDIA | ⭐⭐⭐⭐ | 🏷️ Agent Framework, Agentic Workflow, Tool Use, Reasoning, Code Generation, Application

Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model fine-tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent's natural-language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task-specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high-quality fine-tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test-time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.

💡 推荐理由： 工业实验室论文（NVIDIA），方法有明确技术新颖性：提出test-time scaling框架，通过演化自然语言技能而非微调模型来提升EDA agent性能，并引入dense verifier feedback。实验在hard CVDP任务上显著提升pass rate，且突破先前未解决任务。

ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

📍 来源：University of Science and Technology of China, Huawei | ⭐⭐⭐⭐ | 🏷️ Inference, Agent Framework, Reasoning, Transformer

Recent progress in LLM reasoning has increasingly shifted from single-pass generation to explicit search over intermediate reasoning states. Tree-of-Thoughts (ToT) organizes inference to tree-structured search with branching and backtracking, but it substantially amplifies the Key--Value (KV) cache: retaining KV states for a frontier of partial trajectories quickly becomes a memory bottleneck that limits throughput and constrains search depth and width under fixed hardware budgets. We address this challenge by observing that KV reuse in ToT-style inference is governed by search dynamics: near-term decoding depends primarily on the active branch and its ancestors, whereas inactive subtrees have low short-term reuse probability yet must remain recoverable for backtracking. Motivated by this, we propose ArborKV, a structure-aware eviction framework that couples a lightweight value estimator with a tree-aware allocation policy, and performs purely token-extractive eviction with lazy rehydration to support revisits. Experiments on ToT-style reasoning benchmarks show that ArborKV achieves up to ~4x peak KV-memory reduction while preserving near-full-retention accuracy, enabling larger search configurations under fixed device budgets that would otherwise run out of memory.

💡 推荐理由： 工业界论文（华为），有明确技术新颖性（结构感知的KV缓存管理），实验在多个ToT基准上验证且提升显著（~4x内存减少，精度几乎无损）。

Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

📍 来源：AWS AI Lab, HSBC | ⭐⭐⭐⭐ | 🏷️ Agent Framework, Agent Memory, Code Agent, Agentic Workflow, Reasoning

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce Ratchet, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds).

💡 推荐理由： 工业界论文（AWS AI Lab），有明确部署证据。方法有技术新颖性（提出生命周期管理机制解决技能库漂移问题），实验全面且提升显著。

IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

📍 来源：USC, Amazon | ⭐⭐⭐⭐ | 🏷️ Agent Framework, Inference, Planning, Multi-Agent, Tool Use

Large language model (LLM)-based agents solve complex tasks by leveraging multi-step reasoning with iterative tool calls and environment interactions, which incur idle time while waiting for observations. Despite the prevalence of idle time in most agentic scenarios, existing works treat it as an unavoidable overhead or propose restricted solutions that overlook varying computational budgets across different tool calls and future observation uncertainty, thereby leading to suboptimal utilization of idle time. In this paper, we introduce IdleSpec, a scalable and generic inference approach that leverages idle-time computation to improve agent performance while minimizing latency overhead. Specifically, IdleSpec iteratively generates plan candidates during idle periods and, once observations become available, aggregates them to guide the next reasoning step. For effective plan generation under observation uncertainty, IdleSpec samples between complementary drafting strategies (i.e., progressive and recovery) from a learned distribution that is updated via posterior feedback.

💡 推荐理由： 工业界论文（USC/Amazon），有部署证据。创新性：提出IdleSpec方法，利用空闲时间进行推测性规划，通过自适应采样策略处理观察不确定性，方法新颖。实验全面，性能提升显著。

WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

📍 来源：Anthropic | ⭐⭐⭐⭐ | 🏷️ Agent Framework, Agentic Workflow, Application, Reasoning, NLP Task

LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis.

💡 推荐理由： 工业界论文（Anthropic团队）且聚焦Agentic Engineering（端到端工作流评估）。创新点：首个针对金融端到端电子表格任务的基准，包含多维评估体系（Accuracy/Formula/Format）。

🐙 GitHub 热门项目

Aider-AI/aider

⭐ 45313 | 🗣️ Python | 🏷️ LLM, DevTool, Agent

Aider 是一款终端中的 AI 结对编程工具，支持多种 LLM（如 GPT-4、Claude），可自动编辑代码、执行命令、管理 git 提交。它通过理解代码库上下文，帮助开发者快速实现功能、修复 bug 或重构代码，特别适合日常开发中需要快速迭代的场景。核心技术亮点包括：自动 git 管理、多文件编辑、与终端深度集成。

💡 推荐理由： Aider 是当前最成熟的终端 AI 编程助手之一，支持主流 LLM，能显著提升开发效率，且完全开源可自部署，值得关注。

onyx-dot-app/onyx

⭐ 29763 | 🗣️ Python | 🏷️ LLM, Agent, DevTool

Onyx 是一个开源 AI 平台，提供与所有主流 LLM 集成的智能聊天功能，支持高级特性如多模型切换、上下文管理、插件系统等。面向开发者和企业用户，可用于构建定制化 AI 助手、客服系统等场景。核心亮点是高度可扩展的插件架构和统一的 LLM 接口抽象，降低集成成本。

💡 推荐理由： 作为开源 AI 聊天平台，Onyx 直接服务于 LLM 应用开发，解决多模型集成痛点，且可立即部署使用，具有传播价值。