<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="zh-CN"><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="https://ifuryst.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://ifuryst.github.io/" rel="alternate" type="text/html" hreflang="zh-CN"/><updated>2026-05-28T06:32:18+00:00</updated><id>https://ifuryst.github.io/feed.xml</id><title type="html">ifuryst</title><subtitle>📝 &amp; 💭 </subtitle><entry><title type="html">Harness的新理解</title><link href="https://ifuryst.github.io/blog/2026/harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/" rel="alternate" type="text/html" title="Harness的新理解"/><published>2026-05-28T00:00:00+00:00</published><updated>2026-05-28T00:00:00+00:00</updated><id>https://ifuryst.github.io/blog/2026/harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3</id><content type="html" xml:base="https://ifuryst.github.io/blog/2026/harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/"><![CDATA[<p>这次的通透来自于<a href="https://www.xiaoyuzhoufm.com/episode/6a15a2cbff7b9a8c0a5b953f?s=eyJ1IjogIjY4Mzk3OTM0ZDFkMzUwNzI2OWRiOTQ4NCJ9">张小珺和戴雨森的创投观察第2集</a></p> <p>刚好最近在重新盘点审视Harness，这两天听这集播客带来了一些不错的想法！</p> <p>我们看Codex/Claude Code/OpenClaw这些其实都可以说是Harness，本质上是通过一些配套的机制、工具和环境让模型更好的发挥能力。</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949798_3-480.webp 480w,/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949798_3-800.webp 800w,/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949798_3-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949798_3.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>看我用ChatGPT Gen的这张图，简洁明了的介绍了。</p> <p>做个baseline的Harness很简单，codex、openclaw、openhands、hermes开源的，claude code有泄漏的源码，clone一下，和ai cowork一下就能得到一个in-house版本的harness。那harness的价值在哪里？</p> <p>我觉得把Harness类比成Agent OS非常make sense！</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949799_4-480.webp 480w,/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949799_4-800.webp 800w,/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949799_4-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949799_4.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>根据我的想法快速Gen了一张，忽略细节，只是想快速表达想法（暂时不想花太多effort去手撕这个图）。</p> <p>我们可以把LLM想象成以前的CPU，Harness就和OS一样，会针对不同的LLM做Driver，这样LLM本质是可以任意替换的，只是效果差异而已。外层配套的沙盒、Mem、Tool之类的可以不耦合。</p> <p>那么Harness的意义就出来了，比如现在的Claude Code的开放生态并不足，Codex的有APP Server并且官方态度明确支持开放（类似Crypto里的ETH的生态力）。也就出现了很多build on top of “harness”的产品，比如：</p> <ul> <li> <p>slock（原kimi cli团队出来做的）这种，本机直接跑一个daemon通过ws连到其服务端，这样slock的服务端可以下发命令给codex/claude code</p> </li> <li> <p>另外Codex.app也可以看作是基于Codex这个Harness本身去做出来的</p> </li> <li> <p>类似郭宇的wanman这种也是一个不错的点，以前都是Google OAuth2.0登陆一个网站，现在直接是Codex OAuth2.0登陆了，细品</p> </li> <li> <p>Managed Agents这种就是Agent OS的一方服务，甚至更深，可以理解成原来OS里的syscall这种系统调用级别，也就是不需要在Harness之上再做什么动作了，可以对外直接提供这个能力了，这种也就是0代码或者开箱即用的能力</p> </li> </ul> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949799_5-480.webp 480w,/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949799_5-800.webp 800w,/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949799_5-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949799_5.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>依然是GEN的， 不纠结细节，大体是对的，Codex.app现在就是一个壳，底座是基于codex cli的</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949800_6-480.webp 480w,/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949800_6-800.webp 800w,/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949800_6-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-28-harness%E7%9A%84%E6%96%B0%E7%90%86%E8%A7%A3/1779949800_6.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>可以看到自己打包了codex cli。只不过这是一方的套壳，但是理论上任何人都可以基于codex cli这个harness本身做出一个open-codex.app或任何产品</p> <p>这个就是Harness或者说Agent OS的机会，如果把其当作一个产品，未来是可以单独把其当作OS去售卖，也可以支撑上层APP的快速开花，或者直接以SaaS版本对外提供。并且像codex/claude code本身都是重单机、重本地的版本，Managed Agents背后一定有一套企业级别的Harness，这种才是应该追求的，这个现在讨论和做的人还比较少，但是随着B端对AI盈利的确定性持续增加，必然会成为大家厮杀的一个主战场。</p> <p>因此Harness/Agent OS应该做的本身就是做好这个层：</p> <ul> <li> <p>对外屏蔽掉这些细节，体现产品力、易用灵活、可扩展、稳定等</p> </li> <li> <p>内部做好各种配套的机制来保证稳定、时延、scale、分布、高可用等</p> </li> <li> <p>对下针对LLM做好适配，甚至可能会需要有一些模型和llm infra的认知</p> </li> </ul> <p>当理清了这些概念后，就会有个清晰的图景去引领迭代的方向。认知的提升只能多看多听多试了，现在变化太快，给决策者带来的挑战很大，时刻保持敏感性和大量的信息输入和内化熵减很重要</p>]]></content><author><name></name></author><category term="AI"/><category term="AI"/><summary type="html"><![CDATA[这次的通透来自于张小珺和戴雨森的创投观察第2集]]></summary></entry><entry><title type="html">LLM Infra 101 v0.5: KV Cache分块管理</title><link href="https://ifuryst.github.io/blog/2026/llm-infra-101-v0-5-kv-cache-block/" rel="alternate" type="text/html" title="LLM Infra 101 v0.5: KV Cache分块管理"/><published>2026-05-28T00:00:00+00:00</published><updated>2026-05-28T00:00:00+00:00</updated><id>https://ifuryst.github.io/blog/2026/llm-infra-101-v0-5-kv-cache-block</id><content type="html" xml:base="https://ifuryst.github.io/blog/2026/llm-infra-101-v0-5-kv-cache-block/"><![CDATA[<p>系列的第六集，前面的可以看：</p> <ol> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-model-inference/">LLM Infra 101 v0.0: 推理模型</a></p> </li> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-v0-1-openai-compatible-api/">LLM Infra 101 v0.1: API调用</a></p> </li> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-v0-2-kv-cache-decode/">LLM Infra 101 v0.2: KV Cache</a></p> </li> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-v0-3-static-batching/">LLM Infra 101 v0.3: 静态批处理</a></p> </li> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-v0-4-continuous-batching/">LLM Infra 101 v0.4: 连续批处理</a></p> </li> </ol> <p>这一期的代码在 <a href="https://github.com/iFurySt/nanoLLMServe/tree/release/v0.5.0">https://github.com/iFurySt/nanoLLMServe/tree/release/v0.5.0</a></p> <p>前面已经完成了Continuous Batching的建设了，这波我们会进一步做KV Cache的Block管理模型，为了后续完整的Paged Attention做个基础。</p> <p>前面我们让请求动态进入Batch，动态完成离开Batch，这里面有个比较大的问题，就是请求的序列长度是不一样的，KV Cache的占用也不同，这样我们每个请求都按照一个很大的连续空间去与留KV Cache空间，就会有很多显存浪费了。举例说明一下：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Request</span> <span class="n">A</span><span class="p">:</span> <span class="n">prompt</span> <span class="mi">200</span> <span class="n">tokens</span><span class="err">，</span><span class="n">最终生成</span> <span class="mi">100</span> <span class="n">tokens</span>  <span class="err">→</span> <span class="n">需要</span> <span class="mi">300</span> <span class="n">tokens</span> <span class="n">的</span> <span class="n">KV</span>
<span class="n">Request</span> <span class="n">B</span><span class="p">:</span> <span class="n">prompt</span> <span class="mi">4</span><span class="p">,</span><span class="mi">000</span> <span class="n">tokens</span><span class="err">，</span><span class="n">最终生成</span> <span class="mi">1</span><span class="p">,</span><span class="mi">000</span> <span class="n">tokens</span> <span class="err">→</span> <span class="n">需要</span> <span class="mi">5</span><span class="p">,</span><span class="mi">000</span> <span class="n">tokens</span> <span class="n">的</span> <span class="n">KV</span>
<span class="n">Request</span> <span class="n">C</span><span class="p">:</span> <span class="n">prompt</span> <span class="mi">50</span> <span class="n">tokens</span><span class="err">，</span><span class="n">最终生成</span> <span class="mi">20</span> <span class="n">tokens</span> <span class="err">→</span> <span class="n">需要</span> <span class="mi">70</span> <span class="n">tokens</span> <span class="n">的</span> <span class="n">KV</span>

<span class="n">Req</span> <span class="n">A</span> <span class="n">实际用</span> <span class="mi">300</span>   <span class="o">/</span> <span class="n">预留</span> <span class="mi">8000</span>  <span class="err">→</span> <span class="n">大量浪费</span>
<span class="n">Req</span> <span class="n">B</span> <span class="n">实际用</span> <span class="mi">5000</span>  <span class="o">/</span> <span class="n">预留</span> <span class="mi">8000</span>  <span class="err">→</span> <span class="n">还行</span>
<span class="n">Req</span> <span class="n">C</span> <span class="n">实际用</span> <span class="mi">70</span>    <span class="o">/</span> <span class="n">预留</span> <span class="mi">8000</span>  <span class="err">→</span> <span class="n">极大浪费</span>
</code></pre></div></div> <p>大概这样的一个情况，对传统的软件开发了解比较深的人应该有熟悉的感觉，可以直接类比联想到OS里的分页Paging逻辑，也就是OS抽象出一个内存页的内存单元，实际上内存页在物理内存里存放的位置不需要连续了。实际上vLLM的Paged Attention也是沿用了这个理念了，把申请的显存打成一个一个的block，这样可以减少显存的碎片化和无用消耗。（所以说万变不离其宗，有些东西不要停留在表面，理解底层原理，换到不同的应用场景下，都能从根源倒推上去，而不会每次都是八股文式的记忆，也能有很多乐趣），我们的block分块管理就是其中的一环，原来的布局是：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Req</span> <span class="n">A</span><span class="p">:</span>
<span class="p">[....................</span> <span class="mi">8000</span> <span class="n">token</span> <span class="nb">buffer</span> <span class="p">....................]</span>

<span class="n">Req</span> <span class="n">B</span><span class="p">:</span>
<span class="p">[....................</span> <span class="mi">8000</span> <span class="n">token</span> <span class="nb">buffer</span> <span class="p">....................]</span>

<span class="n">Req</span> <span class="n">C</span><span class="p">:</span>
<span class="p">[....................</span> <span class="mi">8000</span> <span class="n">token</span> <span class="nb">buffer</span> <span class="p">....................]</span>
</code></pre></div></div> <p>现在的请求和显存布局会变成这样</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Req</span> <span class="n">A</span><span class="p">:</span>
<span class="p">[</span><span class="n">block</span><span class="p">][</span><span class="n">block</span><span class="p">]</span>

<span class="n">Req</span> <span class="n">B</span><span class="p">:</span>
<span class="p">[</span><span class="n">block</span><span class="p">][</span><span class="n">block</span><span class="p">][</span><span class="n">block</span><span class="p">][</span><span class="n">block</span><span class="p">][</span><span class="n">block</span><span class="p">][</span><span class="n">block</span><span class="p">]</span>

<span class="n">Req</span> <span class="n">C</span><span class="p">:</span>
<span class="p">[</span><span class="n">block</span><span class="p">]</span>
</code></pre></div></div> <p>现在是以固定大小的KV块/页来分配，block在请求结束后会收到池子。这样的显存利用率一下提升了，并且可以动态增长，按需分配。</p> <p>大概知道这个原理后，我们来看看怎么实现</p> <h1 id="实现">实现</h1> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">.</span>
<span class="err">├──</span> <span class="n">benchmarks</span><span class="o">/</span>
<span class="err">│</span>   <span class="err">└──</span> <span class="n">benchmark_block_manager</span><span class="p">.</span><span class="n">py</span>      <span class="c1"># 新增 block KV cache 碎片率 benchmark
</span><span class="err">├──</span> <span class="n">src</span><span class="o">/</span>
<span class="err">│</span>   <span class="err">└──</span> <span class="n">nanollmserve</span><span class="o">/</span>
<span class="err">│</span>       <span class="err">├──</span> <span class="n">cache</span><span class="o">/</span>
<span class="err">│</span>       <span class="err">│</span>   <span class="err">├──</span> <span class="n">__init__</span><span class="p">.</span><span class="n">py</span>             <span class="c1"># 导出 block manager 相关类型
</span><span class="err">│</span>       <span class="err">│</span>   <span class="err">└──</span> <span class="n">block_manager</span><span class="p">.</span><span class="n">py</span>        <span class="c1"># 核心：KVBlockManager / block table / usage metrics
</span><span class="err">│</span>       <span class="err">└──</span> <span class="n">engine</span><span class="o">/</span>
<span class="err">│</span>           <span class="err">└──</span> <span class="n">engine</span><span class="p">.</span><span class="n">py</span>               <span class="c1"># 在生成生命周期里接入 allocate / append / release
</span><span class="err">└──</span> <span class="n">tests</span><span class="o">/</span>
    <span class="err">├──</span> <span class="n">test_block_manager</span><span class="p">.</span><span class="n">py</span>           <span class="c1"># block 分配、追加、释放、超分配测试
</span>    <span class="err">├──</span> <span class="n">test_benchmark_block_manager</span><span class="p">.</span><span class="n">py</span> <span class="c1"># benchmark 汇总字段测试
</span>    <span class="err">└──</span> <span class="n">test_engine</span><span class="p">.</span><span class="n">py</span>                  <span class="c1"># engine 生成时 block 生命周期测试
</span></code></pre></div></div> <p>相关改动是这些文件，核心是增加了<code class="language-plaintext highlighter-rouge">KVBlockManager</code></p> <h2 id="kv-block">KV Block</h2> <p>首先引入了一个Block单元</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">KVBlock</span><span class="p">:</span>
    <span class="n">block_id</span><span class="p">:</span> <span class="nb">int</span>        <span class="c1"># == blocks.index
</span>    <span class="n">capacity_tokens</span><span class="p">:</span> <span class="nb">int</span> <span class="c1"># == block_size
</span></code></pre></div></div> <p>KVBlock是一个固定大小的Token Block，用于承载Token的</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">KVBlockManager</span><span class="p">:</span>
    <span class="n">total_blocks</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">block_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">16</span>

    <span class="k">def</span> <span class="nf">__post_init__</span><span class="p">(</span><span class="n">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="n">total_blocks</span> <span class="o">&lt;</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sh">"</span><span class="s">total_blocks must be at least 1</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="n">block_size</span> <span class="o">&lt;</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sh">"</span><span class="s">block_size must be at least 1</span><span class="sh">"</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">blocks</span> <span class="o">=</span> <span class="p">[</span><span class="nc">KVBlock</span><span class="p">(</span><span class="n">block_id</span><span class="o">=</span><span class="n">index</span><span class="p">,</span> <span class="n">capacity_tokens</span><span class="o">=</span><span class="n">self</span><span class="p">.</span><span class="n">block_size</span><span class="p">)</span> <span class="k">for</span> <span class="n">index</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">total_blocks</span><span class="p">)]</span>
        <span class="n">self</span><span class="p">.</span><span class="n">free_block_ids</span><span class="p">:</span> <span class="n">deque</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="nf">deque</span><span class="p">(</span><span class="n">block</span><span class="p">.</span><span class="n">block_id</span> <span class="k">for</span> <span class="n">block</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">blocks</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">request_tables</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">RequestBlockTable</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
</code></pre></div></div> <p>这里默认定义了KVBlock默认可以承载16个token，id简单对应到block在所有blocks里的下标位置。</p> <p>另外增加某个请求和Block的关系表</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">RequestBlockTable</span><span class="p">:</span>
    <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">block_ids</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">)</span>
    <span class="n">token_count</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
</code></pre></div></div> <p>比如：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">req</span><span class="o">-</span><span class="n">a</span><span class="p">:</span>
  <span class="n">token_count</span> <span class="o">=</span> <span class="mi">33</span>
  <span class="n">block_ids</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span>
</code></pre></div></div> <p>代表的就是req-a的KV Cache被影射到3个block上了，总共有33个token</p> <h2 id="kv-block-manager">KV Block Manager</h2> <p>有了上面这两个定义后，就可以开始针对请求和对应的token数量去分配block了。这个过程就需要一个管理器来负责管理所有block的生命周期，从分配到对应的请求，到回收block，以及彻底销毁 block，这些会统一在<code class="language-plaintext highlighter-rouge">KVBlockManager</code> 里管理</p> <p>我们按照整个流程来走，首先是请求进来后，会调用</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_allocate_prompt_blocks</span><span class="p">(</span>
    <span class="n">kv_block_manager</span><span class="p">:</span> <span class="n">KVBlockManager</span> <span class="o">|</span> <span class="bp">None</span><span class="p">,</span>
    <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">prompt_tokens</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">kv_block_manager</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">kv_block_manager</span><span class="p">.</span><span class="nf">allocate</span><span class="p">(</span><span class="n">request_id</span><span class="p">,</span> <span class="n">prompt_tokens</span><span class="p">)</span>
</code></pre></div></div> <p>内部是</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">KVBlockManager</span><span class="p">:</span>
    <span class="n">total_blocks</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">block_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">16</span>

		<span class="c1"># ...
</span>
    <span class="k">def</span> <span class="nf">allocate</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">token_count</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">RequestBlockTable</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">request_id</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">request_tables</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">request already has allocated blocks: </span><span class="si">{</span><span class="n">request_id</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">token_count</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sh">"</span><span class="s">token_count must be non-negative</span><span class="sh">"</span><span class="p">)</span>

        <span class="n">needed_blocks</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">_blocks_for_tokens</span><span class="p">(</span><span class="n">token_count</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="nf">_ensure_free_blocks</span><span class="p">(</span><span class="n">needed_blocks</span><span class="p">)</span>
        <span class="n">table</span> <span class="o">=</span> <span class="nc">RequestBlockTable</span><span class="p">(</span>
            <span class="n">request_id</span><span class="o">=</span><span class="n">request_id</span><span class="p">,</span>
            <span class="n">block_ids</span><span class="o">=</span><span class="n">self</span><span class="p">.</span><span class="nf">_take_blocks</span><span class="p">(</span><span class="n">needed_blocks</span><span class="p">),</span>
            <span class="n">token_count</span><span class="o">=</span><span class="n">token_count</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">request_tables</span><span class="p">[</span><span class="n">request_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">table</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">snapshot_request</span><span class="p">(</span><span class="n">request_id</span><span class="p">)</span>
</code></pre></div></div> <p>会根据token数量计算需要几个block</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">9</span> <span class="n">tokens</span>  <span class="o">-&gt;</span> <span class="nf">ceil</span><span class="p">(</span><span class="mi">9</span> <span class="o">/</span> <span class="mi">16</span><span class="p">)</span>  <span class="o">=</span> <span class="mi">1</span> <span class="n">block</span>
<span class="mi">17</span> <span class="n">tokens</span> <span class="o">-&gt;</span> <span class="nf">ceil</span><span class="p">(</span><span class="mi">17</span> <span class="o">/</span> <span class="mi">16</span><span class="p">)</span> <span class="o">=</span> <span class="mi">2</span> <span class="n">blocks</span>
<span class="mi">33</span> <span class="n">tokens</span> <span class="o">-&gt;</span> <span class="nf">ceil</span><span class="p">(</span><span class="mi">33</span> <span class="o">/</span> <span class="mi">16</span><span class="p">)</span> <span class="o">=</span> <span class="mi">3</span> <span class="n">blocks</span>
</code></pre></div></div> <p>然后到Free Pool里确认是否有足够的block（这里非常基础的版本，初始化默认分配<code class="language-plaintext highlighter-rouge">total_blocks</code>块，后续没有动态增加分配之类的管理），然后分配这些blocks给对应的Req并记录对应的关系表。</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">free_block_ids</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">]</span>
<span class="n">request_tables</span> <span class="o">=</span> <span class="p">{}</span>

<span class="c1"># 变成
</span>
<span class="n">free_block_ids</span> <span class="o">=</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">]</span>
<span class="n">request_tables</span> <span class="o">=</span> <span class="p">{</span>
  <span class="sh">"</span><span class="s">req-a</span><span class="sh">"</span><span class="p">:</span> <span class="n">block_ids</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span> <span class="n">token_count</span><span class="o">=</span><span class="mi">33</span>
<span class="p">}</span>
</code></pre></div></div> <p>接着在Decode生成Token后，会调用</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_append_generated_block_token</span><span class="p">(</span>
    <span class="n">kv_block_manager</span><span class="p">:</span> <span class="n">KVBlockManager</span> <span class="o">|</span> <span class="bp">None</span><span class="p">,</span>
    <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">kv_block_manager</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">kv_block_manager</span><span class="p">.</span><span class="nf">append_tokens</span><span class="p">(</span><span class="n">request_id</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div> <p>内部是</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">KVBlockManager</span><span class="p">:</span>
    <span class="n">total_blocks</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">block_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">16</span>

		<span class="c1"># ...
</span>
    <span class="k">def</span> <span class="nf">append_tokens</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">token_count</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">RequestBlockTable</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">token_count</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sh">"</span><span class="s">token_count must be non-negative</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">request_id</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">request_tables</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nc">KeyError</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">request has no allocated blocks: </span><span class="si">{</span><span class="n">request_id</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">token_count</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">snapshot_request</span><span class="p">(</span><span class="n">request_id</span><span class="p">)</span>

        <span class="n">table</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">request_tables</span><span class="p">[</span><span class="n">request_id</span><span class="p">]</span>
        <span class="n">old_blocks</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">_blocks_for_tokens</span><span class="p">(</span><span class="n">table</span><span class="p">.</span><span class="n">token_count</span><span class="p">)</span>
        <span class="n">new_token_count</span> <span class="o">=</span> <span class="n">table</span><span class="p">.</span><span class="n">token_count</span> <span class="o">+</span> <span class="n">token_count</span>
        <span class="n">new_blocks</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">_blocks_for_tokens</span><span class="p">(</span><span class="n">new_token_count</span><span class="p">)</span>
        <span class="n">additional_blocks</span> <span class="o">=</span> <span class="n">new_blocks</span> <span class="o">-</span> <span class="n">old_blocks</span>
        <span class="n">self</span><span class="p">.</span><span class="nf">_ensure_free_blocks</span><span class="p">(</span><span class="n">additional_blocks</span><span class="p">)</span>
        <span class="n">table</span><span class="p">.</span><span class="n">block_ids</span><span class="p">.</span><span class="nf">extend</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="nf">_take_blocks</span><span class="p">(</span><span class="n">additional_blocks</span><span class="p">))</span>
        <span class="n">table</span><span class="p">.</span><span class="n">token_count</span> <span class="o">=</span> <span class="n">new_token_count</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">snapshot_request</span><span class="p">(</span><span class="n">request_id</span><span class="p">)</span>
</code></pre></div></div> <p>基本上就是告诉Manager增加了一个token，然后会确认一下是否需要增加block，增加的话需要看看是否还有空闲的block可用</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">req</span><span class="o">-</span><span class="n">a</span><span class="p">:</span> <span class="mi">48</span> <span class="n">tokens</span> <span class="o">-&gt;</span> <span class="mi">49</span> <span class="n">tokens</span>

<span class="n">原来</span> <span class="mi">3</span> <span class="n">blocks</span>
<span class="n">现在需要</span> <span class="mi">4</span> <span class="n">blocks</span>
<span class="n">追加一个</span> <span class="n">block</span>
</code></pre></div></div> <p>请求完成后会释放</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_release_blocks</span><span class="p">(</span><span class="n">kv_block_manager</span><span class="p">:</span> <span class="n">KVBlockManager</span> <span class="o">|</span> <span class="bp">None</span><span class="p">,</span> <span class="n">request_ids</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">kv_block_manager</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">return</span>
    <span class="k">for</span> <span class="n">request_id</span> <span class="ow">in</span> <span class="nf">reversed</span><span class="p">(</span><span class="n">request_ids</span><span class="p">):</span>
        <span class="n">kv_block_manager</span><span class="p">.</span><span class="nf">release</span><span class="p">(</span><span class="n">request_id</span><span class="p">)</span>
</code></pre></div></div> <p>会将这个Req直接从Block Table里删除，对应的blocks放回到Free Pool</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">KVBlockManager</span><span class="p">:</span>
    <span class="n">total_blocks</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">block_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">16</span>

		<span class="c1"># ...
</span>
    <span class="k">def</span> <span class="nf">release</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">int</span><span class="p">]:</span>
        <span class="n">table</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">request_tables</span><span class="p">.</span><span class="nf">pop</span><span class="p">(</span><span class="n">request_id</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">table</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nc">KeyError</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">request has no allocated blocks: </span><span class="si">{</span><span class="n">request_id</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
        <span class="n">released</span> <span class="o">=</span> <span class="nf">list</span><span class="p">(</span><span class="n">table</span><span class="p">.</span><span class="n">block_ids</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">free_block_ids</span><span class="p">.</span><span class="nf">extend</span><span class="p">(</span><span class="n">released</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">released</span>
</code></pre></div></div> <p>到这里就完整的过完KV Cache Block的全生命周期了，再用High Level来看看</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tokenize</span> <span class="n">prompt</span>
<span class="o">-&gt;</span> <span class="n">allocate</span> <span class="n">prompt</span> <span class="n">blocks</span>
<span class="o">-&gt;</span> <span class="n">prefill</span>
<span class="o">-&gt;</span> <span class="n">sample</span> <span class="n">token</span>
<span class="o">-&gt;</span> <span class="n">append</span> <span class="n">generated</span> <span class="n">token</span> <span class="n">block</span>
<span class="o">-&gt;</span> <span class="n">decode</span> <span class="n">loop</span>
<span class="o">-&gt;</span> <span class="n">append</span> <span class="n">generated</span> <span class="n">token</span> <span class="n">block</span>
<span class="o">-&gt;</span> <span class="n">release</span> <span class="n">blocks</span> <span class="ow">in</span> <span class="k">finally</span>
</code></pre></div></div> <p>有了这个机制后，现在Continuous Batching会在请求完成后释放对应的Block，而不是等到整个Batch都结束才释放</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">step</span> <span class="mi">0</span><span class="p">:</span> <span class="n">short</span><span class="o">-</span><span class="mi">0</span> <span class="n">running</span>
<span class="n">step</span> <span class="mi">1</span><span class="p">:</span> <span class="n">short</span><span class="o">-</span><span class="mi">0</span> <span class="n">finished</span><span class="p">,</span> <span class="n">late</span><span class="o">-</span><span class="mi">1</span> <span class="n">running</span>
       <span class="o">-&gt;</span> <span class="n">release</span> <span class="n">short</span><span class="o">-</span><span class="mi">0</span> <span class="n">blocks</span> <span class="n">immediately</span>
<span class="n">step</span> <span class="mi">2</span><span class="p">:</span> <span class="n">late</span><span class="o">-</span><span class="mi">1</span> <span class="n">running</span>
<span class="bp">...</span>
</code></pre></div></div> <h1 id="推理">推理</h1> <p>我们还是走bench来观测Block的情况，里面几个指标的意思：</p> <table> <thead> <tr> <th><strong>字段</strong></th> <th><strong>含义</strong></th> </tr> </thead> <tbody> <tr> <td>used_blocks</td> <td>当前被请求占用的block数</td> </tr> <tr> <td>free_blocks</td> <td>当前空闲block数</td> </tr> <tr> <td>allocated_tokens</td> <td>请求真实需要的token数</td> </tr> <tr> <td>reserved_tokens</td> <td>block实际预留的token容量，也就是block_size的大小</td> </tr> <tr> <td>internal_fragmentation_tokens</td> <td>block内部浪费的token容量</td> </tr> <tr> <td>block_utilization</td> <td>allocated_tokens / reserved_tokens</td> </tr> </tbody> </table> <p>举例来说：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">block_size</span> <span class="o">=</span> <span class="mi">16</span>
<span class="n">req</span><span class="o">-</span><span class="n">a</span> <span class="o">=</span> <span class="mi">9</span> <span class="n">tokens</span>
</code></pre></div></div> <p>需要一个block，这种情况下：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">allocated_tokens</span> <span class="o">=</span> <span class="mi">9</span>
<span class="n">reserved_tokens</span> <span class="o">=</span> <span class="mi">16</span>
<span class="n">internal_fragmentation_tokens</span> <span class="o">=</span> <span class="mi">7</span>
<span class="n">utilization</span> <span class="o">=</span> <span class="mi">9</span> <span class="o">/</span> <span class="mi">16</span>
</code></pre></div></div> <p>这个时候利用率只有大概56%。我们来看bench的结果</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-28-llm-infra-101-v0-5-kv-cache-block/1779949860_7-480.webp 480w,/assets/img/2026-05-28-llm-infra-101-v0-5-kv-cache-block/1779949860_7-800.webp 800w,/assets/img/2026-05-28-llm-infra-101-v0-5-kv-cache-block/1779949860_7-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-28-llm-infra-101-v0-5-kv-cache-block/1779949860_7.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="n">base</span><span class="p">)</span> <span class="n">gpu</span><span class="o">-</span><span class="n">A100</span><span class="o">-</span><span class="mi">05</span> <span class="n">nanoLLMServe</span> <span class="c1"># CUDA_VISIBLE_DEVICES=0 PYTHONPATH=src /data/anaconda3/bin/python -m benchmarks.benchmark_block_manager \
</span>  <span class="o">--</span><span class="n">block</span><span class="o">-</span><span class="n">size</span> <span class="mi">16</span> \
  <span class="o">--</span><span class="n">total</span><span class="o">-</span><span class="n">blocks</span> <span class="mi">64</span> \
  <span class="o">--</span><span class="n">request</span><span class="o">-</span><span class="n">tokens</span> <span class="mi">9</span><span class="p">,</span><span class="mi">17</span><span class="p">,</span><span class="mi">33</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">41</span><span class="p">,</span><span class="mi">12</span>
<span class="p">{</span>
  <span class="sh">"</span><span class="s">block_size</span><span class="sh">"</span><span class="p">:</span> <span class="mi">16</span><span class="p">,</span>
  <span class="sh">"</span><span class="s">block_usage</span><span class="sh">"</span><span class="p">:</span> <span class="p">{</span>
    <span class="sh">"</span><span class="s">allocated_tokens</span><span class="sh">"</span><span class="p">:</span> <span class="mi">117</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">block_utilization</span><span class="sh">"</span><span class="p">:</span> <span class="mf">0.6647727272727273</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">free_blocks</span><span class="sh">"</span><span class="p">:</span> <span class="mi">53</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">internal_fragmentation_tokens</span><span class="sh">"</span><span class="p">:</span> <span class="mi">59</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">reserved_tokens</span><span class="sh">"</span><span class="p">:</span> <span class="mi">176</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">used_blocks</span><span class="sh">"</span><span class="p">:</span> <span class="mi">11</span>
  <span class="p">},</span>
  <span class="sh">"</span><span class="s">contiguous_fixed_slot_baseline</span><span class="sh">"</span><span class="p">:</span> <span class="p">{</span>
    <span class="sh">"</span><span class="s">internal_fragmentation_tokens</span><span class="sh">"</span><span class="p">:</span> <span class="mi">129</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">reserved_tokens</span><span class="sh">"</span><span class="p">:</span> <span class="mi">246</span><span class="p">,</span>
    <span class="sh">"</span><span class="s">utilization</span><span class="sh">"</span><span class="p">:</span> <span class="mf">0.47560975609756095</span>
  <span class="p">},</span>
  <span class="sh">"</span><span class="s">fragmentation_tokens_saved_vs_contiguous</span><span class="sh">"</span><span class="p">:</span> <span class="mi">70</span><span class="p">,</span>
  <span class="sh">"</span><span class="s">request_tokens</span><span class="sh">"</span><span class="p">:</span> <span class="p">[</span>
    <span class="mi">9</span><span class="p">,</span>
    <span class="mi">17</span><span class="p">,</span>
    <span class="mi">33</span><span class="p">,</span>
    <span class="mi">5</span><span class="p">,</span>
    <span class="mi">41</span><span class="p">,</span>
    <span class="mi">12</span>
  <span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div> <p>模拟6个请求，每个请求的token数分别是9,17,33,5,41,12，Block size 16的情况下：</p> <table> <thead> <tr> <th><strong>Request</strong></th> <th><strong>实际 Tokens (</strong>allocated_tokens<strong>)</strong></th> <th><strong>使用 Blocks (</strong>used_blocks<strong>)</strong></th> <th><strong>预留容量 (</strong>reserved_tokens<strong>)</strong></th> <th><strong>内部碎片 (</strong>fragmentation<strong>)</strong></th> <th><strong>利用率</strong></th> </tr> </thead> <tbody> <tr> <td>Req A</td> <td>9</td> <td>1</td> <td>16</td> <td>7</td> <td>56.3%</td> </tr> <tr> <td>Req B</td> <td>17</td> <td>2</td> <td>32</td> <td>15</td> <td>53.1%</td> </tr> <tr> <td>Req C</td> <td>33</td> <td>3</td> <td>48</td> <td>15</td> <td>68.8%</td> </tr> <tr> <td>Req D</td> <td>5</td> <td>1</td> <td>16</td> <td>11</td> <td>31.3%</td> </tr> <tr> <td>Req E</td> <td>41</td> <td>3</td> <td>48</td> <td>7</td> <td>85.4%</td> </tr> <tr> <td>Req F</td> <td>12</td> <td>1</td> <td>16</td> <td>4</td> <td>75.0%</td> </tr> </tbody> </table> <p>因此实际上：</p> <table> <thead> <tr> <th><strong>指标</strong></th> <th><strong>数值</strong></th> </tr> </thead> <tbody> <tr> <td>实际 Tokens</td> <td>117</td> </tr> <tr> <td>使用 Blocks</td> <td>11</td> </tr> <tr> <td>Block Size</td> <td>16</td> </tr> <tr> <td>预留容量</td> <td>176</td> </tr> <tr> <td>内部碎片</td> <td>59</td> </tr> <tr> <td>总体利用率</td> <td>66.5%</td> </tr> </tbody> </table> <p>结果里有个<code class="language-plaintext highlighter-rouge">contiguous_fixed_slot_baseline</code> 对对照组，按照最长的41去分配连续的空间，我们也计算一下：</p> <table> <thead> <tr> <th><strong>Request</strong></th> <th><strong>实际 Tokens</strong></th> <th><strong>固定预留容量</strong></th> <th><strong>内部碎片</strong></th> <th><strong>利用率</strong></th> </tr> </thead> <tbody> <tr> <td>Req A</td> <td>9</td> <td>41</td> <td>32</td> <td>22.0%</td> </tr> <tr> <td>Req B</td> <td>17</td> <td>41</td> <td>24</td> <td>41.5%</td> </tr> <tr> <td>Req C</td> <td>33</td> <td>41</td> <td>8</td> <td>80.5%</td> </tr> <tr> <td>Req D</td> <td>5</td> <td>41</td> <td>36</td> <td>12.2%</td> </tr> <tr> <td>Req E</td> <td>41</td> <td>41</td> <td>0</td> <td>100%</td> </tr> <tr> <td>Req F</td> <td>12</td> <td>41</td> <td>29</td> <td>29.3%</td> </tr> </tbody> </table> <p>整体是</p> <table> <thead> <tr> <th><strong>指标</strong></th> <th><strong>数值</strong></th> </tr> </thead> <tbody> <tr> <td>实际 Tokens</td> <td>117</td> </tr> <tr> <td>请求数</td> <td>6</td> </tr> <tr> <td>每请求固定槽位</td> <td>41</td> </tr> <tr> <td>总预留容量</td> <td>246</td> </tr> <tr> <td>内部碎片</td> <td>129</td> </tr> <tr> <td>总体利用率</td> <td>47.6%</td> </tr> </tbody> </table> <p>拉到一个表里对比一下：</p> <table> <thead> <tr> <th><strong>指标</strong></th> <th><strong>Paged Blocks</strong></th> <th><strong>Fixed Contiguous</strong></th> <th><strong>改善</strong></th> </tr> </thead> <tbody> <tr> <td>实际 Tokens</td> <td>117</td> <td>117</td> <td>—</td> </tr> <tr> <td>预留容量</td> <td>176</td> <td>246</td> <td>↓ 70</td> </tr> <tr> <td>内部碎片</td> <td>59</td> <td>129</td> <td>↓ 70</td> </tr> <tr> <td>利用率</td> <td>66.5%</td> <td>47.6%</td> <td>↑ 18.9%</td> </tr> </tbody> </table> <p>很直观的看出来，有了分块（分页）的处理后，整体的显存利用率都得到提升了</p> <h1 id="总结">总结</h1> <p>大块固定分配的内存（显存）块提升成按固定块去分配，虽然在单块里依然存在浪费的情况，但是范围局限在单个Block里了，这一点就已经将内存利用率提升上来了，另一点就是在一些请求结束后动态回收对应的内存用于分配给其他请求使用，又进一步提升利用率。</p> <p>最后还记得我们最早v0.2的时候做了KV Cache，针对单个请求用Prefill去做KV Cache，但是实际上多个请求之间是有可能有一样的前缀KV Cache，我们后续就要继续做一个更接近生产级别的优化，前缀缓存Prefix Cache</p>]]></content><author><name></name></author><category term="AI"/><category term="AI"/><category term="LLMInfra-101"/><summary type="html"><![CDATA[系列的第六集，前面的可以看：]]></summary></entry><entry><title type="html">LLM Infra 101 v0.4: 连续批处理</title><link href="https://ifuryst.github.io/blog/2026/llm-infra-101-v0-4-continuous-batching/" rel="alternate" type="text/html" title="LLM Infra 101 v0.4: 连续批处理"/><published>2026-05-25T00:00:00+00:00</published><updated>2026-05-25T00:00:00+00:00</updated><id>https://ifuryst.github.io/blog/2026/llm-infra-101-v0-4-continuous-batching</id><content type="html" xml:base="https://ifuryst.github.io/blog/2026/llm-infra-101-v0-4-continuous-batching/"><![CDATA[<p>系列的第五集，前面的可以看：</p> <ol> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-model-inference/">LLM Infra 101 v0.0: 推理模型</a></p> </li> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-v0-1-openai-compatible-api/">LLM Infra 101 v0.1: API调用</a></p> </li> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-v0-2-kv-cache-decode/">LLM Infra 101 v0.2: KV Cache</a></p> </li> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-v0-3-static-batching/">LLM Infra 101 v0.3: 静态批处理</a></p> </li> </ol> <p>这一期的代码在 <a href="https://github.com/iFurySt/nanoLLMServe/tree/release/v0.4.0">https://github.com/iFurySt/nanoLLMServe/tree/release/v0.4.0</a></p> <p>上期做到Static Batching，遗留下来的问题很明确：</p> <ul> <li> <p>新请求不能中途进入</p> </li> <li> <p>某个请求结束后，batch里这个请求的位置还会占用着不能释放出来</p> </li> <li> <p>整个batch生命周期被最慢的那个请求拖住</p> </li> </ul> <p>为了处理这些问题，我们需要引入连续批处理（Continuous Batching）</p> <h1 id="实现">实现</h1> <p>这次主要涉及的改动是</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">.</span>
├── src/
│   └── nanollmserve/
│       └── engine/
│           ├── engine.py              <span class="c"># 新增 generate_continuous_batch：运行中 admission、每步重建 active batch、完成请求移出</span>
│           └── scheduler.py           <span class="c"># 核心调度结构：waiting/running/finished 队列、RequestLifecycle、SchedulerStepStats</span>
└── tests/
    ├── test_benchmark_generate.py     <span class="c"># continuous_batch benchmark 汇总字段回归：active batch size / request count / scheduler steps</span>
    └── test_engine.py                 <span class="c"># 连续批处理行为回归：中途加入、完成移除、max_batch_size backpressure</span>
</code></pre></div></div> <h2 id="scheduler">Scheduler</h2> <p><code class="language-plaintext highlighter-rouge">src/nanollmserve/engine/scheduler.py</code></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="sh">"""</span><span class="s">Teaching-scale continuous batching scheduler.</span><span class="sh">"""</span>

<span class="kn">from</span> <span class="n">__future__</span> <span class="kn">import</span> <span class="n">annotations</span>

<span class="kn">from</span> <span class="n">collections</span> <span class="kn">import</span> <span class="n">deque</span>
<span class="kn">from</span> <span class="n">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span><span class="p">,</span> <span class="n">field</span>
<span class="kn">from</span> <span class="n">enum</span> <span class="kn">import</span> <span class="n">Enum</span>


<span class="k">class</span> <span class="nc">RequestLifecycle</span><span class="p">(</span><span class="nb">str</span><span class="p">,</span> <span class="n">Enum</span><span class="p">):</span>
    <span class="n">WAITING</span> <span class="o">=</span> <span class="sh">"</span><span class="s">waiting</span><span class="sh">"</span>
    <span class="n">RUNNING</span> <span class="o">=</span> <span class="sh">"</span><span class="s">running</span><span class="sh">"</span>
    <span class="n">FINISHED</span> <span class="o">=</span> <span class="sh">"</span><span class="s">finished</span><span class="sh">"</span>


<span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">ContinuousBatchRequest</span><span class="p">:</span>
    <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">max_new_tokens</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">32</span>
    <span class="n">arrival_step</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>


<span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">SchedulerStepStats</span><span class="p">:</span>
    <span class="n">step</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">admitted_request_ids</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    <span class="n">running_request_ids</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    <span class="n">completed_request_ids</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span>
    <span class="n">active_batch_size</span><span class="p">:</span> <span class="nb">int</span>


<span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">ScheduledRequestState</span><span class="p">:</span>
    <span class="n">request</span><span class="p">:</span> <span class="n">ContinuousBatchRequest</span>
    <span class="n">lifecycle</span><span class="p">:</span> <span class="n">RequestLifecycle</span> <span class="o">=</span> <span class="n">RequestLifecycle</span><span class="p">.</span><span class="n">WAITING</span>
    <span class="n">admitted_step</span><span class="p">:</span> <span class="nb">int</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">finished_step</span><span class="p">:</span> <span class="nb">int</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span>


<span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">ContinuousBatchScheduler</span><span class="p">:</span>
    <span class="n">requests</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ContinuousBatchRequest</span><span class="p">]</span>
    <span class="n">max_batch_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">waiting</span><span class="p">:</span> <span class="n">deque</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">init</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="n">running</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">,</span> <span class="n">init</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="n">finished</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">,</span> <span class="n">init</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">__post_init__</span><span class="p">(</span><span class="n">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="n">max_batch_size</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="n">self</span><span class="p">.</span><span class="n">max_batch_size</span> <span class="o">&lt;</span> <span class="mi">1</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sh">"</span><span class="s">max_batch_size must be at least 1</span><span class="sh">"</span><span class="p">)</span>

        <span class="n">seen</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="nf">set</span><span class="p">()</span>
        <span class="n">indexed_states</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">tuple</span><span class="p">[</span><span class="nb">int</span><span class="p">,</span> <span class="n">ScheduledRequestState</span><span class="p">]]</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">request</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">requests</span><span class="p">):</span>
            <span class="k">if</span> <span class="n">request</span><span class="p">.</span><span class="n">request_id</span> <span class="ow">in</span> <span class="n">seen</span><span class="p">:</span>
                <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">duplicate request_id: </span><span class="si">{</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
            <span class="n">seen</span><span class="p">.</span><span class="nf">add</span><span class="p">(</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">request</span><span class="p">.</span><span class="n">arrival_step</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">:</span>
                <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sh">"</span><span class="s">arrival_step must be non-negative</span><span class="sh">"</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">request</span><span class="p">.</span><span class="n">max_new_tokens</span> <span class="o">&lt;</span> <span class="mi">1</span><span class="p">:</span>
                <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sh">"</span><span class="s">max_new_tokens must be at least 1</span><span class="sh">"</span><span class="p">)</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">request</span><span class="p">.</span><span class="n">prompt</span><span class="p">:</span>
                <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sh">"</span><span class="s">prompt must not be empty</span><span class="sh">"</span><span class="p">)</span>
            <span class="n">indexed_states</span><span class="p">.</span><span class="nf">append</span><span class="p">((</span><span class="n">index</span><span class="p">,</span> <span class="nc">ScheduledRequestState</span><span class="p">(</span><span class="n">request</span><span class="o">=</span><span class="n">request</span><span class="p">)))</span>

        <span class="n">indexed_states</span><span class="p">.</span><span class="nf">sort</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">item</span><span class="p">:</span> <span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">request</span><span class="p">.</span><span class="n">arrival_step</span><span class="p">,</span> <span class="n">item</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span>
        <span class="n">self</span><span class="p">.</span><span class="n">waiting</span> <span class="o">=</span> <span class="nf">deque</span><span class="p">(</span><span class="n">state</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">indexed_states</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">has_work</span><span class="p">(</span><span class="n">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
        <span class="k">return</span> <span class="nf">bool</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">waiting</span> <span class="ow">or</span> <span class="n">self</span><span class="p">.</span><span class="n">running</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">next_arrival_step</span><span class="p">(</span><span class="n">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span> <span class="o">|</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">self</span><span class="p">.</span><span class="n">waiting</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">None</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="n">waiting</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">request</span><span class="p">.</span><span class="n">arrival_step</span>

    <span class="k">def</span> <span class="nf">admit</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">step</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]:</span>
        <span class="n">admitted</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">while</span> <span class="n">self</span><span class="p">.</span><span class="n">waiting</span> <span class="ow">and</span> <span class="n">self</span><span class="p">.</span><span class="n">waiting</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">request</span><span class="p">.</span><span class="n">arrival_step</span> <span class="o">&lt;=</span> <span class="n">step</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="n">max_batch_size</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="nf">len</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">running</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">self</span><span class="p">.</span><span class="n">max_batch_size</span><span class="p">:</span>
                <span class="k">break</span>
            <span class="n">state</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">waiting</span><span class="p">.</span><span class="nf">popleft</span><span class="p">()</span>
            <span class="n">state</span><span class="p">.</span><span class="n">lifecycle</span> <span class="o">=</span> <span class="n">RequestLifecycle</span><span class="p">.</span><span class="n">RUNNING</span>
            <span class="n">state</span><span class="p">.</span><span class="n">admitted_step</span> <span class="o">=</span> <span class="n">step</span>
            <span class="n">self</span><span class="p">.</span><span class="n">running</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
            <span class="n">admitted</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">admitted</span>

    <span class="k">def</span> <span class="nf">finish</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">request_ids</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span> <span class="n">step</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]:</span>
        <span class="n">completed</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="n">still_running</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">running</span><span class="p">:</span>
            <span class="k">if</span> <span class="n">state</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span> <span class="ow">in</span> <span class="n">request_ids</span><span class="p">:</span>
                <span class="n">state</span><span class="p">.</span><span class="n">lifecycle</span> <span class="o">=</span> <span class="n">RequestLifecycle</span><span class="p">.</span><span class="n">FINISHED</span>
                <span class="n">state</span><span class="p">.</span><span class="n">finished_step</span> <span class="o">=</span> <span class="n">step</span>
                <span class="n">completed</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">still_running</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">running</span> <span class="o">=</span> <span class="n">still_running</span>
        <span class="n">self</span><span class="p">.</span><span class="n">finished</span><span class="p">.</span><span class="nf">extend</span><span class="p">(</span><span class="n">completed</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">completed</span>

    <span class="k">def</span> <span class="nf">record_step</span><span class="p">(</span>
        <span class="n">self</span><span class="p">,</span>
        <span class="o">*</span><span class="p">,</span>
        <span class="n">step</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span>
        <span class="n">admitted</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">],</span>
        <span class="n">running_request_ids</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
        <span class="n">completed</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">],</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">SchedulerStepStats</span><span class="p">:</span>
        <span class="k">return</span> <span class="nc">SchedulerStepStats</span><span class="p">(</span>
            <span class="n">step</span><span class="o">=</span><span class="n">step</span><span class="p">,</span>
            <span class="n">admitted_request_ids</span><span class="o">=</span><span class="p">[</span><span class="n">state</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span> <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">admitted</span><span class="p">],</span>
            <span class="n">running_request_ids</span><span class="o">=</span><span class="n">running_request_ids</span><span class="p">,</span>
            <span class="n">completed_request_ids</span><span class="o">=</span><span class="p">[</span><span class="n">state</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span> <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">completed</span><span class="p">],</span>
            <span class="n">active_batch_size</span><span class="o">=</span><span class="nf">len</span><span class="p">(</span><span class="n">running_request_ids</span><span class="p">),</span>
        <span class="p">)</span>

</code></pre></div></div> <p>这次引入了Scheduler用来处理和调度请求，首先是定义请求的生命周期：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">RequestLifecycle</span><span class="p">(</span><span class="nb">str</span><span class="p">,</span> <span class="n">Enum</span><span class="p">):</span>
    <span class="n">WAITING</span> <span class="o">=</span> <span class="sh">"</span><span class="s">waiting</span><span class="sh">"</span>
    <span class="n">RUNNING</span> <span class="o">=</span> <span class="sh">"</span><span class="s">running</span><span class="sh">"</span>
    <span class="n">FINISHED</span> <span class="o">=</span> <span class="sh">"</span><span class="s">finished</span><span class="sh">"</span>
</code></pre></div></div> <p>请求刚来的时候在waiting中，到了某个scheduler step的时候，被admit进入active batch的时候，就会变成running，在生成结束后，从running set里被移出后，会变成finished</p> <p>现在进来的请求长这样：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span><span class="p">(</span><span class="n">frozen</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">ContinuousBatchRequest</span><span class="p">:</span>
    <span class="n">request_id</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">max_new_tokens</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">32</span>
    <span class="n">arrival_step</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
</code></pre></div></div> <p>类似这样调用</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">ContinuousBatchRequest</span><span class="p">(</span><span class="sh">"</span><span class="s">req-0</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">hello</span><span class="sh">"</span><span class="p">,</span> <span class="n">arrival_step</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="nc">ContinuousBatchRequest</span><span class="p">(</span><span class="sh">"</span><span class="s">req-1</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">你好</span><span class="sh">"</span><span class="p">,</span> <span class="n">arrival_step</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div></div> <p>这个代表了2个请求，req-0在step0的时候到达，req-1在step2的时候到达</p> <p>这里面定义了2个列表和1个队列：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">@dataclass</span>
<span class="k">class</span> <span class="nc">ContinuousBatchScheduler</span><span class="p">:</span>
    <span class="n">requests</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ContinuousBatchRequest</span><span class="p">]</span>
    <span class="n">max_batch_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">|</span> <span class="bp">None</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">waiting</span><span class="p">:</span> <span class="n">deque</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">init</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="n">running</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">,</span> <span class="n">init</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
    <span class="n">finished</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="nf">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">,</span> <span class="n">init</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div> <p>分别代表了前面提到的3个不同生命周期阶段对应的请求，另外配套了两个重要的方法</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">admit</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">step</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]:</span>
    <span class="n">admitted</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">while</span> <span class="n">self</span><span class="p">.</span><span class="n">waiting</span> <span class="ow">and</span> <span class="n">self</span><span class="p">.</span><span class="n">waiting</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">request</span><span class="p">.</span><span class="n">arrival_step</span> <span class="o">&lt;=</span> <span class="n">step</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">self</span><span class="p">.</span><span class="n">max_batch_size</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="nf">len</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">running</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="n">self</span><span class="p">.</span><span class="n">max_batch_size</span><span class="p">:</span>
            <span class="k">break</span>
        <span class="n">state</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="n">waiting</span><span class="p">.</span><span class="nf">popleft</span><span class="p">()</span>
        <span class="n">state</span><span class="p">.</span><span class="n">lifecycle</span> <span class="o">=</span> <span class="n">RequestLifecycle</span><span class="p">.</span><span class="n">RUNNING</span>
        <span class="n">state</span><span class="p">.</span><span class="n">admitted_step</span> <span class="o">=</span> <span class="n">step</span>
        <span class="n">self</span><span class="p">.</span><span class="n">running</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
        <span class="n">admitted</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">admitted</span>

<span class="k">def</span> <span class="nf">finish</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">request_ids</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span> <span class="n">step</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]:</span>
    <span class="n">completed</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="n">still_running</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">ScheduledRequestState</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">running</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">state</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span> <span class="ow">in</span> <span class="n">request_ids</span><span class="p">:</span>
            <span class="n">state</span><span class="p">.</span><span class="n">lifecycle</span> <span class="o">=</span> <span class="n">RequestLifecycle</span><span class="p">.</span><span class="n">FINISHED</span>
            <span class="n">state</span><span class="p">.</span><span class="n">finished_step</span> <span class="o">=</span> <span class="n">step</span>
            <span class="n">completed</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">still_running</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">state</span><span class="p">)</span>
    <span class="n">self</span><span class="p">.</span><span class="n">running</span> <span class="o">=</span> <span class="n">still_running</span>
    <span class="n">self</span><span class="p">.</span><span class="n">finished</span><span class="p">.</span><span class="nf">extend</span><span class="p">(</span><span class="n">completed</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">completed</span>
</code></pre></div></div> <p>admin会把已经到达在等待的请求从waiting移到running，而finish会把已经完成的请求从running移到finished</p> <h2 id="engine">Engine</h2> <p>真正处理Continuous Batching是在<code class="language-plaintext highlighter-rouge">src/nanollmserve/engine/engine.py</code> 里的<code class="language-plaintext highlighter-rouge">generate_continuous_batch</code></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="nf">inference_mode</span><span class="p">():</span>
    <span class="k">while</span> <span class="n">scheduler</span><span class="p">.</span><span class="nf">has_work</span><span class="p">():</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">scheduler</span><span class="p">.</span><span class="n">running</span> <span class="ow">and</span> <span class="n">scheduler</span><span class="p">.</span><span class="nf">next_arrival_step</span><span class="p">()</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">step</span> <span class="o">=</span> <span class="nf">max</span><span class="p">(</span><span class="n">step</span><span class="p">,</span> <span class="n">scheduler</span><span class="p">.</span><span class="nf">next_arrival_step</span><span class="p">())</span>

        <span class="n">admitted</span> <span class="o">=</span> <span class="n">scheduler</span><span class="p">.</span><span class="nf">admit</span><span class="p">(</span><span class="n">step</span><span class="p">)</span>
        <span class="k">for</span> <span class="n">scheduled</span> <span class="ow">in</span> <span class="n">admitted</span><span class="p">:</span>
            <span class="n">states</span><span class="p">[</span><span class="n">scheduled</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span><span class="p">]</span> <span class="o">=</span> <span class="nf">_state_from_prompt</span><span class="p">(</span>
                <span class="n">tokenizer</span><span class="p">,</span>
                <span class="n">scheduled</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">prompt</span><span class="p">,</span>
                <span class="n">device</span><span class="p">,</span>
            <span class="p">)</span>
            <span class="n">admitted_at</span><span class="p">[</span><span class="n">scheduled</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span><span class="p">]</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span>

        <span class="n">running_ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">state</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span> <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">scheduler</span><span class="p">.</span><span class="n">running</span><span class="p">]</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">running_ids</span><span class="p">:</span>
            <span class="k">continue</span>

        <span class="n">batch</span> <span class="o">=</span> <span class="nf">_continuous_batch_tensors</span><span class="p">(</span><span class="n">states</span><span class="p">,</span> <span class="n">running_ids</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
        <span class="n">batch_start</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span>
        <span class="n">outputs</span> <span class="o">=</span> <span class="nf">model</span><span class="p">(</span>
            <span class="n">input_ids</span><span class="o">=</span><span class="n">batch</span><span class="p">[</span><span class="sh">"</span><span class="s">input_ids</span><span class="sh">"</span><span class="p">],</span>
            <span class="n">attention_mask</span><span class="o">=</span><span class="n">batch</span><span class="p">[</span><span class="sh">"</span><span class="s">attention_mask</span><span class="sh">"</span><span class="p">],</span>
            <span class="n">use_cache</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="n">batch_elapsed</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span> <span class="o">-</span> <span class="n">batch_start</span>
        <span class="n">next_logits</span> <span class="o">=</span> <span class="nf">_select_last_token_logits</span><span class="p">(</span><span class="n">outputs</span><span class="p">.</span><span class="n">logits</span><span class="p">,</span> <span class="n">batch</span><span class="p">[</span><span class="sh">"</span><span class="s">attention_mask</span><span class="sh">"</span><span class="p">])</span>
        <span class="n">next_tokens</span> <span class="o">=</span> <span class="nf">_sample_from_logits</span><span class="p">(</span>
            <span class="n">next_logits</span><span class="p">,</span>
            <span class="n">temperature</span><span class="o">=</span><span class="n">temperature</span><span class="p">,</span>
            <span class="n">generator</span><span class="o">=</span><span class="n">generator</span><span class="p">,</span>
        <span class="p">)</span>

        <span class="n">completed_ids</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="nf">set</span><span class="p">()</span>
        <span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">request_id</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">running_ids</span><span class="p">):</span>
            <span class="n">state</span> <span class="o">=</span> <span class="n">states</span><span class="p">[</span><span class="n">request_id</span><span class="p">]</span>
            <span class="n">request</span> <span class="o">=</span> <span class="n">request_by_id</span><span class="p">[</span><span class="n">request_id</span><span class="p">]</span>
            <span class="n">token_id</span> <span class="o">=</span> <span class="nf">int</span><span class="p">(</span><span class="n">next_tokens</span><span class="p">[</span><span class="n">index</span><span class="p">,</span> <span class="mi">0</span><span class="p">].</span><span class="nf">item</span><span class="p">())</span>
            <span class="n">state</span><span class="p">.</span><span class="n">generated_token_ids</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">token_id</span><span class="p">)</span>
            <span class="n">state</span><span class="p">.</span><span class="n">attention_mask</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">cat</span><span class="p">(</span>
                <span class="p">[</span>
                    <span class="n">state</span><span class="p">.</span><span class="n">attention_mask</span><span class="p">,</span>
                    <span class="n">torch</span><span class="p">.</span><span class="nf">ones</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">state</span><span class="p">.</span><span class="n">attention_mask</span><span class="p">.</span><span class="n">dtype</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="n">state</span><span class="p">.</span><span class="n">attention_mask</span><span class="p">.</span><span class="n">device</span><span class="p">),</span>
                <span class="p">],</span>
                <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span>
            <span class="p">)</span>
            <span class="k">if</span> <span class="n">state</span><span class="p">.</span><span class="n">generated_tokens</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
                <span class="n">state</span><span class="p">.</span><span class="n">ttft_seconds</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span> <span class="o">-</span> <span class="n">admitted_at</span><span class="p">[</span><span class="n">request_id</span><span class="p">]</span>
                <span class="n">state</span><span class="p">.</span><span class="n">prefill_seconds</span> <span class="o">+=</span> <span class="n">batch_elapsed</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">state</span><span class="p">.</span><span class="n">decode_seconds</span> <span class="o">+=</span> <span class="n">batch_elapsed</span>
            <span class="k">if</span> <span class="n">token_id</span> <span class="ow">in</span> <span class="n">eos_token_ids</span> <span class="ow">or</span> <span class="n">state</span><span class="p">.</span><span class="n">generated_tokens</span> <span class="o">&gt;=</span> <span class="n">request</span><span class="p">.</span><span class="n">max_new_tokens</span><span class="p">:</span>
                <span class="n">state</span><span class="p">.</span><span class="n">finished</span> <span class="o">=</span> <span class="n">token_id</span> <span class="ow">in</span> <span class="n">eos_token_ids</span>
                <span class="n">completed_ids</span><span class="p">.</span><span class="nf">add</span><span class="p">(</span><span class="n">request_id</span><span class="p">)</span>
                <span class="n">finished_at</span><span class="p">[</span><span class="n">request_id</span><span class="p">]</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span>

        <span class="n">completed</span> <span class="o">=</span> <span class="n">scheduler</span><span class="p">.</span><span class="nf">finish</span><span class="p">(</span><span class="n">completed_ids</span><span class="p">,</span> <span class="n">step</span><span class="p">)</span>
        <span class="n">scheduler_steps</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span>
            <span class="n">scheduler</span><span class="p">.</span><span class="nf">record_step</span><span class="p">(</span>
                <span class="n">step</span><span class="o">=</span><span class="n">step</span><span class="p">,</span>
                <span class="n">admitted</span><span class="o">=</span><span class="n">admitted</span><span class="p">,</span>
                <span class="n">running_request_ids</span><span class="o">=</span><span class="n">running_ids</span><span class="p">,</span>
                <span class="n">completed</span><span class="o">=</span><span class="n">completed</span><span class="p">,</span>
            <span class="p">)</span>
        <span class="p">)</span>
        <span class="n">step</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div></div> <p>先让scheduler（基于当前step）把可以执行的请求加到running里：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">admitted</span> <span class="o">=</span> <span class="n">scheduler</span><span class="p">.</span><span class="nf">admit</span><span class="p">(</span><span class="n">step</span><span class="p">)</span>
<span class="k">for</span> <span class="n">scheduled</span> <span class="ow">in</span> <span class="n">admitted</span><span class="p">:</span>
    <span class="n">states</span><span class="p">[</span><span class="n">scheduled</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span><span class="p">]</span> <span class="o">=</span> <span class="nf">_state_from_prompt</span><span class="p">(</span>
        <span class="n">tokenizer</span><span class="p">,</span>
        <span class="n">scheduled</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">prompt</span><span class="p">,</span>
        <span class="n">device</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="n">admitted_at</span><span class="p">[</span><span class="n">scheduled</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span><span class="p">]</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span>
</code></pre></div></div> <p>然后拿到所有的running_ids，并根据running_ids去重建active batch</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">running_ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">state</span><span class="p">.</span><span class="n">request</span><span class="p">.</span><span class="n">request_id</span> <span class="k">for</span> <span class="n">state</span> <span class="ow">in</span> <span class="n">scheduler</span><span class="p">.</span><span class="n">running</span><span class="p">]</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">running_ids</span><span class="p">:</span>
    <span class="k">continue</span>

<span class="n">batch</span> <span class="o">=</span> <span class="nf">_continuous_batch_tensors</span><span class="p">(</span><span class="n">states</span><span class="p">,</span> <span class="n">running_ids</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
</code></pre></div></div> <p>其中_continuous_batch_tensors是把running的多个请求拼成一个padding后的batch，之前我们也有说过，batch的请求是需要对齐的</p> <p>然后后面的处理和之前的就都一样了，得到logits，采样出下个token</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">outputs</span> <span class="o">=</span> <span class="nf">model</span><span class="p">(</span>
    <span class="n">input_ids</span><span class="o">=</span><span class="n">batch</span><span class="p">[</span><span class="sh">"</span><span class="s">input_ids</span><span class="sh">"</span><span class="p">],</span>
    <span class="n">attention_mask</span><span class="o">=</span><span class="n">batch</span><span class="p">[</span><span class="sh">"</span><span class="s">attention_mask</span><span class="sh">"</span><span class="p">],</span>
    <span class="n">use_cache</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">batch_elapsed</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span> <span class="o">-</span> <span class="n">batch_start</span>
<span class="n">next_logits</span> <span class="o">=</span> <span class="nf">_select_last_token_logits</span><span class="p">(</span><span class="n">outputs</span><span class="p">.</span><span class="n">logits</span><span class="p">,</span> <span class="n">batch</span><span class="p">[</span><span class="sh">"</span><span class="s">attention_mask</span><span class="sh">"</span><span class="p">])</span>
<span class="n">next_tokens</span> <span class="o">=</span> <span class="nf">_sample_from_logits</span><span class="p">(</span>
    <span class="n">next_logits</span><span class="p">,</span>
    <span class="n">temperature</span><span class="o">=</span><span class="n">temperature</span><span class="p">,</span>
    <span class="n">generator</span><span class="o">=</span><span class="n">generator</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div> <p>接着过一遍这次forward的请求，把生成的token追加回每个请求的state里，包括attention mask增加一位。最后看看有哪些请求已经结束了，可以反馈给Scheduler</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">completed_ids</span><span class="p">:</span> <span class="nb">set</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="nf">set</span><span class="p">()</span>
<span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">request_id</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">running_ids</span><span class="p">):</span>
    <span class="n">state</span> <span class="o">=</span> <span class="n">states</span><span class="p">[</span><span class="n">request_id</span><span class="p">]</span>
    <span class="n">request</span> <span class="o">=</span> <span class="n">request_by_id</span><span class="p">[</span><span class="n">request_id</span><span class="p">]</span>
    <span class="n">token_id</span> <span class="o">=</span> <span class="nf">int</span><span class="p">(</span><span class="n">next_tokens</span><span class="p">[</span><span class="n">index</span><span class="p">,</span> <span class="mi">0</span><span class="p">].</span><span class="nf">item</span><span class="p">())</span>
    <span class="n">state</span><span class="p">.</span><span class="n">generated_token_ids</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">token_id</span><span class="p">)</span>
    <span class="n">state</span><span class="p">.</span><span class="n">attention_mask</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">cat</span><span class="p">(</span>
        <span class="p">[</span>
            <span class="n">state</span><span class="p">.</span><span class="n">attention_mask</span><span class="p">,</span>
            <span class="n">torch</span><span class="p">.</span><span class="nf">ones</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">state</span><span class="p">.</span><span class="n">attention_mask</span><span class="p">.</span><span class="n">dtype</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="n">state</span><span class="p">.</span><span class="n">attention_mask</span><span class="p">.</span><span class="n">device</span><span class="p">),</span>
        <span class="p">],</span>
        <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="k">if</span> <span class="n">state</span><span class="p">.</span><span class="n">generated_tokens</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
        <span class="n">state</span><span class="p">.</span><span class="n">ttft_seconds</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span> <span class="o">-</span> <span class="n">admitted_at</span><span class="p">[</span><span class="n">request_id</span><span class="p">]</span>
        <span class="n">state</span><span class="p">.</span><span class="n">prefill_seconds</span> <span class="o">+=</span> <span class="n">batch_elapsed</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">state</span><span class="p">.</span><span class="n">decode_seconds</span> <span class="o">+=</span> <span class="n">batch_elapsed</span>
    <span class="k">if</span> <span class="n">token_id</span> <span class="ow">in</span> <span class="n">eos_token_ids</span> <span class="ow">or</span> <span class="n">state</span><span class="p">.</span><span class="n">generated_tokens</span> <span class="o">&gt;=</span> <span class="n">request</span><span class="p">.</span><span class="n">max_new_tokens</span><span class="p">:</span>
        <span class="n">state</span><span class="p">.</span><span class="n">finished</span> <span class="o">=</span> <span class="n">token_id</span> <span class="ow">in</span> <span class="n">eos_token_ids</span>
        <span class="n">completed_ids</span><span class="p">.</span><span class="nf">add</span><span class="p">(</span><span class="n">request_id</span><span class="p">)</span>
        <span class="n">finished_at</span><span class="p">[</span><span class="n">request_id</span><span class="p">]</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span>

<span class="n">completed</span> <span class="o">=</span> <span class="n">scheduler</span><span class="p">.</span><span class="nf">finish</span><span class="p">(</span><span class="n">completed_ids</span><span class="p">,</span> <span class="n">step</span><span class="p">)</span>
</code></pre></div></div> <p>这就是完整的Continuous Batching。但是这里面还有一些问题，粒度太粗，性能不是最优的</p> <p>现在每次<code class="language-plaintext highlighter-rouge">_continuous_batch_tensors</code> 的时候，都是</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sequence</span> <span class="o">=</span> <span class="n">state</span><span class="p">.</span><span class="n">prompt_token_ids</span> <span class="o">+</span> <span class="n">state</span><span class="p">.</span><span class="n">generated_token_ids</span>
</code></pre></div></div> <p>把整个序列都重新forward了，也就是KV Cache又掉了，也就是我们v0.2的KV Cache并没有实际运用进来。这个是后续我们会做的一个paged-KV continuous batching</p> <h1 id="推理">推理</h1> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-25-llm-infra-101-v0-4-continuous-batching/1779683395_2-480.webp 480w,/assets/img/2026-05-25-llm-infra-101-v0-4-continuous-batching/1779683395_2-800.webp 800w,/assets/img/2026-05-25-llm-infra-101-v0-4-continuous-batching/1779683395_2-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-25-llm-infra-101-v0-4-continuous-batching/1779683395_2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>我们这波也还只能在bench里观测，可以看到<code class="language-plaintext highlighter-rouge">continuous_batch</code> 里的数据指标</p> <h1 id="总结">总结</h1> <p>这波支持了Continuous Batching了，实现谁结束了谁滚蛋，谁来了谁补位的目标，不会在batch里有请求已经结束的情况下还占用显存的slot</p> <p>但是我们前面也提到，现在的更多是完整展示Continuous Batching这个概念本身，实际并不是vllm/sglang之类在生产环境上能跑的版本，我们还需要做一些工作来支持，下一步我们要做的就是借助Paged KV Cache（Block）来支持Paged-KV Continuous Batching</p>]]></content><author><name></name></author><category term="AI"/><category term="AI"/><category term="LLMInfra-101"/><summary type="html"><![CDATA[系列的第五集，前面的可以看：]]></summary></entry><entry><title type="html">LLM Infra 101 v0.3: 静态批处理</title><link href="https://ifuryst.github.io/blog/2026/llm-infra-101-v0-3-static-batching/" rel="alternate" type="text/html" title="LLM Infra 101 v0.3: 静态批处理"/><published>2026-05-21T00:00:00+00:00</published><updated>2026-05-21T00:00:00+00:00</updated><id>https://ifuryst.github.io/blog/2026/llm-infra-101-v0-3-static-batching</id><content type="html" xml:base="https://ifuryst.github.io/blog/2026/llm-infra-101-v0-3-static-batching/"><![CDATA[<p>系列的第二集，前面的可以看：</p> <ol> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-model-inference/">LLM Infra 101 v0.0: 推理模型</a></p> </li> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-v0-1-openai-compatible-api/">LLM Infra 101 v0.1: API调用</a></p> </li> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-v0-2-kv-cache-decode/">LLM Infra 101 v0.2: KV Cache</a></p> </li> </ol> <p>这一期的代码在 <a href="https://github.com/iFurySt/nanoLLMServe/tree/release/v0.3.1">https://github.com/iFurySt/nanoLLMServe/tree/release/v0.3.1</a></p> <p>上期过完我们对于KV Cache已经有了认知和实现了，现在我们要继续看一个问题，我们现在每次收到请求是把这个请求单独处理，下个请求也是单独处理，但是实际生产中这样会带来一些问题诸如吞吐低、时延高、资源利用率低这些问题。</p> <p>因为每次请求一个请求，可以理解为是串行的处理，GPU算力空闲率高，并且每个请求独立做prefill。因此自然而然我们就会做一些批量（Batching）的优化动作，这样可以并行处理多个请求，提高整体的GPU利用率，也能批量化内存和算子调度。</p> <p>Batching这块核心就是两种：</p> <ul> <li> <p>静态批处理（Static Batching）：相对传统的Batching，但是对于我们理解机制原理很有帮助</p> </li> <li> <p>连续批处理（Continuous Batching）：现在主流的infra采用的技术</p> </li> </ul> <p>我们这次主要针对Static Batching，下一集会推进到Continuous Batching。</p> <p>Static Batching的原理是，一次固定处理一批请求，整批一起forward，大概如下：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>req1
req2
req3
↓
组成一个固定 batch
↓
一起 forward
↓
等整个 batch 全部结束
↓
再处理下一批
</code></pre></div></div> <p>对比一下，原来是这样的</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">时间轴</span> <span class="err">→</span>

<span class="n">Req</span> <span class="n">A</span> <span class="err">─────</span> <span class="n">GPU</span> <span class="n">forward</span> <span class="err">─────</span>
<span class="n">Req</span> <span class="n">B</span>                     <span class="err">─────</span> <span class="n">GPU</span> <span class="n">forward</span> <span class="err">─────</span>
<span class="n">Req</span> <span class="n">C</span>                                         <span class="err">─────</span> <span class="n">GPU</span> <span class="n">forward</span> <span class="err">─────</span>
</code></pre></div></div> <p>现在是</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>              <span class="err">┌───────────────┐</span>
<span class="n">Req</span> <span class="n">A</span> <span class="err">───────▶│</span>               <span class="err">│</span>
<span class="n">Req</span> <span class="n">B</span> <span class="err">───────▶│</span>   <span class="n">Batch</span><span class="o">=</span><span class="mi">3</span>     <span class="err">│───▶</span> <span class="n">GPU</span> <span class="n">Forward</span>
<span class="n">Req</span> <span class="n">C</span> <span class="err">───────▶│</span>               <span class="err">│</span>
              <span class="err">└───────────────┘</span>
</code></pre></div></div> <h1 id="实现">实现</h1> <p>改动涉及的文件：</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">.</span>
├── src/
│   └── nanollmserve/
│       ├── api/
│       │   ├── openai_server.py    <span class="c"># Responses 子集对齐：服务层接口行为和路由处理扩展</span>
│       │   └── protocol.py         <span class="c"># 协议模型更新：响应子集相关字段、请求/响应结构收敛</span>
│       └── engine/
│           └── engine.py           <span class="c"># 在现有 generate_one 的基础上引入静态批处理路径与调度</span>
└── tests/
    ├── test_benchmark_generate.py   <span class="c"># benchmark 汇总项与静态/响应场景回归覆盖</span>
    ├── test_engine.py               <span class="c"># engine 行为回归（含批处理/状态路径）</span>
    └── test_openai_server.py        <span class="c"># OpenAI 兼容层改造后的接口回归（含 Responses 子集）</span>
</code></pre></div></div> <h2 id="batching">Batching</h2> <p>原来只有<code class="language-plaintext highlighter-rouge">generate_one(model, tokenizer, prompt, ...)</code> ，这次新增了<code class="language-plaintext highlighter-rouge">generate_batch(model, tokenizer, prompts, ...)</code> ，现在调用变成了一组prompt</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results</span> <span class="o">=</span> <span class="nf">generate_batch</span><span class="p">(</span>
    <span class="n">model</span><span class="p">,</span>
    <span class="n">tokenizer</span><span class="p">,</span>
    <span class="p">[</span><span class="sh">"</span><span class="s">hello</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">world</span><span class="sh">"</span><span class="p">],</span>
    <span class="n">max_new_tokens</span><span class="o">=</span><span class="mi">32</span><span class="p">,</span>
    <span class="n">temperature</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div> <p>但是我们没有把generate_batch对接到API和CLI里，因为马上我们就要做Continuous Batching了，这边就做一个过渡</p> <p>因为这边prompt进来是一组的，prompt长度可能都不一样：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A input_ids: [101, 102]
B input_ids: [201, 202, 203, 204]
</code></pre></div></div> <p>因此tokenizer需要打开Padding</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">encoded</span> <span class="o">=</span> <span class="nf">tokenizer</span><span class="p">(</span>
    <span class="n">prompts</span><span class="p">,</span>
    <span class="n">return_tensors</span><span class="o">=</span><span class="sh">"</span><span class="s">pt</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">padding</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div> <p>padding后会变为类似这样的：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A input_ids: [101, 102, 0, 0]
A attention_mask: [1, 1, 0, 0]

B input_ids: [201, 202, 203, 204]
B attention_mask: [1, 1, 1, 1]
</code></pre></div></div> <p>这个其实我们第一章的时候已经说过一次了，当时我们没有batching机制，所以当时默认bacth 1，现在就会有多个batch了</p> <h2 id="batch-prefill">Batch Prefill</h2> <p>之前我们做了prefill，不过当时针对的是单个请求：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prompt -&gt; model -&gt; past_key_values
</code></pre></div></div> <p>现在变成了batch prefill</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[prompt A, prompt B, prompt C] -&gt; model -&gt; batch past_key_values
</code></pre></div></div> <p>第一次forward变成了batch，这个其实之前也都有，所以机制上是已经有了，只不过有个小地方需要调整，原来logits是从-1取的，但是现在-1可能是padding，所以需要调整一下</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_select_last_token_logits</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">attention_mask</span><span class="p">):</span>
    <span class="n">indices</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">clamp</span><span class="p">(</span><span class="n">attention_mask</span><span class="p">.</span><span class="nf">sum</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">min</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
    <span class="n">batch</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">arange</span><span class="p">(</span><span class="n">logits</span><span class="p">.</span><span class="nf">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">device</span><span class="o">=</span><span class="n">logits</span><span class="p">.</span><span class="n">device</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">logits</span><span class="p">[</span><span class="n">batch</span><span class="p">,</span> <span class="n">indices</span><span class="p">,</span> <span class="p">:]</span>
</code></pre></div></div> <p>现在是根据attention_mask来找最后一个真实token的位置，因为掩码里有对应的信息</p> <h2 id="batch-decode">batch decode</h2> <p>decode也是一样，原来已经有batch的机制了：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">input_ids</span> <span class="o">=</span> <span class="p">[[</span><span class="n">last_token</span><span class="p">]]</span>
</code></pre></div></div> <p>现在是</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">input_ids</span> <span class="o">=</span> <span class="p">[</span>
  <span class="p">[</span><span class="n">last_token_for_A</span><span class="p">],</span>
  <span class="p">[</span><span class="n">last_token_for_B</span><span class="p">],</span>
  <span class="p">[</span><span class="n">last_token_for_C</span><span class="p">],</span>
<span class="p">]</span>
</code></pre></div></div> <p>现在会在某个step里分别去生成batch里的请求的下一个token</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">step</span> <span class="mi">1</span><span class="p">:</span> <span class="n">A生成一个token</span><span class="err">，</span><span class="n">B生成一个</span> <span class="n">token</span><span class="err">，</span><span class="n">C生成一个token</span>
<span class="n">step</span> <span class="mi">2</span><span class="p">:</span> <span class="n">A生成一个token</span><span class="err">，</span><span class="n">B生成一个</span> <span class="n">token</span><span class="err">，</span><span class="n">C生成一个token</span>
<span class="n">step</span> <span class="mi">3</span><span class="p">:</span> <span class="bp">...</span>
</code></pre></div></div> <p>但是因为实际序列都不一样长，有一些请求会更早结束</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Req</span> <span class="n">A</span> <span class="err">→</span> <span class="mi">10</span> <span class="n">tokens</span>
<span class="n">Req</span> <span class="n">B</span> <span class="err">→</span> <span class="mi">500</span> <span class="n">tokens</span>
<span class="n">Req</span> <span class="n">C</span> <span class="err">→</span> <span class="mi">50</span> <span class="n">tokens</span>

<span class="n">Req</span> <span class="n">A</span><span class="p">:</span> <span class="n">finished</span>
<span class="n">Req</span> <span class="n">B</span><span class="p">:</span> <span class="n">running</span>
<span class="n">Req</span> <span class="n">C</span><span class="p">:</span> <span class="n">running</span>
</code></pre></div></div> <p>在Static Batching里，先遇到EOS结束的请求会标记成finished，后续不会再往它的generated_token_ids里追加token了。但是这个batch里已经有请求结束了，GPU就会出现空洞的情况：</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">batch</span> <span class="o">=</span> <span class="p">[</span><span class="n">EMPTY</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">]</span>
<span class="p">[</span> <span class="n">_</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span> <span class="p">]</span>

<span class="n">batch</span> <span class="o">=</span> <span class="p">[</span><span class="n">EMPTY</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">EMPTY</span><span class="p">]</span>
<span class="p">[</span> <span class="n">_</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">_</span> <span class="p">]</span>
</code></pre></div></div> <p>这样GPU计算的利用率到后面是越来越少的，也就是退化回单条请求。但是同时显存的slot并不会释放，造成了显存的浪费</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># step1: GPU0 KV Memory
</span><span class="err">┌────┬────┬────┐</span>
<span class="err">│</span> <span class="n">A</span>  <span class="err">│</span> <span class="n">B</span>  <span class="err">│</span> <span class="n">C</span>  <span class="err">│</span>
<span class="err">└────┴────┴────┘</span>

<span class="c1"># step2: GPU0 KV Memory
</span><span class="err">┌────┬────┬────┐</span>
<span class="err">│</span><span class="n">idle</span><span class="err">│</span> <span class="n">B</span>  <span class="err">│</span> <span class="n">C</span>  <span class="err">│</span>
<span class="err">└────┴────┴────┘</span>

<span class="c1"># step3: GPU0 KV Memory
</span><span class="err">┌────┬────┬────┐</span>
<span class="err">│</span><span class="n">idle</span><span class="err">│</span> <span class="n">B</span>  <span class="err">│</span><span class="n">idle</span><span class="err">│</span>
<span class="err">└────┴────┴────┘</span>
</code></pre></div></div> <p>这个其实也是我们下一章Continuous Batching要解决的！（大体解决思路是谁结束了谁滚蛋，谁来了谁补位）这边我们先不展开</p> <h1 id="推理">推理</h1> <p>这次我们基本也是在bench里观测一下</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">(base)</span><span class="w"> </span><span class="err">gpu-A</span><span class="mi">100-05</span><span class="w"> </span><span class="err">nanoLLMServe</span><span class="w"> </span><span class="err">#</span><span class="w"> </span><span class="err">for</span><span class="w"> </span><span class="err">BS</span><span class="w"> </span><span class="err">in</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="mi">2</span><span class="w"> </span><span class="mi">4</span><span class="w"> </span><span class="mi">8</span><span class="err">;</span><span class="w"> </span><span class="err">do</span><span class="w">
  </span><span class="err">CUDA_VISIBLE_DEVICES=</span><span class="mi">0</span><span class="w"> </span><span class="err">PYTHONPATH=src</span><span class="w"> </span><span class="err">/data/anaconda</span><span class="mi">3</span><span class="err">/bin/python</span><span class="w"> </span><span class="err">-m</span><span class="w"> </span><span class="err">benchmarks.benchmark_generate</span><span class="w"> </span><span class="err">\</span><span class="w">
    </span><span class="err">--model</span><span class="w"> </span><span class="err">/data</span><span class="mi">2</span><span class="err">/nanoLLMServe/models/Qwen</span><span class="mi">3-8</span><span class="err">B</span><span class="w"> </span><span class="err">\</span><span class="w">
    </span><span class="err">--prompt</span><span class="w"> </span><span class="s2">"Explain static batching in one sentence."</span><span class="w"> </span><span class="err">\</span><span class="w">
    </span><span class="err">--max-new-tokens</span><span class="w"> </span><span class="mi">64</span><span class="w"> </span><span class="err">\</span><span class="w">
    </span><span class="err">--runs</span><span class="w"> </span><span class="mi">5</span><span class="w"> </span><span class="err">\</span><span class="w">
    </span><span class="err">--warmup</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="err">\</span><span class="w">
    </span><span class="err">--batch-size</span><span class="w"> </span><span class="s2">"$BS"</span><span class="w"> </span><span class="err">\</span><span class="w">
    </span><span class="err">--device</span><span class="w"> </span><span class="err">cuda</span><span class="w"> </span><span class="err">\</span><span class="w">
    </span><span class="err">--dtype</span><span class="w"> </span><span class="err">bfloat</span><span class="mi">16</span><span class="w"> </span><span class="err">\</span><span class="w">
    </span><span class="err">--local-files-only</span><span class="w"> </span><span class="err">\</span><span class="w">
    </span><span class="err">--skip-naive-baseline</span><span class="w">
</span><span class="err">done</span><span class="w">
</span><span class="err">Loading</span><span class="w"> </span><span class="err">checkpoint</span><span class="w"> </span><span class="err">shards:</span><span class="w"> </span><span class="mi">100</span><span class="err">%|██████████████|</span><span class="w"> </span><span class="mi">5</span><span class="err">/</span><span class="mi">5</span><span class="w"> </span><span class="p">[</span><span class="mi">00</span><span class="err">:</span><span class="mi">00</span><span class="err">&lt;</span><span class="mi">00</span><span class="err">:</span><span class="mi">00</span><span class="p">,</span><span class="w"> </span><span class="mf">142.39</span><span class="err">it/s</span><span class="p">]</span><span class="w">
</span><span class="p">{</span><span class="w">
  </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
  </span><span class="nl">"device"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cuda"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"dtype"</span><span class="p">:</span><span class="w"> </span><span class="s2">"bfloat16"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"kv_cache_decode"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"generated_tokens"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"mean_decode_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.8507717087864877</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_elapsed_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.9322640344500541</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_prefill_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.030887942016124725</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tokens_per_second"</span><span class="p">:</span><span class="w"> </span><span class="mf">33.12221489162687</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tpot_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.029377328710896627</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_ttft_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.03308003842830658</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/data2/nanoLLMServe/models/Qwen3-8B"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"prompt_tokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w">
  </span><span class="nl">"runs"</span><span class="p">:</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w">
  </span><span class="nl">"warmup"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="err">Loading</span><span class="w"> </span><span class="err">checkpoint</span><span class="w"> </span><span class="err">shards:</span><span class="w"> </span><span class="mi">100</span><span class="err">%|██████████████|</span><span class="w"> </span><span class="mi">5</span><span class="err">/</span><span class="mi">5</span><span class="w"> </span><span class="p">[</span><span class="mi">00</span><span class="err">:</span><span class="mi">00</span><span class="err">&lt;</span><span class="mi">00</span><span class="err">:</span><span class="mi">00</span><span class="p">,</span><span class="w"> </span><span class="mf">139.83</span><span class="err">it/s</span><span class="p">]</span><span class="w">
</span><span class="p">{</span><span class="w">
  </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w">
  </span><span class="nl">"device"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cuda"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"dtype"</span><span class="p">:</span><span class="w"> </span><span class="s2">"bfloat16"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"kv_cache_decode"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"generated_tokens"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"mean_decode_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.8655244752764701</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_elapsed_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.9476028025150298</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_prefill_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.03145704716444016</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tokens_per_second"</span><span class="p">:</span><span class="w"> </span><span class="mf">32.86332616672236</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tpot_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.02961149960756302</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_ttft_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.03360582888126373</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/data2/nanoLLMServe/models/Qwen3-8B"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"prompt_tokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w">
  </span><span class="nl">"runs"</span><span class="p">:</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w">
  </span><span class="nl">"static_batch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w">
    </span><span class="nl">"generated_tokens"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"mean_batch_elapsed_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.9898763984441756</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_batch_tokens_per_second"</span><span class="p">:</span><span class="w"> </span><span class="mf">64.32565239699788</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_decode_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.9065125167369843</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_generated_tokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">64</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_prefill_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.03188993483781814</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_prompt_tokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tpot_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.03026210344026959</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_ttft_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.034097179770469666</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"warmup"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="err">Loading</span><span class="w"> </span><span class="err">checkpoint</span><span class="w"> </span><span class="err">shards:</span><span class="w"> </span><span class="mi">100</span><span class="err">%|██████████████|</span><span class="w"> </span><span class="mi">5</span><span class="err">/</span><span class="mi">5</span><span class="w"> </span><span class="p">[</span><span class="mi">00</span><span class="err">:</span><span class="mi">00</span><span class="err">&lt;</span><span class="mi">00</span><span class="err">:</span><span class="mi">00</span><span class="p">,</span><span class="w"> </span><span class="mf">141.49</span><span class="err">it/s</span><span class="p">]</span><span class="w">
</span><span class="p">{</span><span class="w">
  </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span><span class="w">
  </span><span class="nl">"device"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cuda"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"dtype"</span><span class="p">:</span><span class="w"> </span><span class="s2">"bfloat16"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"kv_cache_decode"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"generated_tokens"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"mean_decode_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.8458494618535042</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_elapsed_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.9274740874767304</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_prefill_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.031040719151496886</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tokens_per_second"</span><span class="p">:</span><span class="w"> </span><span class="mf">33.204527220435416</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tpot_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.029299197807198477</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_ttft_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.033210942149162294</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/data2/nanoLLMServe/models/Qwen3-8B"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"prompt_tokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w">
  </span><span class="nl">"runs"</span><span class="p">:</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w">
  </span><span class="nl">"static_batch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span><span class="w">
    </span><span class="nl">"generated_tokens"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"mean_batch_elapsed_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.9469317257404328</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_batch_tokens_per_second"</span><span class="p">:</span><span class="w"> </span><span class="mf">131.49626546592532</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_decode_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.8595415592193603</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_generated_tokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">64</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_prefill_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.030785535275936127</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_prompt_tokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tpot_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.029516532686021592</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_ttft_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.033043819665908816</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"warmup"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="err">Loading</span><span class="w"> </span><span class="err">checkpoint</span><span class="w"> </span><span class="err">shards:</span><span class="w"> </span><span class="mi">100</span><span class="err">%|██████████████|</span><span class="w"> </span><span class="mi">5</span><span class="err">/</span><span class="mi">5</span><span class="w"> </span><span class="p">[</span><span class="mi">00</span><span class="err">:</span><span class="mi">00</span><span class="err">&lt;</span><span class="mi">00</span><span class="err">:</span><span class="mi">00</span><span class="p">,</span><span class="w"> </span><span class="mf">141.55</span><span class="err">it/s</span><span class="p">]</span><span class="w">
</span><span class="p">{</span><span class="w">
  </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w">
  </span><span class="nl">"device"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cuda"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"dtype"</span><span class="p">:</span><span class="w"> </span><span class="s2">"bfloat16"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"kv_cache_decode"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"generated_tokens"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"mean_decode_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.8693151980638505</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_elapsed_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.9508710712194444</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_prefill_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.031239084899425507</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tokens_per_second"</span><span class="p">:</span><span class="w"> </span><span class="mf">32.80601897231402</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tpot_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.02967166981053731</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_ttft_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.0333852082490921</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"/data2/nanoLLMServe/models/Qwen3-8B"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"prompt_tokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w">
  </span><span class="nl">"runs"</span><span class="p">:</span><span class="w"> </span><span class="mi">5</span><span class="p">,</span><span class="w">
  </span><span class="nl">"static_batch"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"batch_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w">
    </span><span class="nl">"generated_tokens"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="p">,</span><span class="w">
      </span><span class="mi">64</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"mean_batch_elapsed_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.989836023747921</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_batch_tokens_per_second"</span><span class="p">:</span><span class="w"> </span><span class="mf">257.34328723459373</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_decode_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">1.8920533761382103</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_generated_tokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">64</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_prefill_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.031170780956745147</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_prompt_tokens"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_tpot_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.030032593272035085</span><span class="p">,</span><span class="w">
    </span><span class="nl">"mean_ttft_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.03355635851621628</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"warmup"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="err">(base)</span><span class="w"> </span><span class="err">gpu-A</span><span class="mi">100-05</span><span class="w"> </span><span class="err">nanoLLMServe</span><span class="w"> </span><span class="err">#</span><span class="w">
</span></code></pre></div></div> <p>解读一下，跑了4次，分别是batch为1、2、4、8的场景</p> <table> <thead> <tr> <th><strong>Batch</strong></th> <th><strong>batch elapsed</strong></th> <th><strong>total tokens/s</strong></th> <th><strong>单请求等效 tokens/s</strong></th> </tr> </thead> <tbody> <tr> <td>1</td> <td>~1.93s</td> <td>~33.1 tok/s</td> <td>~33.1 tok/s</td> </tr> <tr> <td>2</td> <td>~1.99s</td> <td>~64.3 tok/s</td> <td>~32.2 tok/s</td> </tr> <tr> <td>4</td> <td>~1.95s</td> <td>~131.5 tok/s</td> <td>~32.9 tok/s</td> </tr> <tr> <td>8</td> <td>~1.99s</td> <td>~257.3 tok/s</td> <td>~32.2 tok/s</td> </tr> </tbody> </table> <p>可以看出，吞吐是变多了，单个请求的时候是33tokens/s，8个请求一批的时候，系统整体的吞吐达到了257token/s，也就是每个请求得到的吞吐一样的情况下，并行的去推理导致系统整体吞吐量得到的极大的提升。这个就是批处理带来的提升！</p> <h1 id="总结">总结</h1> <p>这一波聊了Batching技术，这个特性的出发点就是从infra的角度去提升系统整体的吞吐量并减少接口调用时延。因为batch了，所以GPU的利用率也得到了提升。</p> <p>但是Static Batching也留下了一下问题，比如固定批次，导致已经结束的req也还是不断被带着一起forward，GPU的显存也要保留已经结束的req对应的KV Cache等不能释放，需要等到这批请求都结束后才能被释放，显存利用率降低了。这些都会在下一章Continuous Batching里解决</p>]]></content><author><name></name></author><category term="AI"/><category term="AI"/><category term="LLMInfra-101"/><summary type="html"><![CDATA[系列的第二集，前面的可以看：]]></summary></entry><entry><title type="html">System vs Goal</title><link href="https://ifuryst.github.io/blog/2026/system-vs-goal/" rel="alternate" type="text/html" title="System vs Goal"/><published>2026-05-21T00:00:00+00:00</published><updated>2026-05-21T00:00:00+00:00</updated><id>https://ifuryst.github.io/blog/2026/system-vs-goal</id><content type="html" xml:base="https://ifuryst.github.io/blog/2026/system-vs-goal/"><![CDATA[<p>这两天在看Atomic Habits</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-21-system-vs-goal/1779340784_8-480.webp 480w,/assets/img/2026-05-21-system-vs-goal/1779340784_8-800.webp 800w,/assets/img/2026-05-21-system-vs-goal/1779340784_8-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-21-system-vs-goal/1779340784_8.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>里面关于Goal和System的描述，表述的很好。</p> <p>就好像我们在应试教育的年代里，大部分人追求的东西都是目标Goal，因此很多人通过考试，或者毕业升学，或者毕业入社会后，就抛弃掉一切，学习本身已经不再重要，因为学习只是为了取得成绩、获得入场券、获得绩点、获得学位。</p> <p>这也是我很痛恨应试教育的一个原因。它扭曲了我们对于一个美好的事物的认知，本该是好的东西，经过这个工厂化的流水线后，留在心里的只有不停打螺丝的那些瞬间。</p> <p>但是总有人是可以对抗这种固有体制的，有人反骨，有人看透，有人笑谈。我想这些大概都是构建出了一个属于自己的系统System。从此，学习不再是为了那个永无止境的Goal，而是服务自己的System，自我体系的建立，自洽性产生，正向循环闭环</p> <p>我们进入社会后，开始追去事业的发展，追求权利，追求金钱。或许这也是一个永无止境的Goal，最后经常是一地鸡毛，自己的System却一塌糊涂。我觉得我父亲就是一个非常典型的例子，终其一生都在追寻那寻而不可得的Goal，却忽略掉了最重要的东西。</p> <p>有些人是在构建自己的System的时候，别人眼中的goal自然而然的就来了，goal变成了副产物。我觉得这是更加棒的处世态度和方式。不过批判一波就是知易行难，很多动作在现实中很容易因为外物和内心而变形。</p> <p>我也在努力Build System中，不断的学习，不断的输入输出，不断的RL自己。</p>]]></content><author><name></name></author><category term="Opinion"/><category term="Opinion"/><summary type="html"><![CDATA[这两天在看Atomic Habits]]></summary></entry><entry xml:lang="zh"><title type="html">LLM Infra 101 v0.2: KV Cache</title><link href="https://ifuryst.github.io/blog/2026/llm-infra-101-v0-2-kv-cache-decode/" rel="alternate" type="text/html" title="LLM Infra 101 v0.2: KV Cache"/><published>2026-05-19T00:00:00+00:00</published><updated>2026-05-19T00:00:00+00:00</updated><id>https://ifuryst.github.io/blog/2026/llm-infra-101-v0-2-kv-cache-decode</id><content type="html" xml:base="https://ifuryst.github.io/blog/2026/llm-infra-101-v0-2-kv-cache-decode/"><![CDATA[<p>系列的第二集，前面的可以看：</p> <ol> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-model-inference/">LLM Infra 101 v0.0: 推理模型</a></p> </li> <li> <p><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-v0-1-openai-compatible-api/">LLM Infra 101 v0.1: API调用</a></p> </li> </ol> <p>这一期的代码在 <a href="https://github.com/iFurySt/nanoLLMServe/tree/release/v0.2.0">https://github.com/iFurySt/nanoLLMServe/tree/release/v0.2.0</a></p> <p>上一期过完，能通过API调用模型了。这期我们来支持KV Cache。在第一集的时候我们发现，每次forward 的时候都会重复计算：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prompt
-&gt; forward(prompt)
-&gt; 采样 token1
-&gt; forward(prompt + token1)
-&gt; 采样 token2
-&gt; forward(prompt + token1 + token2)
-&gt; ...
</code></pre></div></div> <p>这里每次推入的序列都会重新计算一遍，Transformer的计算就贵在Attention的计算：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Q = xW_Q
K = xW_K
V = xW_V
Attention(Q, K, V)
</code></pre></div></div> <p>所以当我们没有KV缓存的时候，大概流程是这样的：</p> <ol> <li> <p>forward(prompt)</p> </li> <li> <p>计算prompt里每个token的Q/K/V</p> </li> <li> <p>计算prompt内部的Attention</p> </li> <li> <p>采样得到token1</p> </li> <li> <p>forward(prompt+token1)</p> </li> <li> <p>计算prompt里每个token的Q/K/V</p> </li> <li> <p>计算prompt内部的Attention</p> </li> <li> <p>采样得到token2</p> </li> <li> <p>forward(prompt+token1+token2)</p> </li> <li> <p>计算prompt里每个token的Q/K/V</p> </li> <li> <p>计算prompt内部的Attention</p> </li> <li> <p>采样得到token3</p> </li> </ol> <p>如果有了KV Cache，那流程会是这样的：</p> <ol> <li> <p>forward(prompt)</p> </li> <li> <p>计算prompt里每个token的Q/K/V</p> </li> <li> <p>计算prompt内部的Attention</p> </li> <li> <p>保存K/V到KV Cache</p> </li> <li> <p>采样得到token1</p> </li> <li> <p>forward(token1+past_kv(也就是prompt的))</p> </li> <li> <p>只计算token1的Q/K/V</p> </li> <li> <p>读取prompt的K/V</p> </li> <li> <p>计算Attention(Q_token1, K_prompt+token1, V_prompt+token1)</p> </li> <li> <p>保存token1的K/V</p> </li> <li> <p>采样得到token2</p> </li> <li> <p>forward(token2+past_kv(也就是prompt+token1的))</p> </li> <li> <p>只计算token2的Q/K/V</p> </li> <li> <p>读取prompt+token1的K/V</p> </li> <li> <p>计算Attention(Q_token2, K_prompt+token1+token2, V_prompt+token1+token2)</p> </li> <li> <p>保存token2的K/V</p> </li> <li> <p>采样得到token3</p> </li> </ol> <p>本质上KV Cache就是为了后续计算可以重复利用，我们来看一个实际推理过程中的环节：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Token
 ↓
Attention（看上下文）
 ↓
FFN（自己思考）
 ↓
下一层
</code></pre></div></div> <p>可以看到Attention里的K/V都Cache里，但是FFN里没有任何Cache的，这个是因为Attention的计算都是依赖于之前计算的，但是FFN都是针对当前token自己去做计算（非线性变换）的</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>token3
 ↓
Linear Up Projection（升维，高纬空间有更复杂的表达能力）
 ↓
Activation (GELU / SwiGLU)
 ↓
Linear Down Projection（降维）
 ↓
output
</code></pre></div></div> <p>这个过程中只涉及到token3本身的计算，输出的FFN(hidden3)只会在当前layer使用一次，后续就没用了，所以没办法做Cache</p> <p>知道了原理后，来看看实现</p> <h1 id="实现">实现</h1> <p>改动文件涉及这些：</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">.</span>
├── benchmarks/
│   └── benchmark_generate.py      <span class="c"># 增加 KV cache vs v0.0 naive baseline 对比，输出 TTFT/TPOT</span>
├── src/
│   └── nanollmserve/
│       ├── cli/
│       │   └── generate.py        <span class="c"># show-stats 新增 TTFT / TPOT</span>
│       └── engine/
│           ├── engine.py          <span class="c"># 核心改动：prefill + decode + past_key_values 复用</span>
│           └── request.py         <span class="c"># 新增 GenerationRequestState，保存单请求生成状态</span>
└── tests/
    ├── test_engine.py             <span class="c"># 验证 decode 阶段只喂单 token，且复用 past_key_values</span>
    ├── test_request_state.py      <span class="c"># 验证 request state 的 token 统计和 TPOT</span>
    └── test_benchmark_generate.py <span class="c"># 验证 benchmark 汇总字段和 speedup 计算</span>
</code></pre></div></div> <h2 id="prefill">Prefill</h2> <p><code class="language-plaintext highlighter-rouge">src/nanollmserve/engine/engine.py:160</code></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="nf">eval</span><span class="p">()</span>
<span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="nf">inference_mode</span><span class="p">():</span>
    <span class="n">prefill_start</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span>
    <span class="n">outputs</span> <span class="o">=</span> <span class="nf">model</span><span class="p">(</span><span class="n">input_ids</span><span class="o">=</span><span class="n">input_ids</span><span class="p">,</span> <span class="n">attention_mask</span><span class="o">=</span><span class="n">state</span><span class="p">.</span><span class="n">attention_mask</span><span class="p">,</span> <span class="n">use_cache</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">state</span><span class="p">.</span><span class="n">prefill_seconds</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span> <span class="o">-</span> <span class="n">prefill_start</span>
    <span class="n">state</span><span class="p">.</span><span class="n">past_key_values</span> <span class="o">=</span> <span class="nf">getattr</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="sh">"</span><span class="s">past_key_values</span><span class="sh">"</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">state</span><span class="p">.</span><span class="n">past_key_values</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nc">RuntimeError</span><span class="p">(</span><span class="sh">"</span><span class="s">model did not return past_key_values; KV cache decode requires use_cache support</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div> <p>这边的model是基于transformers加载进来的模型对象</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loaded</span> <span class="o">=</span> <span class="nf">load_model_and_tokenizer</span><span class="p">(...)</span>
<span class="n">result</span> <span class="o">=</span> <span class="nf">generate_one</span><span class="p">(</span>
    <span class="n">loaded</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
    <span class="n">loaded</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">,</span>
    <span class="n">prompt</span><span class="p">,</span>
    <span class="bp">...</span>
<span class="p">)</span>
</code></pre></div></div> <p>传入<code class="language-plaintext highlighter-rouge">use_cache=True</code> 参数后，会要求模型forward后返回<code class="language-plaintext highlighter-rouge">past_key_values</code> ，后续decode的时候再把这个KV Cache传回去。</p> <p>这里做的就是预填充Prefill，简单说就是把传入的prompt完整的处理一遍，建立KV Cache，后续就只要做新的token的Q计算，然后就可以服用之前的KV Cache做Attention的计算了</p> <h2 id="decode">Decode</h2> <p><code class="language-plaintext highlighter-rouge">src/nanollmserve/engine/engine.py:179</code></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">next_token</span> <span class="o">=</span> <span class="nf">_sample_from_outputs</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="n">temperature</span><span class="p">,</span> <span class="n">generator</span><span class="o">=</span><span class="n">generator</span><span class="p">)</span>
<span class="k">yield</span> <span class="nf">_record_step</span><span class="p">(</span>
    <span class="n">tokenizer</span><span class="p">,</span>
    <span class="n">state</span><span class="p">,</span>
    <span class="n">next_token</span><span class="p">,</span>
    <span class="n">eos_token_ids</span><span class="o">=</span><span class="n">eos_token_ids</span><span class="p">,</span>
    <span class="n">start</span><span class="o">=</span><span class="n">start</span><span class="p">,</span>
    <span class="n">max_new_tokens</span><span class="o">=</span><span class="n">max_new_tokens</span><span class="p">,</span>
<span class="p">)</span>
<span class="k">if</span> <span class="n">state</span><span class="p">.</span><span class="n">finished</span><span class="p">:</span>
    <span class="k">return</span>

<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">max_new_tokens</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span>
    <span class="n">decode_start</span> <span class="o">=</span> <span class="nf">perf_counter</span><span class="p">()</span>
    <span class="n">outputs</span> <span class="o">=</span> <span class="nf">model</span><span class="p">(</span>
        <span class="n">input_ids</span><span class="o">=</span><span class="n">next_token</span><span class="p">.</span><span class="nf">to</span><span class="p">(</span><span class="n">input_ids</span><span class="p">.</span><span class="n">device</span><span class="p">),</span>
        <span class="n">attention_mask</span><span class="o">=</span><span class="n">state</span><span class="p">.</span><span class="n">attention_mask</span><span class="p">,</span>
        <span class="n">past_key_values</span><span class="o">=</span><span class="n">state</span><span class="p">.</span><span class="n">past_key_values</span><span class="p">,</span>
        <span class="n">use_cache</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="n">state</span><span class="p">.</span><span class="n">past_key_values</span> <span class="o">=</span> <span class="nf">getattr</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="sh">"</span><span class="s">past_key_values</span><span class="sh">"</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">state</span><span class="p">.</span><span class="n">past_key_values</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nc">RuntimeError</span><span class="p">(</span><span class="sh">"</span><span class="s">model did not return past_key_values during decode</span><span class="sh">"</span><span class="p">)</span>

    <span class="n">next_token</span> <span class="o">=</span> <span class="nf">_sample_from_outputs</span><span class="p">(</span><span class="n">outputs</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="n">temperature</span><span class="p">,</span> <span class="n">generator</span><span class="o">=</span><span class="n">generator</span><span class="p">)</span>
    <span class="k">yield</span> <span class="nf">_record_step</span><span class="p">(</span>
        <span class="n">tokenizer</span><span class="p">,</span>
        <span class="n">state</span><span class="p">,</span>
        <span class="n">next_token</span><span class="p">,</span>
        <span class="n">eos_token_ids</span><span class="o">=</span><span class="n">eos_token_ids</span><span class="p">,</span>
        <span class="n">start</span><span class="o">=</span><span class="n">start</span><span class="p">,</span>
        <span class="n">max_new_tokens</span><span class="o">=</span><span class="n">max_new_tokens</span><span class="p">,</span>
        <span class="n">decode_start</span><span class="o">=</span><span class="n">decode_start</span><span class="p">,</span>
    <span class="p">)</span>
    <span class="k">if</span> <span class="n">state</span><span class="p">.</span><span class="n">finished</span><span class="p">:</span>
        <span class="k">break</span>
</code></pre></div></div> <p>后续的循环这里，可以看到进入的已经不再是不断拼接的input_ids了，而是<code class="language-plaintext highlighter-rouge">next_token</code> ，也就是前一次生成的token，然后会通过<code class="language-plaintext highlighter-rouge">past_key_values=state.past_key_values,</code>带上前面的KV</p> <h1 id="推理">推理</h1> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-19-llm-infra-101-v0-2-kv-cache-decode/1779196784_7-480.webp 480w,/assets/img/2026-05-19-llm-infra-101-v0-2-kv-cache-decode/1779196784_7-800.webp 800w,/assets/img/2026-05-19-llm-infra-101-v0-2-kv-cache-decode/1779196784_7-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-19-llm-infra-101-v0-2-kv-cache-decode/1779196784_7.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>因此这次改动是单个请求内的KV Cache Reuse，prefill后decode复用，所以没办法在多个请求之间命中缓存，就没办法做那种演示了，但是bench是可以看出来<code class="language-plaintext highlighter-rouge">kv_cache_decode.mean_prefill_seconds</code>是非0</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="sh">"</span><span class="s">elapsed_speedup</span><span class="sh">"</span><span class="p">:</span> <span class="mf">1.066</span>
<span class="sh">"</span><span class="s">tpot_speedup</span><span class="sh">"</span><span class="p">:</span> <span class="mf">1.073</span>
</code></pre></div></div> <p>现在总耗时和TOPT（Time per Output Token）都变快了，但是因为输入的prompt很短，没有更明显的差距体现</p> <h1 id="总结">总结</h1> <p>这些大概就是引入KV Cache带来的变化，代码改动不多，也相对简洁，因为transformers这类框架帮我屏蔽了很多实现细节。</p> <p>另外这里的KV Cache在GPU显存里，会涉及到<code class="language-plaintext highlighter-rouge">每层</code> 和<code class="language-plaintext highlighter-rouge">每个token</code> 都要存K/V，KV Cache的大小近似于：</p> <p><code class="language-plaintext highlighter-rouge">2*L*T*H*dtype</code></p> <table> <thead> <tr> <th><strong>参数</strong></th> <th><strong>含义</strong></th> </tr> </thead> <tbody> <tr> <td>L</td> <td>layer 数</td> </tr> <tr> <td>T</td> <td>sequence length</td> </tr> <tr> <td>H</td> <td>hidden size</td> </tr> <tr> <td>2</td> <td>K+V</td> </tr> </tbody> </table> <p>比如我们简单算一个Qwen3 32B的：</p> <p>2<em>64</em>128k<em>5120</em>2bytes/1024^3=~156.25GB</p> <p>但是实际上Qwen3走了GQA（attention heads是40，kv heads是8，head_dim是128），所以实际大概会是33.5GB左右（GQA这些技术的意义来了）</p> <p>可以看出大模型在推理的时候，显存会被大量的KV Cache占满！这个也是Infra里需要解决的一个重要课题。现在很多模型使用一些技术来降低KV Cache，列举几个，比如模型层可以做的有：</p> <ul> <li> <p>GQA（Grouped Query Attention）这种技术，Q Heads很多KV Heads很少，这样可以大量降低KV Cache</p> </li> <li> <p>MQA（Multi-Query Attention）：比GQA更激进，所有的Q共享同一组KV，但是效果会下降比较多</p> </li> <li> <p>MLA（Multi-head Latent Attention）：是DeepSeek很关键的方向，不直接存完整的KV，而是存压缩的latent（KV Compression），需要的时候再恢复</p> </li> <li> <p>Sliding Window Attention：只看最近的窗口，比如看最近4k，而不是完整的1M上下文</p> </li> <li> <p>Sparse Attention：不是所有的token都两两attention（比如只关注附近的token、少量关键的token以及一些summary token等）</p> </li> </ul> <p>Inference Engine层可以做的有：</p> <ul> <li> <p>PagedAttention：vllm主要的特性，kv cache做分页</p> </li> <li> <p>Prefix Cache：共享相同前缀的prompt的kv，不重复做prefill</p> </li> <li> <p>KV Quantization：KV不存bf16，改成存int8/int4，但是伴随量化也会带来精度下降</p> </li> <li> <p>Distributed KV Cache：KV分布到多GPU，按head/layer/sequence去做shard</p> </li> <li> <p>PD分离（Prefill-Decode Disaggregation）：Prefill和Decode分不同机器，因为前者是Compute-bound型，后者是Memory-bound型，这也可以有不同的机器支撑</p> </li> </ul> <p>这些手段或多或少都在解决KV Cache相关的问题，只不过关注的角度不太一样。后续我们也会接触到里面的某些内容，其他的有价值值得写的也会单独有文章来聊</p>]]></content><author><name></name></author><category term="Blog"/><category term="Blog"/><category term="微信公众号"/><category term="Substack"/><summary type="html"><![CDATA[系列的第二集，前面的可以看：]]></summary></entry><entry xml:lang="zh"><title type="html">LLM Infra 101 v0.1: API调用</title><link href="https://ifuryst.github.io/blog/2026/llm-infra-101-v0-1-openai-compatible-api/" rel="alternate" type="text/html" title="LLM Infra 101 v0.1: API调用"/><published>2026-05-18T00:00:00+00:00</published><updated>2026-05-18T00:00:00+00:00</updated><id>https://ifuryst.github.io/blog/2026/llm-infra-101-v0-1-openai-compatible-api</id><content type="html" xml:base="https://ifuryst.github.io/blog/2026/llm-infra-101-v0-1-openai-compatible-api/"><![CDATA[<p>系列的第二集，前面的可以看：</p> <ol> <li><a href="https://www.ifuryst.com/blog/2026/llm-infra-101-model-inference/">LLM Infra 101 v0.0: 推理模型</a></li> </ol> <p>这一期的代码在 <a href="https://github.com/iFurySt/nanoLLMServe/tree/release/v0.1.0">https://github.com/iFurySt/nanoLLMServe/tree/release/v0.1.0</a></p> <p>上一期过完，有了一个能通过CLI调用的，这一期我们做个新的特性，我们做一个兼容OpenAI的API调用接口，这样现有的大部分sdk都可以无缝接入了。大体上会支持这样的内容：</p> <ol> <li> <p>HTTP Server</p> </li> <li> <p>OpenAI-Compatible endpoint</p> </li> <li> <p>支持stream参数，可以全量返回也可以流式返回</p> </li> <li> <p>支持chat接口，暂时不支持response接口</p> </li> </ol> <h1 id="实现">实现</h1> <p>这次改动的不多，基本上就是包装了API，相关改动情况：</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">.</span>
└── src/
    └── nanollmserve/
        ├── api/
        │   ├── __init__.py        <span class="c"># API 包入口，占位导出用</span>
        │   ├── openai_server.py   <span class="c"># OpenAI-compatible HTTP server：/v1/models、/v1/responses、/v1/chat/completions、/v1/completions</span>
        │   └── protocol.py        <span class="c"># OpenAI-compatible 协议模型：请求/响应 schema、prompt 转换、usage/response 构造</span>
        └── engine/
            └── engine.py          <span class="c"># 增加 streaming generation：GenerationStep、stream_generate_one，并让 generate_one 复用流式路径</span>
</code></pre></div></div> <p>新增接口如下：</p> <table> <thead> <tr> <th><strong>接口</strong></th> <th><strong>用途</strong></th> </tr> </thead> <tbody> <tr> <td>/v1/models</td> <td>返回当前 server 暴露的模型列表</td> </tr> <tr> <td>/v1/responses</td> <td>OpenAI 新推荐的 Responses API，支持 text-only create/stream/retrieve</td> </tr> <tr> <td>/v1/chat/completions</td> <td>兼容传统 chat messages 格式</td> </tr> <tr> <td>/v1/completions</td> <td>兼容 legacy prompt completion 格式</td> </tr> </tbody> </table> <p>实现里关注Chat和Response API，一个原因是这两个接口都是主流使用，另一个是这两个接口是典型对比。比如现在很多AI Agent都接入Response了，因为可以走前缀缓存，命中缓存后，不管速度和费用都能得到收益</p> <p>整体基本就是兼容OpenAI的接口协议，比如Chat Completion：</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"xxx"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"messages"</span><span class="p">:</span><span class="w"> </span><span class="p">[{</span><span class="w"> </span><span class="nl">"role"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user"</span><span class="p">,</span><span class="w"> </span><span class="nl">"content"</span><span class="p">:</span><span class="w"> </span><span class="s2">"hello"</span><span class="w"> </span><span class="p">}]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p>Response类似：</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"xxx"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"input"</span><span class="p">:</span><span class="w"> </span><span class="s2">"hello"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div> <p>之后就是数据转换，转成Engine能识别的格式。其他就没有太多值得展开讲的</p> <h1 id="推理">推理</h1> <p>过一下整体的推理过程，首先是启动HTTP服务</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span>0 <span class="nv">PYTHONPATH</span><span class="o">=</span>src /data/anaconda3/bin/python <span class="nt">-m</span> nanollmserve.api.openai_server <span class="se">\</span>
  <span class="nt">--model</span> /data2/nanoLLMServe/models/Qwen3-8B <span class="se">\</span>
  <span class="nt">--served-model-name</span> Qwen3-8B <span class="se">\</span>
  <span class="nt">--local-files-only</span> <span class="se">\</span>
  <span class="nt">--host</span> 127.0.0.1 <span class="se">\</span>
  <span class="nt">--port</span> 18080 <span class="se">\</span>
  <span class="nt">--device</span> cuda <span class="se">\</span>
  <span class="nt">--dtype</span> bfloat16
</code></pre></div></div> <p>启动后可以看看模型列表</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-s</span> http://127.0.0.1:18080/v1/models | jq <span class="nb">.</span>
</code></pre></div></div> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115783_1-480.webp 480w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115783_1-800.webp 800w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115783_1-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115783_1.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>请求Response API</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-sS</span> http://127.0.0.1:18080/v1/responses <span class="se">\</span>
  <span class="nt">-H</span> <span class="s1">'Content-Type: application/json'</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{
    "model": "Qwen3-8B",
    "instructions": "Answer briefly.",
    "input": "Explain KV cache in one sentence.",
    "max_output_tokens": 100,
    "temperature": 0,
    "store": true
  }'</span> | jq <span class="nb">.</span>
</code></pre></div></div> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115783_2-480.webp 480w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115783_2-800.webp 800w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115783_2-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115783_2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>Response API开启Stream流式返回</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-sS</span> http://127.0.0.1:18080/v1/responses <span class="se">\</span>
  <span class="nt">-H</span> <span class="s1">'Content-Type: application/json'</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{
    "model": "Qwen3-8B",
    "instructions": "Answer briefly.",
    "input": "Explain KV cache in one sentence.",
    "max_output_tokens": 100,
    "temperature": 0,
    "stream": true,
    "store": true
  }'</span>
</code></pre></div></div> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_3-480.webp 480w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_3-800.webp 800w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_3-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_3.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>尝试用resp id请求之前已经请求过的</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-sS</span> http://127.0.0.1:18080/v1/responses/resp-1770dd64b5d44d1bbd93fc7dc5857bda | jq <span class="nb">.</span>
</code></pre></div></div> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_4-480.webp 480w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_4-800.webp 800w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_4-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_4.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>请求Chat Completion API</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-sS</span> http://127.0.0.1:18080/v1/chat/completions <span class="se">\</span>
  <span class="nt">-H</span> <span class="s1">'Content-Type: application/json'</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{
    "model": "Qwen3-8B",
    "messages": [
      {"role": "user", "content": "Explain KV cache in one sentence."}
    ],
    "max_tokens": 100,
    "temperature": 0
  }'</span> | jq <span class="nb">.</span>
</code></pre></div></div> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_5-480.webp 480w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_5-800.webp 800w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_5-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115784_5.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>Chat Completion API开启stream流式返回</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-sS</span> http://127.0.0.1:18080/v1/chat/completions <span class="se">\</span>
  <span class="nt">-H</span> <span class="s1">'Content-Type: application/json'</span> <span class="se">\</span>
  <span class="nt">-d</span> <span class="s1">'{
    "model": "Qwen3-8B",
    "messages": [
      {"role": "user", "content": "Explain KV cache in one sentence."}
    ],
    "max_tokens": 100,
    "temperature": 0,
    "stream": true
  }'</span>
</code></pre></div></div> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115785_6-480.webp 480w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115785_6-800.webp 800w,/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115785_6-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-18-llm-infra-101-v0-1-openai-compatible-api/1779115785_6.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <h1 id="总结">总结</h1> <p>这波很简单，包了一下API，没有太多东西需要讲。总体而言这些Infra对外其实也是一个HTTP Server，屏蔽掉下面模型调度推理的细节，调用方无需感知那么多，从最早的无状态化Chat Completion API慢慢过渡到现在有一些状态信息的Response API，背后也反应了行业在模型推理层面的变化。</p>]]></content><author><name></name></author><category term="AI"/><category term="AI"/><category term="LLMInfra-101"/><summary type="html"><![CDATA[系列的第二集，前面的可以看：]]></summary></entry><entry xml:lang="zh"><title type="html">日常Harness</title><link href="https://ifuryst.github.io/blog/2026/daily-harness/" rel="alternate" type="text/html" title="日常Harness"/><published>2026-05-17T00:00:00+00:00</published><updated>2026-05-17T00:00:00+00:00</updated><id>https://ifuryst.github.io/blog/2026/daily-harness</id><content type="html" xml:base="https://ifuryst.github.io/blog/2026/daily-harness/"><![CDATA[<p>是时候写点Harness相关的了，此前零零散散在各种地方输出，都是碎片化的。加之Public后得到的反馈都是正向的，我想这个东西对于很多人来说，是个有价值的东西。牺牲自己的2小时时间，送给有缘人</p> <h1 id="解放思想">解放思想</h1> <p>我的东西不新，甚至不一定适合你的工作流或产品，但是我的想法一定会让你有所得，这也是我想要通过这篇文章传达的。</p> <p>相信还有很多人并不相信大模型的能力已经严重溢出了，一个可能是认知不够，一个是尝试的不够，后者是被前者驱动着的，因此从根源（fancy的说法：第一性原理）来看待，认知是需要首先解决的。</p> <p>最近我反复和身边的人传播着敢想，敢于去挑战AI的能力边界，努力摸到天花板，读完这篇文章或许你会对于这句话有更好的理解。</p> <h1 id="harness-template">Harness Template</h1> <p>就从这个开源的Repo开始吧。相关的repo有三个：</p> <ul> <li><a href="https://github.com/iFurySt/harness-template">https://github.com/iFurySt/harness-template</a></li> <li><a href="https://github.com/iFurySt/harness-template-cn">https://github.com/iFurySt/harness-template-cn</a></li> <li><a href="https://github.com/iFurySt/harness-cli">https://github.com/iFurySt/harness-cli</a></li> </ul> <p>前面两个是一样的，中英文差异，我个人大部分会喜欢用中文的，少部分开源项目会用英文的。第三个是是用来快速new以项目的cli，本质也是是用了前面两个。用法如下：</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989811_1-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989811_1-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989811_1-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989811_1.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>直接在GitHub上使用Template来创建一个仓库。这个方式对于我整体要new新项目来说属于脱裤子放屁，多此一举了，所以我一般就是直接cli跑一下：</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989811_2-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989811_2-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989811_2-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989811_2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>然后就可以快乐的开始与AI共舞了</p> <p>开始更深入之前，先来说下Harness这个东西，AI Agent的发展有这么几个重要阶段：</p> <ol> <li>Prompt Engineering</li> <li>Context Engineering</li> <li>Harness</li> </ol> <p>遥想去年我还在写CE101这本书，往事不堪回首，毫不客气的讲，现在看到那些东西感觉就应该直接丢带垃圾桶里去。但是我依然觉得当时我花时间写是值得的，虽然现在我需要打自己的脸</p> <p>直到大半年来Harness已经红遍大街小巷了，我不知道这个词的源头是哪里来的，我也不感兴趣，但是我觉得这个词用的太精妙了。</p> <p>我们知道模型是非确定性的，我们不断写Prompt、Skills就是去约束模型的行为，让其朝着我们期望的方向去走，就好像骑马一样，马鞍、马勒、缰绳、马镫等这些东西都是未来去约束马朝着我们想要的方向行走、奔跑或停住。但是马只是动物，听不懂人话，只能通过训练和这些外部的工具去约束，大模型也是，模型无法做出确定性的结果，我们无法控制它，只能约束他，这就是为什么我们用Harness而不用Control之类的词的缘故</p> <p>关于Harness，其实根据不同的语境，我觉得可以分为两种：</p> <ol> <li>针对产品/服务的Harness：比如配套的工具、配套的Memory机制、配套的Sandbox等，服务于模型，让其更好的发挥效果</li> <li>针对研发、创造等生产过程中的Harnes：配套的脚手架、环境、上下文等，是为了更好的创作</li> </ol> <p>针对1型Harness，没啥好说的，去clone一下codex、claude code、openclaw或hermes的源码看看，借助ai分析一下，都真相大白了，没有什么太多的magic，号称自己harness牛逼的人，无外乎这几种可能：</p> <ol> <li>能力和眼界一般</li> <li>吹牛逼的骗子</li> <li>系统过于复杂（让他在各种努力下打造出一个高于SOTA baseline的产品，然后很高兴）</li> </ol> <p>现有的这些harness技术，很早都有了，那为什么2年前没有这样呢？回头看看2年前的大模型能力。因此现阶段回归模型即产品，会是一个更加客观和正确的认知。所以关于1型Harness，我不再展开，各家产品好烂自己门清，要怎么集成应该也都心里有数</p> <p>现在回归2型Harness，这个也是我的Harness Template的主战场，也是我相信能给个人、团队和组织带来实打实提升的点。展开之前，有几篇前面写的文章，有时间我觉得也值得看看：</p> <ul> <li><a href="https://www.ifuryst.com/blog/2026/speedrunning-the-ai-era/"><strong>我们是如何在AI Era飙车的</strong></a></li> <li><a href="https://www.ifuryst.com/blog/2026/the-urge-to-solve/"><strong>解决问题的原始冲动</strong></a></li> <li><a href="https://www.ifuryst.com/blog/2026/open-browser-use/"><strong>Browser Use详解</strong></a></li> </ul> <p>里面或多或少的都讲了一些我在摸索和实践Harness的过程中的一些想法和实际的实践。</p> <p>这套方法论的核心其实就在于让一切的东西始于AI也终于AI，AI产AI用，一切信息不出Repo。有点虚，一步步来，先看张图：</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989811_3-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989811_3-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989811_3-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989811_3.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>这个是我自己写的，我觉得这些点能较好的表达出我的想法。</p> <h2 id="agentsmd">AGENTS.md</h2> <p>接着看看项目的目录：</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989813_4-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989813_4-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989813_4-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989813_4.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>非常简单，进来首先是到AGENTS.md和CLAUDE.md，基本能Cover主流的Agent了，然后这边的AGNETS.md通常是做目录（TOC）的，会把东西打散到docs里，这样能按需取用，减少上下文损耗</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989813_5-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989813_5-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989813_5-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989813_5.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>通常我们会按一些关键的节点或领域来划分，比如：</p> <ul> <li>开始前要读：通常是一些协作规范、Repo的简要情况和一些指导性的东西</li> <li>工作完要读：一般是一些收尾的东西，比如写历史记录、测试覆盖、验收等动作</li> <li>提交前要读：通常会做一些比如全量本地测试或一些分支操作等动作</li> <li>领域读取：最常见无外乎前后端了，再细化也有按照模块或者DDD之类的去选择性读取</li> </ul> <p>这些基本上就是一个大型的自然语言Harness现场，用现在流行的说法，就是写了好多非标Skills，然后按需加载使用。</p> <h2 id="docs">docs</h2> <p>接下去docs，这里就很关键了，套用现在的说法，可以理解这里就是Karpathy大佬的LLM Wiki的思想了，只能说大佬（AI头部网红）的传播度比较高。</p> <p>这里的目录划分没什么讲究的，只是我自己拍脑袋了一套出来，实际可以按照自己项目的情况和需求来调整，大概是这么一些想法点：</p> <ul> <li>AGENTS.md拆分出来的一些独立文件可以落在这里，可以单md文件，也可以上目录，里面更细化</li> <li>histories：这个我自己想出来的天才之作，有了这个，整个repo从Day1开始的一切Query和变化的历史都在案，带来的好处是，不需要写文档了，新人onboarding上AI就能知道过往的一切了。当一个功能出问题后也可以快速回溯哪里改了，哪里导致的，哪里退化的。也可以借机让一些跑得比较慢的人学习其他高手怎么Vibe的。某种意义可以当作是这个repo的记忆</li> <li>milestone/feature规划：一般是用文件系统来跟踪TODO的，哪怕在有了/goal（Ralph Loop）的今天，这个依然很有用，在上下文里跟踪TODO属于找死，更别说大feat，在文件里跟踪的好处是可以支撑长时任务的进行，也可以在进行中通过调整文件来动态修改milestone和目标（后续还会提到这部分）</li> <li>其他都是什么产品定义、设计规范、参考文件、release文档等等</li> </ul> <h2 id="其他">其他</h2> <p>其他的不多了，举几个重要的：</p> <ul> <li>scripts，通常是一些可复用的脚本，这部分AI也可以沉淀，这样后续可以持续服用</li> <li>skills，我没包含在template里，但是这部分其实是会有很多skills的，比如操作浏览器，登陆堡垒机或开发机，部署的指导等等</li> <li>敏感文件，我一般会用诸如.harness或.agents之类的目录来存放一些敏感文件，这个目录会加到gitignore，目录里的文件也可以套用环境变量来做加密，避免key之类的明文罗盘，AI是可以在取用的时候执行一个脚本实时从env读取密钥解密拿到内容的。</li> </ul> <h1 id="用法">用法</h1> <p>讲完了，或许很一般，但就好像我前面截图里那句名言<code class="language-plaintext highlighter-rouge">**Less is more**</code>一样，当我们使用起来，魔法就来了。我用实际的例子来讲吧。</p> <h2 id="open-computer-use">Open Computer Use</h2> <p>先来看看<a href="https://github.com/iFurySt/open-codex-computer-use">Open Computer Use</a>。在开始这个项目之前，我需要先对现有的机制去做分析，如果是裸的直接用codex分析的话，很多内容无法沉淀的，最终会在一次次上下文compact中丢失很多分析的细节，因此我就通过harness template的机制让我在分析的过程中把得到的信息不断沉淀到docs里</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989814_6-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989814_6-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989814_6-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989814_6.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>而且因为逆向分析的过程是持续性的，需要用很多不同的工具搭配着来分析（<del>买不起IDA Pro</del>），有时候后面收集到的最新信息会覆盖掉之前错误的结论，因此这个持久化可<strong>复利</strong>的体系就显得非常有必要了。</p> <p>有了这些信息，在后面的实现过程中，可以不断参阅这些信息去做实现，哪怕后续官方的版本升级后，也可以持续增量更新新内容，永续性就来了</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989814_7-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989814_7-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989814_7-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989814_7.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <h2 id="aifi">AIFi</h2> <p>再来到<a href="https://github.com/iFurySt/aifi">AIFi</a>这个项目，这是一个金融分析的项目，项目名的灵感来源于DeFi之于非中心化金融，AIFi之于AI金融。这个项目0代码需求，本身就是大量的Skills组成的产品，目录即产品。</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989814_8-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989814_8-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989814_8-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989814_8.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>新时代开源，就是开源Skill。这边就是通过把传统的投资理财领域的专家建模成一个一个Skill，来达到让AI可以按需扮演不同的人来做不同的工作。经典的来了，这次我就不是使用docs了，我直接让AI在根目录的research里持久化每次调研分析的结果，这样可以实现调研<strong>复利</strong></p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989815_9-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989815_9-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989815_9-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989815_9.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>这里让架构已经不是软件的架构了，而是整个体系的架构。</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989815_10-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989815_10-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989815_10-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989815_10.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>看一些分析效果</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989815_11-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989815_11-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989815_11-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989815_11.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>整体可以说是和以前的deep research类似了，但是现在我们还需要单独打造一个deep research么？codex/claude code打开，套上这样一个harness，不管是什么领域的，都可以沉淀出很好的内容</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989816_12-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989816_12-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989816_12-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989816_12.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>上点<a href="https://github.com/iFurySt/visual-html-gen-ui">Gen UI的Skill</a>就可以产出一些可视化的内容了。</p> <h1 id="解放认知">解放认知</h1> <p>提到的这些都是希望以小见大，让你的认知打开。当遇到任何的问题时都尝试去挑战一下AI，不断扩宽它的边界，看看它的天花板在那里，这会决定你在这个时代的天花板在哪里。</p> <p>再举个简单的例子，最近我在实践一个叫<a href="https://github.com/iFurySt/nanoLLMServe">Nano LLM Serve</a>的项目，主要是想build from scratch，从实践层面去撬动自己对于模型和Infra的认知，在这期间，我不断和ChatGPT/Codex去Co-Learning，去Co-Work，我一个毫无名气的破本科生，看公式如看天书的人，现在我可以和别人讨论Speculative Decoding，讨论Steering Vectors，讨论KV Cache Network。而且还能自己探索最新的Interaction Model，探索Diffusion Language Model。从应用出发，反推理论，和AI一起，不断打碎旧认知里为自己设下的边界，这不仅是这个时代的生存法则，更是任何时代都应该具备的生存法则。</p> <p>说到生存法则有点硬，从Just For Fun的角度来阐述或更好点，我昨天才发了一条消息给朋友：</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989816_13-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989816_13-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989816_13-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989816_13.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-daily-harness/1778989817_14-480.webp 480w,/assets/img/2026-05-17-daily-harness/1778989817_14-800.webp 800w,/assets/img/2026-05-17-daily-harness/1778989817_14-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-daily-harness/1778989817_14.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>或许对于喜欢整活的人来说，有了AI放大的能力是无限的，能限制的只有碳基身体的脆弱以及24h的时间限制</p> <h1 id="尾声">尾声</h1> <p>我相信如果你能把这个harness template用起来，那肯定会开始慢慢认知到一些原来无法认知到的东西。我并不觉得这个东西一定会适合你，但是可以陆续去调整成适合自己的harness</p> <p>最近最有感触的依然还是学会与AI共舞，就像在AI时代长大的年轻一代，会顺其自然的就把AI当作工具使用，就好像我们以前使用PC、互联网、手机等东西一样。只不过因为习以为常，加上年龄增长带来的阅历提升，开始慢慢丧失了不断尝试、不断试错、不断失败最后有所得的能力了。时代其实一直在变，只是人从变开始慢慢寻求不变，短暂的一生无法拥抱太多的变化。</p>]]></content><author><name></name></author><category term="AI"/><category term="AI"/><category term="thoughts"/><summary type="html"><![CDATA[是时候写点Harness相关的了，此前零零散散在各种地方输出，都是碎片化的。加之Public后得到的反馈都是正向的，我想这个东西对于很多人来说，是个有价值的东西。牺牲自己的2小时时间，送给有缘人]]></summary></entry><entry xml:lang="zh"><title type="html">LLM Infra 101 v0.0: 推理模型</title><link href="https://ifuryst.github.io/blog/2026/llm-infra-101-model-inference/" rel="alternate" type="text/html" title="LLM Infra 101 v0.0: 推理模型"/><published>2026-05-17T00:00:00+00:00</published><updated>2026-05-17T00:00:00+00:00</updated><id>https://ifuryst.github.io/blog/2026/llm-infra-101-model-inference</id><content type="html" xml:base="https://ifuryst.github.io/blog/2026/llm-infra-101-model-inference/"><![CDATA[<p>系列的第一集，这集需要达到的目标很简单：能跑一个大模型。</p> <p>这一期的代码在 <a href="https://github.com/iFurySt/nanoLLMServe/tree/release/v0.0.0">https://github.com/iFurySt/nanoLLMServe/tree/release/v0.0.0</a></p> <p>模型训练完得到的是一个权重文件，一般开源的也就是这个模型权重（开源更加完全的是会在技术报告或相关论文里去详细披露出自己怎么训练的全过程，谁看完都可以自己去复现的那种），Infra的首要目的就是能把这个模型权重文件跑起来！</p> <p>基于此，我们规划出这么几个简单的步骤：</p> <ol> <li> <p>从Hugging Face（以后都叫hf了）下载模型权重</p> </li> <li> <p>通过代码将模型权重加载到GPU显存里进行推理</p> </li> <li> <p>通过CLI单次调用（非交互）可以输入Prompt得到结果</p> </li> </ol> <p>很简单的实现</p> <h1 id="模型选择">模型选择</h1> <p>一般Infra是需要支持很多模型的，也要在很多卡上去做推理测试，我们一开始，手头有什么就用什么了。我们会先着重在单GPU卡上去做推理，因此我们的参数量不会太大，我们控制在10B以内的参数，基于Qwen基本提供了全参数的模型，已经是现阶段首选的客观事实，我们就选择：</p> <ol> <li> <p><a href="https://huggingface.co/Qwen/Qwen3-0.6B">Qwen/Qwen3-0.6B</a></p> </li> <li> <p><a href="https://huggingface.co/Qwen/Qwen3-1.7B">Qwen/Qwen3-1.7B</a></p> </li> <li> <p><a href="https://huggingface.co/Qwen/Qwen3-4B">Qwen/Qwen3-4B</a></p> </li> <li> <p><a href="https://huggingface.co/Qwen/Qwen3-8B">Qwen/Qwen3-8B</a></p> </li> </ol> <p>这么几个模型权重来推理。我们打开Files可以看到有这些文件</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-17-llm-infra-101-model-inference/1779013265_2-480.webp 480w,/assets/img/2026-05-17-llm-infra-101-model-inference/1779013265_2-800.webp 800w,/assets/img/2026-05-17-llm-infra-101-model-inference/1779013265_2-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-17-llm-infra-101-model-inference/1779013265_2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <table> <thead> <tr> <th>文件</th> <th>作用</th> </tr> </thead> <tbody> <tr> <td>.gitattributes</td> <td>git/hf 的文件管理配置</td> </tr> <tr> <td>LICENSE</td> <td>模型许可证</td> </tr> <tr> <td>README.md</td> <td>模型卡，包含模型介绍、用法、限制、示例代码等</td> </tr> <tr> <td>config.json</td> <td>模型结构配置，比如层数、hidden size、attention heads、词表大小、RoPE 参数、dtype 等。Transformers 加载模型时会先读这个文件</td> </tr> <tr> <td>generation_config.json</td> <td>默认生成参数。比如 temperature: 0.6、top_p: 0.95、top_k: 20、do_sample: true、EOS/PAD token 等</td> </tr> <tr> <td>tokenizer.json</td> <td>tokenizer文件，包含分词模型、规则、特殊token等</td> </tr> <tr> <td>tokenizer_config.json</td> <td>tokenizer的额外配置，重点包括chat template、特殊token、最大长度等。Qwen聊天格式主要放在这里</td> </tr> <tr> <td>vocab.json</td> <td>BPE tokenizer的词表，token到id的映射。</td> </tr> <tr> <td>merges.txt</td> <td>BPE合并规则，决定字符/子词如何逐步合并成token。</td> </tr> <tr> <td>model.safetensors</td> <td>0.6B的模型权重文件，安全张量格式</td> </tr> <tr> <td>model-00001-of-00005.safetensors 等</td> <td>8B的模型权重分片，文件太大时会拆成多个 shard</td> </tr> <tr> <td>model.safetensors.index.json</td> <td>只在分片模型里需要，记录每个权重tensor存在哪个.safetensors分片里，加载器靠它拼回完整模型</td> </tr> </tbody> </table> <h1 id="tokenizer">Tokenizer</h1> <p>用来分词的，也就是用来把我们输入的自然语言文本转成token，比如：</p> <ol> <li> <p>输入Hello world</p> </li> <li> <p>分词成[“Hello”, “ world”]</p> </li> <li> <p>转成token ids[15496, 995]</p> </li> </ol> <p>tokenizer核心定义在tokenizer.json里，如果没有的话会通过vocab.json和merges.txt去重建tokenizer</p> <h1 id="实现">实现</h1> <p>大概有了原理后，我们就着手实现一版，直接看文件结构：</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">.</span>
├── benchmarks/
│   └── benchmark_generate.py    <span class="c"># v0 naive 单请求生成性能基准，输出耗时和 tokens/s</span>
├── src/
│   └── nanollmserve/
│       ├── __init__.py          <span class="c"># 包入口，暴露版本/基础包信息</span>
│       ├── api/                 <span class="c"># 对外 API 层，后续承载 OpenAI-compatible HTTP 接口</span>
│       │   ├── __init__.py
│       │   ├── openai_server.py <span class="c"># v0.1 OpenAI-compatible HTTP server 占位</span>
│       │   └── protocol.py      <span class="c"># OpenAI-compatible 请求/响应协议模型占位</span>
│       ├── cache/               <span class="c"># KV cache 与 prefix cache 相关数据结构边界</span>
│       │   ├── __init__.py
│       │   ├── block_manager.py <span class="c"># block-based KV cache 分配器占位</span>
│       │   ├── kv_cache.py      <span class="c"># KV cache tensor/metadata 管理占位</span>
│       │   ├── prefix_cache.py  <span class="c"># prefix cache 查询与淘汰策略占位</span>
│       │   └── radix_tree.py    <span class="c"># prefix cache radix tree 索引占位</span>
│       ├── cli/                 <span class="c"># 命令行入口层，保持薄封装</span>
│       │   ├── __init__.py
│       │   └── generate.py      <span class="c"># `nanollmserve-generate` 风格的单 prompt 生成 CLI</span>
│       ├── distributed/         <span class="c"># 多进程/多节点协调边界</span>
│       │   ├── __init__.py
│       │   ├── router.py        <span class="c"># 跨 worker 请求路由占位</span>
│       │   └── worker.py        <span class="c"># 分布式 worker 进程胶水代码占位</span>
│       ├── engine/              <span class="c"># 请求生命周期与 decode 编排核心</span>
│       │   ├── __init__.py
│       │   ├── engine.py        <span class="c"># 当前核心实现：naive 单请求 decode loop</span>
│       │   ├── request.py       <span class="c"># 请求状态/生命周期 contract 占位</span>
│       │   └── scheduler.py     <span class="c"># batching/scheduling policy 占位</span>
│       ├── metrics/             <span class="c"># 运行时统计与指标导出边界</span>
│       │   ├── __init__.py
│       │   ├── prometheus.py    <span class="c"># Prometheus exporter 占位</span>
│       │   └── stats.py         <span class="c"># engine/scheduler/cache stats 数据结构占位</span>
│       ├── model/               <span class="c"># 模型加载与模型执行边界</span>
│       │   ├── __init__.py
│       │   └── hf_runner.py     <span class="c"># Hugging Face causal LM/tokenizer 加载、device/dtype 解析</span>
│       ├── sampling/            <span class="c"># logits 处理与 token 选择</span>
│       │   ├── __init__.py
│       │   ├── params.py        <span class="c"># sampling 参数 contract 占位</span>
│       │   └── sampler.py       <span class="c"># greedy 和 temperature sampling 实现</span>
│       ├── structured_output/   <span class="c"># schema/grammar constrained decoding 边界</span>
│       │   └── __init__.py
│       └── worker/              <span class="c"># 本地执行 worker 边界</span>
│           ├── __init__.py
│           └── gpu_worker.py    <span class="c"># single-GPU worker execution 占位</span>
└── tests/
    ├── test_cli.py              <span class="c"># CLI 参数解析、main 输出和 stats 行为测试</span>
    ├── test_engine.py           <span class="c"># generate_one decode、EOS、attention mask、参数校验测试</span>
    ├── test_hf_runner.py        <span class="c"># device/dtype 解析、HF 加载兼容性测试</span>
    └── test_sampling.py         <span class="c"># greedy/temperature sampling 和异常输入测试</span>
</code></pre></div></div> <p>因为我们有长远的规划，为了后续能优雅的迭代，我们做了一些占位文件和目录，去掉那些后，我们保留本次真正有效改动的：</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">.</span>
├── benchmarks/
│   └── benchmark_generate.py <span class="c"># 单请求 naive generation 基准脚本，验证吞吐、耗时、tokens/s</span>
├── pyproject.toml            <span class="c"># 包配置、依赖、测试配置和 CLI entry point</span>
├── README.md                 <span class="c"># 当前使用方式、v0 能力说明和运行示例</span>
├── src/
│   └── nanollmserve/
│       ├── __init__.py       <span class="c"># 包版本/顶层包信息</span>
│       ├── cli/
│       │   ├── __init__.py
│       │   └── generate.py   <span class="c"># 命令行生成入口：解析参数、加载模型、调用 engine、打印结果/统计</span>
│       ├── engine/
│       │   ├── __init__.py   <span class="c"># engine 对外导出</span>
│       │   └── engine.py     <span class="c"># 核心 naive decode loop：单 prompt、自回归生成、EOS 停止、计时统计</span>
│       ├── model/
│       │   ├── __init__.py   <span class="c"># model 对外导出</span>
│       │   └── hf_runner.py  <span class="c"># Hugging Face tokenizer/model 加载，device/dtype 解析与兼容处理</span>
│       └── sampling/
│           ├── __init__.py   <span class="c"># sampling 对外导出</span>
│           └── sampler.py    <span class="c"># token 选择逻辑：greedy decoding 和 temperature sampling</span>
└── tests/
    ├── test_cli.py           <span class="c"># CLI 参数、main 调用链、stdout/stderr stats 测试</span>
    ├── test_engine.py        <span class="c"># 生成循环、EOS、max token、attention mask、输入校验测试</span>
    ├── test_hf_runner.py     <span class="c"># device/dtype 解析、HF 加载兼容 fallback、可选依赖隔离测试</span>
    └── test_sampling.py      <span class="c"># greedy/temperature sampling 和异常 logits/temperature 测试</span>
</code></pre></div></div> <p>很简单的实现，基本满足了最小可运行链路：CLI→模型加载→Engine Decode循环→Sample。下面我们来看看实际的推理过程是怎样的</p> <h1 id="推理">推理</h1> <p>这一次我们最需要关注的只有两个东西：</p> <ol> <li> <p>模型</p> </li> <li> <p>Tokenizer</p> </li> </ol> <p>我们用一次推理的过程来看看整体都发生了什么，我们通过以下命令来触发单次推理：</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">(</span>base<span class="o">)</span> gpu-A100-05 nanoLLMServe <span class="c"># export MODEL=/data2/nanoLLMServe/models/Qwen3-8B</span>
<span class="o">(</span>base<span class="o">)</span> gpu-A100-05 nanoLLMServe <span class="c"># CUDA_VISIBLE_DEVICES=0 PYTHONPATH=src /data/anaconda3/bin/python -m nanollmserve.cli.generate \</span>
  <span class="nt">--model</span> <span class="s2">"</span><span class="nv">$MODEL</span><span class="s2">"</span> <span class="se">\</span>
  <span class="nt">--local-files-only</span> <span class="se">\</span>
  <span class="nt">--prompt</span> <span class="s2">"Explain KV cache in one sentence."</span> <span class="se">\</span>
  <span class="nt">--max-new-tokens</span> 100 <span class="se">\</span>
  <span class="nt">--temperature</span> 0 <span class="se">\</span>
  <span class="nt">--device</span> cuda <span class="se">\</span>
  <span class="nt">--dtype</span> bfloat16 <span class="se">\</span>
  <span class="nt">--show-stats</span>
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████| 5/5 <span class="o">[</span>00:00&lt;00:00, 136.32it/s]
 KV cache is a technique used <span class="k">in </span>transformer models to store the keys and values of previous attention computations, allowing the model to efficiently process sequential data by reusing these cached values instead of recalculating them <span class="k">for </span>each new input token.
KV cache is a technique used <span class="k">in </span>transformer models to store the keys and values of previous attention computations, allowing the model to efficiently process sequential data by reusing these cached values instead of recalculating them <span class="k">for </span>each new input token.
Okay, I need to explain what a
<span class="nv">prompt_tokens</span><span class="o">=</span>8 <span class="nv">generated_tokens</span><span class="o">=</span>100 <span class="nv">elapsed_seconds</span><span class="o">=</span>3.932 <span class="nv">tokens_per_second</span><span class="o">=</span>25.43 <span class="nv">device</span><span class="o">=</span>cuda <span class="nv">dtype</span><span class="o">=</span>bfloat16
</code></pre></div></div> <p>稍微解释一下这些参数都是什么意义</p> <table> <thead> <tr> <th>参数</th> <th>含义</th> </tr> </thead> <tbody> <tr> <td>CUDA_VISIBLE_DEVICES=0</td> <td>只让程序看到第0张GPU，我们现在聚焦单卡</td> </tr> <tr> <td>PYTHONPATH=src</td> <td>把仓库的src/加到python import路径，方便直接运行源码</td> </tr> <tr> <td>/data/anaconda3/bin/python</td> <td>用conda的python</td> </tr> <tr> <td>-m nanollmserve.cli.generate</td> <td>以模块方式运行命令行入口</td> </tr> <tr> <td>–model “$MODEL”</td> <td>指定模型路径或hf上的模型名，这边我们提前下载离线模型文件了，所以我们前面设置了路径</td> </tr> <tr> <td>–local-files-only</td> <td>只使用本地已有模型文件，不联网下载</td> </tr> <tr> <td>–prompt “xxxx”</td> <td>输入给模型的prompt，现在是很裸的，实际上一般会有system prompt+user prompt这种结合起来的</td> </tr> <tr> <td>–max-new-tokens 100</td> <td>最多生成100个新 token，现在没有吃EOS之类的结束符，我们会一直推理到100个token才结束，所以实际上可以看到哪怕应该结束了还是重复在输出知道100个token</td> </tr> <tr> <td>–temperature 0</td> <td>温度为0，使用贪心解码（greedy decoding），每步选概率最高的token（重复输出也有这个参数的原因）</td> </tr> <tr> <td>–device cuda</td> <td>把模型放到CUDA GPU上运行</td> </tr> <tr> <td>–dtype bfloat16</td> <td>使用bfloat16精度（dtype=data type，数据用什么数值格式存储和计算）</td> </tr> <tr> <td>–show-stats</td> <td>输出生成统计信息，比如token数、耗时、tokens/s、device、dtype</td> </tr> </tbody> </table> <p>然后输出的结果是</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> KV cache is a technique used in transformer models to store the keys and values of previous attention computations, allowing the model to efficiently process sequential data by reusing these cached values instead of recalculating them for each new input token.
KV cache is a technique used in transformer models to store the keys and values of previous attention computations, allowing the model to efficiently process sequential data by reusing these cached values instead of recalculating them for each new input token.
Okay, I need to explain what a
</code></pre></div></div> <p>这就是推理过程，这个就是最原始最裸的模型输出，和我们平时感受差别很大，因为现在还没有任何的instruction，没有任何的system prompt来harness模型的输出，这些后续我们都会陆续加上</p> <p>至于最后的</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prompt_tokens=8 generated_tokens=100 elapsed_seconds=3.932 tokens_per_second=25.43 device=cuda dtype=bfloat16
</code></pre></div></div> <p>就是相关的统计信息，输入的<code class="language-plaintext highlighter-rouge">Explain KV cache in one sentence.</code> 被tokenizer切成了8个token；然后实际生成了100个新token；耗时3.932秒；吞吐是每秒25.43个token</p> <p>有了全局的认知，我们来看看这期间重要的阶段发生了什么</p> <h2 id="1-cli入口">1. CLI入口</h2> <p><code class="language-plaintext highlighter-rouge">src/nanollmserve/cli/generate.py:31</code></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loaded</span> <span class="o">=</span> <span class="nf">load_model_and_tokenizer</span><span class="p">(</span>
    <span class="n">args</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
    <span class="n">device</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">device</span><span class="p">,</span>
    <span class="n">dtype</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">dtype</span><span class="p">,</span>
    <span class="n">local_files_only</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">local_files_only</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">result</span> <span class="o">=</span> <span class="nf">generate_one</span><span class="p">(</span>
    <span class="n">loaded</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
    <span class="n">loaded</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">,</span>
    <span class="n">args</span><span class="p">.</span><span class="n">prompt</span><span class="p">,</span>
    <span class="n">max_new_tokens</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">max_new_tokens</span><span class="p">,</span>
    <span class="n">temperature</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">temperature</span><span class="p">,</span>
    <span class="n">seed</span><span class="o">=</span><span class="n">args</span><span class="p">.</span><span class="n">seed</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div> <p>这边就做了2件事情：</p> <ol> <li> <p>加载模型和tokenizer</p> </li> <li> <p>把模型和tokenizer交给generate_one去做单次推理</p> </li> </ol> <h2 id="2-加载模型和tokenizer">2. 加载模型和Tokenizer</h2> <p><code class="language-plaintext highlighter-rouge">src/nanollmserve/model/hf_runner.py:49</code></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">resolved_device</span> <span class="o">=</span> <span class="nf">resolve_device</span><span class="p">(</span><span class="n">device</span><span class="p">)</span> <span class="c1"># resolve_device("cuda") -&gt; cuda
</span><span class="n">resolved_dtype</span> <span class="o">=</span> <span class="nf">resolve_dtype</span><span class="p">(</span><span class="n">dtype</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="n">resolved_device</span><span class="p">)</span> <span class="c1"># resolve_dtype("bfloat16") -&gt; torch.bfloat16
</span>
<span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="nf">from_pretrained</span><span class="p">(</span> <span class="c1"># 加载tokenizer
</span>    <span class="n">model_path</span><span class="p">,</span> <span class="c1"># /data2/nanoLLMServe/models/Qwen3-8B
</span>    <span class="n">local_files_only</span><span class="o">=</span><span class="n">local_files_only</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForCausalLM</span><span class="p">.</span><span class="nf">from_pretrained</span><span class="p">(</span> <span class="c1"># 加载模型
</span>    <span class="n">model_path</span><span class="p">,</span>
    <span class="n">dtype</span><span class="o">=</span><span class="n">resolved_dtype</span><span class="p">,</span>
    <span class="n">local_files_only</span><span class="o">=</span><span class="n">local_files_only</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">model</span><span class="p">.</span><span class="nf">to</span><span class="p">(</span><span class="n">resolved_device</span><span class="p">)</span> <span class="c1"># model.to("cuda") 把模型权重搬到GPU0卡显存内
</span><span class="n">model</span><span class="p">.</span><span class="nf">eval</span><span class="p">()</span> <span class="c1"># 模型切换到推理模式，还有model.train()训练模式
</span></code></pre></div></div> <p>我们这边采用了hf的<a href="https://github.com/huggingface/transformers">transformers</a>库来加载模型和Tokenizer（AutoTokenizer和AutoModelForCausalLM）。其中tokenizer会去读取</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/data2/nanoLLMServe/models/Qwen3-8B/tokenizer_config.json
/data2/nanoLLMServe/models/Qwen3-8B/tokenizer.json
</code></pre></div></div> <p>如果有需要fallback才会去读取<code class="language-plaintext highlighter-rouge">vocab.json</code>和<code class="language-plaintext highlighter-rouge">merges.txt</code>，否则这两个配置就足够了</p> <p>模型加载则是读取对应的模型结构配置和safetensor文件：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>config.json
model.safetensors.index.json
model-00001-of-00005.safetensors
model-00002-of-00005.safetensors
model-00003-of-00005.safetensors
model-00004-of-00005.safetensors
model-00005-of-00005.safetensors
</code></pre></div></div> <p>在STDOUT我们可以看到有输出一行</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Loading checkpoint shards: 100%|███████| 5/5 [00:00&lt;00:00, 136.32it/s]
</code></pre></div></div> <p>这个就是读取了5个权重分片。</p> <p>另外这边我们手动制定了使用bfloat16，和float16一样都是占用2bytes（但是因为指数位和精度位更小，所以计算量更小，但是同时效果损失不大），这里我们可以估算大概（还有一些其他的消耗）的显存消耗为：</p> <p><em>8B</em>2Bytes=8<em>10^9</em>2Bytes/1024^3=~16G*</p> <p>最后就是把已经加载到CPU内存的模型送到GPU显存里</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="nf">to</span><span class="p">(</span><span class="n">resolved_device</span><span class="p">)</span> <span class="c1"># model.to("cuda") 把模型权重搬到GPU0卡显存内
</span><span class="n">model</span><span class="p">.</span><span class="nf">eval</span><span class="p">()</span> <span class="c1"># 模型切换到推理模式，还有model.train()训练模式
</span></code></pre></div></div> <p>也就是我们前面在做<code class="language-plaintext highlighter-rouge">AutoModelForCausalLM.from_pretrained</code> 的时候，模型已经从磁盘里的权重文件被加载到CPU内存里按模型结构和配置构建好了，所以整机的内存一般都很大，否则加载进来都是个问题。但是因为CPU计算并行度不够，太慢了，所以仍然需要把模型送到GPU显存里，在GPU的计算核心里并行的计算。</p> <table> <thead> <tr> <th><strong>项目</strong></th> <th><strong>8× A100 80GB（DGX/HGX A100 典型）</strong></th> <th><strong>8× H100 80GB（DGX/HGX H100 典型）</strong></th> </tr> </thead> <tbody> <tr> <td>GPU 型号</td> <td>NVIDIA A100 80GB SXM4</td> <td>NVIDIA H100 80GB SXM5</td> </tr> <tr> <td>GPU 数量</td> <td>8</td> <td>8</td> </tr> <tr> <td>单卡显存</td> <td>80GB HBM2e</td> <td>80GB HBM3</td> </tr> <tr> <td>总 GPU 显存</td> <td>640GB</td> <td>640GB</td> </tr> <tr> <td>单卡显存带宽</td> <td>~2.0 TB/s</td> <td>~3.0 TB/s</td> </tr> <tr> <td>GPU 峰值功耗（TDP）</td> <td>~400W</td> <td>~700W</td> </tr> <tr> <td>GPU 架构</td> <td>Ampere</td> <td>Hopper</td> </tr> <tr> <td>Tensor Core</td> <td>第三代</td> <td>第四代</td> </tr> <tr> <td>FP8 支持</td> <td>无</td> <td>有（Transformer Engine）</td> </tr> <tr> <td>BF16 支持</td> <td>有</td> <td>有</td> </tr> <tr> <td>NVLink 版本</td> <td>NVLink 3</td> <td>NVLink 4</td> </tr> <tr> <td>单 GPU NVLink 带宽</td> <td>600 GB/s（双向）</td> <td>900 GB/s（双向）</td> </tr> <tr> <td>NVSwitch</td> <td>6× NVSwitch</td> <td>第三代 NVSwitch</td> </tr> <tr> <td>GPU 拓扑</td> <td>全互联（all-to-all）</td> <td>全互联（all-to-all）</td> </tr> <tr> <td>GPU ↔ GPU 通信</td> <td>NVSwitch Fabric</td> <td>NVSwitch Fabric</td> </tr> <tr> <td>PCIe 代际</td> <td>PCIe Gen4</td> <td>PCIe Gen5</td> </tr> <tr> <td>PCIe x16 单向带宽</td> <td>~32 GB/s</td> <td>~64 GB/s</td> </tr> <tr> <td>PCIe x16 双向带宽</td> <td>~64 GB/s</td> <td>~128 GB/s</td> </tr> <tr> <td>CPU（官方 DGX 典型）</td> <td>双路 AMD EPYC 7742</td> <td>双路 Intel Xeon Sapphire Rapids</td> </tr> <tr> <td>CPU 核心数</td> <td>64C ×2 = 128 核</td> <td>~56–60C ×2</td> </tr> <tr> <td>CPU 架构代号</td> <td>Rome</td> <td>Sapphire Rapids</td> </tr> <tr> <td>系统内存</td> <td>1TB–2TB DDR4</td> <td>2TB DDR5</td> </tr> <tr> <td>内存带宽</td> <td>DDR4</td> <td>DDR5（更高）</td> </tr> <tr> <td>本地 NVMe</td> <td>多块 NVMe SSD</td> <td>多块 Gen4/Gen5 NVMe</td> </tr> <tr> <td>网络</td> <td>Mellanox ConnectX-6</td> <td>ConnectX-7</td> </tr> <tr> <td>InfiniBand</td> <td>HDR 200Gbps</td> <td>NDR 400Gbps</td> </tr> <tr> <td>RDMA</td> <td>支持</td> <td>支持</td> </tr> <tr> <td>DPU</td> <td>通常无</td> <td>BlueField-3</td> </tr> <tr> <td>单机整机功耗</td> <td>~6.5–8 kW</td> <td>~10–12 kW</td> </tr> <tr> <td>散热</td> <td>高压风冷</td> <td>高压风冷/液冷</td> </tr> <tr> <td>典型用途</td> <td>GPT-3/LLaMA1时代训练</td> <td>GPT-4时代训练/推理</td> </tr> <tr> <td>典型瓶颈</td> <td>NVLink/显存带宽</td> <td>电力/散热/跨节点通信</td> </tr> <tr> <td>训练特点</td> <td>Compute-bound 较多</td> <td>Memory/Communication-bound 更明显</td> </tr> <tr> <td>MoE 支持</td> <td>可以但通信压力大</td> <td>非常适合</td> </tr> <tr> <td>Tensor Parallel</td> <td>强</td> <td>极强</td> </tr> <tr> <td>推理 KV Cache 性能</td> <td>较强</td> <td>极强</td> </tr> <tr> <td>典型价格（整机）</td> <td>~$120k–200k</td> <td>~$250k–500k+</td> </tr> </tbody> </table> <p>简单看这个表格，我们就能看到整机内存都是1T以上的这种级别，和我们认知里的电脑或者服务器里32GB/64GB/128GB已经不是一个纬度的了。另外这里值得留意的是PCIe、NVLINK和HBM的速度差异</p> <table> <thead> <tr> <th><strong>项目</strong></th> <th><strong>NVIDIA A100 80GB SXM4</strong></th> <th><strong>NVIDIA H100 80GB SXM5</strong></th> </tr> </thead> <tbody> <tr> <td>HBM 类型</td> <td>HBM2e</td> <td>HBM3</td> </tr> <tr> <td>HBM 带宽</td> <td>~2 TB/s</td> <td>~3 TB/s</td> </tr> <tr> <td>PCIe</td> <td>Gen4 x16</td> <td>Gen5 x16</td> </tr> <tr> <td>PCIe 单向带宽</td> <td>~32 GB/s</td> <td>~64 GB/s</td> </tr> <tr> <td>NVLink</td> <td>NVLink 3</td> <td>NVLink 4</td> </tr> <tr> <td>NVLink 带宽</td> <td>600 GB/s</td> <td>900 GB/s</td> </tr> <tr> <td>GPU 拓扑</td> <td>NVSwitch 全互联</td> <td>NVSwitch 全互联</td> </tr> <tr> <td>跨机网络</td> <td>200G IB</td> <td>400G IB</td> </tr> </tbody> </table> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>         [ GPU Compute ]
                │
                │ 超高速
                ▼
        HBM3 ~3000 GB/s
                │
                │
      ┌─────────┴─────────┐
      │                   │
      ▼                   ▼
 NVLink 900 GB/s     PCIe 64 GB/s
      │                   │
      ▼                   ▼
 Other GPUs           CPU RAM
</code></pre></div></div> <p>这个在成熟的Infra里尤其重要，因为Infra解决的正是通信问题，现在模型推理里最重要的问题不是算力，而是传输速度，很多时候都是在等待传输导致计算的利用率不能最大化拉满，效率不足就会有闲置。现在NVIDIA的护城河也是其做到诸如<strong>NVL72</strong>这种整机柜，让72张GPU尽可能像一张GPU一样协同工作</p> <table> <thead> <tr> <th><strong>平台</strong></th> <th><strong>HBM</strong></th> <th><strong>NVLink</strong></th> </tr> </thead> <tbody> <tr> <td>A100</td> <td>2 TB/s</td> <td>600 GB/s</td> </tr> <tr> <td>H100</td> <td>3 TB/s</td> <td>900 GB/s</td> </tr> <tr> <td>GB200 NVL72</td> <td>8 TB/s</td> <td>1.8 TB/s</td> </tr> </tbody> </table> <p>题外话，就当提前了解有个全局的认知。我们继续</p> <h2 id="3-prompt编码并送到gpu">3. Prompt编码并送到GPU</h2> <p><code class="language-plaintext highlighter-rouge">src/nanollmserve/engine/engine.py:68</code></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">encoded</span> <span class="o">=</span> <span class="nf">tokenizer</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">return_tensors</span><span class="o">=</span><span class="sh">"</span><span class="s">pt</span><span class="sh">"</span><span class="p">)</span>
<span class="n">encoded</span> <span class="o">=</span> <span class="nf">_move_batch_to_device</span><span class="p">(</span><span class="n">encoded</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
<span class="n">input_ids</span> <span class="o">=</span> <span class="n">encoded</span><span class="p">[</span><span class="sh">"</span><span class="s">input_ids</span><span class="sh">"</span><span class="p">]</span>
<span class="n">attention_mask</span> <span class="o">=</span> <span class="n">encoded</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">attention_mask</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div> <p>这里基本上做的就是把输入的提示词去做tokenizer，然后送到GPU里，大概行为类似这样：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Explain KV cache in one sentence."
  -&gt; tokenizer
  -&gt; input_ids: shape [1, 8]
  -&gt; attention_mask: shape [1, 8]
  -&gt; .to(cuda)
</code></pre></div></div> <p>搬到GPU后，就可以和前面已经搬到GPU显存里的模型参数一起做推理计算了</p> <h2 id="4-推理">4. 推理</h2> <p><code class="language-plaintext highlighter-rouge">src/nanollmserve/engine/engine.py:87</code></p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="nf">inference_mode</span><span class="p">():</span>
    <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="n">max_new_tokens</span><span class="p">):</span>
        <span class="n">outputs</span> <span class="o">=</span> <span class="nf">model</span><span class="p">(</span><span class="n">input_ids</span><span class="o">=</span><span class="n">input_ids</span><span class="p">,</span> <span class="n">attention_mask</span><span class="o">=</span><span class="n">attention_mask</span><span class="p">)</span>
        <span class="n">next_logits</span> <span class="o">=</span> <span class="n">outputs</span><span class="p">.</span><span class="n">logits</span><span class="p">[:,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="p">:]</span>
        <span class="n">next_token</span> <span class="o">=</span> <span class="nf">sample_next_token</span><span class="p">(</span>
            <span class="n">next_logits</span><span class="p">,</span>
            <span class="n">temperature</span><span class="o">=</span><span class="n">temperature</span><span class="p">,</span>
            <span class="n">generator</span><span class="o">=</span><span class="n">generator</span><span class="p">,</span>
        <span class="p">)</span>
</code></pre></div></div> <p>大体流程是这样的，第三步最后拿到的是token ids，也就是这里第一次的input_ids</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Explain KV cache in one sentence.
input_ids = [[849, 735, 6634, 304, 825, 11652, 13, 151645]] # 这边举例，实际的token ids不是这样
</code></pre></div></div> <p>此时的shape是<code class="language-plaintext highlighter-rouge">[1, 8]</code> ，简单理解含义就是一条请求，请求里有8个token（attention_mask也是同理）</p> <ul> <li> <p><code class="language-plaintext highlighter-rouge">1 = batch size</code>，当前只有一个请求（之后我们做到batch推理的时候会更详细展开说明）</p> </li> <li> <p><code class="language-plaintext highlighter-rouge">8 = sequence length</code>，这句话被切成了8个token（btw，模型上下文长度也是叫sequence length，或者说序列长度）</p> </li> </ul> <p>这里我们也会理解一下attention_mask这个东西，这个其实是Padding Mask，不是Decoder里的Causal Mask。假设现在有2个请求需要推理：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. I love AI
2. Hello
</code></pre></div></div> <p>我们走batch推理，为了GPU并行，需要补齐，变成：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. ["I", "love", "AI"]
2. ["Hello", "[PAD]", "[PAD]"]
</code></pre></div></div> <p>这个时候可以看到第二个请求的token数不够，被补齐成一样的长度了，但是实际推理中是不可能去关注补齐的那部分内容，所以需要一个填充掩码来标记</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">attention_mask</span> <span class="o">=</span> <span class="p">[</span>
  <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span>
  <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
<span class="p">]</span>
</code></pre></div></div> <p>这样实际计算中就会把attention score做成负无穷，softmax后的概率约等于0，也就是完全忽略。这边比较算法细节，我暂时不继续深入展开</p> <p>继续看</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">outputs</span> <span class="o">=</span> <span class="nf">model</span><span class="p">(</span><span class="n">input_ids</span><span class="o">=</span><span class="n">input_ids</span><span class="p">,</span> <span class="n">attention_mask</span><span class="o">=</span><span class="n">attention_mask</span><span class="p">)</span>
</code></pre></div></div> <p>这里就把token序列和填充掩码一起送入模型了，模型forward后会输出logits，这里的forward就是模型的一次前向计算，大概包含了：</p> <ol> <li> <p>token→embedding</p> </li> <li> <p>经过很多层（transformer layer）</p> </li> <li> <p>attention/MLP不断计算</p> </li> <li> <p>最后输出logits</p> </li> </ol> <p>最后生成的logits其实就是得到的每个token的预测分数，shape是<code class="language-plaintext highlighter-rouge">[batch, seq_len, vocab_size]</code> ，词表vocab_size假设是151936，那么我么这边表现出来的是<code class="language-plaintext highlighter-rouge">[1, 8, 151936]</code> ，一个三维张量（tensor），白话讲就是：1个样本，每个样本里有8个token位置，每个token位置对151936个token（词表的所有token）都给出一个分数</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>第0个样本
  ├── 第0个位置 -&gt; 151936个分数
  ├── 第1个位置 -&gt; 151936个分数
  ├── 第2个位置 -&gt; 151936个分数
  ...
  └── 第7个位置 -&gt; 151936个分数


位置 0：Explain 后面可能是什么
位置 1：Explain KV 后面可能是什么
位置 2：Explain KV cache 后面可能是什么
...
位置 7：完整 prompt 后面可能是什么
</code></pre></div></div> <p>所以哪怕我们原来用8个输入的token去生成第9个token，我们仍然要计算一次前面的8个token的logits，这是为了在生成第9个token的时候拥有完整的上下文，这也是Transformer的注意力机制所表达的能力。</p> <p>然后我们只关系最后的那个token，所以我们只取最后一个位置的logits</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">next_logits</span> <span class="o">=</span> <span class="n">outputs</span><span class="p">.</span><span class="n">logits</span><span class="p">[:,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="p">:]</span>
</code></pre></div></div> <p>对应到</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1, 8, 151936]
[:, -1, :]
</code></pre></div></div> <p>得到</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[1, 151936]
</code></pre></div></div> <p>这个就是最后一个位置（从0开始，就是位置7）对于词表里所有的token的预测分数，有了这个我们就可以得到预测的下一个token了，这边我们要进入到一个叫做采样（sample）的阶段</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>  <span class="n">next_token</span> <span class="o">=</span> <span class="nf">sample_next_token</span><span class="p">(</span>
      <span class="n">next_logits</span><span class="p">,</span>
      <span class="n">temperature</span><span class="o">=</span><span class="n">temperature</span><span class="p">,</span> <span class="c1"># 0
</span>      <span class="n">generator</span><span class="o">=</span><span class="n">generator</span><span class="p">,</span>
  <span class="p">)</span>

  <span class="k">def</span> <span class="nf">sample_next_token</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="o">*</span><span class="p">,</span> <span class="n">temperature</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.0</span><span class="p">,</span> <span class="n">generator</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Return one next-token tensor from a `[batch, vocab]` logits tensor.

    `temperature &lt;= 0` means greedy decoding. Positive temperatures sample from
    the softmax distribution, matching the first serving concept this milestone
    needs without adding top-k/top-p policy surface yet.
    </span><span class="sh">"""</span>

    <span class="kn">import</span> <span class="n">torch</span>

    <span class="k">if</span> <span class="n">logits</span><span class="p">.</span><span class="n">ndim</span> <span class="o">!=</span> <span class="mi">2</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">expected logits with shape [batch, vocab], got </span><span class="si">{</span><span class="nf">tuple</span><span class="p">(</span><span class="n">logits</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">math</span><span class="p">.</span><span class="nf">isfinite</span><span class="p">(</span><span class="n">temperature</span><span class="p">):</span>
        <span class="k">raise</span> <span class="nc">ValueError</span><span class="p">(</span><span class="sh">"</span><span class="s">temperature must be finite</span><span class="sh">"</span><span class="p">)</span>

    <span class="k">if</span> <span class="n">temperature</span> <span class="o">&lt;=</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="nf">argmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="n">probs</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">softmax</span><span class="p">(</span><span class="n">logits</span> <span class="o">/</span> <span class="n">temperature</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="nf">multinomial</span><span class="p">(</span><span class="n">probs</span><span class="p">,</span> <span class="n">num_samples</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">generator</span><span class="o">=</span><span class="n">generator</span><span class="p">)</span>
</code></pre></div></div> <p>采样也就是对下一个token做采样，意思是从概率分布中采样一个token，有很多种采样方式，比如top1，概率最高的那个token，这是一种最基础的方式。这里我们temperature=0的时候就走贪心解码（greedy decoding），我们<code class="language-plaintext highlighter-rouge">argmax(logis)</code> 取概率最高的那个token（这里贪心的意思就是每次都拿最大的，完全不考虑后续或者长远的情况）。这里的dim=-1指明logits在最后那个纬度，然后keepdim=True表示为最后保留维度（shape保持不变，前面输入的next_logits是<code class="language-plaintext highlighter-rouge">[1, 151936]</code>，argmax处理后默认shape会变为<code class="language-plaintext highlighter-rouge">[1]</code>，保留后是<code class="language-plaintext highlighter-rouge">[1, 1]</code>），因为这样后续就可以继续把生成的token拼接到input_ids里了</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">input_ids</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">cat</span><span class="p">([</span><span class="n">input_ids</span><span class="p">,</span> <span class="n">next_token</span><span class="p">.</span><span class="nf">to</span><span class="p">(</span><span class="n">input_ids</span><span class="p">.</span><span class="n">device</span><span class="p">)],</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">attention_mask</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">cat</span><span class="p">([</span><span class="n">attention_mask</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="nf">ones_like</span><span class="p">(</span><span class="n">next_token</span><span class="p">)],</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div> <p>原来的<code class="language-plaintext highlighter-rouge">[1, 8]</code> ，生成一个token后next_token shape: <code class="language-plaintext highlighter-rouge">[1, 1]</code>，拼接后变成<code class="language-plaintext highlighter-rouge">[1, 9]</code> 了，这边有个<code class="language-plaintext highlighter-rouge">next_token.to(input_ids.device)</code> ，是把这个送到input_ids所在的GPU卡上，这是自回归生成（LLM autoregressive generation）的核心循环</p> <p>之后就是再次把新的input_ids送入进行新一轮的forward，不断往复，直到生成100个token就结束了。然后这边我们可以注意到，在做forward的时候都是一次完整的forward：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>第 1 轮：
[1, 8]

第 2 轮：
[1, 9]

第 3 轮：
[1, 10]

第 100 轮：
[1, 107]
</code></pre></div></div> <p>也就是每轮都会重复计算整段序列的attention，可以轻而易举的想到，我们之前的计算不要重复计算，这样就可以节省大量的算力和时间了，没错，这个就是KV Cache的由来了：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>第 1 次：输入完整 prompt，计算并保存 KV cache
第 2 次：只输入新生成的 1 个 token，复用前面的 KV cache
第 3 次：只输入新生成的 1 个 token，继续复用 KV cache
...
</code></pre></div></div> <p>这个也是我们后面要去做的。</p> <p>整个推理的过程大概如下：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>prompt
  ↓
forward
  ↓
logits（每个token的分数)
  ↓
argmax（decoding strategy，解码策略，argmax只是一种，还有其他）
  ↓
选出next token
  ↓
拼回输入（input_ids）
  ↓
继续forward
</code></pre></div></div> <p>实际的大模型推理中，比较成熟的推理引擎会是这样的流程：</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>input_ids
    ↓
Transformer forward
    ↓
logits
    ↓
temperature scaling
    ↓
top-k / top-p filtering
    ↓
sampling / argmax
    ↓
next token
    ↓
append to KV cache + input_ids
    ↓
next decoding step
</code></pre></div></div> <p>后面随着我们持续深入去实现和迭代，我们也会慢慢往这个方向走的</p> <h1 id="总结">总结</h1> <p>通过这样一个简单的实现，我们已经有一版勉强能跑的版本，虽然很基础，跑得效果很烂，但是这个已经能帮助我们完全理解一个Infra在帮助模型做推理的过程中需要做的事情，也帮我们埋了一些未来演进方向的点。</p> <p>要知道vLLM/SGLang这种LLM Infra里的所有主要feature都不是平白无故的造出来的，背后都是有痛点和需求在推动，一批有想法有动手能力的人去做出不同的实现，本质还是服务于参数规模越来越的模型在多GPU、多设备上去做推理，且要持续获得更快、更经济和效果更好的推理</p>]]></content><author><name></name></author><category term="AI"/><category term="AI"/><category term="LLMInfra-101"/><summary type="html"><![CDATA[系列的第一集，这集需要达到的目标很简单：能跑一个大模型。]]></summary></entry><entry xml:lang="zh"><title type="html">Browser Use详解</title><link href="https://ifuryst.github.io/blog/2026/open-browser-use/" rel="alternate" type="text/html" title="Browser Use详解"/><published>2026-05-09T00:00:00+00:00</published><updated>2026-05-09T00:00:00+00:00</updated><id>https://ifuryst.github.io/blog/2026/open-browser-use</id><content type="html" xml:base="https://ifuryst.github.io/blog/2026/open-browser-use/"><![CDATA[<h1 id="缘起">缘起</h1> <p>一切的源头还是源于前面我们开源的<a href="https://github.com/iFurySt/open-codex-computer-use">Open-Computer-Use</a>，背后的故事可以看这篇<a href="https://www.ifuryst.com/blog/2026/the-urge-to-solve/"><strong>解决问题的原始冲动</strong></a>。</p> <p>这次是因为OpenAI的Codex.app上有release出了Browser Use的能力</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327226_1-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327226_1-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327226_1-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327226_1.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>分析了一下，又有很多收获，也补齐了一些之前未曾主动去了解的认知缺失部分。收获很大，我觉得值得写一篇文章聊一下。行文依然是关注过程重于结果，方法论或者思维的跃迁才是最重要的。</p> <h1 id="探索">探索</h1> <p>整个探索过程依然和之前在<a href="https://www.ifuryst.com/blog/2026/the-urge-to-solve/"><strong>解决问题的原始冲动</strong></a>里分析的是类似的，我们首先依然是拉<a href="https://github.com/iFurySt/harness-template">Harness Template</a>（PS：现在我会用更加便捷的方式<a href="https://github.com/iFurySt/harness-cli"><code class="language-plaintext highlighter-rouge">harness-cli</code></a> ：</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>➜ harness-cli open-browser-use
Select template language:
  1. English
  2. Chinese
Choice <span class="o">[</span>1]: 2
Using Chinese template from https://github.com/iFurySt/harness-template-cn.git
copy 53 file<span class="o">(</span>s<span class="o">)</span>
Initialized git repository
</code></pre></div></div> <p>然后开始分析官方的，这样可以把整个探索的过程不断留存下来，未来需要的时候可以不断查询和溯源。</p> <p>这次的起手是：</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>➜  <span class="nb">cd</span> ~/.codex/plugins/cache/openai-bundled/browser-use
➜  browser-use tree <span class="nt">-I</span> <span class="s1">'node_modules'</span>
<span class="nb">.</span>
└── 0.1.0-alpha2
    ├── assets
    │   ├── browser.png
    │   └── composer-icon.png
    ├── docs
    │   └── capabilities
    │       ├── browser
    │       │   ├── viewport.md
    │       │   └── visibility.md
    │       └── tab
    ├── scripts
    │   └── browser-client.mjs
    └── skills
        └── browser
            ├── agents
            │   └── openai.yaml
            └── SKILL.md

11 directories, 7 files
</code></pre></div></div> <p>可以看到，主要是一个<code class="language-plaintext highlighter-rouge">skill</code>+<code class="language-plaintext highlighter-rouge">browser-client.mjs</code>这个client。所以我们可以快速从这里分析切入。话不多说，我直接丢一个架构图</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327227_2-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327227_2-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327227_2-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327227_2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>总体而言iab（in-app browser）是Codex.app自己抽象的一个浏览器，用windowId+sessionId作唯一id，整体表现为单窗口单会话只有一个浏览器窗口。</p> <p>展开之前，先聊一点浏览器相关的</p> <h2 id="chrome">Chrome</h2> <p>我们从下往上看，首先是<a href="https://www.chromium.org/chromium-projects/">Chromium</a>，是一个开源的浏览器内核项目，Chrome就是基于这个项目构建的商业浏览器，现在市面上很多（AI）浏览器都是基于此二开的。后续我们都统一看待，表述为Chrome。理解好Chrome对于我们在上层构建Browser Use事半功倍，也能很清晰的知道现在各种操作浏览器的手法有什么差异</p> <p>先丢一张全局的架构图：</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327227_3-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327227_3-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327227_3-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327227_3.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>里面细节比较多，感兴趣可以扫一眼，这个主要关注点事外围大框。有个大概概念，现在我们一路走下来看看：</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327228_4-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327228_4-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327228_4-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327228_4.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>首先Chrome是多进程架构的，按照不同的类型用不同的进程来承载。比如浏览器打开一个页面会涉及到诸如以下这些进程</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327229_5-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327229_5-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327229_5-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327229_5.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>这个我们在Chrome自带的任务管理器里可以看到对应的进程</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327229_6-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327229_6-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327229_6-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327229_6.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>我们也可以直接命令行统计一下现在的进程情况</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327229_7-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327229_7-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327229_7-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327229_7.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>这样做的好处是隔离和安全，比如一个tab爆炸对应的进程挂了，也不会影响别的tab页面</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327229_8-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327229_8-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327229_8-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327229_8.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>安全方面借助进程来实现沙盒隔离的能力，比如Renderer Process着重处理用户输入的进程会限制针对系统文件的访问，这样有助于提高安全性</p> <p>其中我们最主要关心的还是Browser Process和Renderer Process了。Browser负责全局（进程）调度的大脑，能管理所有的进程。而Renderer Process是负责渲染的，通常情况下每个tab/iframe都是一个独立的进程，也就是所谓的<a href="https://developer.chrome.com/blog/inside-browser-part1#site-isolation">站点隔离（Site Isolation）</a>。因此平时最重要也是进程最多的就是Renderer ，比如一个tab有一个main frame，这个tab里还有2个iframe，这个情况下就会有3个Renderer Process（实际是会受same site影响的）</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327229_9-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327229_9-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327229_9-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327229_9.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>在此之外，我们需要了解的一个东西是Service Worker，普通的页面都会对应到tab，但是Service Worker是独立在页面之外的，现在的v3浏览器插件基于Service Worker的机制之上了，简单表示为：</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Browser Process
 ├── Renderer Process <span class="o">(</span>网页<span class="o">)</span>
 │     ├── DOM
 │     ├── JS
 │     └── 页面逻辑
 │
 └── Service Worker Process
       ├── fetch 拦截
       ├── cache
       ├── push
       ├── background <span class="nb">sync</span>
       └── extension background logic
</code></pre></div></div> <p>这个后续我们讲浏览器插件里会重点提到。</p> <p>到这里我们对于Chrome的整体机制有个初步的认知了，我不打算完全讲到透，内容量比较多，对大部分人来说也不一定有价值，有兴趣的可以自己看我最后贴的一些链接自行深入去了解</p> <h2 id="codex-browser-use">Codex Browser Use</h2> <p>回归到Codex APP本身的Browser Use能力，主要由Browser Use（iab，应用内部浏览器页面）和Chrome（浏览器插件）组成，还是这张架构图：</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327227_2-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327227_2-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327227_2-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327227_2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>两者各有优缺点。Codex通过抽象封装好的<code class="language-plaintext highlighter-rouge">browser-client.mjs</code> 去调用，等于对使用方屏蔽了，通过Skill的差异来控制。</p> <p>这里值得注意的是，Codex内置了node runtime，因此实际使用中可以编排出类似这样的命令去调用：</p> <div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>await tab.goto<span class="o">(</span><span class="s1">'https://github.com/iFurySt/open-codex-computer-use/issues'</span><span class="o">)</span><span class="p">;</span>
await tab.playwright.waitForLoadState<span class="o">({</span> state: <span class="s1">'domcontentloaded'</span>, timeoutMs: 15000 <span class="o">})</span><span class="p">;</span>
const snap3 <span class="o">=</span> await tab.playwright.domSnapshot<span class="o">()</span><span class="p">;</span>
const relevant3 <span class="o">=</span> snap3.split<span class="o">(</span><span class="s1">'\n'</span><span class="o">)</span>.filter<span class="o">(</span>l <span class="o">=&gt;</span> /Open|Closed|Issues|issue|No results|open-codex-computer-use|Pull requests|Starred/.test<span class="o">(</span>l<span class="o">))</span><span class="p">;</span>
nodeRepl.write<span class="o">(</span>relevant3.slice<span class="o">(</span>0, 160<span class="o">)</span>.join<span class="o">(</span><span class="s1">'\n'</span><span class="o">))</span><span class="p">;</span>
</code></pre></div></div> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327230_11-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327230_11-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327230_11-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327230_11.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>这样可以让使用方进入一种有状态的上下文中，可以不断操作，而不需要反复去获取和定位一些诸如tab和元素等</p> <p>接下去分别看看两者</p> <h3 id="iabin-app-browser">IAB(In-App Browser)</h3> <p>这个在Codex.app里表现为Browser Use插件</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327230_12-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327230_12-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327230_12-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327230_12.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>就是右侧边栏那个内置的网页</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327230_13-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327230_13-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327230_13-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327230_13.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>这个方案是因为Codex.app本身就是Electron写的，内置的就已经有浏览器的能力了，他的做法是在中间自己抽象了一层业务层，以单个codex.app的窗口（window）+会话（session）唯一对应到一个浏览器页面，这个页面对应到Electron的WebContents，细节对上层屏蔽了。</p> <p>这里就对应到我们前面聊Chrome里提到的部分，这个对于本身已经有类似产品的，是有不错的参考意义的。不过我不打算展开聊，有需要的人自行深入，有需要或问题也可以email我，我可以分享一些DeepDive时候的一些见解。</p> <p>iab的一个优点是相对简单，且对于用户来说丝滑一些，直接在APP里就可以预览正在操作的浏览器</p> <p>但是弊端也很明显：</p> <ul> <li>目前设计只能打开一个页面，打开其他页面会顶掉前面的页面</li> <li>内置的无法安装一些浏览器插件，尤其是针对某些操作依赖某些浏览器插件时</li> <li>无法无缝接入用户的浏览器</li> </ul> <h3 id="chrome-extension">Chrome Extension</h3> <p>在Codex.app里放在了Computer Use里的Google Chrome（不知道为什么放在这里🤡）</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327231_14-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327231_14-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327231_14-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327231_14.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327231_15-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327231_15-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327231_15-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327231_15.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>搭配<a href="https://chromewebstore.google.com/detail/codex/hehggadaopoacecdllhhajmbjkdcmajg">Chrome插件</a>使用</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327231_16-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327231_16-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327231_16-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327231_16.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>浏览器插件的形态会更加通用，适应力更强。一些Chrome内核的浏览器也都可以用，而且可以做到在浏览器里模拟cursor的操作。</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327232_17-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327232_17-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327232_17-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327232_17.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>这里不得不提一下OpenAI的巧思，或者说产品力</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327232_18-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327232_18-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327232_18-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327232_18.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>任务会以分组Group来聚合，这点就非常妙，这个任务下的tab都集中在这个分组，这样任务结束的时候直接整个分组关了就不会污染用户的tab。</p> <p>在此期间，这些tab都是非激活状态的，也就是这个浏览器插件是具备后台操作能力的，和Computer Use的Background能力一致，非常丝滑的产品体验！</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327232_19-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327232_19-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327232_19-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327232_19.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>操作过程如果去查看tab，也能看到和Computer Use类似的鼠标悬浮和移动，让人可以直观感受到在做什么</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327232_20-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327232_20-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327232_20-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327232_20.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>结束后除了收掉Group和内部的tab以外，还有可能出现移交tab到Codex这个通用分组下。这些都是写在skill里的指导。</p> <h3 id="cdpchrome-devtools-protocol">CDP（Chrome DevTools Protocol）</h3> <p>这个是第三种方式，我不详细展开了，本质上是通过CDP协议去连到Chrome，是最技术的方案，需要Chrome以Remote Debugging / CDP监听的方式去启动的，对于普通用户几乎不可能，对于研发人员接触的比较多。</p> <p>现在Chrome官方也提供了<a href="https://github.com/ChromeDevTools/chrome-devtools-mcp">MCP</a>，以及配套的<a href="https://github.com/ChromeDevTools/chrome-devtools-mcp/tree/main/skills/chrome-devtools-cli">chrome-devtool-cli</a>。类似Playwright、Selenium、puppeteer本质上都是基于CDP做的。也是最根源的方式，包括很多云端Sandbox里包装了Chrome的也都是通过CDP去交互的</p> <p>至此我们对于codex拥有的整个浏览器操作已经有了全局的认知了，很多技术细节没有再额外展开，有兴趣的可以按需找AI一点通一下。</p> <h1 id="open-browser-use">Open Browser Use</h1> <p>为什么我们需要一个开源替代方案呢？</p> <ul> <li>因为甚至连Codex CLI都无法用Codex.app的这两个Browser Use的能力，我们需要一个平台中立的方案，可以让所有的AI Agent轻易使用，可以让所有的AI应用轻易集成</li> <li>技术实现不等于产品实现。我会尽量用产品的角度来推进这个开源项目，因为技术方案前面CDP一节里提到了好几个，但是他们对于AI开箱即用的能力太弱了，或者他们天然的定位就不是面向AI的。Chrome MCP好一点点，但是也有很多痛点在里面。</li> </ul> <p>Open Browser Use的实现方案和Codex.app的extension路线是一致的，我希望的定位是打造成超集的存在，就是在满足原有的一切能力以外还能有一些额外的能力可以赋能上层业务的开箱即用</p> <p><a href="https://github.com/iFurySt/open-codex-browser-use">https://github.com/iFurySt/open-codex-browser-use</a></p> <p>目前是以浏览器插件的形式存在（插件商店版本还在审核，目前直接通过<a href="https://github.com/iFurySt/open-codex-browser-use/releases">zip/crx</a>安装）</p> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327233_21-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327233_21-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327233_21-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327233_21.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <div class="row mt-3"> <div class="col-sm mt-0 mb-0"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/2026-05-09-open-browser-use/1778327233_22-480.webp 480w,/assets/img/2026-05-09-open-browser-use/1778327233_22-800.webp 800w,/assets/img/2026-05-09-open-browser-use/1778327233_22-1400.webp 1400w," sizes="95vw" type="image/webp"/> <img src="/assets/img/2026-05-09-open-browser-use/1778327233_22.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> </div> </div> <p>具体使用方式参见GitHub里，这边就不展开赘述了。</p> <h1 id="尾声">尾声</h1> <p>其实这篇文章我愿称之为半成品，因为很多原来我预想和规划的内容我没有完整的展现出来，最近的时间精力也不够完全支撑我写完这篇文章到我满意的程度，但是又不希望因为不完美而不完成，因此还是决定发出来，因为我相信哪怕它不完美，依然可以触达很多的人。做人和做事都是如此，追求完成应是第一要务，在此之上，能追求完美的人，才有机会成为传奇。</p> <p>回到Codex和OAI本身，OpenAI虽然人才流失了很多，但是依然妨碍不了继续牛逼，或许是组织足够厉害，也可能是现在还在的人里充满了人才，不管怎样，最近这段时间持续给大家递送更好的产品，喜闻乐见。在这背后，也不断激发我对于产品力的思考。</p> <p><strong>Coding≠Engineering, Technology≠Product.</strong></p> <p>AI带给我们的很多，但是还有很多东西其（暂时）无法带给我们。我依然相信持续保持好奇心、敢于尝试的勇气，以及立刻行动的执行力，是支撑着我们探索无尽未知的原始动力</p> <h1 id="references">References</h1> <p>想了解现代浏览器的可以看Chrome这四篇Post文章，简单易懂：</p> <ol> <li><a href="https://developer.chrome.com/blog/inside-browser-part1">Inside look at modern web browser (part 1)</a></li> <li><a href="https://developer.chrome.com/blog/inside-browser-part2">Inside look at modern web browser (part 2)</a></li> <li><a href="https://developer.chrome.com/blog/inside-browser-part3">Inside look at modern web browser (part 3)</a></li> <li><a href="https://developer.chrome.com/blog/inside-browser-part4">Inside look at modern web browser (part 4)</a></li> </ol>]]></content><author><name></name></author><category term="AI"/><category term="AI"/><summary type="html"><![CDATA[缘起]]></summary></entry></feed>