TokenSpeed: A New Open-Source LLM Inference Engine Tailored for Agentic AI Workloads
Introduction
The rapid adoption of autonomous coding agents—such as Claude Code, Codex, and Cursor—has highlighted a critical bottleneck in modern AI deployment: inference efficiency. As these systems evolve from experimental tools to essential infrastructure for large-scale software development, the strain on underlying inference engines becomes immense. To address this, the LightSeek Foundation has introduced TokenSpeed, an open-source LLM inference engine released under the MIT license. Currently in preview, TokenSpeed is purpose-built for agentic workloads, promising performance comparable to NVIDIA's TensorRT-LLM while remaining freely accessible.

Why Agentic Inference Is a Different Problem
Traditional chatbot interactions typically involve short, isolated queries. Agentic systems, by contrast, handle long-running conversations with context windows exceeding 50,000 tokens and dozens of iterative turns. This fundamentally changes the performance requirements. Two metrics become critical:
- Per-GPU tokens per minute (TPM): Determines how many concurrent users a single GPU can serve.
- Per-user tokens per second (TPS): Measures the system's responsiveness from the individual user's perspective.
Most public benchmarks fail to capture this dual pressure. TokenSpeed is explicitly designed to maximize per-GPU TPM while maintaining a per-user TPS floor of at least 70 TPS—and often exceeding 200 TPS for demanding scenarios. This balance is essential for agentic systems that must feel instantaneous even during prolonged multi-step reasoning.
Architecture: Five Interlocking Subsystems
TokenSpeed's design rests on five foundational pillars, each addressing a specific aspect of agentic inference. The following sections detail these subsystems and how they cooperate to deliver high throughput and low latency.
Modeling Layer: SPMD for Simplified Parallelism
The modeling layer employs a local Single Program, Multiple Data (SPMD) approach. In SPMD, all processing units execute the same instructions on different data portions—a pattern well-suited for distributed deep learning. Traditionally, developers must manually implement communication logic between processes. TokenSpeed abstracts this complexity: engineers specify I/O placement annotations at module boundaries, and a lightweight static compiler automatically generates the required collective operations during model construction. This eliminates error-prone manual communication coding, allowing teams to focus on model logic rather than infrastructure plumbing.
Scheduler: Enforcing Correctness at Compile Time
The scheduler introduces a structural split between control plane and execution plane. The control plane is implemented in C++ as a finite-state machine (FSM) that leverages the type system to enforce safe resource management—including KV cache transfers and usage—at compile time rather than runtime. Key aspects of the request lifecycle, KV cache resources, and overlap timing are encoded as explicit FSM transitions and ownership semantics. This means correctness is verified by a formal control system, not by convention or runtime checks. The result: fewer bugs, better performance predictability, and safer concurrent execution.

Safe KV Resource Reuse
Agentic conversations with long contexts generate massive KV caches. Inefficient management of these caches leads to memory fragmentation or wasted GPU cycles. TokenSpeed enforces a safe KV resource reuse restriction that prevents conflicts while maximizing memory reuse. By tying cache ownership to the scheduler's FSM, the engine can precisely allocate and deallocate memory regions without costly runtime garbage collection.
Pluggable Layered Kernel System
To support diverse hardware, TokenSpeed includes a pluggable layered kernel system that abstracts accelerator-specific operations. This allows the engine to seamlessly run on different GPU architectures (e.g., NVIDIA, AMD) and potentially on emerging accelerators. Each kernel layer can be swapped or tuned independently, simplifying optimization for specific hardware without rewriting the entire engine.
SMG Integration for Low-Overhead Request Entry
Finally, TokenSpeed integrates with SMG (shared memory gateways) to provide a low-overhead CPU-side entry point for incoming requests. This reduces the latency and CPU usage associated with receiving and routing requests, which is especially beneficial in high-throughput agentic systems where every millisecond matters.
Implications for the AI Ecosystem
By releasing TokenSpeed under an MIT license, the LightSeek Foundation ensures that developers and enterprises can adopt, modify, and contribute to the engine without licensing barriers. This open-source approach could accelerate innovation in agentic AI, particularly for organizations that want to customize inference behavior for their specific workloads. Combined with its TensorRT-LLM-level performance goals, TokenSpeed may become a go-to solution for deploying large language models in real-time, conversation-heavy applications.
Conclusion
TokenSpeed represents a thoughtful response to the unique demands of agentic AI inference. Its architecture—featuring SPMD-based modeling, a compile-time safety-enforcing scheduler, intelligent KV cache reuse, pluggable kernels, and low-latency request intake—directly tackles the bottlenecks that have hindered scaling of interactive coding assistants and similar tools. As the engine moves beyond preview, its open-source nature invites the community to shape its evolution. For teams building next-generation agentic systems, TokenSpeed is a development worth watching.