Kevin

KVin

GPU-Native KV-Cache Compression Engine
Your LLM inference just got 3.5x cheaper.
No model changes. No retraining. Drop-in.
Talk Acquisition / Licensing
kevin@spw1.com · Watch the demo below first — meetings are for deal terms
3.67x
Compression (70B)
30.5 GB
Saved per session
89
Tests passing
8,114
Lines C++/CUDA

H100 Benchmarks — 5 Models, 13 Configurations

Tested on NVIDIA H100 80GB (RunPod Cloud) and locally via Ollama with real model tensors. Compression ratio is architecture-dependent: 3.03x for 32-layer models, 3.67x for 80-layer GQA models.

ModelParamsContextCompressionCosine SimSaved
Llama 3 70B70B128K3.67x0.876 (4K)30.5 GB
Llama 3 70B70B32K3.67x0.808 (16K)7.8 GB
Llama 3 70B70B4K3.67x0.876980 MB
Qwen 72B72B128K3.67x0.875 (4K)30.5 GB
Qwen 72B72B32K3.67x0.7387.8 GB
Qwen 72B72B4K3.67x0.875980 MB
Qwen3 30B30B4K3.54x0.825771 MB
Llama 3 8B8B128K3.03x0.880 (4K)11.2 GB
Llama 3 8B8B4K3.03x0.880360 MB
Mistral 7B7.3B128K3.03x0.874 (4K)11.2 GB
Mistral 7B7.3B16K3.03x0.8101.4 GB
Mistral 7B7.3B4K3.03x0.877360 MB

Cosine similarity at production context lengths (4K) is 0.87-0.88 across all models. Larger models benefit most — your most expensive GPUs get the biggest savings.

Memory Savings by Model & Context Length

Compression Ratio by Architecture

80-Layer GQA (70B+)
3.67x
73% memory reduction
64-Layer (30B)
3.54x
72% memory reduction
32-Layer (7B/8B)
3.03x
67% memory reduction

Key Finding: Compression Scales With Model Size

80-layer GQA models (70B/72B) achieve 3.67x compression. 64-layer models hit 3.54x. 32-layer models reach 3.03x. Your largest, most capital-intensive models benefit most from KVin.

The 30 GB Number
30.5 GB
freed per Llama 70B session @ 128K
38%
of an H100's 80 GB VRAM
3.5x
more concurrent sessions

Watch the Demo

Pick your lens. Same engine, different lens — one for decision-makers, one for engineers.

Management overview: why KV-cache compression is the single biggest lever in AI infrastructure economics, and what Kevin unlocks at fleet scale.

Dashboard — What Production Looks Like

KVin ships with a live monitoring dashboard. This is what operators see when the engine is running.

KVin GPU-Native KV-Cache Orchestration Engine v0.0.25
Engine Running
Tokens Stored
1,247,832
Compression Ratio
3.54x
Arena Utilization
27%
Prefix Cache Hits
94.2%
Per-Head Precision Distribution
2-bit
52%
K4/V2
29%
4-bit
18%
Request Throughput
229K KV pairs/sec

This is a static rendering of the live dashboard. In production, all metrics update in real-time via WebSocket.

Throughput & Performance

KVin is engineered for inference-speed operation. Every component on the hot path is designed for zero-allocation, constant-time performance. CPU-path numbers measured on Windows 11 / RTX 3060 12GB. GPU benchmarks on NVIDIA H100 80GB (RunPod Cloud).

229K
KV pairs / sec (CPU)
763 ms
128K token pipeline
1-2 ns
Position lookup
O(1)
Request cleanup
0
malloc() Calls on Hot Path
Arena allocator uses bump-pointer allocation. O(1) batch reset between requests. No fragmentation, no GC pauses, no allocation jitter during inference. The same pattern that powered sub-microsecond latency on 2001 hardware.
1-2 ns
O(1) Position Lookup
3-level radix tree resolves any position at any sequence length up to 128K tokens in 3 pointer dereferences. Hash maps: 30-50 ns. Linear scan: unbounded. This is the difference at scale.
1 kernel
Fused Decompress + Attend
fused_attend decompresses and computes attention in a single CUDA kernel launch. No intermediate fp16 staging buffer. No extra memory round-trip. Decompress and attend are one operation.
SPSC + MPMC
Lock-Free Queues
KVinSPSCQueue for single-producer pipelines, KVinMsgQueue for multi-producer dispatch. Both lock-free, both ported from a production platform where microsecond contention was unacceptable.

Operational Stack — Ships Complete

KVin is not an algorithm. It is a fully instrumented production system. Every component needed to deploy, monitor, alert, and operate ships out of the box. No assembly required.

API Server
FastAPI with metered endpoints, per-key usage tracking, tiered API keys, Stripe-ready billing. Health checks, versioned endpoints, CORS-ready. Self-serve or embed behind your existing API gateway.
Live Dashboard
Real-time dark-themed monitoring: tokens stored, compression ratio, arena utilization, prefix cache hit rate, per-head entropy distribution, request throughput. WebSocket live updates. No external dependencies.
Telemetry & Metrics
ScopedTimer instrumentation on every critical path. JSON stats API for Prometheus/Grafana scraping. MQTT telemetry for fleet-wide monitoring. Per-head, per-layer, per-request granularity. The same instrumentation depth that ran 24/7 in mission-critical production environments.
System Tray Controller
Desktop sidecar: green/yellow/red health indicator, hover tooltip with live stats (tokens, compression, arena %). Right-click: status, open dashboard, open logs. Alerts operators before arena pressure becomes a problem.
vLLM Drop-In Backend
One-line CacheEngine replacement: import kvin; kvin.activate(). Intercepts all KV cache operations transparently. Your serving layer doesn't know it's running. Zero changes to model code, prompts, or serving config.
Deployment Ready
GPU-ready Dockerfile with CUDA runtime. pip-installable Python package. C++/CUDA shared library (.so/.dll). Helm chart ready for Kubernetes GPU clusters. MQTT fleet telemetry for multi-node observability.
What "Production-Ready" Actually Means
89
C++ tests passing (100%)
8,114
lines C++/CUDA (53 files)
3
custom CUDA kernels
block_k4
4-bit key quantization, block amortized
kmeans_v2
2-bit values, shared memory k-means
fused_attend
decompress + attend, single kernel

KVin ships with binary record/replay (KVinJournal) for deterministic debugging, NVMe offload for overflow to CPU/disk, and a pluggable quantizer interface (I_QuantizerInstance) so you can swap TurboQuant for KIVI, GearKV, or your own custom backend without touching the engine.

25 Years of Production Heritage

KVin is not derived from a research paper. The engineering foundation — zero-malloc arenas, lock-free queues, subject-based routing, ring-buffer caching — ran in mission-critical, high-throughput production environments for over two decades. That foundation was then extended with GPU-native optimizations: 3 custom CUDA kernels, per-head adaptive entropy, fused decompress+attend, speculative branch-and-prune, and NVMe offload — none of which existed in the original domain. Proven architecture, purpose-built extensions.

895
Heritage Source Files
The original production caching platform: framework, bus, cache engine, transport plugins, arena allocator, message queues, replication, capture/playback. All retained and available for inspection.
Direct+
Port, Plus GPU Optimizations
Every KVin component has a named, auditable ancestor in the heritage codebase. Then extended: 3 custom CUDA kernels, per-head adaptive entropy, fused decompress+attend, speculative branch-and-prune, NVMe offload tier. The production foundation is proven; the GPU extensions are net-new engineering.

Architecture

Inference Engine (vLLM / TensorRT-LLM / SGLang) | KVin AttentionBackend ← drop-in, one import | KVinEngine / | | \ Bus LVC Store Arena route fp16 compressed zero-malloc | | | | Quantizer Plugins (TurboQuant / KIVI / GearKV / custom) | CUDA Kernels (block_k4, kmeans_v2, fused_attend)

Bus: Subject-based message router dispatches per-(layer, head) precision policies. Each attention head gets its own quantization strategy based on entropy.

LVC: Per-head ring buffer keeps the last 128 tokens at full fp16. Quality-critical recent attention is never degraded.

ContextStore: Compressed KV storage organized by layer/head hierarchy. Append-only, arena-allocated.

Arena: Bump-pointer allocator for both CPU and CUDA. Zero malloc on the hot path. O(1) cleanup between requests.

Adaptive precision: KVin measures per-head Shannon entropy in real time. Low-entropy heads (sharp attention) drop to 2-bit. High-entropy heads stay at 4-bit. 52% of heads in Qwen3 30B compress to 2-bit automatically — no configuration.

Component Lineage

Every KVin component traces to a production ancestor — then extends it with GPU-native optimizations that didn't exist in the original domain.

Heritage (Production)KVin (GPU)Role
Framework::BusKVinBusSubject-based routing + precision policies
LastValueCacheKVinLVCPer-head fp16 ring buffer (recent tokens)
TasCacheKVinContextStoreCompressed sequence storage (layer/head)
TasTickMemoryKVinArenaBump-pointer allocator (CPU + CUDA)
TasSeqMap (radix)KVinPositionIndexO(1) position lookup (1-2 ns @ 128K)
I_TransportInstanceI_QuantizerInstancePluggable compression backends
MsgQueue<T>KVinMsgQueueMPMC + SPSC lock-free queues
SDObject::RecordKVinRecordCopy-on-Write typed metadata carrier
Capture/PlaybackKVinJournalBinary record/replay for debugging
Multicast handlerKVinSpeculativeBranch-and-prune for speculative decoding

Every component has a direct lineage to a production-proven ancestor. No pattern was invented from a paper.

Why Acquire Instead of Build

FactorBuild from PaperAcquire KVin
Time to production6-12 months, 2-3 senior GPU engineersImmediate (89 tests passing)
Engineering cost$500K-$1.5M salary burnFraction of that
Memory managementmalloc/free, fragmentationArena allocator, zero-malloc hot path
Position lookupLinear scan or hash tableRadix tree, 1-2 ns O(1)
Adaptive precisionNot addressed in papersPer-head entropy-driven, automatic
Cross-request reuseNot addressedHash-based prefix cache + LRU
Speculative decodingSeparate concernIntegrated branch-and-prune
ObservabilityNoneAPI + MQTT + live dashboard
Plugin interfaceHardcoded quantizerAbstract interface, swap quantizers

Integration

vLLM (Two Lines)

import kvin kvin.activate() # vLLM now uses KVin for all KV-cache operations. # Dashboard: http://localhost:8099

Custom Inference Stacks (C API)

// 1. Initialize with model config KVinEngine* engine = kvin_init(layers, heads, head_dim); // 2. Store new KV tensors on each forward pass kvin_store(engine, layer, key_tensor, value_tensor, seq_len); // 3. Fused decompress + attend in one kernel launch kvin_attend(engine, query, output); // 4. O(1) cleanup between requests kvin_reset(engine);

Integration timeline: 1-2 weeks for basic deployment. 4-6 weeks for full production hardening with custom precision policies.

ROI

A $40K GPU serving inference with KVin handles 3.5x the concurrent load.

Annual Savings
$120K
equivalent / GPU / year
$1.2M
10 GPU fleet / year
$12M
100 GPU fleet / year

You've Seen the Engine. Ready to Talk Terms?

Acquisition, licensing, or consulting integration. If you run inference at scale, this conversation saves you 6-12 months of engineering.

Start the Conversation kevin@spw1.com · spw1.com
Ask Kevin!
Ask Kevin