Same base model weights, same prompts, same temperature. The only difference is Hypernym's platform layer on top of llama.cpp. Modulum's attention conditioning produces a measurable speedup on qa3 (3-fact temporal reasoning) at every context length we tested.
Median tokens-per-second from llama.cpp's timings block, captured natively by both endpoints. Modulum is the qa3 winner; qa2 is essentially flat; qa1 at short context shows Modulum slower — but that data was captured during a phase-1 endpoint-load incident (78 503 retries), so the in-flight 2026-05-18 reruns will give us a clean qa1 baseline.
| Cell | Modulum decode | Vanilla decode | Speedup | Note |
|---|---|---|---|---|
| qa1 32k | 35.1 tok/s | 50.4 tok/s | −30.3 % | Phase-1 (2026-05-13) — captured during endpoint-load incident |
| qa1 64k | 33.6 tok/s | 41.5 tok/s | −18.9 % | Phase-1 — same caveat |
| qa1 128k | 37.1 tok/s | 35.9 tok/s | +3.2 % | Phase-1 + 503-storm retries |
| qa2 32k | 39.5 tok/s | 40.4 tok/s | −2.2 % | Phase-3 clean |
| qa2 64k | 35.1 tok/s | 37.6 tok/s | −6.8 % | Phase-3 clean |
| qa2 128k | 32.7 tok/s | 34.9 tok/s | −6.4 % | Phase-3 clean |
| qa3 32k | 49.5 tok/s | 40.7 tok/s | +21.6 % | Phase-3 clean — load-bearing |
| qa3 64k | 45.9 tok/s | 38.0 tok/s | +21.0 % | Phase-3 clean — load-bearing |
| qa3 128k | 40.2 tok/s | 34.4 tok/s | +16.9 % | Phase-3/5/10 clean — load-bearing |
Decode-rate delta (Modulum − Vanilla) ÷ Vanilla, per cell:
Hypothesis — Modulum's attention conditioning produces a tighter probability distribution over the next token on multi-fact reasoning. Tighter distributions concentrate decode-time compute on a smaller candidate set, shortening outputs and producing the +17–22 % qa3 speedup. On retrieval-style qa1, the conditioning doesn't help and may add a small overhead. We are NOT yet in a position to confirm this hypothesis architecturally; it explains the observed pattern.
| Variable | Modulum | Vanilla |
|---|---|---|
| Base model | Gemma-4-31B-it | Gemma-4-31B-it (same) |
| Quantization | Q4_K_M | Q4_K_M (same) |
| Inference runtime | llama.cpp + Modulum platform | llama.cpp /completion (raw) |
| Endpoint | gemma4.hypernym.ai/v1/chat/completions | 35.192.66.207:9011/completion (Hypernym mirror) |
| Temperature | 0 | 0 |
| Max tokens | 256 | 64 |
| Dataset | RMT-team/babilong-1k-samples | RMT-team/babilong-1k-samples |
| Sample indexes (mask comparison) | idx 0..49 (50 prompts per cell) | idx 0..49 (same 50 prompts) |
| Timing source | llama.cpp timings.predicted_ms | llama.cpp timings.predicted_ms |
| Tokens/sec computed as | tokens_predicted · 1000 / predicted_ms | same |