A Single-Model Native Inference Engine for DeepSeek V4 Flash
Technical
Report · Updated May 21, 2026
8d57664.
ds4.h). The core engine (ds4.c)
implements model loading, tensor layout, CPU kernels, and the GPU
graph driver. GPU backends (Metal, CUDA, CPU) are isolated in
separate files. The KV Store subsystem manages on-disk checkpoint
files independently.
DeepSeek V4 Flash occupies a unique position in the current landscape of locally-runnable language models. With 284B total parameters and only ~13B active per token via routed-MoE, it offers a combination of high knowledge capacity, fast inference, and efficient reasoning that is not replicated by dense models of comparable size. The model additionally features a compressed sparse attention mechanism supporting a 1M-token context, a KV cache that is sufficiently compact to persist on disk, and a thinking-mode behavior that produces reasoning traces proportional to problem complexity — often 1/5 the length of other reasoning models.
Despite these advantages, the local inference ecosystem is dominated by general-purpose runners (e.g., llama.cpp, vllm) that must accommodate a wide range of architectures. For a specific model such as DeepSeek V4 Flash, this generality incurs complexity, validation overhead, and integration gaps — particularly around KV cache persistence, tool-calling, and agentic workflows. DS4 was created to close these gaps by taking a deliberate bet on one model and building a vertically-integrated engine around it.
The following constraints govern the DS4 design:
ds4.h) is backend-agnostic. GPU kernels are
isolated in ds4_metal.m (macOS) and
ds4_cuda.cu (Linux). A CPU path exists for
correctness validation.
The repository is organized around a single core source file
(ds4.c, ~808 kB) that implements the engine. Supporting
files include:
| File | Lines | Role |
|---|---|---|
ds4.c |
~20,000 | Core engine: GGUF loading, tensor layout, CPU kernels, Metal graph driver, tokenizer, public API |
ds4.h |
200 | Public API header — narrow interface for CLI, server, agent, and tests |
ds4_metal.m |
~6,400 | Metal GPU kernels and graph builder (Objective-C++) |
ds4_cuda.cu |
~4,700 | CUDA GPU kernels and graph builder |
ds4_gpu.h |
~700 | Shared GPU header — buffer abstractions, graph primitives |
ds4_server.c |
~14,700 | OpenAI-compatible HTTP server with KV store |
ds4_agent.c |
~7,185 | Native TUI coding agent |
ds4_kvstore.c |
~1,200 | On-disk KV cache directory manager |
ds4_cli.c |
~1,468 | CLI interactive/one-shot chat driver |
ds4_bench.c |
~370 | Context-frontier throughput benchmark |
ds4_eval.c |
~3,000 | Quality and correctness evaluation harness |
The public API (ds4.h) defines two primary objects:
ds4_engine — the loaded model, shared across
sessions. Responsible for GGUF loading, tokenization, argmax
generation, imatrix collection, and GPU graph testing.
ds4_session — one mutable inference timeline. Owns
the live KV cache and logits. Callers provide full token
prefixes; ds4_session_sync() determines whether to
reuse, extend, or rebuild the backend state based on prefix
matching.
Key session operations include:
ds4_session_sync(prompt) — synchronize the session
to a token prefix. If the current checkpoint is a prefix, only
the suffix is evaluated; otherwise the backend state is refilled
from scratch.
ds4_session_argmax(),
ds4_session_sample(...) — extract the next token
from the live logits (greedy or temperature/top-p/min-p
sampling).
ds4_session_eval(token) — append one token to the
graph and extend the KV cache.
ds4_session_eval_speculative_argmax(...) — MTP
draft verification: the draft model proposes multiple tokens;
the main model verifies them in a single forward pass.
ds4_session_save_payload(fp),
ds4_session_load_payload(fp, bytes) —
serialize/deserialize the GPU graph state to/from a file.
The Makefile supports four build configurations:
./ds4, ./ds4-server,
./ds4-bench, ./ds4-eval,
./ds4-agent.
make cuda-spark (Linux) — CUDA
backend targeting DGX Spark / GB10.
make cuda-generic (Linux) — CUDA
backend for other NVIDIA GPUs.
make cpu (Linux) — CPU-only
diagnostics build. On macOS, the CPU path crashes the kernel due
to a VM implementation bug.
Compilation uses -O3 -ffast-math -std=c99 with
-mcpu=native (macOS) or
-march=native (Linux). Metal sources are compiled with
-fobjc-arc; CUDA sources use nvcc with
--use_fast_math.
Loading is mmap-based. The loader parses the GGUF header, metadata
table, and tensor directory, validating that the file matches the
exact DeepSeek V4 Flash tensor layout. Tensor data remains in the
kernel page cache until inference touches it, or until Metal wraps
slices of the mapping as zero-copy MTLBuffers. Loading
is strict: every validation step is designed to fail early if the
GGUF does not match the expected layout.
DS4 uses a highly asymmetrical quantization strategy for the 2-bit variants:
| Component | Quantization |
|---|---|
| Routed-MoE up/gate projections | IQ2_XXS (2-bit) |
| Routed-MoE down projections | Q2_K (2-bit) |
| Shared experts, projections, routing | Unquantized (FP16/FP32) |
This asymmetry preserves quality by leaving critical components untouched, while aggressively quantizing the majority of the model's parameter count. Both imatrix-tuned and non-imatrix variants are available. The imatrix variants, calibrated with a routed-MoE-specific prompt corpus, are preferred.
Metal is the primary target. The GPU graph builder
(ds4_metal.m) constructs a fixed forward-pass graph for
the 43-layer DeepSeek V4 Flash architecture. The graph handles:
The Metal path is validated against official logits at various
context sizes through the ds4-eval harness.
The CUDA backend (ds4_cuda.cu) mirrors the Metal graph
structure with NVIDIA-specific optimizations. Special attention is
given to the DGX Spark / GB10 form factor, where memory bandwidth
and power constraints differ from desktop GPUs. The CUDA path
supports the same set of operations: chunked prefill, compressed KV
cache management, speculative decoding, and directional steering.
A CPU-only path exists for correctness checks and model/tokenizer
diagnostics. It uses hand-tuned CPU quant/dot-product kernels
adapted from GGML under MIT license. On macOS, this path is broken
by a kernel VM bug that crashes the system; Linux users can build
with make cpu.
DeepSeek V4 Flash uses compressed sparse attention, where each layer maintains a raw sliding-window cache (latest 128 tokens) and a set of compressed history rows. The compression ratios are layer-dependent:
| Layer Index (0-based) | Compression Ratio | Extra State |
|---|---|---|
| 0, 1 | None | Raw 128-token sliding window only |
| Even layers, 2+ | 4 | Compressed KV + indexer KV |
| Odd layers, 3+ | 128 | Compressed KV |
Because the compressed rows are very compact (one row per 128 tokens for the majority of layers), the entire KV state for a session of 32k–128k tokens fits in a few hundred KB of GPU memory — and, critically, in a few MB when serialized to disk. This makes on-disk persistence practical.
The ds4_kvstore subsystem
(ds4_kvstore.c/.h) manages a directory of
checkpoint files indexed by rendered text hash:
ds4_kvstore_entry_eviction_score) selects
files for eviction when the store exceeds its budget. The decay
half-life defaults to 6 hours, so older checkpoints with low hit
counts are evicted first. A protected SHA (the currently-live
session) is excluded from eviction.
DS4_KVSTORE_REASON_COLD.
DS4_KVSTORE_REASON_CONTINUED. The system can
suppress continued stores when the extension is small (below a
configurable threshold).
DS4 implements three thinking modes that control whether and how the model engages in chain-of-thought reasoning before producing its final answer:
| Mode | Enum | Behavior |
|---|---|---|
| None | DS4_THINK_NONE |
No thinking prefix. The model generates answers directly. |
| High | DS4_THINK_HIGH |
Appends a "max effort" prefix that encourages reasoning before answering. The thinking section length scales with problem complexity and is typically 1/5 that of other reasoning models. |
| Max | DS4_THINK_MAX |
Forces maximal thinking via a special prefix. Requires a
minimum context size
(ds4_think_max_min_context()) to leave room for
the answer.
|
The function
ds4_think_mode_for_context(mode, ctx_size) downgrades
the mode if the remaining context is insufficient for meaningful
thinking output. When context is very tight, it falls back to
DS4_THINK_NONE automatically.
Directional steering is a runtime activation edit. A steering file
is a flat f32 matrix (43 layers × 4096 hidden
dimensions) containing one normalized direction vector per layer.
During inference, the engine applies the edit after FFN outputs,
attention outputs, or both:
y = y − α · dℓ · ⟨dℓ, y⟩
where α is a scale factor,
dℓ is the direction vector for layer ℓ, and
⟨·,·⟩ is the dot product. Positive α removes the represented
direction; negative α amplifies it. With no steering file or zero
scales, the inference follows the normal path.
Steering is configured via engine options:
--dir-steering-file FILE loads the direction file;
--dir-steering-ffn F and
--dir-steering-attn F set the scale factors for FFN and
attention outputs respectively. FFN output is the recommended target
because it is late enough in each layer to capture behavioral and
stylistic signals. The dir-steering/ directory contains
an example that builds a style direction from 100 paired prompts.
The HTTP server (ds4-server) follows a single-worker
design:
ds4_session and all live KV cache state. By
centralizing graph mutations in one thread, the design avoids
race conditions on KV state and simplifies checkpointing.
ds4_session and all
live KV cache state, avoiding race conditions. Sessions are
persisted to disk via the KV Store, which manages checkpoint files
indexed by rendered text hash.
The server exposes OpenAI-compatible chat completion endpoints
(/v1/chat/completions) with streaming and non-streaming
modes. It also supports Anthropic-style tool-calling via the
tools and tool_choice parameters. Tool
schemas are prepended to the rendered system prompt as part of the
chat rendering pipeline.
Each session checkpoint is managed by the
ds4_kvstore subsystem. When a request arrives for an
existing session, the server loads the matching checkpoint, the GPU
worker syncs the session to the request's prompt prefix, and
generation resumes from the cached state. If no checkpoint matches,
the session is filled from scratch.
The native agent (ds4-agent) is a single-process,
two-thread application:
There are no socket boundaries, IPC channels, or serialization layers between the UI and the inference engine. The session state is the live KV cache on disk; the agent reads and writes it directly.
ds4_session_progress_fn callbacks.
/list, /switch) with
zero prefill cost, because each session is a precomputed KV
cache checkpoint on disk.
The agent is currently alpha quality. When it reaches a stable shape, the plan is to split the server and client into a stateful session-based protocol that can recreate the agent experience in a client-server configuration.
ds4-bench measures instantaneous prefill and generation
throughput at context frontiers rather than reporting a single
whole-run average. The benchmark loads the model once, walks a fixed
token sequence to frontiers (2048, 4096, 6144, ...), and uses
incremental prefill so each row measures only the newly-added token
interval. After each frontier it saves the live KV state to memory,
generates a fixed greedy non-EOS probe, restores the memory
snapshot, and continues prefill.
The following table reports single-run Metal CLI numbers with
--ctx 32768, --nothink, greedy decoding,
and -n 256. Short prompts use a small Italian story
prompt; long prompts exercise chunked prefill plus long-context
decode.
| Machine | Quant | Prompt | Prefill (t/s) | Generation (t/s) |
|---|---|---|---|---|
| MacBook Pro M3 Max, 128 GB | q2 | short | 58.52 | 26.68 |
| MacBook Pro M3 Max, 128 GB | q2 | 11709 tokens | 250.11 | 21.47 |
| Mac Studio M3 Ultra, 512 GB | q2 | short | 84.43 | 36.86 |
| Mac Studio M3 Ultra, 512 GB | q2 | 11709 tokens | 468.03 | 27.39 |
| Mac Studio M3 Ultra, 512 GB | q4 | short | 78.95 | 35.50 |
| Mac Studio M3 Ultra, 512 GB | q4 | 12018 tokens | 448.82 | 26.62 |
| DGX Spark GB10, 128 GB | q2 | 7047 tokens | 343.81 | 13.75 |
Prefill speeds scale with available memory bandwidth and GPU compute, reaching 468 t/s on the M3 Ultra with long prompts. Generation speeds are more uniform across configurations, reflecting the decode-bound nature of autoregressive inference.
The test runner (ds4_test) validates the engine against
official DeepSeek V4 Flash continuation vectors at various context
sizes. Available test suites:
--server — request parsing, chat rendering,
streaming, tool-call parsing.
--logprob-vectors — logprob accuracy against
official continuations.
--long-context — correctness at extended context
windows.
--tool-call-quality — tool-calling fidelity.--metal-kernels — individual Metal kernel
correctness.
All changes affecting inference backends must be checked for speed
regressions. The only acceptable speed penalty is when an important
correctness bug is fixed. Benchmarks are collected with
ds4-bench and recorded in the
speed-bench/ directory as CSV files and graphs.
The gguf-tools/quality-testing/ directory provides a
framework for scoring local GGUFs against official DeepSeek V4 Flash
continuations. This ensures that quantization and imatrix tuning do
not introduce systematic quality degradation.
DS4 does not link against GGML, but it exists thanks to the path
opened by the llama.cpp project and the kernels, quantization
formats, GGUF ecosystem, and hard-won engineering knowledge
developed there. We are indebted to Georgi Gerganov and all
llama.cpp contributors. Some source-level pieces are retained or
adapted under the MIT license: GGUF quant layouts and tables, CPU
quant/dot logic, and certain kernels. The GGML authors copyright
notice is preserved in the LICENSE file.
The project is developed with strong assistance from GPT 5.5, with humans leading the ideas, testing, and debugging. This is stated openly because it shaped how the project was built.
The code and GGUF files are of beta quality. Inference and model serving are complicated matters that take months to stabilize. The agent is alpha quality. The project is maintained in a usable state and is under active development.
The server now keeps the SSE connection alive during extended prefill
phases (f027269). This prevents client timeouts when the
model processes long prompts before generation begins — a critical fix
for OpenAI-compatible streaming clients that expect periodic
keepalive signals. Additionally, prefill errors that occur after
keepalive are handled gracefully (8d57664), ensuring
robust error recovery in streaming mode.
A new ds4-splice utility (93d9d96) enables
combining GGUF files with different quantization mixes. This allows
operators to create custom split-quantization models — for example,
pairing a q2-imatrix expert layer with a q4-KM shared layer — without
rebuilding the full GGUF from scratch. The tool validates tensor
layouts across source files to ensure architectural compatibility.
The native ds4-agent TUI has received substantial
refinement across rendering, session management, and interaction
design. Key improvements include:
9ff77a1) —
live progress bars use Unicode glyphs instead of raw text,
improving readability at small terminal widths.
8ba0c45) —
color handling uses ANSI escape sequences that survive terminal
reflow and dark-mode environments.
799dff4) — the
user's prompt remains visible during streamed output, enabling
continuous reference while the agent generates.
2606543) — the
agent status bar updates in-place without redrawing the entire
screen, reducing visual flicker.
1aedee4) —
the TUI dynamically adjusts to terminal dimensions, handling
resizes and narrow windows.
23cf510) — improved
linenoise history folding prevents aggressive truncation of long
command sequences.
1dc8bdb) —
agents can save and list sessions during active generation without
blocking.
8daa088) — after
applying file edits, the agent shows the changed context to the
model, reducing follow-up correction rounds.
1e3c11f) —
append-only queuing prevents prompt reordering under concurrent
input.
A comprehensive evaluation framework (de5ec6d through
48c4d4d) now ships with DS4. ds4-eval
supports:
4441e56) — the
harness dynamically adjusts context length to match prompt
requirements, avoiding wasted prefill.
336fbd6) — results are
displayed in a clean terminal interface with pass/fail summaries
and per-item breakdowns.
d630ca4) — the CLI
can compute perplexity over arbitrary text, providing a
quality-independent correctness metric.
The eviction policy now supports hit-count decay
(b62292c), enabled by default (d0357ec).
Cache entries that have been hit frequently receive a decaying score
multiplier, preventing popular entries from permanently crowding out
newer context. This is especially beneficial for long-running
conversations where earlier turns remain relevant but shouldn't
dominate the cache budget.
Several server-side improvements have been merged:
ef0a490) —
ds4-server accepts --working-directory to
resolve relative session paths.
312935e) — the
server exposes configurable CORS headers for web-based clients.
7b68234) — tool
definitions are automatically prepended to the rendered system
prompt, improving compliance on OpenAI-compatible tool calls.
38800bf,
0ca2e28) — the server reports Anthropic and KV cache
usage statistics via OpenAI-compatible API fields.
613e9b2) — the
server now defaults to min-p filtering for generation, reducing
degenerate output without configuration.
5bc1e6d) —
fixes to the CUDA Flash graph builder ensure correct attention
computation under compressed prefill.
c9dd949) —
RoPE position computation corrected for compressed sparse
attention on CUDA.
04b6fda) —
CUDA uses managed memory for KV cache allocations exceeding GPU
VRAM, enabling 1M-token context on 48 GB cards.
4efd501) —
the engine gracefully skips long API vectors that don't match the
active model's architecture, preventing load failures.
⌀ Colophon
This document was created with ds4-agent — the native coding agent built into DwarfStar 4 — running against DeepSeek V4 Flash (q2-imatrix) on a MacBook Pro M5 Max, 128GB RAM.
The generation session used the following prompt sequence:
All prompts were processed by ds4-agent in a single interactive session, demonstrating zero-latency session switching, live prefill progress, and native tool-calling — each edit applied directly to the source file without serialization or API boundaries.
— Generated on May 20, 2025 · Updated May 21, 2026