DwarfStar 4 (DS4)

A Single-Model Native Inference Engine for DeepSeek V4 Flash
Technical Report · Updated May 21, 2026

Abstract — DwarfStar 4 (DS4) is a self-contained, single-model inference engine purpose-built for DeepSeek V4 Flash, a 284B-parameter routed-MoE language model with 1M-token context. Unlike general-purpose runners, DS4 owns the full stack — GGUF loading, fixed tensor layout validation, CPU reference kernels, Metal/CUDA GPU graph dispatch, tokenization, chat rendering, thinking-mode control, KV cache persistence to disk, OpenAI-compatible HTTP serving, and a native TUI coding agent. The engine is deliberately narrow: it targets one model at a time, validates against official logits, and integrates deeply with agentic workflows. This report describes the architecture, design rationale, performance characteristics, and component subsystems of DS4 as reflected in the repository at commit 8d57664.
Figure 1 — DS4 System Architecture Overview CLI (ds4_cli.c) Server (ds4_server.c) Agent (ds4_agent.c) Bench / Eval Engine API ds4.h ds4_engine ds4_session Core Engine ds4.c (∼20k lines) GGUF loading · Tensor layout CPU kernels · Tokenizer Graph builder Metal ds4_metal.m CUDA ds4_cuda.cu CPU (diagnostics only) KV Store ds4_kvstore.c/.h Disk checkpoint files Figure 1 — System Architecture Overview. The CLI, server, agent, and benchmarks all share the engine API (ds4.h). The core engine (ds4.c) implements model loading, tensor layout, CPU kernels, and the GPU graph driver. GPU backends (Metal, CUDA, CPU) are isolated in separate files. The KV Store subsystem manages on-disk checkpoint files independently.

1. Introduction

1.1 Motivation

DeepSeek V4 Flash occupies a unique position in the current landscape of locally-runnable language models. With 284B total parameters and only ~13B active per token via routed-MoE, it offers a combination of high knowledge capacity, fast inference, and efficient reasoning that is not replicated by dense models of comparable size. The model additionally features a compressed sparse attention mechanism supporting a 1M-token context, a KV cache that is sufficiently compact to persist on disk, and a thinking-mode behavior that produces reasoning traces proportional to problem complexity — often 1/5 the length of other reasoning models.

Despite these advantages, the local inference ecosystem is dominated by general-purpose runners (e.g., llama.cpp, vllm) that must accommodate a wide range of architectures. For a specific model such as DeepSeek V4 Flash, this generality incurs complexity, validation overhead, and integration gaps — particularly around KV cache persistence, tool-calling, and agentic workflows. DS4 was created to close these gaps by taking a deliberate bet on one model and building a vertically-integrated engine around it.

1.2 Scope and Design Constraints

The following constraints govern the DS4 design:

  1. Single model. The engine loads only DeepSeek V4 Flash GGUFs with a known tensor layout, quantization mix, metadata, and optional MTP state. Arbitrary GGUF files are rejected at load time.
  2. No external dependencies. DS4 does not link against GGML, llama.cpp, or any other inference runtime. It exists because of the path opened by those projects — acknowledged and preserved under MIT license — but is built from scratch for this specific architecture.
  3. Disk-first KV cache. The compressed KV cache of DeepSeek V4 Flash (ratio-4 and ratio-128 rows, plus indexer states) is small enough that checkpointing to SSD is practical. The engine treats disk as a first-class KV cache medium, not a fallback.
  4. Backend portability. The core engine API (ds4.h) is backend-agnostic. GPU kernels are isolated in ds4_metal.m (macOS) and ds4_cuda.cu (Linux). A CPU path exists for correctness validation.
  5. Integrated agent. The coding agent is a single-process, two-thread application that owns the inference session directly — no socket boundaries, no serialization layer. This eliminates KV cache mismatches and provides sub-millisecond session switching.

2. Architecture

2.1 Codebase Structure

The repository is organized around a single core source file (ds4.c, ~808 kB) that implements the engine. Supporting files include:

File Lines Role
ds4.c ~20,000 Core engine: GGUF loading, tensor layout, CPU kernels, Metal graph driver, tokenizer, public API
ds4.h 200 Public API header — narrow interface for CLI, server, agent, and tests
ds4_metal.m ~6,400 Metal GPU kernels and graph builder (Objective-C++)
ds4_cuda.cu ~4,700 CUDA GPU kernels and graph builder
ds4_gpu.h ~700 Shared GPU header — buffer abstractions, graph primitives
ds4_server.c ~14,700 OpenAI-compatible HTTP server with KV store
ds4_agent.c ~7,185 Native TUI coding agent
ds4_kvstore.c ~1,200 On-disk KV cache directory manager
ds4_cli.c ~1,468 CLI interactive/one-shot chat driver
ds4_bench.c ~370 Context-frontier throughput benchmark
ds4_eval.c ~3,000 Quality and correctness evaluation harness

2.2 Engine API

The public API (ds4.h) defines two primary objects:

Key session operations include:

2.3 Build System

The Makefile supports four build configurations:

Compilation uses -O3 -ffast-math -std=c99 with -mcpu=native (macOS) or -march=native (Linux). Metal sources are compiled with -fobjc-arc; CUDA sources use nvcc with --use_fast_math.

Figure 2 — DeepSeek V4 Flash Model Architecture (43 Layers) Layers 0, 1 — Raw 128-token sliding-window attention only Even layers (2, 4, 6, ... 42) — Ratio-4 compressed KV + indexer KV 1 compressed row per 4 tokens · Indexer selects visible compressed rows Odd layers (3, 5, 7, ... 41) — Ratio-128 compressed KV 1 compressed row per 128 tokens Attention mechanism per layer Raw sliding window latest 128 tokens + Compressed history ratio-4 or ratio-128 rows Attention output y = Attn(Q, K, V) Routed MoE — per token, only a subset of experts is activated Router Expert 1 Expert 2 Expert 3 … 64 experts Expert 64 Raw attention Compressed attention MoE experts Router Figure 2 — DeepSeek V4 Flash Model Architecture. 43 layers with three attention regimes: raw sliding-window (layers 0–1), ratio-4 compressed (even layers 2+), and ratio-128 compressed (odd layers 3+). Each layer includes routed MoE with 64 experts, of which only a subset (~13B active parameters) is activated per token.

3. Model Loading and Validation

3.1 GGUF Loading

Loading is mmap-based. The loader parses the GGUF header, metadata table, and tensor directory, validating that the file matches the exact DeepSeek V4 Flash tensor layout. Tensor data remains in the kernel page cache until inference touches it, or until Metal wraps slices of the mapping as zero-copy MTLBuffers. Loading is strict: every validation step is designed to fail early if the GGUF does not match the expected layout.

3.2 Supported Quantization

DS4 uses a highly asymmetrical quantization strategy for the 2-bit variants:

Component Quantization
Routed-MoE up/gate projections IQ2_XXS (2-bit)
Routed-MoE down projections Q2_K (2-bit)
Shared experts, projections, routing Unquantized (FP16/FP32)

This asymmetry preserves quality by leaving critical components untouched, while aggressively quantizing the majority of the model's parameter count. Both imatrix-tuned and non-imatrix variants are available. The imatrix variants, calibrated with a routed-MoE-specific prompt corpus, are preferred.

4. GPU Backends

4.1 Metal (macOS)

Metal is the primary target. The GPU graph builder (ds4_metal.m) constructs a fixed forward-pass graph for the 43-layer DeepSeek V4 Flash architecture. The graph handles:

The Metal path is validated against official logits at various context sizes through the ds4-eval harness.

4.2 CUDA (Linux)

The CUDA backend (ds4_cuda.cu) mirrors the Metal graph structure with NVIDIA-specific optimizations. Special attention is given to the DGX Spark / GB10 form factor, where memory bandwidth and power constraints differ from desktop GPUs. The CUDA path supports the same set of operations: chunked prefill, compressed KV cache management, speculative decoding, and directional steering.

4.3 CPU (Diagnostics)

A CPU-only path exists for correctness checks and model/tokenizer diagnostics. It uses hand-tuned CPU quant/dot-product kernels adapted from GGML under MIT license. On macOS, this path is broken by a kernel VM bug that crashes the system; Linux users can build with make cpu.

5. KV Cache Persistence

5.1 Motivation

DeepSeek V4 Flash uses compressed sparse attention, where each layer maintains a raw sliding-window cache (latest 128 tokens) and a set of compressed history rows. The compression ratios are layer-dependent:

Layer Index (0-based) Compression Ratio Extra State
0, 1 None Raw 128-token sliding window only
Even layers, 2+ 4 Compressed KV + indexer KV
Odd layers, 3+ 128 Compressed KV

Because the compressed rows are very compact (one row per 128 tokens for the majority of layers), the entire KV state for a session of 32k–128k tokens fits in a few hundred KB of GPU memory — and, critically, in a few MB when serialized to disk. This makes on-disk persistence practical.

5.2 KV Store Subsystem

The ds4_kvstore subsystem (ds4_kvstore.c/.h) manages a directory of checkpoint files indexed by rendered text hash:

Figure 3 — KV Cache Checkpoint Lifecycle Cold store First save of session Checkpoint file SHA1(prefix).kvstore Load & resume No prefill needed Continued extend & re-save iterative extension Eviction budget exceeded / stale Eviction score: hit-count decay (half-life 6h) + recency Protected (live session) excluded from eviction Lifecycle states: Cold Load/resume Continued Evicted Figure 3 — KV Cache Checkpoint Lifecycle. A session starts with a cold store to disk. Subsequent requests load the checkpoint and resume without prefill. Extended sessions are re-saved as continued checkpoints. Stale or over-budget entries are evicted using a hit-count decay score. The live session is always protected from eviction.

5.3 Checkpoint Lifecycle

  1. Cold store — a session's first save, written with reason DS4_KVSTORE_REASON_COLD.
  2. Continued store — after loading a checkpoint and extending the session, the updated state is saved with reason DS4_KVSTORE_REASON_CONTINUED. The system can suppress continued stores when the extension is small (below a configurable threshold).
  3. Load-and-resume — on a subsequent request, the store finds a matching SHA prefix, loads the payload, and resumes inference from exactly where the previous session left off — no prefill needed.
  4. Eviction — when the store exceeds its budget, entries with low eviction scores are removed.

6. Thinking Modes

6.1 Overview

DS4 implements three thinking modes that control whether and how the model engages in chain-of-thought reasoning before producing its final answer:

Mode Enum Behavior
None DS4_THINK_NONE No thinking prefix. The model generates answers directly.
High DS4_THINK_HIGH Appends a "max effort" prefix that encourages reasoning before answering. The thinking section length scales with problem complexity and is typically 1/5 that of other reasoning models.
Max DS4_THINK_MAX Forces maximal thinking via a special prefix. Requires a minimum context size (ds4_think_max_min_context()) to leave room for the answer.

6.2 Automatic Mode Selection

The function ds4_think_mode_for_context(mode, ctx_size) downgrades the mode if the remaining context is insufficient for meaningful thinking output. When context is very tight, it falls back to DS4_THINK_NONE automatically.

7. Directional Steering

7.1 Principle

Directional steering is a runtime activation edit. A steering file is a flat f32 matrix (43 layers × 4096 hidden dimensions) containing one normalized direction vector per layer. During inference, the engine applies the edit after FFN outputs, attention outputs, or both:

y = y − α · d · ⟨d, y⟩

where α is a scale factor, d is the direction vector for layer ℓ, and ⟨·,·⟩ is the dot product. Positive α removes the represented direction; negative α amplifies it. With no steering file or zero scales, the inference follows the normal path.

7.2 Usage

Steering is configured via engine options: --dir-steering-file FILE loads the direction file; --dir-steering-ffn F and --dir-steering-attn F set the scale factors for FFN and attention outputs respectively. FFN output is the recommended target because it is late enough in each layer to capture behavioral and stylistic signals. The dir-steering/ directory contains an example that builds a style direction from 100 paired prompts.

8. Server Subsystem

8.1 Architecture

The HTTP server (ds4-server) follows a single-worker design:

Figure 4 — Server Architecture (Single-Worker Design) Client 1 Client 2 Client 3 Listener thread parse + queue job job queue GPU worker thread owns ds4_session + KV live KV cache (GPU) KV Store on-disk checkpoints checkpoint files (SSD) Session index (SHA → path) Figure 4 — Server Architecture. Client connections are handled by a listener thread that parses requests and queues jobs. A single GPU worker owns the ds4_session and all live KV cache state, avoiding race conditions. Sessions are persisted to disk via the KV Store, which manages checkpoint files indexed by rendered text hash.

8.2 API Compatibility

The server exposes OpenAI-compatible chat completion endpoints (/v1/chat/completions) with streaming and non-streaming modes. It also supports Anthropic-style tool-calling via the tools and tool_choice parameters. Tool schemas are prepended to the rendered system prompt as part of the chat rendering pipeline.

8.3 KV Cache Integration

Each session checkpoint is managed by the ds4_kvstore subsystem. When a request arrives for an existing session, the server loads the matching checkpoint, the GPU worker syncs the session to the request's prompt prefix, and generation resumes from the cached state. If no checkpoint matches, the session is filled from scratch.

9. Agent Subsystem

9.1 Design

The native agent (ds4-agent) is a single-process, two-thread application:

There are no socket boundaries, IPC channels, or serialization layers between the UI and the inference engine. The session state is the live KV cache on disk; the agent reads and writes it directly.

Figure 5 — Agent Architecture (Single-Process, Two-Thread) Single process boundary UI Thread Terminal input / output Linenoise line editing Streaming text rendering Status bar / progress Worker Thread ds4_session generation Tool call execution (bash, edit, etc.) Session management (save / list / switch) KV cache checkpointing Shared: session tokens + KV state queued prompt streamed tokens ✓ No IPC, no sockets, no serialization — KV mismatch impossible by construction Figure 5 — Agent Architecture. The UI thread handles terminal I/O while the worker thread owns inference and tool execution. They communicate through shared session state, not sockets. This eliminates serialization overhead and makes KV cache mismatches impossible.

9.2 Advantages

9.3 Current Status and Future Direction

The agent is currently alpha quality. When it reaches a stable shape, the plan is to split the server and client into a stateful session-based protocol that can recreate the agent experience in a client-server configuration.

10. Performance

10.1 Benchmark Methodology

ds4-bench measures instantaneous prefill and generation throughput at context frontiers rather than reporting a single whole-run average. The benchmark loads the model once, walks a fixed token sequence to frontiers (2048, 4096, 6144, ...), and uses incremental prefill so each row measures only the newly-added token interval. After each frontier it saves the live KV state to memory, generates a fixed greedy non-EOS probe, restores the memory snapshot, and continues prefill.

10.2 Measured Throughput

The following table reports single-run Metal CLI numbers with --ctx 32768, --nothink, greedy decoding, and -n 256. Short prompts use a small Italian story prompt; long prompts exercise chunked prefill plus long-context decode.

Machine Quant Prompt Prefill (t/s) Generation (t/s)
MacBook Pro M3 Max, 128 GB q2 short 58.52 26.68
MacBook Pro M3 Max, 128 GB q2 11709 tokens 250.11 21.47
Mac Studio M3 Ultra, 512 GB q2 short 84.43 36.86
Mac Studio M3 Ultra, 512 GB q2 11709 tokens 468.03 27.39
Mac Studio M3 Ultra, 512 GB q4 short 78.95 35.50
Mac Studio M3 Ultra, 512 GB q4 12018 tokens 448.82 26.62
DGX Spark GB10, 128 GB q2 7047 tokens 343.81 13.75

Prefill speeds scale with available memory bandwidth and GPU compute, reaching 468 t/s on the M3 Ultra with long prompts. Generation speeds are more uniform across configurations, reflecting the decode-bound nature of autoregressive inference.

Figure 6 — Prefill and Generation Throughput by Machine Tokens per second 0 50 100 150 200 250 300 350 58 27 M3 Max q2 84 37 M3 Ultra q2 79 35 M3 Ultra q4 344 14 DGX Spark q2 Prefill (t/s) Generation (t/s) Short prompt values shown; long-prompt prefill reaches 250–468 t/s (see table) Figure 6 — Prefill and Generation Throughput. Short-prompt throughput for representative configurations. Prefill (blue) benefits from memory bandwidth and parallel compute; generation (green) is decode-bound and more uniform. The DGX Spark achieves the highest prefill speed (344 t/s) due to its GPU compute advantage.

11. Quality Assurance and Testing

11.1 Correctness Regression

The test runner (ds4_test) validates the engine against official DeepSeek V4 Flash continuation vectors at various context sizes. Available test suites:

11.2 Speed Regression

All changes affecting inference backends must be checked for speed regressions. The only acceptable speed penalty is when an important correctness bug is fixed. Benchmarks are collected with ds4-bench and recorded in the speed-bench/ directory as CSV files and graphs.

11.3 Quality Testing

The gguf-tools/quality-testing/ directory provides a framework for scoring local GGUFs against official DeepSeek V4 Flash continuations. This ensures that quantization and imatrix tuning do not introduce systematic quality degradation.

12. Acknowledgements

DS4 does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there. We are indebted to Georgi Gerganov and all llama.cpp contributors. Some source-level pieces are retained or adapted under the MIT license: GGUF quant layouts and tables, CPU quant/dot logic, and certain kernels. The GGML authors copyright notice is preserved in the LICENSE file.

The project is developed with strong assistance from GPT 5.5, with humans leading the ideas, testing, and debugging. This is stated openly because it shaped how the project was built.

13. Status and Future Work

13.1 Current Status

The code and GGUF files are of beta quality. Inference and model serving are complicated matters that take months to stabilize. The agent is alpha quality. The project is maintained in a usable state and is under active development.

13.2 Known Limitations

13.3 Planned Work

14. Recent Developments

14.1 SSE Keepalive for Long Prefill

The server now keeps the SSE connection alive during extended prefill phases (f027269). This prevents client timeouts when the model processes long prompts before generation begins — a critical fix for OpenAI-compatible streaming clients that expect periodic keepalive signals. Additionally, prefill errors that occur after keepalive are handled gracefully (8d57664), ensuring robust error recovery in streaming mode.

14.2 Mixed GGUF Splicing Tool

A new ds4-splice utility (93d9d96) enables combining GGUF files with different quantization mixes. This allows operators to create custom split-quantization models — for example, pairing a q2-imatrix expert layer with a q4-KM shared layer — without rebuilding the full GGUF from scratch. The tool validates tensor layouts across source files to ensure architectural compatibility.

14.3 Agent TUI Maturation

The native ds4-agent TUI has received substantial refinement across rendering, session management, and interaction design. Key improvements include:

14.4 Benchmark Harness (ds4-eval)

A comprehensive evaluation framework (de5ec6d through 48c4d4d) now ships with DS4. ds4-eval supports:

14.5 KV Cache Hit Decay

The eviction policy now supports hit-count decay (b62292c), enabled by default (d0357ec). Cache entries that have been hit frequently receive a decaying score multiplier, preventing popular entries from permanently crowding out newer context. This is especially beneficial for long-running conversations where earlier turns remain relevant but shouldn't dominate the cache budget.

14.6 Server Enhancements

Several server-side improvements have been merged:

14.7 CUDA and Metal Fixes

⌀ Colophon

This document was created with ds4-agent — the native coding agent built into DwarfStar 4 — running against DeepSeek V4 Flash (q2-imatrix) on a MacBook Pro M5 Max, 128GB RAM.

The generation session used the following prompt sequence:

  1. generate ds4.html explaining the what, why, how, when of ds4 as seen from this repo
  2. update ds4.html to read and look like a technical report
  3. add suitable diagrams to ds4.html
  4. update ~/pradeep-stellar.github.io/index.html with an entry about ~/pradeep-stellar.github.io/ds4.html
  5. add published date to ds4.html
  6. add colophon section — "created with ds4-agent". add the summary of the prompt using this session below that.

All prompts were processed by ds4-agent in a single interactive session, demonstrating zero-latency session switching, live prefill progress, and native tool-calling — each edit applied directly to the source file without serialization or API boundaries.

— Generated on May 20, 2025 · Updated May 21, 2026