DwarfStar 4 (DS4)

A Single-Model Native Inference Engine for DeepSeek V4 Flash
Technical Report · Updated June 7, 2026

Abstract — DwarfStar 4 (DS4) is a self-contained, single-model inference engine purpose-built for DeepSeek V4 Flash, a 284B-parameter routed-MoE language model with 1M-token context. Unlike general-purpose runners, DS4 owns the full stack — GGUF loading, fixed tensor layout validation, CPU reference kernels, Metal/CUDA GPU graph dispatch, tokenization, chat rendering, thinking-mode control, KV cache persistence to disk, OpenAI-compatible HTTP serving, and a native TUI coding agent. The engine is deliberately narrow: it targets one model at a time, validates against official logits, and integrates deeply with agentic workflows. This report describes the architecture, design rationale, performance characteristics, and component subsystems of DS4 as reflected in the repository at commit c463029.

Figure 1 — System Architecture Overview. The CLI, server, agent, and benchmarks all share the engine API (ds4.h). The core engine (ds4.c) implements model loading, tensor layout, CPU kernels, and the GPU graph driver. GPU backends (Metal, CUDA, CPU) are isolated in separate files. The KV Store subsystem manages on-disk checkpoint files independently.

1. Introduction

1.1 Motivation

DeepSeek V4 Flash occupies a unique position in the current landscape of locally-runnable language models. With 284B total parameters and only ~13B active per token via routed-MoE, it offers a combination of high knowledge capacity, fast inference, and efficient reasoning that is not replicated by dense models of comparable size. The model additionally features a compressed sparse attention mechanism supporting a 1M-token context, a KV cache that is sufficiently compact to persist on disk, and a thinking-mode behavior that produces reasoning traces proportional to problem complexity — often 1/5 the length of other reasoning models.

Despite these advantages, the local inference ecosystem is dominated by general-purpose runners (e.g., llama.cpp, vllm) that must accommodate a wide range of architectures. For a specific model such as DeepSeek V4 Flash, this generality incurs complexity, validation overhead, and integration gaps — particularly around KV cache persistence, tool-calling, and agentic workflows. DS4 was created to close these gaps by taking a deliberate bet on one model and building a vertically-integrated engine around it.

1.2 Scope and Design Constraints

The following constraints govern the DS4 design:

Single model. The engine loads only DeepSeek V4 Flash GGUFs with a known tensor layout, quantization mix, metadata, and optional MTP state. Arbitrary GGUF files are rejected at load time.
No external dependencies. DS4 does not link against GGML, llama.cpp, or any other inference runtime. It exists because of the path opened by those projects — acknowledged and preserved under MIT license — but is built from scratch for this specific architecture.
Disk-first KV cache. The compressed KV cache of DeepSeek V4 Flash (ratio-4 and ratio-128 rows, plus indexer states) is small enough that checkpointing to SSD is practical. The engine treats disk as a first-class KV cache medium, not a fallback.
Backend portability. The core engine API (ds4.h) is backend-agnostic. GPU kernels are isolated in ds4_metal.m (macOS) and ds4_cuda.cu (Linux). A CPU path exists for correctness validation.
Integrated agent. The coding agent is a single-process, two-thread application that owns the inference session directly — no socket boundaries, no serialization layer. This eliminates KV cache mismatches and provides sub-millisecond session switching.

2. Architecture

2.1 Codebase Structure

The repository is organized around a single core source file (ds4.c, ~808 kB) that implements the engine. Supporting files include:

File	Lines	Role
`ds4.c`	~20,000	Core engine: GGUF loading, tensor layout, CPU kernels, Metal graph driver, tokenizer, public API
`ds4.h`	200	Public API header — narrow interface for CLI, server, agent, and tests
`ds4_metal.m`	~6,400	Metal GPU kernels and graph builder (Objective-C++)
`ds4_cuda.cu`	~4,700	CUDA GPU kernels and graph builder
`ds4_gpu.h`	~700	Shared GPU header — buffer abstractions, graph primitives
`ds4_server.c`	~14,700	OpenAI-compatible HTTP server with KV store
`ds4_agent.c`	~7,185	Native TUI coding agent
`ds4_kvstore.c`	~1,200	On-disk KV cache directory manager
`ds4_cli.c`	~1,468	CLI interactive/one-shot chat driver
`ds4_bench.c`	~370	Context-frontier throughput benchmark
`ds4_eval.c`	~3,000	Quality and correctness evaluation harness

2.2 Engine API

The public API (ds4.h) defines two primary objects:

ds4_engine — the loaded model, shared across sessions. Responsible for GGUF loading, tokenization, argmax generation, imatrix collection, and GPU graph testing.
ds4_session — one mutable inference timeline. Owns the live KV cache and logits. Callers provide full token prefixes; ds4_session_sync() determines whether to reuse, extend, or rebuild the backend state based on prefix matching.

Key session operations include:

ds4_session_sync(prompt) — synchronize the session to a token prefix. If the current checkpoint is a prefix, only the suffix is evaluated; otherwise the backend state is refilled from scratch.
ds4_session_argmax(), ds4_session_sample(...) — extract the next token from the live logits (greedy or temperature/top-p/min-p sampling).
ds4_session_eval(token) — append one token to the graph and extend the KV cache.
ds4_session_eval_speculative_argmax(...) — MTP draft verification: the draft model proposes multiple tokens; the main model verifies them in a single forward pass.
ds4_session_save_payload(fp), ds4_session_load_payload(fp, bytes) — serialize/deserialize the GPU graph state to/from a file.

2.3 Build System

The Makefile supports four build configurations:

Default (macOS) — Metal backend. Produces ./ds4, ./ds4-server, ./ds4-bench, ./ds4-eval, ./ds4-agent.
make cuda-spark (Linux) — CUDA backend targeting DGX Spark / GB10.
make cuda-generic (Linux) — CUDA backend for other NVIDIA GPUs.
make cpu (Linux) — CPU-only diagnostics build. On macOS, the CPU path crashes the kernel due to a VM implementation bug.

Compilation uses -O3 -ffast-math -std=c99 with -mcpu=native (macOS) or -march=native (Linux). Metal sources are compiled with -fobjc-arc; CUDA sources use nvcc with --use_fast_math.

Figure 2 — DeepSeek V4 Flash Model Architecture. 43 layers with three attention regimes: raw sliding-window (layers 0–1), ratio-4 compressed (even layers 2+), and ratio-128 compressed (odd layers 3+). Each layer includes routed MoE with 64 experts, of which only a subset (~13B active parameters) is activated per token.

3. Model Loading and Validation

3.1 GGUF Loading

Loading is mmap-based. The loader parses the GGUF header, metadata table, and tensor directory, validating that the file matches the exact DeepSeek V4 Flash tensor layout. Tensor data remains in the kernel page cache until inference touches it, or until Metal wraps slices of the mapping as zero-copy MTLBuffers. Loading is strict: every validation step is designed to fail early if the GGUF does not match the expected layout.

3.2 Supported Quantization

DS4 uses a highly asymmetrical quantization strategy for the 2-bit variants:

Component	Quantization
Routed-MoE up/gate projections	IQ2_XXS (2-bit)
Routed-MoE down projections	Q2_K (2-bit)
Shared experts, projections, routing	Unquantized (FP16/FP32)

This asymmetry preserves quality by leaving critical components untouched, while aggressively quantizing the majority of the model's parameter count. Both imatrix-tuned and non-imatrix variants are available. The imatrix variants, calibrated with a routed-MoE-specific prompt corpus, are preferred.

4. GPU Backends

4.1 Metal (macOS)

Metal is the primary target. The GPU graph builder (ds4_metal.m) constructs a fixed forward-pass graph for the 43-layer DeepSeek V4 Flash architecture. The graph handles:

Routed-MoE expert dispatch — each token is routed to a subset of experts; the graph activates only those experts' weights.
Compressed sparse attention (CSA) — for each layer, the graph processes the raw sliding-window cache (latest 128 tokens) and the compressed history rows (ratio-4 or ratio-128 depending on layer index).
Indexer state — even layers (ratio-4) maintain an indexer that selects which compressed rows are visible to attention.
RoPE position embedding — applied during prefill and extended during generation.

The Metal path is validated against official logits at various context sizes through the ds4-eval harness.

4.2 CUDA (Linux)

The CUDA backend (ds4_cuda.cu) mirrors the Metal graph structure with NVIDIA-specific optimizations. Special attention is given to the DGX Spark / GB10 form factor, where memory bandwidth and power constraints differ from desktop GPUs. The CUDA path supports the same set of operations: chunked prefill, compressed KV cache management, speculative decoding, and directional steering.

4.3 CPU (Diagnostics)

A CPU-only path exists for correctness checks and model/tokenizer diagnostics. It uses hand-tuned CPU quant/dot-product kernels adapted from GGML under MIT license. On macOS, this path is broken by a kernel VM bug that crashes the system; Linux users can build with make cpu.

5. KV Cache Persistence

5.1 Motivation

DeepSeek V4 Flash uses compressed sparse attention, where each layer maintains a raw sliding-window cache (latest 128 tokens) and a set of compressed history rows. The compression ratios are layer-dependent:

Layer Index (0-based)	Compression Ratio	Extra State
0, 1	None	Raw 128-token sliding window only
Even layers, 2+	4	Compressed KV + indexer KV
Odd layers, 3+	128	Compressed KV

Because the compressed rows are very compact (one row per 128 tokens for the majority of layers), the entire KV state for a session of 32k–128k tokens fits in a few hundred KB of GPU memory — and, critically, in a few MB when serialized to disk. This makes on-disk persistence practical.

5.2 KV Store Subsystem

The ds4_kvstore subsystem (ds4_kvstore.c/.h) manages a directory of checkpoint files indexed by rendered text hash:

Naming. Each file is named by the SHA1 hash of the rendered byte prefix (not the token sequence). The fixed 48-byte header stores: SHA1 hash, quant bits, reason code (cold/continued/evict/shutdown/agent), token count, hit counter, context size, extension flags, creation time, last-used time, and payload byte count.
Payload. The body contains the serialized GPU graph state — compressed KV rows, indexer states, and any auxiliary buffers.
Trailer. An optional trailer can carry tool-call maps, visible thinking text, or visible responses, depending on extension flags.
Eviction. An LRU-like policy with hit-count decay (ds4_kvstore_entry_eviction_score) selects files for eviction when the store exceeds its budget. The decay half-life defaults to 6 hours, so older checkpoints with low hit counts are evicted first. A protected SHA (the currently-live session) is excluded from eviction.

Figure 3 — KV Cache Checkpoint Lifecycle. A session starts with a cold store to disk. Subsequent requests load the checkpoint and resume without prefill. Extended sessions are re-saved as continued checkpoints. Stale or over-budget entries are evicted using a hit-count decay score. The live session is always protected from eviction.

5.3 Checkpoint Lifecycle

Cold store — a session's first save, written with reason DS4_KVSTORE_REASON_COLD.
Continued store — after loading a checkpoint and extending the session, the updated state is saved with reason DS4_KVSTORE_REASON_CONTINUED. The system can suppress continued stores when the extension is small (below a configurable threshold).
Load-and-resume — on a subsequent request, the store finds a matching SHA prefix, loads the payload, and resumes inference from exactly where the previous session left off — no prefill needed.
Eviction — when the store exceeds its budget, entries with low eviction scores are removed.

6. Thinking Modes

6.1 Overview

DS4 implements three thinking modes that control whether and how the model engages in chain-of-thought reasoning before producing its final answer:

Mode	Enum	Behavior
None	`DS4_THINK_NONE`	No thinking prefix. The model generates answers directly.
High	`DS4_THINK_HIGH`	Appends a "max effort" prefix that encourages reasoning before answering. The thinking section length scales with problem complexity and is typically 1/5 that of other reasoning models.
Max	`DS4_THINK_MAX`	Forces maximal thinking via a special prefix. Requires a minimum context size (`ds4_think_max_min_context()`) to leave room for the answer.

6.2 Automatic Mode Selection

The function ds4_think_mode_for_context(mode, ctx_size) downgrades the mode if the remaining context is insufficient for meaningful thinking output. When context is very tight, it falls back to DS4_THINK_NONE automatically.

7. Directional Steering

7.1 Principle

Directional steering is a runtime activation edit. A steering file is a flat f32 matrix (43 layers × 4096 hidden dimensions) containing one normalized direction vector per layer. During inference, the engine applies the edit after FFN outputs, attention outputs, or both:

y = y − α · d_ℓ · ⟨d_ℓ, y⟩

where α is a scale factor, d_ℓ is the direction vector for layer ℓ, and ⟨·,·⟩ is the dot product. Positive α removes the represented direction; negative α amplifies it. With no steering file or zero scales, the inference follows the normal path.

7.2 Usage

Steering is configured via engine options: --dir-steering-file FILE loads the direction file; --dir-steering-ffn F and --dir-steering-attn F set the scale factors for FFN and attention outputs respectively. FFN output is the recommended target because it is late enough in each layer to capture behavioral and stylistic signals. The dir-steering/ directory contains an example that builds a style direction from 100 paired prompts.

8. Server Subsystem

8.1 Architecture

The HTTP server (ds4-server) follows a single-worker design:

Client threads — each connection is handled by a small blocking thread that parses one HTTP request and queues a job to the single GPU worker.
GPU worker — owns the ds4_session and all live KV cache state. By centralizing graph mutations in one thread, the design avoids race conditions on KV state and simplifies checkpointing.
Session management — sessions are identified by rendered text hash and persisted to disk. The server supports creating, listing, switching, and deleting sessions via REST endpoints.

Figure 4 — Server Architecture. Client connections are handled by a listener thread that parses requests and queues jobs. A single GPU worker owns the ds4_session and all live KV cache state, avoiding race conditions. Sessions are persisted to disk via the KV Store, which manages checkpoint files indexed by rendered text hash.

8.2 API Compatibility

The server exposes OpenAI-compatible chat completion endpoints (/v1/chat/completions) with streaming and non-streaming modes. It also supports Anthropic-style tool-calling via the tools and tool_choice parameters. Tool schemas are prepended to the rendered system prompt as part of the chat rendering pipeline.

8.3 KV Cache Integration

Each session checkpoint is managed by the ds4_kvstore subsystem. When a request arrives for an existing session, the server loads the matching checkpoint, the GPU worker syncs the session to the request's prompt prefix, and generation resumes from the cached state. If no checkpoint matches, the session is filled from scratch.

9. Agent Subsystem

9.1 Design

The native agent (ds4-agent) is a single-process, two-thread application:

UI thread — handles terminal input/output, linenoise line editing, and streaming output rendering.
Worker thread — owns the live DS4 session and KV state. It executes generation, tool calls, and session management.

There are no socket boundaries, IPC channels, or serialization layers between the UI and the inference engine. The session state is the live KV cache on disk; the agent reads and writes it directly.

Figure 5 — Agent Architecture. The UI thread handles terminal I/O while the worker thread owns inference and tool execution. They communicate through shared session state, not sockets. This eliminates serialization overhead and makes KV cache mismatches impossible.

9.2 Advantages

Latency. No serialization or network overhead. Displaying generated text, starting tool calls, and switching sessions are bounded by prefill speed alone.
Progress. Live progress bars during prefill, driven by ds4_session_progress_fn callbacks.
Native tool-calling. The agent handles DSML tool calls natively — no conversion layer, no schema translation.
KV cache integrity. Mismatches are impossible by construction: the agent's state is always the truth.
Session switching. The agent can list and switch sessions (/list, /switch) with zero prefill cost, because each session is a precomputed KV cache checkpoint on disk.

9.3 Current Status and Future Direction

The agent is currently alpha quality. When it reaches a stable shape, the plan is to split the server and client into a stateful session-based protocol that can recreate the agent experience in a client-server configuration.

10. Performance

10.1 Benchmark Methodology

ds4-bench measures instantaneous prefill and generation throughput at context frontiers rather than reporting a single whole-run average. The benchmark loads the model once, walks a fixed token sequence to frontiers (2048, 4096, 6144, ...), and uses incremental prefill so each row measures only the newly-added token interval. After each frontier it saves the live KV state to memory, generates a fixed greedy non-EOS probe, restores the memory snapshot, and continues prefill.

10.2 Measured Throughput

The following table reports single-run Metal CLI numbers with --ctx 32768, --nothink, greedy decoding, and -n 256. Short prompts use a small Italian story prompt; long prompts exercise chunked prefill plus long-context decode.

Machine	Quant	Prompt	Prefill (t/s)	Generation (t/s)
MacBook Pro M3 Max, 128 GB	q2	short	58.52	26.68
MacBook Pro M3 Max, 128 GB	q2	11709 tokens	250.11	21.47
Mac Studio M3 Ultra, 512 GB	q2	short	84.43	36.86
Mac Studio M3 Ultra, 512 GB	q2	11709 tokens	468.03	27.39
Mac Studio M3 Ultra, 512 GB	q4	short	78.95	35.50
Mac Studio M3 Ultra, 512 GB	q4	12018 tokens	448.82	26.62
DGX Spark GB10, 128 GB	q2	7047 tokens	343.81	13.75

Prefill speeds scale with available memory bandwidth and GPU compute, reaching 468 t/s on the M3 Ultra with long prompts. Generation speeds are more uniform across configurations, reflecting the decode-bound nature of autoregressive inference.

Figure 6 — Prefill and Generation Throughput. Short-prompt throughput for representative configurations. Prefill (blue) benefits from memory bandwidth and parallel compute; generation (green) is decode-bound and more uniform. The DGX Spark achieves the highest prefill speed (344 t/s) due to its GPU compute advantage.

11. Quality Assurance and Testing

11.1 Correctness Regression

The test runner (ds4_test) validates the engine against official DeepSeek V4 Flash continuation vectors at various context sizes. Available test suites:

--server — request parsing, chat rendering, streaming, tool-call parsing.
--logprob-vectors — logprob accuracy against official continuations.
--long-context — correctness at extended context windows.
--tool-call-quality — tool-calling fidelity.
--metal-kernels — individual Metal kernel correctness.

11.2 Speed Regression

All changes affecting inference backends must be checked for speed regressions. The only acceptable speed penalty is when an important correctness bug is fixed. Benchmarks are collected with ds4-bench and recorded in the speed-bench/ directory as CSV files and graphs.

11.3 Quality Testing

The gguf-tools/quality-testing/ directory provides a framework for scoring local GGUFs against official DeepSeek V4 Flash continuations. This ensures that quantization and imatrix tuning do not introduce systematic quality degradation.

12. Acknowledgements

DS4 does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there. We are indebted to Georgi Gerganov and all llama.cpp contributors. Some source-level pieces are retained or adapted under the MIT license: GGUF quant layouts and tables, CPU quant/dot logic, and certain kernels. The GGML authors copyright notice is preserved in the LICENSE file.

The project is developed with strong assistance from GPT 5.5, with humans leading the ideas, testing, and debugging. This is stated openly because it shaped how the project was built.

13. Status and Future Work

13.1 Current Status

The code and GGUF files are of beta quality. Inference and model serving are complicated matters that take months to stabilize. The agent is alpha quality. The project is maintained in a usable state and is under active development.

13.2 Known Limitations

macOS CPU inference crashes the kernel due to an apparent VM implementation bug. The system must be restarted after such a crash.
Speculative decoding (MTP) is experimental and provides at most a slight speedup.
The agent TUI, while functional, is alpha quality and lacks the polish of a production system.

13.3 Planned Work

Split the server and agent into a stateful session-based protocol, enabling client-server deployment of the agent experience.
Track future DeepSeek V4 Flash updates and adapt the engine as needed.
Stabilize the KV cache eviction policy and hit-count decay heuristics.
Improve speculative decoding throughput.

14. Recent Developments

14.1 SSE Keepalive for Long Prefill

The server now keeps the SSE connection alive during extended prefill phases (f027269). This prevents client timeouts when the model processes long prompts before generation begins — a critical fix for OpenAI-compatible streaming clients that expect periodic keepalive signals. Additionally, prefill errors that occur after keepalive are handled gracefully (8d57664), ensuring robust error recovery in streaming mode.

14.2 Mixed GGUF Splicing Tool

A new ds4-splice utility (93d9d96) enables combining GGUF files with different quantization mixes. This allows operators to create custom split-quantization models — for example, pairing a q2-imatrix expert layer with a q4-KM shared layer — without rebuilding the full GGUF from scratch. The tool validates tensor layouts across source files to ensure architectural compatibility.

14.3 Agent TUI Maturation

The native ds4-agent TUI has received substantial refinement across rendering, session management, and interaction design. Key improvements include:

Glyph-based prefill progress (9ff77a1) — live progress bars use Unicode glyphs instead of raw text, improving readability at small terminal widths.
Robust terminal colors (8ba0c45) — color handling uses ANSI escape sequences that survive terminal reflow and dark-mode environments.
Prompt preservation (799dff4) — the user's prompt remains visible during streamed output, enabling continuous reference while the agent generates.
Status bar refresh (2606543) — the agent status bar updates in-place without redrawing the entire screen, reducing visual flicker.
Adaptive terminal layout (1aedee4) — the TUI dynamically adjusts to terminal dimensions, handling resizes and narrow windows.
History replay (23cf510) — improved linenoise history folding prevents aggressive truncation of long command sequences.
Session management while busy (1dc8bdb) — agents can save and list sessions during active generation without blocking.
Post-edit context (8daa088) — after applying file edits, the agent shows the changed context to the model, reducing follow-up correction rounds.
Queued prompt handling (1e3c11f) — append-only queuing prevents prompt reordering under concurrent input.

14.4 Benchmark Harness (ds4-eval)

A comprehensive evaluation framework (de5ec6d through 48c4d4d) now ships with DS4. ds4-eval supports:

Automated benchmark runs — prompt sets are audited and localized for COMPSEC, reasoning, and general knowledge tasks.
Auto-sizing context (4441e56) — the harness dynamically adjusts context length to match prompt requirements, avoiding wasted prefill.
TUI reporting (336fbd6) — results are displayed in a clean terminal interface with pass/fail summaries and per-item breakdowns.
Perplexity scoring (d630ca4) — the CLI can compute perplexity over arbitrary text, providing a quality-independent correctness metric.

14.5 KV Cache Hit Decay

The eviction policy now supports hit-count decay (b62292c), enabled by default (d0357ec). Cache entries that have been hit frequently receive a decaying score multiplier, preventing popular entries from permanently crowding out newer context. This is especially beneficial for long-running conversations where earlier turns remain relevant but shouldn't dominate the cache budget.

14.6 Server Enhancements

Several server-side improvements have been merged:

Working directory option (ef0a490) — ds4-server accepts --working-directory to resolve relative session paths.
Opt-in CORS support (312935e) — the server exposes configurable CORS headers for web-based clients.
Tool schema prepend (7b68234) — tool definitions are automatically prepended to the rendered system prompt, improving compliance on OpenAI-compatible tool calls.
Cache usage reporting (38800bf, 0ca2e28) — the server reports Anthropic and KV cache usage statistics via OpenAI-compatible API fields.
Default min-p sampling (613e9b2) — the server now defaults to min-p filtering for generation, reducing degenerate output without configuration.

14.7 CUDA and Metal Fixes

Flash graph correctness (5bc1e6d) — fixes to the CUDA Flash graph builder ensure correct attention computation under compressed prefill.
Compressed prefill RoPE (c9dd949) — RoPE position computation corrected for compressed sparse attention on CUDA.
Managed KV cache for huge contexts (04b6fda) — CUDA uses managed memory for KV cache allocations exceeding GPU VRAM, enabling 1M-token context on 48 GB cards.
Mismatched API vector skip (4efd501) — the engine gracefully skips long API vectors that don't match the active model's architecture, preventing load failures.

14.8 Agent TUI Maturation (continued)

The ds4-agent continues to receive substantial refinement. Key additions since the last report include:

Browser-backed web tools (167dda5) — the agent can now open and interact with web pages via a browser-based tool, extracting rendered content and handling page navigation directly from agent prompts.
Web tool status messages (7ac436a) — live status indicators show when the agent is fetching, scrolling, or extracting pages, improving observability.
Interruption cooperative (42e5915) — the agent gracefully handles interruptions during generation and tool execution without corrupting session state.
System status styling (c463029) — status messages are styled with consistent colors and glyphs for better readability across terminal environments.
Greedy sampling display (02852fa) — the status bar shows when greedy sampling is active, making decoding mode visible at a glance.
Prefill speed display (56b25eb) — live tokens-per-second rate is shown during prefill, not just generation, giving early feedback on prompt processing.
Wrapped tool results in DSML (d6f2331) — tool call results are wrapped in structured DSML, improving parse reliability and model adherence.
Harden DSML parsing (a56519d) — the DSML parser is more robust against malformed tool call formats, reducing retry loops.
Edit tooling improvements (d447bdb, 02bae89) — anchored edit matching, reminder logging, and post-edit context display reduce correction rounds.
Session management refinements (bfe070a, 4a855d8) — session IDs are stabilized, listing and switching are faster, and power commands can be applied while the agent is busy.
Non-interactive mode (75f7858) — ds4-agent supports non-interactive (one-shot) mode for scripted or piped use cases.
Queued prompt delivery after tools (53e143b) — prompts queued during tool execution are delivered reliably after the tool result is processed.

14.9 Distributed Inference

DS4 now supports distributed inference across multiple machines (abdc807), enabling larger contexts and higher throughput by splitting the model across GPU nodes. Key components:

Distributed KV checkpoint support (4844d55) — KV cache checkpoints are shared across distributed workers, allowing session migration between nodes.
Distributed snapshot request IDs (5cd0739) — snapshot coordination uses unique request IDs to prevent duplicate or stale checkpoint loads.
CLI route wait (46419d6) — the distributed CLI waits for route establishment before executing one-shot commands, preventing race conditions on startup.
CUDA distributed PRO Q4 inference (3c3ad4c) — the CUDA backend supports distributed inference with PRO Q4 quantization, extending the distributed model to high-quality quantizations.

14.10 SSD Streaming

The engine now implements SSD streaming (9ba160a) — a mechanism that streams model weights from SSD into GPU memory on-demand during inference, rather than loading the full model into RAM upfront. This enables running DeepSeek V4 Flash on machines with limited system memory (e.g., 32–64 GB) while maintaining interactive generation speeds. The streaming layer is integrated with both Metal and CUDA backends, with expert cache hooks stubbed for future optimization (c47b15f).

14.11 Metal Neural Acceleration (NAX)

The Metal backend has been enhanced with Neural Acceleration (63ceed6) — a set of optimized compute kernels that leverage Apple's ANE and GPU co-processor architecture. NAX provides speedups for attention and MoE dispatch on M3/M4-series hardware. Subsequent cleanup and tuning (18c2d4b) improved NAX throughput and removed legacy Metal4 shader artifacts. Routed MoE TensorOps were disabled (d4fba7b) after stability analysis, with attention output TensorOps retained under guard (57ae485).

14.12 PRO Model Support

DS4 now supports the DeepSeek V4 PRO model variant (04f151d), a higher-quality quantization target with different tensor layouts. Key additions:

PRO official continuation collection (68cd2c0) — official logit continuations for PRO are included in the test suite for correctness validation.
PRO expert mapping fix (ad0209f) — routed MoE expert mapping was corrected for PRO's altered layer structure.
Hugging Face CLI downloads (477c0e8) — model downloads now use the Hugging Face CLI for reliable PRO and Flash model acquisition, replacing legacy download targets (690b659).
Documentation — PRO support is documented in README.md and AGENT.md were updated (5b95fa1, 297f750) to reflect the expanded model support.

14.13 CUDA Enhancements

The CUDA backend received several targeted optimizations:

DGX Spark / GB10 HBM-resident model (15f42aa) — the model can be loaded directly into GPU HBM on Spark systems, bypassing host memory for reduced latency.
Model map span API (caa60f2) — a span-based API for selective tensor loading enables partial model caching across distributed workers.
Layer-slice model caching (e16ead1) — individual layer slices can be cached in GPU memory, reducing PCIe transfers during distributed inference.
Batched Q4_K routed MoE (2791d27) — multiple tokens share expert-weight loads, reducing per-token overhead for the Q4_K quantization path.
Q4 routed MoE speedup (dc51d64) — the Q4 routed MoE kernel was optimized for higher compute occupancy.
Spark startup tensor cache (1704eca) — tensor cache restored on startup prevents redundant HBM re-initialization.
Top-k regression warm-up (bce69b0) — top-k sampling timing is warmed during model load to prevent first-token latency spikes.
Flash path compatibility (f7511c2) — Flash graph builder kept building after PRO Q4 API changes.

14.14 Server and KV Cache Improvements

Disk KV cache compatibility hardening (e8e8779) — checkpoint files are validated more strictly on load, preventing silent corruption from version-skewed payloads.
Pre-store eviction improvement (ec6a82a) — the KV store evicts stale entries before writing new checkpoints, maintaining budget bounds proactively.
Failed prefill eviction (9450594) — entries that fail prefill are evicted from the disk cache rather than retained as dead weight.

⌀ Colophon

This document was created with ds4-agent — the native coding agent built into DwarfStar 4 — running against DeepSeek V4 Flash (q2-imatrix) on a MacBook Pro M5 Max, 128GB RAM.

The generation session used the following prompt sequence:

generate ds4.html explaining the what, why, how, when of ds4 as seen from this repo
update ds4.html to read and look like a technical report
add suitable diagrams to ds4.html
update ~/pradeep-stellar.github.io/index.html with an entry about ~/pradeep-stellar.github.io/ds4.html
add published date to ds4.html
add colophon section — "created with ds4-agent". add the summary of the prompt using this session below that.

All prompts were processed by ds4-agent in a single interactive session, demonstrating zero-latency session switching, live prefill progress, and native tool-calling — each edit applied directly to the source file without serialization or API boundaries.

— Generated on May 20, 2025 · Updated June 7, 2026