How is ds4 different from llama.cpp or Ollama?

ds4 is not a general GGUF loader. It targets DeepSeek V4 Flash only, with Metal/CUDA graph execution, on-disk KV and Agent API. Use llama.cpp or Ollama to swap models; choose ds4 when you need V4-class local capability wired into Cursor or opencode.

Can ds4 run on a Mac with only 64GB unified memory?

The official README treats 96GB as the production starting point. A 64GB machine struggles to load full q2 Flash weights plus long-context KV. Rent a 128GB cloud Mac to validate, or keep 16–24GB nodes for CI workloads per the pricing page.

When running ds4 on a cloud Mac, does conversation data pass through a third-party LLM API?

Inference runs on your dedicated instance via local ds4-server listening; it does not force traffic to Claude or GPT APIs. You must still comply with model licenses and egress policy. See the help center and order page for network and backup details.

Run DeepSeek V4 Locally in 2026? antirez's ds4 and Mac Cloud Rental: Top-Tier Inference Above the 96GB Wall

If you want DeepSeek V4–class models on your own hardware and have been watching antirez's newly open-sourced ds4 (DwarfStar) take over your timeline, May 2026 is not about whether local inference is possible. It is about the hardware bill that starts at 96GB unified memory, climbs to 256GB for Flash q4, and reaches 512GB for PRO. This article explains why ds4 crossed 11,000 GitHub stars in a week, how Metal plus on-disk KV changes the trade-offs, what each memory tier costs in real money, and why Apple Silicon remains the best consumer-grade platform for this workload. It closes with a six-step path to run ds4-server on KVMNODE 128GB / 512GB cloud Macs and wire Cursor or opencode to a private endpoint. Cross-read with storage and memory sizing, OpenClaw persistence, and six-region selection.

What ds4 is: Redis's author betting on single-model excellence for DeepSeek V4

Salvatore Sanfilippo (antirez), author of Redis, open-sourced ds4 (DwarfStar 4) in 2026: a local inference engine built only for DeepSeek V4 Flash and PRO, implemented in plain C, not wrapped around llama.cpp and not aimed at the general GGUF market. The README states the goal plainly: make local inference on a top-tier personal machine or Mac Studio trustworthy enough to replace everyday Claude or GPT calls, with official vector checks, long-context tests, and coding-agent integration.

Within days the project passed 11,000+ GitHub stars. Hacker News and community reviews converged on one point: a 284B-class MoE running offline on a MacBook with tool calling and six-figure context windows. That is a different conversation from the 2025 era of 7B toy models. ds4 moved the debate from "can it run" to "would you ship production code on it." For KVMNODE customers the hype itself signals a steepening demand curve for high-memory Macs; what blocks most teams is the memory wall in the next section.

Narrow and deep: only DeepSeek V4, which buys integrated Metal graphs, KV formats, and tool calling tuned together.

Self-contained: loading, prompt rendering, disk KV, ds4-server, and a built-in coding agent live in one repository.

Community validation: public benchmarks and third-party 18-task suites show some workflows already need fewer cloud tabs for comparison.

Not multi-tenant: serial request handling today, no batch serving; aimed at solo or small-team agent workflows.

KVMNODE intersection: cloud Mac SKUs deliver the unified memory tiers ds4 expects without buying a Mac Studio Ultra upfront.

Compliance note: weights must be downloaded under DeepSeek and project licenses; this article covers engine and hardware paths only.

Technical highlights: Metal first, million-token context, and on-disk KV

The ds4 story compresses to max out Apple Silicon, push long session state to SSD. Reported capabilities from the project and early reviewers include the following.

On a MacBook Pro M5 Max, ds4 has been quoted at roughly 463 token/s prefill and about 34 token/s generation (quantization and context length move the numbers). That places the engine in the first tier among consumer hardware. It advertises a context window up to about one million tokens, paired with DeepSeek V4's compressed KV design, so "entire repo plus long chat" becomes a planning scenario instead of a demo slide.

On-disk KV persistence is a differentiator: session KV can be written to fast Mac SSDs so restarts or task switches avoid a full prefill replay. That matters for laptops that sleep daily and for agents that must resume yesterday's thread. 2-bit asymmetric quantization compresses routing experts aggressively while keeping other layers precise, which is how Flash becomes viable on 128GB machines. ds4-server exposes OpenAI- and Anthropic-compatible endpoints, so Cursor, opencode, and Claude Code can treat the instance as a private model vendor.

shell

git clone https://github.com/antirez/ds4
cd ds4 && make
./ds4-server --ctx 100000 --host 127.0.0.1 --port 8080

The README also warns that on macOS the CPU inference path can trigger kernel virtual-memory bugs; production should use Metal (or CUDA on Linux). That belongs on your cloud Mac runbook alongside the health probes in the diagnostic ladder.

Hardware threshold table: Flash q2 from 96GB to PRO at 512GB

No matter how elegant the engine, unified memory capacity sets the ceiling. The table below merges official README guidance, community measurements, and typical retail pricing (USD approximations; configs and FX move the numbers). Use it for budget or rental decisions: separate "can run" from "runs comfortably."

Model / quant	Min unified memory	Typical hardware	Reference price (approx.)	Cloud rental angle
V4 Flash q2	96 GB	MacBook Pro M3/M4/M5 Max	$4,200+	128GB cloud Mac weekly or monthly trial
V4 Flash q4	256 GB	Mac Studio Ultra	$8,400+	Short Ultra-tier spike or staged quant experiments
V4 PRO q2	512 GB	Mac Studio M3 Ultra max config	$15,400+	Project-based 512GB instance, stop when done
CI only / 16–24GB	16–24 GB	M4 / M4 Pro cloud nodes	Not for ds4 production	Keep for Xcode / OpenClaw; separate ds4 pool

Software has proved local V4 is feasible; what blocks you is the unit price of unified memory, not whether the C code is elegant.

For teams the pragmatic split is validate ds4 separately from daily iOS CI: 16GB·256 or 24GB·512 for builds and OpenClaw, 128GB+ dedicated to ds4-server, so DerivedData and million-token KV never fight on one SKU. Sizing detail lives in storage and memory pairing.

Why ds4 puts Metal and Mac first: unified memory plus SSD as system coupling

Listing Metal as the primary macOS backend is not marketing. Apple Silicon's unified memory architecture (UMA) lets CPU, GPU, and Neural Engine share one physical pool, avoiding the PC split where "24GB VRAM plus 64GB RAM" caps what you can load. For large-model inference, one addressable space directly bounds quantized weights and KV size. M3/M4/M5 memory bandwidth pushes prefill throughput to the consumer ceiling.

macOS NVMe pairs with ds4's disk KV as a second coupling: long sessions need not sit entirely in RAM; cold starts can reload context blocks from SSD. A Linux plus CUDA path (including DGX Spark tuning) exists in the repository, but for developers who already own Macs and want offline coding, high-memory Mac is the best consumer platform for ds4 today, consistent with antirez's Hacker News comments.

Running ds4 in macOS VMs on non-Apple hardware or hackintosh setups sacrifices Metal stability and licensing. Cloud deployments should use real bare-metal Apple Silicon nodes, which is why KVMNODE delivers dedicated Mac Mini hardware rather than "Mac-like" virtual desktops.

Six steps to bring up ds4-server on a KVMNODE cloud Mac and hook Cursor / opencode

The steps below assume you ordered a cloud Mac with 128GB or more unified memory. Pick region by Git remote and weight download source per six-region guidance. Model files are large; align with object storage or a Hugging Face mirror in the same region to avoid trans-Pacific tail latency.

Choose tier and order: on the order page pick a package at 96GB+; spike by the day for trials, monthly baseline for long-running agents (see daily spike article).

First SSH login: confirm Xcode CLT, Homebrew, and git; store models and KV on local SSD paths, never inside iCloud-synced folders.

Build ds4: git clone https://github.com/antirez/ds4 && cd ds4 && make; verify ./ds4 and ./ds4-server; do not use CPU-only paths for production load tests.

Fetch weights: download DeepSeek V4 Flash weights per repo scripts; verify SHA and set a fixed MODEL_PATH.

Start service: ./ds4-server --ctx 100000 --host 0.0.0.0 --port 8080 on private network, or 127.0.0.1 plus SSH -L; use launchd or pm2 for persistence per OpenClaw persistence patterns.

Wire clients: point Cursor / opencode Base URL to http://127.0.0.1:8080/v1 (or tunnel URL); for team sharing expose read-only inference over Tailscale, never tokens on the public internet.

Privacy posture: inference stays on your dedicated instance; chat and code context are not forced through third-party LLM APIs. You remain responsible for model licenses and egress firewall rules. Network and backup notes are in the help center.

Three quotable facts, alternatives, and the Mac cloud rental conclusion

For technical reviews or procurement memos, cite these public figures (they move with upstream README updates): ① GitHub 11k+ stars reflecting May 2026 community momentum; ② community-reported ~463 t/s prefill and ~34 t/s generation on MacBook Pro M5 Max (quantization and context dependent); ③ official production floor at 96GB unified memory, with 128GB as the safer Flash long-context tier.

Lay alternatives side by side. Staying on Claude / GPT APIs only bills per token and sends code plus long context over the internet, which hurts IP-sensitive work. Buying a Mac Studio Ultra locks CapEx in the tens of thousands and freezes upgrade cycles. Forcing ds4 onto generic Linux GPU clouds leaves Metal optimizations unused and changes MoE memory topology requirements. Renting 128GB / 512GB cloud Macs from KVMNODE by the hour or month turns ds4's top-tier local inference into switchable OpEx: real Metal hardware, team sharing, and data that stays on a dedicated instance—ideal to validate before hardware purchase.

Teams running iOS CI, OpenClaw Gateway, and ds4 together should split pools physically or logically; do not merge 16GB build SKUs with 128GB inference SKUs. Tiers and pricing on the pricing page, ordering on the order page, runbooks in the help center.

Back to Blog Rent Now