Teams running OpenClaw on cloud Macs more often fight post-upgrade drift than first install: the schema moved from gateway.token to gateway.auth.token, launchd may still launch an older global binary while your shell uses a newer one, and flipping gateway.bind to lan without matching auth makes the gateway refuse to listen. This runbook covers pre-upgrade snapshots, a disciplined order for openclaw doctor and gateway status --deep, split-brain repair, reproducible ssh -L and Tailscale patterns for port 18789, and the real scope of gateway install --force. Pair it with the install checklist and the 24/7 stability article for a complete operations loop.
01

Five 2026 post-upgrade failures: do not blame the model first

Recent OpenClaw builds tightened local gateway bind and auth guards. When gateway.bind leaves loopback without a valid gateway.auth.token, startup fails fast. When your interactive shell resolves a newer global openclaw but launchd still points at an older prefix, you see doctor clean yet RPC probes fail: classic split brain. A third pattern keeps the deprecated gateway.token key so the new schema ignores it and the dashboard loops on unauthorized responses. A fourth places state under enterprise sync or network home directories so upgrades race writes against the daemon reader. A fifth opens port 18789 on a public interface before application-layer auth is complete, which amplifies scanner noise and tempts on-call to chase phantom model timeouts.

The five checkpoints below force ordering: freeze versions and files before touching temperatures or API keys. If you have not finished first boot onboarding, complete the install checklist before returning here for upgrade-only paths.

01

Key drift: npm upgraded but gateway.auth.token never landed in the plist-visible profile, so dashboard and CLI read different fragments.

02

Dual binaries: which openclaw disagrees with the first ProgramArguments entry; new fields apply only on one side.

03

Incomplete bind plus auth pairs: lan without token makes the gateway exit; logs often show refusal-to-bind signatures.

04

Port table without owners: 18789 still held by a stray debug process so EADDRINUSE survives restarts until cleanup.

05

Raw internet exposure: listening on untrusted interfaces before tunnels or edge ACLs invites pointless retries.

Treat upgrades as an alignment problem across binaries, plist metadata, and config files; only then choose tunnel versus controlled bind strategies.

Another subtle failure is environment inheritance: launchd jobs do not automatically import shell profile exports you added for convenience. If you generated a token in an interactive session and only echoed it into a shell rc file, the daemon will never see it. The durable pattern is to persist secrets where the service already reads configuration, or to use the documented env-file hooks if your packaging supports them, then restart and re-run doctor to confirm the service-side view. Finally, write the expected SHA or semver of the CLI next to the plist label in the change ticket so the next upgrade cannot silently widen the gap between human and supervised execution.

When multiple engineers share one cloud Mac for experiments, add a short rotation note to the runbook: who last ran force, which profile owns the gateway, and whether any temporary bind changes were reverted. Shared hosts amplify split brain because each person assumes their PATH is universal. A single authoritative table of binaries, plist labels, and listening addresses prevents Friday-night guesswork.

02

Comparison: loopback-only, LAN plus token, and tunneled dashboard access

On cloud Macs the default posture is loopback plus SSH local forwarding or a Tailnet with tight ACLs. That threat model differs from binding 18789 on a public interface. Put the table on page one of the runbook so on-call stops debating bind changes ad hoc.

ModeBest forPrerequisitesSharp edges
loopbackSingle engineer SSH plus local browserSimplest default; token optionalHealth checks from the internet still fail without tunnels
lan + tokenInternal probes on fixed RFC1918 rangesRequires gateway.auth.token and minimal firewall holesToken missing from plist environment leaves the service empty-handed
SSH -L / TailscaleCross-internet ops with zero-trust edgesSSH key rotation and MagicDNS planningLocal port clashes; reconnect scripts after sleep
SignalHealthy meaningSuspect first when bad
openclaw doctorSchema, token, port, supervisor consistencyLegacy keys, PATH, sync locks
openclaw gateway statusRuntime plus RPC probe summaryProcess up but RPC dead, token drift
openclaw gateway status --deepDuplicate installs and system versus user unit hintsDual launchd jobs, stale plists

Align who runs, which config file loads, and which address listens before you tunnel; otherwise you only broadcast the mismatch to more laptops.

Upstream troubleshooting starts with openclaw status, openclaw gateway status, openclaw logs --follow, and openclaw doctor. On cloud images also record whether Node and the global prefix inside the image match plist expectations after golden-image refreshes.

Tailscale and SSH tunnels differ in operational cost. Tailscale gives you DNS names and ACLs that scale to small teams without remembering per-host SSH flags, but it introduces another dependency chain you must patch and audit. SSH port forwarding is boring, which is an advantage when compliance wants a minimal moving part list. Some teams combine both: Tailscale for reachability, loopback bind on the gateway, and explicit deny rules so accidental LAN binds never appear during upgrades. Whatever you pick, document the failure mode when the tunnel is down: agents should degrade gracefully rather than hammering a dead socket.

Capacity planning still matters: upgrading OpenClaw does not remove memory pressure from large toolchains. If logs show OOM adjacent to gateway restarts, capture that separately from auth failures. The storage-and-memory playbook on this site helps decide when M4 Pro headroom buys stability versus when you only need cleaner supervision metadata.

03

Token migration and split brain: make the supervisor read what you edited

When ~/.openclaw/openclaw.json already contains gateway.auth.token yet dashboards still return 401, the next step is not generating a tenth token. Verify the launchd working directory and any OPENCLAW_STATE_DIR match your interactive shell. Some cloud images ship separate homes for automation users versus login users; a plist aimed at the wrong account produces the illusion that cat shows a token while the service sees none.

If doctor reports a newer config touched by an older binary, fix PATH first and reinstall supervisor metadata from the intended install. Avoid rapid config set loops while an old binary is still supervised, or meta version guards will block mutations without clear UI feedback.

Shell
openclaw doctor
openclaw config get gateway.auth.token
openclaw gateway status --deep
openclaw gateway install --force
openclaw gateway restart

Note: During the change window snapshot openclaw.json, the plist label, and openclaw --version output together; roll back as a triple, not as a lone file restore.

For handoffs paste those four command outputs next to the heartbeat section in the 24/7 stability article so night shift does not rely on chat folklore about port tweaks.

If you rely on CI to bake images, add an automated smoke that runs doctor non-interactively after image build and fails the pipeline when warnings cross a threshold you define. That moves upgrade regressions left where they are cheaper than production discovery. Keep the smoke output artifact next to the image version string so operators can diff what changed between Tuesday and Thursday builds without SSH archaeology.

04

Six reproducible steps from loopback health to safe remote 18789

Assume loopback already works on the server; otherwise return to the install checklist for onboarding and the daemon. Only then add remote paths so network mistakes are not disguised as auth failures.

01

Local proof: on the host run curl against http://127.0.0.1:18789/ or the documented health path and record the HTTP code.

02

Pick a path: single laptop access favors ssh -L 18789:127.0.0.1:18789 user@cloud-mac; broader teams evaluate Tailscale with ACLs.

03

Resolve clashes: if local 18789 is taken, use ssh -L 19000:127.0.0.1:18789 and browse the high local port.

04

Align tokens: tunnels still present the gateway token to the browser; store values in a vault, not chat.

05

Plan disconnects: add autossh or equivalent so unattended jobs do not assume the gateway vanished.

06

Write the ticket: capture bind mode, tunnel command, and port owner with the same change record as region and SKU; order capacity through the audited purchase page.

05

Three production gates and when gateway install --force is justified

A

Version stamps agree: meta.lastTouchedVersion and openclaw --version move together; intentional downgrades need a separate policy, not silent force abuse.

B

Single supervisor source: default to one user-level gateway unit per host; if deep scans show duplicates, disable extras before force.

C

Force boundaries: use when plist ports disagree with live config or doctor explicitly requests supervisor refresh; never as a substitute for secret rotation or security-group review.

Warning: hammering force before reading logs and doctor output can turn a simple port collision into a restart loop that widens the outage window.

Compared with borrowing a teammate laptop as the gateway host, pinning OpenClaw to a rentable KVMNODE cloud Mac makes it easier to keep binary paths, plist labels, tokens, and tunnel commands on one change record; laptops still lose time to sleep and OS upgrades. For teams that must document observability, upgrade rollback, and renewals in weekly reviews, dedicated bare metal with multi-metro choice is usually easier to execute than scattered hardware: short-term rent in the chosen region to validate upgrades and tunnels, then decide on M4 Pro and longer terms using the pricing page and Help Center instead of release-night chat requests.