code · · 7 min read
I spent most of a Sunday convinced I was losing my mind. Same Python binary. Same .env. Same machine. Same network. Run it from a Terminal window and my Hermes-Agent gateway booted clean, the Home Assistant MCP server connected, the lights came on. Wrap the exact same command in a launchd plist as a LaunchAgent and the MCP server failed every time with three retries and a useless error. Nothing else changed.
Magic-trick bugs are the kind that make you doubt your own competence for a few hours. The diagnostic equivalent of is this your card?, with no obvious surface to debug through.
The reveal was that macOS Sequoia quietly shipped a new privacy enforcement layer that breaks a specific class of well-established Unix patterns, with no error message, no permission prompt, and no API to fix it. The workaround is to ask AppleScript to ask Terminal to ask the shell to start your daemon. I am not making that up.
Hermes-Agent is the daily-driver agent I’ve written about before. It’s a self-hosted gateway that listens on multiple inbound channels (TUI, Telegram, a couple of others) and connects out to a set of MCP servers that give it tools and devices. One of those is the Home Assistant connector, which is how I let the agent turn the kitchen lights off when I forget.
On Linux it works. On macOS as a foreground process it works. As a launchd LaunchAgent on macOS, the Home Assistant MCP server fails to connect every time. I’d moved it to a LaunchAgent because I wanted it running on boot without remembering to open a Terminal session. That decision cost me a Sunday.
The user-visible failure was a useless one-liner: unhandled errors in a TaskGroup (1 sub-exception), three retries, permanent give-up. The real exception was wrapped and swallowed inside the MCP SDK’s task group machinery, which is a polite way of saying the framework was lying to me about what failed. Step one of a debug like this: stop trusting the error you’ve been handed.
I went through five plausible-but-wrong hypotheses before finding the real culprit. Each is the kind of thing a less paranoid engineer would have stopped at.
Tailscale routing. First reflex, because it’s a Mac and I run Tailscale across most of my infrastructure. Wrong. Home Assistant is on the LAN, no VPN hop, and the Tailscale daemon was identical in both contexts. Lesson: don’t anchor on the explanation that worked for last week’s bug.
Environment variable interpolation. The gateway log showed Loaded environment variables from .env firing multiple times per inbound message, suggesting subprocess contexts were re-loading env on each MCP launch. The MCP config used ${HASS_URL} for the Home Assistant address; maybe the LaunchAgent context wasn’t seeing the right value. I added a spawn-trace log line capturing the fully resolved URL at every launch. Both paths showed identical values. Falsified.
DNS resolution. Checked because errno 65 can sometimes be downstream of a bad lookup. Resolved correctly, every time.
Python certificate chains. Because there always are. The HA endpoint is on https://, Python has its own cert bundle (certifi) separate from the system Keychain, and the LaunchAgent context could plausibly be reaching for a different one. It wasn’t. The error fired at the TCP layer, well before any TLS handshake had a chance to happen. The handshake is the better suspect roughly 90% of the time, which is why it’s worth ruling out properly even when the error code says it can’t be that.
Routing table and interface state. netstat -rn showed clean routes, ARP entries existed, nothing weird at the kernel level from a shell.
This is roughly where I wanted to throw my laptop into the canal. Each false trail had been narrowed properly and ruled out properly, and I was no closer to a culprit. Everything looks identical is the worst pattern in debugging, because it implies the difference is somewhere you haven’t thought to look.
The breakthrough came from doing what I should have done two hours earlier: I patched the MCP SDK to surface the wrapped exception instead of folding it into the TaskGroup. Out fell OSError(65, 'No route to host'), also known as EHOSTUNREACH. Errno 65 is a real network-layer error, not a higher-level wrapper. Something inside the kernel’s TCP path was refusing to make the connection.
Then I wrote a small probe script and ran it from inside the LaunchAgent process and from a Terminal-launched process side by side, against four targets:
| Target | LaunchAgent | Terminal |
|---|---|---|
| Default gateway (192.168.x.1) | works | works |
| Any other LAN host | EHOSTUNREACH | works |
| Internet (8.8.8.8) | works | works |
| Tailscale CGNAT (100.x.x.x) | works | works |
That table is the diagnostic fingerprint. If you see it, you have what I had. Only the default gateway and only non-RFC1918 addresses succeed from the LaunchAgent. Other LAN hosts are silently blocked.
That looks like a routing problem. It isn’t. Routing problems don’t politely allow internet traffic and selectively drop your NAS. This was permission-shaped, and once you see that shape it stops being a mystery and starts being a search problem.
Search led me to Apple’s Tech Note TN3179 and a long, patient backlog of forum replies from Quinn “The Eskimo” of Apple DTS. The feature is called Local Network Privacy. It shipped in Sequoia. Its properties are deeply unhelpful for anyone running CLI infrastructure.
It is not TCC. tccutil doesn’t see it, can’t reset it, can’t grant it. Editing the TCC database is irrelevant. LNP is a separate enforcement layer, administered nowhere you’d think to look.
There is no programmatic API. No way to query the state, prompt for it, or grant it. No MDM payload. No configuration profile. Quinn has confirmed this on the developer forums repeatedly, and the gentle exhaustion in those replies is a thing to behold.
LaunchDaemons running as root are exempt. LaunchAgents in user context are not. This is the bit that bites.
Terminal-launched processes inherit Terminal’s exemption. Terminal.app is a system app, system apps are exempt, and any child process gets the parent’s “responsible code” attribution. That’s why the foreground case worked.
LaunchAgents are the worst case. They run in user context (subject to LNP) but their PPID is 1 with no GUI ancestor, so macOS can’t determine what code is responsible. It silently denies. No dialog. No log entry. No row in the Privacy & Security panel to toggle on.
Python makes it worse. You can’t sign Python code, so the system identifies the process by the interpreter binary’s UUID. That UUID changes when you rebuild your venv or upgrade Python. Every venv rebuild is, from LNP’s perspective, a brand-new program.
Now look at the table again. The default gateway works because the kernel needs to talk to it for any outbound traffic. Tailscale works because the CGNAT range (100.64/10) isn’t classified as “local network” by Apple’s definition. The internet works for the same reason: it leaves the network. Other LAN hosts get silently blocked, because they are precisely what LNP exists to gate.
Apple’s documented fix is to ship your code as a signed .app bundle with NSLocalNetworkUsageDescription and AssociatedBundleIdentifiers wired up. Which requires a paid Apple Developer ID and substantial app-bundle scaffolding for what is, fundamentally, a CLI tool. The official path is unavailable to anyone shipping an open-source Python service.
If LaunchAgents have no GUI ancestor, and Terminal-launched processes inherit Terminal’s exemption, the workaround writes itself. The LaunchAgent’s only job becomes asking Terminal to do the actual work.
<key>ProgramArguments</key>
<array>
<string>/usr/bin/osascript</string>
<string>-e</string>
<string>tell application "Terminal" to do script "exec ~/.hermes/run.sh"</string>
</array>
run.sh is a wrapper that loops on the actual gateway with while true; sleep 5; ...; done, so a crash restarts the agent without re-spawning Terminal windows. KeepAlive would re-fire osascript and pile up windows; the loop in the wrapper avoids that. Trade-offs that are honest to flag:
launchctl list shows the agent as exited / 0 immediately. That looks broken. It is correct: osascript’s job ends the moment Terminal accepts the request.tmux. The tmux server forks, and whether the original audit token survives that fork is unclear from Apple’s docs. I’m not yet brave enough to find out the hard way.It feels janky because it is janky. It is also structurally sound. The agent has been up continuously for a week.
If you see a daemon on a Mac that talks to your gateway and your internet but nothing else on your LAN, you now know what it is. The fingerprint is specific enough to be conclusive.
The thing I should flag is that this was not a solo expedition. Claude was the pair on this one. The loop went form a hypothesis, write the probe, read the output, rule it out, form the next one, and it ran an order of magnitude faster than it would have done in 2020. The MCP SDK patch that surfaced the wrapped exception was a five-minute exercise. The Quinn-and-TN3179 archaeology that produced the LNP discovery would have been a half-day of forum-trawling on my own; with an agent that could read the docs alongside me it was forty minutes.
The unexpected upside was the framework itself. To get the wrapped exception out of the MCP SDK’s task group I had to read the gateway, the MCP client, the env loader, and the spawn pipeline carefully enough to add tracing without breaking anything. I now know Hermes’ guts the way you only know a piece of software you’ve had to debug seriously: where the channel adapters live, how env propagation actually works across the subprocess boundary, which retry policies are on someone else’s TODO list. None of that is in the docs in any useful form, and I wouldn’t have read the source that carefully if the bug hadn’t forced me to.
That side-effect is the real benefit of running self-hosted infrastructure in 2026. Not the open-source-purity bit. Not the no-vendor-lock-in bit. The bit where, when something goes wrong, the entire stack is legible, an AI pair can read it with you, and you come out understanding your own system in a way closed-cloud equivalents simply do not allow. Hermes was excellent before this episode. After it I trust it the way you trust a tool you’ve taken apart and put back together. Which is the way you ought to trust anything allowed to turn your lights off.