go-agent-reliability
Problem Long-running LLM-agent loops fail in ways ordinary services do not. They stall silently: a model call hangs and nothing makes progress for forty minutes, but no error is raised. They get stuck retrying the same failing step and burn tokens. They die mid-task and leave half-written state behind. A supervisor like systemd or Kubernetes restarts the process but cannot tell the agent what it needs to know on restart: was that a clean stop or a crash, and where exactly was I?
Approach Six standalone packages, standard library only, meant to
compose in a plain for loop rather than a framework:
watchdog (idle warn/alert tiers), stuck
(consecutive-failure detection), checkpoint (atomic
write-rename state store), runlock (a flock
whose leftover content is the crash signal), recovery (pure
classification of the previous run plus resume-context prose for the next
prompt), and lifecycle (signal-aware graceful shutdown with
LIFO hooks).
Outcome A go get-able library with a bundled demo
agent that survives Ctrl-C and kill -9 and explains, in
prose, how its previous run ended. Deterministic tests via injectable
clocks and signal notifiers; go test -race clean.
Limitations Unix-only (flock), single-node, last-write-wins checkpoints with no history. Durability over speed: it fsyncs on every checkpoint.
Stack Go 1.24, zero dependencies · github.com/Assylzhan-a/go-agent-reliability