01 · Section
Three layers, in order
Error tracking, structured logs, and metrics. In that order. Most teams skip ahead to fancy dashboards before they have basic exception capture working, then wonder why incidents take two hours to diagnose.
Sentry on day one. Structured JSON logs shipped to a managed log store (Axiom, Better Stack, Logtail) on day two. Application and infra metrics in month two.
02 · Section
Skip the expensive stuff until it pays back
A full Datadog rollout with APM, RUM and synthetics easily lands at $2–5K/month before you have customers. That money is better spent on engineers shipping features.
Self-hosted Grafana + Loki + Tempo can give you 80% of the value at 10% of the cost — but only if someone owns the runbook. If no one will, pay for the managed service.
03 · Section
Alerting that does not melt your brain
Two channels: an "act now" channel that pages a human, and a "look at this" channel that is read async. Anything noisier than that gets ignored and you lose signal.
Tune for false-positive rate aggressively in the first month. An on-call rotation that fires three times a night for a week stops being on-call — engineers silence the alerts and miss the real one.
Key takeaways
- Errors → logs → metrics, in that order. Do not skip steps.
- Sentry + a managed log store covers most pre-Series-A needs.
- Self-host Grafana/Loki only if someone will own it; otherwise pay.
- Two alert channels max. Tune false positives weekly in the first month.
Tags
Written by
Sara Mahmood
7 min read · Posted in Cloud Computing