Cloud Monitoring Dashboard

01 - The Problem

Servers fail silently.

Without centralized monitoring, you find out about a server in trouble the worst possible way: from the people using it. A disk fills up overnight, a runaway process pins the CPU at 100%, memory leaks until the kernel starts killing things - and the first signal you get is an angry message, not a clean alert from your own tooling.

The problem compounds across providers. Dev runs on one box, production on another, and the fleet is spread across AWS, GCP, and Hetzner. Each cloud has its own console, its own login, its own way of showing graphs. There is no single pane of glass - just a dozen tabs that each tell you a fraction of the story.

I wanted the opposite of that: one place to open, where the health of every machine I run is visible at a glance, and where problems announce themselves before they ever reach a user.

02 - The Architecture

One monitor server. The whole fleet in view.

The design follows a classic, dependable pattern: a single dedicated monitoring server that pulls metrics from everything else. Every machine in the fleet runs Node Exporter, a lightweight agent that exposes hardware and OS metrics - CPU, memory, disk, network, load - on port :9100.

Prometheus, running on :9090 on the monitor server, scrapes every one of those targets on a fixed interval of a few seconds and stores the results as time-series data. Because it pulls rather than waiting to be pushed to, adding a new server is just one more line in the scrape config.

Grafana, on :3000, reads from Prometheus as its data source and turns those raw series into live dashboards and alerts. The monitor server even monitors itself - it runs its own Node Exporter on localhost, so the watchdog never becomes a blind spot.

Architecture diagram: Prometheus scraping Node Exporter on port 9100 across Dev, Prod, Monitor, AWS, GCP, and Hetzner servers, with Grafana on 3000 and Prometheus on 9090 — Node Exporter on every host exposes metrics on :9100. Prometheus (:9090) scrapes them all and Grafana (:3000) renders the live view.

03 - Security & Access

Metrics are useful internally - and dangerous if exposed.

A Node Exporter endpoint hands out a detailed map of a machine: its processes, its load, its filesystems. That is exactly what you want your monitor server to see, and exactly what you never want open to the public internet. So port discipline is part of the design, not an afterthought.

Firewall rules keep :9100 reachable only from the monitor server's address, never from anywhere else. Prometheus on :9090 stays fully internal - there is no reason for it to be reachable from outside. Only Grafana on :3000 is exposed, and only behind secure, authenticated access. SSH on :22 is used for setup and stays locked down to known keys.

The rule of thumb is simple: expose the dashboard, hide the plumbing. Anything that reveals internal detail stays behind the firewall, and the one door that is open is the one with a lock on it.

04 - The Dashboards

Raw metrics are noise. Dashboards are answers.

The first dashboard is Node Exporter Full - a deep, per-server view for when you need to actually diagnose something. It breaks a single machine down into everything that matters: CPU busy and system load, RAM used and SWAP, root filesystem usage, plus live graphs of network traffic and disk space over time.

Grafana Node Exporter Full dashboard showing CPU, RAM and disk gauges alongside network traffic graphs — The Node Exporter Full dashboard - the detail view for diagnosing a single host.

The second dashboard is the one I keep open all day: an at-a-glance server status board. Instead of dense graphs, it leads with large UP / DOWN panels for each server, so the most important question - "is everything alive?" - is answered in a single glance. Below that sit compact CPU, memory, and disk gauges and per-server uptime, color-coded so anything unhealthy jumps out.

Grafana dashboard with large UP server status panels, CPU, memory and disk gauges, and uptime per server — The at-a-glance status board - big UP/DOWN panels, gauges, and uptime for the whole fleet.

05 - Alerting

A dashboard you have to watch isn't monitoring.

Dashboards are for investigating; alerts are for being told. Prometheus evaluates a set of alert rules against the same time-series it collects - high CPU sustained over a threshold, low disk space crossing a danger line, high memory pressure - and fires the moment a condition holds true.

Those alerts surface directly in Grafana with clear FIRING and RESOLVED states, so I see a problem the moment it starts trending the wrong way - not after it has cascaded into an outage. The goal is to catch the small thing before it becomes the big thing.

06 - The Result

One live view for the entire fleet.

The whole stack runs in Docker, which makes the setup reproducible: the monitoring server can be torn down and rebuilt from the same compose definition, and spinning up a fresh instance is measured in minutes, not a day of manual configuration. The coverage spans AWS, GCP, and Hetzner from a single dashboard, regardless of which console each provider ships.

What I value most is what monitoring quietly enables. Deploys get boring - I can watch the graphs hold steady instead of hoping nothing broke. Incident response gets fast, because the first place to look already has the answer. And capacity planning gets honest, because weeks of real history show exactly when a server is starting to run out of room.

Good monitoring doesn't feel exciting, and that's the point. It turns surprises into signals - and lets you trust that your servers are healthy before your users ever have to tell you they aren't.