Skip to main content

rawops.dev

P2

Prometheus Target Down — No Metrics Being Scraped

Fix a target stuck in `DOWN` on Prometheus's /targets page. Covers network reachability, /metrics format, relabeling mistakes, TLS/auth, and scrape timing.

15 min7 steps
Progress: 0/7 steps
0%

Prometheus tells you exactly what broke. Don't guess before reading this.

# In the browser: http://prometheus:9090/targets
# Or via API:
curl -sS http://localhost:9090/api/v1/targets | python3 -m json.tool | head -80
Expected: `health: down` plus `lastError` — typically 'context deadline exceeded', 'connection refused', 'tls: handshake failure', or 'unsupported Content-Type'.

Prometheus in Docker/Kubernetes may not resolve the same DNS/IPs you do. Always test from inside its network namespace.

# Native:
curl -vs http://<target>:<port>/metrics 2>&1 | head -30

# From Prometheus container:
docker exec -it prometheus wget -qO- http://<target>:<port>/metrics | head -20

# From Kubernetes:
kubectl exec -n monitoring prometheus-0 -- wget -qO- http://<svc>.<ns>:<port>/metrics | head -20
Expected: First 20 lines of /metrics text-format output. If you see HTML, the target isn't actually an exporter. If you see nothing, it's a firewall / service / network-policy issue.

Prometheus is strict about the OpenMetrics exposition format. A trailing non-UTF8 byte or a malformed label breaks the scrape.

# Use promtool to parse the output:
curl -s http://<target>:<port>/metrics | promtool check metrics 2>&1 | head -20

# Or manually look for bad lines:
curl -s http://<target>:<port>/metrics | grep -nE '^[^#a-zA-Z_]' | head
Expected: `promtool check metrics` silent exit = clean. Any complaint (e.g. 'invalid UTF-8', 'label name is invalid') pinpoints the bad metric.

Most 'handshake failure' and '401' errors are auth misconfig, not network.

# Prometheus config sample for bearer auth:
# - job_name: api
#   authorization:
#     type: Bearer
#     credentials_file: /etc/prometheus/api.token
#   scheme: https
#   tls_config:
#     ca_file: /etc/ssl/certs/ca-certificates.crt

# Verify the token and TLS from prometheus:
curl -sS -H "Authorization: Bearer $(cat /etc/prometheus/api.token)" https://<target>/metrics --cacert /etc/ssl/certs/ca-certificates.crt | head
Expected: 200 OK + metrics means Prometheus's identity is fine. 401 = token missing/expired. TLS error = cert chain / SNI / SAN mismatch.

A `keep` action with the wrong regex silently removes targets from the scrape pool before they even show as DOWN — they just vanish.

# Compare what service discovery returns vs what ends up scraped:
curl -s http://localhost:9090/api/v1/targets?state=active | python3 -c 'import json,sys; d=json.load(sys.stdin); print("active:", len(d["data"]["activeTargets"]))'
curl -s http://localhost:9090/api/v1/targets?state=dropped | python3 -c 'import json,sys; d=json.load(sys.stdin); print("dropped:", len(d["data"]["droppedTargets"]))'
Expected: If `dropped` includes your target, a relabel rule is filtering it. Check `__meta_*` labels on the dropped entry.

A target that takes >10s to answer will time out. Interval smaller than scrape duration also misbehaves.

# Measure /metrics response time:
time curl -sS -o /dev/null http://<target>:<port>/metrics

# Adjust in prometheus.yml:
# - job_name: slow-exporter
#   scrape_interval: 60s
#   scrape_timeout: 30s
Expected: `scrape_timeout` must be less than `scrape_interval`. Rule of thumb: timeout = 50-80% of interval.

Prometheus rereads its config on SIGHUP or /-/reload if `--web.enable-lifecycle` is set.

curl -sS -X POST http://localhost:9090/-/reload && echo 'reloaded'
# Or: kill -HUP $(pidof prometheus)

# Confirm target went UP:
sleep 5 && curl -s http://localhost:9090/api/v1/targets | python3 -c 'import json,sys; [print(t["labels"].get("job"), t["health"]) for t in json.load(sys.stdin)["data"]["activeTargets"]]'
Expected: All jobs show `up`. The target starts accumulating samples; verify with a PromQL query like `up{job="<name>"}`.