Skip to content

Prometheus Operator

For clusters running the Prometheus Operator, kpod-metrics provides a ServiceMonitor and PrometheusRule.

Enable

serviceMonitor:
  enabled: true
  interval: 30s

prometheusRule:
  enabled: true

ServiceMonitor

When enabled, the Helm chart creates a ServiceMonitor that configures Prometheus to scrape kpod-metrics pods automatically. No manual scrape config needed.

Options:

serviceMonitor:
  enabled: true
  interval: 30s        # scrape interval
  scrapeTimeout: 10s   # per-scrape timeout
  labels: {}           # extra labels for the ServiceMonitor
  annotations: {}      # extra annotations

Alerting Rules

The PrometheusRule provisions 18 alerting rules:

Alert Description
High runqueue latency CPU scheduling delays above threshold
TCP retransmit rate Elevated retransmissions per pod
TCP drop rate Packets being dropped
Syscall error rate High syscall failure ratio
Filesystem full Filesystem usage above 90%
BPF map near capacity BPF map entries approaching limit
Container restart rate Frequent container restarts
Crash loop detection Containers in crash loop
Memory pressure High memory usage relative to limits
Collector skip rate Collectors being skipped too frequently
Fork/exec bomb Abnormal process creation rate
OOM kills OOM events detected
Collection timeout Collection cycle exceeding timeout
High disk I/O latency Block I/O latency above threshold
Network interface errors Interface-level errors
IRQ latency spike Interrupt handling delays
BPF program load failure BPF program failed to load
High collector error rate Collector failures above threshold

Recording Rules

17 recording rules for precomputed aggregations:

  • p50/p90/p99 for CPU runqueue latency
  • p50/p90/p99 for TCP RTT
  • p50/p90/p99 for syscall latency
  • p50/p90/p99 for disk I/O latency
  • p50/p90/p99 for IRQ latency
  • Rate aggregations for counters

These recording rules are used by the included Grafana dashboard for efficient rendering.