Skip to content

Changelog

All notable changes to this project will be documented in this file.

[1.11.0] - 2026-03-07

Added

  • Grafana dashboard: p90 percentile lines for CPU runqueue, TCP RTT, syscall, and disk I/O latency panels
  • Grafana dashboard: p50/p90 queries added to syscall latency and block I/O latency panels
  • Helm _helpers.tpl: centralized label/selector templates replacing 13 hardcoded blocks
  • .github/CODEOWNERS: automatic reviewer assignment
  • Dependabot: docker ecosystem for base image update tracking

Changed

  • All dashboard latency panels now use pre-computed recording rules for efficient rendering
  • Dockerfile: added root requirement comment and AS runtime stage alias

[1.10.0] - 2026-03-07

Added

  • p90 recording rules for CPU runqueue latency, TCP RTT, syscall latency, and disk I/O latency (recording rules: 13 → 17)
  • KpodHighForkRate alert: fork bomb detection (> 100 forks/s for 5m)
  • KpodHighExecRate alert: abnormal process spawning (> 50 execs/s for 5m)
  • Schema: cpuProfile and bpfMapStats collector entries in values.schema.json

Fixed

  • NOTES.txt and README: corrected stale PrometheusRule alert/recording rule counts
  • README: updated image tag reference from 1.6.0 to 1.10.0
  • E2E workflow: fixed label selector app=kpod-metricsapp.kubernetes.io/name=kpod-metrics
  • E2E workflow: expanded metric validation from 3 to 9 metrics with health status check

[1.9.0] - 2026-03-07

Added

  • Helm test: second container verifies /actuator/prometheus returns kpod metrics
  • emptyDir /tmp volume (tmpfs, 64Mi) for JVM temporary files, enabling readOnlyRootFilesystem
  • SECURITY.md: "Why Root Is Required" section, writable paths, optional network egress documentation
  • Spring context tests: conditional bean verification for BPF, OTLP, and discovery health indicator (5 tests across 2 classes)
  • CI: strict Helm lint, template dry-runs (default/OTLP/profiling/secrets), schema validation

Changed

  • CI helm-lint job upgraded from basic helm lint to --strict mode with 5 additional validation steps

[1.8.0] - 2026-03-07

Added

  • Unit tests for AnomalyEndpoint (4 tests: delegation, defaults, sensitivity validation)
  • Unit tests for RecommendEndpoint (8 tests: delegation, defaults, confidence clamping, time expression parsing, label validation)
  • README: OTLP Secret support documentation with existingSecret example
  • README: Analysis Endpoints section documenting /actuator/kpodAnomaly and /actuator/kpodRecommend with curl examples
  • Troubleshooting guide: OTLP export, analysis endpoints, NetworkPolicy sections
  • CONTRIBUTING.md: step for updating values.schema.json when modifying Helm values

Changed

  • README: updated test count (201 → 293), image tag (1.6.0 → 1.8.0)

[1.7.0] - 2026-03-07

Added

  • OTLP Secret support: otlp.existingSecret references a K8s Secret for OTLP headers instead of plaintext ConfigMap
  • Pyroscope Secret support: profiling.pyroscope.existingSecret references a Secret for auth tokens
  • NetworkPolicy conditional egress: OTLP (4317/4318) and Pyroscope (4040) egress rules auto-added when features are enabled
  • values.schema.json for Helm chart validation, IDE autocomplete, and helm lint enforcement
  • Validates profile enum, interval minimums, URI formats, and required fields

Fixed

  • config.pollInterval default mismatch: values.yaml now defaults to 30000ms (was 29000ms)

[1.6.0] - 2026-03-01

Added

  • README: OTLP export documentation with full configuration example
  • README: kpod.collector.skipped.total and kpod.bpf.program.load.duration in Self-Monitoring metrics table
  • README: OTLP properties (kpod.otlp.*) in Key Properties table
  • Grafana dashboard: Collector Skip Rate panel (time series, ID 104)
  • Grafana dashboard: BPF Program Load Duration panel (stat, ID 105)
  • PrometheusRule: kpod:collector_skip_rate:5m recording rule
  • PrometheusRule: KpodHighCollectorSkipRate alert (> 10% skip rate for 10m)
  • kpod.initial-delay exposed as initialDelay property in MetricsProperties

Changed

  • Fixed poll-interval default mismatch: application.yml now defaults to 30000ms (was 15000ms), matching MetricsProperties and Helm values
  • README: updated test count (140 → 201), image tag (1.3.0 → 1.6.0), architecture diagram
  • README: added MemoryCgroupCollector to architecture diagram
  • README: updated PrometheusRule counts (17 alerts + 13 recording rules)
  • Grafana dashboard synced between standalone and Helm copies (49 panels total)

[1.5.0] - 2026-03-01

Added

  • kpod.collector.skipped.total counter — tracks interval-based collector skips per collector
  • kpod.bpf.program.load.duration timer — measures BPF program load time per program
  • lastCollectorErrors in DiagnosticsEndpoint — shows last error message per collector
  • enabledCollectorCount in DiagnosticsEndpoint — runtime count of active collectors
  • Helm probe customization: probes.startup, probes.liveness, probes.readiness in values.yaml
  • Configurable periodSeconds, failureThreshold, initialDelaySeconds

Changed

  • MetricsCollectorService tracks last error per collector for diagnostics
  • BpfProgramManager.tryLoadProgram records load duration via Micrometer timer
  • DaemonSet template uses {{ .Values.probes.* }} instead of hardcoded probe intervals

[1.4.0] - 2026-03-01

Added

  • OpenTelemetry/OTLP metrics export via micrometer-registry-otlp
  • Configurable via kpod.otlp.enabled, kpod.otlp.endpoint, kpod.otlp.headers, kpod.otlp.step
  • Push metrics to any OTLP-compatible collector alongside Prometheus scraping
  • CollectorConfigHealthIndicator: health check reports DOWN when all collectors are disabled
  • getEnabledCollectorCount() on MetricsCollectorService for runtime inspection
  • Enhanced Helm NOTES.txt: shows active profile, poll/timeout settings, conditional sections for ServiceMonitor, PrometheusRule, Grafana dashboard, and OTLP export
  • Helm values: otlp.enabled, otlp.endpoint, otlp.headers, otlp.step

Changed

  • BpfAutoConfiguration conditionally creates OtlpMeterRegistry bean when kpod.otlp.enabled=true
  • Helm ConfigMap renders OTLP configuration under kpod.otlp when enabled

[1.3.0] - 2026-03-01

Added

  • Per-collector interval configuration (kpod.collector-intervals.<name>)
  • Heavy collectors (syscall, biolatency, hardirqs) can run at longer intervals
  • Default: all collectors run every cycle (backward compatible)
  • Tracks per-collector last-run timestamps; skips when interval hasn't elapsed
  • PrometheusRule alerts: KpodHighRestartRate (> 3 restarts/15min) and KpodPodCrashLooping (> 5 restarts/30min)
  • Helm DaemonSet: affinity and topologySpreadConstraints support
  • README: comprehensive configuration reference (40+ properties documented)
  • README: memory cgroup, pod lifecycle, and self-monitoring metric tables

Changed

  • MetricsCollectorService accepts CollectorIntervals for per-collector scheduling
  • BpfAutoConfiguration passes collectorIntervals and basePollInterval to service
  • Helm ConfigMap renders collector-intervals when configured
  • README Prometheus Operator section updated to reflect 15 alert rules + 12 recording rules

[1.2.0] - 2026-03-01

Added

  • Container restart tracking: kpod.container.restarts gauge per pod/container
  • Automatically updated from K8s container status restartCount
  • Cleaned up on pod deletion to prevent cardinality growth
  • Grafana dashboard: Container Restarts panel in Process Activity row
  • KubeletPodProvider now captures container restart counts

Changed

  • Dashboard histogram queries optimized to use precomputed recording rules
  • 6 queries replaced: runqueue latency p50/p99, TCP RTT p50/p99, syscall latency p99
  • Reduces Prometheus query load on large clusters
  • Helm dashboard ConfigMap synced with standalone dashboard (fixes v1.1.0 panel gap)
  • PodWatcher now accepts optional MeterRegistry for restart gauge registration

Fixed

  • Helm dashboards copy missing 5 panels added in v1.1.0 (interface errors/drops/packets, buffer dirty, fs available)

[1.1.0] - 2026-03-01

Added

  • Service always created (decoupled from ServiceMonitor) — standalone Prometheus can now scrape without Operator
  • DaemonSet updateStrategy with configurable maxUnavailable (default: 1) for controlled rollouts
  • extraEnv support in Helm values for JVM tuning, proxy settings, etc.
  • Grafana dashboard: Interface Errors, Interface Drops, Interface Packets panels (Network row)
  • Grafana dashboard: Buffer Dirty Rate panel (Memory & Cache row)
  • Grafana dashboard: Filesystem Available panel (Disk & Filesystem row)
  • 12 Prometheus recording rules for precomputed aggregations (p50/p99 latencies, error ratios, rates)
  • Chart.yaml metadata: home, sources, keywords, maintainers for chart discoverability

Changed

  • PrometheusRule now has two rule groups: kpod-metrics.recording and kpod-metrics (alerting)
  • Helm chart version and appVersion bumped to 1.1.0

[1.0.0] - 2026-03-01

Added

  • Spring graceful shutdown (server.shutdown=graceful, 30s timeout) for clean request draining
  • DiscoveryHealthIndicator: readiness check that reports DOWN if no pods discovered after 60s grace period
  • Initial collection delay (kpod.initial-delay, default 10s) to allow PodWatcher to discover pods before first cycle
  • Collection overlap guard: skips cycle with warning if previous cycle is still running
  • Helm test pod (helm test) for health endpoint validation
  • Dockerfile HEALTHCHECK for non-Kubernetes environments

Changed

  • MetricsCollectorService.collect() uses compareAndSet for atomic overlap detection
  • Helm ConfigMap includes graceful shutdown configuration
  • build.gradle.kts version set to 1.0.0

[0.9.0] - 2026-03-01

Added

  • Pod label propagation to metrics: configurable kpod.filter.include-labels whitelist
  • Pod labels (e.g., app=nginx) appear as label_app="nginx" metric tags
  • Labels filtered at PodCgroupMapper level to control cardinality
  • Metric staleness cleanup: Micrometer meters for deleted pods are automatically removed
  • Prevents cardinality growth from pod churn in long-running clusters
  • Cleans gauge stores in FilesystemCollector and MemoryCgroupCollector
  • Label selector filtering: kpod.filter.label-selector now functional
  • Supports key=value, key!=value, and key (exists) selectors
  • Comma-separated for multiple terms (AND logic)
  • PodCgroupTarget.tags() helper for consistent tag construction across cgroup collectors

Changed

  • PodCgroupMapper now accepts includeLabels to filter pod labels at discovery time
  • PodWatcher.shouldWatch() now evaluates label selectors alongside namespace filters
  • Cgroup collectors use target.tags() instead of manual tag construction
  • Helm ConfigMap renders filter config (namespaces, excludeNamespaces, labelSelector, includeLabels)

[0.8.0] - 2026-03-01

Added

  • Memory cgroup collector: kpod.mem.cgroup.usage.bytes, kpod.mem.cgroup.peak.bytes, kpod.mem.cgroup.cache.bytes, kpod.mem.cgroup.swap.bytes — supports cgroup v1 and v2
  • Multi-cluster label injection via kpod.cluster-name — adds cluster common tag to all metrics
  • node common tag automatically applied to all metrics via MeterRegistryCustomizer
  • Grafana dashboard: Memory (Cgroup) row with usage, peak, cache, swap panels
  • Grafana dashboard: Operational row with cgroup read error rate and collection timeout panels
  • PrometheusRule alerts: KpodCgroupReadErrors, KpodCollectionTimeouts, KpodMemoryPressure

Changed

  • Helm values: added config.clusterName option
  • Helm ConfigMap: renders cluster-name when set

[0.7.0] - 2026-03-01

Added

  • Per-target error handling in cgroup collectors (DiskIO, InterfaceNetwork, Filesystem)
  • kpod.cgroup.read.errors counter per collector — one pod's failure no longer blocks others
  • Graceful shutdown with drain timeout — waits for in-flight collection cycle to complete
  • /actuator/kpodDiagnostics endpoint: uptime, collector states, BPF program status, profile summary
  • Startup cardinality estimation with configurable warning threshold
  • Logs estimated metric series count per profile at boot
  • Warns if estimate exceeds 100k (configurable)

Changed

  • MetricsCollectorService.close() now drains in-flight cycles before shutting down executor
  • Cgroup collector tests updated for error counter registration at construction

[0.6.0] - 2026-03-01

Added

  • Collection cycle timeout (kpod.collection-timeout) with kpod.collection.timeouts.total counter
  • Per-collector enable/disable overrides (kpod.collectors.*) on top of profiles
  • E2E test CI workflow (.github/workflows/e2e.yml) with minikube, weekly schedule + manual dispatch

Changed

  • Collection cycle now wrapped in withTimeoutOrNull for bounded execution
  • Helm ConfigMap template renders collection-timeout and collectors config
  • Added kotlin("test") dependency for test assertions

[0.5.0] - 2026-03-01

Added

  • Helm chart: imagePullSecrets support for private registries
  • Helm chart: priorityClassName support for guaranteed scheduling
  • Helm chart: allowPrivilegeEscalation: false and drop: ALL capabilities hardening
  • Helm chart: optional seccompProfile configuration
  • Helm chart: NetworkPolicy template (networkPolicy.enabled)
  • Enhanced NOTES.txt with health check verification and feature status
  • Troubleshooting guide (docs/troubleshooting.md)
  • CONTRIBUTING.md with development setup and PR process
  • SECURITY.md with security model and vulnerability reporting

[0.4.0] - 2026-03-01

Added

  • Self-monitoring metrics for collection pipeline health:
  • kpod.collection.cycle.duration — timer for full collection cycle
  • kpod.collector.duration — timer per collector (tagged by collector name)
  • kpod.collector.errors.total — counter per collector for failure tracking
  • kpod.discovery.pods.total — gauge of discovered pods per cycle
  • kpod.bpf.programs.loaded / kpod.bpf.programs.failed — BPF program load status
  • Custom Spring Boot health indicators for Kubernetes probes:
  • BpfHealthIndicator — reports DOWN when BPF programs fail to load
  • CollectionHealthIndicator — reports DOWN when collection is stale (3x poll interval)
  • Per-program graceful BPF load failures (partial degradation instead of full failure)
  • Grafana dashboard Row 10: Collection Health (7 panels)
  • PrometheusRule alerts: KpodCollectorErrors, KpodNoBpfPrograms

[0.3.0] - 2026-02-28

Added

  • Multi-arch Docker image builds (linux/amd64 + linux/arm64) via buildx
  • Container image vulnerability scanning with Trivy (CRITICAL/HIGH)
  • Automated release workflow triggered on version tag push
  • Helm chart linting in CI pipeline

Changed

  • Dockerfile now auto-detects target architecture from buildx TARGETARCH
  • Publish workflow uses docker/build-push-action with multi-platform support
  • Trivy scan results uploaded to GitHub Security tab (SARIF format)

[0.2.0] - 2026-02-28

Added

  • Grafana dashboard with 9 rows and 29 panels covering all kpod-metrics collectors
  • Auto-provisioned via Grafana sidecar ConfigMap (Helm-managed)
  • Standalone JSON available at grafana/kpod-metrics-dashboard.json
  • Prometheus Operator integration (ServiceMonitor + PrometheusRule)
  • Headless Service for ServiceMonitor pod discovery
  • 8 production alerting rules (runqueue latency, TCP retransmits/drops, syscall errors, filesystem usage, BPF map capacity/errors, target down)
  • BCC tool collectors: BiolatencyCollector, CachestatCollector, TcpdropCollector, HardirqsCollector, SoftirqsCollector, ExecsnoopCollector

Fixed

  • Dockerfile compatibility with legacy Docker builder (non-BuildKit)

[0.1.0] - 2026-02-27

Added

  • Core eBPF collectors: CPU scheduling, network, memory, syscall
  • Cgroup collectors: disk I/O, interface network, filesystem
  • BPF map diagnostics collector
  • Dual kernel support: CO-RE (5.2+ with BTF) and legacy (4.18-5.1)
  • JNI bridge wrapping libbpf for BPF program lifecycle
  • Kotlin DSL code generation for eBPF programs (via kotlin-ebpf-dsl)
  • Spring Boot application with virtual threads and Micrometer/Prometheus export
  • Pod discovery via K8s informer or Kubelet API
  • Profile system: minimal, standard, comprehensive, custom
  • Helm chart with DaemonSet, RBAC, ConfigMap, PDB
  • Multi-stage Dockerfile (codegen, BPF compile, JNI build, app build, runtime)
  • GitHub Actions CI/CD (unit tests, image publish)
  • E2E and integration test scripts (minikube)