Changelog¶
All notable changes to this project will be documented in this file.
[1.11.0] - 2026-03-07¶
Added¶
- Grafana dashboard: p90 percentile lines for CPU runqueue, TCP RTT, syscall, and disk I/O latency panels
- Grafana dashboard: p50/p90 queries added to syscall latency and block I/O latency panels
- Helm
_helpers.tpl: centralized label/selector templates replacing 13 hardcoded blocks .github/CODEOWNERS: automatic reviewer assignment- Dependabot: docker ecosystem for base image update tracking
Changed¶
- All dashboard latency panels now use pre-computed recording rules for efficient rendering
- Dockerfile: added root requirement comment and
AS runtimestage alias
[1.10.0] - 2026-03-07¶
Added¶
- p90 recording rules for CPU runqueue latency, TCP RTT, syscall latency, and disk I/O latency (recording rules: 13 → 17)
KpodHighForkRatealert: fork bomb detection (> 100 forks/s for 5m)KpodHighExecRatealert: abnormal process spawning (> 50 execs/s for 5m)- Schema:
cpuProfileandbpfMapStatscollector entries invalues.schema.json
Fixed¶
- NOTES.txt and README: corrected stale PrometheusRule alert/recording rule counts
- README: updated image tag reference from 1.6.0 to 1.10.0
- E2E workflow: fixed label selector
app=kpod-metrics→app.kubernetes.io/name=kpod-metrics - E2E workflow: expanded metric validation from 3 to 9 metrics with health status check
[1.9.0] - 2026-03-07¶
Added¶
- Helm test: second container verifies
/actuator/prometheusreturns kpod metrics - emptyDir
/tmpvolume (tmpfs, 64Mi) for JVM temporary files, enablingreadOnlyRootFilesystem - SECURITY.md: "Why Root Is Required" section, writable paths, optional network egress documentation
- Spring context tests: conditional bean verification for BPF, OTLP, and discovery health indicator (5 tests across 2 classes)
- CI: strict Helm lint, template dry-runs (default/OTLP/profiling/secrets), schema validation
Changed¶
- CI
helm-lintjob upgraded from basichelm lintto--strictmode with 5 additional validation steps
[1.8.0] - 2026-03-07¶
Added¶
- Unit tests for
AnomalyEndpoint(4 tests: delegation, defaults, sensitivity validation) - Unit tests for
RecommendEndpoint(8 tests: delegation, defaults, confidence clamping, time expression parsing, label validation) - README: OTLP Secret support documentation with
existingSecretexample - README: Analysis Endpoints section documenting
/actuator/kpodAnomalyand/actuator/kpodRecommendwith curl examples - Troubleshooting guide: OTLP export, analysis endpoints, NetworkPolicy sections
- CONTRIBUTING.md: step for updating
values.schema.jsonwhen modifying Helm values
Changed¶
- README: updated test count (201 → 293), image tag (1.6.0 → 1.8.0)
[1.7.0] - 2026-03-07¶
Added¶
- OTLP Secret support:
otlp.existingSecretreferences a K8s Secret for OTLP headers instead of plaintext ConfigMap - Pyroscope Secret support:
profiling.pyroscope.existingSecretreferences a Secret for auth tokens - NetworkPolicy conditional egress: OTLP (4317/4318) and Pyroscope (4040) egress rules auto-added when features are enabled
values.schema.jsonfor Helm chart validation, IDE autocomplete, andhelm lintenforcement- Validates profile enum, interval minimums, URI formats, and required fields
Fixed¶
config.pollIntervaldefault mismatch: values.yaml now defaults to 30000ms (was 29000ms)
[1.6.0] - 2026-03-01¶
Added¶
- README: OTLP export documentation with full configuration example
- README:
kpod.collector.skipped.totalandkpod.bpf.program.load.durationin Self-Monitoring metrics table - README: OTLP properties (
kpod.otlp.*) in Key Properties table - Grafana dashboard: Collector Skip Rate panel (time series, ID 104)
- Grafana dashboard: BPF Program Load Duration panel (stat, ID 105)
- PrometheusRule:
kpod:collector_skip_rate:5mrecording rule - PrometheusRule:
KpodHighCollectorSkipRatealert (> 10% skip rate for 10m) kpod.initial-delayexposed asinitialDelayproperty in MetricsProperties
Changed¶
- Fixed
poll-intervaldefault mismatch:application.ymlnow defaults to 30000ms (was 15000ms), matching MetricsProperties and Helm values - README: updated test count (140 → 201), image tag (1.3.0 → 1.6.0), architecture diagram
- README: added MemoryCgroupCollector to architecture diagram
- README: updated PrometheusRule counts (17 alerts + 13 recording rules)
- Grafana dashboard synced between standalone and Helm copies (49 panels total)
[1.5.0] - 2026-03-01¶
Added¶
kpod.collector.skipped.totalcounter — tracks interval-based collector skips per collectorkpod.bpf.program.load.durationtimer — measures BPF program load time per programlastCollectorErrorsin DiagnosticsEndpoint — shows last error message per collectorenabledCollectorCountin DiagnosticsEndpoint — runtime count of active collectors- Helm probe customization:
probes.startup,probes.liveness,probes.readinessin values.yaml - Configurable
periodSeconds,failureThreshold,initialDelaySeconds
Changed¶
MetricsCollectorServicetracks last error per collector for diagnosticsBpfProgramManager.tryLoadProgramrecords load duration via Micrometer timer- DaemonSet template uses
{{ .Values.probes.* }}instead of hardcoded probe intervals
[1.4.0] - 2026-03-01¶
Added¶
- OpenTelemetry/OTLP metrics export via
micrometer-registry-otlp - Configurable via
kpod.otlp.enabled,kpod.otlp.endpoint,kpod.otlp.headers,kpod.otlp.step - Push metrics to any OTLP-compatible collector alongside Prometheus scraping
CollectorConfigHealthIndicator: health check reports DOWN when all collectors are disabledgetEnabledCollectorCount()on MetricsCollectorService for runtime inspection- Enhanced Helm NOTES.txt: shows active profile, poll/timeout settings, conditional sections for ServiceMonitor, PrometheusRule, Grafana dashboard, and OTLP export
- Helm values:
otlp.enabled,otlp.endpoint,otlp.headers,otlp.step
Changed¶
BpfAutoConfigurationconditionally createsOtlpMeterRegistrybean whenkpod.otlp.enabled=true- Helm ConfigMap renders OTLP configuration under
kpod.otlpwhen enabled
[1.3.0] - 2026-03-01¶
Added¶
- Per-collector interval configuration (
kpod.collector-intervals.<name>) - Heavy collectors (syscall, biolatency, hardirqs) can run at longer intervals
- Default: all collectors run every cycle (backward compatible)
- Tracks per-collector last-run timestamps; skips when interval hasn't elapsed
- PrometheusRule alerts:
KpodHighRestartRate(> 3 restarts/15min) andKpodPodCrashLooping(> 5 restarts/30min) - Helm DaemonSet:
affinityandtopologySpreadConstraintssupport - README: comprehensive configuration reference (40+ properties documented)
- README: memory cgroup, pod lifecycle, and self-monitoring metric tables
Changed¶
MetricsCollectorServiceacceptsCollectorIntervalsfor per-collector schedulingBpfAutoConfigurationpassescollectorIntervalsandbasePollIntervalto service- Helm ConfigMap renders
collector-intervalswhen configured - README Prometheus Operator section updated to reflect 15 alert rules + 12 recording rules
[1.2.0] - 2026-03-01¶
Added¶
- Container restart tracking:
kpod.container.restartsgauge per pod/container - Automatically updated from K8s container status
restartCount - Cleaned up on pod deletion to prevent cardinality growth
- Grafana dashboard: Container Restarts panel in Process Activity row
- KubeletPodProvider now captures container restart counts
Changed¶
- Dashboard histogram queries optimized to use precomputed recording rules
- 6 queries replaced: runqueue latency p50/p99, TCP RTT p50/p99, syscall latency p99
- Reduces Prometheus query load on large clusters
- Helm dashboard ConfigMap synced with standalone dashboard (fixes v1.1.0 panel gap)
- PodWatcher now accepts optional
MeterRegistryfor restart gauge registration
Fixed¶
- Helm dashboards copy missing 5 panels added in v1.1.0 (interface errors/drops/packets, buffer dirty, fs available)
[1.1.0] - 2026-03-01¶
Added¶
- Service always created (decoupled from ServiceMonitor) — standalone Prometheus can now scrape without Operator
- DaemonSet
updateStrategywith configurablemaxUnavailable(default: 1) for controlled rollouts extraEnvsupport in Helm values for JVM tuning, proxy settings, etc.- Grafana dashboard: Interface Errors, Interface Drops, Interface Packets panels (Network row)
- Grafana dashboard: Buffer Dirty Rate panel (Memory & Cache row)
- Grafana dashboard: Filesystem Available panel (Disk & Filesystem row)
- 12 Prometheus recording rules for precomputed aggregations (p50/p99 latencies, error ratios, rates)
- Chart.yaml metadata: home, sources, keywords, maintainers for chart discoverability
Changed¶
- PrometheusRule now has two rule groups:
kpod-metrics.recordingandkpod-metrics(alerting) - Helm chart version and appVersion bumped to 1.1.0
[1.0.0] - 2026-03-01¶
Added¶
- Spring graceful shutdown (
server.shutdown=graceful, 30s timeout) for clean request draining DiscoveryHealthIndicator: readiness check that reports DOWN if no pods discovered after 60s grace period- Initial collection delay (
kpod.initial-delay, default 10s) to allow PodWatcher to discover pods before first cycle - Collection overlap guard: skips cycle with warning if previous cycle is still running
- Helm test pod (
helm test) for health endpoint validation - Dockerfile
HEALTHCHECKfor non-Kubernetes environments
Changed¶
MetricsCollectorService.collect()usescompareAndSetfor atomic overlap detection- Helm ConfigMap includes graceful shutdown configuration
build.gradle.ktsversion set to1.0.0
[0.9.0] - 2026-03-01¶
Added¶
- Pod label propagation to metrics: configurable
kpod.filter.include-labelswhitelist - Pod labels (e.g.,
app=nginx) appear aslabel_app="nginx"metric tags - Labels filtered at PodCgroupMapper level to control cardinality
- Metric staleness cleanup: Micrometer meters for deleted pods are automatically removed
- Prevents cardinality growth from pod churn in long-running clusters
- Cleans gauge stores in FilesystemCollector and MemoryCgroupCollector
- Label selector filtering:
kpod.filter.label-selectornow functional - Supports
key=value,key!=value, andkey(exists) selectors - Comma-separated for multiple terms (AND logic)
PodCgroupTarget.tags()helper for consistent tag construction across cgroup collectors
Changed¶
PodCgroupMappernow acceptsincludeLabelsto filter pod labels at discovery timePodWatcher.shouldWatch()now evaluates label selectors alongside namespace filters- Cgroup collectors use
target.tags()instead of manual tag construction - Helm ConfigMap renders filter config (namespaces, excludeNamespaces, labelSelector, includeLabels)
[0.8.0] - 2026-03-01¶
Added¶
- Memory cgroup collector:
kpod.mem.cgroup.usage.bytes,kpod.mem.cgroup.peak.bytes,kpod.mem.cgroup.cache.bytes,kpod.mem.cgroup.swap.bytes— supports cgroup v1 and v2 - Multi-cluster label injection via
kpod.cluster-name— addsclustercommon tag to all metrics nodecommon tag automatically applied to all metrics viaMeterRegistryCustomizer- Grafana dashboard: Memory (Cgroup) row with usage, peak, cache, swap panels
- Grafana dashboard: Operational row with cgroup read error rate and collection timeout panels
- PrometheusRule alerts:
KpodCgroupReadErrors,KpodCollectionTimeouts,KpodMemoryPressure
Changed¶
- Helm values: added
config.clusterNameoption - Helm ConfigMap: renders
cluster-namewhen set
[0.7.0] - 2026-03-01¶
Added¶
- Per-target error handling in cgroup collectors (DiskIO, InterfaceNetwork, Filesystem)
kpod.cgroup.read.errorscounter per collector — one pod's failure no longer blocks others- Graceful shutdown with drain timeout — waits for in-flight collection cycle to complete
/actuator/kpodDiagnosticsendpoint: uptime, collector states, BPF program status, profile summary- Startup cardinality estimation with configurable warning threshold
- Logs estimated metric series count per profile at boot
- Warns if estimate exceeds 100k (configurable)
Changed¶
MetricsCollectorService.close()now drains in-flight cycles before shutting down executor- Cgroup collector tests updated for error counter registration at construction
[0.6.0] - 2026-03-01¶
Added¶
- Collection cycle timeout (
kpod.collection-timeout) withkpod.collection.timeouts.totalcounter - Per-collector enable/disable overrides (
kpod.collectors.*) on top of profiles - E2E test CI workflow (
.github/workflows/e2e.yml) with minikube, weekly schedule + manual dispatch
Changed¶
- Collection cycle now wrapped in
withTimeoutOrNullfor bounded execution - Helm ConfigMap template renders
collection-timeoutandcollectorsconfig - Added
kotlin("test")dependency for test assertions
[0.5.0] - 2026-03-01¶
Added¶
- Helm chart:
imagePullSecretssupport for private registries - Helm chart:
priorityClassNamesupport for guaranteed scheduling - Helm chart:
allowPrivilegeEscalation: falseanddrop: ALLcapabilities hardening - Helm chart: optional
seccompProfileconfiguration - Helm chart: NetworkPolicy template (
networkPolicy.enabled) - Enhanced NOTES.txt with health check verification and feature status
- Troubleshooting guide (
docs/troubleshooting.md) - CONTRIBUTING.md with development setup and PR process
- SECURITY.md with security model and vulnerability reporting
[0.4.0] - 2026-03-01¶
Added¶
- Self-monitoring metrics for collection pipeline health:
kpod.collection.cycle.duration— timer for full collection cyclekpod.collector.duration— timer per collector (tagged by collector name)kpod.collector.errors.total— counter per collector for failure trackingkpod.discovery.pods.total— gauge of discovered pods per cyclekpod.bpf.programs.loaded/kpod.bpf.programs.failed— BPF program load status- Custom Spring Boot health indicators for Kubernetes probes:
BpfHealthIndicator— reports DOWN when BPF programs fail to loadCollectionHealthIndicator— reports DOWN when collection is stale (3x poll interval)- Per-program graceful BPF load failures (partial degradation instead of full failure)
- Grafana dashboard Row 10: Collection Health (7 panels)
- PrometheusRule alerts: KpodCollectorErrors, KpodNoBpfPrograms
[0.3.0] - 2026-02-28¶
Added¶
- Multi-arch Docker image builds (linux/amd64 + linux/arm64) via buildx
- Container image vulnerability scanning with Trivy (CRITICAL/HIGH)
- Automated release workflow triggered on version tag push
- Helm chart linting in CI pipeline
Changed¶
- Dockerfile now auto-detects target architecture from buildx TARGETARCH
- Publish workflow uses docker/build-push-action with multi-platform support
- Trivy scan results uploaded to GitHub Security tab (SARIF format)
[0.2.0] - 2026-02-28¶
Added¶
- Grafana dashboard with 9 rows and 29 panels covering all kpod-metrics collectors
- Auto-provisioned via Grafana sidecar ConfigMap (Helm-managed)
- Standalone JSON available at
grafana/kpod-metrics-dashboard.json - Prometheus Operator integration (ServiceMonitor + PrometheusRule)
- Headless Service for ServiceMonitor pod discovery
- 8 production alerting rules (runqueue latency, TCP retransmits/drops, syscall errors, filesystem usage, BPF map capacity/errors, target down)
- BCC tool collectors: BiolatencyCollector, CachestatCollector, TcpdropCollector, HardirqsCollector, SoftirqsCollector, ExecsnoopCollector
Fixed¶
- Dockerfile compatibility with legacy Docker builder (non-BuildKit)
[0.1.0] - 2026-02-27¶
Added¶
- Core eBPF collectors: CPU scheduling, network, memory, syscall
- Cgroup collectors: disk I/O, interface network, filesystem
- BPF map diagnostics collector
- Dual kernel support: CO-RE (5.2+ with BTF) and legacy (4.18-5.1)
- JNI bridge wrapping libbpf for BPF program lifecycle
- Kotlin DSL code generation for eBPF programs (via kotlin-ebpf-dsl)
- Spring Boot application with virtual threads and Micrometer/Prometheus export
- Pod discovery via K8s informer or Kubelet API
- Profile system: minimal, standard, comprehensive, custom
- Helm chart with DaemonSet, RBAC, ConfigMap, PDB
- Multi-stage Dockerfile (codegen, BPF compile, JNI build, app build, runtime)
- GitHub Actions CI/CD (unit tests, image publish)
- E2E and integration test scripts (minikube)