
Ask ten DevOps engineers what keeps them up at night, and nine will say “the stuff I can’t see.” Containers spin up and down in seconds. Nodes fail silently. Memory leaks hide under layers of abstraction. That’s where OpenShift monitoring comes in. It isn’t a luxury. It’s the only way to know if your cluster is healthy or just five minutes from collapse. OpenShift, Red Hat’s enterprise Kubernetes platform, gives you a lot out of the box. But raw data isn’t the same as useful intelligence. You need a real strategy.
This guide walks through what OpenShift actually is, why monitoring it matters, the key metrics you cannot ignore, and a step by step plan to get it right.
What is OpenShift
OpenShift is a Kubernetes distribution. But calling it “just Kubernetes” misses the point. Red Hat took upstream Kubernetes and added security policies, built in logging, automated operators, and a developer friendly interface. Companies run OpenShift on bare metal, virtual machines, or any major cloud provider. It handles container orchestration, scaling, and application deployment.
The platform includes several components. The control plane manages the cluster state. Worker nodes run your actual workloads. OpenShift adds image registries, networking with SDN or OVN Kubernetes, and monitoring stacks right from installation. You get Prometheus, Alertmanager, and Grafana dashboards enabled by default. For the complete list of installed components, see the official OpenShift monitoring platform overview. That last part matters. Many Kubernetes distributions force you to install your own monitoring tools. OpenShift ships with them.
Cluster administrators interact with OpenShift using the web console or the oc command line tool. Developers push code, and OpenShift builds images, deploys containers, and routes traffic. The platform also includes security context constraints (SCCs) that lock down permissions. Those SCCs break monitoring agents if you don’t configure them correctly.
So OpenShift monitoring isn’t just about watching CPU graphs. It’s about watching a secure, opinionated Kubernetes environment with its own quirks. Quirks like how the platform handles audit logs differently than vanilla Kubernetes. Or how cluster operators self manage themselves. If you treat OpenShift like raw Kubernetes, you will miss half the story.
Benefits of OpenShift Monitoring
Why put in all this effort when you could just trust the platform to work? Running a cluster blind leads to slow performance that frustrates users, angry customers who leave bad reviews, and 3 AM pages that do not explain anything useful. Good monitoring flips that entire situation around in your favor.
Identify Problems Before Users Notice
A node reporting high disk latency today becomes a full outage tomorrow when the disk finally fails completely. OpenShift monitoring tracks trends over hours and days rather than just showing you what happened five seconds ago. When memory usage climbs five percent every hour, you get a warning alert long before the pod actually crashes. That is the difference between planned maintenance that happens during business hours and emergency firefighting at 2 AM on a Saturday with a half asleep engineer trying to figure out what broke. The platform gives you time to respond, investigate, and fix the root cause before any customer experiences a slowdown or an error message. Without that early warning system, you are always reacting to fires instead of preventing them.
Optimize Your Cloud and Infrastructure Costs
OpenShift clusters cost real money because each node adds compute resources, storage allocations, and licensing fees from Red Hat. Without proper monitoring, teams massively over provision resources “just in case” and waste thousands of dollars per month on capacity they never actually use. With monitoring, you see exactly which namespaces are hogging the CPU and which ones are barely sipping resources. One team might run a test environment that eats forty percent of the entire cluster’s CPU capacity while another team runs a production service on five percent. Monitoring reveals that problem immediately, and then you can set resource quotas, move those workloads to cheaper nodes, or ask the team to right size their requests. The savings go straight to your bottom line.
Improve Your Security Posture
OpenShift monitoring tracks every API server request coming into the cluster and makes that data searchable and alertable. Who is accessing what data from which IP address at 3 AM on a Sunday? Are there failed login attempts coming from an unknown network that has never talked to your cluster before? Does a container suddenly try to write to the host filesystem when it never did that in the previous six months of operation? All of that information shows up in your logs and metrics if you are collecting them properly. Without monitoring, a security breach goes undetected for weeks while attackers move laterally through your workloads and exfiltrate sensitive data. With monitoring, you see those anomalies in real time and can respond immediately by isolating the compromised workload or rotating credentials.
Improve The Troubleshooting Process
A microservice fails for unknown reasons during a routine deployment. Is the problem in the database connection pool? Did a network policy block the traffic between services after an update? Did the memory limit cause a silent OOM kill that the application did not log? OpenShift monitoring gives you the complete timeline of events leading up to the failure. You look at CPU throttling data, network error rates, and pod restart counts side by side on a single dashboard without jumping between five different tools. The root cause jumps out at you immediately because you see the exact moment when latency spiked or when the first restart occurred. Without that data, you start guessing randomly and checking things one by one. Guessing wastes hours of engineering time and delays customer fixes while your team runs in circles.
Support Your Service Level Agreements
If your company promises ninety nine point nine percent uptime to customers, you need hard proof that you actually delivered that number when they ask for it. OpenShift monitoring generates availability reports automatically without requiring someone to manually compile data from spreadsheets. It shows exactly when the API server responded slowly, when a node went completely offline for maintenance, or when a cluster operator degraded for a few minutes. Auditors want that data during compliance reviews, and your customers might demand it as part of their vendor assessment process. Having automated, auditable monitoring reports saves your compliance team hours of work and gives customers confidence that you know what is happening inside your infrastructure.
OpenShift Monitoring: 5 Key Metrics
Not all metrics are equal. Teams new to OpenShift often monitor everything and end up ignoring the signal. Focus on these categories.
- Cluster level metrics. The control plane health comes first. Watch API server latency. If requests take over one second, something is wrong. Etcd database size and request duration matter too. Etcd stores the cluster state. A slow etcd breaks everything else. Also track scheduler latency. That tells you how long pods wait for a node;
- Node metrics. CPU, memory, disk, and network per node. But look at pressure metrics. Memory pressure means the node starts killing pods. Disk pressure means the kubelet evicts workloads. Node ready status is obvious. Less obvious is node network availability. Packets dropped at the virtual interface level often go unnoticed until applications fail;
- Pod and container metrics. Restart count is the first red flag. A pod restarting every few hours has a memory leak or liveness probe problem. CPU throttling shows if you set limits too low. OOM kills (out of memory) indicate bad resource requests. Container start time also matters. A container taking 30 seconds to start might be fine. Taking three minutes means your image is too large or your init script hangs;
- Kubernetes object metrics. How many unhealthy replicas in a deployment? Are persistent volume claims stuck in pending? Network policies blocking traffic? OpenShift monitoring should track these counts over time. A sudden spike in failed CRD (custom resource definition) registrations points to a broken operator;
- OpenShift specific metrics. Cluster operators each have their own status. The authentication operator, console operator, and ingress operator. If any operator reports degraded or unavailable, the cluster loses functionality. Also monitor route metrics. OpenShift routes are the ingress layer. Track route endpoint availability and request latency. Finally, watch cert expiration. Many OpenShift components use internal certificates that expire after 30 or 90 days. Monitoring catches that before the cluster locks itself out.
How to Do OpenShift Monitoring: 5 Steps
You have the theory. Now here is the practical path. These five steps take you from zero to a working OpenShift monitoring setup. No fluff. No unnecessary detours.
Step 1: Understand What OpenShift Already Gives You
OpenShift ships with a monitoring stack. Not a toy stack. A real one. The Cluster Monitoring Operator installs and updates Prometheus, Thanos, Alertmanager, and Grafana. You get this immediately after installation. No extra yaml files required.
Log into the OpenShift web console. Navigate to Observe. You will see metrics, dashboards, and alerts. This built in system scrapes metrics from every component. The API server, etcd, kubelet, container runtime, and cluster operators. It stores data for six months by default on most versions. That is a long history.
But the default setup has limits. It only looks at cluster components. Not your application pods. Not your databases running inside the cluster. The default dashboards also show raw numbers without much context. And alerting rules are conservative. They fire when a node is completely dead, not when disk space hits 80%.
So step one is simple. Spend an hour clicking through the built in dashboards. Look at the Alerts tab. See what is already firing. Check the Targets page in Prometheus to see which endpoints get scraped. Write down what is missing. Your own application metrics. Custom business logic. External dependencies.
The built in stack also consumes cluster resources. Prometheus uses memory. So does Thanos for long term storage. On a small cluster with three worker nodes, the monitoring stack might use 4 GB of RAM. That is normal. Do not try to lower it unless you really understand the tradeoffs.
Step 2: Deploy Service Monitors for Your Workloads
The default stack does not scrape your applications. You have to tell OpenShift what to look at. Service Monitors are the answer. A Service Monitor is a CRD that tells Prometheus which services to scrape and how often. Before writing your first ServiceMonitor, your cluster administrator must enable monitoring for user-defined projects. The enabling monitoring for user-defined projects guide walks through that prerequisite. You will also need the appropriate permissions (typically monitoring-rules-edit role) to create monitoring resources in your namespace.
Create a simple Service Monitor for an example app. Assume your app exposes metrics at /metrics on port 8080. Write a YAML file like this:
yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
namespace: my-app
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: web
path: /metrics
interval: 30s
Apply it with oc apply -f servicemonitor.yaml. OpenShift’s Prometheus automatically picks it up. Wait two minutes. Then check the Targets page again. Your app should appear.
Common mistakes here. Wrong namespace. The Service Monitor must be in the same namespace as the service it targets. Wrong label selector. Prometheus uses the selector to find services. If your service has app: my-app-v2 but the selector looks for app: my-app, it finds nothing. Also wrong port name. The endpoint section references a port name from the service definition, not the numeric port.
Do this for every application that exports Prometheus metrics. Most modern frameworks do. Node.js with prom-client. Python with prometheus_client. Java with Micrometer. If your app doesn’t export metrics, add it. Exporting custom metrics like queue length, active sessions, or transaction count transforms monitoring from reactive to proactive.
Step 3: Configure Alertmanager to Stop Waking You Up at 3 AM
Alertmanager handles alerts. It deduplicates, groups, and routes them to the right place. OpenShift includes Alertmanager but uses a basic configuration. You need to customize it.
First, decide where alerts go. Slack, PagerDuty, email, or a webhook. Most teams use Slack for warnings and PagerDuty for critical issues. Create a secret in OpenShift with your webhook URLs. Then edit the Alertmanager configuration.
Go to OpenShift console > Administration > Cluster Settings > Configuration > Alertmanager. Click the YAML view. Add your receivers and routes. A minimal Slack setup looks like this:
yaml
receivers:
- name: slack-warnings
slack_configs:
- api_url_secret: slack-webhook
channel: '#alerts'
title: 'OpenShift Alert'
text: 'Alert: {{ .CommonLabels.alertname }} - {{ .CommonAnnotations.description }}'
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: slack-warnings
The repeat_interval is important. Set it to 4 hours or longer. Otherwise you get the same alert every five minutes. Nobody wants that.
Second, tune the default alert rules. OpenShift ships with rules like KubeCPUThrottlingHigh and KubeMemoryOvercommit. Some are too sensitive. Clone the default PrometheusRule objects into your own namespace and modify thresholds. Raise the CPU throttling threshold from 25% to 40% if your workloads are bursty. Lower the memory overcommit warning from 80% to 70% if you run memory intense Java apps.
Third, add your own custom rules. Do you need an alert when a specific deployment runs zero replicas for 10 minutes? Write a PrometheusRule. When the average request latency crosses 500 milliseconds? Write another one. Custom rules turn generic monitoring into something that matches your actual business.
Step 4: Set Up Long Term Storage and Retention Policies
Prometheus does not scale forever. Its local storage fills up. OpenShift’s Cluster Monitoring Operator allows configuring retention. The default is 15 days for metrics and 24 hours for raw data. That works for small clusters. Production environments need longer.
Decide your retention period. Compliance might require six months. Troubleshooting might need three months of historical metrics. OpenShift supports remote storage via Thanos. Thanos is already deployed in the cluster. Configure it to write to an S3 bucket or a compatible object store.
Edit the cluster-monitoring-config ConfigMap in the openshift-monitoring namespace. Add these lines:
yaml
data:
config.yaml: |
thanosRuler:
retention: 180d
prometheusK8s:
retention: 60d
volumeClaimTemplate:
spec:
resources:
requests:
storage: 100Gi
This sets Prometheus local retention to 60 days and Thanos ruler retention to 180 days. The volume claim gives Prometheus 100 GB of persistent storage. Adjust the storage size based on your metric volume. A busy cluster with 200 pods generates about 5 GB of metrics per day. A quieter cluster does 1 GB.
Without long term storage, losing a Prometheus pod means losing all history. That breaks trend analysis. You cannot answer “was memory usage higher last month?” if the data is gone. So configure persistent volumes and remote storage early. Recovery after a failure becomes trivial.
Step 5: Create Actionable Dashboards and SLO Tracking
Dashboards are not decorations. They are diagnostic tools. OpenShift includes Grafana. But the default dashboards are dense. Build your own focused dashboards.
Start with three dashboards. The first is a cluster health dashboard. Show node status, API server latency, and etcd leader elections. Add a table of pods with restart counts over 10. Add a gauge for cluster operator status. Green for all healthy, red for any degraded. Keep this dashboard simple. It fits on one screen.
The second dashboard is per namespace resource usage. Show CPU requests versus actual usage. Memory limits versus real consumption. Network traffic in and out. Many teams set limits too high. This dashboard shows waste. You see a namespace requesting 8 CPU cores but using 0.5. That is money sitting idle. Fix it by reducing requests.
The third dashboard tracks SLOs. Service level objectives. Pick three or four user facing metrics. API response time at 95th percentile. Error rate as percentage of total requests. Availability of critical routes. Plot these as burn rates over time. When the error budget burns faster than expected, the dashboard flashes red. Engineers know to stop feature work and fix reliability.
Build these dashboards using PromQL queries. PromQL takes practice. A basic query for pod CPU usage is sum(rate(container_cpu_usage_seconds_total{container!=””}[5m])) by (pod). For API server latency: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{}[5m])) by (le)). Test each query in the OpenShift metrics explorer first. Then paste it into a Grafana panel.
Save each dashboard as a JSON file. Commit that file to git. Then you can recreate the dashboards if the cluster gets rebuilt. Share them across teams. Nothing worse than rebuilding a useful dashboard from memory after an outage.
Set up recording rules for expensive queries. A query summing CPU across 1,000 pods every 30 seconds adds load. Create a recording rule that precomputes that sum every 5 minutes. Then dashboards query the precomputed metric. Faster dashboards and less load on Prometheus.
You now have a monitoring setup that works. It catches problems early, preserves history, and gives actionable dashboards. OpenShift monitoring is not a one time task. It evolves as your cluster grows. But these five steps build a foundation that scales.