Monitoring Inquiry (Cluster Admin)

Modified on: 2025-03-29 19:35

Cloud Z CP's monitoring service is provided by configuring the following open source components.
Metric To collect and store Prometheus
Prometheus Processing Alerts created in Alertmanager
Various methods to export 3rd party metrics to the outside so that Prometheus can collect them Exporter (node-exporter, kube-state-metrics, blackbox-exporter, elasticsearch-exporter field)
Finally, the collected metrics are visualized using Prometheus Query and provided in a dashboard format that is easy for users to understand. Grafana
This section explains how to use Grafana's Dashboard and each item in the default Dashboard.
If you would like more detailed information or instructions on how to use Grafana. Please refer to Grafana Docs기 바랍니다.
To use the service, click Monitoring on the ZCP Console side menu.
As the version is updated, information that has been added, modified, or deleted in the Dashboard or Panel is displayed in the following legend.
Version, Content: Added Dashboard or Panel
Version, Content: Changed Dashboard or Panel
Version, Content: Deleted Dashboard or Panel
Go to Dashboard
Built-in Dashboard
Addon Dashboards
ElasticSearch
ZCP Services Status
Cluster Dashboards
Etcd Cluster
Kubernetes: Cluster Overview
Kubernetes: Performance Overview
Kubernetes: Resource Requests
Container Dashboards
Kubernetes: DaemonSet Overview
Kubernetes: Deployment Overview
Kubernetes: POD Overview
Kubernetes: StatefulSets Overview
System Dashboards
System Disk Space
System Usage Overview
System: Overview
Dashboard write Guide

Go to Dashboard

Select the Home menu at the top.
You can see the recently selected Dashboard (Recent) and the default Folders (4).
When you select one of the default Folders, the Dashboards belonging to that Folder will be displayed.
When you select Dashboard, you will see a screen composed of various panels.

Built-in Dashboard

This explains the Dashboard provided by default in Cloud Z CP Public.

Addon Dashboards

ElasticSearch

Display information about elasticsearch (JVM, CPU, Memory, Documents, Indices, etc.)

Row	Pannel	Description
KPI	Cluster health	Current status of elasticsearch cluster (N/A / Green / Yellow / Red)
	Tripped for breakers	The average value is tripeed because the cluster is broken
	CPU usage Avg.	CPU average usage
	JVM memory used Avg.	JVM memory average usage
	Nodes	Number of nodes in the cluster.
	Data nodes	Number of data nodes in the cluster.
	Pending tasks	Cluster level changes which have not yet been executed.
	Openfile descriptors per cluster	The total number of open files in elasticsearch
Shards	Active primary shards	The number of primary shards in your cluster. This is an aggregate total across all indices.
	Active shards	Aggregate total of all shards across all indices, which includes replica shards.
	Initializing shards	Count of shards that are being freshly created.
	Relocating shards	The number of shards that are currently moving from one node to another node.
	Delayed shards	Shards delayed to reduce reallocation overhead.
	Unassigned shards	The number of shards that exist in the cluster state, but cannot be found in the cluster itself.
JVM Garbage Collection	GC count	Number of items processed by Garbage Collection
JVM Garbage Collection	GC time	Time to process Garbage Collection
CPU and Memory	Load average	Load average used in elasticsearch
	CPU usage	CPU usage used by elasticsearch
	JVM memory usage	JVM memory usage used by elasticsearch
	JVM memory committed	JVM memory usage used for committing in elasticsearch
Disk and Network	Disk usage	Disk usage used by elasticsearch
Disk and Network	Network usage	Network usage used by elasticsearch
Documents	Documents count on node	Number of documents stored in the data node
	Documents indexed rate	The percentage of documents indexed
	Documents deleted rate	The rate at which documents are deleted
	Documents merged rate	The rate at which documents are merged
	Documents merged bytes	The size of merged documents (bytes)
Times	Query time	Query execution time
	Indexing time	Indexing execution time
	Merging time	Merging execution time
	Throttle time for index store	Throttle time to save index
Indices: Count of documents and Total size	Count of documents with only primary shards	Number of documents in primary shards
	Total size of stored index data in bytes with only primary shards on all nodes	Total capacity of index data stored in primary shards
	Total size of stored index data in bytes with all shards on all nodes	Total size of index data stored in all shards
Indices: Index writer	Index writer with only primary shards on all nodes in bytes	The capacity of primary shards being used as index
Indices: Index writer	Index writer with all shards on all nodes in bytes	The capacity of all shards being written as index

ZCP Services Status

zcp-system Health check of namespace (CPU usages, status value)

Panel	Description
Duration	probe duration seconds
Status : alertmanager	alertmanager health (UP / DOWN)
alertmanager Status Code	alertmanager status code
Status : grafana	grafana health (UP / DOWN)
grafana Status Code	grafana status code
Status : prometheus	prometheus health (UP / DOWN)
prometheus Status Code	prometheus status code

Cluster Dashboards

Etcd Cluster

Etcd status values (RPC Rate, DB Size, Disk Sync Duration, etc.)

Panel	Description
Etcd has a leader?	Check if etcd has a leader (YES/NO)
The number of leader changes seen	Number of times Etcd leader changed
The total number of failed proposals seen	Total number of failed proposals
RPC Rate	Number of gRPCs started or handled in 5 minutes
Etcd DB Size	Etcd debugging mvcc db total size in bytes
Etcd Disk Sync Duration	Total number of wal fsyncs performed by etc disk in 5 minutes (Histogram 99)
Etcd Memory	Memory usage of 'etcd' job
Etcd Client Traffic In	etcd network client gRPC total traffic received in 5 minutes
Etcd Client Traffic Out	etcd network client gRPC total traffic sent in 5 minutes
Etcd Peer Traffic In	Total number of traffic received by etcd network peer in 5 minutes
Etcd Peer Traffic Out	etcd Total number of traffic sent by network peers in 5 minutes
Etcd Proposals rate(Fail,Pending,commit,apply)	Total number of committed proposals made by etcd server in 5 minutes
Etcd Disk operations(AVG)	Total number of backend commits made by etcd disk in 2 minutes
Network	etcd network client gRPC total traffic received in 2 minutes
Snapshot duration	Abnormally high snapshot duration (snapshot_save_total_duration_seconds) indicates disk issues and might cause the cluster to be unstable.

Kubernetes: Cluster Overview

Information about total/node average/cluster average resources (number of nodes/pods/containers, CPU/memory/network usage, etc.)

Row	Panel	Description
Resource Dashboard	Alertmanager Alerts Firing	Total number of Alerts
	Node Not Ready	Number of Nodes in 'Not Ready' state
	Node Unschedulable	Number of Nodes in 'Unschedulable' state
	Node Memory Pressure	Number of nodes in 'Memory Pressure' state
	Node Disk Pressure	Number of nodes in 'Disk Pressure' state
	Running Pod Total	The number of Pods currently in the 'Running' state.
	Running Pod Total by Node	The number of Pods currently in 'Running' state on each node.
	Running Container Total	The number of Containers currently in 'Running' state.
	Running Container Total by Node	The number of Containers currently in 'Running' state on each node.
Node Resource Usage	Number of Node	Total number of nodes in the current cluster
	Total CPU	Total CPU of nodes in the current cluster
	Used Memory	Memory usage of nodes in the current cluster
	Total Memory	Total Memory of nodes in the current cluster
	DIsk Usage	Disk usage of nodes in the current cluster
	DIsk Total	Total disks of nodes in the current cluster
	Avg CPU Usage	Average CPU usage of nodes in the current cluster
	Avg Memory Usage	Average memory usage of nodes in the current cluster
	Avg Disk Usage	Average disk usage of nodes in the current cluster
	Network Usage (Node NIC)	Network usage of nodes in the current cluster
Cluster Resource Usage	Cluster CPU Usage(Used/Total)	Current CPU usage (%) of all nodes in the cluster - In addition, the total CPU amount (Cores) and the amount used are also displayed below.
	Cluster Memory Usage(Used/Total)	Total Memory Usage (%) of the Nodes in the Current Cluster - In addition, the total amount of Memory (Gib) and the amount used are also displayed below.
	Cluster DIsk Usage(Used/Total)	Current cluster node's total DIsk usage (%) - In addition, the total DIsk amount (Gib) and the amount used are also displayed below.
	Pod Count by namespace	Number of Pods registered in Kubernetes by namespace
	Container Count by namespace	Number of containers registered in kubernetes by namespace

Kubernetes: Performance Overview

API Server Requests/Latency, Pod/Container Running Trands, Creating Rate etc

Panel	Description
APIServer Request Rate	Total number of requests made every 2 minutes from APIServer
APIServer Latency	APIServer's average request latencies
Kubelet POD Start Latency	Latency in microseconds for a single pod to go from pending to running. Broken down by podname.
Running Pod Trands	Number of pods in 'running' state in kubelet
Create Rate of Pods	Rate of newly created Pods in kubelet in 2 minutes
Running Containers Trands	Number of Containers in 'running' state in kubelet
Create Rate of Containers	Rate of newly created containers in kubelet in 2 minutes

Kubernetes: Resource Requests

Displays information about CPU/Memory usages and Pod count of Node.

Panel	Description
Cluster CPU(Allocated/Request)	This represents the total [CPU resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) in the cluster. For comparison the total [allocatable CPU cores](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.
Cluster Memory(Allocated/Request)	This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) in the cluster. For comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.
Cluster Pod(Allocated/Request)	This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run) in the cluster. For comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Panel

Description

Cluster CPU(Allocated/Request)

This represents the total [CPU resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) in the cluster.

For comparison the total [allocatable CPU cores](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Cluster Memory(Allocated/Request)

This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) in the cluster.

For comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Cluster Pod(Allocated/Request)

This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run) in the cluster.

For comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Container Dashboards

Kubernetes: DaemonSet Overview

Daemonset에 대한 정보 (Replicas, CPU/Memory/Network/Filesystem( v1.1.0) 등)

Panel	Description
Desired Replicas ( v1.1.0, DESIRED)	Number of daemonsets required for scheduling The number of nodes that should be running the daemon pod
CURRENT ( v1.1.0)	Number of currently scheduled daemonsets
READY ( v1.1.0)	Number of daemonsets currently running and ready
Available Replicas ( v1.1.0, AVAILABLE)	The number of daemonsets currently running and in use.
Metadata Generation	Number of daemonsets created with metadata
DaemonSet Create Time	The time of the daemonset that was created the longest ago.
Total CPU	The sum of CPU (Cores) used by containers created by Daemonset
Total Memory	Total Memory (MiB) used by Containers created by Daemonset
Total Network	Total network usage (MBps) by containers created by Daemonset
CPU Usage	CPU usage of containers created by Daemonset
Memory Usage	Memory usage of containers created by Daemonset
Filesystem Read/Write ( v1.1.0)	Filesystem Read/Write Usage of Containers Created by Daemonset
Network TX/RX ( v1.1.0)	Network Transmit/Receive usage of containers created with Daemonset
Replicas Status	Status of Daemonset's Replicas (Ready / Available / Unavailable / Misscheduled)

Kubernetes: Deployment Overview

Deployment Information about (Replicas, CPU/Memory/Network/Filesystem( v1.1.0) ect)

Panel	Description
Desired Replicas ( v1.1.0, DESIRED)	Number of deployment replicas required for the schedule
Available Replicas ( v1.1.0, AVAILABLE)	Number of deployment replicas in use
Observed Generation	Number of deployments created by Observed
Metadata Generation	Number of deployments generated by metadata
Deployment Create Time	Time of the deployment created oldest from the present
AVG CPU ( v1.1.0, Total CPU)	Average CPU (Core) used by containers created by Deployment ( v1.1.0, DeploymentThe sum of CPU (Cores) used by all containers in the Pod created by)
AVG Memory ( v1.1.0, Total Memory)	Average Memory Used (MiB) by Containers Created by Deployment ( v1.1.0, Sum of Memory used by all Containers in a Pod created by Deployment (MiB)
AVG Network ( v1.1.0, Total Network)	Average network used by containers created by Deployment (kBps) ( v1.1.0, Sum of Network used by all Containers in Pods created by Deployment (MiB)
CPU Usage	CPU usage of containers created by Deployment
Memory Usage	Memory usage of containers created by Deployment
Filesystem Read/Write ( v1.1.0)	Filesystem Read/Write usage of containers created by Deployment
Network TX/RX ( v1.1.0)	Network Transmit/Receive usage of containers created by Deployment
Replicas Status	Status of replicas in the deployment (Ready / Available / Unavailable / Misscheduled)
Spec	Spec of Replicas in Deployment (Replicas / Paused)

Kubernetes: POD Overview

Pod Information about (Pod의 status, restart count, used in pod CPU/Memory/Network/Volume( v1.1.0)/Filesystem( v1.1.0) mark

Panel	Description
POD Count	Number of Pods in the selected Namespace
Pod Status	Selected Namespace, Pod status (Failed / Pending / Running / Succeeded / Unknown)
Pod Restart Count	Number of restarts for the selected Namespace and Pod
CPU Usage	CPU usage and trend used by containers in the selected Namespace and Pod
Memory Usage	Memory usage and trend used by containers in the selected Namespace and Pod
Volume Usage ( v1.1.0)	Usage and trend of Persistent Volumes used in Containers of the selected Namespace and Pod
Filsystem Read/Write ( v1.1.0)	Trend of Filesystem Read/Write usage used by Containers in the selected Namespace and Pod
Network TX/RX	Transmit/Receive usage and trends of the network used by the selected Namespace and Pod's Container

Kubernetes: StatefulSets Overview

StatefulSets Information about (Replicas, CPU/Memory/Network/Filesystem( v1.1.0) etc)

Pane	Description
Desired Replicas ( v1.1.0, DESIRED)	Number of statefulset replicas required for scheduling
Available Replicas ( v1.1.0, AVAILABLE)	Number of statefulset replicas in use
Observed Generation	Number of statefulsets created by Observed
Metadata Generation	Number of statefulsets created with metadata
Statefulset Create Time	The time of the statefulset created oldest before the present.
Total CPU	The sum of CPU (Cores) used by containers created by Statefulset
Total Memory	Sum of Memory used by Containers created as Statefulset (MiB)
Total Network	Sum of network used by containers created by Statefulset (MBps)
CPU Usage	CPU usage of containers created by Statefulset
Memory Usage	Memory usage of containers created with Statefulset
Filesystem Read/Write ( v1.1.0)	Filesystem Read/Write Usage of Containers Created as Statefulset
Network TX/RX ( v1.1.0)	Network Transmit/Receive usage of containers created with Statefulset
Replicas Status	Status of Replicas in Statefulset (Corrent / Available)

System Dashboards

System Disk Space

Disk Usage Trends Used in Each Node

Panel	Description
Check Root Disk capacity	Amount of disk space used and available on various mount points. Running out of disk space on OS volume, database volume or volume used for temporary space can cause downtime. Some storage may also have reduced performance when small amount of space is available.

System Usage Overview

Usage information for each Node (Idle CPU, DISK I/O, Network received/transmitted, Memory/Disk Usage, etc.)

Pane	Description
CPU Core star Idle	Idle average of CPUs within the selected Node for 5 minutes
System Load(1,5,15)	The average rate at which the selected Node is loaded (1 minute / 5 minutes / 15 minutes)
Memory Usage	Usage of memory by type used in the selected Node (memory used / memory buffers / memory cached / memory free)
Memory Usage	Total memory usage percentage (%) used by the selected Node
DIsk I/O	Usage (read / written) by type of DISK used in the selected Node
Disk Usage	Total usage ratio (%) of DISK used on the selected Node
Network Interface star Received(Byte)	The amount of bytes received from the network over 5 minutes from the selected node.
Network Interface star Transmitted(Byte)	The amount of bytes sent to the network from the selected Node over the past 5 minutes.

System: Overview

Summary information for each Node (Load Average, Swap, CPU/Memory/Network Usage, etc.)

Panel	Description
System Uptime	The time that the system was uptimed during the selected Interval time of the selected Node.
Virtual CPU	Current Virtual CPU allocation of the selected Node
RAM	Current Memory Allocation of the Selected Node
Memory Available	Current Memory Usage Ratio (%) of the selected Node
Load Average	Average Load of selected Interval time of selected Node (min, max, avg displayed separately)
Memory	Memory usage (Gib) by type (Total / Used / Available) for the selected Interval time of the selected Node - min, max, avg are displayed separately
CPU Usage	idle / user / system / steal / iowait / softirq / nice CPU usage ratio (%) of the selected Node during the selected Interval time - min, max, avg are displayed separately
Memory Distribution	Memory Distribution Usage (Gib) by type (Cached / Used / Free / Buffers) of selected Interval time of selected Node - min, max, avg displayed separately
Network Traffic(KBps)	Network Traffic usage (kBps) by type (Inbound / Outbound for each item) for the selected Interval time of the selected Node - min, max, avg are displayed separately
Network Utilization	Network Utilization usage (MiB) by type (Sent / Received) for selected Interval time of selected Node - min, max, avg displayed separately
Swap	Swap usage (B) by type (Used / Free) for selected Interval time of selected Node - min, max, avg displayed separately
Swap Activity	Swap Activity Usage (Bps) by Type (Swap In / Swap Out) of Selected Interval Time of Selected Node - Min, Max, Avg are displayed separately

Dashboard write Guide

http://docs.grafana.org/reference/templating/

Online consultation

English