Cloud Z CP's monitoring service is provided by configuring the following open source components.
- Metric To collect and store Prometheus
- Prometheus Processing Alerts created in Alertmanager
- Various methods to export 3rd party metrics to the outside so that Prometheus can collect them Exporter (node-exporter, kube-state-metrics, blackbox-exporter, elasticsearch-exporter field)
- Finally, the collected metrics are visualized using Prometheus Query and provided in a dashboard format that is easy for users to understand. Grafana
This section explains how to use Grafana's Dashboard and each item in the default Dashboard.
If you would like more detailed information or instructions on how to use Grafana. Please refer to Grafana Docs기 바랍니다.
To use the service, click Monitoring on the ZCP Console side menu.
As the version is updated, information that has been added, modified, or deleted in the Dashboard or Panel is displayed in the following legend.
Version, Content: Added Dashboard or Panel
Version, Content: Changed Dashboard or Panel
Version, Content: Deleted Dashboard or Panel
- Go to Dashboard
- Built-in Dashboard
- Dashboard write Guide
Go to Dashboard
- Select the Home menu at the top.
- You can see the recently selected Dashboard (Recent) and the default Folders (4).
- When you select one of the default Folders, the Dashboards belonging to that Folder will be displayed.
- When you select Dashboard, you will see a screen composed of various panels.
Built-in Dashboard
This explains the Dashboard provided by default in Cloud Z CP Public.
Addon Dashboards
ElasticSearch
Display information about elasticsearch (JVM, CPU, Memory, Documents, Indices, etc.)
Row | Pannel | Description |
---|---|---|
KPI | Cluster health | Current status of elasticsearch cluster (N/A / Green / Yellow / Red) |
Tripped for breakers | The average value is tripeed because the cluster is broken | |
CPU usage Avg. | CPU average usage | |
JVM memory used Avg. | JVM memory average usage | |
Nodes | Number of nodes in the cluster. | |
Data nodes | Number of data nodes in the cluster. | |
Pending tasks | Cluster level changes which have not yet been executed. | |
Openfile descriptors per cluster | The total number of open files in elasticsearch | |
Shards | Active primary shards | The number of primary shards in your cluster. This is an aggregate total across all indices. |
Active shards | Aggregate total of all shards across all indices, which includes replica shards. | |
Initializing shards | Count of shards that are being freshly created. | |
Relocating shards | The number of shards that are currently moving from one node to another node. | |
Delayed shards | Shards delayed to reduce reallocation overhead. | |
Unassigned shards | The number of shards that exist in the cluster state, but cannot be found in the cluster itself. | |
JVM Garbage Collection | GC count | Number of items processed by Garbage Collection |
GC time | Time to process Garbage Collection | |
CPU and Memory | Load average | Load average used in elasticsearch |
CPU usage | CPU usage used by elasticsearch | |
JVM memory usage | JVM memory usage used by elasticsearch | |
JVM memory committed | JVM memory usage used for committing in elasticsearch | |
Disk and Network | Disk usage | Disk usage used by elasticsearch |
Network usage | Network usage used by elasticsearch | |
Documents | Documents count on node | Number of documents stored in the data node |
Documents indexed rate | The percentage of documents indexed | |
Documents deleted rate | The rate at which documents are deleted | |
Documents merged rate | The rate at which documents are merged | |
Documents merged bytes | The size of merged documents (bytes) | |
Times | Query time | Query execution time |
Indexing time | Indexing execution time | |
Merging time | Merging execution time | |
Throttle time for index store | Throttle time to save index | |
Indices: Count of documents and Total size | Count of documents with only primary shards | Number of documents in primary shards |
Total size of stored index data in bytes with only primary shards on all nodes | Total capacity of index data stored in primary shards | |
Total size of stored index data in bytes with all shards on all nodes | Total size of index data stored in all shards | |
Indices: Index writer | Index writer with only primary shards on all nodes in bytes | The capacity of primary shards being used as index |
Index writer with all shards on all nodes in bytes | The capacity of all shards being written as index |
ZCP Services Status
zcp-system Health check of namespace (CPU usages, status value)
Panel | Description |
---|---|
Duration | probe duration seconds |
Status : alertmanager | alertmanager health (UP / DOWN) |
alertmanager Status Code | alertmanager status code |
Status : grafana | grafana health (UP / DOWN) |
grafana Status Code | grafana status code |
Status : prometheus | prometheus health (UP / DOWN) |
prometheus Status Code | prometheus status code |
Cluster Dashboards
Etcd Cluster
Etcd status values (RPC Rate, DB Size, Disk Sync Duration, etc.)
Panel | Description |
---|---|
Etcd has a leader? | Check if etcd has a leader (YES/NO) |
The number of leader changes seen | Number of times Etcd leader changed |
The total number of failed proposals seen | Total number of failed proposals |
RPC Rate | Number of gRPCs started or handled in 5 minutes |
Etcd DB Size | Etcd debugging mvcc db total size in bytes |
Etcd Disk Sync Duration | Total number of wal fsyncs performed by etc disk in 5 minutes (Histogram 99) |
Etcd Memory | Memory usage of 'etcd' job |
Etcd Client Traffic In | etcd network client gRPC total traffic received in 5 minutes |
Etcd Client Traffic Out | etcd network client gRPC total traffic sent in 5 minutes |
Etcd Peer Traffic In | Total number of traffic received by etcd network peer in 5 minutes |
Etcd Peer Traffic Out | etcd Total number of traffic sent by network peers in 5 minutes |
Etcd Proposals rate(Fail,Pending,commit,apply) | Total number of committed proposals made by etcd server in 5 minutes |
Etcd Disk operations(AVG) | Total number of backend commits made by etcd disk in 2 minutes |
Network | etcd network client gRPC total traffic received in 2 minutes |
Snapshot duration | Abnormally high snapshot duration (snapshot_save_total_duration_seconds) indicates disk issues and might cause the cluster to be unstable. |
Kubernetes: Cluster Overview
Information about total/node average/cluster average resources (number of nodes/pods/containers, CPU/memory/network usage, etc.)
Row | Panel | Description |
---|---|---|
Resource Dashboard | Alertmanager Alerts Firing | Total number of Alerts |
Node Not Ready | Number of Nodes in 'Not Ready' state | |
Node Unschedulable | Number of Nodes in 'Unschedulable' state | |
Node Memory Pressure | Number of nodes in 'Memory Pressure' state | |
Node Disk Pressure | Number of nodes in 'Disk Pressure' state | |
Running Pod Total | The number of Pods currently in the 'Running' state. | |
Running Pod Total by Node | The number of Pods currently in 'Running' state on each node. | |
Running Container Total | The number of Containers currently in 'Running' state. | |
Running Container Total by Node | The number of Containers currently in 'Running' state on each node. | |
Node Resource Usage | Number of Node | Total number of nodes in the current cluster |
Total CPU | Total CPU of nodes in the current cluster | |
Used Memory | Memory usage of nodes in the current cluster | |
Total Memory | Total Memory of nodes in the current cluster | |
DIsk Usage | Disk usage of nodes in the current cluster | |
DIsk Total | Total disks of nodes in the current cluster | |
Avg CPU Usage | Average CPU usage of nodes in the current cluster | |
Avg Memory Usage | Average memory usage of nodes in the current cluster | |
Avg Disk Usage | Average disk usage of nodes in the current cluster | |
Network Usage (Node NIC) | Network usage of nodes in the current cluster | |
Cluster Resource Usage | Cluster CPU Usage(Used/Total) | Current CPU usage (%) of all nodes in the cluster - In addition, the total CPU amount (Cores) and the amount used are also displayed below. |
Cluster Memory Usage(Used/Total) | Total Memory Usage (%) of the Nodes in the Current Cluster - In addition, the total amount of Memory (Gib) and the amount used are also displayed below. | |
Cluster DIsk Usage(Used/Total) | Current cluster node's total DIsk usage (%) - In addition, the total DIsk amount (Gib) and the amount used are also displayed below. | |
Pod Count by namespace | Number of Pods registered in Kubernetes by namespace | |
Container Count by namespace | Number of containers registered in kubernetes by namespace |
Kubernetes: Performance Overview
API Server Requests/Latency, Pod/Container Running Trands, Creating Rate etc
Panel | Description |
---|---|
APIServer Request Rate | Total number of requests made every 2 minutes from APIServer |
APIServer Latency | APIServer's average request latencies |
Kubelet POD Start Latency | Latency in microseconds for a single pod to go from pending to running. Broken down by podname. |
Running Pod Trands | Number of pods in 'running' state in kubelet |
Create Rate of Pods | Rate of newly created Pods in kubelet in 2 minutes |
Running Containers Trands | Number of Containers in 'running' state in kubelet |
Create Rate of Containers | Rate of newly created containers in kubelet in 2 minutes |
Kubernetes: Resource Requests
Displays information about CPU/Memory usages and Pod count of Node.
Container Dashboards
Kubernetes: DaemonSet Overview
Daemonset에 대한 정보 (Replicas, CPU/Memory/Network/Filesystem( v1.1.0) 등)
Panel | Description |
---|---|
Desired Replicas ( | Number of daemonsets required for scheduling The number of nodes that should be running the daemon pod |
CURRENT ( | Number of currently scheduled daemonsets |
READY ( | Number of daemonsets currently running and ready |
Available Replicas ( | The number of daemonsets currently running and in use. |
Metadata Generation | Number of daemonsets created with metadata |
DaemonSet Create Time | The time of the daemonset that was created the longest ago. |
Total CPU | The sum of CPU (Cores) used by containers created by Daemonset |
Total Memory | Total Memory (MiB) used by Containers created by Daemonset |
Total Network | Total network usage (MBps) by containers created by Daemonset |
CPU Usage | CPU usage of containers created by Daemonset |
Memory Usage | Memory usage of containers created by Daemonset |
Filesystem Read/Write ( | Filesystem Read/Write Usage of Containers Created by Daemonset |
Network TX/RX ( | Network Transmit/Receive usage of containers created with Daemonset |
Replicas Status | Status of Daemonset's Replicas (Ready / Available / Unavailable / Misscheduled) |
Kubernetes: Deployment Overview
Deployment Information about (Replicas, CPU/Memory/Network/Filesystem( v1.1.0) ect)
Panel | Description |
---|---|
Desired Replicas ( | Number of deployment replicas required for the schedule |
Available Replicas ( | Number of deployment replicas in use |
Observed Generation | Number of deployments created by Observed |
Metadata Generation | Number of deployments generated by metadata |
Deployment Create Time | Time of the deployment created oldest from the present |
AVG CPU ( | Average CPU (Core) used by containers created by Deployment ( |
AVG Memory ( | Average Memory Used (MiB) by Containers Created by Deployment ( |
AVG Network ( | Average network used by containers created by Deployment (kBps) ( |
CPU Usage | CPU usage of containers created by Deployment |
Memory Usage | Memory usage of containers created by Deployment |
Filesystem Read/Write ( | Filesystem Read/Write usage of containers created by Deployment |
Network TX/RX ( | Network Transmit/Receive usage of containers created by Deployment |
Replicas Status | Status of replicas in the deployment (Ready / Available / Unavailable / Misscheduled) |
Spec | Spec of Replicas in Deployment (Replicas / Paused) |
Kubernetes: POD Overview
Pod Information about (Pod의 status, restart count, used in pod CPU/Memory/Network/Volume( v1.1.0)/Filesystem(
v1.1.0) mark
Panel | Description |
---|---|
POD Count | Number of Pods in the selected Namespace |
Pod Status | Selected Namespace, Pod status (Failed / Pending / Running / Succeeded / Unknown) |
Pod Restart Count | Number of restarts for the selected Namespace and Pod |
CPU Usage | CPU usage and trend used by containers in the selected Namespace and Pod |
Memory Usage | Memory usage and trend used by containers in the selected Namespace and Pod |
Volume Usage ( | Usage and trend of Persistent Volumes used in Containers of the selected Namespace and Pod |
Filsystem Read/Write ( | Trend of Filesystem Read/Write usage used by Containers in the selected Namespace and Pod |
Network TX/RX | Transmit/Receive usage and trends of the network used by the selected Namespace and Pod's Container |
Kubernetes: StatefulSets Overview
StatefulSets Information about (Replicas, CPU/Memory/Network/Filesystem( v1.1.0) etc)
Pane | Description |
---|---|
Desired Replicas ( | Number of statefulset replicas required for scheduling |
Available Replicas ( | Number of statefulset replicas in use |
Observed Generation | Number of statefulsets created by Observed |
Metadata Generation | Number of statefulsets created with metadata |
Statefulset Create Time | The time of the statefulset created oldest before the present. |
Total CPU | The sum of CPU (Cores) used by containers created by Statefulset |
Total Memory | Sum of Memory used by Containers created as Statefulset (MiB) |
Total Network | Sum of network used by containers created by Statefulset (MBps) |
CPU Usage | CPU usage of containers created by Statefulset |
Memory Usage | Memory usage of containers created with Statefulset |
Filesystem Read/Write ( | Filesystem Read/Write Usage of Containers Created as Statefulset |
Network TX/RX ( | Network Transmit/Receive usage of containers created with Statefulset |
Replicas Status | Status of Replicas in Statefulset (Corrent / Available) |
System Dashboards
System Disk Space
Disk Usage Trends Used in Each Node
Panel | Description |
---|---|
Check Root Disk capacity | Amount of disk space used and available on various mount points. Running out of disk space on OS volume, database volume or volume used for temporary space can cause downtime. Some storage may also have reduced performance when small amount of space is available. |
System Usage Overview
Usage information for each Node (Idle CPU, DISK I/O, Network received/transmitted, Memory/Disk Usage, etc.)
Pane | Description |
---|---|
CPU Core star Idle | Idle average of CPUs within the selected Node for 5 minutes |
System Load(1,5,15) | The average rate at which the selected Node is loaded (1 minute / 5 minutes / 15 minutes) |
Memory Usage | Usage of memory by type used in the selected Node (memory used / memory buffers / memory cached / memory free) |
Memory Usage | Total memory usage percentage (%) used by the selected Node |
DIsk I/O | Usage (read / written) by type of DISK used in the selected Node |
Disk Usage | Total usage ratio (%) of DISK used on the selected Node |
Network Interface star Received(Byte) | The amount of bytes received from the network over 5 minutes from the selected node. |
Network Interface star Transmitted(Byte) | The amount of bytes sent to the network from the selected Node over the past 5 minutes. |
System: Overview
Summary information for each Node (Load Average, Swap, CPU/Memory/Network Usage, etc.)
Panel | Description |
---|---|
System Uptime | The time that the system was uptimed during the selected Interval time of the selected Node. |
Virtual CPU | Current Virtual CPU allocation of the selected Node |
RAM | Current Memory Allocation of the Selected Node |
Memory Available | Current Memory Usage Ratio (%) of the selected Node |
Load Average | Average Load of selected Interval time of selected Node (min, max, avg displayed separately) |
Memory | Memory usage (Gib) by type (Total / Used / Available) for the selected Interval time of the selected Node - min, max, avg are displayed separately |
CPU Usage | idle / user / system / steal / iowait / softirq / nice CPU usage ratio (%) of the selected Node during the selected Interval time - min, max, avg are displayed separately |
Memory Distribution | Memory Distribution Usage (Gib) by type (Cached / Used / Free / Buffers) of selected Interval time of selected Node - min, max, avg displayed separately |
Network Traffic(KBps) | Network Traffic usage (kBps) by type (Inbound / Outbound for each item) for the selected Interval time of the selected Node - min, max, avg are displayed separately |
Network Utilization | Network Utilization usage (MiB) by type (Sent / Received) for selected Interval time of selected Node - min, max, avg displayed separately |
Swap | Swap usage (B) by type (Used / Free) for selected Interval time of selected Node - min, max, avg displayed separately |
Swap Activity | Swap Activity Usage (Bps) by Type (Swap In / Swap Out) of Selected Interval Time of Selected Node - Min, Max, Avg are displayed separately |
Dashboard write Guide
http://docs.grafana.org/reference/templating/
Online consultation
Contact us