This article is not available in English, view it in Korean

Monitoring Query (Namespace Admin)

Print

Cloud Z CP's monitoring service is provided by configuring the following open source components.

  • Metric To collect and store Prometheus
  • Prometheus Processing Alerts created in Alertmanager
  • Various methods to export 3rd party metrics to the outside so that Prometheus can collect them Exporter (node-exporter, kube-state-metrics, blackbox-exporter, elasticsearch-exporter field)
  • Finally, the collected metrics are visualized using Prometheus Query and provided in a dashboard format that is easy for users to understand. Grafana

This section explains how to use Grafana's Dashboard and each item in the default Dashboard.

If you would like more detailed information or instructions on how to use Grafana. Please refer to  Grafana Docs기 바랍니다.

To use the service, click Monitoring on the ZCP Console side menu.

As the version is updated, information that has been added, modified, or deleted in the Dashboard or Panel is displayed in the following legend.

  • (plus) Version, Content: Added Dashboard or Panel
  • (warning) Version, Content: Changed Dashboard or Panel
  • (minus) Version, Content: Deleted Dashboard or Panel

Go to Dashboard

  1. Select the Home menu at the top.


  2. You can see the recently selected Dashboard (Recent) and the default Folders (4).

  3. When you select one of the default Folders, the Dashboards belonging to that Folder will be displayed.

  4. When you select Dashboard, you will see a screen composed of various panels.

Built-in Dashboard

This explains the Dashboard provided by default in Cloud Z CP Public.

Addon Dashboards

ElasticSearch

Display information about elasticsearch (JVM, CPU, Memory, Documents, Indices, etc.)

RowPannelDescription
KPICluster healthCurrent status of elasticsearch cluster (N/A / Green / Yellow / Red)
Tripped for breakersThe average value is tripeed because the cluster is broken
CPU usage Avg.CPU average usage
JVM memory used Avg.JVM memory average usage
NodesNumber of nodes in the cluster.
Data nodesNumber of data nodes in the cluster.
Pending tasksCluster level changes which have not yet been executed.
Openfile descriptors per clusterThe total number of open files in elasticsearch
ShardsActive primary shardsThe number of primary shards in your cluster. This is an aggregate total across all indices.
Active shardsAggregate total of all shards across all indices, which includes replica shards.
Initializing shardsCount of shards that are being freshly created.
Relocating shardsThe number of shards that are currently moving from one node to another node.
Delayed shardsShards delayed to reduce reallocation overhead.
Unassigned shardsThe number of shards that exist in the cluster state, but cannot be found in the cluster itself.
JVM Garbage CollectionGC countNumber of items processed by Garbage Collection
GC timeTime to process Garbage Collection
CPU and MemoryLoad averageLoad average used in elasticsearch
CPU usageCPU usage used by elasticsearch
JVM memory usageJVM memory usage used by elasticsearch
JVM memory committedJVM memory usage used for committing in elasticsearch
Disk and NetworkDisk usageDisk usage used by elasticsearch
Network usageNetwork usage used by elasticsearch
DocumentsDocuments count on nodeNumber of documents stored in the data node
Documents indexed rateThe percentage of documents indexed
Documents deleted rateThe rate at which documents are deleted
Documents merged rateThe rate at which documents are merged
Documents merged bytesThe size of merged documents (bytes)
TimesQuery timeQuery execution time

Indexing time

Indexing execution time
Merging timeMerging execution time
Throttle time for index storeThrottle time to save index
Indices: Count of documents and Total sizeCount of documents with only primary shardsNumber of documents in primary shards
Total size of stored index data in bytes with only primary shards on all nodesTotal capacity of index data stored in primary shards
Total size of stored index data in bytes with all shards on all nodesTotal size of index data stored in all shards
Indices: Index writerIndex writer with only primary shards on all nodes in bytesThe capacity of primary shards being used as index
Index writer with all shards on all nodes in bytesThe capacity of all shards being written as index

ZCP Services Status

zcp-system Health check of namespace (CPU usages, status value)

PanelDescription
Durationprobe duration seconds
Status : alertmanageralertmanager health (UP / DOWN)
alertmanager Status Codealertmanager status code
Status : grafanagrafana health (UP / DOWN)
grafana Status Codegrafana status code
Status : prometheusprometheus health (UP / DOWN)
prometheus Status Codeprometheus status code


Cluster Dashboards

Etcd Cluster

Etcd status values (RPC Rate, DB Size, Disk Sync Duration, etc.)

PanelDescription

Etcd has a leader?

Check if etcd has a leader (YES/NO)
The number of leader changes seenNumber of times Etcd leader changed
The total number of failed proposals seen

Total number of failed proposals

RPC RateNumber of gRPCs started or handled in 5 minutes
Etcd DB SizeEtcd debugging mvcc db total size in bytes
Etcd Disk Sync DurationTotal number of wal fsyncs performed by etc disk in 5 minutes (Histogram 99)
Etcd MemoryMemory usage of 'etcd' job
Etcd Client Traffic In

etcd network client gRPC total traffic received in 5 minutes

Etcd Client Traffic Outetcd network client gRPC total traffic sent in 5 minutes
Etcd Peer Traffic InTotal number of traffic received by etcd network peer in 5 minutes
Etcd Peer Traffic Outetcd Total number of traffic sent by network peers in 5 minutes
Etcd Proposals rate(Fail,Pending,commit,apply)Total number of committed proposals made by etcd server in 5 minutes
Etcd Disk operations(AVG)Total number of backend commits made by etcd disk in 2 minutes
Networketcd network client gRPC total traffic received in 2 minutes
Snapshot durationAbnormally high snapshot duration (snapshot_save_total_duration_seconds) indicates disk issues and might cause the cluster to be unstable.

Kubernetes: Cluster Overview

Information about total/node average/cluster average resources (number of nodes/pods/containers, CPU/memory/network usage, etc.)

RowPanelDescription
Resource DashboardAlertmanager Alerts FiringTotal number of Alerts
Node Not ReadyNumber of Nodes in 'Not Ready' state
Node UnschedulableNumber of Nodes in 'Unschedulable' state

Node Memory Pressure

Number of nodes in 'Memory Pressure' state

Node Disk Pressure

Number of nodes in 'Disk Pressure' state

Running Pod Total

The number of Pods currently in the 'Running' state.

Running Pod Total by Node

The number of Pods currently in 'Running' state on each node.

Running Container Total

The number of Containers currently in 'Running' state.

Running Container Total by Node

The number of Containers currently in 'Running' state on each node.
Node Resource Usage

Number of Node

Total number of nodes in the current cluster

Total CPU

Total CPU of nodes in the current cluster

Used Memory

Memory usage of nodes in the current cluster

Total Memory

Total Memory of nodes in the current cluster

DIsk Usage

Disk usage of nodes in the current cluster

DIsk Total

Total disks of nodes in the current cluster

Avg CPU Usage

Average CPU usage of nodes in the current cluster

Avg Memory Usage

Average memory usage of nodes in the current cluster

Avg Disk Usage

Average disk usage of nodes in the current cluster

Network Usage (Node NIC)

Network usage of nodes in the current cluster
Cluster Resource Usage

Cluster CPU Usage(Used/Total)

Current CPU usage (%) of all nodes in the cluster - In addition, the total CPU amount (Cores) and the amount used are also displayed below.

Cluster Memory Usage(Used/Total)

Total Memory Usage (%) of the Nodes in the Current Cluster - In addition, the total amount of Memory (Gib) and the amount used are also displayed below.

Cluster DIsk Usage(Used/Total)

Current cluster node's total DIsk usage (%) - In addition, the total DIsk amount (Gib) and the amount used are also displayed below.

Pod Count by namespace

Number of Pods registered in Kubernetes by namespace
Container Count by namespaceNumber of containers registered in kubernetes by namespace

Kubernetes: Performance Overview

API Server Requests/Latency,  Pod/Container Running Trands, Creating Rate etc

PanelDescription

APIServer Request Rate

Total number of requests made every 2 minutes from APIServer
APIServer LatencyAPIServer's average request latencies
Kubelet POD Start LatencyLatency in microseconds for a single pod to go from pending to running. Broken down by podname.
Running Pod TrandsNumber of pods in 'running' state in kubelet
Create Rate of PodsRate of newly created Pods in kubelet in 2 minutes
Running Containers TrandsNumber of Containers in 'running' state in kubelet
Create Rate of ContainersRate of newly created containers in kubelet in 2 minutes

Kubernetes: Resource Requests

Displays information about CPU/Memory usages and Pod count of Node.

PanelDescription
Cluster CPU(Allocated/Request)

This represents the total [CPU resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu) in the cluster.

For comparison the total [allocatable CPU cores](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Cluster Memory(Allocated/Request)

This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-memory) in the cluster.

For comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Cluster Pod(Allocated/Request)

This represents the total [memory resource requests](https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#how-pods-with-resource-limits-are-run) in the cluster.

For comparison the total [allocatable memory](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md) is also shown.

Container Dashboards

Kubernetes: DaemonSet Overview

Daemonset에 대한 정보 (Replicas, CPU/Memory/Network/Filesystem((plus) v1.1.0) 등)

PanelDescription
Desired Replicas ((warning) v1.1.0, DESIRED)

Number of daemonsets required for scheduling

The number of nodes that should be running the daemon pod

CURRENT ((plus) v1.1.0)Number of currently scheduled daemonsets
READY ((plus) v1.1.0)Number of daemonsets currently running and ready
Available Replicas ((warning) v1.1.0, AVAILABLE)The number of daemonsets currently running and in use.
Metadata GenerationNumber of daemonsets created with metadata
DaemonSet Create TimeThe time of the daemonset that was created the longest ago.
Total CPUThe sum of CPU (Cores) used by containers created by Daemonset
Total MemoryTotal Memory (MiB) used by Containers created by Daemonset
Total NetworkTotal network usage (MBps) by containers created by Daemonset
CPU UsageCPU usage of containers created by Daemonset
Memory UsageMemory usage of containers created by Daemonset
Filesystem Read/Write ((plus) v1.1.0)Filesystem Read/Write Usage of Containers Created by Daemonset
Network TX/RX ((plus) v1.1.0)Network Transmit/Receive usage of containers created with Daemonset

Replicas Status

Status of Daemonset's Replicas (Ready / Available / Unavailable / Misscheduled)

Kubernetes: Deployment Overview

Deployment Information about (Replicas, CPU/Memory/Network/Filesystem((plus) v1.1.0) ect)

PanelDescription
Desired Replicas ((warning) v1.1.0, DESIRED)Number of deployment replicas required for the schedule
Available Replicas ((warning) v1.1.0, AVAILABLE)Number of deployment replicas in use
Observed GenerationNumber of deployments created by Observed
Metadata GenerationNumber of deployments generated by metadata
Deployment Create TimeTime of the deployment created oldest from the present

AVG CPU

((warning) v1.1.0, Total CPU) 

Average CPU (Core) used by containers created by Deployment

((warning) v1.1.0, DeploymentThe sum of CPU (Cores) used by all containers in the Pod created by)

AVG Memory

((warning) v1.1.0, Total Memory) 

Average Memory Used (MiB) by Containers Created by Deployment

((warning) v1.1.0, Sum of Memory used by all Containers in a Pod created by Deployment (MiB)

AVG Network

((warning) v1.1.0, Total Network) 

Average network used by containers created by Deployment (kBps)

((warning) v1.1.0, Sum of Network used by all Containers in Pods created by Deployment (MiB)

CPU UsageCPU usage of containers created by Deployment
Memory UsageMemory usage of containers created by Deployment
Filesystem Read/Write ((plus) v1.1.0)Filesystem Read/Write usage of containers created by Deployment
Network TX/RX  ((plus) v1.1.0)Network Transmit/Receive usage of containers created by Deployment

Replicas Status

Status of replicas in the deployment (Ready / Available / Unavailable / Misscheduled)
SpecSpec of Replicas in Deployment (Replicas / Paused)

Kubernetes: POD Overview

Pod Information about (Pod의 status, restart count, used in pod CPU/Memory/Network/Volume((plus) v1.1.0)/Filesystem((plus) v1.1.0) mark

PanelDescription

POD Count

Number of Pods in the selected Namespace
Pod StatusSelected Namespace, Pod status (Failed / Pending / Running / Succeeded / Unknown)
Pod Restart CountNumber of restarts for the selected Namespace and Pod
CPU UsageCPU usage and trend used by containers in the selected Namespace and Pod
Memory UsageMemory usage and trend used by containers in the selected Namespace and Pod
Volume Usage ((plus) v1.1.0)Usage and trend of Persistent Volumes used in Containers of the selected Namespace and Pod
Filsystem Read/Write ((plus) v1.1.0)Trend of Filesystem Read/Write usage used by Containers in the selected Namespace and Pod
Network TX/RXTransmit/Receive usage and trends of the network used by the selected Namespace and Pod's Container

Kubernetes: StatefulSets Overview

StatefulSets Information about (Replicas, CPU/Memory/Network/Filesystem((plus) v1.1.0) etc)

PaneDescription
Desired Replicas ((warning) v1.1.0, DESIRED)Number of statefulset replicas required for scheduling
Available Replicas ((warning) v1.1.0, AVAILABLE)Number of statefulset replicas in use
Observed GenerationNumber of statefulsets created by Observed
Metadata GenerationNumber of statefulsets created with metadata
Statefulset Create TimeThe time of the statefulset created oldest before the present.
Total CPUThe sum of CPU (Cores) used by containers created by Statefulset
Total MemorySum of Memory used by Containers created as Statefulset (MiB)
Total NetworkSum of network used by containers created by Statefulset (MBps)
CPU UsageCPU usage of containers created by Statefulset
Memory UsageMemory usage of containers created with Statefulset
Filesystem Read/Write ((plus) v1.1.0)Filesystem Read/Write Usage of Containers Created as Statefulset
Network TX/RX  ((plus) v1.1.0)Network Transmit/Receive usage of containers created with Statefulset

Replicas Status

Status of Replicas in Statefulset (Corrent / Available)

System Dashboards

System Disk Space

Disk Usage Trends Used in Each Node


PanelDescription
Check Root Disk capacityAmount of disk space used and available on various mount points.  Running out of disk space on OS volume,  database volume or volume used for temporary space can cause downtime.   Some storage may also have reduced performance when small amount of space is available.

System Usage Overview

Usage information for each Node (Idle CPU, DISK I/O, Network received/transmitted, Memory/Disk Usage, etc.)


PaneDescription

CPU Core star Idle

Idle average of CPUs within the selected Node for 5 minutes
System Load(1,5,15)The average rate at which the selected Node is loaded (1 minute / 5 minutes / 15 minutes)
Memory UsageUsage of memory by type used in the selected Node (memory used / memory buffers / memory cached / memory free)
Memory UsageTotal memory usage percentage (%) used by the selected Node
DIsk I/OUsage (read / written) by type of DISK used in the selected Node
Disk UsageTotal usage ratio (%) of DISK used on the selected Node
Network Interface star Received(Byte)The amount of bytes received from the network over 5 minutes from the selected node.
Network Interface star Transmitted(Byte)The amount of bytes sent to the network from the selected Node over the past 5 minutes.

System: Overview 

Summary information for each Node (Load Average, Swap, CPU/Memory/Network Usage, etc.)


PanelDescription

System Uptime

The time that the system was uptimed during the selected Interval time of the selected Node.
Virtual CPUCurrent Virtual CPU allocation of the selected Node
RAMCurrent Memory Allocation of the Selected Node
Memory AvailableCurrent Memory Usage Ratio (%) of the selected Node
Load AverageAverage Load of selected Interval time of selected Node (min, max, avg displayed separately)
Memory

Memory usage (Gib) by type (Total / Used / Available) for the selected Interval time of the selected Node - min, max, avg are displayed separately

CPU Usage

idle / user / system / steal / iowait / softirq / nice CPU usage ratio (%) of the selected Node during the selected Interval time - min, max, avg are displayed separately

Memory Distribution

Memory Distribution Usage (Gib) by type (Cached / Used / Free / Buffers) of selected Interval time of selected Node - min, max, avg displayed separately

Network Traffic(KBps)

Network Traffic usage (kBps) by type (Inbound / Outbound for each item) for the selected Interval time of the selected Node - min, max, avg are displayed separately

Network Utilization

Network Utilization usage (MiB) by type (Sent / Received) for selected Interval time of selected Node - min, max, avg displayed separately

Swap

Swap usage (B) by type (Used / Free) for selected Interval time of selected Node

- min, max, avg displayed separately

Swap Activity

Swap Activity Usage (Bps) by Type (Swap In / Swap Out) of Selected Interval Time of Selected Node

- Min, Max, Avg are displayed separately

Dashboard write Guide

http://docs.grafana.org/reference/templating/

Online consultation

Contact us

Did you find it helpful? Yes No

Send feedback
Sorry we couldn't be helpful. Help us improve this article with your feedback.