반응형
Prometheus Chart 파일에서 알람 받을 규칙을 설정합니다.
알람 규칙을 Web Console에서 할 수 없습니다.
[ 작업 순서 ]
1. Alert Rule 규칙 생성
2. Prometheus 재배포
3. Prometheus에서 Rule 설정 확인
4. Prometheus에서 Rule 감지 확인
5. Slack Alert 확인
Alert 가이드
- https://godekdls.github.io/Prometheus/alerting.overview/
- https://samber.github.io/awesome-prometheus-alerts/rules.html
1. 선행 작업
promethues에서 Alert이 발생하면 Slack으로 전달 받을 수 있도록 설정
- Slack Alert 설정 방법 : https://dongwook35.tistory.com/50
2. Prometheus Alert Rule 확인하는 곳
- Rule이 있는 폴더 위치 : prometheus/templates/prometheus/rules-1.14/
- add-rule.yaml 파일 생성해서 Rule 설정하기
3. Rule 추가
- source 파일 위치 : https://github.com/tingtomkim/devops/blob/main/monitoring/prometheus/templates/prometheus/rules-1.14/add-rule.yaml
cd prometheus/templates/prometheus/rules-1.14/
vi add-rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
meta.helm.sh/release-name: prometheus
meta.helm.sh/release-namespace: monitor
prometheus-operator-validated: "true"
labels:
app: kube-prometheus-stack
app.kubernetes.io/instance: prometheus
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/part-of: kube-prometheus-stack
app.kubernetes.io/version: 41.7.3
chart: kube-prometheus-stack-41.7.3
heritage: Helm
release: prometheus
name: add-rule
namespace: monitor
spec:
groups:
- name: add-rule # 그룹 이름
rules:
# - alert: test-KubePodNotReady # 알람 이름
# annotations:
# description: 'Pod {{`{{`}} $labels.namespace {{`}}`}}/{{`{{`}} $labels.pod {{`}}`}} has been in a non-ready state for longer than 2 minutes.'
# runbook_url: 'https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodnotready'
# summary: 'Pod has been in a non-ready state for more than 2 minutes.'
# expr: |- # 알람 조건
# sum by (namespace, pod, cluster) (
# max by(namespace, pod, cluster) (
# kube_pod_status_phase{job="kube-state-metrics", namespace=~".*", phase=~"Pending|Unknown|Failed"}
# ) * on(namespace, pod, cluster) group_left(owner_kind) topk by(namespace, pod, cluster) (
# 1, max by(namespace, pod, owner_kind, cluster) (kube_pod_owner{owner_kind!="Job"})
# )
# ) > 0
# #for: 2m
# labels:
# severity: warning # 필요한 label 설정
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: "critical"
annotations:
title: "Instance {{`{{`}} $labels.instance {{`}}`}} down"
summary: "Endpoint {{`{{`}} $labels.instance {{`}}`}}"
identifier: "{{`{{`}} $labels.instance {{`}}`}}"
description: "{{`{{`}} $labels.instance {{`}}`}} of job {{`{{`}} $labels.job {{`}}`}} has been down for more than 2 minutes."
- alert: HostOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 2m
labels:
severity: warning
annotations:
summary: "Host out of memory (instance {{`{{`}} $labels.instance {{`}}`}})"
description: "Node memory is filling up (< 10% left)\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS: {{`{{`}} $labels {{`}}`}}"
- alert: HostMemoryUnderMemoryPressure
expr: rate(node_vmstat_pgmajfault[1m]) > 1000
for: 2m
labels:
severity: warning
annotations:
summary: "Host memory under memory pressure (instance {{`{{`}} $labels.instance {{`}}`}})"
description: "The node is under heavy memory pressure. High rate of major page faults\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS: {{`{{`}} $labels {{`}}`}}"
# Please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
severity: warning
annotations:
summary: "Host out of disk space (instance {{`{{`}} $labels.instance {{`}}`}})"
description: "Disk is almost full (< 10% left)\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS: {{`{{`}} $labels {{`}}`}}"
- alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 0m
labels:
severity: warning
annotations:
summary: "Host high CPU load (instance {{`{{`}} $labels.instance {{`}}`}})"
description: "CPU load is > 80%\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS: {{`{{`}} $labels {{`}}`}}"
- alert: HostHighCpuLoad1
expr: (sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 0m
labels:
severity: warning
annotations:
summary: Host high CPU load (instance {{`{{`}} $labels.instance {{`}}`}})
description: "CPU load is > 80%\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
- alert: HostCpuHighIowait
expr: (avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 0m
labels:
severity: warning
annotations:
summary: Host CPU high iowait (instance {{`{{`}} $labels.instance {{`}}`}})
description: "CPU iowait > 10%. A high iowait means that you are disk or network bound.\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
- alert: HostOutOfMemory
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 20) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 2m
labels:
severity: warning
annotations:
summary: Host out of memory (instance {{`{{`}} $labels.instance {{`}}`}})
description: "Node memory is filling up (< 20% left)\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
# Please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
- alert: HostOutOfDiskSpace
expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 20 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 2m
labels:
severity: warning
annotations:
summary: Host out of disk space (instance {{`{{`}} $labels.instance {{`}}`}})
description: "Disk is almost full (< 10% left)\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
- alert: HostUnusualDiskReadRate
expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual disk read rate (instance {{`{{`}} $labels.instance {{`}}`}})
description: "Disk is probably reading too much data (> 50 MB/s)\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
- alert: HostUnusualDiskWriteRate
expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 2m
labels:
severity: warning
annotations:
summary: Host unusual disk write rate (instance {{`{{`}} $labels.instance {{`}}`}})
description: "Disk is probably writing too much data (> 50 MB/s)\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
- alert: HostUnusualNetworkThroughputIn
expr: (sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual network throughput in (instance {{`{{`}} $labels.instance {{`}}`}})
description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
- alert: HostUnusualNetworkThroughputOut
expr: (sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 5m
labels:
severity: warning
annotations:
summary: Host unusual network throughput out (instance {{`{{`}} $labels.instance {{`}}`}})
description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
- alert: ContainerCpuUsage
expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: Container CPU usage (instance {{`{{`}} $labels.instance {{`}}`}})
description: "Container CPU usage is above 80%\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
# See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
- alert: ContainerMemoryUsage
expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: Container Memory usage (instance {{`{{`}} $labels.instance {{`}}`}})
description: "Container Memory usage is above 80%\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
- alert: ContainerVolumeUsage
expr: (1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: Container Volume usage (instance {{`{{`}} $labels.instance {{`}}`}})
description: "Container Volume usage is above 80%\n VALUE = {{`{{`}} $value {{`}}`}}\n LABELS = {{`{{`}} $labels {{`}}`}}"
- alert: KubernetesMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes memory pressure (instance {{`{{`}} $labels.instance{{`}}`}})
description: "{{`{{`}} $labels.node{{`}}`}} has MemoryPressure condition\n VALUE = {{`{{`}} $value{{`}}`}}\n LABELS = {{`{{`}} $labels{{`}}`}}"
- alert: KubernetesDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: Kubernetes disk pressure (instance {{`{{`}} $labels.instance{{`}}`}})
description: "{{`{{`}} $labels.node{{`}}`}} has DiskPressure condition\n VALUE = {{`{{`}} $value{{`}}`}}\n LABELS = {{`{{`}} $labels{{`}}`}}"
- alert: KubernetesContainerOomKiller
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes container oom killer (instance {{`{{`}} $labels.instance{{`}}`}})
description: "Container {{`{{`}} $labels.container{{`}}`}} in pod {{`{{`}} $labels.namespace{{`}}`}}/{{`{{`}} $labels.pod{{`}}`}} has been OOMKilled {{`{{`}} $value{{`}}`}} times in the last 10 minutes.\n VALUE = {{`{{`}} $value{{`}}`}}\n LABELS = {{`{{`}} $labels{{`}}`}}"
- alert: KubernetesPersistentvolumeclaimPending
expr: kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes PersistentVolumeClaim pending (instance {{`{{`}} $labels.instance{{`}}`}})
description: "PersistentVolumeClaim {{`{{`}} $labels.namespace{{`}}`}}/{{`{{`}} $labels.persistentvolumeclaim{{`}}`}} is pending\n VALUE = {{`{{`}} $value{{`}}`}}\n LABELS = {{`{{`}} $labels{{`}}`}}"
3. Rule 설정 확인
- Prometheus 로그인
- Status -> Rules 클릭
- 생성한 rule이 적용되었는지 확인
- Rules -> "add-rule" group Name 확인
4. Rule 메트릭 감지 확인
- ex) InstanceDown Rule이 정상적으로 감지 중인지 확인
5. Slack으로 알람을 받게 되는 과정
알람 프로세스
- Inactive : 알람 조건에 맞는 알람 세트가 없음 (정상 상태)
- Pending : 알람 조건에 맞는 알람 세트가 생성된 상태
- Alert Rule에 정의된 for 기간 동안 이 상태를 유지하게 되면 "Firing"으로 상태가 변경된다.
- Firing : Firing 상태가 되면 Prometheus는 알람 정보를 Alermanager에게 전달한다.
- AlertManager은 "receivers"(Slack)에 설정된 곳으로 알람 메세지를 전달해준다.
6. 메트릭 상태 검색
- Graph 클릭
- PromQL 언어로 검색하면 메트릭스 정보를 확인 할 수 있습니다.
- up == 0 (instanceDown)
- up == 1 (instanceUP)
7. 알람 예외처리
- AlertManager 로그인
- Silences -> "New Silence" 클릭
알람 예외 생성하기
- 기간 입력 : 999999h
- 예외 항목 : 입력
- creator & Comment : 입력
- Create 클릭
예외 알람 확인하기
반응형
'모니터링 > prometheus' 카테고리의 다른 글
[OpenTelemetry] 설치 (0) | 2023.09.07 |
---|---|
[ Loki ] 설치 (0) | 2023.06.24 |
[ Prometheus ] AlertManager - Slack 설정 (0) | 2023.03.16 |
[ Prometheus ] exporter 종류 (0) | 2023.03.16 |
[ Prometheus ] Nginx Log exporter (0) | 2023.03.16 |