개요
OpenTelemetry = otel으로 부르겠습니다.
otel은 AWS EKS에서 POD의 Logs, Metrics, Traces를 수집할 수 있습니다.
1. Otel이 수집하는 Metrics 대상
Job | cAdvisor | node-exporter | kubelet | kubelet-probes | kube-state-metrics |
Path | /metrics/cadvisor | /metrics | /metrics | /metrics/probes | /metrics |
포트 | 10250 | 9100 | 10250 | 10250 | 8080 |
내용 | Container(pod) 상태 수집 container cpu,meme,disk,network 등 |
Node 정보 수집 cpu, mem, disk, network 등 |
노드 및 파드 상태, 성능 메트릭 수집 |
probe 정보 수집 liveness 등 |
k8s 리소스 상태 수집 deploy, pod, node 등 |
2. Metrics 수집 방법 아키텍처
기존 Metrics 수집 방법
WorkerNode -> <kubelet API> -> prometheus -> Grafana
WorkerNode -> node-exporter -> prometheus -> Grafana
WorkerNode -> kube-state_metrics -> prometheus -> Grafana
otel Metrics 수집 방법
WorkerNode -> <kubelet API> -> Otel -> mimir -> Grafana
WorkerNode -> node-exporter -> Otel -> mimir -> Grafana
WorkerNode -> kube-state_metrics -> Otel -> mimir -> Grafana
WorkerNode -> Otel -> mimir -> mimir -> Grafana
Metrics 저장소를 Prometehus에서 mimir으로 변경한 이유는 mimir는 S3에 저장되며,
mimir는 여러 개의 Component으로 동작됩니다.
mimir 자세한 내용은 다른 문서에서 정리하겠습니다.
3. Otel 장/단점
- 여러개의 Agent에서 하나의 Otel으로 통합 관리 가능해졌습니다.
- 여러개의 Agent (promtail, node-exporter, prometheus 등)
- otel에서 수집(receiver)을 하려면 모든 것을 Custom하게 설정해야하기 때문에, 접근 난이도가 높습니다.
- DaemonSet으로 배포하게 되면, 여러 개의 Otel이 Metrics을 중복 수집하기 때문에 문제가 발생합니다.
4. otel 설치
otel은 ConfigMap에 설정값을 불러와서 실행됩니다.
ConfigMap 설정
kubectl apply -f otel_configmap_metrics.yaml
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
prometheus.io/scrape: "true"
name: otel-collector-config-metrics
namespace: monitor
data:
otel-collector-config.yaml: |
receivers:
prometheus:
config:
global:
scrape_interval: 60s
scrape_timeout: 10s
scrape_configs:
- job_name: cadvisor
honor_labels: true
honor_timestamps: true
track_timestamps_staleness: true
scrape_interval: 10s
scrape_timeout: 10s
follow_redirects: true
enable_compression: true
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
metrics_path: /metrics/cadvisor
relabel_configs:
- source_labels: [job]
separator: ;
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name, __meta_kubernetes_service_labelpresent_app_kubernetes_io_name]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_k8s_app, __meta_kubernetes_service_labelpresent_k8s_app]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: (Failed|Succeeded)
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: (.+)
target_label: job
replacement: $1
action: replace
- separator: ;
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__metrics_path__]
separator: ;
target_label: metrics_path
replacement: $1
action: replace
- source_labels: [__address__, __tmp_hash]
separator: ;
regex: (.+);
target_label: __tmp_hash
replacement: $1
action: replace
- source_labels: [__tmp_hash]
separator: ;
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
metric_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: container_cpu_(cfs_throttled_seconds_total|load_average_10s|system_seconds_total|user_seconds_total)
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: container_memory_(mapped_file|swap)
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: container_(file_descriptors|tasks_state|threads_max)
replacement: $1
action: drop
- source_labels: [__name__, scope]
separator: ;
regex: container_memory_failures_total;hierarchy
replacement: $1
action: drop
- source_labels: [__name__, interface]
separator: ;
regex: container_network_.*;(cali|cilium|cni|lxc|nodelocaldns|tunl).*
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: container_spec.*
replacement: $1
action: drop
- source_labels: [id, pod]
separator: ;
regex: .+;
replacement: $1
action: drop
- job_name: Kubelet
honor_labels: true
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
enable_compression: true
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name, __meta_kubernetes_service_labelpresent_app_kubernetes_io_name]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_k8s_app, __meta_kubernetes_service_labelpresent_k8s_app]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: (Failed|Succeeded)
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: (.+)
target_label: job
replacement: $1
action: replace
- separator: ;
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__metrics_path__]
separator: ;
target_label: metrics_path
replacement: $1
action: replace
- source_labels: [__address__, __tmp_hash]
separator: ;
regex: (.+);
target_label: __tmp_hash
replacement: $1
action: replace
- source_labels: [__tmp_hash]
separator: ;
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
metric_relabel_configs:
- source_labels: [__name__, le]
separator: ;
regex: (csi_operations|storage_operation_duration)_seconds_bucket;(0.25|2.5|15|25|120|600)(\.0)?
replacement: $1
action: drop
kubernetes_sd_configs:
- role: endpoints
follow_redirects: true
- job_name: kubelet-probes
honor_labels: true
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics/probes
scheme: https
enable_compression: true
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name, __meta_kubernetes_service_labelpresent_app_kubernetes_io_name]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_k8s_app, __meta_kubernetes_service_labelpresent_k8s_app]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: (Failed|Succeeded)
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: (.+)
target_label: job
replacement: $1
action: replace
- separator: ;
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__metrics_path__]
separator: ;
target_label: metrics_path
replacement: $1
action: replace
- source_labels: [__address__, __tmp_hash]
separator: ;
regex: (.+);
target_label: __tmp_hash
replacement: $1
action: replace
- source_labels: [__tmp_hash]
separator: ;
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
kubernetes_sd_configs:
- role: endpoints
follow_redirects: true
- job_name: kube-state-metrics
honor_labels: true
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
enable_compression: true
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name, __meta_kubernetes_service_labelpresent_app_kubernetes_io_name]
separator: ;
regex: (kube-state-metrics);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: http
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: (Failed|Succeeded)
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
separator: ;
regex: (.+)
target_label: job
replacement: $1
action: replace
- separator: ;
target_label: endpoint
replacement: http
action: replace
- source_labels: [__address__, __tmp_hash]
separator: ;
regex: (.+);
target_label: __tmp_hash
replacement: $1
action: replace
- source_labels: [__tmp_hash]
separator: ;
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
kubernetes_sd_configs:
- role: endpoints
follow_redirects: true
- job_name: node-exporter
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
enable_compression: true
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name, __meta_kubernetes_service_labelpresent_app_kubernetes_io_name]
separator: ;
regex: (prometheus-node-exporter);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: (Failed|Succeeded)
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_jobLabel]
separator: ;
regex: (.+)
target_label: job
replacement: $1
action: replace
- separator: ;
target_label: endpoint
replacement: http-metrics
action: replace
- source_labels: [__address__, __tmp_hash]
separator: ;
regex: (.+);
target_label: __tmp_hash
replacement: $1
action: replace
- source_labels: [__tmp_hash]
separator: ;
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
kubernetes_sd_configs:
- role: endpoints
follow_redirects: true
processors:
memory_limiter:
check_interval: 3s
limit_percentage: 75
spike_limit_percentage: 25
batch:
timeout: 5s
send_batch_size: 256
send_batch_max_size: 131072 # 128kb
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
filter:
node: "this_node"
extract:
metadata:
- "k8s.pod.name"
- "k8s.namespace.name"
- "k8s.node.name"
- "k8s.container.name"
exporters:
prometheusremotewrite:
endpoint: "http://mimir-distributor-headless.monitor.svc:8080/api/v1/push"
tls:
insecure: true
headers:
X-Scope-OrgID: "Mimir"
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [k8sattributes, batch, memory_limiter]
exporters: [prometheusremotewrite, prometheus]
# telemetry:
# logs:
# level: "debug"
# encoding: "console"
otel Deployment 배포
kubectl apply -f otel.yaml -n monitor
apiVersion: apps/v1
#kind: DaemonSet
kind: Deployment
metadata:
name: otel-collector-metrics
namespace: monitor
spec:
selector:
matchLabels:
app: otel-collector-metrics
template:
metadata:
labels:
app: otel-collector-metrics
spec:
serviceAccountName: otel-collector
priorityClassName: system-node-critical
#hostNetwork: true
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otel-collector-config.yaml"]
env:
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 100m
memory: 128Mi
volumeMounts:
- name: config
mountPath: /etc/otel-collector-config.yaml
subPath: otel-collector-config.yaml
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: config
configMap:
name: otel-collector-config-metrics
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: otel-collector
namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector
rules:
- apiGroups: [""]
# resources: ["nodes", "nodes/proxy", "services", "endpoints", "pods", "namespaces"]
resources: ["nodes", "nodes/proxy", "services", "endpoints", "pods", "namespaces", "events", "namespaces/status", "nodes/spec", "pods/status", "replicationcontrollers", "replicationcontrollers/status", "resourcequotas", "nodes/metrics", "nodes/stats"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["replicasets", "daemonsets", "deployments", "statefulsets"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector
subjects:
- kind: ServiceAccount
name: otel-collector
namespace: monitor
4. otel 테스트 방법
otel에서 scrape이 잘되는지 확인방법
1. 8889 포트 오픈
otel에서 8889 포트를 오픈해서 메트릭을 정상적으로 수집이 되는지 확인 할 수 있습니다.
2. debug모드로 로그 분석
curl "otel pod IP":8889/metrics
curl 10.10.10.10:8889/metrics
메트릭 수집 내용 확인
cAdvisor 테스트
curl 10.10.10.10:8889/metrics | grep container_cpu_usage_seconds_total
curl 10.10.10.10:8889/metrics | grep container_memory_usage_bytes
curl 10.10.10.10:8889/metrics | grep container_network_receive_bytes_total
> instance="10.10.10.x:10250", job="kubelet",metrics_path="/metrics/cadvisor" 내용 확인
Kubelet 테스트
curl 10.10.10.10:8889/metrics | grep kubelet_volume_stats_used_bytes
curl 10.10.10.10:8889/metrics | grep kubelet_runtime_operations_duration_seconds
curl 10.10.10.10:8889/metrics | grep kubelet_network_plugin_operations_duration_seconds
> instance="10.10.10.x:10250", job="kubelet",metrics_path="/metrics" 내용 확인인
kubelet-probes 테스트트
curl 10.10.10.10:8889/metrics |grep prober_probe_duration_seconds_bucket
curl 10.10.10.10:8889/metrics |grep prober_probe_duration_seconds_count
curl 10.10.10.10:8889/metrics |grep prober_probe_total
curl 10.10.10.10:8889/metrics |grep process_start_time_seconds
> instance="10.10.10.x:10250", job="kubelet",metrics_path="/metrics/probes" 내용 확인
kube-state-metrics 테스트
curl 10.10.10.10:8889/metrics | grep kube_pod_status_phase
curl 10.10.10.10:8889/metrics | grep kube_deployment_status_replicas
> instance="10.10.10.x:8080" job="kube-state-metrics" 내용 확인
node-exporter 테스트
curl 10.10.10.10:8889/metrics | grep node_filesystem_
curl 10.10.10.10:8889/metrics | grep node_network_
> instance="10.10.10.x:9100",job="node-exporter" 내용 확인
5. 특이사항
- kube-state-metrics 메트릭 수집방법
- kube-state-metrics를 배포해서 kube-state-metrics의 endpoint를 찾아서 메트릭을 수집하는 방식
- node-exporter 메트릭 수집방법
- node-exporter를 배포해서 node-exporter의 endpoint를 모두 찾아서 메트릭을 수집하는 방식
- cAdvisor, kubelet, kubelet-probes 메트릭 수집방법
- cAdvisor, kubelet, kubelet-probes는 각각의 endpoint를 찾아서 메트릭을 수집하는 방식
- node의 enpoint를 찾아야하는데 endpoint 정의된 내용이 없어서 수집이 안되는 문제가 있습니다.
- prometheus operator를 배포해서 endpoint를 생성해서 수집해야합니다.
role: endpoint방식 말고 role : node 방식으로 수집하면 해결 가능 할 수 도 있습니다.
'모니터링 > OpenTelemetry' 카테고리의 다른 글
[otel] 장애처리 (0) | 2025.06.05 |
---|---|
[OpenTelemetry] Logs 수집 (0) | 2025.05.26 |
[Lambda] 업무시간 외 Fargate Stop/Start (0) | 2025.04.16 |
[ Fargate ] Adot-collector 설치 (fargate 메트릭,로그) (0) | 2025.02.26 |
[ Fargate ] Fargate 로깅 (1) | 2023.10.20 |