모니터링/OpenTelemetry
[OpenTelemetry] Metrics 수집
김붕어87
2025. 5. 29. 14:05
반응형
개요
OpenTelemetry = otel으로 부르겠습니다.
otel은 AWS EKS에서 POD의 Logs, Metrics, Traces를 수집할 수 있습니다.
1. Metrics 수집 방법 아키텍처
기존 Metrics 수집 방법
WorkerNode -> node-exporter -> prometheus -> Grafana
otel Metrics 수집 방법
WorkerNode -> Otel -> mimir -> Grafana
Metrics 저장소를 Prometehus에서 mimir으로 변경한 이유는 mimir는 S3에 저장되며,
여러개의 Component으로 동작됩니다.
자세한 내용은 다른 문서에서 정리하겠습니다.
2. Otel 장/단점
- 여러개의 Agent에서 하나의 Otel으로 통합 관리 가능해졌습니다.
- otel에서 수집(receiver)을 하려면 모든 것을 Custom하게 설정해야하기 때문에, 접근 난이도가 높습니다.
- DaemonSet으로 배포하게 되면, 여러 개의 Otel이 Metrics을 중복 수집하기 때문에 문제가 발생합니다.
3. otel 설치
otel은 ConfigMap에 설정값을 불러와서 실행됩니다.
ConfigMap 설정
kubectl apply -f otel_configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
prometheus.io/scrape: "true"
name: otel-collector-config
namespace: monitor
data:
otel-collector-config.yaml: |
receivers:
prometheus:
config:
global:
scrape_interval: 60s
scrape_timeout: 10s
scrape_configs:
- job_name: cadvisor
honor_labels: true
honor_timestamps: true
track_timestamps_staleness: true
scrape_interval: 10s
scrape_timeout: 10s
follow_redirects: true
enable_compression: true
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
metrics_path: /metrics/cadvisor
relabel_configs:
- source_labels: [job]
separator: ;
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name, __meta_kubernetes_service_labelpresent_app_kubernetes_io_name]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_k8s_app, __meta_kubernetes_service_labelpresent_k8s_app]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: (Failed|Succeeded)
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: (.+)
target_label: job
replacement: $1
action: replace
- separator: ;
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__metrics_path__]
separator: ;
target_label: metrics_path
replacement: $1
action: replace
- source_labels: [__address__, __tmp_hash]
separator: ;
regex: (.+);
target_label: __tmp_hash
replacement: $1
action: replace
- source_labels: [__tmp_hash]
separator: ;
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
metric_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: container_cpu_(cfs_throttled_seconds_total|load_average_10s|system_seconds_total|user_seconds_total)
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: container_fs_(io_current|io_time_seconds_total|io_time_weighted_seconds_total|reads_merged_total|sector_reads_total|sector_writes_total|writes_merged_total)
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: container_memory_(mapped_file|swap)
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: container_(file_descriptors|tasks_state|threads_max)
replacement: $1
action: drop
- source_labels: [__name__, scope]
separator: ;
regex: container_memory_failures_total;hierarchy
replacement: $1
action: drop
- source_labels: [__name__, interface]
separator: ;
regex: container_network_.*;(cali|cilium|cni|lxc|nodelocaldns|tunl).*
replacement: $1
action: drop
- source_labels: [__name__]
separator: ;
regex: container_spec.*
replacement: $1
action: drop
- source_labels: [id, pod]
separator: ;
regex: .+;
replacement: $1
action: drop
- job_name: Kubelet
honor_labels: true
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
enable_compression: true
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name, __meta_kubernetes_service_labelpresent_app_kubernetes_io_name]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_k8s_app, __meta_kubernetes_service_labelpresent_k8s_app]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: (Failed|Succeeded)
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: (.+)
target_label: job
replacement: $1
action: replace
- separator: ;
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__metrics_path__]
separator: ;
target_label: metrics_path
replacement: $1
action: replace
- source_labels: [__address__, __tmp_hash]
separator: ;
regex: (.+);
target_label: __tmp_hash
replacement: $1
action: replace
- source_labels: [__tmp_hash]
separator: ;
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
metric_relabel_configs:
- source_labels: [__name__, le]
separator: ;
regex: (csi_operations|storage_operation_duration)_seconds_bucket;(0.25|2.5|15|25|120|600)(\.0)?
replacement: $1
action: drop
kubernetes_sd_configs:
- role: endpoints
follow_redirects: true
- job_name: kubelet-probes
honor_labels: true
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics/probes
scheme: https
enable_compression: true
tls_config:
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name, __meta_kubernetes_service_labelpresent_app_kubernetes_io_name]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_k8s_app, __meta_kubernetes_service_labelpresent_k8s_app]
separator: ;
regex: (kubelet);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: https-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: (Failed|Succeeded)
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_k8s_app]
separator: ;
regex: (.+)
target_label: job
replacement: $1
action: replace
- separator: ;
target_label: endpoint
replacement: https-metrics
action: replace
- source_labels: [__metrics_path__]
separator: ;
target_label: metrics_path
replacement: $1
action: replace
- source_labels: [__address__, __tmp_hash]
separator: ;
regex: (.+);
target_label: __tmp_hash
replacement: $1
action: replace
- source_labels: [__tmp_hash]
separator: ;
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
kubernetes_sd_configs:
- role: endpoints
follow_redirects: true
- job_name: kube-state-metrics
honor_labels: true
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
enable_compression: true
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance, __meta_kubernetes_service_labelpresent_app_kubernetes_io_instance]
separator: ;
regex: (prometheus);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name, __meta_kubernetes_service_labelpresent_app_kubernetes_io_name]
separator: ;
regex: (kube-state-metrics);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: http
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: (Failed|Succeeded)
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
separator: ;
regex: (.+)
target_label: job
replacement: $1
action: replace
- separator: ;
target_label: endpoint
replacement: http
action: replace
- source_labels: [__address__, __tmp_hash]
separator: ;
regex: (.+);
target_label: __tmp_hash
replacement: $1
action: replace
- source_labels: [__tmp_hash]
separator: ;
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
kubernetes_sd_configs:
- role: endpoints
follow_redirects: true
- job_name: node-exporter
honor_timestamps: true
track_timestamps_staleness: false
scrape_interval: 30s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
enable_compression: true
follow_redirects: true
relabel_configs:
- source_labels: [job]
separator: ;
target_label: __tmp_prometheus_job_name
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance, __meta_kubernetes_service_labelpresent_app_kubernetes_io_instance]
separator: ;
regex: (prometheus);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name, __meta_kubernetes_service_labelpresent_app_kubernetes_io_name]
separator: ;
regex: (prometheus-node-exporter);true
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
separator: ;
regex: http-metrics
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Node;(.*)
target_label: node
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
separator: ;
regex: Pod;(.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_namespace]
separator: ;
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_container_name]
separator: ;
target_label: container
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_phase]
separator: ;
regex: (Failed|Succeeded)
replacement: $1
action: drop
- source_labels: [__meta_kubernetes_service_name]
separator: ;
target_label: job
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_label_jobLabel]
separator: ;
regex: (.+)
target_label: job
replacement: $1
action: replace
- separator: ;
target_label: endpoint
replacement: http-metrics
action: replace
- source_labels: [__address__, __tmp_hash]
separator: ;
regex: (.+);
target_label: __tmp_hash
replacement: $1
action: replace
- source_labels: [__tmp_hash]
separator: ;
modulus: 1
target_label: __tmp_hash
replacement: $1
action: hashmod
- source_labels: [__tmp_hash]
separator: ;
regex: "0"
replacement: $1
action: keep
kubernetes_sd_configs:
- role: endpoints
follow_redirects: true
processors:
memory_limiter:
check_interval: 3s
limit_percentage: 75
spike_limit_percentage: 25
batch:
timeout: 2s
send_batch_size: 256
send_batch_max_size: 131072 # 128kb
k8sattributes:
auth_type: "serviceAccount"
passthrough: false
filter:
node: "this_node"
extract:
metadata:
- "k8s.pod.name"
- "k8s.namespace.name"
- "k8s.node.name"
- "k8s.container.name"
exporters:
prometheusremotewrite:
endpoint: "http://mimir-distributor-headless.monitor.svc:8080/api/v1/push"
tls:
insecure: true
headers:
X-Scope-OrgID: "Mimir"
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [k8sattributes, batch, memory_limiter]
exporters: [prometheusremotewrite, prometheus]
# telemetry:
# logs:
# level: "debug"
# encoding: "console"
otel Deployment 배포
kubectl apply -f otel.yaml -n monitor
apiVersion: apps/v1
#kind: DaemonSet
kind: Deployment
metadata:
name: otel-collector
namespace: monitor
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
serviceAccountName: otel-collector
priorityClassName: system-node-critical
#hostNetwork: true
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otel-collector-config.yaml"]
env:
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 100m
memory: 128Mi
volumeMounts:
- name: config
mountPath: /etc/otel-collector-config.yaml
subPath: otel-collector-config.yaml
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: config
configMap:
name: otel-collector-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: otel-collector
namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector
rules:
- apiGroups: [""]
# resources: ["nodes", "nodes/proxy", "services", "endpoints", "pods", "namespaces"]
resources: ["nodes", "nodes/proxy", "services", "endpoints", "pods", "namespaces", "events", "namespaces/status", "nodes/spec", "pods/status", "replicationcontrollers", "replicationcontrollers/status", "resourcequotas", "nodes/metrics", "nodes/stats"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["replicasets", "daemonsets", "deployments", "statefulsets"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector
subjects:
- kind: ServiceAccount
name: otel-collector
namespace: monitor
4. otel 테스트 방법
otel에서 scrape이 잘되는지 확인방법
curl "otel pod IP":8889/metrics
curl 10.10.10.10:8889/metrics
메트릭 수집 내용 확인
반응형