반응형
개요
OpenTelemetry = otel으로 부르겠습니다.
otel은 AWS EKS에서 POD의 Logs, Metrics, Traces를 수집할 수 있습니다.
[ 기존 Logs 수집 방법 ]
WorkerNode -> Fluent-Bit -> Loki -> Grafana
WorkerNode -> Promtail -> Loki -> Grafana
[ otel Logs 수집 방법 ]
WorkerNode -> Otel -> Loki -> Grafana
1. otel 장/단점
- 기존에는 metrics은 prometheus으로 수집하고 logs는 promtail으로 수집하고 Agent가 여러개 였습니다.
여러개의 정보를 하나의 otel 1개의 Agent으로 통합해서 수집할 수 있습니다. - otel에서 수집(receiver)을 하려면 모든 것을 Custom해야하기 때문에, 접근 난이도가 높습니다.
- otel에서 실시간 logs를 보내면 loki가 ingestion rate에 따라서 로그를 drop, reject 시킵니다.
- otel에서 일정 바이트 만큼 저장했다가 loki에 전달하면 메모리가 증가합니다.
- otel에서 일정 수의 로그가 계속 들어오면 로그를 drop시킵니다.
- otel과 loki를 잘 커스텀해야서 사용해야합니다.
2. otel 설치
otel은 ConfigMap에 설정값을 불러와서 실행됩니다.
ConfigMap 설정
kubectl apply -f otel_configmap.yaml
vi otel_configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: monitor
data:
otel-collector-config.yaml: |
receivers:
filelog:
include:
- /var/log/pods/*/*/*.log
exclude:
# Exclude logs from all containers named otel-collector
- /var/log/pods/*/otel-collector/*.log
start_at: end
include_file_path: true
include_file_name: false
retry_on_failure:
enabled: true
operators:
# Find out which format is used by kubernetes
- type: router
id: get-format
routes:
- output: parser-docker
expr: 'body matches "^\\{"'
- output: parser-crio
expr: 'body matches "^[^ Z]+ "'
- output: parser-containerd
expr: 'body matches "^[^ Z]+Z"'
# Parse CRI-O format
- type: regex_parser
id: parser-crio
regex:
'^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*)
?(?P<log>.*)$'
output: extract_metadata_from_filepath
timestamp:
parse_from: attributes.time
layout_type: gotime
layout: '2006-01-02T15:04:05.999999999Z07:00'
# Parse CRI-Containerd format
- type: regex_parser
id: parser-containerd
regex:
'^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*)
?(?P<log>.*)$'
output: extract_metadata_from_filepath
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
# Parse Docker format
- type: json_parser
id: parser-docker
output: extract_metadata_from_filepath
timestamp:
parse_from: attributes.time
layout: '%Y-%m-%dT%H:%M:%S.%LZ'
- type: move
from: attributes.log
to: body
# Extract metadata from file path
- type: regex_parser
id: extract_metadata_from_filepath
regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]{36})\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
parse_from: attributes["log.file.path"]
cache:
size: 128 # default maximum amount of Pods per Node is 110
# Rename attributes
- type: move
from: attributes.stream
to: attributes["log.iostream"]
- type: move
from: attributes.container_name
to: resource["k8s.container.name"]
- type: move
from: attributes.namespace
to: resource["k8s.namespace.name"]
- type: move
from: attributes.pod_name
to: resource["k8s.pod.name"]
- type: move
from: attributes.restart_count
to: resource["k8s.container.restart_count"]
- type: move
from: attributes.uid
to: resource["k8s.pod.uid"]
processors:
batch:
timeout: 2s
send_batch_size: 256
send_batch_max_size: 131072 # 128kb
k8sattributes:
auth_type: serviceAccount
passthrough: true
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.statefulset.name
- k8s.daemonset.name
- k8s.namespace.name
- k8s.node.name
- k8s.pod.start_time
- k8s.cluster.uid
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.name
- from: resource_attribute
name: k8s.namespace.name
resource:
attributes:
- action: insert
key: loki.format
value: raw
- action: insert
key: service.name
from_attribute: k8s.deployment.name
- action: insert
key: service.name
from_attribute: k8s.daemonset.name
- action: insert
key: service.name
from_attribute: k8s.statefulset.name
- action: insert
key: loki.resource.labels
value:
- k8s.container.name
- k8s.namespace.name
- k8s.pod.name
- service.name
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
exporters:
loki:
endpoint: http://loki.monitor.svc.cluster.local:3100/loki/api/v1/push
service:
pipelines:
logs:
receivers: [filelog]
processors: [k8sattributes, resource, memory_limiter, batch]
exporters: [loki]
# telemetry:
# logs:
# level: "debug" # otel pod에 문제가 생겼을 때 debug 모드로 설정
otel이 workernode의 metrics,logs 정보를 수집하려면 deamonset 방식으로 배포해야합니다.
otel Deamonset 배포
kubectl apply -f otel_daemonset.yaml -n monitor
vi otel_daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: monitor
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
serviceAccountName: otel-collector
priorityClassName: system-node-critical
#hostNetwork: true
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:latest
args: ["--config=/etc/otel-collector-config.yaml"]
env:
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 100m
memory: 128Mi
volumeMounts:
- name: config
mountPath: /etc/otel-collector-config.yaml
subPath: otel-collector-config.yaml
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: config
configMap:
name: otel-collector-config
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: otel-collector
namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otel-collector
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/proxy", "services", "endpoints", "pods", "namespaces", "events", "namespaces/status", "nodes/spec", "pods/status", "replicationcontrollers", "replicationcontrollers/status", "resourcequotas"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["replicasets", "daemonsets", "deployments", "statefulsets"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otel-collector
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otel-collector
subjects:
- kind: ServiceAccount
name: otel-collector
namespace: monitor
3. otel 옵션 설명
receivers: # 수집기 설정
filelog:
include: # 수집할 로그 대상
- /var/log/pods/*/*/*.log
exclude: # 수집 제외 로그 대상 (otel 로그는 무시)
# Exclude logs from all containers named otel-collector
- /var/log/pods/*/otel-collector/*.log
start_at: end # otel 실행 시 마지막 로그부터 수집
include_file_path: true
include_file_name: false
retry_on_failure: # 로그 전송 실패시 재 전송
enabled: true
processors:
batch:
timeout: 2s # 데이터 2초 동안 수집 후 전송
send_batch_size: 256 # 로그 레코드 256개 쌓이면 바로 전송
send_batch_max_size: 131072 # 128kb 초과시 바로 로그 전송
memory_limiter:
check_interval: 5s # otel 메모리 사용량 체크
limit_percentage: 80 # 메모리 80% 넘으면 수집 중단(Drop)
spike_limit_percentage: 25 # 이전 체크 시점보다 25% 급증하면 수집 중단(Drop)
4. otel 테스트 방법
otel에서 loki으로 exporter가 잘되는지 확인 방법
loki job 확인방법
curl -s "http://pod 아이피:3100/loki/api/v1/labels" | jq
job의 key,value 확인방법
curl -s "http://pod 아이피:3100/loki/api/v1/label/exporter/values" | jq
job key,values으로 로그 데이터 확인방법
curl -G -s "http://pod 아이피:3100/loki/api/v1/query" --data-urlencode 'query={exporter="OTLP"}' | jq
반응형
'모니터링 > OpenTelemetry' 카테고리의 다른 글
[OpenTelemetry] Metrics 수집 (0) | 2025.05.29 |
---|---|
[Lambda] 업무시간 외 Fargate Stop/Start (0) | 2025.04.16 |
[ Fargate ] Adot-collector 설치 (fargate 메트릭,로그) (0) | 2025.02.26 |
[ Fargate ] Fargate 로깅 (1) | 2023.10.20 |
[OpenTelemetry] 설치 (0) | 2023.09.07 |