모니터링/OpenTelemetry

[OpenTelemetry] Logs 수집

김붕어87 2025. 5. 26. 15:42
반응형

 

개요

OpenTelemetry = otel으로 부르겠습니다.
otel은 AWS EKS에서 POD의 Logs, Metrics, Traces를 수집할 수 있습니다.

[ 기존 Logs 수집 방법 ] 
WorkerNode -> Fluent-Bit -> Loki -> Grafana 
WorkerNode -> Promtail -> Loki -> Grafana 

[ otel Logs 수집 방법 ]
WorkerNode -> Otel -> Loki -> Grafana 

 

 

 

1. otel 장/단점

  • 기존에는 metrics은 prometheus으로 수집하고 logs는 promtail으로 수집하고 Agent가 여러개 였습니다.
    여러개의 정보를 하나의 otel 1개의 Agent으로 통합해서 수집할 수 있습니다.
  • otel에서 수집(receiver)을 하려면 모든 것을 Custom해야하기 때문에, 접근 난이도가 높습니다.
  • otel에서 실시간 logs를 보내면 loki가 ingestion rate에 따라서 로그를 drop, reject 시킵니다.
  • otel에서 일정 바이트 만큼 저장했다가 loki에 전달하면 메모리가 증가합니다.
  • otel에서 일정 수의 로그가 계속 들어오면 로그를 drop시킵니다.
  • otel과 loki를 잘 커스텀해야서 사용해야합니다.

 

 

2. otel 설치

otel은 ConfigMap에 설정값을 불러와서 실행됩니다.

ConfigMap 설정 

kubectl apply -f otel_configmap.yaml

vi otel_configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: monitor
data:
  otel-collector-config.yaml: |
    receivers:
      filelog:
        include:
          - /var/log/pods/*/*/*.log
        exclude:
          # Exclude logs from all containers named otel-collector
          - /var/log/pods/*/otel-collector/*.log
        start_at: end
        include_file_path: true
        include_file_name: false
        retry_on_failure:
          enabled: true
        operators:
          # Find out which format is used by kubernetes
          - type: router
            id: get-format
            routes:
              - output: parser-docker
                expr: 'body matches "^\\{"'
              - output: parser-crio
                expr: 'body matches "^[^ Z]+ "'
              - output: parser-containerd
                expr: 'body matches "^[^ Z]+Z"'
          # Parse CRI-O format
          - type: regex_parser
            id: parser-crio
            regex:
              '^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*)
              ?(?P<log>.*)$'
            output: extract_metadata_from_filepath
            timestamp:
              parse_from: attributes.time
              layout_type: gotime
              layout: '2006-01-02T15:04:05.999999999Z07:00'
          # Parse CRI-Containerd format
          - type: regex_parser
            id: parser-containerd
            regex:
              '^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*)
              ?(?P<log>.*)$'
            output: extract_metadata_from_filepath
            timestamp:
              parse_from: attributes.time
              layout: '%Y-%m-%dT%H:%M:%S.%LZ'
          # Parse Docker format
          - type: json_parser
            id: parser-docker
            output: extract_metadata_from_filepath
            timestamp:
              parse_from: attributes.time
              layout: '%Y-%m-%dT%H:%M:%S.%LZ'
          - type: move
            from: attributes.log
            to: body
          # Extract metadata from file path
          - type: regex_parser
            id: extract_metadata_from_filepath
            regex: '^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]{36})\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$'
            parse_from: attributes["log.file.path"]
            cache:
              size: 128 # default maximum amount of Pods per Node is 110
          # Rename attributes
          - type: move
            from: attributes.stream
            to: attributes["log.iostream"]
          - type: move
            from: attributes.container_name
            to: resource["k8s.container.name"]
          - type: move
            from: attributes.namespace
            to: resource["k8s.namespace.name"]
          - type: move
            from: attributes.pod_name
            to: resource["k8s.pod.name"]
          - type: move
            from: attributes.restart_count
            to: resource["k8s.container.restart_count"]
          - type: move
            from: attributes.uid
            to: resource["k8s.pod.uid"]
    processors:
      batch:
        timeout: 2s
        send_batch_size: 256
        send_batch_max_size: 131072  # 128kb
      k8sattributes:
        auth_type: serviceAccount
        passthrough: true
        extract:
          metadata:
            - k8s.pod.name
            - k8s.pod.uid
            - k8s.deployment.name
            - k8s.statefulset.name
            - k8s.daemonset.name
            - k8s.namespace.name
            - k8s.node.name
            - k8s.pod.start_time
            - k8s.cluster.uid
        pod_association:
          - sources:
            - from: resource_attribute
              name: k8s.pod.name
            - from: resource_attribute
              name: k8s.namespace.name
      resource:
        attributes:
          - action: insert
            key: loki.format
            value: raw
          - action: insert
            key: service.name
            from_attribute: k8s.deployment.name
          - action: insert
            key: service.name
            from_attribute: k8s.daemonset.name
          - action: insert
            key: service.name
            from_attribute: k8s.statefulset.name
          - action: insert
            key: loki.resource.labels
            value:
              - k8s.container.name
              - k8s.namespace.name
              - k8s.pod.name
              - service.name
      memory_limiter:
        check_interval: 5s
        limit_percentage: 80
        spike_limit_percentage: 25
    exporters:
      loki:
        endpoint: http://loki.monitor.svc.cluster.local:3100/loki/api/v1/push
    service:
      pipelines:
        logs:
          receivers: [filelog]
          processors: [k8sattributes, resource, memory_limiter, batch]
          exporters: [loki]
#      telemetry:
#        logs:
#          level: "debug"  # otel pod에 문제가 생겼을 때 debug 모드로 설정

 

 

 

 

otel이 workernode의 metrics,logs 정보를 수집하려면 deamonset 방식으로 배포해야합니다.

otel Deamonset 배포

kubectl apply -f otel_daemonset.yaml -n monitor 

vi otel_daemonset.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: monitor
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      serviceAccountName: otel-collector
      priorityClassName: system-node-critical
      #hostNetwork: true
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/etc/otel-collector-config.yaml"]
          env:
            - name: KUBE_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          resources:
            limits:
              cpu: 1000m
              memory: 1Gi
            requests:
              cpu: 100m
              memory: 128Mi
          volumeMounts:
            - name: config
              mountPath: /etc/otel-collector-config.yaml
              subPath: otel-collector-config.yaml
            - name: varlog
              mountPath: /var/log
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-collector
  namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector
rules:
  - apiGroups: [""]
    resources: ["nodes", "nodes/proxy", "services", "endpoints", "pods", "namespaces", "events", "namespaces/status", "nodes/spec", "pods/status", "replicationcontrollers", "replicationcontrollers/status", "resourcequotas"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["replicasets", "daemonsets", "deployments", "statefulsets"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector
subjects:
  - kind: ServiceAccount
    name: otel-collector
    namespace: monitor

 

 

 

3. otel 옵션 설명 

    receivers: # 수집기 설정
      filelog:
        include: # 수집할 로그 대상 
          - /var/log/pods/*/*/*.log
        exclude: # 수집 제외 로그 대상 (otel 로그는 무시)
          # Exclude logs from all containers named otel-collector
          - /var/log/pods/*/otel-collector/*.log
        start_at: end  # otel 실행 시 마지막 로그부터 수집
        include_file_path: true
        include_file_name: false
        retry_on_failure: # 로그 전송 실패시 재 전송
          enabled: true

 

    processors:
      batch:
        timeout: 2s   # 데이터 2초 동안 수집 후 전송
        send_batch_size: 256  # 로그 레코드 256개 쌓이면 바로 전송
        send_batch_max_size: 131072  # 128kb 초과시 바로 로그 전송
      memory_limiter:
        check_interval: 5s  # otel 메모리 사용량 체크
        limit_percentage: 80 # 메모리 80% 넘으면 수집 중단(Drop)
        spike_limit_percentage: 25 # 이전 체크 시점보다 25% 급증하면 수집 중단(Drop)

 

 

 

4. otel 테스트 방법

otel에서 loki으로 exporter가 잘되는지 확인 방법

 

loki job 확인방법 
  curl -s "http://pod 아이피:3100/loki/api/v1/labels" | jq
  
job의 key,value 확인방법 
  curl -s "http://pod 아이피:3100/loki/api/v1/label/exporter/values" | jq
  

job key,values으로 로그 데이터 확인방법
curl -G -s "http://pod 아이피:3100/loki/api/v1/query" --data-urlencode 'query={exporter="OTLP"}' | jq

 

 

 

반응형