모니터링/Tempo

[Grafana tempo] Grafana tempo, Traces 로그 수집

김붕어87 2025. 6. 20. 14:17
반응형
Trace 이란?
Trace 로그는 마이크로서비스(MSA) 분산 시스템에서 하나의 요청(Request)이
여러 서비스나 컴포넌트를 거쳐 처리될 때의 흐름을 기록한 로그입니다.

구성 요소 
Trace ID : 하나의 요청이 들어오면 전체를 추척할 수 있는 고유 ID값
Span ID : 각 단계(서비스, 함수 등)의 실행을 구분하는 ID
Parent Span ID : 호출 관계 표현 (누가 누구를 호출했는지)
시작/종료시간, 에러 상태, 태그/속성 등

Trace ID: 1234abcd
[app-a pod] ---> [app-b pod] ---> [app-c pod]
Span 1          ->      Span 2      ->    Span 3

어디에서 지연이 발생했는지
어느 서비스에서 에러가 났는지
사용자 요청이 어떻게 흐르느지 시각적으로 확인 할 수 있다.

 

 

  • Grafana으로 Trace 확인

 

  • 웹 브라우저 -> F12 -> Network 탭에서 지연시간 등 확인

 

 

 

Grafana Tempo는 Trace 데이터를 저장하고 쿼리할 수 있는 저장소입니다.

분산 추적 시스템 저장소 : Trace 로그 저장
고성능, 저비용 : 인덱스를 거의 쓰지 않아 효율적
Grafana와 연동 : Trace Flow를 시각화

OpenTelemetry(Otel)으로 Trace를 수집해서 Grafana Tempo에 저장합니다.
pod -> Otel SDK -> Otel -> Grafana Tempo (S3) -> Grafana

 

 

 

 

작업 순서 

1. S3 IAM Role 생성

2. S3 생성 

3. Grafana Tempo 설치

4. Otel 설치

5. service docker image build 

6. service pod 배포

7. 테스트

 

 

 

 

1. S3 IAM Role 생성

dev-monitor-tempo-policy 생성

  • 새로 생성한 S3 버킷명을 넣어야합니다.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::dev-monitor-tempo-bucket",
                "arn:aws:s3:::dev-monitor-tempo-bucket/*"
            ]
        }
    ]
}

 

 

dev-monitor-tempo-role 생성

  • dev-monitor-tempo-policy Attach 
  • 신뢰 정책 넣기
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "",
			"Effect": "Allow",
			"Principal": {
				"Federated": "arn:aws:iam::xxx:oidc-provider/oidc.eks.ap-northeast-2.amazonaws.com/id/xxx"
			},
			"Action": "sts:AssumeRoleWithWebIdentity",
			"Condition": {
				"StringLike": {
					"oidc.eks.ap-northeast-2.amazonaws.com/id/xxx:aud": "sts.amazonaws.com",
					"oidc.eks.ap-northeast-2.amazonaws.com/id/xxx:sub": "system:serviceaccount:monitor:tempo"
				}
			}
		}
	]
}

 

 

 

 

2. S3 생성 

Grafana Tempo에서 사용할 S3 저장소 생성하기

dev-monitor-tempo-bucket 생성

 

 

 

3. Grafana Tempo 설치

 

Grafana Tempo Helm Chart 다운로드 받기

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm search repo tempo
helm repo pull grafana/tempo

 

 

vi values.yaml 수정하기

vi values.yaml

# 저장소를 S3으로 변경하기
storage:
  trace:
    backend: s3
    s3:
      bucket: dev-monitor-tempo-bucket "{s3 버킷 이름}"
      endpoint: s3.ap-northeast-2.amazonaws.com "{Region 이름}"
    # backend: local 주석처리


# 수집 recivers 설정하기
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: "0.0.0.0:4317"
        http:
          endpoint: "0.0.0.0:4318"

# IRSA (IAM Role Service Account) 설정
serviceAccount:
  create: true
  name: tempo
  annotations:
    "eks.amazonaws.com/role-arn": "arn:aws:iam::xxx:role/dev-monitor-tempo-role"
  automountServiceAccountToken: true

 

 

Grafana tempo 배포하기

helm upgrade --install tempo ./

 

 

 

4. Otel 설치

POD -> Otel -> Grafana Tempo

otel : pod에서 전달해주는 Trace를 otel이 수집하고 grafana tempo으로 exporter 합니다.

 

vi otel_configmap_tempo.yaml 생성

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config-tempo
  namespace: monitor
data:
  otel-collector-config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

    processors:
      batch:
        timeout: 10s
        send_batch_size: 256
        send_batch_max_size: 131072  # 128kb
      memory_limiter:
        check_interval: 5s
        limit_percentage: 80
        spike_limit_percentage: 25
    exporters:
      otlp:
        endpoint: "http://tempo.monitor.svc.cluster.local:4317"
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp]
#      telemetry:
#        logs:
#          level: "debug"

 

 

vi service.yaml 생성

apiVersion: v1
kind: Service
metadata:
  name: otel-collector-tempo
  labels:
    app: otel-collector-tempo
spec:
  selector:
    app: otel-collector-tempo
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317
      protocol: TCP
    - name: otlp-http
      port: 4318
      targetPort: 4318
      protocol: TCP
  type: ClusterIP

 

 

 

vi otel.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector-tempo
  namespace: monitor
spec:
  selector:
    matchLabels:
      app: otel-collector-tempo
  template:
    metadata:
      labels:
        app: otel-collector-tempo
    spec:
      serviceAccountName: otel-collector
      priorityClassName: system-node-critical
      tolerations:
      - effect: NoSchedule
        operator: Exists
      #hostNetwork: true
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:latest
          args: ["--config=/etc/otel-collector-config.yaml"]
          env:
            - name: KUBE_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          resources:
            limits:
              cpu: 1000m
              memory: 1Gi
            requests:
              cpu: 100m
              memory: 128Mi
          volumeMounts:
            - name: config
              mountPath: /etc/otel-collector-config.yaml
              subPath: otel-collector-config.yaml
            - name: varlog
              mountPath: /var/log
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: otel-collector-config-tempo
        - name: varlog
          hostPath:
            path: /var/log
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-collector
  namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector
rules:
  - apiGroups: [""]
#    resources: ["nodes", "nodes/proxy", "services", "endpoints", "pods", "namespaces"]
    resources: ["nodes", "nodes/proxy", "services", "endpoints", "pods", "namespaces", "events", "namespaces/status", "nodes/spec", "pods/status", "replicationcontrollers", "replicationcontrollers/status", "resourcequotas", "nodes/metrics", "nodes/stats"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["replicasets", "daemonsets", "deployments", "statefulsets"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector
subjects:
  - kind: ServiceAccount
    name: otel-collector
    namespace: monitor

 

 

otel 배포

kubectl apply -f otel_configmap_tempo.yaml 
kubectl apply -f service.yaml
kubectl apply -f otel.yaml

 

 

 

5. service docker image build 

Trace 테스트하기 위한 도커 이미지를 생성합니다.

 

docker 생성하기

app-a 도커 생성하기

mkdir -p app-a app-b app-c
vi app-a/Dockerfile
FROM python:3.8-slim

WORKDIR /app
COPY . /app

RUN apt-get update && \
    apt-get install -y curl && \
    rm -rf /var/lib/apt/lists/* && \
    pip install Flask requests
RUN pip install opentelemetry-distro opentelemetry-exporter-otlp opentelemetry-api opentelemetry-sdk
RUN opentelemetry-bootstrap -a install

CMD ["opentelemetry-instrument", "python", "app_a.py"]





vi app-a/app_a.py
from flask import Flask, jsonify
import requests
import os

app = Flask(__name__)

SERVICE_B_HOST = os.environ.get('SERVICE_B_HOST', 'service_b')
SERVICE_B_PORT = os.environ.get('SERVICE_B_PORT', '5001')

@app.route('/')
def call_app_b():
    url = f'http://{SERVICE_B_HOST}:{SERVICE_B_PORT}/'
    message = "I'm app-a"
    try:
        response = requests.get(url, timeout=3)
        response.raise_for_status()
        message += " " + response.text
    except Exception as e:
        print(f"Error while calling Service B: {e}")
        message += " Error calling Service B"
    return message
    
if __name__ == '__main__':
    PORT = os.environ.get('PORT', '5000')
    app.run(host='0.0.0.0', port=int(PORT))

 

 

app-b 도커 생성하기

vi app-b/Dockerfile
FROM python:3.8-slim
WORKDIR /app
COPY . /app
RUN apt-get update && \
    apt-get install -y curl && \
    rm -rf /var/lib/apt/lists/* && \
    pip install Flask requests
RUN pip install opentelemetry-distro opentelemetry-exporter-otlp opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-flask
RUN opentelemetry-bootstrap -a install
CMD ["opentelemetry-instrument", "python", "app_b.py"]




vi app-b/app_b.py
from flask import Flask, jsonify
import requests
import os
import time 
from opentelemetry.instrumentation.flask import FlaskInstrumentor

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

SERVICE_C_HOST = os.environ.get('SERVICE_C_HOST', 'service_c')
SERVICE_C_PORT = os.environ.get('SERVICE_C_PORT', '5002')

@app.route('/')
def call_app_c():
    time.sleep(2)
    url = f'http://{SERVICE_C_HOST}:{SERVICE_C_PORT}/'
    message = "I'm app-b"
    try:
        response = requests.get(url, timeout=3)
        response.raise_for_status()
        message += " " + response.text
    except Exception as e:
        print(f"Error while calling Service C: {e}")
        message += " Error calling Service C"
    return message
    
if __name__ == '__main__':
    PORT = os.environ.get('PORT', '5001')
    app.run(host='0.0.0.0', port=int(PORT))

 

 

app-c 도커 생성하기

vi app-c/Dockerfile
FROM python:3.8-slim
WORKDIR /app
COPY . /app
RUN apt-get update && \
    apt-get install -y curl && \
    rm -rf /var/lib/apt/lists/* && \
    pip install Flask requests
RUN pip install opentelemetry-distro opentelemetry-exporter-otlp opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation-flask
RUN opentelemetry-bootstrap -a install

CMD ["opentelemetry-instrument", "python", "app_c.py"]





vi app-c/app_c.py
from flask import Flask, jsonify
import os
from opentelemetry.instrumentation.flask import FlaskInstrumentor

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route('/')
def home():
    message = "I'm app-c"
    return message
    
if __name__ == '__main__':
    PORT = os.environ.get('PORT', '5002')
    app.run(host='0.0.0.0', port=int(PORT))

 

 

 

docker 빌드하기

# 도커 빌드
docker build -t app-a ./app-a
docker build -t app-b ./app-b
docker build -t app-c ./app-c

# ECR 로그인
aws ecr get-login-password --region ap-northeast-2 | docker login --username AWS --password-stdin xxx.dkr.ecr.ap-northeast-2.amazonaws.com

# ECR 주소와 맞는 Tag으로 변경하기
docker tag app-a:latest xxx.dkr.ecr.ap-northeast-2.amazonaws.com/test:app-a
docker tag app-b:latest xxx.dkr.ecr.ap-northeast-2.amazonaws.com/test:app-b
docker tag app-c:latest xxx.dkr.ecr.ap-northeast-2.amazonaws.com/test:app-c

# ECR에 업로드하기
docker push xxx.dkr.ecr.ap-northeast-2.amazonaws.com/test:app-a
docker push xxx.dkr.ecr.ap-northeast-2.amazonaws.com/test:app-b
docker push xxx.dkr.ecr.ap-northeast-2.amazonaws.com/test:app-c

 

 

6. service pod 배포

위에서 생성한 Docker 이미지로 POD를 배포합니다.

 

 

deployment.yaml 생성하기

app-a.yaml

OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://otel-collector-tempo.monitor.svc.cluster.local:4318/v1/traces opentelemetry-instrument python app_a.py

  • pod에서 otel으로 trace를 보내야합니다.
  • otel collector 주소를 넣어야합니다.
## app-a
apiVersion: v1
kind: Service
metadata:
  name: app-a
spec:
  selector:
    app: app-a
  ports:
  - protocol: TCP
    port: 80
    targetPort: 5000
    name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-a
spec:
  replicas: 1
  selector:
    matchLabels:
      app: app-a
  template:
    metadata:
      labels:
        app: app-a
    spec:
      containers:
      - name: app-a
        image: xxx.dkr.ecr.ap-northeast-2.amazonaws.com/test:app-a
        command: ["/bin/sh"]
        args:
        - "-c"
        - |
          OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://otel-collector-tempo.monitor.svc.cluster.local:4318/v1/traces opentelemetry-instrument python app_a.py
        ports:
        - containerPort: 5000
          name: http
        env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: SERVICE_B_HOST
          value: "app-b"
        - name: SERVICE_B_PORT
          value: "80"
        - name: PORT
          value: "5000"
        - name: OTEL_SERVICE_NAME
          value: "app-a"
        - name: OTEL_TRACES_EXPORTER
          value: "console,otlp"
        - name: OTEL_EXPORTER_OTLP_TRACES_HEADERS
          value: "api-key=key,other-config-value=value"
        - name: OTEL_TRACES_SAMPLER_ARG
          value: "1"
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: "http/protobuf"
        - name: OTEL_METRICS_EXPORTER
          value: "none"

 

 

app-b.yaml

apiVersion: v1
kind: Service
metadata:
  name: app-b
spec:
  selector:
    app: app-b
  ports:
  - protocol: TCP
    port: 80
    targetPort: 5001
    name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: app-b
  template:
    metadata:
      labels:
        app: app-b
    spec:
      containers:
      - name: app-b
        image: xxx.dkr.ecr.ap-northeast-2.amazonaws.com/test:app-b
        imagePullPolicy: Always
        command: ["/bin/sh"]
        args:
        - "-c"
        - |
          OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://otel-collector-tempo.monitor.svc.cluster.local:4318/v1/traces opentelemetry-instrument python app_b.py
        ports:
        - containerPort: 5001
          name: http
        env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: PORT
          value: "5001"
        - name: SERVICE_C_HOST
          value: "app-c"
        - name: SERVICE_C_PORT
          value: "80"
        - name: OTEL_SERVICE_NAME
          value: "app-b"
        - name: OTEL_TRACES_SAMPLER_ARG
          value: "100"
        - name: OTEL_TRACES_EXPORTER
          value: "console,otlp"
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: "http/protobuf"
        - name: OTEL_METRICS_EXPORTER
          value: "none"

 

 

 

app-c.yaml

apiVersion: v1
kind: Service
metadata:
  name: app-c
spec:
  selector:
    app: app-c
  ports:
  - protocol: TCP
    port: 80
    targetPort: 5002
    name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-c
spec:
  replicas: 1
  selector:
    matchLabels:
      app: app-c
  template:
    metadata:
      labels:
        app: app-c
    spec:
      containers:
      - name: app-c
        image: xxx.dkr.ecr.ap-northeast-2.amazonaws.com/test:app-c
        command: ["/bin/sh"]
        args:
        - "-c"
        - |
          OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://otel-collector-tempo.monitor.svc.cluster.local:4318/v1/traces opentelemetry-instrument python app_c.py
        ports:
        - containerPort: 5002
          name: http
        env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        - name: PORT
          value: "5002"
        - name: OTEL_SERVICE_NAME
          value: "app-c"
        - name: OTEL_EXPORTER_OTLP_TRACES_HEADERS
          value: "api-key=key,other-config-value=value"
        - name: OTEL_TRACES_SAMPLER_ARG
          value: "100"
        - name: OTEL_EXPORTER_OTLP_PROTOCOL
          value: "http/protobuf"
        - name: OTEL_METRICS_EXPORTER
          value: "none"
        - name: OTEL_TRACES_EXPORTER
          value: "console,otlp"

 

 

pod 배포

kubectl apply -f app-a.yaml
kubectl apply -f app-b.yaml
kubectl apply -f app-c.yaml

 

 

 

 

 

 

7. 테스트

nginx pod으로 curl 테스트

# nginx yaml 생성
vi nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
          name: http
          protocol: TCP

# nginx 배포 
kubectl apply -f nginx.yaml

 

 

curl 테스트

# nginx pod 접속
kubectl exec pod/nginx -it -- /bin/bash 

# curl 명령어로 테스트
curl app-a

# curl 결과값
I'm app-a I'm app-b I'm app-c

 

 

 

pod log 

 * Serving Flask app 'app_a'
 * Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://10.110.8.246:5000
Press CTRL+C to quit
10.110.7.32 - - [20/Jun/2025 02:17:45] "GET / HTTP/1.1" 200 -
{
    "name": "GET",
    "context": {
        "trace_id": "0xd8c248015d0821ca3b6787bc7f3d2a88",
        "span_id": "0x1e6d2db0429b0baf",
        "trace_state": "[]"
    },
    "kind": "SpanKind.CLIENT",
    "parent_id": "0x2e578556b70947c3",
    "start_time": "2025-06-20T02:17:43.135980Z",
    "end_time": "2025-06-20T02:17:45.157717Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "http.method": "GET",
        "http.url": "http://app-b:80/",
        "http.status_code": 200
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.33.1",
            "service.name": "app-a",
            "telemetry.auto.version": "0.54b1"
        },
        "schema_url": ""
    }
}
{
    "name": "GET /",
    "context": {
        "trace_id": "0xd8c248015d0821ca3b6787bc7f3d2a88",
        "span_id": "0x2e578556b70947c3",
        "trace_state": "[]"
    },
    "kind": "SpanKind.SERVER",
    "parent_id": null,
    "start_time": "2025-06-20T02:17:43.133019Z",
    "end_time": "2025-06-20T02:17:45.158166Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "http.method": "GET",
        "http.server_name": "0.0.0.0",
        "http.scheme": "http",
        "net.host.name": "app-a",
        "http.host": "app-a",
        "net.host.port": 5000,
        "http.target": "/",
        "net.peer.ip": "10.110.7.32",
        "net.peer.port": 51468,
        "http.user_agent": "curl/7.88.1",
        "http.flavor": "1.1",
        "http.route": "/",
        "http.status_code": 200
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.33.1",
            "service.name": "app-a",
            "telemetry.auto.version": "0.54b1"
        },
        "schema_url": ""
    }
}

 

 

 

 

8. Grafana 확인 

grafana tempo datasource 등록

 

trace Search 

  • Explore -> outline Tempo -> Search -> Run Query 

 

 

traceQL 

 

 

trace Dashboard - Traces

 

 

trace Dashboard - Table

 

 

 

 

반응형