NCP(Naver Cloud Platform)에서 LGTM + Otel Collector 을 사용한 모니터링 설치 - dev

2026. 5. 28.

by. Daramu

LGTM 스택 사용을 통한 모니터링 구축을 진행할 것이다.

본 포스팅은 각 컴포넌트를 간소화 하여 설치한 버전을 할 것이며, Prod용을 원한다면 해당 포스팅 참고 바란다.

NCP(Naver Cloud Platform)에서 LGTM + Otel Collector 을 사용한 모니터링 설치 - prod용 :: Daramu

LGTM은 Loki, Grafana, Tempo, Mimir 로, 각각 logs, 시각화(UI), trace, temric 을 담당한다.

가령 앱/서비스 가 있다고 한다면, 각 앱에 OTLP(Open Telemetry Protocol)을 보내도록 설정을 진행하고, OTC(OpenTelemetry Collector 설정을 통해 OTLP 입력을 받는다.

그리고 입력 받은 모든 데이터를 각각의 백엔드(Loki, Temp, Mimir)로 보내 데이터를 보관 및 처리하며, 결과적으로 Grafana를 통해 시각화를 제공한다.

도식화를 하면 아래와 같이 구성된다.

앱/서비스
  └─ OTLP (gRPC:4317 / HTTP:4318)
       └─ OTel Collector (DaemonSet) ──► Loki   (logs)
                                    ──► Tempo  (traces)
                                    ──► Mimir  (metrics)
                                         └─ Grafana (시각화)

그림으로 보면 아래와 같다.

백엔드가 Spring boot라고 했을때, OTLP 프로토콜을 사용하여 서버를 실행할때 JAR 파일만 주입하여 사용하는 방식을 주로 사용한다. 이 JAR파일을 통해 Spring boot내부의 모든 라이브러리(Web, JPA, Logback..etc..)를 자동으로 감지하여 트레이스, 메트릭, 로그를 OTLP로 수집한다.

여기에 쿠버네티스라면 OpenTelemetry Opertor(오텔 오퍼레이터)는 쿠버네티스 환경에서 Otel Collector를 자동으로 띄워 애플리케이션 코드를 건드리지 않고, Otel SDK를 Pod에 자동 주입하는 쿠버네티스 전용 자동화 관리 도구이며, 이것을 사용할 것이다.

그림에서 보이듯이, 총 6개의 Helm을 기본으로 설치해야한다.

Mimir, Loki, Tempo, Grafana, Otel Operator(Pod 자동 수집), Otel Collector(노드 수집) 이다.

거기에 내부 HTTPS 통신을 위한 cert-manager 까지 더해져 총 7개의 helm chat 설치가 필요하다.

오텔 오퍼레이터는 웹훅 통신시에 내부 HTTPS 인증서가 필수이기에 cert-manager는 피하기 어렵다.

cert-manager에 대한 내용이 궁금하면 이전 포스팅 참고바란다.

Cert Manager 란 무엇인가? :: Daramu

Cert Manager 란 무엇인가?

HTTPS 통신을 위해서는 SSL/TLS 인증서 관리가 필요하다.만료일을 깜빡하면 HTTPS 통신이 안되어 서비스가 마비되는 대참사가 간혹 나오고 있는데, 쿠버네티스에서는 cert-manager를 사용하여 이 같은

daramu.tistory.com

Grafana 공식 재단에서 grafana/lgtm-distributed 같은 통합 차트를 쓰면 하나의 명령어로 LGTM 컴포넌트를 한번에 띄울 수 있다. 하지만 이 방식을 사용하면 각기 다른 설정을 할때 어려움이 있으므로, 본 포스팅에서는 각각 설치하여 설정하는 것을 목표로 하겠다.

이제 설치 단계에 진입하겠다.

#helm 레포 등록
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo add jetstack https://charts.jetstack.io
helm repo update

#Namespace 생성
kubectl create namespace monitoring

Otel Operator의 Webhook 연결시 TLS인증서가 필수이기에 cert-manager또한 설치한다.

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set crds.enabled=true

이제 다음 단계로 오브젝트 스토리지 접근을 위한 Secert을 생성한다.

각 로그는 OBJ에 저장될 것이므로, AWS라면 S3 연결을 위한 설정을 진행한다고 생각하면 된다.

본 포스팅은 NCP(Naver Cloud Platform)을 사용했다.

kubectl create secret generic ncp-objstore-secret \
  --namespace monitoring \
  --from-literal=access_key_id=<NCP_ACCESS_KEY> \
  --from-literal=secret_access_key=<NCP_SECRET_KEY>

이제 매트릭을 담당할 Mimir를 설치한다.

이 중 key 부분은 secret으로 만들었으나, bucket_name은 실제 object storage 이름으로 맞춘다. endpoint의 경우 민간 수도권은 아래 예시와 같으며, 공공플랫폼일 경우 "kr.object.gov-ncloudstorage.com " 이다.

추가로 ncp nks의 경우, 쿠버네티스 내부 DNS로 coredns를 사용하고 있다.
"kubectl get service -n kube-system" 을 입력할 경우 아래 처럼 coredns라는 서비스를 볼 수 있다.

mimir는 최초 동작시 DNS에 질의를 해야하는데, 기본값으로 두면 kube-dns라는 과거 서비스를 바라보고 있기에 mimir gateway가 작동하지 않을 수 있다. global 설정에서 dns설정에 dnsservice와 namespace, clusterdomain을 넣어둔다. 아래 예시는 ncp의 nks기준이다.

mimir:
  structuredConfig:
    common:
      storage:
        backend: s3
        s3:
          endpoint: kr.object.gov-ncloudstorage.com
          region: kr-standard
          access_key_id: "${ACCESS_KEY_ID}"
          secret_access_key: "${SECRET_ACCESS_KEY}"
          insecure: false
    blocks_storage:
      s3:
        bucket_name: mimir-blocks-monitoring
    alertmanager_storage:
      s3:
        bucket_name: mimir-alertmanager-monitoring
    ruler_storage:
      s3:
        bucket_name: mimir-ruler-monitoring

global:
  extraEnvFrom:
    - secretRef:
        name: ncp-objstore-secret
  dnsService: "coredns"
  dnsNamespace: "kube-system"
  clusterDomain: "cluster.local"

minio:
  enabled: false

ingester:
  replicas: 1
  zoneAwareReplication:
    enabled: false
  resources:
    requests:
      cpu: "100m"
      memory: "512Mi"
    limits:
      cpu: "500m"
      memory: "1Gi"

store_gateway:
  replicas: 1
  zoneAwareReplication:
    enabled: false
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "512Mi"

compactor:
  replicas: 1
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "512Mi"

distributor:
  replicas: 1
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "512Mi"

querier:
  replicas: 1
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "512Mi"

query_frontend:
  replicas: 1
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "512Mi"

ruler:
  replicas: 1
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "256Mi"

alertmanager:
  replicas: 1
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "256Mi"

gateway:
  enabled: true
  nginx:
    config:
      enableIPv6: false
  resources:
    requests:
      cpu: "50m"
      memory: "64Mi"
    limits:
      cpu: "200m"
      memory: "256Mi"

helm install mimir grafana/mimir-distributed \
  --namespace monitoring \
  -f mimir-values.yaml

참고로 생성시 pod가 많이 죽을 수 있다. mimir는 내부적으로 kafka를 사용하는데, kafka pod가 뜨기 전에 다른 pod가 kafka를 사용하지 못해 Error 가 나는 것이다. kafka가 running 1/1로 바뀐 후 시간이 지나면 pod재시작으로 running으로 바뀐다.

다음으로 Log를 담당할 Loki를 설치한다.

동일하게 object storage 생성 후 만들어준다.

deploymentMode: SingleBinary

loki:
  auth_enabled: false
  storage:
    type: s3
    bucketNames:
      chunks: loki-chunks
      admin: loki-chunks
      ruler: loki-chunks
    s3:
      endpoint: https://kr.object.ncloudstorage.com
      region: kr-standard
      s3ForcePathStyle: true
      accessKeyId: ${ACCESS_KEY_ID}
      secretAccessKey: ${SECRET_ACCESS_KEY}
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  limits_config:
    allow_structured_metadata: true
    otlp_config:
      resource_attributes:
        attributes_config:
          - action: index_label
            attributes:
              - k8s.namespace.name
              - k8s.pod.name
              - service.name

singleBinary:
  replicas: 1
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "1Gi"
  persistence:
    storageClass: nks-block-storage

backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0

minio:
  enabled: false
  
chunksCache:
  enabled: false

resultsCache:
  enabled: false

gateway:
  enabled: true
  replicas: 1
  nginxConfig:
    enableIPv6: false

global:
  dnsService: "coredns"
  dnsNamespace: "kube-system"
  clusterDomain: "cluster.local"

helm install loki grafana/loki \
  --namespace monitoring \
  -f loki-values.yaml

다음으로 trace를 담당할 Tempo를 설치한다.

동일하게 object storage 생성 후 만들어준다.

왜인지 key값의 시크릿을 참조 못하는 경우가 있는데, 그때는 하드코딩해서 집어넣어야한다.

# tempo-values.yaml (monolithic)
extraEnvFrom:
  - secretRef:
      name: ncp-objstore-secret

tempo:
  resources:
    requests:
      cpu: "100m"
      memory: "256Mi"
    limits:
      cpu: "500m"
      memory: "1Gi"

  storage:
    trace:
      backend: s3
      s3:
        bucket: tempo-traces
        endpoint: kr.object.ncloudstorage.com
        region: kr-standard
        access_key: ${ACCESS_KEY_ID}
        secret_key: ${SECRET_ACCESS_KEY}
        forcepathstyle: true
      wal:
        path: /var/tempo/wal
      local:
        path: /var/tempo/blocks

  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

  metricsGenerator:
    enabled: true
    remoteWriteUrl: "http://mimir-gateway.monitoring.svc:80/api/v1/push"

persistence:
  enabled: true
  storageClassName: nks-block-storage
  size: 10Gi

helm install tempo grafana/tempo \
  --namespace monitoring \
  -f tempo-values.yaml

다음으로 시각화를 담당할 Grafana를 설치한다.

VPN 환경이기에 annotations에서 private을 선언했다. 공인IP가 필요하면 해당 annotations를 삭제하면 된다.

adminPassword: "your-secure-password"

persistence:
  enabled: true
  storageClassName: nks-block-storage
  size: 10Gi

resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

service:
  type: NodePort

ingress:
  enabled: true
  annotations:
    alb.ingress.kubernetes.io/network-type: "private"
  ingressClassName: alb
  hosts:
    - grafana.yourdomain.com

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Mimir
        type: prometheus
        url: http://mimir-gateway.monitoring.svc:80/prometheus
        isDefault: true
        jsonData:
          httpMethod: POST
          httpHeaderName1: "X-Scope-OrgID"
          prometheusType: Mimir
        secureJsonData:
          httpHeaderValue1: "anonymous"


      - name: Loki
        type: loki
        url: http://loki-gateway.monitoring.svc:80
        jsonData:
          derivedFields:
            - datasourceUid: Tempo
              matcherType: label
              matcherRegex: trace_id
              name: TraceID
              url: "${__value.raw}"
              urlDisplayLabel: "View in Tempo"

      - name: Tempo
        type: tempo
        uid: Tempo
        url: http://tempo.monitoring.svc:3200
        jsonData:
          tracesToLogsV2:
            datasourceUid: Loki
            spanStartTimeShift: "-1m"
            spanEndTimeShift: "1m"
            filterByTraceID: true
            filterBySpanID: false
            customQuery: false
          tracesToMetrics:
            datasourceUid: Mimir
          serviceMap:
            datasourceUid: Mimir
          nodeGraph:
            enabled: true

helm install grafana grafana/grafana \
  --namespace monitoring \
  -f grafana-values.yaml

ingress(LB)생성 후 hosts 부분 수정하여 접속해보는것도 좋다.

다음으로 어플리케이션에 SDK를 Pod에 자동 주입하여 Pod의 정보를 가져오는 오텔 오퍼레이터를 설치한다.

"admissionWebhooks.certManager.enabled=true" 옵션을 통해 오텔 오퍼레이터가 가동으로 cert-manager를 통해 Issuer와 Certificate를 생성하여 Secret를 생성 및 주입한다.

helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
  --namespace monitoring \
  --set "manager.collectorImage.repository=otel/opentelemetry-collector-contrib" \
  --set admissionWebhooks.certManager.enabled=true

마지막으로 노트별 로그를 수집하며 중앙 집계 및 처리를 진행할 오텔 콜렉터를 설치한다.

kubectl create serviceaccount otel-collector-sa -n monitoring

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: otel-collector-sa
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-role
rules:
  - apiGroups: [""]
    resources: ["nodes/stats", "nodes/proxy", "pods", "namespaces", "nodes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["replicasets"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otel-collector-role
subjects:
  - kind: ServiceAccount
    name: otel-collector-sa
    namespace: monitoring
EOF

##otelcol-daemonset.yaml
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-daemonset
  namespace: monitoring
spec:
  mode: daemonset
  serviceAccount: otel-collector-sa
  resources:
    requests:
      cpu: "100m"
      memory: "128Mi"
    limits:
      cpu: "500m"
      memory: "400Mi"
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      kubeletstats:
        collection_interval: 30s
        auth_type: serviceAccount
        endpoint: "https://${env:K8S_NODE_IP}:10250"
        insecure_skip_verify: true
      filelog:
        include:
          - /var/log/pods/*/*/*.log
        include_file_path: true
        operators:
          - type: container
            id: container-parser

    processors:
      batch:
        timeout: 5s
        send_batch_size: 10000
      k8sattributes:
        passthrough: false
        extract:
          metadata:
            - k8s.pod.name
            - k8s.namespace.name
            - k8s.deployment.name
            - k8s.node.name
            - service.name
      resourcedetection:
        detectors: [env, k8snode]
        timeout: 5s
      memory_limiter:
        check_interval: 1s
        limit_mib: 400
        spike_limit_mib: 100

    connectors:
      spanmetrics:
        namespace: traces
        histogram:
          explicit:
            buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s]
        dimensions:
          - name: http.method
          - name: http.status_code

    exporters:
      otlphttp/loki:
        endpoint: http://loki-gateway.monitoring.svc:80/otlp
      otlp/tempo:
        endpoint: tempo.monitoring.svc:4317
        tls:
          insecure: true
      prometheusremotewrite/mimir:
        endpoint: http://mimir-gateway.monitoring.svc:80/api/v1/push
        headers:
          X-Scope-OrgID: anonymous

    service:
      pipelines:
        logs:
          receivers: [otlp, filelog]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlphttp/loki]
        traces:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlp/tempo, spanmetrics]
        metrics:
          receivers: [otlp, kubeletstats, spanmetrics] 
          processors: [memory_limiter, k8sattributes, resourcedetection, batch]
          exporters: [prometheusremotewrite/mimir]

  volumeMounts:
    - name: varlogpods
      mountPath: /var/log/pods
      readOnly: true

  volumes:
    - name: varlogpods
      hostPath:
        path: /var/log/pods

  env:
    - name: K8S_NODE_IP
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP

kubectl apply -f otelcol-daemonset.yaml

정상적으로 설치가 끝났다면 "kubectl get pod -n monitoring" 시에 pending이나 Error가 없이 여러개의 Pod가 떠있을 것이다.

모두 runnging 중이라면 Grafana 웹사이트에 접속하여 UI에서 실제 연결을 확인해 본다.

Grafana 좌측 사이드바 -> Commections -> Data sources

이곳에서 Loki, Mimir, Tempo를 모두 들어가 가장 하단에 "Test"를 통해 Connection 을 확인해준다.

세개 모두 성공했다면 설치는 끝이다.

저작자표시 (새창열림)

'Monitoring > LGTM + Otel' 카테고리의 다른 글

스프링부트(Spring boot) 설정을 통해 LGTM으로 트레이스, 로그 등 여러 정보를 한번에 모니터링 (0)	2026.05.29
NCP(Naver Cloud Platform)에서 LGTM + Otel Collector 을 사용한 모니터링 설치 - prod용 (0)	2026.05.28

맨 위로

Daramu
다람어의 블로그

NCP(Naver Cloud Platform)에서 LGTM + Otel Collector 을 사용한 모니터링 설치 - dev

'Monitoring > LGTM + Otel' 카테고리의 다른 글

티스토리툴바

NCP(Naver Cloud Platform)에서 LGTM + Otel Collector 을 사용한 모니터링 설치 - dev

'Monitoring > LGTM + Otel' 카테고리의 다른 글

관련글

티스토리툴바