Tutoriais

Grafana + LGTM

Tutorial completo de Grafana com a stack LGTM (Loki, Tempo, Mimir/Prometheus) — exporters, dashboards, alerting unificado, SSO e backup.

Tutorial completo para subir uma stack de observabilidade em um servidor Debian 13 usando Docker Compose. Cobre métricas (Prometheus), logs (Loki), traces (Tempo), alertas (Alertmanager) e visualização (Grafana) integrados com Keycloak (SSO) e nginx-proxy + Let’s Encrypt.


1. Visão geral

A “LGTM stack” da Grafana Labs (Loki, Grafana, Tempo, Mimir/Prometheus) é a combinação de ferramentas open-source que cobre os três pilares clássicos de observabilidade:

PilarFerramentaModeloFunção
MétricasPrometheusPull (HTTP scrape)Coleta time-series numéricas (rate, histogram, gauge)
LogsLokiPush (agent → ingester)Armazena logs indexando apenas labels (não o conteúdo)
TracesTempoPush (OTLP)Armazena spans distribuídos com lookup por trace_id
AlertasAlertmanagerPush (Prometheus → AM)Roteia, agrupa e silencia alertas (Discord, e-mail, PagerDuty)
UIGrafanaWebDashboards, queries ad-hoc, alerting unificado

Quando preferir LGTM vs Elastic Stack (ELK)

CritérioLGTM (Loki)Elastic (ELK)
Indexação de logsApenas labels (cheap)Full-text de todo o documento
Custo de storageBaixo (chunks comprimidos em S3/disco)Alto (índice invertido)
Query de texto livreLenta (grep nos chunks via LogQL)Muito rápida (Lucene)
SetupMais simples, menos componentesPesado (Elasticsearch + JVM heap)
RAM por nóLoki ingester ~512 MB OKElasticsearch precisa GB
Integração com métricas/tracesNativa via GrafanaPrecisa Kibana + APM separado

Regra prática: se você raramente faz busca livre em logs (e quase sempre filtra por label service, level, container), Loki ganha em custo e simplicidade. Para SOC / compliance com queries complexas em texto, Elastic ainda compensa.


2. Arquitetura

flowchart LR
    subgraph apps["Aplicações"]
        SB1["api-backend<br/>(Spring Boot)"]
        SB2["api-backend-2<br/>(Spring Boot)"]
        VUE["web-spa<br/>(web vitals)"]
        PG[("Postgres 17")]
        NGX["nginx-proxy"]
        DOCKER["Docker engine"]
    end

    subgraph agents["Exporters / Agents"]
        NE["node_exporter"]
        CAD["cAdvisor"]
        PGE["postgres_exporter"]
        NGE["nginx-exporter"]
        BB["blackbox_exporter"]
        ALLOY["Grafana Alloy<br/>(logs + traces)"]
    end

    subgraph backends["Backends de observabilidade"]
        PROM["Prometheus<br/>:9090"]
        LOKI["Loki<br/>:3100"]
        TEMPO["Tempo<br/>:3200 / OTLP :4317"]
        AM["Alertmanager<br/>:9093"]
    end

    GRAF["Grafana<br/>:3000"]
    USER(("Operador"))
    DISCORD["Discord webhook"]
    EMAIL["SMTP"]

    SB1 -->|/actuator/prometheus| PROM
    SB2 -->|/actuator/prometheus| PROM
    PG --> PGE --> PROM
    NGX --> NGE --> PROM
    DOCKER --> CAD --> PROM
    DOCKER --> NE --> PROM
    BB --> PROM

    SB1 -.OTLP.-> TEMPO
    SB2 -.OTLP.-> TEMPO
    DOCKER -.logs.-> ALLOY --> LOKI
    VUE -.faro/OTLP.-> ALLOY

    PROM --> AM
    AM --> DISCORD
    AM --> EMAIL

    PROM --> GRAF
    LOKI --> GRAF
    TEMPO --> GRAF
    USER --> GRAF

Fluxo de uma investigação típica:

  1. Alerta dispara no Prometheus (p95 latency > 1s) → Alertmanager → Discord.
  2. O operador abre o dashboard do serviço no Grafana e identifica o pico em http_server_requests_seconds.
  3. Clica num span de exemplo → Grafana abre o trace no Tempo (mesma UI).
  4. Do span, clica em “Logs for this span” → Grafana filtra Loki por trace_id={X}.

3. Instalação do Prometheus

3.1 Estrutura no servidor

/opt/docker/observability/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── alerts/
│       └── backend.yml
├── alertmanager/
│   └── alertmanager.yml
├── loki/
│   └── loki-config.yml
├── tempo/
│   └── tempo.yml
├── alloy/
│   └── config.alloy
└── grafana/
    └── provisioning/
        ├── datasources/
        └── dashboards/

3.2 docker-compose.yml (bloco Prometheus + Alertmanager)

networks:
  proxy:
    external: true
  observability:
    driver: bridge

volumes:
  prometheus_data:
  alertmanager_data:

services:
  prometheus:
    image: prom/prometheus:v2.55.1
    container_name: prometheus
    restart: unless-stopped
    user: "65534:65534"   # nobody
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--storage.tsdb.retention.size=20GB"
      - "--web.enable-lifecycle"          # permite reload via HTTP POST /-/reload
      - "--web.enable-admin-api"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alerts:/etc/prometheus/alerts:ro
      - prometheus_data:/prometheus
    expose:
      - "9090"
    networks:
      - observability
      - proxy
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 5s
      retries: 3

  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    expose:
      - "9093"
    networks:
      - observability
      - proxy

Note que 9090 e 9093 ficam só em expose: (não ports:) — quem expõe pra internet é o nginx-proxy com TLS e allowlist da rede confiável.

3.3 prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: vps-prod
    environment: prod

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - /etc/prometheus/alerts/*.yml

scrape_configs:
  # Self-monitoring
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

  # Host metrics
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]

  # Container metrics
  - job_name: cadvisor
    static_configs:
      - targets: ["cadvisor:8080"]

  # Spring Boot apps (Micrometer)
  - job_name: spring-boot
    metrics_path: /actuator/prometheus
    scheme: http
    static_configs:
      - targets:
          - "api-backend:8080"
          - "api-backend-2:8080"
        labels:
          stack: spring-boot
    relabel_configs:
      - source_labels: [__address__]
        regex: "([^:]+):.*"
        target_label: service
        replacement: "$1"

  # Postgres
  - job_name: postgres
    static_configs:
      - targets: ["postgres-exporter:9187"]

  # nginx
  - job_name: nginx
    static_configs:
      - targets: ["nginx-exporter:9113"]

  # Probes externos
  - job_name: blackbox-http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://app.example.com
          - https://api.example.com
          - https://git.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

3.4 Exemplo de alert rule (alerts/backend.yml)

groups:
  - name: backend-slo
    interval: 30s
    rules:
      - alert: BackendHighLatencyP95
        expr: |
          histogram_quantile(0.95,
            sum by (le, service) (
              rate(http_server_requests_seconds_bucket{service=~"api-.*"}[5m])
            )
          ) > 1
        for: 5m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "p95 latency > 1s em {{ $labels.service }}"
          description: "p95 = {{ $value | humanizeDuration }} nos últimos 5min."

      - alert: BackendErrorRate5xx
        expr: |
          sum by (service) (rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
            /
          sum by (service) (rate(http_server_requests_seconds_count[5m])) > 0.05
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Taxa de 5xx > 5% em {{ $labels.service }}"

Reload sem restart depois de mexer no yaml:

curl -X POST http://prometheus.example.com/-/reload
# ou de dentro do servidor:
docker exec prometheus wget -O- --post-data="" http://localhost:9090/-/reload

4. Exporters

Adicione ao mesmo docker-compose.yml:

  node-exporter:
    image: prom/node-exporter:v1.8.2
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    network_mode: host          # precisa do host network para ler interfaces
    volumes:
      - /:/host:ro,rslave
    command:
      - "--path.rootfs=/host"
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    expose:
      - "8080"
    networks:
      - observability

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:v0.15.0
    container_name: postgres-exporter
    restart: unless-stopped
    environment:
      DATA_SOURCE_NAME: "postgresql://prom_reader:${PG_PROM_PASSWORD}@postgres:5432/postgres?sslmode=disable"
      PG_EXPORTER_AUTO_DISCOVER_DATABASES: "true"
    expose:
      - "9187"
    networks:
      - observability
      - proxy        # precisa enxergar o container postgres

  nginx-exporter:
    image: nginx/nginx-prometheus-exporter:1.3.0
    container_name: nginx-exporter
    restart: unless-stopped
    command:
      - "--nginx.scrape-uri=http://nginx-proxy:8080/stub_status"
    expose:
      - "9113"
    networks:
      - observability
      - proxy

  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox-exporter
    restart: unless-stopped
    volumes:
      - ./blackbox/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
    expose:
      - "9115"
    networks:
      - observability

Crie o usuário readonly no Postgres antes:

CREATE USER prom_reader WITH PASSWORD 'CHANGE_ME';
GRANT pg_monitor TO prom_reader;

E habilite stub_status no nginx-proxy:

server {
    listen 8080;
    server_name _;
    location /stub_status {
        stub_status on;
        access_log off;
        allow 172.16.0.0/12;   # docker bridge ranges
        allow 127.0.0.1;
        deny all;
    }
}

5. Spring Boot (Micrometer)

5.1 Dependências (Gradle)

dependencies {
    implementation "org.springframework.boot:spring-boot-starter-actuator"
    implementation "io.micrometer:micrometer-registry-prometheus"

    // Tracing (seção 7)
    implementation "io.micrometer:micrometer-tracing-bridge-otel"
    implementation "io.opentelemetry:opentelemetry-exporter-otlp"
}

5.2 application.properties

# Actuator
management.endpoints.web.exposure.include=health,info,prometheus
management.endpoint.health.probes.enabled=true
management.endpoint.prometheus.access=read_only

# Tags globais — vira label em todas as métricas
management.metrics.tags.application=${spring.application.name}
management.metrics.tags.environment=prod

# Histograma de latência HTTP (necessário para histogram_quantile)
management.metrics.distribution.percentiles-histogram.http.server.requests=true
management.metrics.distribution.slo.http.server.requests=50ms,100ms,200ms,500ms,1s,2s

# Tracing
management.tracing.sampling.probability=0.1
management.otlp.tracing.endpoint=http://tempo:4318/v1/traces

5.3 Verificar localmente

curl -s http://localhost:8080/actuator/prometheus | head -20
# # HELP jvm_memory_used_bytes The amount of used memory
# # TYPE jvm_memory_used_bytes gauge
# jvm_memory_used_bytes{application="api-backend",area="heap",...} 1.23456E8

O Prometheus precisa estar na mesma docker network da aplicação (ou ambos em proxy). O scrape config já aponta para api-backend:8080 — DNS interno do Docker resolve o nome do container.


6. Loki

6.1 docker-compose.yml (bloco Loki + Alloy)

volumes:
  loki_data:

services:
  loki:
    image: grafana/loki:3.3.2
    container_name: loki
    restart: unless-stopped
    user: "10001:10001"
    command: ["-config.file=/etc/loki/loki-config.yml"]
    volumes:
      - ./loki/loki-config.yml:/etc/loki/loki-config.yml:ro
      - loki_data:/loki
    expose:
      - "3100"
    networks:
      - observability
      - proxy
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:3100/ready"]
      interval: 30s

  alloy:
    image: grafana/alloy:v1.5.1
    container_name: alloy
    restart: unless-stopped
    command:
      - "run"
      - "--server.http.listen-addr=0.0.0.0:12345"
      - "--storage.path=/var/lib/alloy/data"
      - "/etc/alloy/config.alloy"
    volumes:
      - ./alloy/config.alloy:/etc/alloy/config.alloy:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/log:/var/log:ro
    expose:
      - "12345"
    networks:
      - observability
      - proxy

6.2 loki-config.yml (monolítico, filesystem)

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: info

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

ingester:
  chunk_target_size: 1572864       # 1.5 MB
  chunk_idle_period: 30m
  max_chunk_age: 1h

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 720h           # 30 dias
  max_query_parallelism: 32
  ingestion_rate_mb: 8
  ingestion_burst_size_mb: 16
  allow_structured_metadata: true

compactor:
  working_directory: /loki/compactor
  retention_enabled: true
  retention_delete_delay: 2h
  delete_request_store: filesystem

ruler:
  storage:
    type: local
    local:
      directory: /loki/rules
  rule_path: /loki/rules-tmp
  alertmanager_url: http://alertmanager:9093

6.3 Modo escalonado com MinIO (opcional)

Se o ambiente já tem MinIO, troque filesystem: por s3: em common.storage e schema_config.object_store:

common:
  storage:
    s3:
      endpoint: minio:9000
      bucketnames: loki-chunks
      access_key_id: ${MINIO_ACCESS_KEY}
      secret_access_key: ${MINIO_SECRET_KEY}
      s3forcepathstyle: true
      insecure: true

E quebre os componentes em loki-read, loki-write, loki-backend (Helm “simple-scalable”) — para o volume de logs de um ambiente pequeno, monolítico basta.

6.4 Alloy collector (config.alloy)

logging {
  level  = "info"
  format = "logfmt"
}

discovery.docker "containers" {
  host             = "unix:///var/run/docker.sock"
  refresh_interval = "10s"
}

discovery.relabel "containers" {
  targets = discovery.docker.containers.targets

  rule {
    source_labels = ["__meta_docker_container_name"]
    regex         = "/(.*)"
    target_label  = "container"
  }
  rule {
    source_labels = ["__meta_docker_container_label_com_docker_compose_service"]
    target_label  = "service"
  }
  rule {
    source_labels = ["__meta_docker_container_label_com_docker_compose_project"]
    target_label  = "compose_project"
  }
}

loki.source.docker "containers" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.relabel.containers.output
  forward_to = [loki.process.app.receiver]
}

loki.process "app" {
  // Detecta JSON e extrai level/message
  stage.json {
    expressions = {
      level   = "level",
      message = "message",
      logger  = "logger_name",
      trace   = "traceId",
    }
  }
  stage.labels {
    values = { level = "", logger = "" }
  }
  stage.structured_metadata {
    values = { trace_id = "trace" }
  }
  // Spring Boot non-JSON: regex pra extrair level
  stage.regex {
    expression = "^(?P<ts>\\S+)\\s+(?P<level>INFO|WARN|ERROR|DEBUG)\\s+"
  }
  stage.labels {
    values = { level = "" }
  }

  forward_to = [loki.write.default.receiver]
}

loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

Promtail ainda funciona (e é mais leve), mas a Grafana oficialmente recomenda Alloy desde 2024 — ele substitui Promtail, Grafana Agent e parte do OpenTelemetry Collector.


7. Tempo

7.1 docker-compose.yml (bloco Tempo)

volumes:
  tempo_data:

services:
  tempo:
    image: grafana/tempo:2.6.1
    container_name: tempo
    restart: unless-stopped
    user: "10001:10001"
    command: ["-config.file=/etc/tempo/tempo.yml"]
    volumes:
      - ./tempo/tempo.yml:/etc/tempo/tempo.yml:ro
      - tempo_data:/var/tempo
    expose:
      - "3200"   # Tempo HTTP API
      - "4317"   # OTLP gRPC
      - "4318"   # OTLP HTTP
    networks:
      - observability
      - proxy

7.2 tempo.yml (monolítico, filesystem)

stream_over_http_enabled: true

server:
  http_listen_port: 3200
  log_level: info

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318

ingester:
  max_block_duration: 5m
  trace_idle_period: 10s

compactor:
  compaction:
    block_retention: 168h          # 7 dias

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: vps-prod
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces

storage:
  trace:
    backend: local
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics]

metrics_generator é o pulo do gato — Tempo derruba métricas RED (Rate/Errors/Duration) e service graph automaticamente, enviando para o Prometheus via remote_write. Para isso o Prometheus precisa do flag --web.enable-remote-write-receiver na command.

7.3 Habilitar remote_write no Prometheus

Adicione no command: do Prometheus:

- "--web.enable-remote-write-receiver"

7.4 Spring Boot enviando spans

Com as dependências da seção 5 e management.otlp.tracing.endpoint=http://tempo:4318/v1/traces, todos os requests HTTP já viram spans automaticamente (interceptor do Spring MVC). Para anotar spans custom:

@Autowired
Tracer tracer;

Span span = tracer.nextSpan().name("processar-pagamento").start();
try (var ws = tracer.withSpan(span)) {
    pagamentoService.processar(id);
} finally {
    span.end();
}

trace_id aparece nos logs do Spring Boot automaticamente quando você adiciona o Logback pattern:

<pattern>%d{HH:mm:ss.SSS} [%thread] %-5level [%X{traceId:-},%X{spanId:-}] %logger - %msg%n</pattern>

E no JSON appender (recomendado para Loki), traceId vira um campo top-level — que o Alloy extrai como structured_metadata (seção 6.4).


8. Instalação do Grafana

volumes:
  grafana_data:

services:
  grafana:
    image: grafana/grafana:11.4.0
    container_name: grafana
    restart: unless-stopped
    user: "472:472"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
      GF_SECURITY_DISABLE_GRAVATAR: "true"
      GF_SECURITY_COOKIE_SECURE: "true"
      GF_SECURITY_COOKIE_SAMESITE: "lax"

      GF_SERVER_DOMAIN: grafana.example.com
      GF_SERVER_ROOT_URL: "https://grafana.example.com/"
      GF_SERVER_ENFORCE_DOMAIN: "true"

      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_USERS_AUTO_ASSIGN_ORG_ROLE: Viewer

      GF_INSTALL_PLUGINS: "grafana-clock-panel,grafana-piechart-panel,grafana-polystat-panel"

      GF_FEATURE_TOGGLES_ENABLE: "traceqlEditor traceToMetrics"

      # SMTP (alerting por email)
      GF_SMTP_ENABLED: "true"
      GF_SMTP_HOST: "smtp.gmail.com:587"
      GF_SMTP_USER: ${SMTP_USER}
      GF_SMTP_PASSWORD: ${SMTP_PASSWORD}
      GF_SMTP_FROM_ADDRESS: "grafana@example.com"
      GF_SMTP_FROM_NAME: "Grafana"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    expose:
      - "3000"
    networks:
      - observability
      - proxy
    depends_on:
      prometheus:
        condition: service_healthy
      loki:
        condition: service_healthy

.env (no mesmo diretório):

GRAFANA_ADMIN_PASSWORD=CHANGE_ME
PG_PROM_PASSWORD=CHANGE_ME
SMTP_USER=CHANGE_ME
SMTP_PASSWORD=CHANGE_ME
MINIO_ACCESS_KEY=CHANGE_ME
MINIO_SECRET_KEY=CHANGE_ME

9. Reverse proxy nginx

/opt/docker/nginx/conf.d/grafana.conf:

upstream grafana_upstream {
    server grafana:3000;
}

server {
    listen 443 ssl http2;
    server_name grafana.example.com;

    ssl_certificate     /etc/letsencrypt/live/grafana.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/grafana.example.com/privkey.pem;

    # Allowlist da rede confiável — Grafana fica fora da internet pública
    allow 10.8.0.0/24;       # rede VPN interna
    allow 203.0.113.10;      # próprio servidor
    deny all;

    client_max_body_size 50m;

    location / {
        proxy_pass http://grafana_upstream;
        proxy_set_header Host              $host;
        proxy_set_header X-Real-IP         $remote_addr;
        proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # Grafana Live (WebSocket) — necessário para dashboards real-time e alerting unified UI
    location /api/live/ {
        proxy_pass http://grafana_upstream;
        proxy_http_version 1.1;
        proxy_set_header Upgrade    $http_upgrade;
        proxy_set_header Connection "Upgrade";
        proxy_set_header Host       $host;
    }
}

server {
    listen 80;
    server_name grafana.example.com;
    return 301 https://$host$request_uri;
}

Repita o padrão (com server_name dedicado) para prometheus.example.com, alertmanager.example.com se quiser expor as UIs internas — todas com a mesma allowlist da rede confiável.


10. Auth / SSO com Keycloak

10.1 No Keycloak

  1. Realm corp → Clients → Create → grafana (Client type: OpenID Connect).
  2. Client authentication: ON. Standard flow: ON. Direct access grants: OFF.
  3. Valid redirect URIs: https://grafana.example.com/login/generic_oauth
  4. Web origins: https://grafana.example.com
  5. Credentials → copie o client secret.
  6. Client scopes → adicione roles ao default. Aba “Mappers” do client → “Add predefined mapper” → escolha realm roles (Token Claim Name: realm_access.roles, Add to userinfo: ON).

10.2 No Grafana (env vars)

environment:
  GF_AUTH_GENERIC_OAUTH_ENABLED: "true"
  GF_AUTH_GENERIC_OAUTH_NAME: "Keycloak"
  GF_AUTH_GENERIC_OAUTH_ALLOW_SIGN_UP: "true"
  GF_AUTH_GENERIC_OAUTH_CLIENT_ID: grafana
  GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET: ${KEYCLOAK_GRAFANA_SECRET}
  GF_AUTH_GENERIC_OAUTH_SCOPES: "openid profile email offline_access roles"
  GF_AUTH_GENERIC_OAUTH_AUTH_URL: "https://keycloak.example.com/realms/corp/protocol/openid-connect/auth"
  GF_AUTH_GENERIC_OAUTH_TOKEN_URL: "https://keycloak.example.com/realms/corp/protocol/openid-connect/token"
  GF_AUTH_GENERIC_OAUTH_API_URL: "https://keycloak.example.com/realms/corp/protocol/openid-connect/userinfo"
  GF_AUTH_GENERIC_OAUTH_LOGIN_ATTRIBUTE_PATH: preferred_username
  GF_AUTH_GENERIC_OAUTH_NAME_ATTRIBUTE_PATH: name
  GF_AUTH_GENERIC_OAUTH_EMAIL_ATTRIBUTE_PATH: email
  GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH: "contains(realm_access.roles[*], 'grafana-admin') && 'Admin' || contains(realm_access.roles[*], 'grafana-editor') && 'Editor' || 'Viewer'"
  GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_STRICT: "true"
  GF_AUTH_GENERIC_OAUTH_ALLOW_ASSIGN_GRAFANA_ADMIN: "true"
  GF_AUTH_OAUTH_AUTO_LOGIN: "false"
  GF_AUTH_SIGNOUT_REDIRECT_URL: "https://keycloak.example.com/realms/corp/protocol/openid-connect/logout?post_logout_redirect_uri=https%3A%2F%2Fgrafana.example.com"

O ROLE_ATTRIBUTE_PATH é JMESPath: lê o claim realm_access.roles (array) e retorna Admin / Editor / Viewer baseado em qual role o usuário tem no realm. Crie as três roles (grafana-admin, grafana-editor, grafana-viewer) no realm e atribua aos usuários.


11. Datasources (provisioning)

grafana/provisioning/datasources/datasources.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    uid: prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      httpMethod: POST
      timeInterval: 15s
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo

  - name: Loki
    uid: loki
    type: loki
    access: proxy
    url: http://loki:3100
    jsonData:
      maxLines: 1000
      derivedFields:
        - name: trace_id
          matcherRegex: "trace[_-]?id[\"=:]+\\s*\"?([a-f0-9]+)"
          url: "$${__value.raw}"
          datasourceUid: tempo

  - name: Tempo
    uid: tempo
    type: tempo
    access: proxy
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: -5m
        spanEndTimeShift: 5m
        tags: [{ key: "service.name", value: "service" }]
        filterByTraceID: true
      tracesToMetrics:
        datasourceUid: prometheus
        tags: [{ key: "service.name", value: "service" }]
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
      search:
        hide: false
      lokiSearch:
        datasourceUid: loki

  - name: Postgres (app)
    uid: pg-app
    type: postgres
    access: proxy
    url: postgres:5432
    user: grafana_reader
    database: app
    secureJsonData:
      password: ${PG_GRAFANA_PASSWORD}
    jsonData:
      sslmode: disable
      postgresVersion: 1700
      timescaledb: false

UIDs estáveis (prometheus, loki, tempo) são críticos — dashboards versionados referenciam por UID. Se você deixar Grafana gerar UIDs aleatórios, dashboards quebram a cada redeploy.


12. Dashboards

12.1 Provisioning

grafana/provisioning/dashboards/dashboards.yml:

apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: Infra
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: true
    options:
      path: /etc/grafana/provisioning/dashboards/json
      foldersFromFilesStructure: true

Crie subpastas infra/, apps/, databases/ em grafana/provisioning/dashboards/json/ e jogue os JSONs ali.

12.2 Dashboards prontos (grafana.com)

IDNomeFonte
1860Node Exporter Fullnode-exporter
14282cAdvisor Exportercadvisor
9628PostgreSQL Databasepostgres-exporter
4701JVM (Micrometer)spring-boot
12708Nginx exporternginx-exporter
7587Blackbox Exporterblackbox
13639Logs / Loki (App)loki
3662Prometheus 2.0 Statsself

Baixar e provisionar:

for id in 1860 14282 9628 4701 12708 7587 13639 3662; do
  curl -fsSL "https://grafana.com/api/dashboards/${id}/revisions/latest/download" \
    -o "/opt/docker/observability/grafana/provisioning/dashboards/json/infra/${id}.json"
done

Edite cada JSON e troque ${DS_PROMETHEUS} por prometheus (UID que você fixou) — ou ajuste via input mapping no provisioning.

12.3 Painel custom — taxa de 5xx por serviço (PromQL)

sum by (service) (
  rate(http_server_requests_seconds_count{status=~"5..", service=~"api-.*"}[5m])
)
/
sum by (service) (
  rate(http_server_requests_seconds_count{service=~"api-.*"}[5m])
)

12.4 Painel custom — top 10 logs ERROR por logger (LogQL)

topk(10,
  sum by (logger) (
    count_over_time({service=~"api-.*", level="ERROR"}[15m])
  )
)

12.5 Painel custom — p95 latência endpoint (PromQL)

histogram_quantile(0.95,
  sum by (le, uri, service) (
    rate(http_server_requests_seconds_bucket{service="api-backend"}[5m])
  )
)

12.6 Web Vitals (frontends)

Use @grafana/faro-web-sdk no main.ts:

import { initializeFaro } from '@grafana/faro-web-sdk';

initializeFaro({
  url: 'https://faro.example.com/collect',
  app: { name: 'web-spa', version: __APP_VERSION__ },
  instrumentations: [/* webVitals, errors, console */],
});

Faro envia para Alloy via OTLP HTTP, que reencaminha métricas (LCP, CLS, INP) para o Prometheus via remote_write e logs para Loki. Dashboard pronto: ID 19004 (Faro Web SDK).


13. Alerting unificado

13.1 Contact points

Em Alerting → Contact points → New contact point:

  • Type: Discord
    • Webhook URL: https://discord.com/api/webhooks/XXX/YYY (criar em Server Settings → Integrations → Webhooks)
    • Message: {{ template "discord.message" . }}
  • Type: Email
    • Addresses: oncall@example.com

Ou provisionar via YAML em grafana/provisioning/alerting/contact-points.yml:

apiVersion: 1
contactPoints:
  - orgId: 1
    name: discord-default
    receivers:
      - uid: discord-default
        type: discord
        settings:
          url: ${DISCORD_WEBHOOK_URL}
          use_discord_username: true
          title: "[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}"
          message: |
            **{{ .CommonAnnotations.summary }}**
            {{ range .Alerts }}
            - severity: {{ .Labels.severity }}
            - service: {{ .Labels.service }}
            - description: {{ .Annotations.description }}
            {{ end }}

13.2 Notification policy

grafana/provisioning/alerting/policies.yml:

apiVersion: 1
policies:
  - orgId: 1
    receiver: discord-default
    group_by: [alertname, service]
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    routes:
      - receiver: email-oncall
        matchers:
          - severity = critical
        continue: true
      - receiver: discord-default
        matchers:
          - severity =~ "warning|info"

13.3 Alert rule (UI ou YAML)

Exemplo: “p95 latency backend > 1s por 5min” — UI:

  1. Alerting → Alert rules → New rule
  2. Grafana managed alert
  3. Query A (Prometheus): histogram_quantile(0.95, sum by (le,service) (rate(http_server_requests_seconds_bucket{service="api-backend"}[5m])))
  4. Expression B (Reduce → Last) → Expression C (Threshold > 1)
  5. Folder: Backend SLO / Group: latency / Eval interval: 1m / Pending period: 5m
  6. Labels: severity=warning, team=backend
  7. Annotations: summary="p95 backend > 1s", runbook_url=https://wiki.example.com/runbooks/p95"

YAML equivalente em grafana/provisioning/alerting/rules.yml:

apiVersion: 1
groups:
  - orgId: 1
    name: backend-latency
    folder: Backend SLO
    interval: 1m
    rules:
      - uid: backend-p95-latency
        title: Backend p95 latency > 1s
        condition: C
        for: 5m
        data:
          - refId: A
            datasourceUid: prometheus
            relativeTimeRange: { from: 300, to: 0 }
            model:
              expr: |
                histogram_quantile(0.95,
                  sum by (le,service) (rate(http_server_requests_seconds_bucket{service="api-backend"}[5m]))
                )
          - refId: B
            datasourceUid: __expr__
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator: { type: gt, params: [1] }
        labels:
          severity: warning
          team: backend
        annotations:
          summary: p95 do backend acima de 1s

13.4 Alertas em logs (Loki ruler)

A ruler do Loki dispara via o mesmo Alertmanager:

# loki/rules/errors.yml
groups:
  - name: app-errors
    rules:
      - alert: TooManyErrorLogs
        expr: |
          sum by (service) (rate({level="ERROR"}[5m])) > 1
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "{{ $labels.service }} com >1 ERROR/s"

14. Multi-tenant / Organizations

Grafana tem dois mecanismos:

MecanismoQuando usarLimitação
OrganizationsTenants completamente isolados (clientes diferentes). Cada org = datasources, dashboards, users separados.Não dá pra compartilhar dashboard entre orgs; admin precisa trocar de org pra editar.
Folders + Teams + RBACMesmo time, áreas diferentes (backend, infra, frontend). Usuários veem só o que importa.Continua um banco de datasources.

Para um ambiente solo: fique em uma única org, use folders (Infra, Apps, Databases, Frontends). Crie teams se um dia adicionar mais pessoas como Viewer. RBAC granular (dashboards.write em folder X) está em Enterprise — na OSS, role = Admin/Editor/Viewer global + permissões por folder.

Provisionar folders + permissões:

# grafana/provisioning/access-control/roles.yml — requer Enterprise
# Na OSS, configure via UI: Folder → Permissions

15. Backup

15.1 Volume Grafana (SQLite padrão)

# Snapshot enquanto Grafana está parado (consistente)
docker compose stop grafana
tar czf /backup/grafana-$(date +%F).tar.gz \
  -C /var/lib/docker/volumes/observability_grafana_data/_data .
docker compose start grafana

15.2 Migrar para Postgres (recomendado em produção)

environment:
  GF_DATABASE_TYPE: postgres
  GF_DATABASE_HOST: postgres:5432
  GF_DATABASE_NAME: grafana
  GF_DATABASE_USER: grafana
  GF_DATABASE_PASSWORD: ${GF_DB_PASSWORD}
  GF_DATABASE_SSL_MODE: disable

Antes de mudar, exporte tudo (próximo passo) e re-importe. SQLite só corrompe quando o volume sofre crash; com Postgres você ganha replicação e backups por pg_dump.

15.3 Exportar dashboards via API

GRAFANA_URL=https://grafana.example.com
TOKEN=$(grafana-cli admin reset-admin-password ...)  # ou Service Account token

mkdir -p backup/dashboards
curl -s -H "Authorization: Bearer $TOKEN" \
  "$GRAFANA_URL/api/search?type=dash-db" \
  | jq -r '.[].uid' \
  | while read uid; do
      curl -s -H "Authorization: Bearer $TOKEN" \
        "$GRAFANA_URL/api/dashboards/uid/$uid" \
        | jq '.dashboard' > "backup/dashboards/${uid}.json"
    done

Cron diário, push para um repositório Git dedicado de backup.

15.4 Backup do Prometheus

promtool tsdb dump ou snapshot:

curl -X POST http://prometheus.example.com/api/v1/admin/tsdb/snapshot
# arquivo aparece em /prometheus/snapshots/ dentro do container

15.5 Loki/Tempo

Para storage filesystem: tar do volume periodicamente. Para S3/MinIO: o próprio MinIO faz versionamento + lifecycle.


16. Troubleshooting

Datasource “Failed to connect”

  • Confira que Grafana e o backend estão na mesma rede docker (observability). docker network inspect observability_observability deve listar os dois containers.
  • Use nome do container como host (http://prometheus:9090), não localhost.
  • Teste de dentro do Grafana: docker exec grafana wget -O- http://prometheus:9090/-/healthy

Dashboard mostra “No data”

  • Compare os labels: prometheus → Status → Targets mostra como service, instance, job. Se o dashboard filtra service=~"api-.*" mas o exporter manda application=api-..., troque o label.
  • Verifique janela de tempo (top-right): Loki só guarda 30 dias por padrão.
  • curl prometheus:9090/api/v1/query?query=up deve retornar resultado.

Loki “too many outstanding requests”

limits_config:
  max_query_parallelism: 32
  split_queries_by_interval: 15m
  max_concurrent_tail_requests: 10
query_scheduler:
  max_outstanding_requests_per_tenant: 2048

Ou diminua a janela da query no Grafana.

Tempo “MIGRATE schema” no boot

Aconteceu update de versão major (ex.: 2.4 → 2.6). Deixe o container terminar a migration nos primeiros logs:

level=info msg="migrating schema from v2 to v3"

Se travar, faça downgrade da imagem para a versão imediatamente anterior, deixe drenar, depois suba de novo.

Grafana atrás de proxy: OAuth callback redireciona pra http://

  • GF_SERVER_ROOT_URL precisa ser https://... (com a barra final).
  • No nginx, garanta proxy_set_header X-Forwarded-Proto $scheme;
  • Adicione GF_SERVER_ENFORCE_DOMAIN=true para evitar mismatch.

Keycloak: “Invalid redirect URI”

A URI cadastrada precisa ser literalmente https://grafana.example.com/login/generic_oauth (com /login/generic_oauth, não só /). Wildcards (/*) também funcionam mas são menos seguros.

Alertmanager não dispara no Discord

  • Teste o webhook direto: curl -X POST -H "Content-Type: application/json" -d '{"content":"teste"}' $DISCORD_WEBHOOK
  • Em Alerting → Contact points, clique em Test — Grafana manda um alerta dummy.
  • Verifique route matchers: severity=critical no matcher só pega alertas com esse label exato.

Spring Boot: /actuator/prometheus retorna 404

  • Você esqueceu micrometer-registry-prometheus.
  • Ou o endpoint não está exposto: management.endpoints.web.exposure.include=prometheus.
  • Spring Boot 3.2+: o endpoint exige management.endpoint.prometheus.access=read_only (era enabled=true antes).

cAdvisor consumindo 100% CPU

Filtre o que ele monitora:

command:
  - "--housekeeping_interval=30s"
  - "--docker_only=true"
  - "--disable_metrics=disk,diskIO,network,tcp,udp,percpu,sched,process"

17. Referências