Tutorial completo para subir uma stack de observabilidade em um servidor Debian 13 usando Docker Compose. Cobre métricas (Prometheus), logs (Loki), traces (Tempo), alertas (Alertmanager) e visualização (Grafana) integrados com Keycloak (SSO) e nginx-proxy + Let’s Encrypt.
1. Visão geral
A “LGTM stack” da Grafana Labs (Loki, Grafana, Tempo, Mimir/Prometheus) é a combinação de ferramentas open-source que cobre os três pilares clássicos de observabilidade:
| Pilar | Ferramenta | Modelo | Função |
|---|---|---|---|
| Métricas | Prometheus | Pull (HTTP scrape) | Coleta time-series numéricas (rate, histogram, gauge) |
| Logs | Loki | Push (agent → ingester) | Armazena logs indexando apenas labels (não o conteúdo) |
| Traces | Tempo | Push (OTLP) | Armazena spans distribuídos com lookup por trace_id |
| Alertas | Alertmanager | Push (Prometheus → AM) | Roteia, agrupa e silencia alertas (Discord, e-mail, PagerDuty) |
| UI | Grafana | Web | Dashboards, queries ad-hoc, alerting unificado |
Quando preferir LGTM vs Elastic Stack (ELK)
| Critério | LGTM (Loki) | Elastic (ELK) |
|---|---|---|
| Indexação de logs | Apenas labels (cheap) | Full-text de todo o documento |
| Custo de storage | Baixo (chunks comprimidos em S3/disco) | Alto (índice invertido) |
| Query de texto livre | Lenta (grep nos chunks via LogQL) | Muito rápida (Lucene) |
| Setup | Mais simples, menos componentes | Pesado (Elasticsearch + JVM heap) |
| RAM por nó | Loki ingester ~512 MB OK | Elasticsearch precisa GB |
| Integração com métricas/traces | Nativa via Grafana | Precisa Kibana + APM separado |
Regra prática: se você raramente faz busca livre em logs (e quase sempre filtra por label service, level, container), Loki ganha em custo e simplicidade. Para SOC / compliance com queries complexas em texto, Elastic ainda compensa.
2. Arquitetura
flowchart LR
subgraph apps["Aplicações"]
SB1["api-backend<br/>(Spring Boot)"]
SB2["api-backend-2<br/>(Spring Boot)"]
VUE["web-spa<br/>(web vitals)"]
PG[("Postgres 17")]
NGX["nginx-proxy"]
DOCKER["Docker engine"]
end
subgraph agents["Exporters / Agents"]
NE["node_exporter"]
CAD["cAdvisor"]
PGE["postgres_exporter"]
NGE["nginx-exporter"]
BB["blackbox_exporter"]
ALLOY["Grafana Alloy<br/>(logs + traces)"]
end
subgraph backends["Backends de observabilidade"]
PROM["Prometheus<br/>:9090"]
LOKI["Loki<br/>:3100"]
TEMPO["Tempo<br/>:3200 / OTLP :4317"]
AM["Alertmanager<br/>:9093"]
end
GRAF["Grafana<br/>:3000"]
USER(("Operador"))
DISCORD["Discord webhook"]
EMAIL["SMTP"]
SB1 -->|/actuator/prometheus| PROM
SB2 -->|/actuator/prometheus| PROM
PG --> PGE --> PROM
NGX --> NGE --> PROM
DOCKER --> CAD --> PROM
DOCKER --> NE --> PROM
BB --> PROM
SB1 -.OTLP.-> TEMPO
SB2 -.OTLP.-> TEMPO
DOCKER -.logs.-> ALLOY --> LOKI
VUE -.faro/OTLP.-> ALLOY
PROM --> AM
AM --> DISCORD
AM --> EMAIL
PROM --> GRAF
LOKI --> GRAF
TEMPO --> GRAF
USER --> GRAFFluxo de uma investigação típica:
- Alerta dispara no Prometheus (
p95 latency > 1s) → Alertmanager → Discord. - O operador abre o dashboard do serviço no Grafana e identifica o pico em
http_server_requests_seconds. - Clica num span de exemplo → Grafana abre o trace no Tempo (mesma UI).
- Do span, clica em “Logs for this span” → Grafana filtra Loki por
trace_id={X}.
3. Instalação do Prometheus
3.1 Estrutura no servidor
/opt/docker/observability/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ └── alerts/
│ └── backend.yml
├── alertmanager/
│ └── alertmanager.yml
├── loki/
│ └── loki-config.yml
├── tempo/
│ └── tempo.yml
├── alloy/
│ └── config.alloy
└── grafana/
└── provisioning/
├── datasources/
└── dashboards/3.2 docker-compose.yml (bloco Prometheus + Alertmanager)
networks:
proxy:
external: true
observability:
driver: bridge
volumes:
prometheus_data:
alertmanager_data:
services:
prometheus:
image: prom/prometheus:v2.55.1
container_name: prometheus
restart: unless-stopped
user: "65534:65534" # nobody
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
- "--storage.tsdb.retention.size=20GB"
- "--web.enable-lifecycle" # permite reload via HTTP POST /-/reload
- "--web.enable-admin-api"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alerts:/etc/prometheus/alerts:ro
- prometheus_data:/prometheus
expose:
- "9090"
networks:
- observability
- proxy
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 5s
retries: 3
alertmanager:
image: prom/alertmanager:v0.27.0
container_name: alertmanager
restart: unless-stopped
command:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--storage.path=/alertmanager"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
expose:
- "9093"
networks:
- observability
- proxyNote que 9090 e 9093 ficam só em expose: (não ports:) — quem expõe pra internet é o nginx-proxy com TLS e allowlist da rede confiável.
3.3 prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: vps-prod
environment: prod
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- /etc/prometheus/alerts/*.yml
scrape_configs:
# Self-monitoring
- job_name: prometheus
static_configs:
- targets: ["localhost:9090"]
# Host metrics
- job_name: node
static_configs:
- targets: ["node-exporter:9100"]
# Container metrics
- job_name: cadvisor
static_configs:
- targets: ["cadvisor:8080"]
# Spring Boot apps (Micrometer)
- job_name: spring-boot
metrics_path: /actuator/prometheus
scheme: http
static_configs:
- targets:
- "api-backend:8080"
- "api-backend-2:8080"
labels:
stack: spring-boot
relabel_configs:
- source_labels: [__address__]
regex: "([^:]+):.*"
target_label: service
replacement: "$1"
# Postgres
- job_name: postgres
static_configs:
- targets: ["postgres-exporter:9187"]
# nginx
- job_name: nginx
static_configs:
- targets: ["nginx-exporter:9113"]
# Probes externos
- job_name: blackbox-http
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://app.example.com
- https://api.example.com
- https://git.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:91153.4 Exemplo de alert rule (alerts/backend.yml)
groups:
- name: backend-slo
interval: 30s
rules:
- alert: BackendHighLatencyP95
expr: |
histogram_quantile(0.95,
sum by (le, service) (
rate(http_server_requests_seconds_bucket{service=~"api-.*"}[5m])
)
) > 1
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "p95 latency > 1s em {{ $labels.service }}"
description: "p95 = {{ $value | humanizeDuration }} nos últimos 5min."
- alert: BackendErrorRate5xx
expr: |
sum by (service) (rate(http_server_requests_seconds_count{status=~"5.."}[5m]))
/
sum by (service) (rate(http_server_requests_seconds_count[5m])) > 0.05
for: 10m
labels:
severity: critical
annotations:
summary: "Taxa de 5xx > 5% em {{ $labels.service }}"Reload sem restart depois de mexer no yaml:
curl -X POST http://prometheus.example.com/-/reload
# ou de dentro do servidor:
docker exec prometheus wget -O- --post-data="" http://localhost:9090/-/reload4. Exporters
Adicione ao mesmo docker-compose.yml:
node-exporter:
image: prom/node-exporter:v1.8.2
container_name: node-exporter
restart: unless-stopped
pid: host
network_mode: host # precisa do host network para ler interfaces
volumes:
- /:/host:ro,rslave
command:
- "--path.rootfs=/host"
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
container_name: cadvisor
restart: unless-stopped
privileged: true
devices:
- /dev/kmsg
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
expose:
- "8080"
networks:
- observability
postgres-exporter:
image: prometheuscommunity/postgres-exporter:v0.15.0
container_name: postgres-exporter
restart: unless-stopped
environment:
DATA_SOURCE_NAME: "postgresql://prom_reader:${PG_PROM_PASSWORD}@postgres:5432/postgres?sslmode=disable"
PG_EXPORTER_AUTO_DISCOVER_DATABASES: "true"
expose:
- "9187"
networks:
- observability
- proxy # precisa enxergar o container postgres
nginx-exporter:
image: nginx/nginx-prometheus-exporter:1.3.0
container_name: nginx-exporter
restart: unless-stopped
command:
- "--nginx.scrape-uri=http://nginx-proxy:8080/stub_status"
expose:
- "9113"
networks:
- observability
- proxy
blackbox-exporter:
image: prom/blackbox-exporter:v0.25.0
container_name: blackbox-exporter
restart: unless-stopped
volumes:
- ./blackbox/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
expose:
- "9115"
networks:
- observabilityCrie o usuário readonly no Postgres antes:
CREATE USER prom_reader WITH PASSWORD 'CHANGE_ME';
GRANT pg_monitor TO prom_reader;E habilite stub_status no nginx-proxy:
server {
listen 8080;
server_name _;
location /stub_status {
stub_status on;
access_log off;
allow 172.16.0.0/12; # docker bridge ranges
allow 127.0.0.1;
deny all;
}
}5. Spring Boot (Micrometer)
5.1 Dependências (Gradle)
dependencies {
implementation "org.springframework.boot:spring-boot-starter-actuator"
implementation "io.micrometer:micrometer-registry-prometheus"
// Tracing (seção 7)
implementation "io.micrometer:micrometer-tracing-bridge-otel"
implementation "io.opentelemetry:opentelemetry-exporter-otlp"
}5.2 application.properties
# Actuator
management.endpoints.web.exposure.include=health,info,prometheus
management.endpoint.health.probes.enabled=true
management.endpoint.prometheus.access=read_only
# Tags globais — vira label em todas as métricas
management.metrics.tags.application=${spring.application.name}
management.metrics.tags.environment=prod
# Histograma de latência HTTP (necessário para histogram_quantile)
management.metrics.distribution.percentiles-histogram.http.server.requests=true
management.metrics.distribution.slo.http.server.requests=50ms,100ms,200ms,500ms,1s,2s
# Tracing
management.tracing.sampling.probability=0.1
management.otlp.tracing.endpoint=http://tempo:4318/v1/traces5.3 Verificar localmente
curl -s http://localhost:8080/actuator/prometheus | head -20
# # HELP jvm_memory_used_bytes The amount of used memory
# # TYPE jvm_memory_used_bytes gauge
# jvm_memory_used_bytes{application="api-backend",area="heap",...} 1.23456E8O Prometheus precisa estar na mesma docker network da aplicação (ou ambos em proxy). O scrape config já aponta para api-backend:8080 — DNS interno do Docker resolve o nome do container.
6. Loki
6.1 docker-compose.yml (bloco Loki + Alloy)
volumes:
loki_data:
services:
loki:
image: grafana/loki:3.3.2
container_name: loki
restart: unless-stopped
user: "10001:10001"
command: ["-config.file=/etc/loki/loki-config.yml"]
volumes:
- ./loki/loki-config.yml:/etc/loki/loki-config.yml:ro
- loki_data:/loki
expose:
- "3100"
networks:
- observability
- proxy
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:3100/ready"]
interval: 30s
alloy:
image: grafana/alloy:v1.5.1
container_name: alloy
restart: unless-stopped
command:
- "run"
- "--server.http.listen-addr=0.0.0.0:12345"
- "--storage.path=/var/lib/alloy/data"
- "/etc/alloy/config.alloy"
volumes:
- ./alloy/config.alloy:/etc/alloy/config.alloy:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/log:/var/log:ro
expose:
- "12345"
networks:
- observability
- proxy6.2 loki-config.yml (monolítico, filesystem)
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: info
common:
instance_addr: 127.0.0.1
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
ingester:
chunk_target_size: 1572864 # 1.5 MB
chunk_idle_period: 30m
max_chunk_age: 1h
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 720h # 30 dias
max_query_parallelism: 32
ingestion_rate_mb: 8
ingestion_burst_size_mb: 16
allow_structured_metadata: true
compactor:
working_directory: /loki/compactor
retention_enabled: true
retention_delete_delay: 2h
delete_request_store: filesystem
ruler:
storage:
type: local
local:
directory: /loki/rules
rule_path: /loki/rules-tmp
alertmanager_url: http://alertmanager:90936.3 Modo escalonado com MinIO (opcional)
Se o ambiente já tem MinIO, troque filesystem: por s3: em common.storage e schema_config.object_store:
common:
storage:
s3:
endpoint: minio:9000
bucketnames: loki-chunks
access_key_id: ${MINIO_ACCESS_KEY}
secret_access_key: ${MINIO_SECRET_KEY}
s3forcepathstyle: true
insecure: trueE quebre os componentes em loki-read, loki-write, loki-backend (Helm “simple-scalable”) — para o volume de logs de um ambiente pequeno, monolítico basta.
6.4 Alloy collector (config.alloy)
logging {
level = "info"
format = "logfmt"
}
discovery.docker "containers" {
host = "unix:///var/run/docker.sock"
refresh_interval = "10s"
}
discovery.relabel "containers" {
targets = discovery.docker.containers.targets
rule {
source_labels = ["__meta_docker_container_name"]
regex = "/(.*)"
target_label = "container"
}
rule {
source_labels = ["__meta_docker_container_label_com_docker_compose_service"]
target_label = "service"
}
rule {
source_labels = ["__meta_docker_container_label_com_docker_compose_project"]
target_label = "compose_project"
}
}
loki.source.docker "containers" {
host = "unix:///var/run/docker.sock"
targets = discovery.relabel.containers.output
forward_to = [loki.process.app.receiver]
}
loki.process "app" {
// Detecta JSON e extrai level/message
stage.json {
expressions = {
level = "level",
message = "message",
logger = "logger_name",
trace = "traceId",
}
}
stage.labels {
values = { level = "", logger = "" }
}
stage.structured_metadata {
values = { trace_id = "trace" }
}
// Spring Boot non-JSON: regex pra extrair level
stage.regex {
expression = "^(?P<ts>\\S+)\\s+(?P<level>INFO|WARN|ERROR|DEBUG)\\s+"
}
stage.labels {
values = { level = "" }
}
forward_to = [loki.write.default.receiver]
}
loki.write "default" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"
}
}Promtail ainda funciona (e é mais leve), mas a Grafana oficialmente recomenda Alloy desde 2024 — ele substitui Promtail, Grafana Agent e parte do OpenTelemetry Collector.
7. Tempo
7.1 docker-compose.yml (bloco Tempo)
volumes:
tempo_data:
services:
tempo:
image: grafana/tempo:2.6.1
container_name: tempo
restart: unless-stopped
user: "10001:10001"
command: ["-config.file=/etc/tempo/tempo.yml"]
volumes:
- ./tempo/tempo.yml:/etc/tempo/tempo.yml:ro
- tempo_data:/var/tempo
expose:
- "3200" # Tempo HTTP API
- "4317" # OTLP gRPC
- "4318" # OTLP HTTP
networks:
- observability
- proxy7.2 tempo.yml (monolítico, filesystem)
stream_over_http_enabled: true
server:
http_listen_port: 3200
log_level: info
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
ingester:
max_block_duration: 5m
trace_idle_period: 10s
compactor:
compaction:
block_retention: 168h # 7 dias
metrics_generator:
registry:
external_labels:
source: tempo
cluster: vps-prod
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
traces_storage:
path: /var/tempo/generator/traces
storage:
trace:
backend: local
wal:
path: /var/tempo/wal
local:
path: /var/tempo/blocks
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics]metrics_generator é o pulo do gato — Tempo derruba métricas RED (Rate/Errors/Duration) e service graph automaticamente, enviando para o Prometheus via remote_write. Para isso o Prometheus precisa do flag --web.enable-remote-write-receiver na command.
7.3 Habilitar remote_write no Prometheus
Adicione no command: do Prometheus:
- "--web.enable-remote-write-receiver"7.4 Spring Boot enviando spans
Com as dependências da seção 5 e management.otlp.tracing.endpoint=http://tempo:4318/v1/traces, todos os requests HTTP já viram spans automaticamente (interceptor do Spring MVC). Para anotar spans custom:
@Autowired
Tracer tracer;
Span span = tracer.nextSpan().name("processar-pagamento").start();
try (var ws = tracer.withSpan(span)) {
pagamentoService.processar(id);
} finally {
span.end();
}trace_id aparece nos logs do Spring Boot automaticamente quando você adiciona o Logback pattern:
<pattern>%d{HH:mm:ss.SSS} [%thread] %-5level [%X{traceId:-},%X{spanId:-}] %logger - %msg%n</pattern>E no JSON appender (recomendado para Loki), traceId vira um campo top-level — que o Alloy extrai como structured_metadata (seção 6.4).
8. Instalação do Grafana
volumes:
grafana_data:
services:
grafana:
image: grafana/grafana:11.4.0
container_name: grafana
restart: unless-stopped
user: "472:472"
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
GF_SECURITY_DISABLE_GRAVATAR: "true"
GF_SECURITY_COOKIE_SECURE: "true"
GF_SECURITY_COOKIE_SAMESITE: "lax"
GF_SERVER_DOMAIN: grafana.example.com
GF_SERVER_ROOT_URL: "https://grafana.example.com/"
GF_SERVER_ENFORCE_DOMAIN: "true"
GF_USERS_ALLOW_SIGN_UP: "false"
GF_USERS_AUTO_ASSIGN_ORG_ROLE: Viewer
GF_INSTALL_PLUGINS: "grafana-clock-panel,grafana-piechart-panel,grafana-polystat-panel"
GF_FEATURE_TOGGLES_ENABLE: "traceqlEditor traceToMetrics"
# SMTP (alerting por email)
GF_SMTP_ENABLED: "true"
GF_SMTP_HOST: "smtp.gmail.com:587"
GF_SMTP_USER: ${SMTP_USER}
GF_SMTP_PASSWORD: ${SMTP_PASSWORD}
GF_SMTP_FROM_ADDRESS: "grafana@example.com"
GF_SMTP_FROM_NAME: "Grafana"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
expose:
- "3000"
networks:
- observability
- proxy
depends_on:
prometheus:
condition: service_healthy
loki:
condition: service_healthy.env (no mesmo diretório):
GRAFANA_ADMIN_PASSWORD=CHANGE_ME
PG_PROM_PASSWORD=CHANGE_ME
SMTP_USER=CHANGE_ME
SMTP_PASSWORD=CHANGE_ME
MINIO_ACCESS_KEY=CHANGE_ME
MINIO_SECRET_KEY=CHANGE_ME9. Reverse proxy nginx
/opt/docker/nginx/conf.d/grafana.conf:
upstream grafana_upstream {
server grafana:3000;
}
server {
listen 443 ssl http2;
server_name grafana.example.com;
ssl_certificate /etc/letsencrypt/live/grafana.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/grafana.example.com/privkey.pem;
# Allowlist da rede confiável — Grafana fica fora da internet pública
allow 10.8.0.0/24; # rede VPN interna
allow 203.0.113.10; # próprio servidor
deny all;
client_max_body_size 50m;
location / {
proxy_pass http://grafana_upstream;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# Grafana Live (WebSocket) — necessário para dashboards real-time e alerting unified UI
location /api/live/ {
proxy_pass http://grafana_upstream;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
proxy_set_header Host $host;
}
}
server {
listen 80;
server_name grafana.example.com;
return 301 https://$host$request_uri;
}Repita o padrão (com server_name dedicado) para prometheus.example.com, alertmanager.example.com se quiser expor as UIs internas — todas com a mesma allowlist da rede confiável.
10. Auth / SSO com Keycloak
10.1 No Keycloak
- Realm
corp→ Clients → Create →grafana(Client type: OpenID Connect). - Client authentication: ON. Standard flow: ON. Direct access grants: OFF.
- Valid redirect URIs:
https://grafana.example.com/login/generic_oauth - Web origins:
https://grafana.example.com - Credentials → copie o client secret.
- Client scopes → adicione
rolesao default. Aba “Mappers” do client → “Add predefined mapper” → escolharealm roles(Token Claim Name:realm_access.roles, Add to userinfo: ON).
10.2 No Grafana (env vars)
environment:
GF_AUTH_GENERIC_OAUTH_ENABLED: "true"
GF_AUTH_GENERIC_OAUTH_NAME: "Keycloak"
GF_AUTH_GENERIC_OAUTH_ALLOW_SIGN_UP: "true"
GF_AUTH_GENERIC_OAUTH_CLIENT_ID: grafana
GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET: ${KEYCLOAK_GRAFANA_SECRET}
GF_AUTH_GENERIC_OAUTH_SCOPES: "openid profile email offline_access roles"
GF_AUTH_GENERIC_OAUTH_AUTH_URL: "https://keycloak.example.com/realms/corp/protocol/openid-connect/auth"
GF_AUTH_GENERIC_OAUTH_TOKEN_URL: "https://keycloak.example.com/realms/corp/protocol/openid-connect/token"
GF_AUTH_GENERIC_OAUTH_API_URL: "https://keycloak.example.com/realms/corp/protocol/openid-connect/userinfo"
GF_AUTH_GENERIC_OAUTH_LOGIN_ATTRIBUTE_PATH: preferred_username
GF_AUTH_GENERIC_OAUTH_NAME_ATTRIBUTE_PATH: name
GF_AUTH_GENERIC_OAUTH_EMAIL_ATTRIBUTE_PATH: email
GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH: "contains(realm_access.roles[*], 'grafana-admin') && 'Admin' || contains(realm_access.roles[*], 'grafana-editor') && 'Editor' || 'Viewer'"
GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_STRICT: "true"
GF_AUTH_GENERIC_OAUTH_ALLOW_ASSIGN_GRAFANA_ADMIN: "true"
GF_AUTH_OAUTH_AUTO_LOGIN: "false"
GF_AUTH_SIGNOUT_REDIRECT_URL: "https://keycloak.example.com/realms/corp/protocol/openid-connect/logout?post_logout_redirect_uri=https%3A%2F%2Fgrafana.example.com"O ROLE_ATTRIBUTE_PATH é JMESPath: lê o claim realm_access.roles (array) e retorna Admin / Editor / Viewer baseado em qual role o usuário tem no realm. Crie as três roles (grafana-admin, grafana-editor, grafana-viewer) no realm e atribua aos usuários.
11. Datasources (provisioning)
grafana/provisioning/datasources/datasources.yml:
apiVersion: 1
datasources:
- name: Prometheus
uid: prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
httpMethod: POST
timeInterval: 15s
exemplarTraceIdDestinations:
- name: trace_id
datasourceUid: tempo
- name: Loki
uid: loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
maxLines: 1000
derivedFields:
- name: trace_id
matcherRegex: "trace[_-]?id[\"=:]+\\s*\"?([a-f0-9]+)"
url: "$${__value.raw}"
datasourceUid: tempo
- name: Tempo
uid: tempo
type: tempo
access: proxy
url: http://tempo:3200
jsonData:
tracesToLogsV2:
datasourceUid: loki
spanStartTimeShift: -5m
spanEndTimeShift: 5m
tags: [{ key: "service.name", value: "service" }]
filterByTraceID: true
tracesToMetrics:
datasourceUid: prometheus
tags: [{ key: "service.name", value: "service" }]
serviceMap:
datasourceUid: prometheus
nodeGraph:
enabled: true
search:
hide: false
lokiSearch:
datasourceUid: loki
- name: Postgres (app)
uid: pg-app
type: postgres
access: proxy
url: postgres:5432
user: grafana_reader
database: app
secureJsonData:
password: ${PG_GRAFANA_PASSWORD}
jsonData:
sslmode: disable
postgresVersion: 1700
timescaledb: falseUIDs estáveis (prometheus, loki, tempo) são críticos — dashboards versionados referenciam por UID. Se você deixar Grafana gerar UIDs aleatórios, dashboards quebram a cada redeploy.
12. Dashboards
12.1 Provisioning
grafana/provisioning/dashboards/dashboards.yml:
apiVersion: 1
providers:
- name: default
orgId: 1
folder: Infra
type: file
disableDeletion: false
updateIntervalSeconds: 30
allowUiUpdates: true
options:
path: /etc/grafana/provisioning/dashboards/json
foldersFromFilesStructure: trueCrie subpastas infra/, apps/, databases/ em grafana/provisioning/dashboards/json/ e jogue os JSONs ali.
12.2 Dashboards prontos (grafana.com)
| ID | Nome | Fonte |
|---|---|---|
| 1860 | Node Exporter Full | node-exporter |
| 14282 | cAdvisor Exporter | cadvisor |
| 9628 | PostgreSQL Database | postgres-exporter |
| 4701 | JVM (Micrometer) | spring-boot |
| 12708 | Nginx exporter | nginx-exporter |
| 7587 | Blackbox Exporter | blackbox |
| 13639 | Logs / Loki (App) | loki |
| 3662 | Prometheus 2.0 Stats | self |
Baixar e provisionar:
for id in 1860 14282 9628 4701 12708 7587 13639 3662; do
curl -fsSL "https://grafana.com/api/dashboards/${id}/revisions/latest/download" \
-o "/opt/docker/observability/grafana/provisioning/dashboards/json/infra/${id}.json"
doneEdite cada JSON e troque ${DS_PROMETHEUS} por prometheus (UID que você fixou) — ou ajuste via input mapping no provisioning.
12.3 Painel custom — taxa de 5xx por serviço (PromQL)
sum by (service) (
rate(http_server_requests_seconds_count{status=~"5..", service=~"api-.*"}[5m])
)
/
sum by (service) (
rate(http_server_requests_seconds_count{service=~"api-.*"}[5m])
)12.4 Painel custom — top 10 logs ERROR por logger (LogQL)
topk(10,
sum by (logger) (
count_over_time({service=~"api-.*", level="ERROR"}[15m])
)
)12.5 Painel custom — p95 latência endpoint (PromQL)
histogram_quantile(0.95,
sum by (le, uri, service) (
rate(http_server_requests_seconds_bucket{service="api-backend"}[5m])
)
)12.6 Web Vitals (frontends)
Use @grafana/faro-web-sdk no main.ts:
import { initializeFaro } from '@grafana/faro-web-sdk';
initializeFaro({
url: 'https://faro.example.com/collect',
app: { name: 'web-spa', version: __APP_VERSION__ },
instrumentations: [/* webVitals, errors, console */],
});Faro envia para Alloy via OTLP HTTP, que reencaminha métricas (LCP, CLS, INP) para o Prometheus via remote_write e logs para Loki. Dashboard pronto: ID 19004 (Faro Web SDK).
13. Alerting unificado
13.1 Contact points
Em Alerting → Contact points → New contact point:
- Type: Discord
- Webhook URL:
https://discord.com/api/webhooks/XXX/YYY(criar em Server Settings → Integrations → Webhooks) - Message:
{{ template "discord.message" . }}
- Webhook URL:
- Type: Email
- Addresses:
oncall@example.com
- Addresses:
Ou provisionar via YAML em grafana/provisioning/alerting/contact-points.yml:
apiVersion: 1
contactPoints:
- orgId: 1
name: discord-default
receivers:
- uid: discord-default
type: discord
settings:
url: ${DISCORD_WEBHOOK_URL}
use_discord_username: true
title: "[{{ .Status | toUpper }}] {{ .CommonLabels.alertname }}"
message: |
**{{ .CommonAnnotations.summary }}**
{{ range .Alerts }}
- severity: {{ .Labels.severity }}
- service: {{ .Labels.service }}
- description: {{ .Annotations.description }}
{{ end }}13.2 Notification policy
grafana/provisioning/alerting/policies.yml:
apiVersion: 1
policies:
- orgId: 1
receiver: discord-default
group_by: [alertname, service]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- receiver: email-oncall
matchers:
- severity = critical
continue: true
- receiver: discord-default
matchers:
- severity =~ "warning|info"13.3 Alert rule (UI ou YAML)
Exemplo: “p95 latency backend > 1s por 5min” — UI:
- Alerting → Alert rules → New rule
- Grafana managed alert
- Query A (Prometheus):
histogram_quantile(0.95, sum by (le,service) (rate(http_server_requests_seconds_bucket{service="api-backend"}[5m]))) - Expression B (Reduce → Last) → Expression C (Threshold > 1)
- Folder:
Backend SLO/ Group:latency/ Eval interval: 1m / Pending period: 5m - Labels:
severity=warning,team=backend - Annotations:
summary="p95 backend > 1s",runbook_url=https://wiki.example.com/runbooks/p95"
YAML equivalente em grafana/provisioning/alerting/rules.yml:
apiVersion: 1
groups:
- orgId: 1
name: backend-latency
folder: Backend SLO
interval: 1m
rules:
- uid: backend-p95-latency
title: Backend p95 latency > 1s
condition: C
for: 5m
data:
- refId: A
datasourceUid: prometheus
relativeTimeRange: { from: 300, to: 0 }
model:
expr: |
histogram_quantile(0.95,
sum by (le,service) (rate(http_server_requests_seconds_bucket{service="api-backend"}[5m]))
)
- refId: B
datasourceUid: __expr__
model:
type: reduce
expression: A
reducer: last
- refId: C
datasourceUid: __expr__
model:
type: threshold
expression: B
conditions:
- evaluator: { type: gt, params: [1] }
labels:
severity: warning
team: backend
annotations:
summary: p95 do backend acima de 1s13.4 Alertas em logs (Loki ruler)
A ruler do Loki dispara via o mesmo Alertmanager:
# loki/rules/errors.yml
groups:
- name: app-errors
rules:
- alert: TooManyErrorLogs
expr: |
sum by (service) (rate({level="ERROR"}[5m])) > 1
for: 10m
labels: { severity: warning }
annotations:
summary: "{{ $labels.service }} com >1 ERROR/s"14. Multi-tenant / Organizations
Grafana tem dois mecanismos:
| Mecanismo | Quando usar | Limitação |
|---|---|---|
| Organizations | Tenants completamente isolados (clientes diferentes). Cada org = datasources, dashboards, users separados. | Não dá pra compartilhar dashboard entre orgs; admin precisa trocar de org pra editar. |
| Folders + Teams + RBAC | Mesmo time, áreas diferentes (backend, infra, frontend). Usuários veem só o que importa. | Continua um banco de datasources. |
Para um ambiente solo: fique em uma única org, use folders (Infra, Apps, Databases, Frontends). Crie teams se um dia adicionar mais pessoas como Viewer. RBAC granular (dashboards.write em folder X) está em Enterprise — na OSS, role = Admin/Editor/Viewer global + permissões por folder.
Provisionar folders + permissões:
# grafana/provisioning/access-control/roles.yml — requer Enterprise
# Na OSS, configure via UI: Folder → Permissions15. Backup
15.1 Volume Grafana (SQLite padrão)
# Snapshot enquanto Grafana está parado (consistente)
docker compose stop grafana
tar czf /backup/grafana-$(date +%F).tar.gz \
-C /var/lib/docker/volumes/observability_grafana_data/_data .
docker compose start grafana15.2 Migrar para Postgres (recomendado em produção)
environment:
GF_DATABASE_TYPE: postgres
GF_DATABASE_HOST: postgres:5432
GF_DATABASE_NAME: grafana
GF_DATABASE_USER: grafana
GF_DATABASE_PASSWORD: ${GF_DB_PASSWORD}
GF_DATABASE_SSL_MODE: disableAntes de mudar, exporte tudo (próximo passo) e re-importe. SQLite só corrompe quando o volume sofre crash; com Postgres você ganha replicação e backups por pg_dump.
15.3 Exportar dashboards via API
GRAFANA_URL=https://grafana.example.com
TOKEN=$(grafana-cli admin reset-admin-password ...) # ou Service Account token
mkdir -p backup/dashboards
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/search?type=dash-db" \
| jq -r '.[].uid' \
| while read uid; do
curl -s -H "Authorization: Bearer $TOKEN" \
"$GRAFANA_URL/api/dashboards/uid/$uid" \
| jq '.dashboard' > "backup/dashboards/${uid}.json"
doneCron diário, push para um repositório Git dedicado de backup.
15.4 Backup do Prometheus
promtool tsdb dump ou snapshot:
curl -X POST http://prometheus.example.com/api/v1/admin/tsdb/snapshot
# arquivo aparece em /prometheus/snapshots/ dentro do container15.5 Loki/Tempo
Para storage filesystem: tar do volume periodicamente. Para S3/MinIO: o próprio MinIO faz versionamento + lifecycle.
16. Troubleshooting
Datasource “Failed to connect”
- Confira que Grafana e o backend estão na mesma rede docker (observability).
docker network inspect observability_observabilitydeve listar os dois containers. - Use nome do container como host (
http://prometheus:9090), nãolocalhost. - Teste de dentro do Grafana:
docker exec grafana wget -O- http://prometheus:9090/-/healthy
Dashboard mostra “No data”
- Compare os labels:
prometheus → Status → Targetsmostra comoservice,instance,job. Se o dashboard filtraservice=~"api-.*"mas o exporter mandaapplication=api-..., troque o label. - Verifique janela de tempo (top-right): Loki só guarda 30 dias por padrão.
curl prometheus:9090/api/v1/query?query=updeve retornar resultado.
Loki “too many outstanding requests”
limits_config:
max_query_parallelism: 32
split_queries_by_interval: 15m
max_concurrent_tail_requests: 10
query_scheduler:
max_outstanding_requests_per_tenant: 2048Ou diminua a janela da query no Grafana.
Tempo “MIGRATE schema” no boot
Aconteceu update de versão major (ex.: 2.4 → 2.6). Deixe o container terminar a migration nos primeiros logs:
level=info msg="migrating schema from v2 to v3"Se travar, faça downgrade da imagem para a versão imediatamente anterior, deixe drenar, depois suba de novo.
Grafana atrás de proxy: OAuth callback redireciona pra http://
GF_SERVER_ROOT_URLprecisa serhttps://...(com a barra final).- No nginx, garanta
proxy_set_header X-Forwarded-Proto $scheme; - Adicione
GF_SERVER_ENFORCE_DOMAIN=truepara evitar mismatch.
Keycloak: “Invalid redirect URI”
A URI cadastrada precisa ser literalmente https://grafana.example.com/login/generic_oauth (com /login/generic_oauth, não só /). Wildcards (/*) também funcionam mas são menos seguros.
Alertmanager não dispara no Discord
- Teste o webhook direto:
curl -X POST -H "Content-Type: application/json" -d '{"content":"teste"}' $DISCORD_WEBHOOK - Em Alerting → Contact points, clique em Test — Grafana manda um alerta dummy.
- Verifique route matchers:
severity=criticalno matcher só pega alertas com esse label exato.
Spring Boot: /actuator/prometheus retorna 404
- Você esqueceu
micrometer-registry-prometheus. - Ou o endpoint não está exposto:
management.endpoints.web.exposure.include=prometheus. - Spring Boot 3.2+: o endpoint exige
management.endpoint.prometheus.access=read_only(eraenabled=trueantes).
cAdvisor consumindo 100% CPU
Filtre o que ele monitora:
command:
- "--housekeeping_interval=30s"
- "--docker_only=true"
- "--disable_metrics=disk,diskIO,network,tcp,udp,percpu,sched,process"17. Referências
- Grafana docs: https://grafana.com/docs/grafana/latest/
- Grafana Docker install: https://grafana.com/docs/grafana/latest/setup-grafana/installation/docker/
- Prometheus: https://prometheus.io/docs/prometheus/latest/getting_started/
- Prometheus config: https://prometheus.io/docs/prometheus/latest/configuration/configuration/
- Loki: https://grafana.com/docs/loki/latest/
- Loki monolithic: https://grafana.com/docs/loki/latest/setup/install/docker/
- Tempo: https://grafana.com/docs/tempo/latest/
- Tempo getting started: https://grafana.com/docs/tempo/latest/getting-started/docker-example/
- Alertmanager: https://prometheus.io/docs/alerting/latest/alertmanager/
- Generic OAuth (Keycloak): https://grafana.com/docs/grafana/latest/setup-grafana/configure-security/configure-authentication/generic-oauth/
- Provisioning: https://grafana.com/docs/grafana/latest/administration/provisioning/
- Alerting provisioning: https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/file-provisioning/
- Spring Boot Micrometer: https://docs.spring.io/spring-boot/reference/actuator/metrics.html
- Spring Boot tracing: https://docs.spring.io/spring-boot/reference/actuator/tracing.html
- Grafana Alloy: https://grafana.com/docs/alloy/latest/
- Faro Web SDK: https://grafana.com/docs/grafana-cloud/monitor-applications/frontend-observability/faro-web-sdk/
- Dashboards públicos: https://grafana.com/grafana/dashboards/