Skip to main content
Version: 2.x (Latest)

Metrics & Monitoring

Authorizer exposes Prometheus-compatible metrics for monitoring authentication activity, API performance, security events, and infrastructure health.

Endpoints

/healthz - Liveness Probe

Returns storage health status. Use with Kubernetes liveness probes.

curl http://localhost:8080/healthz
# {"status":"ok"}

Returns 503 with {"status":"unhealthy","error":"storage unavailable"} when the database is unreachable (details are logged server-side only).

/readyz - Readiness Probe

Checks storage and memory store health. Use with Kubernetes readiness probes.

curl http://localhost:8080/readyz
# {"status":"ready"}

Returns 503 with {"status":"not ready","error":"storage unavailable"} when the system is not ready to serve traffic (details are logged server-side only).

/metrics - Prometheus Metrics

Serves all metrics in Prometheus exposition format.

/metrics is never on the main HTTP server. It is always served by a separate minimal HTTP server that runs in parallel with the main Gin app (same pattern as running distinct app and metrics listeners). By default --http-port is 8080 and --metrics-port is 8081; --http-port and --metrics-port must differ or the process exits at startup.

The metrics listener is not reachable from other machines by default: --metrics-host defaults to 127.0.0.1, so only loopback can scrape unless you change it (see below).

curl http://127.0.0.1:8081/metrics

Bind address and security

  • Single host / node-exporter style: Keep defaults (127.0.0.1 + --metrics-port). Run Prometheus (or an agent) on the same host and scrape 127.0.0.1:8081, or use a reverse proxy that forwards from an internal network to that socket.
  • Docker / Kubernetes / another machine scrapes the pod: Set --metrics-host=0.0.0.0 (or the pod IP interface you use) so the metrics port accepts connections on the container network. Do not put the metrics port on a public ingress or internet-facing load balancer; use a ClusterIP Service (or internal Docker network) and scrape from inside the cluster only.

For Docker EXPOSE vs -p / Compose ports: and Kubernetes pod vs Service exposure, see Docker deployment and Kubernetes deployment.

Available Metrics

HTTP Metrics

MetricTypeLabelsDescription
authorizer_http_requests_totalCountermethod, path, statusTotal HTTP requests received
authorizer_http_request_duration_secondsHistogrammethod, pathHTTP request latency in seconds

For routes that do not match a registered Gin pattern, path is recorded as unmatched (not the raw URL), to keep Prometheus cardinality bounded.

Authentication Metrics

MetricTypeLabelsDescription
authorizer_auth_events_totalCounterevent, statusAuthentication event count
authorizer_active_sessionsGaugeApproximate active session count
authorizer_api_operations_totalCounterprotocol, operation, statusAPI operations served, attributed to the protocol they came in on

The protocol label on authorizer_api_operations_total is one of graphql, grpc, or rest (the rest value covers calls made through the grpc-gateway /v1/* surface). operation is the gRPC method / GraphQL operation name and status is ok or error. The same protocol is also recorded on each audit-log entry (under the protocol key of the entry's metadata field), so operations are attributable to their transport in both metrics and the audit trail.

Auth event values:

EventDescription
loginUser email/password login
signupNew user registration
logoutUser session termination
forgot_passwordPassword reset request
reset_passwordPassword reset completion
verify_emailEmail verification
verify_otpOTP verification
magic_link_loginMagic link authentication
admin_loginAdmin dashboard login
admin_logoutAdmin dashboard logout
oauth_loginOAuth provider redirect
oauth_callbackOAuth provider callback
token_refreshToken refresh
token_revokeToken revocation

Status values: success, failure

Security Metrics

MetricTypeLabelsDescription
authorizer_security_events_totalCounterevent, reasonSecurity-sensitive events for alerting
authorizer_client_id_header_missing_totalCounterRequests with no X-Authorizer-Client-ID header (allowed for some routes)

Security event examples:

EventReasonTrigger
invalid_credentialsbad_passwordFailed password comparison
invalid_credentialsuser_not_foundLogin with non-existent email
account_revokedlogin_attemptLogin to a revoked account
invalid_admin_secretadmin_loginFailed admin authentication

GraphQL Metrics

MetricTypeLabelsDescription
authorizer_graphql_errors_totalCounteroperationGraphQL responses containing errors (HTTP 200 with errors)
authorizer_graphql_request_duration_secondsHistogramoperationGraphQL operation latency
authorizer_graphql_limit_rejections_totalCounterlimitOperations rejected for exceeding a configured query limit

The operation label is anonymous for unnamed operations, or op_ + a short SHA-256 prefix of the operation name so client-controlled names cannot create unbounded time series.

GraphQL APIs return HTTP 200 even when the response contains errors. These metrics capture those application-level errors that would otherwise be invisible to HTTP-level monitoring.

The limit label on authorizer_graphql_limit_rejections_total is one of:

ValueWhat was exceededTunable via
depthSelection-set nesting depth--graphql-max-depth
complexityTotal complexity score--graphql-max-complexity
aliasTotal aliased fields per operation--graphql-max-aliases
body_sizeHTTP request body size--graphql-max-body-bytes

A sustained non-zero rate on any label usually means either an exploration attempt or a legitimate operation that needs the limit raised. Alert at the rate that distinguishes the two for your traffic profile. See GraphQL hardening for the limits themselves.

Authorization (FGA) Metrics

These metrics cover the embedded fine-grained authorization (OpenFGA) engine. They appear only once FGA is enabled and the corresponding operation has run at least once. See Authorization (FGA).

MetricTypeLabelsDescription
authorizer_fga_checks_totalCounteroperation, resultAccess decisions from check_permissions. The headline metric for adoption and denial/error alerting.
authorizer_fga_check_duration_secondsHistogramoperationLatency of the client-facing FGA engine reads.
authorizer_fga_operations_totalCounteroperation, resultNon-decision FGA operations (model/tuple management, enumeration, reset) by outcome.

authorizer_fga_checks_total labels:

LabelValues
operationcheck_permissions (each supplied pair is counted individually)
resultallowed · denied · error (the engine call failed — fail-closed, so the caller was denied)

authorizer_fga_check_duration_seconds operation: check_permissions · list_permissions. The histogram's _count also gives you a call rate per operation for free.

authorizer_fga_operations_total operation: get_model · write_model · read_tuples · write_tuples · delete_tuples · list_users · expand · list_permissions · reset. result: success · error.

Useful queries:

# FGA denial rate (last 5 minutes)
sum(rate(authorizer_fga_checks_total{result="denied"}[5m]))

# FGA check error rate — should be ~0; a spike means the engine/store is failing closed
sum(rate(authorizer_fga_checks_total{result="error"}[5m]))

# Admin authorization changes (model/tuple writes, resets)
sum by (operation) (increase(authorizer_fga_operations_total{operation=~"write_model|write_tuples|delete_tuples|reset"}[1h]))

# p99 check latency
histogram_quantile(0.99, sum by (le, operation) (rate(authorizer_fga_check_duration_seconds_bucket[5m])))

A non-zero result="error" rate on authorizer_fga_checks_total is an operational signal — the engine or its datastore is failing, and checks are denying as a result. Page on it.

Infrastructure Metrics

MetricTypeLabelsDescription
authorizer_db_health_check_totalCounterstatusDatabase health check outcomes (healthy/unhealthy)

Prometheus Configuration

Add Authorizer as a scrape target in your prometheus.yml:

scrape_configs:
- job_name: 'authorizer'
scrape_interval: 15s
static_configs:
# In Docker/K8s, use --metrics-host=0.0.0.0 so the scraper can reach the pod/container; scrape via internal DNS/service.
- targets: ['authorizer:8081'] # default --metrics-port; same host only: use 127.0.0.1:8081

For Kubernetes with service discovery:

scrape_configs:
- job_name: 'authorizer'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: authorizer
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
target_label: __address__
replacement: '$1:8081' # metrics port; ensure deployment sets --metrics-host=0.0.0.0 for in-cluster scrape

Grafana Dashboard

Suggested Panels

Authentication Overview:

# Login success rate (last 5 minutes)
rate(authorizer_auth_events_total{event="login",status="success"}[5m])
/ rate(authorizer_auth_events_total{event="login"}[5m])

# Signup rate
rate(authorizer_auth_events_total{event="signup",status="success"}[5m])

# Active sessions
authorizer_active_sessions

Security Alerts:

# Failed login rate (alert if > 10/min)
rate(authorizer_security_events_total{event="invalid_credentials"}[1m]) > 10

# Failed admin login attempts
increase(authorizer_security_events_total{event="invalid_admin_secret"}[5m])

# Revoked account login attempts
increase(authorizer_security_events_total{event="account_revoked"}[5m])

API Performance:

# GraphQL p99 latency
histogram_quantile(0.99, rate(authorizer_graphql_request_duration_seconds_bucket[5m]))

# HTTP p95 latency by path
histogram_quantile(0.95, sum(rate(authorizer_http_request_duration_seconds_bucket[5m])) by (le, path))

# GraphQL error rate
rate(authorizer_graphql_errors_total[5m])

Infrastructure:

# Database health check failure rate
rate(authorizer_db_health_check_total{status="unhealthy"}[5m])

# Request rate by endpoint
sum(rate(authorizer_http_requests_total[5m])) by (path)

Alerting Rules

Example Prometheus alerting rules:

groups:
- name: authorizer
rules:
- alert: HighLoginFailureRate
expr: rate(authorizer_security_events_total{event="invalid_credentials"}[5m]) > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "High login failure rate detected"
description: "More than 0.5 failed logins/sec for 2 minutes — possible brute force."

- alert: DatabaseUnhealthy
expr: authorizer_db_health_check_total{status="unhealthy"} > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Authorizer database health check failing"

- alert: HighGraphQLErrorRate
expr: rate(authorizer_graphql_errors_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Elevated GraphQL error rate"

Manual Testing

Start Authorizer and verify metrics are working:

# 1. Start Authorizer (dev mode with SQLite)
make dev

# 2. Check health endpoints
curl http://localhost:8080/healthz
curl http://localhost:8080/readyz

# 3. View raw metrics (default: loopback + metrics port)
curl http://127.0.0.1:8081/metrics

# 4. Generate some auth events via GraphQL
curl -X POST http://localhost:8080/graphql \
-H "Content-Type: application/json" \
-d '{"query":"mutation { login(params: {email: \"test@example.com\", password: \"wrong\"}) { message } }"}'

# 5. Check metrics again — look for auth and security counters
curl -s http://127.0.0.1:8081/metrics | grep authorizer_auth
curl -s http://127.0.0.1:8081/metrics | grep authorizer_security
curl -s http://127.0.0.1:8081/metrics | grep authorizer_graphql

# 6. Run integration tests
TEST_DBS="sqlite" go test -p 1 -v -run "TestMetrics|TestHealth|TestReady|TestAuthEvent|TestAdminLoginMetrics|TestGraphQLError" ./internal/integration_tests/

CLI Flags

FlagDefaultDescription
--metrics-port8081Port for the dedicated Prometheus /metrics listener (must differ from --http-port)
--metrics-host127.0.0.1Bind address for that dedicated listener only (use 0.0.0.0 for in-cluster or cross-container scrape; never expose on the public internet without a proxy and auth)

/healthz, /readyz, and /health stay on the main HTTP port (--host:--http-port). /metrics is only on the dedicated listener (--metrics-host:--metrics-port).