Observability
Observability — Traces, metrics, and structured logging
FraiseQL includes several features to help your API remain stable under load and recover gracefully from failures:
FraiseQL maintains a connection pool for each configured database. The pool reuses connections across requests, reducing connection overhead.
Configure pool settings per database in fraiseql.toml:
[databases.primary]url = "${DATABASE_URL}"type = "postgresql"pool_size = 10timeout_seconds = 30| Setting | Default | Description |
|---|---|---|
pool_size | 10 | Maximum connections in the pool |
timeout_seconds | 30 | Connection and query timeout |
| Workload | Recommended Pool Size |
|---|---|
| Light (single user, development) | 5-10 |
| Medium (production API) | 10-20 |
| Heavy (high-throughput analytics) | 20-50 |
| Multi-database setup | 10-15 per database |
FraiseQL enforces timeouts on database queries to prevent slow queries from consuming pool connections indefinitely.
When a query exceeds the configured timeout_seconds:
Example error response:
{ "errors": [{ "message": "Query exceeded timeout of 30s", "extensions": { "code": "QUERY_TIMEOUT", "timeout_seconds": 30 } }]}| Query Type | Recommended Timeout |
|---|---|
| Simple lookups (by ID) | 5-10 seconds |
| List queries with filters | 10-30 seconds |
| Complex aggregations | 30-60 seconds |
| Analytics/reporting | 60-300 seconds |
If you frequently hit timeout errors:
Check for missing database indexes:
EXPLAIN ANALYZE SELECT * FROM v_user WHERE email = 'test@example.com';Consider query optimization or materialized views for complex aggregations
Increase pool_size if timeouts are caused by connection wait times
FraiseQL exposes health endpoints on the same port as the GraphQL API:
| Endpoint | Purpose | Response |
|---|---|---|
GET /health | Full status including database connectivity | JSON with detailed status |
GET /health/live | Process is alive | 200 OK or 503 |
GET /health/ready | Ready to serve traffic | 200 OK or 503 |
Configure readiness and liveness probes:
livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 10 periodSeconds: 10
readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5Healthy state:
{ "status": "ok", "version": "0.9.1", "uptime_seconds": 3821, "databases": { "primary": { "status": "ok", "pool_idle": 18, "pool_active": 2 } }}Degraded state (database unavailable):
{ "status": "degraded", "databases": { "primary": { "status": "error", "error": "connection refused" } }}HTTP 503 is returned when status is "degraded" or "error", causing Kubernetes to stop routing traffic to the pod.
When FraiseQL receives SIGTERM (from Kubernetes, systemd, or kill), it enters a drain phase:
Stop accepting new connections — the load balancer health check returns 503 immediately, causing the LB to stop routing new traffic to this instance
Drain in-flight requests — FraiseQL waits up to 30 seconds for active queries and mutations to complete
Close database pools — all connections are cleanly closed after in-flight work finishes or the drain timeout expires
Exit with code 0 — a clean exit allows Kubernetes to proceed with the rolling update
Ensure your deployment allows time for graceful shutdown:
spec: template: spec: terminationGracePeriodSeconds: 60 # Must be > drain timeout containers: - name: fraiseql image: fraiseql/fraiseql:latestFraiseQL returns structured GraphQL errors with consistent extension codes:
| Error Code | Meaning | Client Action |
|---|---|---|
QUERY_TIMEOUT | Query exceeded timeout | Retry with smaller result set or optimize query |
DATABASE_UNAVAILABLE | Cannot connect to database | Retry with exponential backoff |
POOL_EXHAUSTED | All connections in use | Wait and retry, or increase pool_size |
RATE_LIMITED | Request rate limit hit | Slow down request rate |
INTERNAL_ERROR | Unexpected server error | Report to operations team |
Implement intelligent retries in your client:
import timeimport randomfrom fraiseql import FraiseQLClient
client = FraiseQLClient("http://localhost:8080")
def execute_with_retry(query, variables, max_attempts=3): """Execute query with exponential backoff for transient errors.""" delay = 0.1
for attempt in range(max_attempts): try: result = client.execute(query, variables=variables)
if result.errors: codes = [e.get("extensions", {}).get("code") for e in result.errors]
# Retry on transient errors if any(c in codes for c in ["DATABASE_UNAVAILABLE", "POOL_EXHAUSTED"]): if attempt < max_attempts - 1: time.sleep(delay + random.uniform(0, 0.1)) delay = min(delay * 2, 5.0) continue
# Don't retry timeouts — they need query optimization if "QUERY_TIMEOUT" in codes: raise RuntimeError("Query timeout — optimize query or increase timeout")
return result
except ConnectionError: if attempt == max_attempts - 1: raise time.sleep(delay + random.uniform(0, 0.1)) delay = min(delay * 2, 5.0)
raise RuntimeError("Max retries exceeded")import { FraiseQLClient, FraiseQLError } from 'fraiseql';
const client = new FraiseQLClient('http://localhost:8080');
async function executeWithRetry<T>( query: string, variables: Record<string, unknown>, maxAttempts = 3): Promise<T> { let delay = 100;
for (let attempt = 0; attempt < maxAttempts; attempt++) { try { const result = await client.execute<T>(query, variables);
if (result.errors) { const codes = result.errors.map(e => e.extensions?.code);
// Retry on transient errors if (codes.some(c => ['DATABASE_UNAVAILABLE', 'POOL_EXHAUSTED'].includes(c))) { if (attempt < maxAttempts - 1) { await sleep(delay + Math.random() * 100); delay = Math.min(delay * 2, 5000); continue; } }
// Don't retry timeouts if (codes.includes('QUERY_TIMEOUT')) { throw new Error('Query timeout — optimize query or increase timeout'); } }
return result.data; } catch (err) { if (attempt === maxAttempts - 1) throw err; await sleep(delay + Math.random() * 100); delay = Math.min(delay * 2, 5000); } }
throw new Error('Unreachable');}
function sleep(ms: number): Promise<void> { return new Promise(resolve => setTimeout(resolve, ms));}FraiseQL protects federation fan-out queries from cascading failures. When a federated database repeatedly fails, the circuit opens and subsequent requests to that database fail fast with HTTP 503 instead of waiting for timeouts.
| State | Behaviour |
|---|---|
| Closed | Requests flow normally. Failure count is tracked. |
| Open | All requests to this database fail immediately with HTTP 503 + Retry-After header. |
| Half-open | After recovery_timeout_secs, a probe request is allowed. On success the circuit closes; on failure it reopens. |
[federation.circuit_breaker]enabled = truefailure_threshold = 5 # Open after N consecutive failuresrecovery_timeout_secs = 30 # Seconds to stay open before probingsuccess_threshold = 2 # Successful probes required to close
# Per-database override (array of tables)[[federation.circuit_breaker.per_database]]database = "orders_db"failure_threshold = 3recovery_timeout_secs = 60| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable circuit breaker protection |
failure_threshold | integer | 5 | Consecutive failures before opening |
recovery_timeout_secs | integer | 30 | Seconds in open state before probing |
success_threshold | integer | 2 | Successful probes required to close |
When the circuit is open, FraiseQL returns:
{ "errors": [{ "message": "Federation database 'orders_db' unavailable: circuit breaker open", "extensions": { "code": "SERVICE_UNAVAILABLE", "database": "orders_db" } }]}HTTP status is 503 Service Unavailable with a Retry-After: 30 header set to recovery_timeout_secs.
FraiseQL exposes operational metrics on GET /metrics in Prometheus exposition format:
| Metric | Labels | Description |
|---|---|---|
fraiseql_requests_total | method, status | Total HTTP requests |
fraiseql_request_duration_seconds | method | Request latency histogram |
fraiseql_database_pool_active | database | Active connections per database |
fraiseql_database_pool_idle | database | Idle connections per database |
fraiseql_query_timeouts_total | database | Query timeout count |
fraiseql_errors_total | type | Error count by type |
fraiseql_federation_circuit_breaker_state | database | Circuit breaker state: 0=closed, 1=open, 2=half_open |
Example Prometheus alerting rules for FraiseQL:
# High error rate- alert: FraiseQLHighErrorRate expr: rate(fraiseql_errors_total[5m]) > 10 for: 5m labels: severity: warning annotations: summary: "High error rate on {{ $labels.instance }}"
# Database connection issues- alert: FraiseQLDatabaseUnavailable expr: fraiseql_database_pool_active == 0 for: 1m labels: severity: critical annotations: summary: "Database appears unavailable"
# Query timeouts- alert: FraiseQLQueryTimeouts expr: rate(fraiseql_query_timeouts_total[5m]) > 1 for: 5m labels: severity: warning annotations: summary: "Frequent query timeouts — check for missing indexes"Before deploying FraiseQL to production:
pool_size for your workloadtimeout_seconds based on your slowest acceptable query[federation.circuit_breaker] for federated deploymentsfraiseql_federation_circuit_breaker_stateterminationGracePeriodSeconds > 30 in KubernetesThe following features are planned for future releases:
“Pool exhausted” errors
Increase pool_size or reduce query concurrency:
[databases.primary]pool_size = 20 # Increase from default 10Frequent query timeouts
Identify slow queries:
SELECT query, mean_exec_timeFROM pg_stat_statementsORDER BY mean_exec_time DESCLIMIT 10;Add indexes on frequently filtered columns
Increase timeout for legitimate long-running queries:
[databases.analytics]timeout_seconds = 120Health check failures
Check database connectivity:
curl http://localhost:8080/health# Look for database.status: "error"Verify DATABASE_URL is correct and the database is accessible from the FraiseQL pod.
Observability
Observability — Traces, metrics, and structured logging
Deployment
Deployment — Kubernetes configuration and production setup
Performance
Performance Guide — Query optimization and indexes