Performance Issues
Performance Issues
Section titled “Performance Issues”Solutions for FraiseQL performance problems.
Slow Queries (p95 > 200ms, p99 > 500ms)
Section titled “Slow Queries (p95 > 200ms, p99 > 500ms)”Symptoms
Section titled “Symptoms”- Requests taking > 200ms (p95)
- Inconsistent latency
- Occasionally slow responses mixed with fast ones
- Load hasn’t changed but latency increased
Diagnosis Steps
Section titled “Diagnosis Steps”- Check logs for slow queries:
LOG_LEVEL=debug fraiseql run- Monitor during slow period:
docker stats fraiseqlcurl http://localhost:9000/metrics | grep fraiseql_request_duration- Check database for slow queries:
SELECT mean_exec_time, calls, mean_exec_time * calls as total_time, queryFROM pg_stat_statementsORDER BY mean_exec_time DESC LIMIT 5;
-- Show slowest queries by total impactSELECT mean_exec_time, calls, query FROM pg_stat_statementsORDER BY mean_exec_time * calls DESC LIMIT 5;SHOW FULL PROCESSLIST;
-- Enable slow query logSET GLOBAL slow_query_log = 'ON';SET GLOBAL long_query_time = 1;
-- Check slow query logSELECT * FROM mysql.slow_log ORDER BY query_time DESC LIMIT 5;# SQLite doesn't have built-in query statistics# Use EXPLAIN QUERY PLAN with profilingsqlite3 database.db ".timer on"sqlite3 database.db "EXPLAIN QUERY PLAN SELECT ..."SELECT TOP 5 total_elapsed_time / execution_count as avg_elapsed_time, execution_count, query_hash, textFROM sys.dm_exec_query_stats statsCROSS APPLY sys.dm_exec_sql_text(stats.sql_handle) textORDER BY total_elapsed_time DESC;
-- Show total impact (avg time × calls)SELECT TOP 5 total_elapsed_time / execution_count as avg_time, (total_elapsed_time / execution_count) * execution_count as total_time, execution_count, textFROM sys.dm_exec_query_stats statsCROSS APPLY sys.dm_exec_sql_text(stats.sql_handle) textORDER BY total_time DESC;- Analyze slow query:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM posts WHERE user_id = 123;-- Look for: "Seq Scan" → Add index, "Hash Aggregate" → Check if necessaryEXPLAIN FORMAT=JSON SELECT * FROM posts WHERE user_id = 123\G-- Look for: type='ALL' → full scan, Extra='Using where' without indexEXPLAIN QUERY PLAN SELECT * FROM posts WHERE user_id = 123;-- Look for: SCAN TABLE posts → full table scanSET STATISTICS IO ON;SET STATISTICS TIME ON;SELECT * FROM posts WHERE user_id = 123;SET STATISTICS IO OFF;SET STATISTICS TIME OFF;-- Check I/O and time statistics in Messages tabSolutions
Section titled “Solutions”Add Missing Index (Most common):
# Check EXPLAIN output for "Seq Scan" or "Full Table Scan"EXPLAIN ANALYZE SELECT * FROM posts WHERE user_id = 123;
# If full scan, add indexCREATE INDEX idx_posts_user_id ON posts(user_id);Optimize N+1 Queries:
FraiseQL should batch automaticallyIf seeing 100 queries for 10 items, check logsQuery Complexity:
Deeply nested queries:users { posts { comments { author { posts { ... } } } } }
Solution: Simplify query depth, use multiple queriesHigh Latency Spikes
Section titled “High Latency Spikes”Symptoms
Section titled “Symptoms”- Usually fast (50ms)
- Occasionally very slow (2-5 seconds)
- Happens randomly or under load
- Correlates with database load
Common Causes
Section titled “Common Causes”-
Database Lock Contention
- Multiple writers competing
- Long transactions holding locks
- Deadlocks being retried
-
Connection Pool Saturation
- All connections in use
- New requests waiting for free connection
- Leads to cascading delays
-
Memory Pressure
- Swapping to disk
- Garbage collection pauses
- Buffer cache thrashing
-
Network Latency
- To database
- Load balancer/DNS lookup
- Network packet loss
Solutions
Section titled “Solutions”Identify Root Cause:
# 1. Check database connection poolcurl http://localhost:9000/metrics | grep db_connections
# If near max: INCREASE POOL SIZEPGBOUNCER_MAX_POOL_SIZE=50
# 2. Check memory usage during spikedocker stats fraiseql --no-stream
# 3. Check for GC pauses (Python)LOG_LEVEL=debug
# 4. Check database locks (PostgreSQL)SELECT * FROM pg_locks WHERE NOT granted;
# 5. Check network latencyping database.example.comFix Connection Pool:
# Current: 3 instances × 5 pool size = 15 connections# If using 100% regularly, increase:
# Option 1: Increase pool sizePGBOUNCER_MAX_POOL_SIZE=30
# Option 2: Add more instances# Scale from 3 to 5 instances
# Option 3: Use read replicas# Route reads to replica instead of primaryFix Memory Pressure:
# Increase container memorydocker-compose.yml: fraiseql: deploy: resources: limits: memory: 2G # Increase from 1G
# Or reduce batch sizeBATCH_SIZE=100 # Smaller batches use less memoryHigh Database Load (CPU > 80%)
Section titled “High Database Load (CPU > 80%)”Symptoms
Section titled “Symptoms”- Database CPU high even though application isn’t busy
- Database becoming bottleneck
- Slow queries affecting all users
- Auto-scaling of app doesn’t help
Root Causes
Section titled “Root Causes”-
Too Many Queries
- N+1 problem
- No query caching
- Inefficient queries
-
Missing Indexes
- Full table scans
- Unnecessary I/O
-
Suboptimal Queries
- Complex joins
- Inefficient grouping
- Unnecessary subqueries
-
Hot Spots
- One table/query causing load
- Contention on popular record
Solutions
Section titled “Solutions”Find Slow Query:
SELECT mean_exec_time, calls, query FROM pg_stat_statementsORDER BY mean_exec_time * calls DESC LIMIT 5;-- Shows: slowest × most-called = most impactSELECT SUM(sum_timer_wait)/SUM(count_star) as avg_time, count_star as calls, object_schema, object_nameFROM performance_schema.events_statements_summary_by_digestORDER BY avg_time DESC LIMIT 5;# Use EXPLAIN QUERY PLAN for all slow queriessqlite3 database.db "EXPLAIN QUERY PLANSELECT * FROM posts WHERE user_id = 123;"SELECT TOP 5 (total_elapsed_time / execution_count) as avg_time, total_elapsed_time / 1000000 as total_time_sec, execution_count, textFROM sys.dm_exec_query_stats statsCROSS APPLY sys.dm_exec_sql_text(stats.sql_handle) textORDER BY (total_elapsed_time / execution_count) DESC;Optimize with EXPLAIN:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM posts WHERE user_id = 123;-- Look for:-- - "Seq Scan" → Add index-- - "Hash Aggregate" → Check if necessary-- - "Nested Loop" → Might need better indexEXPLAIN FORMAT=JSON SELECT * FROM posts WHERE user_id = 123\G-- Look for:-- - "type": "ALL" → Full table scan, add index-- - "possible_keys": null → No usable index-- - "key": null → Not using any indexEXPLAIN QUERY PLAN SELECT * FROM posts WHERE user_id = 123;-- Look for:-- - SCAN TABLE posts → Full table scan, add index-- - SEARCH TABLE posts USING INDEX → Good, using index-- Get execution planSET STATISTICS IO ON;SELECT * FROM posts WHERE user_id = 123;SET STATISTICS IO OFF;-- Look for:-- - Table Scan → Add index-- - Index Seek → Good, using index-- - High logical reads → Inefficient planAdd Strategic Indexes:
-- On WHERE columnsCREATE INDEX idx_posts_user_id ON posts(user_id);
-- Composite index for common filtersCREATE INDEX idx_posts_user_published ON posts(user_id, published);
-- Partial index for specific valuesCREATE INDEX idx_posts_published ON posts(id) WHERE published = true;Cache Expensive Queries:
@cached(ttl=3600)@fraiseql.querydef user_stats(user_id: ID) -> UserStats: # Expensive aggregation computed once per hour passHigh Memory Usage
Section titled “High Memory Usage”Symptoms
Section titled “Symptoms”- Memory usage growing over time
- Container killed with OOM
- Memory usage > 90%
- Slowdowns due to swap/GC
Root Causes
Section titled “Root Causes”-
Memory Leak
- Objects not being garbage collected
- Circular references
- Large caches growing unbounded
-
Large Query Results
- Fetching massive datasets
- No pagination
- Loading entire table into memory
-
Caching Issues
- Cache growing unbounded
- Old cache entries not evicted
Solutions
Section titled “Solutions”Identify Leak:
# Monitor memory growth over 1 hourwatch -n 60 'docker stats fraiseql'
# If consistently growing, there's a leak
# Check memory per queryfor i in {1..100}; do curl http://localhost:8000/graphql ...donedocker stats fraiseql # Memory increased?Fix Large Query Results:
# BAD: Fetches entire table@fraiseql.querydef all_posts() -> list[Post]: pass
# GOOD: Paginate@fraiseql.querydef posts(first: int = 10, after: str | None = None) -> PostConnection: pass
# Usagequery { posts(first: 10) { # Only 10 at a time edges { node { id title } } pageInfo { hasNextPage } }}Fix Unbounded Cache:
# Set TTL to prevent indefinite growth@cached(ttl=3600) # Expires after 1 hourdef expensive_query(): pass
# Or use LRU cache with max sizefrom functools import lru_cache@lru_cache(maxsize=1000)def expensive_computation(): passIncrease Memory Limit:
# Temporary: Increase container memorydocker update --memory 2G fraiseql
# Permanent: Update docker-compose.ymlfraiseql: deploy: resources: limits: memory: 2GConnection Pool Exhaustion
Section titled “Connection Pool Exhaustion”Symptoms
Section titled “Symptoms”- Requests timeout
- “Cannot get database connection” errors
- Gets worse under load
- Fixed by restarting
Root Causes
Section titled “Root Causes”-
Connections Not Released
- Long transactions
- Unhandled errors
- Middleware not closing connections
-
Pool Too Small
- More concurrent requests than pool size
- Each request holds connection
-
Connection Leak
- Connections created but never returned
- Growing over time
Solutions
Section titled “Solutions”Check Current State:
-- Get connection countSELECT COUNT(*) as total_connections, COUNT(CASE WHEN state != 'idle' THEN 1 END) as activeFROM pg_stat_activity;
-- Get current maxSHOW max_connections;
-- If near max (e.g., 100/100), pool is exhaustedSHOW PROCESSLIST;
-- Get connection summarySELECT COUNT(*) as total_connections, COUNT(CASE WHEN COMMAND != 'Sleep' THEN 1 END) as activeFROM INFORMATION_SCHEMA.PROCESSLIST;
-- Get max allowedSHOW VARIABLES LIKE 'max_connections';# SQLite single connection, check for lock waitslsof | grep database.dbls -lh database.db-wal # Check WAL size-- Get connection countSELECT COUNT(*) as total_connections, COUNT(CASE WHEN status = 'running' THEN 1 END) as activeFROM master.dbo.sysprocessesWHERE spid > 50;
-- Get max allowedEXEC sp_configure 'max server memory';
-- If near limit, pool is exhaustedImmediate Relief:
# Increase pool sizePGBOUNCER_MAX_POOL_SIZE=50 # From 20
# Restart applicationdocker-compose restart fraiseql
# Kill idle connections# PostgreSQLSELECT pg_terminate_backend(pid)FROM pg_stat_activityWHERE state = 'idle' AND query_start < NOW() - INTERVAL '5 minutes';Long-term Fix:
-
Find why connections aren’t returning:
Terminal window # Check logs for exceptions that don't close connectionsLOG_LEVEL=debug fraiseql run -
Reduce transaction time:
# BAD: Long transactionasync def mutation(input):db.begin()save_to_db(input)send_email() # 5 secondsdb.commit()# GOOD: Short transactionasync def mutation(input):db.begin()save_to_db(input)db.commit()await send_email() # Outside transaction -
Add connection retry:
# Application auto-retries if no connection available# FraiseQL handles this -
Scale with more instances:
If need 100 concurrent connections:5 instances × 20 pool = 100 connections
Query Timeout
Section titled “Query Timeout”Symptoms
Section titled “Symptoms”- Requests return after timeout period
- “Query execution timeout” errors
- Random, not specific to one query
- Gets worse under load
Root Causes
Section titled “Root Causes”-
Query Too Slow
- Needs index
- Query complexity
-
Database Overloaded
- Can’t start executing query in time
- Waiting for free resources
-
Timeout Too Low
- 5 second timeout not enough
- Query naturally takes 6 seconds
Solutions
Section titled “Solutions”Check Query Speed:
# Run slow query directlytime psql -c "SELECT ..."
# If takes 10 seconds, timeout must be > 10 secondsIncrease Timeout:
# Application levelSTATEMENT_TIMEOUT=30000 # 30 seconds
# Connection level (PostgreSQL)DATABASE_URL=postgresql://...?statement_timeout=30000Optimize Slow Query:
EXPLAIN ANALYZE SELECT ...;
# Add missing indexes, simplify queryRate Limiting Issues
Section titled “Rate Limiting Issues”Symptoms
Section titled “Symptoms”- Requests return 429 Too Many Requests
- Valid traffic getting rate limited
- Limits too strict for actual usage
Solutions
Section titled “Solutions”Check Current Limits:
echo $RATE_LIMIT_REQUESTSecho $RATE_LIMIT_WINDOW_SECONDSAdjust Limits (if too strict):
# From: 100 requests per minute# To: 1000 requests per minute
RATE_LIMIT_REQUESTS=1000RATE_LIMIT_WINDOW_SECONDS=60
# Per-user limit (requires auth)PER_USER_RATE_LIMIT=100PER_USER_RATE_LIMIT_WINDOW_SECONDS=60Monitor Rate Limit Usage:
# Check Prometheus metricscurl http://localhost:9000/metrics | grep rate_limit
# If consistently hitting limit, increase itCaching Not Working
Section titled “Caching Not Working”Symptoms
Section titled “Symptoms”- Same query executed multiple times
- Cache hit rate < 50%
- Memory not being used efficiently
Solutions
Section titled “Solutions”Check if Caching Enabled:
redis-cli ping# Should return: PONG
# If error, Redis not runningdocker-compose logs redisIncrease TTL:
# From 1 hour to 24 hours@cached(ttl=86400)def stable_data(): passCheck Cache Hit Rate:
# Prometheus metriccurl http://localhost:9000/metrics | grep cache_hit_ratio
# Should be > 80% for productionMonitor Cache Size:
# If too much data cached, memory pressureredis-cli INFO memory
# If used_memory > available_memory# Either: increase memory or reduce cache sizeCascading Failures
Section titled “Cascading Failures”Symptoms
Section titled “Symptoms”- One slow request blocks others
- Timeouts cascade across system
- Gets worse under load
- System recovers when load drops
Root Causes
Section titled “Root Causes”-
Shared Resource Contention
- Database connection pool shared
- One slow request holds connections
- Other requests can’t get connections
-
Synchronous Calls
- Long operation blocks request
- Holds database connection
- Other requests timeout
Solutions
Section titled “Solutions”Make I/O Async:
# BAD: Blocks request@fraiseql.mutationdef create_user(email: str) -> User: user = db.create(email) send_email(user.email) # Blocks! return user
# GOOD: Async (don't wait for email)@fraiseql.mutationasync def create_user(email: str) -> User: user = await db.create(email) # Send email in background asyncio.create_task(send_email(user.email)) return userUse Circuit Breaker:
# Fail fast if service downfrom circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)def call_external_service(): # If 5 failures, stop trying for 60 seconds # Return error immediately instead of waiting passImplement Bulkheads:
Separate connection pools:- User pool: 20 connections- Admin pool: 10 connections- Batch pool: 10 connections
Prevents one type from exhausting allPerformance Monitoring
Section titled “Performance Monitoring”Key Metrics to Track
Section titled “Key Metrics to Track”Response Latency (p50, p95, p99)├── p50: Should be < 100ms├── p95: Should be < 200ms└── p99: Should be < 500ms
Database Queries├── Count: < 5 per request├── Latency: < 50ms average└── Connection pool: < 80% utilized
Memory├── Usage: Should be stable├── Growth: < 1MB/minute growth└── GC: No long pause times
Error Rate├── Target: < 0.1%└── Alerts: > 1% for 5 minutes
Cache├── Hit rate: > 80%├── Size: < 80% of available memory└── Evictions: Normal (TTL expiry)Set Up Monitoring
Section titled “Set Up Monitoring”# Prometheus scrape configurationglobal: scrape_interval: 15s
scrape_configs: - job_name: 'fraiseql' static_configs: - targets: ['localhost:9000']Alerts to Configure
Section titled “Alerts to Configure”# Alert when p95 > 500ms- alert: HighLatency expr: fraiseql_request_duration_seconds{quantile="0.95"} > 0.5
# Alert when error rate > 1%- alert: HighErrorRate expr: rate(fraiseql_errors_total[5m]) > 0.01
# Alert when connection pool > 90%- alert: ConnectionPoolNearFull expr: fraiseql_db_connections_used / fraiseql_db_connections_max > 0.9
# Alert when cache hit rate < 70%- alert: LowCacheHitRate expr: fraiseql_cache_hit_ratio < 0.7Benchmarking Your Changes
Section titled “Benchmarking Your Changes”# Before optimizationcurl -s http://localhost:8000/graphql -d '{"query":"..."}' | jq '.timing'
# After optimization# Should see improvement in latency and query count
# Use load testing toolk6 run benchmark.js# Compare: p95 latency, error rate, throughput