How to Diagnose and Fix Common Issues¶

Expanded troubleshooting guide for resolving setup, pipeline, authentication, browser, database, Docker, and Kubernetes problems.

Prerequisites¶

Access to the Quorvex AI server or development environment
Ability to run CLI commands and view logs

Step 1: Run Quick Diagnostics¶

Before investigating specific issues, gather system state:

make check-env          # Validate configuration
make health-check       # Hit all health endpoints
make prod-status        # Docker service status (production)

Check log files:

tail -f api.log         # Backend API logs (local dev)
tail -f web.log         # Frontend logs (local dev)
make prod-logs          # Docker production logs

Setup and Configuration Issues¶

"ANTHROPIC_AUTH_TOKEN not set"¶

Symptom: CLI or API fails with missing token error.

Fix:

make check-env
# Edit .env and set ANTHROPIC_AUTH_TOKEN=your-actual-token

If running in Docker, ensure .env.prod has the variable and restart:

make prod-restart

"ModuleNotFoundError: No module named 'orchestrator'"¶

Fix: Activate the virtual environment:

source venv/bin/activate
python orchestrator/cli.py specs/my-test.md

Or use make run SPEC=... which activates the venv automatically.

"venv not found" or Missing Dependencies¶

Fix:

make setup

Port 8001 or 3000 Already in Use¶

Fix:

make stop

If ports are still occupied:

lsof -ti :8001 | xargs kill -15
lsof -ti :3000 | xargs kill -15

Test Generation Issues¶

"No target URL found in spec"¶

Cause: The spec file does not contain a navigable URL.

Fix: Ensure your spec includes a URL starting with http:// or https://:

## Steps
1. Navigate to https://example.com

Generated Test Selectors Fail¶

Fix: 1. Healer automatically retries (up to 3 attempts) 2. Use hybrid mode for extended healing:

python orchestrator/cli.py specs/my-test.md --hybrid

3. Check if the target application requires authentication or changed its UI

Test Times Out on Complex Pages¶

Fix: Increase agent timeouts in .env:

AGENT_TIMEOUT_SECONDS=3600
GENERATOR_TIMEOUT_SECONDS=3600

Or use hybrid mode for more healing attempts.

SDK Cancel Scope Errors¶

Symptom: Error mentioning "cancel scope" in stderr.

Cause: Expected behavior -- the Claude Agent SDK throws cleanup errors during shutdown. These are handled automatically by the pipeline.

If you see this in custom code, apply the fix pattern:

result_text = ""
try:
    result_text = await runner.run(prompt)
except Exception as e:
    if "cancel scope" in str(e).lower():
        pass  # SDK cleanup -- result_text already captured
    else:
        raise
# Parse result AFTER the except block

Authentication Issues¶

"Account locked"¶

Symptom: Login returns HTTP 423.

Fix: Wait 15 minutes for automatic unlock, or manually clear:

-- PostgreSQL
UPDATE users SET failed_login_attempts = 0, locked_until = NULL WHERE email = 'user@example.com';

"Invalid token" Errors¶

Cause: Access token expired (15-minute lifetime) or JWT secret key changed.

Fix: 1. Refresh using POST /auth/refresh with your refresh token 2. If refresh token expired (7 days), re-login 3. If JWT_SECRET_KEY changed, all tokens are invalidated -- users must re-login

Registration Disabled¶

Fix: Set ALLOW_REGISTRATION=true in .env and restart.

Browser Pool Issues¶

Browser Slots Exhausted¶

Symptom: Tests queue up and timeout with "Could not acquire browser slot".

Diagnostics:

curl http://localhost:8001/api/browser-pool/status | python3 -m json.tool

Fix: 1. Wait for running operations to complete 2. Increase limit: MAX_BROWSER_INSTANCES=10 in .env 3. Force cleanup: curl -X POST http://localhost:8001/api/browser-pool/cleanup 4. Scale browser workers: make workers-up && make workers-scale N=8

Exploration Stops Early¶

Fix: Increase limits:

python orchestrator/cli.py --explore https://example.com --max-interactions 100 --timeout 60

Database Issues¶

"Database connection refused"¶

Fix (PostgreSQL in Docker):

docker compose up -d db
# or
make prod-up

Fix (SQLite fallback):

# In .env
DATABASE_URL=sqlite:///./test.db

Migration Errors¶

Fix:

# Check migration state
make db-history

# If schema already exists, stamp it
make db-stamp R=001

# Apply pending migrations
make db-upgrade

Auth Endpoints Return 500 After Restore¶

Cause: Missing database columns after Alembic restore.

Fix:

docker compose --env-file .env.prod -f docker-compose.prod.yml exec db \
  psql -U playwright -d playwright_agent -c "
    ALTER TABLE users ADD COLUMN IF NOT EXISTS last_login TIMESTAMP;
    ALTER TABLE refresh_tokens ADD COLUMN IF NOT EXISTS device_info VARCHAR;
    ALTER TABLE refresh_tokens ADD COLUMN IF NOT EXISTS ip_address VARCHAR;
  "
make prod-restart

Docker / Production Issues¶

Container OOM (Out of Memory)¶

Symptom: Container killed with exit code 137.

Fix: 1. Check usage: docker stats 2. Increase limits in docker-compose.prod.yml 3. Ensure shm_size: 2gb for the backend 4. Consider workers mode: make workers-up

VNC Not Connecting¶

Fix: 1. Verify VNC_ENABLED=true in .env.prod 2. Check supervisord status:

docker exec playwright-agent-backend-1 supervisorctl status

All processes (xvfb, fluxbox, x11vnc, websockify, uvicorn) should be RUNNING.

Backup Services Can't Connect¶

Symptom: DNS errors like lookup minio: no such host.

Fix: Ensure backup services have networks: - playwright-network in docker-compose.prod.yml.

Redis Connection Failed¶

Fix:

docker ps | grep redis
docker compose restart redis
docker exec -it playwright-agent-redis-1 redis-cli ping

The application degrades gracefully: rate limiting uses in-memory storage and the agent queue falls back to direct execution.

Kubernetes Issues¶

Pods Stuck in Pending¶

Diagnostics:

kubectl describe pod <pod-name> -n playwright-agent
kubectl get pvc -n playwright-agent

HPA Not Scaling¶

Fix: Install metrics server if missing:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Browser Worker Crashes¶

Cause: Insufficient shared memory for Chromium.

Fix: Increase sizeLimit for the /dev/shm emptyDir in browser-worker-deployment.yaml.

Log File Locations¶

Environment	Log	Location
Local dev	Backend	`api.log` (project root)
Local dev	Frontend	`web.log` (project root)
Docker prod	All	`make prod-logs`
Kubernetes	Backend	`kubectl logs -l app=backend -n playwright-agent`
Kubernetes	Workers	`kubectl logs -l app=browser-worker -n playwright-agent`

Health Endpoints¶

Endpoint	Purpose
`GET /health`	Backend API status
`GET /health/storage`	Local + MinIO storage
`GET /health/backup`	Last backup info
`GET /health/alerts`	Active alerts
`GET /api/browser-pool/status`	Browser pool usage
`GET /api/agents/queue-status`	Agent queue status

Verification¶

After fixing any issue:

Run make health-check to verify all services are healthy
Run a simple test spec to confirm end-to-end functionality
Check the dashboard loads and can list specs/runs

Getting Started -- initial setup
Deployment -- deployment modes and configuration
Disaster Recovery -- recovery from data loss
Authentication -- auth-specific issues

How to Diagnose and Fix Common Issues¶

Prerequisites¶

Step 1: Run Quick Diagnostics¶

Setup and Configuration Issues¶

"ANTHROPIC_AUTH_TOKEN not set"¶

"ModuleNotFoundError: No module named 'orchestrator'"¶

"venv not found" or Missing Dependencies¶

Port 8001 or 3000 Already in Use¶

Test Generation Issues¶

"No target URL found in spec"¶

Generated Test Selectors Fail¶

Test Times Out on Complex Pages¶

SDK Cancel Scope Errors¶

Authentication Issues¶

"Account locked"¶

"Invalid token" Errors¶

Registration Disabled¶

Browser Pool Issues¶

Browser Slots Exhausted¶

Exploration Stops Early¶

Database Issues¶

"Database connection refused"¶

Migration Errors¶

Auth Endpoints Return 500 After Restore¶

Docker / Production Issues¶

Container OOM (Out of Memory)¶

VNC Not Connecting¶

Backup Services Can't Connect¶

Redis Connection Failed¶

Kubernetes Issues¶

Pods Stuck in Pending¶

HPA Not Scaling¶

Browser Worker Crashes¶

Log File Locations¶

Health Endpoints¶

Verification¶

Related Guides¶