Skip to main content
This runbook covers day-to-day operations, monitoring, and failure recovery for the AI content system.

Monitoring Views

Three views provide operational visibility without complex queries:
ViewPurpose
v_ai_queue_healthQueue depth by status, avg processing time, oldest queued job
v_ai_provider_usagePer-provider call counts, success rates, costs (today/month)
v_tenant_ai_budget_statusPer-tenant budget usage, remaining budget, percentage used

Common Tasks

Check queue health

SELECT * FROM v_ai_queue_health;

Check provider health

SELECT
  provider_key,
  is_enabled,
  circuit_state,
  consecutive_failures,
  last_success_at,
  last_failure_at
FROM ai_provider_health
ORDER BY priority;

Check tenant budgets

SELECT * FROM v_tenant_ai_budget_status
ORDER BY pct_used DESC;

Manually trigger narrative regeneration for a tenant

SELECT rpc_test_enqueue_narrative('<tenant_id>'::uuid);
Replace <tenant_id> with the actual tenant UUID.

Increase a tenant’s monthly budget

UPDATE ai_content_budgets
SET monthly_limit_usd = 50.00
WHERE tenant_id = '<tenant_id>';

Disable AI for a specific tenant

UPDATE ai_content_budgets
SET is_enabled = false
WHERE tenant_id = '<tenant_id>';
Re-enable by setting is_enabled = true.

Manually close an open circuit

UPDATE ai_provider_health
SET circuit_state = 'closed',
    consecutive_failures = 0,
    circuit_opened_at = NULL
WHERE provider_key = '<provider_key>';
Valid provider_key values: anthropic, openai, google, groq.

Manually open a circuit (disable a provider)

UPDATE ai_provider_health
SET circuit_state = 'open',
    circuit_opened_at = now()
WHERE provider_key = '<provider_key>';

View recent generation history

SELECT
  id, content_type, tenant_id, status,
  provider, fallback_used, cost_usd,
  created_at, completed_at, error_message
FROM ai_generation_queue
ORDER BY created_at DESC
LIMIT 20;

Check provider costs today

SELECT * FROM v_ai_provider_usage;

Failure Modes & Recovery

Worker not running

Symptom: Queue depth keeps growing, no jobs move to running. Diagnose:
-- Check if any jobs have been claimed recently
SELECT MAX(claimed_at) AS last_claimed FROM ai_generation_queue
WHERE status IN ('running', 'completed');
Fix: Verify the ai-content-worker cron job is active:
SELECT jobid, jobname, schedule, active
FROM cron.job
WHERE jobname = 'ai-content-worker';
If the job exists but isn’t running, check the Edge Function logs in the Supabase dashboard under Edge Functions > ai_content_worker > Logs.

Jobs stuck in running

Symptom: Jobs have status = 'running' for more than 10 minutes. Diagnose:
SELECT id, content_type, tenant_id, claimed_at
FROM ai_generation_queue
WHERE status = 'running'
  AND claimed_at < now() - interval '10 minutes';
Fix: The ai-reclaim-stuck-jobs cron should handle this automatically every 5 minutes. To manually reclaim:
UPDATE ai_generation_queue
SET status = 'queued',
    claimed_at = NULL,
    attempts = attempts + 1
WHERE status = 'running'
  AND claimed_at < now() - interval '10 minutes';

All providers down

Symptom: All circuits are open, jobs keep failing. Diagnose:
SELECT provider_key, circuit_state, consecutive_failures, last_failure_at
FROM ai_provider_health
WHERE circuit_state = 'open';
Fix:
  1. Check each provider’s status page for outages.
  2. If a provider has recovered, manually close its circuit:
UPDATE ai_provider_health
SET circuit_state = 'closed', consecutive_failures = 0, circuit_opened_at = NULL
WHERE provider_key = 'anthropic';
  1. If all providers are genuinely down, wait — the half-open probe will automatically test recovery every 5 minutes.

Tenant budget exceeded

Symptom: A tenant’s AI content stops updating. Jobs are enqueued but skipped. Diagnose:
SELECT * FROM v_tenant_ai_budget_status
WHERE tenant_id = '<tenant_id>';
Fix: Either wait for the monthly reset (1st of month) or increase the budget:
UPDATE ai_content_budgets
SET monthly_limit_usd = 25.00
WHERE tenant_id = '<tenant_id>';

Content hash dedup is wrong (content not regenerating)

Symptom: Data has changed but the narrative still shows old content. Diagnose:
-- Compare current hash vs stored hash
SELECT
  compute_gap_narrative_hash('<tenant_id>'::uuid) AS current_hash,
  content_hash AS stored_hash
FROM ai_gap_narratives
WHERE tenant_id = '<tenant_id>';
If the hashes match but content should differ, the hash function inputs may need updating. Fix: Force regeneration:
SELECT rpc_get_or_request_narrative('<tenant_id>'::uuid, true);
Or manually clear the stored hash to force a cache miss:
UPDATE ai_gap_narratives
SET content_hash = 'force-regen'
WHERE tenant_id = '<tenant_id>';

Alerts Reference

AlertTriggerCron
Circuit openAny provider circuit transitions to openai-alert-circuit-open (every 5 min)
Budget warningTenant reaches 80% of monthly budgetai-alert-budget-warnings (hourly)
Both alerts are sent to Slack via SLACK_WEBHOOK_URL (Edge Function secret). If the webhook is not configured, alerts are silently skipped.