Operations & Runbook

This runbook covers day-to-day operations, monitoring, and failure recovery for the AI content system.

Monitoring Views

Three views provide operational visibility without complex queries:

View	Purpose
`v_ai_queue_health`	Queue depth by status, avg processing time, oldest queued job
`v_ai_provider_usage`	Per-provider call counts, success rates, costs (today/month)
`v_tenant_ai_budget_status`	Per-tenant budget usage, remaining budget, percentage used

Common Tasks

Check queue health

SELECT * FROM v_ai_queue_health;

Check provider health

SELECT
  provider_key,
  is_enabled,
  circuit_state,
  consecutive_failures,
  last_success_at,
  last_failure_at
FROM ai_provider_health
ORDER BY priority;

Check tenant budgets

SELECT * FROM v_tenant_ai_budget_status
ORDER BY pct_used DESC;

Manually trigger narrative regeneration for a tenant

SELECT rpc_test_enqueue_narrative('<tenant_id>'::uuid);

Replace <tenant_id> with the actual tenant UUID.

Increase a tenant’s monthly budget

UPDATE ai_content_budgets
SET monthly_limit_usd = 50.00
WHERE tenant_id = '<tenant_id>';

Disable AI for a specific tenant

UPDATE ai_content_budgets
SET is_enabled = false
WHERE tenant_id = '<tenant_id>';

Re-enable by setting is_enabled = true.

Manually close an open circuit

UPDATE ai_provider_health
SET circuit_state = 'closed',
    consecutive_failures = 0,
    circuit_opened_at = NULL
WHERE provider_key = '<provider_key>';

Valid provider_key values: anthropic, openai, google, groq.

Manually open a circuit (disable a provider)

UPDATE ai_provider_health
SET circuit_state = 'open',
    circuit_opened_at = now()
WHERE provider_key = '<provider_key>';

View recent generation history

SELECT
  id, content_type, tenant_id, status,
  provider, fallback_used, cost_usd,
  created_at, completed_at, error_message
FROM ai_generation_queue
ORDER BY created_at DESC
LIMIT 20;

Check provider costs today

SELECT * FROM v_ai_provider_usage;

Failure Modes & Recovery

Worker not running

Symptom: Queue depth keeps growing, no jobs move to running. Diagnose:

-- Check if any jobs have been claimed recently
SELECT MAX(claimed_at) AS last_claimed FROM ai_generation_queue
WHERE status IN ('running', 'completed');

Fix: Verify the ai-content-worker cron job is active:

SELECT jobid, jobname, schedule, active
FROM cron.job
WHERE jobname = 'ai-content-worker';

If the job exists but isn’t running, check the Edge Function logs in the Supabase dashboard under Edge Functions > ai_content_worker > Logs.

Jobs stuck in `running`

Symptom: Jobs have status = 'running' for more than 10 minutes. Diagnose:

SELECT id, content_type, tenant_id, claimed_at
FROM ai_generation_queue
WHERE status = 'running'
  AND claimed_at < now() - interval '10 minutes';

Fix: The ai-reclaim-stuck-jobs cron should handle this automatically every 5 minutes. To manually reclaim:

UPDATE ai_generation_queue
SET status = 'queued',
    claimed_at = NULL,
    attempts = attempts + 1
WHERE status = 'running'
  AND claimed_at < now() - interval '10 minutes';

All providers down

Symptom: All circuits are open, jobs keep failing. Diagnose:

SELECT provider_key, circuit_state, consecutive_failures, last_failure_at
FROM ai_provider_health
WHERE circuit_state = 'open';

Fix:

Check each provider’s status page for outages.
If a provider has recovered, manually close its circuit:

UPDATE ai_provider_health
SET circuit_state = 'closed', consecutive_failures = 0, circuit_opened_at = NULL
WHERE provider_key = 'anthropic';

If all providers are genuinely down, wait — the half-open probe will automatically test recovery every 5 minutes.

Tenant budget exceeded

Symptom: A tenant’s AI content stops updating. Jobs are enqueued but skipped. Diagnose:

SELECT * FROM v_tenant_ai_budget_status
WHERE tenant_id = '<tenant_id>';

Fix: Either wait for the monthly reset (1st of month) or increase the budget:

UPDATE ai_content_budgets
SET monthly_limit_usd = 25.00
WHERE tenant_id = '<tenant_id>';

Content hash dedup is wrong (content not regenerating)

Symptom: Data has changed but the narrative still shows old content. Diagnose:

-- Compare current hash vs stored hash
SELECT
  compute_gap_narrative_hash('<tenant_id>'::uuid) AS current_hash,
  content_hash AS stored_hash
FROM ai_gap_narratives
WHERE tenant_id = '<tenant_id>';

If the hashes match but content should differ, the hash function inputs may need updating. Fix: Force regeneration:

SELECT rpc_get_or_request_narrative('<tenant_id>'::uuid, true);

Or manually clear the stored hash to force a cache miss:

UPDATE ai_gap_narratives
SET content_hash = 'force-regen'
WHERE tenant_id = '<tenant_id>';

Alerts Reference

Alert	Trigger	Cron
Circuit open	Any provider circuit transitions to `open`	`ai-alert-circuit-open` (every 5 min)
Budget warning	Tenant reaches 80% of monthly budget	`ai-alert-budget-warnings` (hourly)

Both alerts are sent to Slack via SLACK_WEBHOOK_URL (Edge Function secret). If the webhook is not configured, alerts are silently skipped.

API Reference

Cost Management

⌘I

Monitoring Views
Common Tasks
Check queue health
Check provider health
Check tenant budgets
Manually trigger narrative regeneration for a tenant
Increase a tenant’s monthly budget
Disable AI for a specific tenant
Manually close an open circuit
Manually open a circuit (disable a provider)
View recent generation history
Check provider costs today
Failure Modes & Recovery
Worker not running
Jobs stuck in running
All providers down
Tenant budget exceeded
Content hash dedup is wrong (content not regenerating)
Alerts Reference

Get Started

Core Concepts

Using SecurAtlas

Integrations

MSP & Partners

Account & Settings

Integration Setup Guides

Developer Reference

AI Content System

Security & Compliance

Operations & Runbook

Monitoring Views

Common Tasks

Check queue health

Check provider health

Check tenant budgets

Manually trigger narrative regeneration for a tenant

Increase a tenant’s monthly budget

Disable AI for a specific tenant

Manually close an open circuit

Manually open a circuit (disable a provider)

View recent generation history

Check provider costs today

Failure Modes & Recovery

Worker not running

Jobs stuck in `running`

All providers down

Tenant budget exceeded

Content hash dedup is wrong (content not regenerating)

Alerts Reference

Get Started

Core Concepts

Using SecurAtlas

Integrations

MSP & Partners

Account & Settings

Integration Setup Guides

Developer Reference

AI Content System

Security & Compliance

​Monitoring Views

​Common Tasks

​Check queue health

​Check provider health

​Check tenant budgets

​Manually trigger narrative regeneration for a tenant

​Increase a tenant’s monthly budget

​Disable AI for a specific tenant

​Manually close an open circuit

​Manually open a circuit (disable a provider)

​View recent generation history

​Check provider costs today

​Failure Modes & Recovery

​Worker not running

​Jobs stuck in running

​All providers down

​Tenant budget exceeded

​Content hash dedup is wrong (content not regenerating)

​Alerts Reference

Monitoring Views

Common Tasks

Check queue health

Check provider health

Check tenant budgets

Manually trigger narrative regeneration for a tenant

Increase a tenant’s monthly budget

Disable AI for a specific tenant

Manually close an open circuit

Manually open a circuit (disable a provider)

View recent generation history

Check provider costs today

Failure Modes & Recovery

Worker not running

Jobs stuck in `running`

All providers down

Tenant budget exceeded

Content hash dedup is wrong (content not regenerating)

Alerts Reference