# SFMagic Incident Response Runbook

**Last updated:** May 14, 2026
**Owner:** Tim Cox, principal (tim@timcox.co)
**Scope:** Operational steps for responding to a security incident affecting SFMagic. The commitment-level policy lives in `policies.md` §8; this is the playbook for executing it.

This runbook assumes a solo-operator response. Every step is something the principal does, or knows how to do, from a laptop with `.env.local` and admin access to the major platform consoles.

---

## 0. The 60-second triage

When a signal comes in (Sentry alert, customer email, anomalous deploy, weird metric):

1. **Is customer data potentially exposed?** If yes → SEV-1, start the clock.
2. **Is the service down or degraded?** If yes → SEV-1 or SEV-2 depending on scope.
3. **Is it confined to non-production (preview, local)?** If yes → SEV-3, normal-hours response.
4. **Is it noise (one-off transient error, third-party blip)?** If yes → close the loop, no incident.

The clock starts at the moment a credible signal is acknowledged. Note the time. This is `t_detect`.

---

## 1. Severity ladder

| Severity | Examples | Response time | Customer notification |
| --- | --- | --- | --- |
| **SEV-1** | Customer data exposure, full outage, compromised production credential, ransomware/data theft, MCP endpoint serving unauthorized tenant | Within 1 hour | Within 72 hours of `t_detect` if personal data is involved |
| **SEV-2** | Partial outage, degraded performance, near-miss exposure (e.g. token leaked in logs but no evidence of use), failed credential rotation | Within 4 hours | If material to a customer, within 7 days |
| **SEV-3** | Minor error spike, single-customer issue, third-party advisory follow-up | Within 1 business day | Only if the customer asks |

If unsure, treat as one severity higher.

---

## 2. Detection sources

These are the channels that surface incidents. Each should be reachable from a mobile phone.

| Source | Where it lands | Who sees it |
| --- | --- | --- |
| Sentry alerts | tim@timcox.co | Push notification on phone |
| Vercel deployment failures | tim@timcox.co | Push notification on phone |
| Resend bounce / complaint | tim@timcox.co | Inbox |
| Stripe payment failure | tim@timcox.co | Inbox |
| Customer report | tim@timcox.co | Inbox |
| Security disclosure | tim@timcox.co | Inbox |
| Weekly digest anomaly | tim@timcox.co (Mondays 09:00 ET) | Inbox |
| Vercel platform incident | status.vercel.com | Manual check |
| Neon platform incident | neonstatus.com | Manual check |

---

## 3. Contain — stop the bleeding first

Before investigating, stop the harm. The contain step is reversible; investigation is not.

**Compromised credential.**

1. Rotate the secret in the source-of-truth console (Vercel, Stripe, Anthropic, Salesforce, Resend, Sentry, Neon, GitHub).
2. Update `vercel env add <KEY> production` (use `printf '%s' "$VALUE" | …` or `--value "$VALUE" --yes` — never `echo`).
3. Trigger a redeploy (`vercel --prod` or push an empty commit to `main`).
4. Revoke the old credential at the provider.

**Compromised database access.**

1. Rotate `DATABASE_URL` from Neon dashboard → reset role password.
2. Update Vercel env vars (`DATABASE_URL` plus the 16 `POSTGRES_*` / `PG*` aliases — `vercel env pull` to verify alignment).
3. Redeploy.
4. Audit Neon query logs for the suspect time window.

**Bad code in production.**

1. Open the Vercel dashboard for the sfmagic project.
2. Find the last-known-good deployment.
3. Click "Promote to Production."
4. Verify rollback took effect by checking the live response on a known endpoint.

**Compromised admin Salesforce session (a tenant's Connected App is misbehaving).**

1. In Salesforce Setup → Connected Apps OAuth Usage → SFMagic → Revoke.
2. Server-side detection kicks in automatically; if not, run `tsx scripts/delete-tenant.ts <slug>` after confirming with the customer.

**Microsoft Entra ID issue (e.g. unauthorized M365 tenant link).**

1. Delete the row from `m365_tenant_links` for the affected `tid`.
2. Rotate `AUTH_ENTRA_ID_SECRET` if the Entra app itself is suspected of compromise.
3. Audit `m365_user_sf_links` for any user linkages under that `tid` and delete them.

**Mass / unknown scope.**

1. If the blast radius is unclear, take the affected endpoint offline rather than guess. Set the route to a maintenance handler via a Vercel deployment, or use Vercel's deployment protection.
2. Then assess.

---

## 4. Assess — scope and impact

Once contained, answer:

- **What data is affected?** (Salesforce records via tokens? Entra identity claims? Stripe customer IDs? Passwords?)
- **How many customers?** (One tenant, a subset, all?)
- **What time window?** (When did exposure start, when was it contained?)
- **Is the data personal under GDPR / CCPA / state law?**
- **Is any of the data Microsoft-customer data covered by the M365 Agent Store terms?**

Tools for assessment:

- Neon query logs and the `usage_events` table.
- Vercel function logs (`vercel logs <deployment-url>`).
- Sentry event search filtered by tenant ID or user.
- Sub-processor logs (Stripe dashboard, Resend dashboard) if the incident touches those paths.

Document findings in a draft post-incident report at `docs/security/incidents/YYYY-MM-DD-slug.md` as you go. It's easier to write while it's fresh.

---

## 5. Notify — the 72-hour window

The clock for breach notification starts at `t_detect`.

**Required notifications.**

| Recipient | Trigger | Deadline | Channel |
| --- | --- | --- | --- |
| Affected customers | Personal data exposed | Within 72 hours of `t_detect` | Email to admin contact + dashboard banner |
| EU/UK supervisory authority (e.g. ICO, CNIL) | Personal data of EU/UK residents exposed AND likely to result in risk | Within 72 hours of `t_detect` (GDPR Art. 33) | Authority web form |
| California AG | Breach affecting 500+ California residents | Without unreasonable delay | oag.ca.gov submission |
| Microsoft Partner Center | M365 customer data affected | Within 72 hours | Partner Center incident report |
| Other state regulators | Per state breach notification laws | Per applicable law | Varies |

If a notification deadline cannot be met because the investigation is ongoing, send a preliminary notification within the deadline and a follow-up when more is known. Late is worse than partial.

**What goes in a customer notification.**

1. What happened (plain language, no jargon).
2. What data was affected (specifically).
3. When it happened (time window).
4. What the customer should do (rotate Salesforce credentials, watch for phishing, etc.).
5. What SFMagic has done in response.
6. Direct contact for questions: tim@timcox.co.
7. Expected timeline for further updates.

**What goes in a regulator notification.** Per the regulator's form. Generally: nature of breach, categories and approximate number of data subjects, likely consequences, measures taken or proposed, contact for the controller's data protection point of contact.

A draft template lives at `docs/security/templates/breach-notification.md` (create on first use).

---

## 6. Investigate — root cause

Once the immediate fire is out, do the postmortem work.

1. Build a timeline of events from logs (`t_root_cause` → `t_detect` → `t_contain` → `t_remediate`).
2. Identify the technical root cause and the process or design weakness that allowed it.
3. Identify any contributing factors.
4. Identify the person or principal accountable (for solo operations, almost always Tim — name it anyway, no false externalization).

---

## 7. Remediate — permanent fix

1. Ship the code or configuration change that prevents recurrence.
2. Add a regression test if applicable.
3. Update affected policies in `docs/security/policies.md` if the design assumption changed.
4. Update this runbook if the response process itself needs improving.

---

## 8. Document — the post-incident report

Filed at `docs/security/incidents/YYYY-MM-DD-slug.md`. Required sections:

1. **Summary** (1–2 sentences).
2. **Timeline** (`t_detect`, `t_contain`, `t_notify`, `t_remediate`).
3. **What happened** (technical narrative).
4. **Impact** (data, customers, duration).
5. **Detection** (how it was found; could it have been found earlier).
6. **Containment** (what was done in step 3 above).
7. **Root cause**.
8. **Remediation** (what shipped).
9. **Action items** (with dates and owners).
10. **Notifications sent** (who, when, link to communication).

The report stays in the repo permanently. SEV-1 reports are linked from `policies.md` §8.

---

## 9. Cleanup

1. Confirm all rotated credentials are propagated (Vercel envs aligned with provider, redeploys complete).
2. Confirm all logs and artifacts needed for the postmortem are archived (Vercel function logs age out after 30 days; Sentry events after 90).
3. Close the incident in any tracking issue.
4. Schedule a check-in 30 days out to verify the remediation held.

---

## Quick command reference

```bash
# View current production env
vercel env ls production

# Add or update a secret (NEVER use echo — trailing newline gets stored)
printf '%s' "$VALUE" | vercel env add KEY production
# or
vercel env add KEY production --value "$VALUE" --yes

# Pull the live env to local
vercel env pull .env.local --environment=development --yes

# Roll back to last deployment
# (use the Vercel dashboard — Promote to Production on the prior deployment)

# Tail production logs
vercel logs <deployment-url-or-alias>

# Find the recent dispatch errors for a tenant
psql "$DATABASE_URL" -c "SELECT * FROM usage_events WHERE tenant_id = '<id>' AND error IS NOT NULL ORDER BY ts DESC LIMIT 50;"

# Cascade delete a tenant
tsx scripts/delete-tenant.ts <slug>
```

---

## Contact

- **Principal on call:** Tim Cox, tim@timcox.co
- **Out-of-band channel** (if email is the issue): SMS to the phone number on file with the Vercel and Neon accounts.
