System Health
Auto-refreshes every 30 s. KPIs cover the last hour. Calls / hr is total /api dispatches.
Error rate turns yellow at 1 % and red at 5 %.
p50 / p95 is response-time median and 95th percentile (ms) — if p95 climbs while p50 stays flat, the slow tail is getting worse.
Pool wait shows pg connections (idle / total / waiting). Anything w > 0 means the pool is jammed — that was the 2026-04-20 incident shape.
Top methods by p95 (last hour)
Slowest endpoints, ordered by 95th-percentile latency. If a method users hit a lot (getProfile, getSched) lands here, they'll feel it. New arrivals often mean a recently shipped query without a matching index.
| Method | Calls | p50 | p95 | Max | Errors |
|---|
Recent errors (last hour)
Every 4xx + 5xx response. 401s on protected methods are fine (unauthenticated probes). 5xx anywhere = a bug to investigate; the firm/user columns tell you whose flow hit it.
| When | Method | Firm | User | Status | ms |
|---|
Per-firm activity (last 24 h)
"Total ms" is wall-clock time the API spent on that firm's requests. A firm whose total ms grows wildly faster than its call count is the noisy-neighbor pattern — usually a runaway loop or a hammering integration.
| Firm | Calls | Total ms |
|---|
Connection pool
pg pool snapshot at the last cron tick (every 5 min). Healthy: total ≤ max, waiting = 0. waiting > 0 means requests are stuck queueing for a connection; this preceded the 2026-04-20 schedule pool jam.
Hot tables
Postgres' own scan + analyze stats for the tables we hit most. Last autoanalyze > 24 h is yellow, > 7 d is red — stale stats are what made the 2026-04-20 query plan flip to a seq scan. Big Dead counts mean autovacuum is behind on cleanup.
| Table | Seq | Idx | Live | Dead | Last autoanalyze |
|---|
Top queries (pg_stat_statements)
Heaviest SQL queries by total exec time since last reset. A line with high Calls × moderate Mean ms is what to optimize first — it's run a lot. A new entry near the top usually means a missing index on a recently shipped feature.
| Calls | Mean ms | Total ms | Sample |
|---|
Live activity
Queries actually running on Postgres at the last 5-min cron tick (idle sessions filtered out). Anything > 30 s is yellow and worth a look; > 5 min is probably stuck. Almost always empty when things are healthy.
| PID | State | Wait | Sec | Sample |
|---|
Schema validation health
Zod request-validation telemetry (from audit_events) cross-referenced with call traffic — the "why" behind 400s, plus which .passthrough() schemas are clean enough to tighten to .strict(). Mirrors npm run check:zod-drift.
Validation failures — methods rejecting real traffic (the "why" behind 400s)
| Method | Fails | Calls/14d | Top reasons (field=code) |
|---|
Promotion candidates — clean ≥14d + confirmed traffic → reviewed flip to .strict()
| Method | Calls/14d | Schema file |
|---|
Undocumented fields — keys the schema doesn't model (add to schema + api-reference)
| Method | Rows | Fields (name:type) |
|---|
Plans & Billing
Firms list with tier, billing status, seat usage, and a 30-day jobs-created drift signal.
Click a row to manage that firm's tier, status, add-ons, and overrides.
Source of truth for prices/caps/policy is docs/references/tenant-policy.md.
Cancelled — purge pending
Firms in cancelled status. 90 days after cancellation,
the "Purge" action becomes available — hard-deletes the firm and
all firm-scoped data. Dry-run first to see row counts.
| # | Firm | Cancelled | Days left | Actions |
|---|
Firms
| # | Firm | Tier | Status | Seats (O/A/PM) | Add-ons | MRR | Jobs 30d | Last seen |
|---|
Firm
Summary
Tier
Status
Automated transitions (trialing → past_due → soft_blocked) happen via the QB poller. Use this for manual hard_block, cancel, or reactivate.
Add-ons
| Code | Name | Price | Started | Cancelled |
|---|
Plan overrides
Keys: seat_overage.<role>, seat_cap.<role>,
comp_addon.<code>. Value is JSON (integer for seat counts, true for comp).
| Key | Value | Reason | Expires | Added |
|---|
DMARC
Aggregate DMARC reports parsed from keith@austintreeexperts.com daily at 01:30 UTC
(daily-dmarc-ingest in Cloud Scheduler).
% aligned is the share of messages whose DKIM or SPF lines up with the From: domain
— anything below ~99 % over a 30-day window warrants a look at the top source IPs.
% enforced is what receivers actually applied (quarantine + reject); it should track
the published policy after a flip from p=none to p=quarantine or p=reject.
Recent reports
Most recent 100 reports in the window, newest first. Policy is what your DNS
_dmarc record was advertising at the time the receiver sampled. Unaligned
> 0 means at least some mail from that submitter failed both DKIM and SPF alignment — drill
into the source-IP table below to see who.
| Submitter | Domain | Window | Policy | Total | Aligned | Unaligned |
|---|
Top source IPs
Source IPs sending mail on your behalf, ranked by message volume in the window. Aligned-only rows are the happy path (your own mail flows, plus aligned relays). Unaligned > 0 is the watch list — usually forwarded mail or a relay that hasn't been signed correctly; occasionally a real spoof.
| Source IP | Total | Aligned | Unaligned | Last seen |
|---|