System Health

—

Auto-refreshes every 30 s. KPIs cover the last hour. Calls / hr is total /api dispatches. Error rate turns yellow at 1 % and red at 5 %. p50 / p95 is response-time median and 95th percentile (ms) — if p95 climbs while p50 stays flat, the slow tail is getting worse. Pool wait shows pg connections (idle / total / waiting). Anything w > 0 means the pool is jammed — that was the 2026-04-20 incident shape.

Calls / hr

—

Errors / hr

—

Error rate

—

p50 / p95 (ms)

—

Pool wait

—

Revision

—

Top methods by p95 (last hour)

Slowest endpoints, ordered by 95th-percentile latency. If a method users hit a lot (getProfile, getSched) lands here, they'll feel it. New arrivals often mean a recently shipped query without a matching index.

Method	Calls	p50	p95	Max	Errors

Recent errors (last hour)

Every 4xx + 5xx response. 401s on protected methods are fine (unauthenticated probes). 5xx anywhere = a bug to investigate; the firm/user columns tell you whose flow hit it.

When	Method	Firm	User	Status	ms

Per-firm activity (last 24 h)

"Total ms" is wall-clock time the API spent on that firm's requests. A firm whose total ms grows wildly faster than its call count is the noisy-neighbor pattern — usually a runaway loop or a hammering integration.

Firm	Calls	Total ms

Connection pool

pg pool snapshot at the last cron tick (every 5 min). Healthy: total ≤ max, waiting = 0. waiting > 0 means requests are stuck queueing for a connection; this preceded the 2026-04-20 schedule pool jam.

Hot tables

Postgres' own scan + analyze stats for the tables we hit most. Last autoanalyze > 24 h is yellow, > 7 d is red — stale stats are what made the 2026-04-20 query plan flip to a seq scan. Big Dead counts mean autovacuum is behind on cleanup.

Table	Seq	Idx	Live	Dead	Last autoanalyze

Top queries (pg_stat_statements)

Heaviest SQL queries by total exec time since last reset. A line with high Calls × moderate Mean ms is what to optimize first — it's run a lot. A new entry near the top usually means a missing index on a recently shipped feature.

Calls	Mean ms	Total ms	Sample

Live activity

Queries actually running on Postgres at the last 5-min cron tick (idle sessions filtered out). Anything > 30 s is yellow and worth a look; > 5 min is probably stuck. Almost always empty when things are healthy.

PID	State	Wait	Sec	Sample

Schema validation health

Zod request-validation telemetry (from audit_events) cross-referenced with call traffic — the "why" behind 400s, plus which .passthrough() schemas are clean enough to tighten to .strict(). Mirrors npm run check:zod-drift.

Validation failures — methods rejecting real traffic (the "why" behind 400s)

Method	Fails	Calls/14d	Top reasons (field=code)

Promotion candidates — clean ≥14d + confirmed traffic → reviewed flip to `.strict()`

Method	Calls/14d	Schema file

Undocumented fields — keys the schema doesn't model (add to schema + api-reference)

Method	Rows	Fields (name:type)

Plans & Billing

—

Firms list with tier, billing status, seat usage, and a 30-day jobs-created drift signal. Click a row to manage that firm's tier, status, add-ons, and overrides. Source of truth for prices/caps/policy is docs/references/tenant-policy.md.

Cancelled — purge pending

Firms in cancelled status. 90 days after cancellation, the "Purge" action becomes available — hard-deletes the firm and all firm-scoped data. Dry-run first to see row counts.

#	Firm	Cancelled	Days left	Actions

Firms

#	Firm	Tier	Status	Seats (O/A/PM)	Add-ons	MRR	Jobs 30d	Last seen

Firm

Summary

Tier

Current tier

Status

Billing status Reason (audit)

Automated transitions (trialing → past_due → soft_blocked) happen via the QB poller. Use this for manual hard_block, cancel, or reactivate.

Add-ons

Code	Name	Price	Started	Cancelled

Subscribe add-on

Plan overrides

Keys: seat_overage.<role>, seat_cap.<role>, comp_addon.<code>. Value is JSON (integer for seat counts, true for comp).

Key	Value	Reason	Expires	Added

Key Value (JSON) Reason Expires (optional)

DMARC

Window —

Aggregate DMARC reports parsed from keith@austintreeexperts.com daily at 01:30 UTC (daily-dmarc-ingest in Cloud Scheduler). % aligned is the share of messages whose DKIM or SPF lines up with the From: domain — anything below ~99 % over a 30-day window warrants a look at the top source IPs. % enforced is what receivers actually applied (quarantine + reject); it should track the published policy after a flip from p=none to p=quarantine or p=reject.

Messages

—

% aligned

—

% enforced

—

Quarantined / rejected

—

Reports

—

Last report

—

Recent reports

Most recent 100 reports in the window, newest first. Policy is what your DNS _dmarc record was advertising at the time the receiver sampled. Unaligned > 0 means at least some mail from that submitter failed both DKIM and SPF alignment — drill into the source-IP table below to see who.

Submitter	Domain	Window	Policy	Total	Aligned	Unaligned

Top source IPs

Source IPs sending mail on your behalf, ranked by message volume in the window. Aligned-only rows are the happy path (your own mail flows, plus aligned relays). Unaligned > 0 is the watch list — usually forwarded mail or a relay that hasn't been signed correctly; occasionally a real spoof.

Source IP	Total	Aligned	Unaligned	Last seen

Loading…

Sign in

Set a new password