# AI CEO Operator Scorecard

A 5-minute self-audit. Score each dimension 0, 1, or 2. Total out of 20.

---

## How to score

- **0** — not built, broken, or missing entirely
- **1** — exists but fragile, undocumented, or you're not confident it works under pressure
- **2** — solid, tested, and you'd trust it running overnight without you watching

Score yourself honestly. The number that makes you wince is the one to fix first.

---

## Dimension 1: Identity

Does your agent know who it is, what it's optimizing for, and how to behave when no one is watching?

| Score | Criteria |
|-------|----------|
| 0 | No SOUL or system prompt. Agent behavior is inconsistent across sessions. |
| 1 | Has a system prompt but it was written once, never revised, and doesn't reflect your actual priorities. |
| 2 | Has a SOUL.md (or equivalent) that defines voice, values, and operating principles. You've revised it based on actual agent behavior. It's under version control. |

**Your score (0 / 1 / 2):** ___

---

## Dimension 2: Memory

Can your agent recover context after a session ends, a crash, a restart, or a week of silence?

| Score | Criteria |
|-------|----------|
| 0 | No persistent memory. Every session starts from zero. |
| 1 | Has some memory (files, DB, notes) but it's ad hoc. The agent doesn't know how to read or write it consistently. |
| 2 | Has a defined memory system: PRIORITIES.md, project notes, or a memory DB. The agent writes to it, reads from it, and you've seen it actually help across session gaps. |

**Your score (0 / 1 / 2):** ___

---

## Dimension 3: Heartbeat

Does your agent check in and act on a real cadence — without you prompting it?

| Score | Criteria |
|-------|----------|
| 0 | Agent only runs when you message it. No autonomous scheduling. |
| 1 | Has a cron or scheduled task, but it's unreliable, badly timed, or the output isn't useful. |
| 2 | Has at least two working scheduled jobs (e.g. morning brief, nightly review). You receive them. They contain actionable content. You've checked the logs and they fire reliably. |

**Your score (0 / 1 / 2):** ___

---

## Dimension 4: Escalation

Does your agent know what it can handle alone versus what needs you?

| Score | Criteria |
|-------|----------|
| 0 | Agent either does everything autonomously (dangerous) or asks for permission constantly (useless). |
| 1 | Has some rules about what to escalate, but they're in a comment or a prompt line — not enforced. |
| 2 | Has a clear escalation boundary: defined actions it takes autonomously, defined actions it flags to you first, and a delivery path (Telegram, Slack, etc.) you actually see. The boundary has been tested. |

**Your score (0 / 1 / 2):** ___

---

## Dimension 5: Operations

Is there a real work queue, handoff process, and logging rhythm?

| Score | Criteria |
|-------|----------|
| 0 | No task queue. Work lives in your head or scattered across chat threads. |
| 1 | Has some task tracking (Notion, Linear, a file) but the agent can't read or update it reliably. |
| 2 | Agent can read the task queue, pick up work, log what it did, and hand off cleanly. You have PM2 (or equivalent) running with log rotation. You can answer "what did the agent do yesterday?" in under 30 seconds. |

**Your score (0 / 1 / 2):** ___

---

## Dimension 6: Revenue Readiness

Can the business side actually capture leads and collect money without you manually involved?

| Score | Criteria |
|-------|----------|
| 0 | No lead capture, no payment flow, or both are manual. |
| 1 | Has a landing page and/or payment link but the agent isn't connected to it. Leads fall into a spreadsheet no one checks. |
| 2 | Lead capture posts to a channel the agent monitors. Stripe (or equivalent) is connected and tested. You have received at least one real payment through the automated flow. The agent knows about new leads and can follow up. |

**Your score (0 / 1 / 2):** ___

---

## Dimension 7: Observability

Can you see what your agent did, and why, without asking it?

| Score | Criteria |
|-------|----------|
| 0 | No logs or traces. The agent's activity is a black box; you rely on the agent's self-report. |
| 1 | Logs exist (console, file, or cron output) but are scattered, unsearchable, or stale. You have to SSH and `tail` to learn anything. |
| 2 | Has a structured activity view (dashboard, queryable store, or daily digest) that shows what ran, when, and what the outcome was — accessible without touching the server. |

**Your score (0 / 1 / 2):** ___

---

## Dimension 8: Cost Control

Can your agent's spend go from $20/mo to $2,000/mo in a week without you noticing?

| Score | Criteria |
|-------|----------|
| 0 | No cost visibility. Agent runs on subscription or API and you'd find out about overruns from your bank, not your system. |
| 1 | You know roughly what it costs (subscription = flat) but no guardrails against runaway loops, accidental $5 → $500 jumps, or model upgrades. |
| 2 | Has budget caps, per-run cost visibility, and/or alerts when spend deviates from baseline. Subscription-backed with clear fallback behavior if quota hits. |

**Your score (0 / 1 / 2):** ___

---

## Dimension 9: Recovery

If your agent crashed right now, how long until it's running again — and who would notice?

| Score | Criteria |
|-------|----------|
| 0 | No process supervisor. If the agent dies, it stays dead until you notice. No auto-restart, no crash alert. |
| 1 | Has a process manager (pm2, systemd, docker) but failures happen silently. You've been surprised by downtime more than once. |
| 2 | Process supervised + crash alerts wired to a channel you read + a documented recovery path that you've actually executed. Uptime is measurable, not hoped for. |

**Your score (0 / 1 / 2):** ___

---

## Dimension 10: Output

What did your agent ship this week that someone besides you would pay for or use?

| Score | Criteria |
|-------|----------|
| 0 | Agent runs but doesn't produce user-facing deliverables. Lots of activity logs, no artifacts. |
| 1 | Ships occasional deliverables (content, code, reports) but cadence is erratic and quality varies. You still re-do most of it. |
| 2 | Consistently produces deliverables at a cadence you can point to (N/week). Someone besides you has used, read, or paid for them. The agent's output compounds. |

**Your score (0 / 1 / 2):** ___

---

## Your Total: ___ / 20

### Interpreting your score

**0–7 — Early.** Foundation gaps. Pick your lowest-scoring dimension and fix it this week. Start with Identity + Memory.

**8–13 — Shipping but fragile.** Operator runs but breaks under pressure. Observability + Recovery are usually the weak spots.

**14–17 — Production-grade.** Operator runs reliably and ships output. Cost Control or Escalation boundary is the likely remaining gap.

**18–20 — Operator-grade.** Rare. Share your score — others will learn from it.

---

## What to do with your score

Circle the dimension where you scored lowest. That's your bottleneck — not a new model, not a better prompt, not another framework.

Fix the bottleneck first. Everything else compounds on top of a solid foundation.

---

## Ready to implement?

**Phoenix Kit** is the full operating system: SOUL.md templates, memory architecture, cron configs, escalation logic, ops logging, and revenue integration — production-tested on a live AI CEO.

[Get Phoenix Kit at phoenixprime.me/kit](https://phoenixprime.me/kit)
