TL;DR. Most AI agent deployments at small businesses do not survive week two — not because the model is bad, but because four design questions were never asked before the contract was signed. This AI agent vendor checklist names the four: (1) what is the ONE workflow the agent replaces, (2) what dollar does each successful run produce or save, (3) where is the human approval gate on high-stakes steps, and (4) who gets paged when the agent's behavior drifts. A vendor who cannot answer all four in 60 seconds is not deployment-ready. The pattern is documented from devil's-advocate reviews on 52 agency offers analyzed March–May 2026, with cross-cutting evidence from DEPLOY-grade rows including ReportBot, the AI POD Listing Factory, AI Cold Email Agency, and the LLM-Ready Docs Converter. Same checklist works whether the agent is voice, text, or SMS — the four questions are agent-class agnostic.
The boring rule of AI agents in 2026: most die by week two. The few that survive share four specific design traits, none of which are about model quality or budget size. They are buyer-side audit questions a non-technical founder can ask any vendor in a 30-minute discovery call. If the vendor hand-waves on even one, the agent will fail in production — and the buyer will eat the setup fee. This article documents the four questions, the failure pattern each one prevents, and how to score a vendor's answer.
What is an AI agent vendor checklist?
An AI agent vendor checklist is a short, repeatable set of buyer-side questions a small business owner asks an AI automation agency before signing a contract — designed to surface the deployment-design gaps that kill most agents in production. Unlike a technical AI agent evaluation framework (which measures model performance, tool use, and reasoning quality after the agent is built), a vendor checklist runs before purchase. It is a procurement filter, not a benchmarking tool.
The distinction matters because the failure modes that kill SMB AI deployments are not model failures. The model usually performs as advertised in the demo. The deployment fails because nobody pinned down what workflow the agent actually owns, what dollar each run produces, what happens on edge cases, and who notices when the agent starts misbehaving. A four-question checklist closes those four gaps before any money changes hands.
The 4 questions that predict whether your deployment survives
Each one maps to a specific design gap. A vendor without a confident answer to any of the four is not ready to ship into production.
1. What is the ONE workflow this agent replaces?
The agents that survive in production do exactly one job in one workflow. The AI receptionist answers inbound calls and books appointments. The lead-reactivation agent texts cold CRM contacts and re-qualifies them. The speed-to-lead agent calls web form-fills within five minutes. Each is narrow. Each has a single trigger and a single output.
The agents that die in week two were sold as "your AI employee." The vendor pitched autonomy across multiple workflows — inbound, outbound, scheduling, intake, reporting, escalation. The model handled each individual subtask fine in the demo. In production, the decision surface exceeded what the agent could manage reliably, edge cases compounded, and the operator pulled the plug.
The audit: ask the vendor to name the single workflow the agent replaces. Make them say it as one sentence. "The agent handles inbound voicemail-to-appointment for after-hours plumbing calls" is a deployment-ready answer. "The agent is your AI assistant for all customer-facing operations" is not.
Common stack on the survivors: Retell AI or Vapi for voice; n8n or Make for workflow orchestration; a single integration to the operator's existing calendar or CRM. Anything broader needs more discovery, not a faster signature.
2. What dollar does each successful agent run produce or save?
Every surviving agent ties to a specific dollar the operator can audit each month. Missed-call value × recovered call rate. Average booked appointment × bookings per month. Recovered cold lead × close rate. The dead agents promised "time saved" or "better customer experience" — phrases the buyer cannot put on a P&L. When the renewal conversation comes up, "we saved time" loses to "we closed three deals this month worth $48K."
The pattern is consistent across the DEPLOY-grade rows in our internal market-intel database: ReportBot tied to hours of analyst time replaced per report; the AI POD Listing Factory tied to listings published per week × average product margin; the AI Cold Email Agency tied to booked sales meetings × closed-deal rate; the LLM-Ready Docs Converter tied to engineer-hours saved per document standardization run. Every one was dollar-quantifiable from day zero.
The audit: ask the vendor to walk through the dollar math for one successful run. The answer should reference a known input from the buyer's business (their average ticket size, their typical conversion rate, their existing missed-call volume) and arrive at a clean per-run dollar. If the vendor cannot do this calculation in 60 seconds on a discovery call, the buyer's CFO will not approve renewal at month two.
3. Where is the human approval gate on high-stakes steps?
Full autonomy fails on edge cases. The agents that survive have a clearly defined hand-off: the agent drafts, the human approves before the action goes external. The agent flags an unusual conversation; a human handles the call. The agent proposes an outbound message; the operator clicks approve in a Slack or Telegram bot before it sends.
The agents that died in week two promised end-to-end autonomy. The first edge case — an unusual question, an unfamiliar accent, a request the prompt did not anticipate — produced either a confident wrong answer or a non-response that erodes customer trust within days. The buyer never recovered the lost first impression.
The audit: ask the vendor where the approval gate lives. The answer should name the trigger (low confidence score, unrecognized intent, dollar value above threshold, explicit customer escalation request), the human interface (Slack message, Telegram bot, email with one-tap approval), and the response SLA (callback within X hours). "The model is smart enough" is not a deployment-ready answer. "We approve every outbound communication above $X in implied dollar value through a Telegram bot, with a 30-minute human SLA" is.
This is the design choice that distinguishes a Day-0-reviewed deployment from a demo that crashes in production. The approval gate is not a sign of weakness — it is the operator's only insurance policy against the agent's hallucination tail.
4. Who gets paged when the agent's behavior drifts?
The fourth question separates the agents that retain operators from the ones that surprise them with a $3,000 invoice in week four. A surviving agent watches itself. Usage alerts at 50%, 75%, and 90% of the monthly minute or token budget. Cost caps that hard-stop runaway voice loops. Error-class tracking so the operator sees repeated failure patterns before they compound. Anomaly detection that catches "the agent is suddenly making 10× normal call volume at 3 AM" before the credit card bill arrives.
When any of these alerts fires, the operator gets a real-time ping — Slack, Telegram, SMS — with a specific action item. Not "something happened, log in to investigate." A specific incident: "Voice minutes at 85% of monthly cap; current trajectory will exceed by 40% next week."
The dead agents had none of this. The operator found out about the $3,000 voice-minute bill on the credit card statement. By then, trust was gone and the contract was on its way to cancellation.
The audit: ask the vendor who gets paged, on what triggers, through what channel, and with what specific action item. A deployment-ready vendor has all four answers without prompting. An unprepared vendor offers a vague "we monitor the system" or "you can log into the dashboard." Those vendors will not catch the drift before it costs the buyer money.
What unites the agents that pass all four
The cross-cutting pattern from the DEPLOY-grade rows is consistent. The agents that survive share four structural traits that map directly onto the four audit questions:
- Narrow scope. Single workflow, single trigger, single output. No "AI employee" framing.
- Dollar-quantifiable outcome. Every run produces a specific dollar the operator can verify on a P&L.
- Human-in-the-loop on high-stakes steps. Explicit approval gate, defined trigger, defined SLA. The agent drafts; the human approves on consequential actions.
- Self-monitoring with paging. Usage caps, cost alerts, error tracking, anomaly detection — with a real-time ping to the operator when any threshold fires.
A vendor whose pitch covers all four without prompting has built deployment-ready product. A vendor who hand-waves any of them has built a demo. The difference shows up in production within two weeks, and it is the single most reliable predictor of whether the deployment survives.
How to use this checklist on a 30-minute discovery call
The checklist is designed to fit inside a single vendor call. Run it in this order:
- Open with question 1. "Walk me through the single workflow this agent replaces." Listen for narrow scope. If the answer describes a multi-workflow system, end the call early and look elsewhere.
- Move to question 2. "What dollar does each successful run produce or save in my business specifically?" Listen for whether the vendor references your inputs (ticket size, conversion rate, current volume) or hides behind generic ROI claims.
- Probe question 3. "Show me where the human approval gate sits in the workflow." Ask the vendor to draw it. If they cannot point to a specific step, the agent is too autonomous for a Day-0 deployment.
- Close with question 4. "Who gets paged when this agent's behavior drifts? What triggers it? Through what channel? With what action item?" The answer should be specific on all four sub-points.
If the vendor passes all four, the agent is worth deeper diligence (references, sample contracts, sandbox demo). If they fail any one, the agent will fail in production — and the buyer will eat the setup fee. The checklist costs nothing to run and saves the cost of a failed deployment.
Frequently asked questions
What is the most important question to ask an AI agent vendor before signing?
The single most important question is "what is the ONE workflow this agent replaces?" Vendors who name a single, narrow workflow have built a deployment-ready product. Vendors who describe a multi-workflow autonomous "AI employee" have built a demo that will fail in production within two weeks. Narrow scope is the strongest structural predictor of survival across the agents documented in our market-intelligence database.
How long does an AI agent take to fail when it fails?
The dominant failure window is two to three weeks after deployment. Week one is honeymoon — volume is light, edge cases are rare, the operator is excited. Week two introduces the first unusual customer interaction, the first cost surprise, the first behavioral pushback from the team using the agent. By week three the operator is questioning the investment, and most projects that will fail have failed by then.
Is an AI agent vendor checklist different from an AI agent evaluation framework?
Yes. A vendor checklist is buyer-side and runs before purchase — it surfaces deployment-design gaps in 30 minutes of discovery. An AI agent evaluation framework is builder-side and runs after the agent is built — it measures model performance, tool use, reasoning quality, and task completion using benchmarks like the ones documented by Anthropic, AWS, IBM, and DeepEval. A small business owner needs the vendor checklist. The vendor's engineering team needs the evaluation framework.
What does a "human approval gate" mean in an AI agent deployment?
A human approval gate is a defined point in the agent's workflow where the agent stops, surfaces what it is about to do, and waits for a human to approve or reject the action before proceeding. The gate typically lives at high-stakes steps: outbound messages above a dollar threshold, escalations to a human caller, refunds, or any action that touches a regulated communication channel. Common interfaces are Slack, Telegram, or email with one-tap approve/reject buttons.
How do I tell if an AI agent vendor is hand-waving on the cost question?
A deployment-ready vendor walks through the dollar math in the buyer's specific business — referencing the buyer's average ticket size, conversion rate, and existing volume — and arrives at a clean per-run dollar value. A hand-waving vendor falls back on generic claims like "10× ROI" or "saves your team time" without referencing any buyer-specific input. The test is whether the vendor can produce the math in 60 seconds on a discovery call.
Do these four audit questions apply to text agents and voice agents equally?
Yes. The four questions are agent-class agnostic. Workflow scope, dollar quantifiability, human approval gating, and self-monitoring with paging all apply identically to voice, SMS, and chat agents. The specific failure modes differ (voice agents bleed margin through per-minute infrastructure costs; text agents bleed margin through token usage), but the four design questions surface the gap before either failure mode hits production.
What happens if a vendor refuses to answer one of the four questions?
A vendor who refuses or visibly struggles on any one of the four is not deployment-ready, regardless of how strong the demo looked. The four questions are the floor for production readiness. Hand-waving on the workflow question predicts scope creep; on the dollar question predicts a failed CFO review at month two; on the approval gate predicts the first-edge-case incident; on the paging question predicts the surprise invoice. Walk away and find a vendor who can answer all four in 60 seconds.
Sources and methodology
- Cross-cutting pattern documented from DEPLOY-grade verdicts in the Lead Flow Automation Agent Business Ideas market-intelligence database (52 agency offers analyzed March–May 2026), including rows ReportBot (recW72BjtlwyPnh4D), AI POD Listing Factory (rec8PkhjklBsTP3Ec), AI Cold Email Agency (reclT67BaSI5XekrQ), and LLM-Ready Docs Converter (reczpDFSQpaKWUJy9).
- Devil's-advocate reviews on outbound-contact and cost-scaling failure modes referenced as the empirical basis for the "$3K invoice surprise" pattern.
- First-hand operator experience with approval-gate and paging architecture from the agency's own production agent stack (Janus approval bot, Mnemosyne summary feed).
- SERP and intent data for the underlying "AI agent vendor checklist" topic pulled via DataforSEO (May 2026), confirming the buyer-side discovery pattern with content from Salesforce, MyAskAI, docket.io, and LinkedIn industry analysts.
- 2026 Reddit consensus on AI agent deployment failure rates ("most die by week two") referenced as industry-convention context, not a primary citation.
About the author
Gergely Zsigmond runs Lead Flow Automation, an AI-automation agency specializing in deployment-ready agent systems for service businesses. Previously built a production retrieval-augmented generation (RAG) chatbot for the engineering team of a $30B/yr multinational firm, in daily use across the organization. 10+ years in AI, 3 years dedicated to LLM-based software development. Runs devil's-advocate reviews on every agency offer before recommending it to a buyer. Based in Budapest; serves clients in the US, EU, and APAC.
Reach the agency at leadflowautomation.net.