Buyer's guide · 7 min read

How to Evaluate an AI Automation Agency

A practical framework for evaluating AI automation agencies — what to look for, what to ignore, and which questions actually filter signal from marketing.

Published April 23, 2026 · Darren Mullen · o1 Innovate

TL;DR

Most AI automation agencies are rebranded no-code shops or advisory firms. Both are legitimate, but neither is what you need if you want production systems running at scale.
The single highest-signal filter: ask them to show you a system they built that has been running in production for at least 6 months under real load.
Pricing that is measured in hours rather than outcomes is a soft signal the agency is thinking about itself, not you.
The 'AI' part is a small fraction of the work. The surrounding system — data, integrations, observability, compliance, ongoing operation — is what determines whether the deployment actually delivers value.

The AI automation category is young and noisy. Most companies selling 'AI automation' today are either no-code implementers using a familiar label, strategy consultants offering advisory hours, or genuinely capable engineering teams. The three look nearly identical on a homepage and cost roughly the same per hour.

Here is a framework for filtering, based on what actually separates agencies that ship production systems from ones that ship good-looking demos.

The three kinds of AI automation agency

Type	Typical output	Real strength	Where it fails
No-code implementer	Zapier/Make/n8n workflows wired to SaaS APIs	Fast turnaround on well-defined glue work	Breaks at scale; hits wall when real software is needed
Strategy consultancy	Decks, process maps, implementation recommendations	Thinking through the business case and process changes	Does not ship software; hand-off to another vendor always required
Engineering-led agency	Production systems with code, data layers, integrations, and ongoing operation	Ships systems that run for years and scale with the business	Higher upfront commitment; fewer vendors at this tier

All three are legitimate categories. You need to know which one you are hiring and why.

Questions that actually filter

Show me a system you built that has been running for 6+ months under real load

This is the single highest-signal question. Anyone can build a demo. Shipping a system that runs unsupervised for months — handling real traffic, real failures, real edge cases — requires engineering discipline that shows up in the answer. Press for specifics: call volume, error rates, maintenance cadence, what broke in month three.

Who owns the code and the data at the end of the engagement

If the answer is anything other than 'you do,' you are buying a vendor lock-in product, not a custom system. Legitimate agencies ship code to your GitHub organization and data to your infrastructure.

Walk me through your observability for a production system

Production AI systems fail in non-obvious ways. Agencies that have actually operated systems at scale have strong answers about per-request logging, error alerting, cost monitoring, and regression detection. Agencies that have mostly demoed systems do not.

What was the last thing you had to fix in production and how did you find out

Good answer: a specific incident with a specific root cause found via a specific alerting mechanism. Weak answer: vague reassurance that nothing has gone wrong. Systems at real scale always have incidents; the question is whether the team is set up to catch and fix them.

How do you think about human-in-the-loop review in AI pipelines

The quality of an AI pipeline is determined mostly by where humans are positioned in it — not by the model. Agencies with a real answer here will talk about specific review checkpoints, feedback loops, and eval harnesses. Agencies without will talk about how good the model is.

Signals you can see without asking

They can describe specific past engagements in detail — the problem, the architecture, the tradeoffs, the metrics. Vague descriptions with no numbers are a tell.
Their own stack is modern and sensible. If an agency building modern AI systems runs on 2015-era tooling themselves, be suspicious.
They push back on requirements when they should. Agencies that agree with everything you say in a first call are either selling you something you do not need or not thinking.
They are willing to tell you when you should not hire them. Trust that over any positive framing.

Pricing signals

Hourly pricing is a soft signal the agency is optimizing for hours billed rather than outcomes delivered. It is not disqualifying on its own — senior engineers sometimes work hourly — but it changes incentives in ways you should be aware of.

Fixed-scope project pricing aligns incentives better: the agency has an interest in shipping the scoped work efficiently. Value-based pricing (e.g., tied to production volume or revenue) aligns best but only works where the outcome is legible and attributable, which it often is not.

Retainer pricing for post-launch operation is appropriate and you should expect it. AI systems are not set-and-forget software; they need ongoing attention, and an agency that charges for that ongoing attention is behaving correctly.

Red flags

Heavy use of the word 'transformational' without specifics.
A case-study page with numbers like '10x productivity' but no information about what was actually built.
Inability to specify which models, which stack, or which architecture their systems use.
Reluctance to share code samples or architecture diagrams from past work under NDA.
Claims about AI capabilities that clearly exceed what the public state of the art supports. If it sounds like a magic box, it is a magic box.

A simple evaluation framework

Define the system you actually need. Is it a workflow glue job (no-code shop), a strategy problem (consultancy), or a production engineering project (engineering-led agency)? Hire the right tier for the right problem.
Ask for 1–2 production references. Talk to the client about what actually got built, how it is operating now, and what they would change.
Scope a small first engagement before a large one. A 4–6 week focused build tells you everything a 6-month commitment would — at 10% of the commitment.
Treat the ongoing operation as part of the scope from day one. Who runs the system in month 7 is as important as who builds it in month 1.

Want a second opinion on this for your situation?

Start a Project