Newsletter
Join the Community
Subscribe to our newsletter for the latest news and updates
A look at the five dimensions we use to evaluate every finance AI tool—accuracy, speed, ease of use, pricing, and compliance. Why we built this framework, how we apply it consistently across 12 tools, and what its limits are.
2026/05/19
Every AI finance tool claims to be the best. Faster, smarter, cheaper, more accurate. Without a consistent way to compare them, "best" is just marketing copy.
So we built one. It's called methodology v1, and you can find the full reference at /methodology. This post explains the thinking behind it—why these five dimensions, why these scoring anchors, and what this framework can and cannot tell you.
We looked at how founders, freelancers, and finance teams actually evaluate tools. Three patterns kept showing up:
Accuracy first, always. A bookkeeping tool that's 99% right but quietly misclassifies 1% of transactions is more dangerous than one that's 90% right and flags the rest for review. Finance work is unforgiving. Errors compound.
Speed is non-negotiable in production. A tax tool that takes 4 seconds to suggest a deduction is fine. One that takes 40 seconds breaks the flow. Real-world speed—including network calls, retries, and rate limits—matters more than benchmark numbers.
Pricing and ease of use are coupled. A $500/month tool with a polished onboarding is sometimes a better deal than a $50/month tool that takes a week to configure. We track them as separate dimensions but read them together.
The remaining two—compliance & security and ease of use—round out the practical picture. A tool with SOC 2 Type II and GDPR data residency options is fundamentally different from one without, even if their feature lists look identical.
Each tool gets a 0–5 score on five dimensions:
Accuracy—How often the tool produces the correct output, especially under edge cases (international transactions, ambiguous categorizations, complex tax rules).
Speed—End-to-end responsiveness in real workflows, not synthetic benchmarks.
Ease of Use—Onboarding friction, daily UX, learning curve. Includes documentation quality.
Pricing—Value per dollar at typical usage tier. Includes hidden costs (per-transaction fees, overage charges).
Compliance & Security—Certifications (SOC 2, ISO 27001), data residency options, audit logs, encryption standards.
Each dimension has a 0–5 anchor table—concrete descriptions of what a 1 vs a 3 vs a 5 looks like. This is the part that took longest to write. Without anchors, scoring becomes a vibe check.
For each tool we evaluate, we do three things:
1. Hands-on testing. No tool gets reviewed from screenshots alone. We sign up, configure, and run real workflows for at least a week—typically more for complex tools like Pilot or Brex.
2. Documentation deep dive. We read the security docs, the API reference, the pricing page including all the asterisks. Anything that's documented but hidden behind a sales call counts as a friction signal.
3. Anchor-based scoring. For each dimension, we pick the anchor description that most closely matches what we observed. We don't average across reviewers—single-evaluator scores are more internally consistent than averaged committee scores.
The scores are versioned. methodology v1 is what's running today. If we change a dimension definition or anchor table, we bump to v1.1 and re-score affected tools.
A few honest limits:
It doesn't predict ROI. A 5/5 accuracy score doesn't mean the tool will pay for itself in your specific business. That depends on your transaction volume, complexity, and existing stack.
It's not a fit assessment. Two tools can both score 4.5 overall and be completely wrong for you. A 5/5 Pricing score on Ramp doesn't help if you need multi-entity consolidation Ramp doesn't support.
It's a snapshot. Tools change. Mercury today is not Mercury 18 months ago. We re-score quarterly or when a major release lands.
It can be wrong. We've changed scores before based on user reports. If you've used a tool we cover and our score doesn't match your experience, tell us—we'll dig in.
Most review sites don't publish their scoring criteria. We do, for three reasons.
First, it forces internal consistency. If we can't write down what a "4 on accuracy" means, we shouldn't be giving 4s.
Second, it gives you a check on us. You can read the anchor for "3 on pricing" and decide if you agree with how we applied it.
Third, it makes the comparisons portable. A score on this site is comparable to any other score on this site, because they all used the same rubric. That's the whole point.
The full methodology—including all five anchor tables—lives at /methodology. We update it when our thinking changes, and we keep a version history so you can see what changed.
That's how we score. If you have feedback, we'd love to hear it.