ChatGPT Scored 82.5 on Its Own Personal Finance Benchmark. Origin Scored 98.3 on the Actual CFP Exam.

When OpenAI launched its personal finance feature for ChatGPT last week, it led with a number: 82.5 out of 100 on an internal personal finance benchmark. Developed in collaboration with 50 finance professionals, it was designed to evaluate how well ChatGPT handles the kinds of questions real people ask about money.

82.5 sounds pretty good. Until you ask who wrote the test.

OpenAI built the benchmark. OpenAI administered it. OpenAI reported the results. That's not a knock on their methodology — it's just worth knowing when you're evaluating a claim about financial competence, because in any other context, we'd call that grading your own homework.

The CFP® exam is not like that. It's the industry standard for human Certified Financial Planners — an independent, standardized test covering investment planning, tax planning, estate planning, retirement, insurance, and financial analysis. It's the bar that human advisors have to clear to give advice professionally. The average human CFP® scores around 79.5%.

Origin's AI Advisor took that test. Across 6,000 unique sample questions administered over 432 hours, it scored 98.3% — with variance held between 95–97% across multiple runs. GPT-5 scored 93.8% on the same questions. Gemini 2.5 Pro scored 93.1%. Every major frontier model was tested under the same controlled conditions: identical question sets, randomized order, no access to external tools or retrieval, no prompt engineering advantages.

Origin scored highest. By a significant margin. On a test none of them wrote.

Why the test design matters

The difference between an internal benchmark and an independent one isn't a technicality. In finance specifically, it's the whole ballgame. A benchmark can be designed to emphasize what a model does well and minimize what it doesn't. An independent exam can't be gamed that way — it tests what it tests, and you either know it or you don't.

ChatGPT's 82.5 might be a genuine reflection of its financial reasoning ability. It might also reflect a benchmark calibrated to what GPT-5.5 handles well. We don't have enough information to know, because OpenAI hasn't published the methodology at the level that would let an outside party verify it.

What we do know is that on the same CFP® sample questions, under the same conditions, GPT-5 scored 93.8% — which is lower than Origin's 98.3% and notably higher than ChatGPT's self-reported 82.5 on its own test. That gap is hard to explain unless the internal benchmark and the CFP® exam are measuring meaningfully different things.

What 98.3% actually means in practice

The CFP® exam isn't multiple choice trivia about financial concepts. It includes scenario-based questions that require multi-step reasoning — the kind where you have to hold several pieces of information in context simultaneously and arrive at a recommendation that's numerically precise and situationally appropriate.

One test case from Origin's evaluation: when asked about RSUs at a non-public company, the model flagged an inconsistency in the question itself and responded: "Are you sure? If the company isn't public, you may have stock options instead." That's not pattern matching. That's context-aware reasoning — catching something the question didn't explicitly surface. Generic models miss that. They answer what's asked, not what's actually going on.

That kind of reasoning is what Origin's multi-agent architecture was built to produce. The CFP® score is the external validation that it's working.

The number that matters most

82.5 on a proprietary benchmark from the company that built the product being tested. 98.3 on an independent professional exam used to certify the humans who give financial advice for a living.

Both numbers are real. Only one of them tells you something you can actually rely on.

Try Origin for $1 for your first year.

‍

Disclaimer

ChatGPT Scored 82.5 on Its Own Personal Finance Benchmark. Origin Scored 98.3 on the Actual CFP Exam.

Why the test design matters

What 98.3% actually means in practice

The number that matters most

Answers to your questions