Fake vs. Real PMF: How Blue J Went from $2M to $25M ARR

Q: How did Blue J know it was time to rebuild instead of iterating on V1?

Ben Alarie recognized that V1's issue-specific models couldn't be solved by iteration. The original vision required answering any tax question, which demanded a different architecture. When RAG-based LLMs became viable, the rebuild became necessary, not optional.

Episode 84 · October 20, 2025

Bottom Line Up Front

Ben Alarie spent 8 years building Blue J with partial product market fit — real customers, real revenue, but no true pull. Then he bet the company on a full rebuild using RAG-based LLMs. Two years later, ARR jumped from $2M to $25M, NPS from 20 to 84, and Blue J raised a $122M Series D. This episode is essential for any founder who suspects they 'kind of' have PMF but can't break through.

Key Facts

ARR Growth (2 years):: $2M to $25M ARR after V2 launch(Ben Alarie)
NPS Improvement:: NPS rose from ~20 to 84(Ben Alarie)
Customer Growth:: Fewer than 100 firms to 3,400+ firms(Ben Alarie)
Latest Raise:: $122M Series D, closed July 2025(Ben Alarie)
Daily New Logos:: ~10 new firms signing up every single day(Ben Alarie)

Eight years of 'partial product market fit' — then everything changed in six months. Blue J founder Ben Alarie rebuilt his AI tax research platform from scratch, and the result was hypergrowth: 10 new firm customers every single day and a $122M raise.

Key Facts

ARR Growth (2 years): $2M to $25M ARR after V2 launch (Ben Alarie)
NPS Improvement: NPS rose from ~20 to 84 (Ben Alarie)
Customer Growth: Fewer than 100 firms to 3,400+ firms (Ben Alarie)
Latest Raise: $122M Series D, closed July 2025 (Ben Alarie)
Daily New Logos: ~10 new firms signing up every single day (Ben Alarie)

What 'Partial Product Market Fit' Actually Looks Like

Partial PMF means you have paying customers and functional technology, but usage is inconsistent. Users love the product in specific moments but abandon it when it doesn't cover their full workflow. Revenue grows slowly; nothing breaks out. It feels like progress — and that's exactly what makes it dangerous.

Blue J's V1 used supervised machine learning to predict how courts would resolve specific tax questions — things like whether a worker is an employee or independent contractor. The models were accurate, predicting outcomes with better than 90% accuracy. Customers trialed the product, validated the technology, and paid for access.

But the experience was fragmented. Users would log in for a question Blue J could answer, get great value, then return for a different issue and find no model existed for it. Over time, they stopped logging in. Ben Alarie describes it precisely: 'There were some power users who knew exactly what BlueJ did. They would come in all the time... but it wasn't enough of a consistent experience for all the users to really get this thing to take off.'

Blue J grew past $5M ARR over several years — enough to raise a Pre-Seed, Seed, Series A, and Series B. Enough to keep going. But not enough to break out. The trap of partial PMF is that it mimics success closely enough to justify continuing without forcing the deeper question: is this really working?

"What we learned was there was partial product market fit. The models that we built were actually functionally very useful, when you had one of those kinds of cases." — Ben Alarie

"It always kind of felt like we kind of had a partial vacuum, a partial seal and it was like, wasn't really tight and working in the way that you would really want it to work." — Ben Alarie

The Rebuild Bet: Going All-In on RAG and LLMs

When Ben Alarie saw early GPT-3 models in late 2022, he recognized retrieval augmented generation could finally deliver Blue J's original vision: answer any tax research question instantly. He put V1 into maintenance mode and gave his team six months to rebuild everything from scratch for the U.S. federal income tax market.

The technical breakthrough was retrieval augmented generation (RAG). Instead of laboriously building model-by-model, Blue J could now index a comprehensive corpus of authoritative tax content — legislation, case law, CRA and IRS rulings, academic commentary — and synthesize answers dynamically. Alarie explains: 'The brilliant thing about leveraging retrieval augmented generation is we can just curate the data and if we're confident about the currency of the tax database... the whole user interface, the user experience can be dramatically simplified.'

The business decision was equally bold. Alarie told his team: put all existing tools into maintenance mode, keep servicing customers, but invest zero in new feature development. All new engineering went toward V2. It was a six-month sprint with an existential mandate.

By the end of June 2023, they had a working product — janky, 90-second response times, single-shot interactions, no follow-up questions. NPS sat at roughly 20. But they had proof the concept worked. Blue J ended 2023 at just over $2M ARR on the new product alone. Then came 2024: fully conversational, 15-second responses, GPT-4 integration, and ARR climbing to nearly $9M by year's end.

"We're going to take the first six months of 2023 and we're going to focus all of our new development efforts in building this thing that can answer any tax research question in U.S. federal income tax law." — Ben Alarie

"We put all of our chips on large language models and see if we can get them to do what we set out to do originally back in 2015, when we started BlueJ." — Ben Alarie

V1: Supervised ML, issue-specific models, 90%+ prediction accuracy, narrow coverage.
V2: RAG-based LLMs, any tax question, fully conversational, 15-second responses.
Transition: V1 placed in maintenance mode; all dev resources redirected to V2.
Why U.S. market first: largest market, existing content partnership with TaxNotes.

Why Blue J Beats ChatGPT for Tax Research

General AI models like ChatGPT rely on open web sources that are often outdated and incomplete. Blue J accesses comprehensive, paywalled tax content from publishers like Thomson Reuters, CCH, and LexisNexis — sources ChatGPT cannot reach. For tax professionals who can't afford a 60%-right answer, purpose-built wins every time.

Alarie is direct about the competitive moat: 'We have copies of all the authoritative content necessary to produce really great answers. And we make it easy for our users to look at those authoritative sources and validate.' Tax law changes constantly — new cases, statutory amendments, agency rulings. Web-crawling models pick up stale guidance and present it with equal confidence.

The customers are discerning professionals whose clients pay premium rates for research accuracy. As Alarie puts it: 'They do not want to waste time with an inferior product. When there's something superior, purpose-built that they can access, that's going to give them the best answer.' The bar isn't 'pretty good.' It's world-class or nothing.

This is also why the time-to-value dynamic is so powerful in this market. A tax researcher who just spent six hours on a question can watch Blue J produce the same answer in 20 seconds. That immediate, undeniable proof of value collapses the sales cycle and accelerates word-of-mouth.

Never miss a founder's PMF story

Subscribe to The PMF Show

"Our users are very picky, they're discerning users, they're tax professionals who are not content with using a ChatGPT." — Ben Alarie

"If they can accelerate that research with tools that cut through that task like a hot knife through butter. They want that hot knife through butter experience." — Ben Alarie

Time to Value: The Real Engine Behind Word-of-Mouth Growth

Time to value is the single biggest lever for word-of-mouth. When a user can get undeniable value in 20 seconds, recommending your product becomes socially low-risk. The recommender isn't asking for a time investment — they're offering an instant win. That asymmetry between effort and upside is what drives organic growth.

Pablo Srugo surfaces this insight mid-conversation, and Alarie runs with it. The social mechanics are precise: 'If I am confident that I make this recommendation to you and you can, at very low cost, go and validate that for yourself right away — it makes it socially far less risky for me to suggest this to you.'

The contrast with V1 is stark. Early Blue J required behavioral change, onboarding investment, and hope that the specific issue you needed would be covered. All of that friction killed referrals. V2 eliminates the friction: one question, one answer, 20 seconds. The recommender looks smart immediately.

Alarie frames it as asymmetry: 'The upside versus the investment to validate it.' When that ratio is extreme — massive upside, near-zero cost to verify — word-of-mouth compounds. Blue J is now adding roughly 10 new firm logos per day, a growth rate that can only be explained by strong organic referral loops.

"The consistency of the value, like I'm not asking you to spend to make a huge investment in your time and the upside gets so asymmetric. The upside versus the investment to validate it." — Ben Alarie

"The trial conversion, the word of mouth is so much fun to sell BlueJ now compared to V1, where the value was more difficult to unlock and it was less consistent." — Ben Alarie

The PMF Moment: A Two-Week Problem Solved in 20 Seconds

Ben Alarie knew Blue J had true PMF during a live demo at the Canada Revenue Agency. An audience member described a tax research problem that had taken his team two weeks to solve. Blue J produced the correct answer in seconds. The expert stood up, walked to the screen, and read it line by line: 'That's the answer we came up with.'

The setting mattered. It was a room of roughly 150 tax professionals at the CRA's Toronto Tax Services office. Alarie had no idea what question would come. The demo was live and unscripted — the highest-stakes possible test of the product.

Alarie describes driving back afterward: 'I turned up the radio and I was just pretty pumped. Because I was like, okay, that was hugely successful. That was an ecologically valid test of BlueJ and we just hit it out of the park.' It wasn't a rehearsed example. It was a real problem from a real expert, solved in real time.

That moment encapsulates what separates real PMF from partial PMF. Real PMF survives the hardest possible test in front of the most skeptical possible audience. Partial PMF works in controlled conditions with friendly users. The CRA demo was the former.

"He said, this is a tricky problem. It took us two weeks to do this internally, can you try this? And he described the question and I push enter. I'm like, OK, let's see what BlueJ comes up with. And BlueJ started bringing in the answer. This guy stood up, walked up to the screen and he was reading it line by line. And he said, that's the answer we came up with." — Ben Alarie

One Piece of Advice: Be Ruthlessly Honest About Whether You Have PMF

The hardest part of finding PMF is admitting you don't have it yet. Founders are the easiest people to deceive — you hear encouragement, you see revenue, you assume it's working. True PMF means customers use your product aggressively every day, pay real money, and proactively tell others. Anything less is partial, and partial doesn't scale.

Alarie's advice is grounded in his own experience of muddling through for years: 'Be ruthlessly honest with yourself about whether you have product market fit. You can deceive yourself about this and you can have happy ears and listen to the folks who are trying to provide encouragement.'

The real test, he says, is behavioral: 'Are they using it aggressively and every day, and thrills. And telling everybody else that they know about this thing — that's happening, and they're paying real money for it. And that's when you know that you've got that product market fit.'

He also acknowledges the paradox of his own conviction: believing absolutely that the market would arrive kept him going for eight years. He wouldn't recommend it as a strategy, but he couldn't avoid it. The lesson for other founders isn't blind faith — it's the combination of honest assessment of where you are today, paired with clear vision of what real PMF would look like.

"Be ruthlessly honest with yourself about whether you have product market fit. You're probably, as a founder, the easiest one to deceive." — Ben Alarie

"I'm going to make this work or I'm going to die trying. I just felt so viscerally that this was going to work." — Ben Alarie

Blue J V1 vs. V2: Partial PMF vs. Real PMF

Dimension	V1 (Supervised ML)	V2 (RAG + LLMs)
Coverage	Specific issue-by-issue models	Any tax research question
Response Time	90 seconds	~15-20 seconds
Interaction Model	Single-shot, restart required	Fully conversational
NPS	~20	84
ARR	$5M+ (over several years)	$2M → $25M (2 years)
Customer Count	Hundreds of firms	3,400+ firms
Word-of-Mouth	Weak, inconsistent	~10 new firms/day organic

Frequently Asked Questions

How did Blue J know it was time to rebuild instead of iterating on V1?

Ben Alarie recognized that V1's fundamental constraint — issue-specific models — couldn't be solved by iteration. The original vision required answering any tax question, which demanded a different architecture entirely. When RAG-based LLMs became viable, the rebuild became necessary, not optional.

What is the difference between partial PMF and real PMF?

Partial PMF means some customers love the product in specific use cases, but usage is inconsistent and doesn't compound. Real PMF means customers use it aggressively every day, pay real money, and proactively refer others — without prompting. According to Ben Alarie, the behavioral difference is unmistakable.

Why can't tax professionals just use ChatGPT instead of Blue J?

General models rely on open web sources that are often outdated or incomplete. Blue J accesses paywalled, authoritative tax content from publishers like Thomson Reuters and CCH — sources ChatGPT cannot reach. For professionals whose clients pay premium rates for accuracy, the difference is not acceptable risk.

What role did time to value play in Blue J's hypergrowth?

Instant time to value made word-of-mouth nearly risk-free. When a user can prove Blue J's value to a colleague in 20 seconds, recommending it requires no social risk or time investment from the recipient. Ben Alarie calls this asymmetry — massive upside, near-zero verification cost — the engine behind organic growth.

Eight years of partial PMF. Six months to rebuild. Two years to hypergrowth. Ben Alarie's story is a masterclass in honest self-assessment, conviction, and knowing the difference between a product that works and a product that pulls. Hear the full conversation on The Product Market Fit Show.

Want more founder stories like this?

Subscribe to The Product Market Fit Show for weekly episodes.

Subscribe Now