The Data Moat: Who Wins Software in the Post-AI World

In 2008, a small team built TweetDeck on Twitter’s API. It was better than Twitter’s own product — genuinely, obviously better. Multi-column, real-time, powerful. Twitter acquired it in 2011 and killed it in 2023. They built on someone else’s foundation and got the outcome that always follows.

The same movie is running again. A generation of products is being built as wrappers around rented intelligence. Some are impressive. Most are TweetDeck.

One question separates durable from disposable: if the underlying model gets 10x better next year, does my product get more valuable or less? Less valuable means the model now does what you were doing. More valuable means a smarter model does more with your data. The first is a wrapper. The second is a moat.

The companies with moats have always had the same thing. Not a better product. Not a smarter team. Data that cannot be synthesized, collected through years of exclusive access, with proprietary logic built on top that has become the language their clients think in.

The LLM doesn’t create that moat. It makes the moat more valuable — and makes its absence more exposed.

The Three Requirements for a Real Data Moat

Three properties. All three. Two out of three is a business, not a moat.

The data cannot be synthesized

It required physical collection, exclusive agreements, or decades of accumulation. An LLM cannot generate it. A competitor cannot scrape it.

The value compounds over time

More history means better signal. Ten years of sensor readings from a factory line is worth more than one year. The moat widens as the asset grows.

The logic is embedded in client cognition

Clients don’t just use the data. They think in its language — its categories, its metrics, its benchmarks. Switching doesn’t mean finding a new vendor. It means rewiring how decisions get made.

Bloomberg is the clearest case. The Terminal costs $24,000 per seat, looks like 1987, and is notoriously hard to learn. None of that matters. The data took decades and billions to build. The methodology — how professionals think about yield curves, how credit analysts structure comparables — lives in Bloomberg’s language, not yours. Switching means retraining every analyst to think differently. Nobody does that.

An LLM on Bloomberg’s data doesn’t threaten Bloomberg. It makes Bloomberg more valuable. The asset was always there. The LLM removes the tax of learning its keyboard commands.

The Wrapper Problem

Cursor is a good product. It is also, at its core, an interface on top of Anthropic and OpenAI’s models. The moat is workflow integration, keyboard habits, brand trust — not owned data. When a base model handles multi-file code editing natively, Cursor’s edge compresses. That day may be two years away or five. It is coming.

Harvey is similar. The legal workflow is thoughtful. The law firm relationships are real. But the reasoning runs on OpenAI’s models, and the case law is largely the same corpus Westlaw has owned and annotated for decades. Harvey is ahead on product and distribution. Westlaw, building an LLM interface on its own corpus, is ahead on substrate.

The API era and the LLM era are the same movie. The companies that survived the API era owned the graph. The companies that will survive the LLM era own the data.

Zynga built FarmVille on Facebook’s social graph and grew faster than any game studio in history. Facebook changed its distribution rules in 2012. Zynga lost 70% of its value within two years. The mistake wasn’t building on Facebook. It was having nothing of their own when the platform changed its mind.

Evaluating the Landscape

The matrix below scores 16 data categories across six criteria: data irreplaceability, collection moat, proprietary logic layer, LLM amplification potential, switching cost, and market size. The scoring is deliberately unweighted — a starting point for analysis, not a final verdict.

Irep Data irreplaceability

Coll Collection moat

Logic Proprietary methodology layer

LLM↑ LLM amplification potential

Lock Switching cost

Mkt Market size

Rating

5 Excellent

4 Strong

3 Moderate

2 Weak

1 Poor

# ↕	Use case ↕	Irep ↕	Coll ↕	Logic ↕	LLM↑ ↕	Lock ↕	Mkt ↕	Score ↓

Score = unweighted average / 5.00 · Click headers to sort.

The Top Cluster: Four Cases Worth Studying

The top four rows share one trait: data that took decades to accumulate, cannot be replicated by any model, and a logic layer that has become the cognitive framework clients use to make decisions. The LLM doesn’t threaten any of them. It makes each more accessible — which makes each more valuable.

Retail Scanner / POS Data

Circana, NielsenIQ — Score 4.83

The most instructive case because the moat looks replicable and isn’t. POS transaction records from grocery, drug, and mass retailers — assembled through exclusive agreements built over 40 years. You could theoretically replicate it. You’d need tens of thousands of retailer agreements and four decades to do it. Then you’d still need to build the analytical framework that CPG brand managers now use as their operating system.

Healthcare / Clinical Records

Epic, Cerner — Score 4.83

Epic is the Bloomberg of clinical data, with more regulatory friction and more moral weight attached to every decision. A third of Americans have their longitudinal health records in Epic’s system — labs, medications, physician notes, years of it. Legally protected, clinically sensitive, irreplaceable. No startup signs its way into that position.

Industrial IoT / Historian Data

OSIsoft PI, AVEVA — Score 4.50

Nobody talks about OSIsoft. They should. The PI System holds 20-year sensor histories from specific production lines at specific factories — data that exists once, reflecting the physical reality of one machine in one building over one span of time. The readings from Line 3 at a semiconductor fab in Hillsboro, Oregon are not interchangeable with anything.

Legal Case Law & Annotation

Westlaw, LexisNexis — Score 4.50

Harvey gets written about constantly. Westlaw doesn’t. Harvey has a smart product and good distribution. Westlaw has 150 years of annotated case law — editorial layers built by human researchers flagging which holdings were overturned, which precedents still hold. A model reasoning over that annotation layer is in a different position than one reasoning over raw public case law.

The Trap: High LLM Score, Low Moat

Two categories score 5 on LLM amplification but sit near the bottom: labor market data and government public records. Both are products. Neither is a moat.

Making government data queryable in plain English is genuinely useful. The data is also public, which means every competitor starts with the same asset. Low switching cost, no exclusivity, no moat.

Labor market data is marginally stronger — Lightcast and Burning Glass built real methodology layers. But the underlying data comes from job postings, which are public, and employer surveys, which aren’t exclusively theirs. Those methodology layers are more replicable than 40 years of retailer agreements.

The One Question

If the base model improves 10x next year, does my product get more valuable or less?

The interface was never the product. The data always was. The LLM era just makes that more visible — and more consequential — than ever before.

This essay is part of a longer inquiry into software in the post-AI era — specifically the argument that most software products bundle data, logic, and a UI that exists solely because machines couldn’t understand language. Remove that constraint and the UI layer dissolves. What survives is the substrate.

— End —