The Problem
A lawyer with ten years of case law experience can train a legal LLM on proprietary precedent and licensing agreements and sell model access at $1,200+ per month per client, with enterprise contracts hitting five figures. A software engineer with no domain standing writes better prompts than yesterday’s solution and watches the market absorb it for free within six months. The same venture firm that funded Harvey to $5 billion in legal AI will not look at a solo developer who trained a model on healthcare claims data, no matter how good the model is, if that developer cannot credibly claim “I built this from knowledge I earned inside the domain.”
This is not theoretical. The domain-specific LLM market is projected to grow from $6.8 billion in 2025 to $52.4 billion by 2034 at a 25.4% compound annual growth rate, according to DataIntelo. That is real structural growth. But it is flowing almost entirely toward people and companies with pre-existing domain authority—established practitioners, institutional founders, teams with credentials baked into their founding story. Freelance prompt engineering roles increased 3x between 2024 and 2026 according to PE Collective job board data, a sign that execution-only AI work is flooding the market and racing to the bottom. Meanwhile, a developer who can credibly claim “I have domain-specific proprietary training data” can command pricing that looks like insurance pricing, not software pricing.
The gap is not about technical capacity. It is about credibility, data provenance, and institutional barriers to entry. A credential-less developer can build the model. What they cannot easily do is convince anyone that the data backing it is trustworthy, differentiated, or legally and ethically sound. That asymmetry is reshaping what durable developer income looks like right now.
Why This Is Happening
Three structural forces explain this collapse in the execution-only market and the retention of value in domain-specific data moats.
First, execution has become genuinely free. Prompt engineering as a skill has a half-life of months, not years. The techniques that commanded premium rates in 2024—chain-of-thought prompting, few-shot examples, token optimization—are now baked into baseline LLM capability and accessible to anyone with a CLI. The commodity market has done what commodity markets do: absorbed the knowledge, taught it to everyone, and driven the price toward the cost of distribution. Freelance rates for prompt engineering have compressed to $80–200 per hour on platforms like Toptal and Arc, while lower-tier managed services AI engineering roles are hitting $18 per hour. The demand is there, but so is the supply. Specialization in “I can write better prompts” does not hold against fifty thousand other people who can also write better prompts.
Second, domain-specific data and domain credibility are structurally different from execution. When a healthcare company trains an LLM on five years of proprietary claims data, they are not just selling access to a model—they are selling the credibility of the data itself, the legal right to use it, and the implicit insurance that the patterns it learned are clinically and ethically sound. A solo developer cannot credibly make that claim without institutional backing or years of domain practice. They can build the technical infrastructure; they cannot build the trust. That trust is what Harvey and similar domain-specific AI companies have monetized. Harvey’s $5 billion valuation is not evidence that “we built the most technically impressive legal LLM.” It is evidence that “we have the credibility to tell law firms this model will not hallucinate about case law, and we can back that up with legal liability.”
Third, and most important, there is no viable infrastructure for a credential-less developer to bootstrap into that domain-credibility model. Hugging Face Datasets exists, but it has no credibility or provenance layer. You can upload a dataset, but there is no mechanism that lets a lawyer reviewing the dataset know whether you actually worked in law for a decade or scraped it from the internet last night. Annotation platforms assume you already have a team or institutional backing. Data partnerships with domain practitioners require legal expertise most developers lack. Synthetic data generation tools like NVIDIA’s Synthetic Data Generation Hub can help, but they only work credibly when grounded in real-world domain examples—and for regulated domains like healthcare, finance, and law, synthetic data is still treated as a training aid, not a substitute for ground-truth data acquisition.
The infrastructure gap creates a Catch-22: You need domain credibility to acquire domain data at scale. You need domain data to build a moat. You cannot credibly claim domain credibility without either years of practice or institutional affiliation. And there is no marketplace, platform, or coordination mechanism that lets you bootstrap that without pre-existing standing.
What Developers Are Actually Doing
Some are pivoting to niches where domain expertise is learnable and acquirable without prior credentials. A developer can spend three months reading healthcare compliance documentation, regulatory filings, and clinical trial data, join domain-specific Slack communities, and reach a level of fluency where they can evaluate and curate datasets with real credibility. Not the credibility of a cardiologist—but enough to say “I spent a quarter becoming expert in this narrow problem space, and here is what I learned about the data that actually matters.” That is not nothing. It is harder than writing better prompts, and it takes longer, but it is tractable.
Others are explicitly working around the credibility barrier by partnering with domain practitioners as co-founders or advisors. This is not a bug; it is what Harvey and similar successful domain AI companies did from the start. The co-founder team of Harvey included legal domain experts from day one. A solo developer who wants to enter healthcare, finance, or legal AI is increasingly running into investors and customers who ask: “Who on your team has spent five years inside this domain?” The answer cannot be “I have.” But it can be “My co-founder has, and she curated the training data.”
Still others are targeting adjacent but less-credentialed domains—market research, content moderation, customer support classification, internal tooling—where the barrier to domain entry is lower. You do not need ten years of market research practice to build and sell a model that predicts survey respondent sentiment or categorizes open-ended feedback. The domain is learnable in weeks. The data is acquirable from public sources or customers willing to share it. The moat is still there—a model trained on your customer feedback is worth more than a generic model—but it is not locked behind a credibility wall.
A handful are attempting to build the infrastructure themselves: open marketplaces for domain data with versioning and provenance metadata, cooperative annotation models where multiple developers share the cost of labeling, staged capital models where small ventures bootstrap by monetizing data incrementally and reinvesting revenue into data acquisition. These are not yet at scale, and they are not yet commodities, but they are being attempted. The challenge is that they require coordination and trust across multiple parties, and the infrastructure to enable that trust does not yet exist.
The Build Opportunity
The specific gap is this: There is no platform that combines real-world domain data collection, credibility signaling for data providers, and monetization infrastructure in a way that lets a credential-less developer enter a domain, build a defensible dataset, and extract value from it.
Hugging Face Datasets gets you distribution. Annotation platforms like Label Studio or Prodigy get you labeling. API monetization platforms like Zuplo and Moesif get you payment infrastructure. But none of them solve the core problem: How does a domain expert, reviewing a dataset from someone they have never heard of, know whether that dataset is trustworthy?
A real solution would need:
Data provenance and version control that surfaces credibility signals. Not just git-style version tracking, but a layer that shows: Who collected this data? What is their track record? Have domain experts reviewed it? Is there a chain of custody? This already exists in academic and pharmaceutical research—data repositories track contributor credentials, peer review, and institutional affiliation. The missing layer is making this usable and accessible for developers without an institutional backing.
A credibility bootstrap pathway that does not require pre-existing domain credentials. What would let a developer signal “I have spent focused time becoming expert in this narrow domain”? Not a certificate (those are plausible but not credible). But something like: public evidence of domain engagement (curated reading lists, research summaries, public analysis of real-world problems in the domain), endorsement from practitioners already in the domain who have reviewed the work, a staged data-collection process that starts small (50 examples) and scales as credibility builds. This is what Emergence Capital calls the “hack your way in” model—but it requires a platform that structures the hack and makes the credibility visible.
A coordination layer for data partnerships. If a developer wants to partner with a domain practitioner to co-curate data, there should be a standard framework for: data ownership and revenue sharing, liability and indemnification, IP agreements, and ongoing contribution models. Lawyers are expensive and slow. A template-based, developer-friendly version would dramatically lower the friction of saying “I want to partner with someone inside the domain.” This is half-solved by platforms like Gumroad and Stripe for revenue sharing, but it is not connected to data provenance or domain credibility.
Regulatory guidance and playbooks for specific domains. Healthcare data requires HIPAA compliance. Financial data requires regulatory disclosure. Legal data involves attorney-client privilege. A platform that includes domain-specific checklists and templates—“If you are building healthcare models, here is what you need to know about de-identification, here is a template for data-use agreements”—would democratize knowledge that currently lives in lawyers’ heads and venture firms’ playbooks.
The adjacent technical work is real. Synthetic data generation can reduce some of the ground-truth burden, but only if credibly grounded in real data. Federated learning and differential privacy can let you train on sensitive data without exposing raw records, but the tooling is still fragmented and requires deep ML expertise to deploy. Vector databases and retrieval-augmented generation can make domain data more useful at smaller scale, reducing the need for massive datasets—a real advantage for bootstrap scenarios.
Open-source starting points exist. Label Studio is productized and open. Hugging Face’s hub infrastructure could be extended. Verifiable credentials and Verifiable Data Registry work in the blockchain/standards space but have not been adapted for ML data. A team starting here would not be building from scratch, but they would be doing genuine integration and layer-building work.
The hard problem is not technical. It is institutional: creating a credibility layer that developers trust, that domain practitioners trust, that customers trust. You cannot fake that in software alone. It requires community governance, and it might require some institutional backing (a university, a nonprofit, or early VC funding) to bootstrap the trust.