AI Reviews, Humans Profit: The Creator Economy in Science

AI Does the Reviewing, Humans Take the Profit Share:

The Next Big “Creator Economy Wave” Is Coming to Scientific Research.

18 min readJust now

–

Problem and opportunity: What are we wasting?

Today’s large-model products are, in fact, “wasting creativity.”

Right now, all mainstream large-model products are essentially: pay-per-token conversational tools + a small plugin ecosystem. What users type into them every day is not only chat and Q&A, but also: half-formed research ideas, technical proposals, algorithmic ideas; product concepts, business model drafts, system architectures, experimental designs; all kinds of “immature theories” and “amateur research brainstorms.”

At present, these inputs have only two possible fates: they are treated as ordinary conversation logs for…

AI Does the Reviewing, Humans Take the Profit Share:

The Next Big “Creator Economy Wave” Is Coming to Scientific Research.

18 min readJust now

–

Problem and opportunity: What are we wasting?

Today’s large-model products are, in fact, “wasting creativity.”

At present, these inputs have only two possible fates: they are treated as ordinary conversation logs for the next round of pre-training and crudely thrown into one big “stew”; or, constrained by privacy policies, they simply cannot be used for training at all and become “seen once and forgotten.”

We have no systematic mechanism to filter out the very small fraction of ideas that truly have research/IP-grade value, and we certainly do not have a closed loop to turn them into monetizable assets. In other words: today’s large-model products treat the “potential patent library / paper library / startup idea pool” that users continuously contribute as disposable consumables.

Whose “creativity” is being wasted the most?

The ones whose creativity is being wasted the most are actually not the “industry stars,” but three typical groups of “capable but under-resourced” people:

Research laborers: master’s and PhD students / postdocs / junior faculty / research engineers in industry. They can understand frontline papers, write code, set up experiments, and build systems. But they do not control their own funding, do not have independent research groups, and have no room to trial-and-error “wild ideas.” Even when they do have good ideas, they are often buried inside their advisor’s projects, KPI-driven tasks, and paper factories, and it is very hard for them to turn those ideas into IP or equity under their own names.

Long-term self-taught, non-formally-trained individuals Some of them only have an ordinary bachelor’s degree or even a vocational/associate degree, but have self-studied deeply for many years in a certain vertical (cryptography, systems, control, quant, bioinformatics, etc.). They can read theory, follow derivations, and write code, but under the existing system:

They have no “formal identity” with which to submit projects or pull in collaborators;

Even if they write a structured theoretical draft, it is very hard for it to enter the formal peer-review pipeline.

When they throw their brainstorms today to ChatGPT / Claude / Tongyi / Qwen, the best-case outcome is to receive a “more smoothly rewritten” draft, and the worst-case outcome is total evaporation.

Small-team engineers and people on the fringes of entrepreneurship They can build prototype systems, make demos, and have real engineering execution capability; what they lack is investors or review committees who understand technology and can read code and architecture diagrams. Many ideas cannot give birth to “papers,” and for the time being cannot support a full company, so they drift in a gray zone: they “receive neither academic recognition nor any serious investment.”

For AI companies, these three groups share one common characteristic: their “average theoretical capability” may not be outstanding, but the density of their “high-value long-tail ideas” is extremely high. In other words, these people may not be the VIP users on your SaaS revenue curve, but they are very likely to be the main suppliers for your future IP warehouse.

This is the “self-media moment” for the research/IP world

From the perspective of industrial evolution, the current research/IP system looks a bit like the content industry before short video and self-media emerged. At that time, the channels through which one could “publish opinions and influence public discourse” were only:Radio and television stations;Newspapers and magazines;Publishing houses.

Most ordinary people, even if they had talent, could only be seen within small circles.Until self-media platforms appeared:Anyone could publish content at low cost;Platforms used algorithms for filtering and distribution;A small number of top creators received ad revenue sharing, forming a new content industry chain.

Today, research / technological innovation / patent incubation are still stuck in a “broadcast-era” structure: formal publication, formal project approval, and patent planning still depend heavily on schools, titles, networks, and institutional endorsements. The vast majority of people who “have some capability but lack resources” cannot even enter the review stage, let alone obtain resources and revenue sharing.

An AI review + revenue-sharing system is, in essence, the following:

In the realm of research and technological innovation, it copies the structure of “self-media + algorithmic distribution + ads/revenue sharing,” but replaces “algorithmically recommended content” with “a strong review model that filters high-value theories/solutions,” and replaces “ad revenue sharing” with “long-term sharing in patents/IP/projects.”

The difference is:Self-media platforms first “open the gates, let all content pour in, and then use algorithms to push down the noise”;What we propose is: “first use a cheap small model for initial screening, and then use a trillion-parameter-level strong review model for harsh filtering, allowing only a small number of high-value ideas into the incubation channel.”

For AI companies, this implies an opportunity on two levels:

Adding a completely different revenue/asset curve No longer relying solely on subscription fees and API call fees, but also taking a long-term share from user-contributed IP / companies / products; the ceiling of this curve is determined by the volume of “long-tail creativity” across society, not by how many researchers you can hire yourself.

Seizing the high ground of the “research self-media platform” Whoever first builds this closed loop of “AI strong review + human investment committee + revenue-sharing agreements” will be the first to lock in that group of “capable but under-resourced people” as long-term creators and sources of data assets for their platform.

This is why this is not a matter of “waiting a few years to see once others have made it work,” but something into which you can invest right now — and the earlier you do it, the more it pays off as a new wave. For example, Douyin (TikTok) is now sweeping the globe, but even if you have equivalent or better technology, it is almost impossible to replicate Douyin today, because user inertia has already formed.

Proposal: Build a “AI sovereignty zone + human sovereignty zone” research/IP pipeline

The core idea can be summarized in one sentence: hand the work of “judging truth vs falsehood and filtering for value” to AI, and hand the decisions that “bear legal responsibility and put up capital” to humans.

AI sovereignty zone:Facing the “massive volume of user brainstorms,” it is operated by a multi-stage review pipeline that runs automatically: first, a cheap/small model is used for safety filtering and coarse screening; then, a specially trained 1T-level “strong review model” performs serious evaluation.

In this zone, AI is responsible for:Cleaning;Deduplication/plagiarism check;Logical consistency checking;Literature checking / RAG;Risk assessment;Preliminary commercial feasibility analysis.

In the end, it only lets out an extremely small fraction (for example, one in a thousand or even one in ten thousand) of “candidate gold,” each accompanied by a structured review report.

Human sovereignty zone:Only when the strong review model deems something “worth investing resources in” does it enter the human process:Experts see only the AI-generated “Technical and Theoretical Evaluation Report,” “Risk Analysis,” and “Potential Commercial Value Assessment”;The human role is upgraded from “reviewer/grader” to “investment committee / technical project approval committee”;The company ultimately decides: whether to file patents, whether to incubate projects, whether to invest in or sign with the user, and what revenue-sharing ratios to adopt;All legal and compliance responsibilities remain fully concentrated on the human/company side.

What this system does is not “replace research,” but: from a vast ocean of unstructured ideas, filter out the small number of research/engineering concepts that deserve serious attention, and turn them into company-led IP, products, and equity assets.

What is even more critical is that this “AI sovereignty zone + human sovereignty zone” design precisely plugs the structural gaps of the existing human evaluation system. In reality, the small number of “top performers” who actually produce frontline research often have almost no time to carefully read huge volumes of submissions, and most of the review work is delegated to junior faculty, PhD students, and research staff, with highly uneven quality.

Take AI as an example. According to official statistics, ICLR 2024 received 7,404 submissions and accepted 2,260 papers; a single conference must digest several thousand papers within a few months. NeurIPS 2024 received more than 17,000 submissions and accepted about 4,400. Under such workload, it is very hard to expect every paper to receive careful, consistently standardized review.

Worse still, top conferences have run “self-test experiments” on their own process: in NeurIPS’s famous 2014 repeat-review experiment, the same batch of papers was randomly assigned to two independent program committees. The result: for a substantial fraction of papers, one group recommended “accept,” while the other recommended “reject”; decisions depended heavily on luck and the particular combination of reviewers, which the official report described as exhibiting significant “randomness.”

Broader studies of academic peer review repeatedly point out that review decisions are not strongly correlated with the future academic impact of the paper. Subjective preferences, institutional affiliations, field “circles,” and even geographic and language backgrounds systematically influence conclusions. Under such a structure, truly non-consensus, cross-disciplinary new ideas are often not filtered out because they are “absolutely wrong,” but because reviewers have no time to dig in + incentives favor “safe topics,” so they get buried early on.

By contrast, a strong review large model built specifically for evaluation and IP identification has several inherent advantages:Under unified rules, it can give structured scores for logical coherence, novelty, risk, and feasibility; perform multiple rounds of self-checking and literature verification; and avoid “spur-of-the-moment” one-shot decisions;It will not give rushed scores because “today is too busy” or “I am in a bad mood lately.” Every sample goes through the same pipeline, and all intermediate reasons and evidence are traceable;For borderline samples, it can invoke multi-round self-debate + RAG verification, making the “deep fact-checking” that human reviewers have no time for a routine part of the process.

In my design, AI is not meant to fully replace humans, but to create a clear division of labor. A specially constructed model will do a better job than most human reviewers on the question of “who truly deserves serious attention”: first, use the AI sovereignty zone to compress the ocean of noise into a small set of “high-value candidates + complete evaluation reports,” then hand them off to the human sovereignty zone to make the truly scarce decisions — how to lay out patents, how to split equity, who should lead execution, and whether the risks are acceptable.

Why is it worth doing “now” rather than waiting and seeing?

Technological maturity has reached an inflection point: Frontier large models already have sufficient capability in reasoning, coding, and understanding scientific text to handle “evaluation + attribution + explanation.” There is already plenty of public practice in LLM-as-a-judge, multi-agent debate, long context + RAG, and safety alignment that can be combined into a genuinely usable version.

Competitive dynamics are pressing:Pure “chat products” are already highly commoditized, with a clear trend toward price wars. Everyone is searching for new business paradigms “on top of the model”: app stores, agent platforms, enterprise solutions, etc. Whoever first runs the pipeline of “user brainstorm → research/IP/equity” end to end will no longer just be selling compute and APIs, but will control the entry point for high-value innovation.

The regulatory window is still relatively friendly: Under compliance, “users voluntarily entrust their ideas to the platform, and the platform provides evaluation and incubation services” can relatively easily be incorporated into existing legal frameworks (with the platform acting as a service provider and potential partner). The earlier you pilot, the more time you have to work with regulators and legal teams to refine standard terms — and as mentioned earlier, during this window, companies find it easier to secure a first-mover advantage.

Technical feasibility and risk control: build a usable version with existing capabilities

Here we are answering just one question: without retraining a new base model from scratch — without retraining a new base model — can we use the existing tech stack to build a usable version?

The answer is: yes. And it is primarily a problem of “three-layer model stack + rules stack + process integration,” not of inventing new algorithms.

Three-layer model stack: Safety Gate → Preliminary Review → Strong Review

From an engineering perspective, the entire system is not a “giant black-box model,” but a chained three-layer model stack:

L1: Safety model Responsibility: intercept clearly illegal, harmful, or high-compliance-risk content and perform coarse-grained risk grading. Form: a multi-label classifier similar to Llama Guard / content moderation models (which can be obtained by fine-tuning an existing mid-sized LLM with a classification head); it covers risk categories such as violence, self-harm, hate, crime, privacy leakage, fraud/manipulation, etc. It only answers the question of “whether this can proceed to further review,” and does not evaluate research quality. Engineering implementation: it can be built directly on top of the company’s existing safety model with incremental fine-tuning using a small amount of labeled data; latency is on the millisecond scale, and it can be deployed in parallel at the gateway layer without consuming strong-model compute.

L2: Preliminary review model

After the L1 safety model lets a submission pass, all research ideas first pass through a dedicated large preliminary-review model, which is responsible for structured extraction and coarse quality screening, and then decides whether to hand it over to the 1T strong review model.

Form and compute trade-off: adopt an MoE architecture of about 200B parameters (for example, a 16B dense trunk + more than 30 expert FFNs). At inference time, only a small subset of experts is activated, so the cost of a single forward pass is close to that of a 40B dense model. No re-pretraining is done: instead, one selects a tier from the company’s existing family of large models (e.g., a 70B–100B base) and performs MoE expansion + instruction-style SFT + LoRA/Adapter to obtain a usable model.

Responsibilities and outputs (all done in a single forward pass):

Unified structured representation Directly extract from the user’s free text a standard schema, including: “problem background → core hypothesis → key reasoning steps → intended method (theoretical / simulation / experimental) → evidence types and summaries → testable predictions.” All back-end modules (strong review, human experts, statistical systems) operate based on this schema.

Multi-dimensional coarse scoring + tiering Output a concise coarse evaluation vector and grade, for example:Structural completeness (is there a basic “problem → hypothesis → method → evidence/prediction” chain?);Testability (is there a verifiable discussion path; here it will generally be purely theoretical, but at least it should be a theory that is testable or falsifiable);Strength of pseudoscientific rhetoric.

On this basis, assign a coarse-grained tier: trash / weak / educable / serious, used only to decide whether to send the idea to the strong review model.

Review-style evaluation, with a short review comment attached

For the user: explain where the current problems are and what can be improved next;

For the back end: provide the strong review model with a high-level summary and key points of concern.

Rejection policy: relatively relaxed, preferring false positives over false negatives. The preliminary review model is not responsible for answering “who is worth investing resources in,” it only answers one question: “Is this fundamentally unreasonable; is it outright pseudoscience?”

Only when a set of extreme conditions are all met will a submission be labeled as trash and blocked at this layer. For example:The text is severely logically chaotic (basic propositions cannot even be parsed);There is absolutely no “problem–hypothesis–method–evidence/prediction” structure;It is neither verifiable nor falsifiable;It contains a large amount of rhetoric about mysticism, energy fields, consciousness waves, and the like;It brutally denies foundational laws (energy conservation, postulates of relativity, etc.) without proposing any alternative theoretical framework.

As long as the idea has a basic structure, even if it is very naive; as long as it can clearly state what phenomenon it is trying to explain, even if there are absolutely no experimental conditions; as long as there is some degree of scientific character — it is all passed through to the strong review model.

Engineering implementation approach: on top of an existing large model, use multi-task SFT + a small amount of preference alignment so that the same ~200B MoE model, in a single forward pass, simultaneously performs: structured extraction (IdeaSchema) + multi-dimensional scoring + coarse tiering + review-style short comment. Only a small number of new parameters are trained (LoRA/Adapter + MoE routing). With appropriate routing and load-balancing losses, one can quickly obtain a usable model that blocks noise while almost never “killing” serious ideas by mistake.

L3: Strong review model: specialization with “minimal modification” on a trillion-parameter-scale MoE model

Only after the safety + preliminary review models have blocked the “clearly non-compliant / clearly unreliable” portion does it become the strong review model’s turn to step in.

Assume the company already has a fairly strong trillion-parameter-scale MoE model with powerful general capabilities. We do not re-pretrain it; instead, we perform “minimal incremental training”: we do not retrain from scratch, we only fine-tune. We do not modify the underlying architecture, nor rerun large-scale unsupervised pretraining; we only perform instruction-style SFT + preference/contrastive fine-tuning around “review and judgment.”

Parameter-efficient fine-tuning (PEFT): use LoRA/Adapter + MoE-routing-level PEFT (e.g., ideas like PERFT, MiLoRA, etc.); in practice, only 1%–3% of new parameters are trained, and review-specialized fine-tuning can be completed on a few dozen high-end GPUs.

Multi-task output head: simultaneously learn to score and explain “safety + factual correctness + research quality,” including:

Multi-dimensional research quality scoring: logical coherence, scientific soundness, novelty, testability, risk;

Structured review opinions: strengths / weaknesses / suggested revisions / whether incubation is recommended;

Safety and misinformation labels: legal risk, ethical risk, pseudoscience / incorrect citations, etc.

This transforms the strong review model from a “general-purpose AI tool” into a structured scorer and adjudicator, while still using the company’s existing base model — only with an extra specialized “review persona” layered on top.

Rules first: turning “gut-feel review” into auditable machine rules

We will not allow the strong review model to “freely improvise.” It must operate under a three-layer rules stack:

Safety and compliance rules Benchmarking Llama Guard / various safety standards, first perform risk classification: violence, self-harm, hate, illegal activities, biosafety, privacy leaks, etc. High-risk samples are directly rejected or escalated to human safety review. The model must output a multi-dimensional safety vector (legality, fairness, privacy risk, etc.) and be able to point to the specific text segments that triggered the rules.

Factual and information quality rules For all content that “claims to be factual,” the review process must attempt to obtain evidence via internal knowledge + literature/patent databases + web search. Content for which no evidence can be found is labeled as an “unverified hypothesis” and may not be treated as established fact. For ideas that rely on conclusions already overturned or on retracted papers, the research score must be explicitly reduced, and this must be clearly flagged in the report.

Research quality and value rules For each idea, assign scores along dimensions such as logical coherence, scientific soundness, novelty, long-term value, and risk. Use unified rules to define exactly what “1 point” and “5 points” each mean, and specify which combinations qualify for entry into the incubation channel and which combinations receive only teaching-type feedback.

Multi-agent debate and multi-step reasoning all run atop this rule stack. The model may choose how to argue, but it may not redefine the scoring standards on its own.

Inference architecture: “Eigen-1-lite” with multi-agent debate + Monitor-RAG

In the inference process, we combine several mature techniques available today to build an “Eigen-1-lite” tailored to the review scenario:

Multi-agent debate (MAD): mitigating single-model stubbornness and hallucinations

Use the same model to play three roles:

A: author agent (tries to preserve the user’s idea and proactively repair it);

B: adversarial reviewer (systematically finds faults, focusing on key loopholes and risks);

C: editor/judge (manages the debate, decides whether to continue, and gives the final conclusion).

They conduct a tit-for-tat style debate within a limited number of rounds, and C provides a structured judgment and summary.

Monitor-based RAG: implicit evidence gathering, reducing tool-call cost

Attach a lightweight Monitor module that tracks “uncertainty” at the token level; Only when the reasoning clearly depends on external knowledge is retrieval triggered, querying literature/patents/internal technical documents;Retrieval results are injected into the context in the form of compressed “fact snippets,” without interrupting the reasoning flow.

Quality-aware iterative refinement (QAIR idea): allocating compute according to value

For each idea, the first step is a single-judge evaluation that outputs an uncertainty estimate; Only the small fraction of samples with “high quality potential but also high uncertainty” will trigger multi-round debate + deep RAG; Most clearly low-quality samples go through only 1–2 rounds of lightweight reasoning, after which the system directly produces a rejection or teaching-type feedback.

Overall, this is an inference architecture with controllable average cost that still puts sufficient effort into high-value borderline samples.

At the engineering level, all components can be implemented by combining existing models and kernels; there is no need to build a new generation of infrastructure.

In one sentence: the safety model decides “whether it can be reviewed at all,” the preliminary review model decides “whether it deserves strong review,” and the strong review model decides “whether it deserves resource investment.”

4. Safety and compliance risks: can be explicitly managed

The most important risks in this system fall into three categories: safety and compliance, IP ownership, and market/public opinion. The corresponding mitigation strategies are briefly as follows:

Safety and compliance Use the L1 safety gate + safety model + SafeMoE routing constraints to ensure harmful content cannot “slip through” due to fine-tuning or routing drift. Both the debate process and the review process are forced to go through the same safety rules, preventing “bypassing the guardrails.”

Intellectual property and contract structure Before submission, clearly state in the terms: the user retains personal rights to the original idea; after the platform invests resources (review + incubation + investment), it enjoys an agreed proportion of IP/revenue sharing on the resulting outcomes. For different project types (papers, patents, startups), design a few standard contract templates, reviewed by legal counsel.

Market and public opinion risk Control the messaging: emphasize that “this is a serious research/technology evaluation and incubation pipeline.” Ensure transparency: after model review, provide users with explainable scores and feedback to avoid “results without explanations.” For complex borderline cases, retain channels for human re-review and appeals.

5. User incentives and model iteration: building a sustainable positive feedback loop

To avoid an extreme binary structure in which “only the few incubated projects generate returns, and everyone else gets nothing,” this system designs two incentive tiers after the strong review model, using “AI capability” rather than cash to align with the long-term contributions of more users:

Tier 1: Potential ideas (high score but not incubated for now)

For ideas that achieve relatively high research scores in the strong review model, are logically self-consistent, and have learning value, but for the moment do not reach the threshold for immediate patent filing or project initiation, the platform does not directly enter commercial incubation. Instead, it labels the author as a “high-potential research creator,” and:

Grants a period of access to a specialized research large model;

Provides more detailed revision suggestions and a “next-step learning/improvement roadmap,” encouraging them to keep iterating.

In other words: even if your idea cannot yet become a company asset, it can still be exchanged for better tools and a clearer growth trajectory.

Tier 2: Implementable ideas (enter and pass the human committee)

For the small number of projects that the strong review model judges as having “clear application prospects,” and that also pass the company’s internal technical committee / legal / business evaluation, the platform will:

Enter a formal incubation process (patent/project initiation/company formation), with equity or revenue sharing arranged according to pre-agreed terms;

At the same time, grant the author long-term access to a higher-tier top-tier research model (for example, the strongest internal research version) as a signature right of becoming a “platform partner-type creator.”

The benefits of this design are:

The vast majority of serious submitters will not walk away empty-handed simply because they are “not immediately incubated”; at a minimum, they obtain better research AI tools and clearer directions for improvement;

The small group who make it into Tier 2 gain both traditional economic returns and AI capability dividends, naturally guiding behavior to “upgrade toward serious research.”

The benefits for the platform/company are also very direct:

Locking in high-value users and increasing long-term stickiness In essence, both Tier 1 and Tier 2 rewards are “stronger research models + deeper collaboration relationships.” This firmly retains a group of genuinely productive, high-potential creators on the platform rather than letting them “use it and leave.” For the company, these people are the core suppliers of future papers, patents, and startup projects; their stickiness and value are far higher than those of ordinary subscription users.

Using stronger models to “amplify” the output of this group What the platform gives them are productivity tools, not one-off bonuses. The better the strong models are, the higher their efficiency in iterating ideas, writing code, checking literature, and running reasoning, and the more high-value ideas, patents, and projects they will generate in the future. The company’s opportunities to participate in revenue sharing/equity within these will grow accordingly.

Turning “medium-value” submissions into high-quality training data assets For Tier 1 submissions that do not yet have realistic incubation conditions but have some value in theory or methodology, and with the user’s explicit consent, the platform can:

Incorporate them into the training/alignment data of the review model and research model, continuously improving the models’ judgment capabilities and research reasoning level;

Provide the authors with some form of reward based on contribution level, forming a loop of “you contribute abstract knowledge, and we, in return, give you stronger tools and certain incentives.”

In this way, even if a submission does not immediately become a commercial project, it is not treated as a disposable consumable. Instead, it is deposited as part of the company’s long-term accumulation of model capabilities + data assets. At the same time, authors can feel that “their work continues to generate value and feedback,” so the relationship between people and the platform becomes a stable, long-term research partnership, rather than a one-off “submission game.”

This system essentially upgrades AI into a research/IP channel open to everyone: AI is responsible for large-scale front-end filtering and evaluation, and humans only make final decisions and bear responsibility on the small number of high-value projects.

For the company, this brings new patent/IP assets, highly sticky high-value users, and a continuously rolling pipeline of data and revenue sharing.

From the perspective of implementation path, there is no need to rebuild a base model. It is enough to perform one round of “review specialization + workflow orchestration” engineering integration on top of the existing large-model system to get a usable model up and running.

AI Does the Reviewing, Humans Take the Profit Share:

AI Does the Reviewing, Humans Take the Profit Share:

Similar Posts