AI Detection as KingMaking

The following essay is produced by a Qwen3 0.6B Instruct model RL’d to write like Sam Altman. The essay is “if humans and dinosaurs roamed earth together.” You might be able to tell I’ve been watching the new season of Primal!

After the essay, I discuss how this was created. This essay was originally envisioned to show the end of AI writing detection. I’ve since realized I was wrong. I’ve been writing this essay over the course of the last month. The other night, I extended this further after a discussion about model bias. The proliferation of AI writing will make detection more valuable even if the detection is routinely beaten.

The Essay

“If dinosaurs and humans had roamed the Earth together, we’d have a much more intricate world. Dinosaurs, long ago, came to Earth 66 million years ago, their once-dominant megaherbivores—herds of large herbivores like sauropods and theropods—made ecosystems more complex than we think. Dinosaurs existed and went extinct, so both dinosaurs and humans would have access to the same food sources, the same kinds of weather, and the same kind of sky and land—without the need to make new, planet-wide migrations.

Before human civilization arose, there was a mutual conflict between dinosaurs and humans that shaped early Earth. Dinosaurs predated humans, and large herbivores often avoided direct contact with large carnivores like lions and tigers. Some dinosaurs (and some of the smaller herbivores) preferred to hunt small birds instead of large mammals. Humans had no such restrictions. We often tell people that humans evolved in a way that is species-specific: we adapted humans to work on our feet instead of crawling and standing on toes. But I think the point is that there is a lot of mutual benefit here: both humans and dinosaurs survived, developed, and spread around the world, while not both of us have to face the same kinds of problems.

Dinosaurs formed an important part of Earth's history in many ways. They lived for a time (until 66 million years ago) and helped the evolution of various other animal families. Like many other primates, our ancestor began moving out of Africa to Eurasia within about 50,000 years. This expansion may have been influenced by global migrations of theropod dinosaurs. More recently, large herbivores like sauropods and elephant herds helped seed the flourishing of many plant families. Humans, in contrast, have largely been an exception, as we are bipedal, did not spend much time in the grassland, and are usually less inclined to work with large plants.

The survival of dinosaurs, if they had long lived and thrived, would probably have been even more difficult than that of humans. Human societies have so much complexity—tool use, language, political structures, cultures, and environmental cooperation—because we can’t rely on our species to be hunter-gatherers all the time. We are social, non-hunting, yet it’s clear from the way humans built houses, went cities, invented fire, and mass-produced so much that it seems we are something totally different.

If humans had many more chances to live with dinosaurs, they would have established at least some kind of symbiotic relationship with them. Maybe humans would have shared many things: perhaps living together in times when both species were on the move. Dinosaurs would have shared the same kind of cautionary stories, for example. I believe that this kind of relationship would have greatly improved human survival and development. Dinosaurs wouldn’t have grown so far up in Earth’s history if humans had not reached them in so many years.

Despite all these possibilities, I don’t believe it's so likely that humans would ever succeed in coexisting with dinosaurs. So instead, they coexisted. They conquered land, survived, and even reigned supreme.

So, in conclusion, if dinosaurs and humans had truly roamed the same planet together, we would have a different type of world. That world would be more like a two-million-year rock record of such a relationship, filled with interesting stories, lush ecosystems, and probably some surprising cooperation between species.”

How It’s Done

The quality of the essay is bad. That was not the purpose of this exercise. The goal was to create a RL environment which would allow us to train a model which wrote like a given personality. In that regard, we appear to be successful. The validation during training looked good. I also beat leading AI detectors like Pangram (pangram.com) despite the size of the model.

At the paragraph level, the above essay passes (albeit with low confidence). I haven’t seen another model at this size beat Pangram consistently. There was a fine-tuned model at 8B which was trained to de-slop AI essays but it only had a \~25% pass rate vs Pangram.

Even with multiple paragraphs and em-dashes, we pass as human written.

The entire essay is detected as AI. The full essay limit may be an artifact of undertraining, limited model capacity or objective. The mimic is trained using my Single GPU DAPO implementation. You can find the specific project here: https://github.com/Matthew-agi/DAPO-Single-GPU/tree/main/projects/entity_personality. There is a similar environment on Prime-RL (https://app.primeintellect.ai/dashboard/environments/matthewagi/entity-personality-style) but I have not tested the compatibility extensively. Both of these implementations will be improved and are currently bare bones.

We reward our model when the topic adheres to the essay title, when the sentence level representations are close to our target sentence style and when the paragraph level is close to our target representation paragraph style. Specifically, we use embeddings to limit divergence in topic and classifier heads to reward adherence. Ignorance at the essay level may have allowed detection since it is out of scope of our objective.

You will also notice that we track the difference from our models embedded centroid to the target writing centroid. We also track the initial classifier head from the first validation set. During training, we retrain our classifier. Our classifier remains competent at detecting AI writing so it is difficult to track explicit progress while it is changing. To alleviate our issue, we track the difference of centroids and the original classifier output. The original baseline and centroid differences allow us to track total improvement.

Where We Go Now

The next step is to generalize this performance so we can insert any writer style and have the model write effectively. I’m currently using project Gutenberg to train across many authors. There are a number of open questions related to increasing the author count.

The method would be similar to voice cloning but in written text. The prompt would include an example of the author's writing. The training will effectively be the same as before but with additional authors. The largest difference is the requirement of an author scheduler. The naive approach is to randomly select authors for each essay. This should be effective enough. It remains likely that a more sophisticated approach will improve mimicry and robustness.

The largest open questions in regards to effectiveness are related to the generalization of this process. I’ve run tests where this style mimicry trained on a single author only weakly generalizes if at all. How many distinct authors do we need for broader generalization? What is the scaling curve on mimicry for authors? Will we need current authors or will the mimicry skill generalize from publicly permissible work? These are questions that will be answered in the next set of experiments.

There might be a deployment limit because models do not converge to platonic ideals (https://arxiv.org/html/2507.12224v1 - thanks @ueaj). This implies that there are model biases learned in the training process which are inherent even at the optimizer level. The argument for detection is that this bias is large enough to be uncovered given enough deployment instances of a style-cloning model. I find it unlikely that a single instance could be determined but I can’t speak to the likelihood that many styles from a single model could be detected.

The argument is as follows. Consider a model trained with some set number of biases. Consider this model is also trained to clone a writing style. The argument is optimizer (or like) biases from training will always contaminate the output of the LLM. As such you will be able to determine biases and detect output. However, data and objective also pollute biases. Individual humans have detectable biases. In order to determine a single instance deployment, you need the LLM standard biases (optimizer) to be higher signal than the data and objective biases. If a standardized bias is not a stronger signal then it can be misidentified as a human-like bias and thus undetectable. However, there is a threshold of deployment even if it is undetectable at single deployment instances. You can denoise the signal from more style deployments up to the detection limits of your system and environment (see spectroscopy and detection limit domains). These limits are currently underexplored. A single instance could in theory be detectable given luck with the bounds of detection limits.

The Abstracted View

The intent of the largest labs is to create a model which is the best writer. The winning style if you will. This is a misguided goal. Style and quality are relative skills which often masquerade as absolute skills. Absolute skills are those which have universally agreed grades. Relative skills on the other hand do not. People are entitled to subjective views in regards to relative skills. Running speed is an absolute skill. Poetry writing is a relative skill. You have great poets (relatively ranked) but the greatest poet (absolutely ranked) is contentious.

When labs claim to want to create the best writer, they speak like writing is an absolute skill. It is a relative skill. Relative skills benefit from a decentralized approach. For instance, the quality of a twitter poster is correlated with their following. Those followings are unique! Two different niche accounts would not do as well across each other’s niche because their quality is not universal. You can’t assign universal grades nor can you target universal posting and create the twitter ecosystem. You need decentralized social construction.

The above statements on twitter might sound at odds with current social media platforms which reward global rankings. It is not. Even at the global ranking scale you get something akin to banner blindness across viral posts. Trends of virality ebb and flow because the populace builds up memetic antibodies. As a result, the flows are driven by new strategies which extend or combine prior successful strategies. The alpha for every new social strategy has inherent decay much like financial alpha. This decay and drive to change is a universal aspect of decentralized systems.

The arts are a good representation of the decentralized creation phenomenon. Each style has leaders and these leaders benefit from the combinatorial nature of sharing and competing in a decentralized manner. Stylistic components of jazz bleeds into various other music styles. Painting has distinct periods influenced by external events and happenstance but across periods the skill is relative. Writing is not produced in a vacuum against a “good”. Instead, you need something that allows for the art to flow and evolve. It needs to react in an environment.

“Good artists copy, great artists steal.”

As funny as it may seem, the skill of mimicry is needed for creation of artistic AI. We will not have a universal model of good writing. Even if we did then it would decay in status as the style saturates public consciousness. The em-dash is poisoned by GPT-isms despite its former status as a signifier of quality. If a model can mimic a style then it can mimic a combination of styles. If we can mimic and evolve styles via evolutionary mixing then we can implement these styles in the public square for feedback. Those which are good will garner acclaim. Those which are bad will be replaced.

The decentralized method of training our writing models will not have an absolute good. Instead it is designed for the decentralized nature of the arts. It is adhering to the system instead of naively fighting for an all knowing single entity. The social aspect of the arts necessitates a decentralized approach. As the models scale, learning mimicry will be easier. The key of the environment is allowing the model to learn the skill of mimicry, keep the output stable and then implement mixing. These are all achievable goals.

Detecting AI

The above abstracted view doesn’t break the paradigm of AI detection. I used to think it did. However, we are in the first iteration of a cat and mouse game. The game laid before us is eternal and dynamic. People are already implicitly aware of the game. It’s almost exactly slang.

Slang is memetic language used to signal an in/out group. Once a term saturates the public zeitgeist it loses this power. Sometimes usage decay is mitigated if the term also carries technical weight (like many professional slang terms which saturate niches and persist). More often than not, it means the collapse of the usage. Nobody says “this essay’s the bee's knees” anymore. The future intention of AI detection will be to accelerate peaks and decline of saturated AI slang.

When a specific writing style becomes vogue, the best writers of that style will benefit. To maximally preserve those writers, detection companies will need their models to be better at mimicry than the leading writing AI. Stated another way, they must be more attuned to the biases of the style than can be rendered from LLMs. As a writing style increases in popularity, they will be filtered by these models. As long as the detection limits allow, they will be able to determine the human writing.

To make this more concrete, the detection systems will need to be able to detect variation in writers (evolution, natural noise) from model derived bias. As long as there is a historical precedent for the writer then this is a tractable problem to the limits. There are limits and in a decentralized system with varying forms of model bias, you will end up at a measurement saturation point. It’s unlikely that a large lab will be able to produce a model to continuously defeat detection. It is likely a single entity (potentially the author themselves) could create a single model that defeats detection.

Machine learning practitioners have a tendency to assume away a lot of bias. If you focus only on the model then you can ignore optimizer and dataset bias. This allows you to assume that with infinite batch sizes you can get infinite noise reduction (all signal at the limit). I come from spectroscopy. The detection limits are detector and environment limited. There are a number of techniques to improve system precision (including increasing the measured data) but these do not average out infinitely. You often see Allen-Werle plots showing the non-white noise driven measurement limits. At the environment level, other factors like real changes disallow infinite data collection and hinder measurements. The above is only a snapshot of precision difficulties and assumes perfect accuracy.

Even if you could measure model bias in theory, there are practical limits. So if we game out this system then the most likely equilibrium is specially trained models will become indistinguishable from humans. Large lab models have enough labelled data a detector could pick up on bias. There exists some transition point between mass deployment (trillions of tokens a day to the public zeitgeist) and specialty models where the detection becomes apparent.

In the long term, you don’t dissuade AI writing. Eventually, detection will fail. Instead, you create an incentive system for novelty and quality. Like slang, there is a peak in usage of each new style. The detection systems decrease the height of each peak. The incentive is to iterate on novelty. The iteration cycles will compress much like traditional slang cycles have compressed thanks to social media. AI detection will trend towards a quality (and thus class) signifier.

The end state implications of the above are more valuable than strict AI detection. Failure to detect AI is inevitable but will make AI detection more valuable as a service. The future business venture is primarily social signalling. The ability to detect cheap reproductions of style is valuable. The ability to detect the cutting edge of style is valuable. The ability to do both also implies the ability to bring about the new styles which is doubly as valuable. We are at the precipice of AI kingmaking across a greater expanse of the social sphere.

You see the value that social signalling brings across every single industry. Style is an incredible multiplier on value for the individual. The “in the know” wear certain clothes, drink certain wine and inhabit certain cliques. Even the anti-signalling of tech founders is in-group signalling. When you are what you consume, you will want to consume the best. You will consume only the best when you can detect what is cheap. AI detection will always be able to detect what is cheap even when it fails at the cutting edge during the transition of styles. The winner in detection transcends pure utility and will become a social kingmaker.

AI detection has been killed. Long live AI detection.