OpenAI Says it's Watermarking its Output

And can rat out students who are using it. Also, those who aren't.

Aug 09, 2024

Last weekend, the Wall Street Journal dropped this article (Gifted link1):

Here’s ChatGPT’s one paragraph summary of the article (that I asked it to generate):

OpenAI has developed a method to detect when ChatGPT is used to write essays or research papers but has not released it due to internal debates about transparency and user retention. The tool uses a watermarking technique that makes AI-generated text detectable without visible changes, claiming 99.9% effectiveness. Concerns include potential impacts on non-native English speakers and the possibility of evading detection through simple techniques. Teachers are eager for such technology to combat rising AI-assisted cheating. Despite internal support for its release, OpenAI is cautious due to potential backlash from users and the risk of false accusations. The company continues to weigh ethical considerations and technological effectiveness before deciding on the tool’s release.

Well then, do we think this tool can detect that the above text was generated by ChatGPT? If so, how?

There’s some hints we have. According to the article:

The anticheating tool under discussion at OpenAI would slightly change how the tokens are selected. Those changes would leave a pattern called a watermark.

That’s kind of curious phrasing: “would slightly change”. Does that mean the text now is unchanged? Or perhaps I’m reading too much into the verb tense.

The watermarks would be unnoticeable to the human eye but could be found with OpenAI’s detection technology.

Again, strange phrasing: “to the human eye” generally means you cannot see something; but clearly in this case you can see it, you just can’t recognize what you’re seeing.

The watermarks are 99.9% effective when enough new text is created by ChatGPT, according to the internal documents.

99.9% effective is a very precise number for a very imprecise metric. It doesn’t tell us if the false positive rate is .1%, the false negative rate is .1%, or something else.

What does this tell us about their technology?

The article is a bit vague, but we can probably make the following are reasonable (though probably simplifying) assumptions:

They already have the ability to watermark the text. Whether they’re doing it or not in production is hard to say, but I wouldn’t assume the answer is no.
Because there’s no way to hide secret images or other watermarks, the only thing they can work with is the actual text that’s being generated. Otherwise, it could be defeated by a simple “Paste as unformatted text”.
When they say there’s a change in the way tokens are selected, I presume it means that they’re picking much less probable word choices more frequently than expected. It is still possible, but perhaps a bit unusual, for a human to write the same sequence of words without GPT’s assistance.
Still, OpenAI cannot use too obscure of choices: for it to be “unnoticeable to the human eye” it has to seem as a plausible as un-watermarked output. They’re not going to start generating text in the style of Damon Runyon2 by only using the present tense, avoiding contractions, and calling women “dames”.
I assume their approach is that each time they see a watermark in a text there’s a low probability that the author naturally wrote that. They can then calculate the odds the overall piece was generated by (or heavily uses) ChatGPT content by computing the joint probability of all the various watermarked snippets in the essay being actually from GPT with something something simple like:

\(p_{isGPT} = 1-\prod_{i=1}^{n} (1-{p}_{IsHuman_{i}})\)

An Example

Assume every time they see one of their watermarks in an essay, OpenAI thinks there is a 51% chance that it was generated by ChatGPT, and a 49% chance a human randomly wrote that. How many watermarks would they need to see in a document to be 99.9% certain it was generated by GPT? If they saw one, they’d only be 51% certain. For two, 76% certain, and for 10 …

\(1 - 0.49^{10} = 0.9992\)

…they'd hit their mark of being 99.9%+ certain. Or so you might think…

Any problems with this?

Lots. The assumptions such a tool relies on are likely to be very problematic and end up falsely accusing a lot of people of using GPT when they haven’t.

There’s this quote from the article:

“It is more likely that the sun evaporates tomorrow than this term paper wasn’t watermarked,” said John Thickstun, a Stanford researcher who is part of a team that has developed a similar watermarking method for AI text.

OMG. That is either a misquote or breathtaking arrogance. There’s not a .1% chance of the sun going poof tomorrow. Not even close. Worse, 99.9% is a really shitty probability to use to start accusing people. Unfathomably shitty: Even if their logic was perfect, one in a thousand essays would be mistakenly flagged as being generated by GPT. There are 17 million high school students in the USA, and if each just writes two essays a year, that’s 34 thousand falsely accused high schoolers per year. Not just falsely accused but being told that it’s certain that they’re guilty. No need for further consideration.

But the odds of OpenAI being right are not even that good.

OpenAI will have calculated the probability of watermarked phrases appearing in the training data they have collected. But you know what isn’t in that training data? High school and college essays. They aren’t available in quantity on the web.

This means that OpenAI simply does not know how often their watermark phrases naturally occur in school essays. Not an effing clue. They are going to be just guessing and assuming the likelihood of the watermark occurring naturally in essays is the same as in all the other text they’ve seen. But kids tend to write things strangely for essays, artificially stilted and weirdly formal and, perhaps, more likely to generate a false positive than OpenAI expects. Perhaps less likely. No one knows.

Now, if they had a huge sampling of student essays they’d have reliable statistics. Hold that thought for a second...

Worse, even if people aren’t using GPT to write their essays, I expect that people will increasingly mimic GPT’s style, including its watermarking style. Why? Because (1) they will increasingly read output from GPT and start to internalize its biases and (2) ironically, I would expect teachers to start using GPT to write their material. This means the material the students are taught from will start to have these watermark phrases in them. Should kids pick up that phraseology or style to try to mimic their teacher, their writings will start to resemble GPT output. That may mean that each instance of watermark triggering text is not independent of the others3, and so the probability may never get very close to 99.9% “effective”.

The simple fact is, OpenAI does not have enough data to accuse anyone of AI-plagiarism. Even if they are usually right, they are not always right, and when done en masse it’s going to slander tens of thousands if not hundreds of thousands of students.

Of course, in a few years, they’ll have seen enough essays (that have be submitted for checking) that they can make more accurate predictions and learn to up the threshold to get rid of most false positives … but only after screwing up the lives of lots of kids. And, sadly, the kids papers could end up being weaponized against them.

You Gotta Fight For Your Right To Essay!

So what happens if OpenAI goes through with this and students start getting accused of using GPT, many falsely?

The ones who are cheating will just switch to an alternative LLM to defeat the tests. Hello, Claude!
The first ones who are falsely accused will be crushed by the “more likely than the sun melting” bullshit, but eventually lawyers will realize that this is a goldmine leading to teachers, schools, and OpenAI finding themselves in the courts for years. People will deeply regret believing OpenAI.
OpenAI will end up with a black eye, best case, and dead in the water, worst case.

Nobody wins. Except the lawyers. They always win.

Final Thoughts

I’m sure OpenAI believes its tool is much better than 99.9% accurate. But even if so, it will never be 100%, and that means false positives. OpenAI will disclaim its liability by stating the tool is only to be used to suggest a possibility of of plagiarism, not a certainty, and that teachers need to independently verify that the students didn’t write the paper. Like Tesla with Full Self-Driving, when it crashes and burns OpenAI will just say “you used it wrong,” leaving the users who relied on it (mistakenly) with all the liability. Were I a teacher, and being told to accuse students who failed the test, I would demand indemnification from the school4.

Let me be perfectly clear: using AI to convict people of misdeeds (even just scholastic misdeeds) is wrong, evil, and immoral. If we’re all lucky, OpenAI will keep their GPT-checker as a lab curiosity, not have it turn into a loose canon that cuts down thousands of innocents.

Backup link in case the gifted one times out: https://archive.is/2024.08.04-093444/https://www.wsj.com/tech/ai/openai-tool-chatgpt-cheating-writing-135b755a

https://www.newyorker.com/magazine/2009/03/02/talk-it-up

If you flip a fair coin once, the odds of it being heads is 50%. If you flip it twice, the odds that it’s always heads is 25%. Three times 12.5%, etc. But if it turns out that the way you toss the coin the second and subsequent times causes it to land the same way as before, then the odds of it being head twice and thrice is 50%, not 25% or 12.5%. That’s because the subsequent tosses are not independent of the first. If you think your tests are all independent but they are not, your predicted probabilities are going to be way, way off.

Were I a student, I’d put copyright notices on my essays just in case. That way, if something goes wrong, there’s a bit more for my lawyer to bring to battle.