A plausible, scalable and slightly wrong black box: why large language models are a fascist technology that cannot be redeemed

Riker: So they could have done this review a dozen times already?

Laforge: A dozen? A hundred? It's impossible to tell

When large language models (LLMs) get something factually wrong or make something ridiculous up, everyone makes fun of it online. Gemini told everyone to put glue on their pizza! (Hilarious!) A corporate chatbot invented a company policy that doesn’t exist! (Uh oh!) There’s gotta be about a million examples of LLMs spouting out nonsense that makes them look silly. Detractors use these as a “gotcha” for why LLMs aren’t ready for real-world use, and boosters defend them by saying that LLMs will only get better.

LLM “hallucinations,” more appropriately known by the technical philosophical term, “bullshit” (text intended to persuade without regard for truth) is a well-known problem. LLMs bullshit or “hallucinate” because they do not actually produce an internal model of the problem being solved and they cannot reason toward a solution. LLMs are just statistical models that predict the next set of words to follow a prompt. They have been (fairly accurately) described as “spicy autocomplete” or “a fuzzy jpeg of the internet.” It works in the same way that your phone, for years, has been able to know that if you type “How is it,” the next word might be “going?”—just moreso.

Because of this basic underlying architecture for LLMs, they are optimized for plausibility, same as the autocorrect on your phone. They are not optimized for truth, or to connect to anything in reality. It is only being trained to produce what are the next words likely to follow from the previous ones based on texts scraped from the internet. This is why we get bullshit/hallucinations, and there’s no way to ever stop them from doing that without completely scrapping the LLM project and rebuilding an artificial intelligence chatbot on a fundamentally different foundation.

This foundation for LLMs also makes them into a “black box”—the statistical model that produces the text is so complicated, containing so many variables that there is no way to possibly explain definitively how it generated any answer to any prompt. If one wrote a regular expression to pull out all the numbers from a text, you could look at the original regular expression and find out that it missed an instance of “three” because it was only looking for numerals (i.e. “3”) and not letters. If you asked an LLM to pull all the numbers out of a text and it missed one, there is no way to ever know why, and even the “explanations” that newer generations of the models give are not real explanations, they are just more plausible but wrong text generated by the model.

LLM boosters promise, of course, that the bullshit will be fixed in the future somehow. To date, sometimes more data points being fed into the model makes the problem better, sometimes it makes it worse. There has not been a clear trend line toward subsequent iterations of LLMs improving on this front. Boosters will always tell you, “This is the worst that LLMs will ever be, the technology will only get better in the future.” What they don’t specify is “better by what metric, better at what and better for whom?” As Google Search usability over the last decade has taught us, technology sometimes gets worse because that is more profitable for those who control it.

In what follows, I will argue that being plausible but slightly wrong and un-auditable—at scale—is the killer feature of LLMs, not a bug that will ever be meaningfully addressed, and this combination of properties makes it an essentially fascist technology. By “fascist” in this context, I mean that it is well suited to centralizing authority, eliminating checks on that authority and advancing an anti-science agenda. I will use the example case of medical systematic reviews to illustrate how it will be used to advance a fascist agenda and gesture toward a few other likely areas of fascist application. I will conclude by arguing that LLMs can’t “used for good,” accepted or even regulated but must be resisted and rejected wholesale.

What LLM boosters and detractors both mostly miss is that a black box that returns a slightly wrong but very plausible answer is a much better offering than being perfectly accurate for certain use cases. This is because there’s only one way to be perfectly accurate (providing the correct answer) but there’s a million ways to be slightly off (providing an answer that misses the mark, but is still mostly defensible). To paraphrase Tolstoy, “Accurate data retrieval is all alike; every LLM response is inaccurate in its own way.” And because LLM prompts can be repeated at industrial scales, an unscrupulous user can cherry-pick the plausible-but-slightly-wrong answers they return to favour their own agenda.

It’s the scaling up of LLMs that makes its plausible black-boxed incorrectness so useful. If the LLM returns different and slightly incorrect answers depending on how one fine-tunes a prompt put to an LLM, then you can decide beforehand what answer you want from the aggregate analysis of a large corpus of data, and then have the LLM analyze it over and over until it gives you the answers you want. Because the model is a black box, no one can be expected to explain where the answer came from exactly, and because it can be applied at scale, there’s no possibility that it can be externally audited.

To illustrate this, I will use the example of a systematic review in the medical literature (my area of expertise), although there are many other areas where this strategy can be used. In the area of insurance reimbursement, for example, an insurance company could decide the exact dollar amount they want to pay out, and then reverse engineer prompts to generate responses to thousands of applications, and fine-tune their responses until the justifications produced by the LLM in the aggregate match the amount of money they wish to pay.

LLMs are the perfect technology for manipulating the medical literature to say nearly anything you want via systematic review methods

Systematic reviews are an important part of the medical evidence hierarchy, sitting even above randomized clinical trials in their level of authority. For many medical questions that have been studied, there are sometimes multiple published clinical trials or other forms of evidence that can provide slightly different, or even conflicting answers. This is not because the methods used were flawed necessarily, but because human biology is complicated, and the answers to questions like “does drug A work better than B in population Y for condition Z?” are probabilistic ones like, “It works 50% better on this metric, 75% of the time” not categorical answers like “yes” or “no.”

Systematic review methodology is meant to provide a broad overview of the medical literature on a specific subject, excluding low-quality evidence and statistically aggregating the more trustworthy evidence into an even more accurate and trustworthy estimate. They are “systematic” in the sense that they are meant to include all the evidence that has been produced to date on the question at hand. This is typically done by first performing a literature search of several medical databases to identify potential evidence sources, followed by human screening based on inclusion criteria applied to the title, abstract, followed by the full-text. This can be a work-intensive process, as selecting evidence has, prior to the advent of LLMs, required human judgement at this step.

LLMs can be deployed here to automate screening of medical journal articles for inclusion in a systematic review, drastically reducing the human work required. This is a bad thing. Because this can be automated and the results of any LLM output are always slightly inaccurate and un-auditable but scalable, it can also be easily manipulated to return an answer of the reviewer’s choosing, and this intentionally introduced bias can be difficult or impossible to discern from the end result. The fact that this process can be automated allows an unscrupulous reviewer to try an arbitrary number of LLM prompts for screening criteria, repeating the screening process until the set of articles to be included only includes the articles that the reviewer wants. This can be fine tuned to the point where the bias is subtle, even when presented with the original LLM prompts.

Similarly, LLMs can be deployed to extract data from medical journal articles, and because LLMs produce plausible answers (you could even measure and “validate” how well they perform against a “gold standard” of human data extractors) that are slightly wrong, they can be gamed to produce nearly any outcome in the aggregate in a manner that is very difficult or impossible to catch after the fact.

Couldn’t this be happening to the systematic review literature already, even without LLMs?

To be certain, an unscrupulous researcher can place their thumb on the scale at quite a number of points in the process of a systematic review, even without the use of LLMs. This happens deliberately or accidentally all the time, and as someone who has published several systematic reviews, and as someone who is often asked to do peer-review for this type of research, I am very cognizant of ways that researchers might be tempted to compromise their research integrity in order to get the answer they like.

That said, LLMs present a new challenge because of the ability that they provide to perform many different fine-tuned iterations of a SR in a manner that can’t possibly be audited externally both because the LLM is a black box and because they can be scaled to the point where double-checking is impractical, and this can be done by a single person without any scrutiny from other researchers. Without an LLM, if a researcher wanted to redo data extraction, while making fine adjustments to the inclusion criteria or the data extraction protocol, if the set of evidence being considered was large enough, it would take a team of researchers a considerable amount of time to accomplish the task even once. Being asked to repeat it over and over with minor variations to the codebook would raise suspicions and likely even some push-back from a team of humans asked to do so. The cooperation required to accomplish large data extraction tasks without an LLM implied some level of accountability. It meant that even if a researcher is willing to commit this kind of research fraud and has the resources to do so, someone else involved is likely to put on the brakes somehow.

This brings us to pinpoint why this technology isn’t just potential research fraud waiting to happen (although it is that too, and who are we kidding, it has definitely been used for research fraud already), but it’s also an essentially fascist tool: From the example of systematic review manipulation, it’s clear to see how it centralizes control over medical evidence synthesis by eliminating a large proportion of the people involved, and thus their ability to check the agenda of an unscrupulous central authority.

This technology lends itself especially well to anti-science projects like the anti-vaccine movement, who could use this technology to inaccurately synthesize evidence from the medical literature to legitimize their movement. I will not be surprised when it is used to legitimize scientific racism and anti-queer hate. While I have focused on the dangers to medical information synthesis, I can think of several other ways this technique can also be applied in other industries. An insurance company, for example, can decide what level of payouts it wishes to have, and then adjust its justifications for decisions regarding claims at scale until it reaches them, regardless of the underlying validity of the claims themselves.

Let the police or the army use this technology, and you can use your imagination on where they would go with it.

What about “responsible” LLM use?

“Using AI responsibly” certainly has the aesthetics of being a “reasonable middle ground,” away from “extreme” positions like banning, boycotting or abstaining from use. However, where fascism is concerned, being moderate toward it is not a virtue.

I’m not going to say that every person who has used an LLM for any reason is a fascist, of course. There are many ways that a reviewer can try to safeguard their own LLM use against the kind of abuses I have described above. A researcher might attempt to thoroughly test the accuracy of an LLM at a data extraction task before employing it (good luck though, the black-box nature of AI’s tends to make this a somewhat fraught enterprise). A researcher attempting to use LLMs in good faith might also pre-register their study so that they can’t alter their prompts later and cherry-pick the result. Good for them!

Unfortunately, even if you as a researcher do everything you can to use AI “responsibly,” there is no way for anyone else to distinguish your work from the irresponsible uses of AI. If you pre-registered a very detailed protocol for your systematic review before you did the work, there is no way for anyone else to know whether you already did your study before the pre-registration, except your own good word as a researcher. That’s the thing about fascist technologies—they are designed to remove accountability and centralize authority.

This vitiates the whole point of doing the study in the first place. If it all comes down to “I didn’t cheat, trust me,” and there’s literally no way for anyone else to double-check, then I don’t know what this is, but it sure isn’t science anymore.

What won’t help

1. First off, if you wish to do science in good faith, you absolutely cannot embrace LLMs for use in your own research.

“But LLMs are here to stay, we better get used to them!” says the person who’s not on OpenAI’s payroll but inexplicably wants to do their PR work for them.

Technologies are discarded, rejected or superseded all the time, even after they are touted as being “inevitable” or “here to stay so you better get used to it.” (Remember how cloning was “inevitable”? Remember how we all had to “just get used to NFTs because they’re not going anywhere?” Remember how the Metaverse was “here to stay?”)

If you do embrace LLMs, congrats, your work is now indistinguishable from all the grifters and fascists.

2. Expecting bad-faith, mass-produced and then cherry-picked systematic reviews to be debunked after they are published is a losing proposition. The naive response, that the answer to bad speech is good speech, doesn’t fly here because we’re not just answering some instances of bad speech, we’re answering a machine that produces new bad speech on an industrial scale. Not just that, but we have to take into account Brandolini’s Law, the “bullshit asymmetry principle,” that the amount of energy needed to refute bullshit is an order of magnitude grater than the energy needed to produce it. Further, as we learned from Wakefield et al (1998), even if an incorrect medical idea is completely discredited, the paper is retracted, and the author is struck off the medical register for misconduct, the damage may already be permanently done.

3. A requirement from academic journals for pre-registration of research done by LLMs would be an ineffectual half-measure that is neither adhered to by researchers, nor enforced by journals, if the trends from clinical trial pre-registration continue. It’s just so easy to “cheat” and journal editors have a tendency to bend rules like these if there’s any wiggle room for it at all, especially if the journal article has an exciting story to tell.

4. There is absolutely no way that we can expect peer review to catch this sort of fraud. I have peer-reviewed so many systematic reviews and it is like pulling teeth to get anyone to pay attention to matters of basic research integrity. Ask a journal editor to insist that the data and analysis code for a study be made available, and see how it gets accepted without those.

What will help

1. Stop using LLMs in your own research completely. It is making your work fundamentally untrustworthy for reasons I have outlined above.

2. Whenever you hear a colleague tout some brand-new study of the type I have described above, accomplished using an LLM, ask them about the kind of research fraud that’s possible and in fact very easy, as I have outlined here. Ask if they can provide any reason why anyone should believe that they didn’t do exactly that kind of fraud. If this seems too adversarial, keep in mind that this is the point of your job as an academic, and actual fraudsters, racists and anti-queer activists will and sometimes do hijack science for their own ends when no one asks the tough questions.

3. Recommend rejection for research accomplished with an LLM if you are asked to peer-review it, or if this is too work intensive, decline to review any research accomplished with an LLM for ethical reasons.

4. Under no circumstances should you include money for LLM use into your grant budgets.

5. If you are in a position of authority, such as being a journal editor, you need to use all the authority you have to draw a hard line on LLM use.

There is no moderate or responsible way to use LLMs. They need to be rejected wholesale.

I still think LLMs are cool and want to use them in my research

If you are unconvinced by the above argument, there are many other reasons why you might still want to reject LLM use entirely. I won’t go into these in detail in this article, but:

1. LLM use makes you complicit in de facto racialized torture of the Kenyan workers who prepare the texts that are used as training data.

2. From hardware manufacturing to hyperscale data centre construction to the training process for LLMs, there is a massive negative environmental impact to LLM use.

3. LLMs and other forms of generative AI depend on training data that has, in many cases, been taken without consent or compensation from artists or other workers in an act that has been described as enclosure of the digital commons.

4. LLM use deskills you as an academic.

5. You will be left holding the bag when the LLM economic bubble bursts. The costs of producing and maintaining these models is not sustainable and eventually the speculative funding will run out. When the bubble bursts, you will have built your career on methods that no longer exist, and having put into the literature results that are completely non-reproducible.