Fake AI

Edited by Frederike Kaltheuner

Meatspace Press (2021)

Book release: 14/12/2021

This book is an intervention -

Chapter 1

AI Snake Oil, Pseudoscience and Hype

An interview with Arvind Narayanan

The term “snake oil” originates from the United States in the mid 19th century when Chinese immigrants working on the railroads introduced their American counterparts to a traditional treatment for arthritis and bursitis made of oil derived from the Chinese water snake. The effectiveness of the oil, which is high in omega-3 acids, and its subsequent popularity prompted some profiteers to get in on a lucrative market. These unscrupulous sellers peddled quack remedies which contained inferior rattlesnake oil or completely arbitrary ingredients to an unsuspecting public. By the early 20th century, “snake oil” had taken on its modern, pejorative meaning to become a byword for fake miracle cures, groundless claims, and brazen falsehoods.

Much of what is sold commercially as AI is snake oil, says Arvind Narayanan, Associate Professor for Computer Science at Princeton University—we have no evidence that it works, and based on our scientific understanding, we have strong reasons to believe that it couldn’t possibly work. And yet, companies continue to market AI products that claim to predict anything from crime, to job performance, sexual orientation or gender. What makes the public so susceptible to these claims is the fact that in recent years, in some domains of AI research, there has been genuine and impressive progress. How, then, did AI become attached to so many products and services of questionable or unverifiable quality, and slim to non-existent usefulness?

Frederike Kaltheuner spoke to Arvind Narayanan via Zoom in January 2021. Frederike, from lockdown in Berlin, and Arvind from his office in Princeton.

F: Your talk, How to Recognise AI Snake Oil went viral in 2019. What inspired you to write about AI snake oil, and were you surprised by the amount of attention your talk received?

A: Throughout the last 15 years or so of my research, one of my regular motivations for getting straight into a research topic is when there is hype in the industry around something. That’s how I first got started on privacy research. My expertise, the desire for consumer protection, and the sense that industry hype got out of control all converged in the case of AI snake oil. The AI narrative had been getting somewhat unhinged from reality for years, but the last straw was seeing how prominent these AI-based hiring companies have become. How many customers they have, and how many millions of people have been put through these demeaning video interviews where AI would supposedly figure out someone’s job suitability based on how they talked and other irrelevant factors. That’s really what triggered me to feel “I have to say something here”.

I was very surprised by its reception. In addition to the attention on Twitter, I received something like 50 invitations for papers, books… That had never happened to me before. In retrospect I think many people suspected what was happening was snake oil but didn’t feel they had the expertise or authority to say anything. People were speaking up of course, but perhaps weren’t being taken as seriously because they didn’t have the “Professor of Computer Science” title. That we still put so much stock in credentials is, I think, unfortunate. So when I stood up and said this, I was seen as someone who had the authority. People really felt it was an important counter to the hype.

F: … and it is still important to counter the hype today, especially in policy circles. Just how much of what is usually referred to as AI falls under the category of AI snake oil? And how can we recognise it?

A: Much of what is sold commercially today as “AI” is what I call “snake oil”. We have no evidence that it works, and based on our scientific understanding of the relevant domains, we have strong reasons to believe that it couldn’t possibly work. My educated guess is because “AI” is a very loose umbrella term. This happens with buzzwords in the tech industry (like “blockchain”). After a point nobody really knows what it means. Some are not snake oil. There has been genuinely remarkable scientific progress. But because of this, companies put all kinds of systems under the AI umbrella—including those you would have more accurately called regression 20 years ago, or statistics, except that statistics asks rigorous questions about whether something is working and how we can quantify it. But because of the hype, people have skipped this step and the public and policymakers have bought into it.

Surveys show that the public largely seems to believe that Artificial General Intelligence (AGI) is right around the corner—which would be a turning point in the history of human civilisation! I don’t think that’s true at all, and most experts don’t either. The idea that our current progress with AI would lead to AGI is as absurd as building a taller and taller ladder that reached the moon. There are fundamental differences between what we’re building now and what it would take to build AGI. AGI is not task-specific, so that’s in part why I think it will take something fundamentally new and different to get there.

F: To build on your metaphor—if the genuinely remarkable scientific progress under the AI umbrella is a ladder to the moon, then AGI would take an entirely different ladder altogether. AI companies are pointing at genuine progress to make claims that require an entirely different kind of progress altogether?

A: Right. There’s this massive confusion around what AI is, which companies have exploited to create hype. Point number two is that the types of applications of so-called “AI” are fundamentally dubious. One important category is predicting the future, that is, predicting social outcomes. Which kids might drop out of school? Who might be arrested for a crime in the future? Who should we hire? These are all contingent on an incredible array of factors that we still have trouble quantifying—and it’s not clear if we ever will.

A few scientific studies have looked rigorously at how good we are at predicting these future social outcomes and shown that it’s barely better than random. We can’t really do much better than simple regression models with a few variables. My favourite example is the “Fragile Families Challenge” led by my Princeton colleague Professor Matt Salganik, along with colleagues and collaborators around the world. Hundreds of participants used state-of-the-art machine learning techniques and a phenomenal dataset to scrutinise “at-risk” kids over a decade and try to predict (based on a child’s circumstances today) what their outcomes might be six years in the future. The negative results are very telling. No team, on any of these social outcomes, could produce predictions that were significantly better than random prediction. This is a powerful statement about why trying to predict future social outcomes is a fundamentally different type of task to those that AI has excelled at. These things don’t work well and we shouldn’t expect them to.

F: Which domains seem to have a lot of snake oil in them and why?

A: My educated guess is that to understand the prevalence of AI snake oil it’s better to look at the consumers / buyers than the sellers. Companies will spring up around any type of technology for which high demand exists. So why are people willing to buy certain types of snake oil? That’s interesting.

I think it’s because certain domains (like hiring) are so broken that even an elaborate random-number generator (which is what I think some of these AI tools are), is an improvement over what people are doing today. And I don’t make this statement lightly. In a domain like hiring we—culturally as well as in business—have a hard time admitting that there is not much we can do to predict who’s going to be most productive in a job. The best we can do is have some basic tests of preparation, ability and competence, and beyond that just accept that it’s essentially a lottery. I think we’re not willing to accept that so much success in life is just randomness, and in our capitalistic economy there’s this constant push for more “rationality”, whether or not that makes sense.

So the way hiring works is a) fundamentally arbitrary because these outcomes are hard to predict, and b) there’s a lot of bias along all axes that we know about. What these tools promise to do is cut down on bias that is relatively easy to statistically quantify, but it’s much harder to prove that these tools are actually selecting candidates who will do better than candidates who were not selected. The companies who are buying these tools are either okay with that or don’t want to know. Look at it from their perspective: they might have a thousand applications for two positions. It’s an enormous investment of time to read those applications and interview those candidates, and it’s frustrating not to be able to make decisions on a clear candidate ranking. And against this backdrop emerges a tool that promises to be AI and has a veneer of scientific sophistication. It says it will cut down on bias and find the best candidates in a way that is much cheaper to their company than a traditional interview and hiring process. That seems like a great deal.

F: So what you’re saying is the domains in which snake oil is more prevalent are the ones where either the market is broken or where we have a desire for certainty that maybe doesn’t exist?

A: I hesitate to provide some sort of sweeping characterisation that explains where there is a lot of snake oil. My point is more that if we look at the individual domains, there seem to be some important reasons why there are buyers in that domain. We have to look at each specific domain and see what is specifically broken there. There’s also a lot of AI snake oil that’s being sold to governments. I think what’s going on there is that there’s not enough expertise in procurement departments to really make nuanced decisions about whether this algorithmic tool can do what it claims.

F: Do you think this problem is limited to products and services that are being sold or is this also something you observe within the scientific community?

A: A lot of my thinking evolved through the “Limits to Prediction” course that I co-taught with Professor Matt Salganik, whom I mentioned earlier. We wanted to get a better scientific understanding of when prediction is even possible, and the limits of its accuracy. One of the things that stuck out for me is that there’s also a lot of misguided research and activity around prediction where we have to ask: what is even the point?

One domain is political prediction. There’s a great book by EItan Hersh which criticises the idea of politics, and even political activism, as a sport—a horse race that turns into a hobby or entertainment. What I find really compelling about this critique is what it implies about efforts like FiveThirtyEight that involve a lot of statistics and technology for predicting the outcomes of various elections. Why? That’s the big question to me. Of course, political candidates themselves might want to know where to focus their campaigning efforts. Political scientists might want to understand what drives people to vote—those are all great. But why as members of the public…?

Let me turn this inwards. I’m one of those people who refreshes the New York Times needle and FiveThirtyEight’s predictions. Why do I participate in this way? I was forced to turn that critique on myself, and I realised it’s because uncertainty is so uncomfortable. Anything that promises to quell the terror that uncertainty produces and tell us that “there’s an 84% chance this candidate will win” just fills a huge gap in our subconscious vulnerabilities. I think this is a real problem. It’s not just FiveThirtyEight. There’s a whole field of research to figure out how to predict elections. Why? The answer is not clear at all. So, it’s not just in the commercial sphere, there’s also a lot of other misguided activity around prediction. We’ve heard a lot about how these predictions have not been very successful, but we’ve heard less about why people are doing these predictions at all.

F: Words like “pseudoscience” and “snake oil” are often thrown around to denote anything from harmful AI, to poorly-done research, to scams, essentially. But you chose your words very carefully. Why “misguided research” rather than, let’s say, “pseudoscience”?

A: I think all these terms are distinct, at least somewhat. Snake oil describes commercial products that are sold as something that’s going to solve a problem. Pseudoscience is where scientific claims are being made, but they’re based on fundamentally shaky assumptions. The classic example is, of course, a paper on supposedly predicting criminality from facial images. When I say “misguided research”, a good example is electoral prediction by political scientists. This is very, very careful research conducted by very rigorous researchers. They know their statistics, I don’t think they’re engaged in pseudoscience. By “misguided” I mean they’re not asking the question of “who is this research helping?”

F: That’s really interesting. The question you’re asking then is epistemological. Why do you think this is the case and what do you see as the problems arising from not asking these questions?

A: That’s a different kind of critique. It’s not the same level of irresponsibility as some of this harmful AI present in academia and on the street. Once an academic community decides something is an important research direction, then you stop asking the questions. It’s frankly difficult to ask that question for every paper that you write. But sometimes an entire community starts down a path that ultimately leads nowhere and is not going to help anybody. It might even have some harmful side-effects. There’s interesting research coming out that the false confidence that people get from seeing these probability scores actually depresses turnouts. This might be a weird thing to say right after an election that saw record levels of turnout, but we don’t know whether even more people might have voted had it not been for this entire industry of predicting elections, and splashing those predictions on the frontpages. This is why misguided research is, I think, a separate critique.

F: Moving onto a different theme, I have two questions on the limit of predictability. It seems like every other year a research paper tries to predict criminality. The other one for me that surprisingly doesn’t die is a 2017 study by two Stanford researchers on predicting homosexuality from faces. There are many, many problems with this paper, but what still fascinates me is that the conversations with policymakers and journalists often revolved around “Well maybe we can’t predict this now, but who knows if we will be able to predict it in future?”. In your talk you said that this is an incomplete categorisation of tasks that AI can be used to solve—and I immediately thought of predicting identity. It’s futile, but the reason why ultimately lies somewhere else. It’s more a question of who we think has the ultimate authority about who defines who we are. It’s an ontological question rather than one about accuracy or biology. I am curious how you refute this claim that AI will be able to predict things in the future, and place an inherent limit on what can be predicted?

A: If we look at the authors of the paper on predicting sexual orientation, one of their main supposed justifications for writing the paper is they claim to be doing this in the interest of the gay community. As repressive governments want to identify sexuality through photos and social media to come after people, they think it’s better for this research to be out there for everybody to see and take defensive measures.

I think that argument makes sense in some domains like computer security. It absolutely does not make sense in this domain. Doing this research is exactly the kind of activity that gives a veneer of legitimacy to an oppressive government who says “Look! There’s a peer-reviewed research paper and it says that this is scientifically accurate, and so we’re doing something that’s backed by science!” Papers like this give ammunition to people who might do such things for repressive ends. The other part is that if you find a vulnerability in a computer program, it’s very easy to fix—finding the vulnerability is the hard part. It’s very different in this case. If it is true (and of course it’s very doubtful) that it’s possible to accurately infer sexual orientation from people’s images on social media, what are these authors suggesting people do to protect themselves from oppressive governments other than disappear from the internet?

F: I think that the suggestion was “accept the death of privacy as a fact and adapt to social norms” which… yeah…

A: Right. I would find the motivations for doing this research in the first place to be very questionable. Similarly, predicting gender. One of the main applications is to put a camera in the back of the taxi that can infer the rider’s gender and show targeted advertisements on the little television screen. That’s one of the main applications that I’m seeing commercially. Why? You know… I think we should push back on that application in the first place. And if none of these applications make sense, we should ask why people are even working on predicting gender from facial images.

F: So you would rephrase the question and not even engage in discussions about accuracy, and just ask whether we should be doing this in the first place?

A: That’s right. I think there are several kinds of critique for questionable uses of AI. There’s the bias critique, the accuracy critique, and the questionable application technique. I think these critiques are separate (there’s often a tendency to confuse them) and what I tried to do in the AI Snake Oil talk is focus on one particular critique, the critique of accuracy. But that’s not necessarily the most relevant critique in all cases.

F: Let’s talk about AI and the current state of the world. I was moderately optimistic that there was less AI solutionism in response to Covid-19 than I feared. Could this be a positive indicator that the debate has matured in the past two years?

A: It’s hard to tell, but that’s a great question. It’s true that companies didn’t immediately start blowing the AI horn when Covid-19 happened, and that is good news. But it’s hard to tell if that’s because they just didn’t see enough commercial opportunity there or because the debate has in fact matured.

F: There could be various explanations for that…

A: Yeah. There is a lot of snake oil and misguided AI in the medical domain. You see a lot where machine learning was tested on what is called a “retrospective test”, where you collect data first from a clinical setting, develop your algorithm on that data and then just test the algorithm on a different portion of the same data. That is a very misleading type of test, because the data might have been collected from one hospital but when you test it on a different hospital in a different region—with different cultural assumptions, different demographics—where the patterns are different, the tool totally fails. We have papers that look at what happens if you test these retrospectively-developed tools in a prospective clinical setting: there’s a massive gap in accuracies. We know there’s a lot of this going on in the medical machine learning domain, but whether the relative dearth of snake oil AI for Covid-19 is due to the debate maturing or some other factor, who can tell.

F: One thing I was wondering… do you feel like you’ve made an impact?

A: (laughs)

F: As in, are you seeing less snake oil now than you did, say two years ago?

A: That’s hard to know. I think there is certainly more awareness among the people who’ve been doing critical AI work. I’m seeing less evidence that awareness is coming through in journalism, although I’m optimistic that that will change. I have a couple of wish-list items for journalists who often unwittingly provide cover for overhyped claims. One is: please stop attributing agency to AI. I don’t understand why journalists do this (presumably it drives clicks?) but it’s such a blatantly irresponsible thing to do. Headlines like “AI discovered how to cure a type of cancer”. Of course it’s never AI that did this. It’s researchers, very hardworking researchers, who use AI machine learning tools like any other tool. It’s both demeaning to the researchers who did that work and creates massive confusion among the public when journalists attribute agency to AI. There’s no reason to do that, especially in headlines.

And number two is that it’s virtually never meaningful to provide an accuracy number, like “AI used to predict earthquakes is 93% accurate”. I see that all the time. It never makes sense in a headline and most of the time never makes sense even in the body of the article. Here’s why: I can take any classifier and make it have just about any accuracy I want by changing the data distribution on which I do the test. I can give it arbitrarily easy instances to classify, I can give it arbitrarily hard instances to classify. That choice is completely up to the researcher or the company that’s doing the test. In most cases there’s not an agreed-upon standard, so unless you’re reporting accuracies on a widely-used, agreed-upon benchmark dataset (which is virtually never the case, it’s usually the company deciding on-the-road how to do the test) it never makes sense to report an accuracy number like that without a lengthy explanation and many, many other caveats. So don’t provide these oversimplified headline accuracy numbers. Try to provide these caveats and give qualitative descriptions of accuracy. What does this mean? What are the implications if you were to employ this in a commercial application? How often would you have false positives? Those are the kinds of questions that policymakers should know, not these oversimplified accuracy numbers.

Next: Chapter 2

Cheap AI

by Abeba Birhane

Instagram Twitter