This book is an intervention -
Chapter 4
Who am I as data?
By Frederike Kaltheuner
“Resisting and protecting people from AI pseudoscience is about far more than making AI accountable, explainable, more transparent, and less biased. It is about defending and vigorously protecting what it means to be a person. It is about resisting ontological reduction. It is about allowing space for identities to be challenged and contested, not cemented by inscrutable automated systems.”
Who am I as data? The question has haunted me for years. It’s a ghoulish curiosity; wanting to know not just how much data companies routinely harvest about my behaviour, but also what this data might reveal about me, and who I appear to be.
It also stems from my fascination with a paradoxical societal shift. We are living at a time when the terms we use to describe identities are becoming more fluid, and boundaries more negotiable. While our language falls short of fully representing our social realities, it reflects the fundamental changes that are afoot—providing an echo of what is already the case. At the same time, another current is pulling in the opposite direction. Out of sight, we are increasingly surrounded by data-driven systems that automatically affix names to us and assign us an identity. From the gender binary that’s encoded into targeted online advertising, to facial recognition systems that claim to predict people’s gender or ethnicity, these systems are often completely inadequate to our social realities. Yet simply by existing, they classify and thereby mould the world around us.
In 2018, I asked an advertising company for all of my data.1 Quantcast, an AI company that is known for its cookie consent notices, tracks users across millions of websites and apps to create detailed profiles of people’s browsing histories, presumed identity, and predicted interests.
Staring into the abyss of one’s tracking data is uncanny. On the one hand, there were pages and pages of my own browsing history—not just top-level domains, but URLs that revealed exactly what I had read and clicked on, which restaurants I’d booked, and which words I’d translated into German. Then there were the predictions made about me, confidence intervals about my predicted gender, age, and income. Had I done the same experiment outside the European Union—where this is unlawful—the data would have included an ethnicity score.
Was this who I am? I expected to feel somewhat violated, but not so misunderstood. These rows and rows of data revealed a lot about me, my interests, and how I spend my time (every click is timestamped and carries other metadata) but the picture remained incomplete, shallow. Eerily accurate inferences were combined with seemingly random consumer categories: bagel shop frequenter, alcohol consumption at home.
AI-driven profiling for advertising may sound banal, yet it is the most profitable and prevalent use of AI to classify, infer, and detect people’s identity—be it gender, ethnicity or sexual orientation. These AI systems operate in the background, in ways that are fundamentally beyond people’s knowledge or control. Those who are classified and assessed by them frequently don’t know where, when or how these systems are used. These systems are deeply Orwellian when they happen to get it right, and Kafkaesque when they do not.
Just to be crystal clear: it is impossible to detect someone’s gender, ethnicity, or sexual orientation using AI techniques, and any attempt to do so falls squarely in the domain of AI pseudoscience. That is not because AI isn’t good enough yet, or because we need more and better data. It’s because identity isn’t something that can be straightforwardly detected, like the colour or shape of an object.
Identity is a fallacy, argues Kwame Anthony Appiah in Rethinking Identity.2 The fallacy is in assuming that there is some deep similarity that binds together people of a particular identity—because ultimately such a similarity does not exist—and yet, identities matter deeply to people precisely because belonging to something larger than oneself is a key human need. To acknowledge the ways identity is at once true and a lie reveals the tension at its heart. The negotiations between how we identify ourselves and how others identify us, what we are identifying with and what we’re being identified as, are perpetual.
AI-driven identity prediction attempts to dissolve this tension by reducing complex questions of identity to what is automatically detectable from the outside, according to pre-conceived notions of what an identity is. At best, AI systems can offer a very incomplete guess of someone’s presumed identity—the data-driven and automated equivalent of the stereotyping that people regularly do to passing strangers in the street. At worst, AI systems that claim to detect identity give racism, transphobia and sexism a veneer of scientific advancement and objectivity. This has devastating consequences for those who are either targeted or rendered invisible by these systems. As Sasha Costanza-Chock describes in Design Justice, the gender binary that is encoded into so much infrastructure, such as airport security systems, amounts to ontological reduction: “As a nonbinary trans femme, I present a problem not easily resolved by the algorithm of the security protocol.”3
Broadly speaking, there are two ways in which AI is used commercially to assess, detect or predict identity. Let’s call them the pragmatic and the essentialist approaches. The former predominates in online advertising. Ultimately, it doesn’t matter to Quantcast, Google or Facebook whether I truly am a woman or a man (or identify as trans, an option that is often not available). What matters is whether my purchasing decisions and online behaviour resemble that of other people who are also classified as “women”. In other words, “woman” is not really a statement about identity, as John Cheney-Lippold writes in We Are Data.4 It is an arbitrary label that is applied to an arbitrary group of people who share an arbitrary set of online behaviours. What determines the label and the group is not whether it is true, but whether grouping people together increases whatever the advertising system is optimised for: clicks, time on site, purchases.
By contrast, the essentialist approach positions automated predictions about identity as truth claims. This is where pseudoscience comes in. The idea behind security cameras that detect Uyghur ethnicity is to really detect and spot people of Uyghur descent in a crowd. When a widely-criticised and now-infamous Stanford study claimed that AI can detect whether someone is gay or straight based only on a photo, the underlying premise was that sexual orientation is reallysomehow visually inscribed in a face. And when immigration authorities use voice recognition to detect accents to determine nationality, the unspoken assumption is that nationality, which is primarily a matter of legal jurisdiction, is expressed in and determined by an accent. These systems are designed to detect deviations from the mean, and thereby they also inevitably define what is considered ‘normal’ and ‘not-normal’, where anything that isn’t ‘normal’ is automatically suspicious.
Both approaches are problematic. The pragmatic (or rather, “just good enough”) approach to profiling may still lead to discrimination, because predictions can be both eerily accurate and wholly inaccurate or inadequate at the same time. Whether accurate or not, inferring someone’s ethnicity or gender without their knowledge or consent still means this “information” can be used in ways which discriminate or exclude. Inferences that are produced for contexts which make do with a pragmatic approach—for example, in advertising—may still end up in a context where accuracy and truth matter. It’s one thing to target an ad based on someone’s likely interests, whereabouts and ethnicity; it’s something entirely different to use the same data for immigration enforcement.
What’s so deeply troubling about the essentialist approach to AI identity prediction, by contrast, is not just how it operates in practice. These predictions come with significant margins of error, completely eliminating those who don’t fit into whatever arbitrary classifications the designers of the system have chosen, and their harms disproportionately affect communities that are already marginalised. Equally disturbing is the insidious idea that automated classification systems can, and should be the ultimate authority over who we are understood to be. Decisions of this magnitude can affect, curtail and even eliminate the ability to enjoy rights as an individual, a citizen, or member of any other group. Yet these systems are intended to be, and are often treated as, more reliable, more objective and more trustworthy than the statements made by those they are assessing.
This is as troubling as it is absurd, for the essentialist approach ignores the fact that the underlying and unspoken assumptions, norms, and values behind every identity prediction system are categories and modes of classification that were chosen by the designers of those very systems. Take, for instance, the gay faces study mentioned earlier. The entire model was based on the assumption that people are either gay or straight, male or female. Such binary labels fail to capture the lived experience of a vast number of queer people—not to mention that the study only included white people.
Classifications have consequences, and produce real-world effects. In the words of Geoffrey Bowker and Susan Star, “For any individual group or situation, classifications and standards give advantage or they give suffering.”5 The idea that an automated system can detect identity risks transforming inherently political decisions—such as opting to use the gender binary—into a hard, yet invisible infrastructure. Such is the case when any form of attribute recognition is added into computer vision systems, such as face recognition cameras.
AI pseudoscience does not happen in a vacuum. It is part of a much broader revival of certain ways of thinking. Amid the return of race science and the emergence of DNA tests that designate people’s ancestry as “35 percent German”, or “76 percent Finnish”, the use of AI to predict and detect identity needs to be seen as part of a much wider revival of (biological) essentialism and determinism.6 Many companies that offer genetic predictions, for instance, sell much more than DNA tests.7 They are also benefiting from—and ultimately spreading—the dangerous, yet incredibly compelling, idea that who we are is ultimately determined by biology. There are now DNA companies that claim genetics can predict people’s ideal lifestyle, their intelligence, their personality and even their perfect romantic partner. A German start-up makes muesli based on people’s DNA, house-sharing platform SpareRoom trailed genetically matched roommates, and the music streaming service Spotify offers a playlist tailored to your DNA.
At the core of this return to biological determinism lies the idea that both people and categories are fixed, unchangeable and therefore predictable. The dark shadow this casts is the belief that some categories of identity, and by extension some people’s lives, are superior to others.
AI merely gives outmoded ways of thinking the veneer of objectivity and futurism.
In reality, categories like race, gender, and sexual orientation evolve over time. They remain subject to contestation. The idea of a criminal face is absurd, not least because our ideas of criminality are constantly changing. Does our face change whenever laws change, or when we move to a new country? What are DNA companies referring to when saying that someone is German? The Federal Republic of Germany? The Holy Roman Empire of the German Nation that existed from 1512 to 1806? Nazi Germany? The very idea of the nation state is a modern concept. And is someone who recently immigrated to Germany and has a German passport not German?
On the individual level, we are often different things to different people. We reveal different parts of ourselves in different settings. How we see ourselves can evolve or even radically change. This freedom to selectively disclose and manage who we are to whom, and the space we have to do it in, is drastically eroded by the increased ability of companies and governments to link and join previously distinct data points, both spatially and temporally, into a distinct, singular identity—an apparently definite assessment of who each of us is as a person.
In 2020, Google announced that it would drop gender recognition from its Cloud Vision API, which is used by developers to analyse what’s in an image, and can identify anything from brand logos to faces to landmarks. That’s a good first step, but further action needs to be taken in industry more widely. For this we need AI regulation that does more than simply setting up the standards and guidelines that determine how AI systems can be used. What is needed are clear red lines that demarcate what AI cannot and should not be used for.
Resisting and protecting people from AI pseudoscience is about far more than making AI accountable, explainable, more transparent, and less biased. It is about defending and vigorously protecting what it means to be a person. It is about resisting ontological reduction. It is about allowing space for identities to be challenged and contested, not cemented by inscrutable automated systems.
Frederike Kaltheuner is a tech policy analyst and researcher. She is also the Director of the European AI Fund, a philanthropic initiative to strengthen civil society in Europe.
Notes
1. Kaltheuner, F. (2018). I asked an online tracking company for all of my data and here's what I found. Privacy International. https://privacyinternational.org/long-read/2433/i-asked-online-tracking-company-all-my-data-and-heres-what-i-found
2. Appiah, K. A. (2018). The lies that bind: Rethinking identity. London: Profile Books.
3. Costanza-Chock, S. (2018). Design justice: Towards an intersectional feminist framework for design theory and practice. Proceedings of the Design Research Society.
4. Cheney-Lippold, J. (2017). We are data. New York, NY: New York University Press.
5. Bowker, G. C., & Star, S. L. (2000). Sorting things out: Classification and its consequences. Cambridge, Mass: MIT Press.
6. Saini, A. (2019). Superior: The return of race science. Boston, Mass: Beacon Press.
7. Kaltheuner, F. (2020). Acknowledging the Limits of Our AI (and Our DNA). Mozilla.https://foundation.mozilla.org/en/blog/acknowledging-limits-our-ai-and-our-dna/
It also stems from my fascination with a paradoxical societal shift. We are living at a time when the terms we use to describe identities are becoming more fluid, and boundaries more negotiable. While our language falls short of fully representing our social realities, it reflects the fundamental changes that are afoot—providing an echo of what is already the case. At the same time, another current is pulling in the opposite direction. Out of sight, we are increasingly surrounded by data-driven systems that automatically affix names to us and assign us an identity. From the gender binary that’s encoded into targeted online advertising, to facial recognition systems that claim to predict people’s gender or ethnicity, these systems are often completely inadequate to our social realities. Yet simply by existing, they classify and thereby mould the world around us.
In 2018, I asked an advertising company for all of my data.1 Quantcast, an AI company that is known for its cookie consent notices, tracks users across millions of websites and apps to create detailed profiles of people’s browsing histories, presumed identity, and predicted interests.
Staring into the abyss of one’s tracking data is uncanny. On the one hand, there were pages and pages of my own browsing history—not just top-level domains, but URLs that revealed exactly what I had read and clicked on, which restaurants I’d booked, and which words I’d translated into German. Then there were the predictions made about me, confidence intervals about my predicted gender, age, and income. Had I done the same experiment outside the European Union—where this is unlawful—the data would have included an ethnicity score.
Was this who I am? I expected to feel somewhat violated, but not so misunderstood. These rows and rows of data revealed a lot about me, my interests, and how I spend my time (every click is timestamped and carries other metadata) but the picture remained incomplete, shallow. Eerily accurate inferences were combined with seemingly random consumer categories: bagel shop frequenter, alcohol consumption at home.
AI-driven profiling for advertising may sound banal, yet it is the most profitable and prevalent use of AI to classify, infer, and detect people’s identity—be it gender, ethnicity or sexual orientation. These AI systems operate in the background, in ways that are fundamentally beyond people’s knowledge or control. Those who are classified and assessed by them frequently don’t know where, when or how these systems are used. These systems are deeply Orwellian when they happen to get it right, and Kafkaesque when they do not.
Just to be crystal clear: it is impossible to detect someone’s gender, ethnicity, or sexual orientation using AI techniques, and any attempt to do so falls squarely in the domain of AI pseudoscience. That is not because AI isn’t good enough yet, or because we need more and better data. It’s because identity isn’t something that can be straightforwardly detected, like the colour or shape of an object.
Identity is a fallacy, argues Kwame Anthony Appiah in Rethinking Identity.2 The fallacy is in assuming that there is some deep similarity that binds together people of a particular identity—because ultimately such a similarity does not exist—and yet, identities matter deeply to people precisely because belonging to something larger than oneself is a key human need. To acknowledge the ways identity is at once true and a lie reveals the tension at its heart. The negotiations between how we identify ourselves and how others identify us, what we are identifying with and what we’re being identified as, are perpetual.
AI-driven identity prediction attempts to dissolve this tension by reducing complex questions of identity to what is automatically detectable from the outside, according to pre-conceived notions of what an identity is. At best, AI systems can offer a very incomplete guess of someone’s presumed identity—the data-driven and automated equivalent of the stereotyping that people regularly do to passing strangers in the street. At worst, AI systems that claim to detect identity give racism, transphobia and sexism a veneer of scientific advancement and objectivity. This has devastating consequences for those who are either targeted or rendered invisible by these systems. As Sasha Costanza-Chock describes in Design Justice, the gender binary that is encoded into so much infrastructure, such as airport security systems, amounts to ontological reduction: “As a nonbinary trans femme, I present a problem not easily resolved by the algorithm of the security protocol.”3
Broadly speaking, there are two ways in which AI is used commercially to assess, detect or predict identity. Let’s call them the pragmatic and the essentialist approaches. The former predominates in online advertising. Ultimately, it doesn’t matter to Quantcast, Google or Facebook whether I truly am a woman or a man (or identify as trans, an option that is often not available). What matters is whether my purchasing decisions and online behaviour resemble that of other people who are also classified as “women”. In other words, “woman” is not really a statement about identity, as John Cheney-Lippold writes in We Are Data.4 It is an arbitrary label that is applied to an arbitrary group of people who share an arbitrary set of online behaviours. What determines the label and the group is not whether it is true, but whether grouping people together increases whatever the advertising system is optimised for: clicks, time on site, purchases.
By contrast, the essentialist approach positions automated predictions about identity as truth claims. This is where pseudoscience comes in. The idea behind security cameras that detect Uyghur ethnicity is to really detect and spot people of Uyghur descent in a crowd. When a widely-criticised and now-infamous Stanford study claimed that AI can detect whether someone is gay or straight based only on a photo, the underlying premise was that sexual orientation is reallysomehow visually inscribed in a face. And when immigration authorities use voice recognition to detect accents to determine nationality, the unspoken assumption is that nationality, which is primarily a matter of legal jurisdiction, is expressed in and determined by an accent. These systems are designed to detect deviations from the mean, and thereby they also inevitably define what is considered ‘normal’ and ‘not-normal’, where anything that isn’t ‘normal’ is automatically suspicious.
Both approaches are problematic. The pragmatic (or rather, “just good enough”) approach to profiling may still lead to discrimination, because predictions can be both eerily accurate and wholly inaccurate or inadequate at the same time. Whether accurate or not, inferring someone’s ethnicity or gender without their knowledge or consent still means this “information” can be used in ways which discriminate or exclude. Inferences that are produced for contexts which make do with a pragmatic approach—for example, in advertising—may still end up in a context where accuracy and truth matter. It’s one thing to target an ad based on someone’s likely interests, whereabouts and ethnicity; it’s something entirely different to use the same data for immigration enforcement.
What’s so deeply troubling about the essentialist approach to AI identity prediction, by contrast, is not just how it operates in practice. These predictions come with significant margins of error, completely eliminating those who don’t fit into whatever arbitrary classifications the designers of the system have chosen, and their harms disproportionately affect communities that are already marginalised. Equally disturbing is the insidious idea that automated classification systems can, and should be the ultimate authority over who we are understood to be. Decisions of this magnitude can affect, curtail and even eliminate the ability to enjoy rights as an individual, a citizen, or member of any other group. Yet these systems are intended to be, and are often treated as, more reliable, more objective and more trustworthy than the statements made by those they are assessing.
This is as troubling as it is absurd, for the essentialist approach ignores the fact that the underlying and unspoken assumptions, norms, and values behind every identity prediction system are categories and modes of classification that were chosen by the designers of those very systems. Take, for instance, the gay faces study mentioned earlier. The entire model was based on the assumption that people are either gay or straight, male or female. Such binary labels fail to capture the lived experience of a vast number of queer people—not to mention that the study only included white people.
Classifications have consequences, and produce real-world effects. In the words of Geoffrey Bowker and Susan Star, “For any individual group or situation, classifications and standards give advantage or they give suffering.”5 The idea that an automated system can detect identity risks transforming inherently political decisions—such as opting to use the gender binary—into a hard, yet invisible infrastructure. Such is the case when any form of attribute recognition is added into computer vision systems, such as face recognition cameras.
AI pseudoscience does not happen in a vacuum. It is part of a much broader revival of certain ways of thinking. Amid the return of race science and the emergence of DNA tests that designate people’s ancestry as “35 percent German”, or “76 percent Finnish”, the use of AI to predict and detect identity needs to be seen as part of a much wider revival of (biological) essentialism and determinism.6 Many companies that offer genetic predictions, for instance, sell much more than DNA tests.7 They are also benefiting from—and ultimately spreading—the dangerous, yet incredibly compelling, idea that who we are is ultimately determined by biology. There are now DNA companies that claim genetics can predict people’s ideal lifestyle, their intelligence, their personality and even their perfect romantic partner. A German start-up makes muesli based on people’s DNA, house-sharing platform SpareRoom trailed genetically matched roommates, and the music streaming service Spotify offers a playlist tailored to your DNA.
At the core of this return to biological determinism lies the idea that both people and categories are fixed, unchangeable and therefore predictable. The dark shadow this casts is the belief that some categories of identity, and by extension some people’s lives, are superior to others.
AI merely gives outmoded ways of thinking the veneer of objectivity and futurism.
In reality, categories like race, gender, and sexual orientation evolve over time. They remain subject to contestation. The idea of a criminal face is absurd, not least because our ideas of criminality are constantly changing. Does our face change whenever laws change, or when we move to a new country? What are DNA companies referring to when saying that someone is German? The Federal Republic of Germany? The Holy Roman Empire of the German Nation that existed from 1512 to 1806? Nazi Germany? The very idea of the nation state is a modern concept. And is someone who recently immigrated to Germany and has a German passport not German?
On the individual level, we are often different things to different people. We reveal different parts of ourselves in different settings. How we see ourselves can evolve or even radically change. This freedom to selectively disclose and manage who we are to whom, and the space we have to do it in, is drastically eroded by the increased ability of companies and governments to link and join previously distinct data points, both spatially and temporally, into a distinct, singular identity—an apparently definite assessment of who each of us is as a person.
In 2020, Google announced that it would drop gender recognition from its Cloud Vision API, which is used by developers to analyse what’s in an image, and can identify anything from brand logos to faces to landmarks. That’s a good first step, but further action needs to be taken in industry more widely. For this we need AI regulation that does more than simply setting up the standards and guidelines that determine how AI systems can be used. What is needed are clear red lines that demarcate what AI cannot and should not be used for.
Resisting and protecting people from AI pseudoscience is about far more than making AI accountable, explainable, more transparent, and less biased. It is about defending and vigorously protecting what it means to be a person. It is about resisting ontological reduction. It is about allowing space for identities to be challenged and contested, not cemented by inscrutable automated systems.
Frederike Kaltheuner is a tech policy analyst and researcher. She is also the Director of the European AI Fund, a philanthropic initiative to strengthen civil society in Europe.
Notes
1. Kaltheuner, F. (2018). I asked an online tracking company for all of my data and here's what I found. Privacy International. https://privacyinternational.org/long-read/2433/i-asked-online-tracking-company-all-my-data-and-heres-what-i-found
2. Appiah, K. A. (2018). The lies that bind: Rethinking identity. London: Profile Books.
3. Costanza-Chock, S. (2018). Design justice: Towards an intersectional feminist framework for design theory and practice. Proceedings of the Design Research Society.
4. Cheney-Lippold, J. (2017). We are data. New York, NY: New York University Press.
5. Bowker, G. C., & Star, S. L. (2000). Sorting things out: Classification and its consequences. Cambridge, Mass: MIT Press.
6. Saini, A. (2019). Superior: The return of race science. Boston, Mass: Beacon Press.
7. Kaltheuner, F. (2020). Acknowledging the Limits of Our AI (and Our DNA). Mozilla.https://foundation.mozilla.org/en/blog/acknowledging-limits-our-ai-and-our-dna/
Next: Chapter 5
The case for interpretive techniques in machine learning
by Razvan Amironesei, Emily Denton, Alex Hanna, Hilary Nicole, Andrew Smart