The Ghost Stays in the Picture, Part 3: The Power of the Image

Eryk Salvaggio is a 2024 Flickr Foundation Research Fellow, diving into the relationships between images, their archives, and datasets through a creative research lens. This three-part series focuses on the ways archives such as Flickr can shape the outputs of generative AI in ways akin to a haunting. You can read part one and two.

“Definitions belong to the definers, not the defined.”
― Toni Morrison, Beloved

Generative Artificial Intelligence is sometimes described as a remix engine. It is one of the more easily graspable metaphors for understanding these images, but it’s also wrong. 

As a digital collage artist working before the rise of artificial intelligence, I was always remixing images. I would do a manual search of the public domain works available through the Internet Archive or Flickr Commons. I would download images into folders named for specific characteristics of various images. An orange would be added to the folder for fruits, but also round, and the color orange; cats could be found in both cats and animals

I was organizing images solely on visual appearance. It was anticipating their retrieval whenever certain needs might emerge. If I needed something round to balance a particular composition, I could find it in the round folder, surrounded by other round things: fruits and stones and images of the sun, the globes of planets and human eyes. 

Once in the folder, the images were shapes, and I could draw from them regardless of what they depicted. It didn’t matter where they came from. They were redefined according to their anticipated use. 

A Churning

This was remixing, but I look back on this practice with fresh eyes when I consider the metaphor as it is applied to diffusion models. My transformation of source material was not merely based on their shapes, but their meaning. New juxtapositions emerged, recontextualizing those images. They retained their original form, but engaged in new dialogues through virtual assemblages. 

As I explore AI images and the datasets that help produce them, I find myself moving away from the concept of the remix. The remix is a form of picking up a melody and evolving it, and it relies on human expression. It is a relationship, a gesture made in response to another gesture.

To believe we could “automate” remixing assumes too much of the systems that do this work. Remixes require an engagement with the source material. Generative AI systems do not have any relationship with the meanings embedded into the materials they reconfigure. In the absence of engagement, what machines do is better described as a churn, combining two senses of the word. Generative AI models churn images in that they dissolve the surface of these images. Then it churns out new images, that is, “to produce mechanically and in great volume.” 

Of course, people can diffuse the surface meaning of images too. As a collagist, I could ignore the context of any image I liked. We can look at the stereogram below and see nothing but the moon. We don’t have to think about the tools used to make that image, or how it was circulated, or who profited from its production. But as a collagist, I could choose to engage with questions that were hidden by the surfaces of things. I could refrain from engagements with images, and their ghosts, that I did not want to disturb. 

Actions taken by a person can model actions taken by a machine. But the ability to automate a person’s actions does not suggest the right or the wisdom to automate those actions. I wonder if, in the case of diffusion models, we shouldn’t more closely scrutinize the act of prising meaning from an image and casting it aside. This is something humans do when they are granted, or demand, the power to do so. The automation of that power may be legal. But it also calls for thoughtful restraint. 

In this essay, I want to explore the power to inscribe into images. Traditionally, the power to extract images from a place has been granted to those with the means to do so. Over the years, the distribution and circulation of images has been balanced against those who hold little power to resist it. In the automation of image extraction for training generative artificial intelligence, I believe we are embedding this practice into a form of data colonialism. I suggest that power differentials haunt the images that are produced by AI, because it has molded the contents of datasets, and infrastructures, that result in those images. 

The Crying Child

Temi Odumosu has written about the “digital reproduction of enslaved and colonized subjects held in cultural heritage collections.” In The Crying Child, Odumosu looks at the role of the digital image as a means of extending the life of a photographic memory. But this process is fraught, and Odumosu dedicates the paper to “revisiting those breaches (in trust) and colonial hauntings that follow photographed Afro-diasporic subjects from moment of capture, through archive, into code” (S290). It does so by focusing on a single image, taken in St. Croix in 1910: 

“This photograph suspends in time a Black body, a series of compositional choices, actions, and a sound. It represents a child standing alone in a nondescript setting, barefoot with overpronation, in a dusty linen top too short to be a dress, and crying. Clearly in visible distress, with a running nose and copious tears rolling down its face, the child’s crinkled forehead gives a sense of concentrated energy exerted by all the emotion … Emotions that object to the circumstances of iconographic production.”

The image emerges from the Royal Danish Library. It was taken by Axel Ovesen, a military officer who operated a commercial photography business. The photograph was circulated as a postcard, and appears in a number of personal and commercial photo albums Odumosu found in the archive.

The unnamed crying child appeared to the Danish colonizers of the island as an amusement, and is labeled only as “the grumpy one” (in the sense of “uncooperative”). The contexts in which this image appeared and circulated were all oriented toward soothing and distancing the colonizers from the colonized. By reframing it as a humorous novelty, the power to apply and remove meaning is exercised on behalf of those who purchase the postcard and mail it to others for a laugh. What is literally depicted in these postcards is, Odumosi writes, “the means of production, rights of access, and dissemination” (S295). 

I am describing this essay at length because the practice of categorizing this image in an archive is so neatly aligned with the collection and categorization of training data for algorithmic images. Too often, the images used for training are treated solely as data, and training defended as an act that leaves no traces. This is true. The digital copy remains intact.

But the image is degraded, literally, step by step until nothing remains but digital noise. The image is churned, the surface broken apart, and its traces stored as math tucked away in some vector space. It all seems very tidy, technical, and precise, if you treat the image as data. But to say so requires us to agree that the structures and patterns of the crying child in the archive — the shape of the child’s body, the details of the wrinkled skin around the child’s mouth — are somehow distinct from the meaning of the image. 

Because by diffusing these images into an AI model, and pairing existing text labels to it within the model, we extend the reach of Danish colonial power over the image. For centuries, archives have organized collections into assemblages shaped and informed by a vision of those with power over those whose power is held back. The colonizing eye sets the crying child into the category of amusements, where it lingers until unearthed and questioned.

If these images are diffused into new images — untraceable images, images that claim to be without context or lineage — how do we uncover the way that this power is wielded and infused into the datasets, the models, and the images ultimately produced by the assemblage? What obligations linger beneath the surfaces of things? 

Every Archive a Collage

Collage can be one path for people to access these images and evaluate their historical context. The human collage maker, the remixer, can assess and determine the appropriateness of the image for whatever use they have in mind. This can be an exercise of power, too, and it ought to be handled consciously. It has featured as a tool of Situationist detournement, a means of taking images from advertising and propaganda to reveal their contradictions and agendas. These are direct confrontations, artistic gestures that undermine the organization of the world that images impose on our sense of things. The collage can be used to exert power or challenge the status quo. 

Every archive is a collage, a way of asserting that there is a place for things within an emergent or imposed structure. The scholar and artist Beth Coleman’s work points to the reversal of this relationship, citing W.E.B. Du Bois’ exhibition at the 1900 Paris Exposition. M. Murphy writes,

“Du Bois’s use of [photographic] evidence disrupted racial kinds rather than ordered them … Du Bois’s exhibition was crucially not an exhibit of ‘facts’ and ‘data’ that made black people in Georgia knowable to study, but rather a portrait in variation and difference so antagonistic to racist sociology as to dislodge race as a coherent object of study” (71).

The imposed structures of algorithmically generated images rely on facts and data, defined a certain way. They struggle with context and difference. The images these tools produce are constrained to the central tendencies of the data they were trained on, an inherently conformist technology. 

To challenge these central tendencies means to engage with the structures it imposes on this data, and to critique this churn of images into data to begin with. Matthew Fuller and Eyal Weizman describe “hyper-aesthetic” images as not merely “part of a symbolic regime of representation, but actual traces and residues of material relations and of mediatic structures assembled to elicit them” (80). 

Consider the stereoscope. Once the most popular means of accessing photographs, the stereoscope relied on a trick of the eye, akin to the use of 3D glasses. It combined two visions of the same scene taken from the slight left and slight right of the other. When viewed through a special viewing device, the human eye superimposes them, and the overlap creates the illusion of physical depth in a flat plane. We can find some examples of these on Flickr (including the Danish Film Museum) or at The Library of Congress’ Stereograph collection.

The time period in which this technology was popular happened to overlap with an era of brutal colonization, and the archival artifacts of this era contain traces of how images projected power. 

I was struck by stereoscopic images of American imperialism in the Philippines during the US occupation, starting in 1899. They aimed to “bring to life” images of Filipino men dying in fields and other images of war, using the spectacle of the stereoscopic image as a mechanism for propaganda. These were circulated as novelties to Americans on the mainland, a way of asserting a gaze of dominance over those they occupied.

In the long American tradition of infotainment, the stereogram fused a novel technological spectacle with the effort to assert military might, paired with captions describing the US cause as just and noble while severely diminishing the numbers of civilian casualties. In Body Parts of Empire : Visual Abjection, Filipino Images, and the American Archive, Nerissa Balce writes that

“The popularity of war photographs, stereoscope viewers, and illustrated journals can be read as the public’s support for American expansion. It can also be read as the fascination for what were then new imperial ‘technologies of vision’” (52).

The link between stereograms as a style of image and the gaze of colonizing power is now deeply entrenched into the vector spaces of image synthesis systems. Prompt Midjourney for the style of a stereogram, and this history haunts the images it returns. Many prompted images for “Stereograms, 1900” do not even render the expected, highly formulaic structure of a stereogram (two of the same images, side by side, at a slight angle). It does, however, conjure images of those occupied lands. We see a visual echo of the colonizing gaze.  

Images produced for the more generally used “stereoview,” even without the use of a date, still gravitate to a similar visual language. With “stereoview,” we are given the technical specifics of the medium. The content is more abstract: people are missing, but strongly suggested. These perhaps get me closest to the idea of a “haunted” image: a scene which suggests a history that I cannot directly access.

Perhaps there are two kinds of absences embedded in these systems. The people that colonizers want to erase, and then the evidence of the colonizers themselves. Crucially, this gaze haunts these images. 

Here are four sets of two pairs.

These styles are embedded into the prompt for the technology of image capture, the stereogram. The source material is inscribed with the gaze that controlled this apparatus. The method of that inscription — the stereogram — inscribes this material into the present images.  The history is loaded into the keyword and its neighboring associations in the vector space. History becomes part of the churn. These are new old images, built from the associations of a single word (stereoview) into its messy surroundings.

It’s important to remember that the images above are not documents of historical places or events. They’re “hallucinations,” that is, they are a sample of images from a spectrum of possible images that exists at the intersection of every image labeled “stereoview.” But “stereoview” as a category does not isolate the technology from how it was used. The technology of the stereogram, or the stereoviewer, was deeply integrated into regimes of war, racial hierarchies, and power. The gaze, and the subject, are both aggregated, diffused, and made to emerge through the churning of the model.

Technologies of Flattening

The stereoview and the diffusion models are both technologies of spectacle, and the affordance of power to those who control it is a similar one. They are technologies for flattening, containing, and re-contextualizing the world into a specific order. As viewers, the generated image is never merely the surfaces of photography churned into new, abstract forms that resemble our prompts. They are an activation of the model’s symbolic regime, which is derived from the corpus of images because it has the power to isolate images from their meaning

AI has the power of finance, which enables computational resources that make obtaining 5 billion images for a dataset possible, regardless of its impact on local environments. It has the resources to train these images; the resources to recruit underpaid labor to annotate and sort these images. The critiques of AI infrastructure are numerous.

I am most interested here in one form of power that is the most invisible, which is the power of naturalizing and imposing an order of meaning through diffused imagery. The machine controls the way language becomes images. At the same time, it renders historical documentation meaningless — we can generate all kinds of historical footage now.

These images are reminders of the ways data colonialism has become embedded within not merely image generation but the infrastructures of machine learning. The scholar Tiara Roxanne has been investigating the haunting of AI systems long before me. In 2022 Roxanne noted that,

“in data colonialism, forms of technological hauntings are are experienced when Indigenous peoples are marked as ‘other,’ and remain unseen and unacknowledged. In this way, Indigenous peoples, as circumscribed through the fundamental settler-colonial structures built within machine learning systems, are haunted and confronted by this external technological force. Here, technology performs as a colonial ghost, one that continues to harm and violate Indigenous perspectives, voices, and overall identities” (49).

AI can ignore “the traces and residues of material relations” (Fuller and Weizman) as it reduces the image to its surfaces instead of the constellations of power that structured the original material. These images are the product of imbalances of power in the archive, and whatever interests those archives protected are now protected by an impenetrable, uncontestable, automated set of decisions steered by the past.

The Abstracted Colonial Subject

What we see in the above images are an inscription by association. The generated image, as a type of machine learning system, matters not only because of how it structures history into the present. It matters because it is a visualization that reaches to something far greater about automated decision making and the power it exerts over others. 

These striations of power in the archive or museum, in the census or the polling data, in the medical records or the migration records, determine what we see and what we do not. What we see in generated images must contort itself around what has been excluded from the archives. What is visible is shaped by the invisible. In the real world, this can manifest as families living on a street serving as an indication of those who could not live on that street. It could be that loans granted by an algorithmic assessment always contain an echo of loans that were not approved. 

The synthetic image visualizes these traces. They churn the surfaces, not the tangled reality beneath them. The images that emerge are glossy, professional, saturated. Hiding behind these products by and for the attention economy is the world of the not-seen. What are our obligations as viewers to the surfaces we churn when we prompt an image model? How do we reconcile our knowledge of context and history with the algorithmic detachment of these automated remixes?

The media scholar Roland Meyer writes that,

“[s]omewhere in the training data that feeds these models are photographs of real people, real places, and real events that have somehow, if only statistically, found their way into the image we are looking at. Historical reality is fundamentally absent from these images, but it haunts them nonetheless.”

In a seance, you raise spirits you have no right to speak to. The folly of it is the subject of countless warnings in stories, songs and folklore. 

What if we took the prompt so seriously? What if typing words to trigger an image was treated as a means of summoning a hidden and unsettled history? Because that is what the prompt does. It agitates the archives. Sometimes, by accident, it surfaces something many would not care to see. Boldly — knowing that I am acting from a place of privilege, and power, I ask the system to return “the abstracted colonial subject of photography.” I know I am conjuring something I should not be. 

My words are transmitted into the model within a data center, where they flow through a set of vectors, the in-between state of thousands of photographs. My words are broken apart into key words — “abstracted, colonial, colonial subject, subject, photography.” These are further sliced into numerical tokens to represent the mathematical coordinates of these ideas within the model. From there, these coordinates offer points of cohesion which are applied to find an image within a jpg of digital static. The machine removes the noise toward an image that exists in the overlapping space of these vectors.

Avery Gordon, whose book Ghostly Matters is a rich source of thinking for this research, writes:

“… if there is one thing to be learned from the investigation of ghostly matters, it is that you cannot encounter this kind of disappearance as a grand historical fact, as a mass of data adding up to an event, marking itself in straight empty time, settling the ground for a future cleansed of its spirit” (63).

If history is present in the archives, the images churned from the archive disrupt our access to the flow of history. It prevents us from relating to the image with empathy, because there is no single human behind the image or within it. It’s the abstracted colonial gaze of power applied as a styling tool. It’s a mass of data claiming to be history.

Human and Mechanical Readings

I hope you will indulge me as my eye wanders through the resulting image.

I am struck by the glossiness of it. Midjourney is fine-tuned toward an aesthetic dataset, leaning into images found visually appealing based on human feedback. I note the presence of palm trees, which brings me to the Caribbean Islands of St. Croix where The Crying Child photograph was taken. I see the presence of barbed wire, a signifier of a colonial presence.

The image is a double exposure. It reminds me of spirit photography, in which so-called psychic photographers would surreptitiously photograph a ghostly puppet before photographing a client. The image of the “ghost” was superimposed on the film to emerge in the resulting photo. These are associations that come to my mind as I glance at this image. I also wonder about what I don’t know how to read: the style of the dress, the patterns it contains, the haircut, the particulars of vegetation.

We can also look at the image as a machine does. Midjourney’s describe feature will tell us what words might create an image we show it. If I use it with the images it produces, it offers a kind of mirror-world insight into the relationship between the words I’ve used to summon that image and the categories of images from which it was drawn.

To be clear, both “readings” offer a loose, intuitive methodology, keeping in the spirit of the seance — a Ouija board of pixel values and text descriptors. They are a way in to the subject matter, offering paths for more rigorous documentation: multiple images for the same prompt, evaluated together to identify patterns and the prevalence of those patterns. That reveals something about the vector space. 

Here, I just want to see something, to compare the image as I see it to what the machine “sees.”

The image returned for the abstract colonial subject of photography is described by Midjourney this way: 

“There is a man standing in a field of tall grass, inverted colors, tropical style, female image in shadow, portrait of bald, azure and red tones, palms, double exposure effect, afrofuturist, camouflage made of love, in style of kar wai wong, red and teal color scheme, symmetrical realistic, yellow infrared, blurred and dreamy illustration.”

My words produced an image, and then those words disappeared from the image that was produced. “Colonized Subject” is adjacent to the words the machine does see: “tall grass,” “afrofuturism,” “tropical.” Other descriptions recur as I prompt the model over and over again to describe this image, such as “Indian.” I have to imagine that this idea of colonized subjects “haunts” these keywords. The idea of the colonial subject is recognized by the system, but shuffled off to nearest synonyms and euphemisms. Might this be a technical infrastructure through which the images are haunted? Could certain patterns of images be linked through unacknowledged, invisible categories the machine can only indirectly acknowledge? 

I can only speculate. That’s the trouble with hauntings. It’s the limit to drawing any conclusions from these observations. But I would draw the reader’s attention to an important distinction between my actions as a collage artist and the images made by Midjourney. The image will be interpreted by many of us, who will find different ways to see it, and a human artist may put those meanings into adjacency through conscious decisions. But to create this image, we rely solely on a tool for automated churning.

We often describe the power of images in terms of what impact an image can have on the world. Less often we discuss the power that impacts the image: the power to structure and give the image form, to pose or arrange photographic subjects. 

Every person interprets an image in different ways. A machine makes images for every person from a fixed set of coordinates, its variety constrained by the borders of its data. That concentrates power over images into the unknown coordination of a black box system. How might we intervene and challenge that power?  

The Indifferent Archivist 

We have no business of conjuring ghosts if we don’t know how to speak to them. As a collage artist, “remixing” in 2016 meant creating new arrangements from old materials, suggesting new interpretations of archival images. I was able to step aside — as a white man in California, I would never use the images of colonized people for something as benign as “expressing myself.” I would know that I could not speak to that history. Best to leave that power to shift meanings and shape new narratives to those who could speak to it. Nonetheless, it is a power that can be wielded by those who have no rights to it.  

Yes, by moving any accessible image from the online archive and transmuting it into training data, diffusion models assert this same power. But it is incapable of historic acknowledgement or obligation. The narratives of the source materials are blocked from view, in service to a technically embedded narrative that images are merely their surfaces and that surfaces are malleable. At its heart is the idea that the context of these images can be stripped and reduced into a molding clay, for anyone’s hands to shape to their own liking. 

What matters is the power to determine the relationships our images have with the systems that include or exclude. It’s about the power to choose what becomes documented, and on what terms. Through directed attention, we may be able to work through the meanings of these gaps and traces. It is a useful antidote to the inattention of automated generalizations. To greet the ghosts in these archives presents an opportunity to intervene on behalf of complexity, nuance, and care.

That is literal meaning of curation, at its Latin root: “curare,” to care. In this light, there is no such thing as automated curation.

Reclaiming Traceability

In 2021, Magda Tyzlik-Carver wrote “the practice of curating data is also an epistemological practice that needs interventions to consider futures, but also account for the past. This can be done by asking where data comes from. The task in curating data is to reclaim their traceability and to account for their lineage.”

When I started the “Ghost Stays in the Picture” research project, I intended to make linkages between the images produced by these systems and the categories within their training data. It would be a means of surfacing the power embedded into the source of this algorithmic churning within the vector space. I had hoped to highlight and respond to these algorithmic imaginaries by revealing the technical apparatus beneath the surface of generated images. 

In 2024, no mainstream image generation tool offers the access necessary for us to gather any insights into its curatorial patterns. The image dataset I initially worked with for this project is gone. Images of power and domination were the reason — specifically, the Stanford Internet Observatory’s discovery of more than 3,000 images in the LAION 5B dataset depicting abused children. Realizing this, the churn of images became visceral, in the pit of my stomach. The traces of those images, the pain of any person in the dataset, lingers in the models. Perhaps imperceptibly, they shape the structures and patterns of the images I see.

In gathering these images, there was no right to refuse, no intervention of care. Ghosts, Odumosu writes, “make their presences felt, precisely in those moments when the organizing structure has ruptured a caretaking contract; when the crime has not been sufficiently named or borne witness to; when someone is not paying attention” (S299). 

The training of Generative Artificial Intelligence systems has relied upon the power to automate indifference. And if synthetic images are structured in this way, it is merely a visualization of how “artificial intelligence systems” structure the material world when carelessly deployed in other contexts. The synthetic image offers us a glimpse of what that world would look like, if only we would look critically at the structures that inform its spectacle. If we can read algorithmic decision-making a lapse in care, a disintegration of accountability, we might see fresh pavement has been poured onto sacred land. 

This regime of Artificial Intelligence is not an inevitability. It is not even a single ideology. It is a computer system, and computer systems, and norms of interaction and participation with those systems, are malleable. Even with training datasets locked away behind corporate walls, it might still be possible “to insist on care where there has historically been none” (Odumosu S297), and by extension, to identify and refuse the automated inscription of the colonizing ghost.

 

This post concludes my research work at the Flickr Foundation, but I am eager to continue it. I am seeking publishers of art books, or curators for art or photographic exhibitions, who may be interested in a longer set of essays or a curatorial project that explores this methodology for reading AI generated images. If you’re interested, please reach out to me directly: eryk.salvaggio@gmail.com.

The Ghost Stays in the Picture, Part 2: Data Casts Shadows

Eryk Salvaggio is a 2024 Flickr Foundation Research Fellow, diving into the relationships between images, their archives, and datasets through a creative research lens. This three-part series focuses on the ways archives such as Flickr can shape the outputs of generative AI in ways akin to a haunting. Read part one, or continue to part three.

“Today the photograph has transformed again.” – David A. Shamma, in a blog post announcing the YFCC100M dataset.

In part one of this series, I wrote about the differences between archives, datasets, and infrastructures. We explored the movement of images into archives through the simple act of sharing a photograph in an online showcase. We looked at the transmutation of archives into datasets — the ways those archives, composed of individual images, become a category unto themselves, and analyzed as an object of much larger scale. Once an archive becomes a dataset, seeing its contents as individual pieces, each with its own story and value, requires a special commitment to archival practices.

Flickr is an archive — a living and historical record of images taken by people living in the 21st century, a repository for visual culture and cultural heritage. It is also a dataset: the vast sum of this data, framed as an overwhelming challenge for organizing, sorting, and contextualizing what it contains. That data becomes AI infrastructure, as datasets made to aid the understanding of the archive become used in unexpected and unanticipated ways.  

In this post, I shift my analysis from image to archive to dataset, and trace the path of images as they become AI infrastructure — particularly in the field of data-driven machine learning and computer vision. I’ll again turn to the Flickr archive and datasets derived from it.

99.2 Million Rows

A key case study is a collection of millions of images shared in June 2014. That’s when Yahoo! Labs released the YFCC100M dataset, which contained 99.2 million rows of metadata describing photos by 578,268 Flickr members, all uploaded to Flickr between 2004 and 2014 and tagged with a CC license. The dataset contained information such as photo IDs, URLs, and a handful of metadata such as the title, tags, description. I believe that the YFCC100M release was emblematic of a shift in the public’s — and Silicon Valley’s — perception of the visual archive into the category of “image datasets.” 

Certainly, it wasn’t the first image dataset. Digital images had been collected into digital databases for decades, usually for the task of training image recognition systems, whether for handwriting, faces, or object detection. Many of these assembled similar images, such as Stanford’s dogs dataset or NVIDIA’s collection of faces. Nor was it the first transition that a curated archive made into the language of “datasets.” For example, the Tate Modern introduced a dataset of 70,000 digitized artworks in 2013.  

What made YFCC100M interesting was that it was so big, but also diverse. That is, it wasn’t a pre-assembled dataset of specific categories, it was an assortment of styles, subject matter, and formats. Flickr was not a cultural heritage institution but a social media network with a user base that had uploaded far more images than the world’s largest libraries, archives, or museums. In terms of pure photography, no institution could compete on scale and community engagement. 

The YFCC100M includes the description, tags, geotags, camera types, and links to 100 million source images. As a result, we see YFCC100M appear over and over again in papers about image recognition, and then image synthesis. It has been used to train, test, or calibrate countless machine vision projects, including high-rated image labeling systems at Google and OpenAI’s CLIP, which was essential to building DALL-E. Its influence in these systems rivals that of ImageNet, a dataset of 14 million images which was used as a benchmark for image recognition systems, though Nicolas Maleve notes that nearly half of ImageNet’s photos came from Flickr URLs. (ImageNet has been explored in-depth by Kate Crawford and Trevor Paglen.)

10,000 Images of San Francisco

It is always interesting to go in and look at the contents of a dataset, and I’m often surprised how rarely people do this. Whenever we dive into the actual content of datasets we discover interesting things. The YFCC100M dataset contains references to 200,000 images by photographer Andy Nystrom alone, a prolific street photographer who has posted nearly 8 million images to Flickr since creating their account in 2008. 

The dataset contains more than 10,000 images each of London, Paris, Tokyo, New York, San Francisco, and Hong Kong, which outnumber those of other cities. Note the gaps here: all cities of the Northern hemisphere. When I ask Midjourney for an image of a city, I see traces of these locations in the output. 

Are these strange hybrids a result of the prevalence of Flickr in the calibration and testing of these systems? Are they a bias accumulated through the longevity of these datasets and their embeddedness into AI infrastructures? I’m not confident enough to say for sure. But missing from the images produced from the generic prompt “city” are traces of what Midjourney considers an African city. What emerges are not shiny, glistening postcard shots or images that would be plastered on posters by the tourist bureau. Instead, they seem to affirm the worst of the colonizing imagination: unpaved roads, cars broken down in the street. The images for “city” are full of windows reflecting streaks of sunlight; for “African city,” these are windows absent of glass. 

“A prompt about a ‘building in Dakar’ will likely return a deserted field with a dilapidated building while Dakar is a vibrant city with a rich architectural history,” notes the Senegalese curator Linda Dounia. She adds: “For a technology that was developed in our times, it feels like A.I. has missed an opportunity to learn from the fraught legacies that older industries are struggling to untangle themselves from.”

Beyond the training data, these legacies are also entangled in digital infrastructures. We know images from Flickr have come to shape the way computers represent the world, and how we define tests of AI-generated output as “realistic.” These definitions emerge from data, but also from infrastructures of AI. Here, one might ask if the process of calibrating images to places has been so centered on the geographic regions where Flickr has access to ample images: 10,000 images each from cities of the Northern Hemisphere. These created categories for future assessment and comparison. 

What we see in those images of an “African city” are what we don’t see in the data set. What we see is what is what is missing from that infrastructure: 10,000 pictures of Lagos or Nairobi. When these images are absent from the training data, they influence the result. When they are absent from the classifiers and calibration tools, that absence is entrenched.

The sociologist Avery Gordon writes of ghosts, too. For Gordon, the ghost, or the haunting, is “the intermingling of fact, fiction and desire as it shapes the personal and social memory … what does the ghost say as it speaks, barely, in the interstices of the visible and invisible?” In these images, the ghost is the image not taken, the history not preserved, the gaps that haunt the archives. It’s clear these absences move into the data, too, and that the images of artificial intelligence are haunted by them, conjuring up images that reveal these gaps, if we can attune ourselves to see them.

There is a limit to this kind of visual infrastructural analysis of image generation tools — its reliance on intuition. There is always a distance between these representations of reality in the generated image and the reality represented in the datasets. Hence our language of the seance. It is a way of poking through the uncanny, to see if we can find its source, however remote the possibility may be.  

Representativeness

We do know a few things, in fact. We know this dataset was tested for representativeness, that was defined as how evenly it aligned with Flickr’s overall content — not the world at large. We know, then, that the dataset was meant to represent the broader content of Flickr as a whole, and that the biases of the dataset — such as the strong presence of these particular cities — are therefore the biases of Flickr. In 2024, an era where images have been scraped from the web wholesale for training data without warning or permission, we can ask if the YFCC100M dataset reflected the biases we see in tools like DALL-E and Midjourney. We can also ask if the dataset, in becoming a tool for measuring and calibrating these systems, may have shaped those biases as a piece of data infrastructure.

As biased data becomes a piece of automated infrastructure, we see biases come into play from factors beyond just the weights of the training data. It also comes into play in the ways the system maps words to images, sorts out and rejects useful images, and more. One of the ways YFCC100M’s influence may shape these outcomes is through its role in training the OpenAI tool I mentioned earlier, called CLIP. 

CLIP looks at patterns of pixels in an image and compares them to labels for similar sets of pixels. It’s a bridge that connects the descriptions of images to words of a user’s prompt. CLIP is a core connection point between words and images within generative AI. Recognizing whether an image resembles a set of words is how researchers decided what images to include in training datasets such as LAION 5B. 

Calibration

CLIP’s training and calibration dataset contained a subset of YFCC100M, about 15 million images out of CLIP’s 400 million total. But CLIP was calibrated with, and its results tested against, classifications using YFCC100M’s full set. By training and calibrating CLIP against YFCC100M, that dataset played a role in establishing the “ground truth” that shaped CLIP’s ability to link images to text. 

CLIP was assessed on its ability to scale the classifications produced by YFCC100M and MS-COCO, another dataset which consisted entirely of images downloaded from Flickr. The result is that the logic of Flickr users and tagging has become deeply embedded into the fabric of image synthesis. The captions created by Flickr members modeled — and then shaped — the ways images of all kinds would be labeled in the future. In turn, that structured the ways machines determined the accuracy of those labels. If we want to look at the infrastructural influences of these digital “ghosts in the machine,” then the age, ubiquity, and openness of the YFCC100M dataset suggests it has a subtle but important role to play in the way images are produced by diffusion models. 

We might ask about “dataset bias,” a form of bias that doesn’t refer to the dataset, or the archive, or the images they contain. Instead, it’s a bias introduced through the simple act of calling something a dataset, rather than acknowledging its constitutive pieces. This shift in focus shifts our relationship to these pieces, asking us to look at the whole. Might the idea of a “dataset” bias us from the outset toward ignoring context, and distract us from our obligation of care to the material it contains?  

From Drips Comes the Deluge

The YFCC100M dataset was paired with a paper, YFCC100M: The New Data in Multimedia Research, which focused on the needs of managing visual archives at scale. YFCC100M was structured as an index of the archive: a tool for generating insight about what the website held. The authors hoped it might be used to create tools for handling an exponential tide of visual information, rather than developing tools that contributed to the onslaught. 

The words “generative AI” never appear in the paper. It would have been difficult, in 2014, to anticipate that such datasets would be seen through a fundamental shift from “index” to “content” for image generation tools. That is a shift driven by the mindset of AI companies that rose to prominence years later.

In looking at the YFCC100M dataset and paper, I was struck by the difference between the problems it was established to address and the eventual, mainstream use of the dataset. Yahoo! released the paper in response to the problems of proprietary datasets, which they claimed were hampering replication across research efforts. The limits on the reuse of datasets also meant that researchers had to gather their own training data, which was a time consuming and expensive process. This is what made the data valuable enough to protect in the first place — an interesting historical counterpoint to today’s paradoxical claim by AI companies that image data is both rare and ubiquitous, essential but worth very little.  

Attribution

Creative Commons licensed pictures were selected for inclusion in order to facilitate the widest possible range of uses, noting that they were providing “a public dataset with clearly marked licenses that do not overly impose restrictions on how the data is used” (2). Only a third of the images in the dataset were marked as appropriate for commercial use, and 17% required only attribution. But, in accordance with the terms of the Creative Commons licenses used, every image in the dataset required attribution of some kind. When the dataset was shared with the public, it was assumed that researchers would use the dataset to determine how to use the images contained within it, picking images that complied with their own experiments.  

The authors of the paper acknowledge that archives are growing beyond our ability to parse them as archivists. But they also acknowledge Flickr as an archive, that is, a site of memory: 

“Beyond archived collections, the photostreams of individuals represent many facets of recorded visual information, from remembering moments and storytelling to social communication and self-identity [19]. This presents a grand challenge of sensemaking and understanding digital archives from non-homogeneous sources. Photographers and curators alike have contributed to the larger collection of Creative Commons images, yet little is known on how such archives will be navigated and retrieved, or how new information can be discovered therein.”

Despite this, there was a curious contradiction in the way Yahoo! Labs structured the release of the dataset. The least restrictive license in the dataset is CC-BY — images where the license requires attribution. Nearly 68 million out of the 100 million images in the dataset specifically stated there could be no commercial use of their images. Yet, the dataset itself was then released without any restrictions at all, described as “publicly and freely usable.”  

The dataset of YFCC100M wasn’t the images themselves. It was the list of images, a sample of the larger archive that was made referenceable as a way to encourage researchers to make sense of the scale of image hosting platforms. The strange disconnect between boldly declaring the contents as CC-licensed, while making them available to researchers to violate those licenses, is perhaps evident only in hindsight.

Publicly Available

It may not have been a deliberate violation of boundaries so much as it was a failure to grapple with the ways boundaries might be transgressed. The paper, then, serves as a unique time capsule for understanding the logic of datasets as descriptions of things, to the understanding of datasets as the collection of things themselves. This was a logic that we can see carried out in the relationships that AI companies have to the data they use. These companies see the datasets as markedly different from the images that the data refers to, suggesting that they have the right to use datasets of images under “fair use” rules that apply to data, but not to intellectual property. 

This breaks with the early days of datafication and machine learning, which made clearer distinctions between the description of an archive and the archive itself. When Stability AI used LAION 5B as a set of pointers to consumable content, this relationship between description and content collapsed. What was a list of image URLs and the text describing what would be found there became pointers to training data. The context was never considered. 

That collapse is the result of a set of a fairly recent set of beliefs about the world which increasingly sees the “image” as an assemblage of color information paired with technical metadata. We hear echoes of this in the defense of AI companies, that their training data is “publicly available,” a term with no actual, specific meaning. OpenAI says that CLIP was trained on “text–image pairs that are already publicly available” in its white paper.

In releasing the dataset, Yahoo’s researchers may have contributed to a shift: from understanding online platforms through the lens of archives, into understanding them as data sources to be plundered. Luckily, it’s not too late to reassert this distinction. Visual culture, memory, and history can be preserved through a return to the original mission of data science and machine learning in the digital humanities. We need to make sense of a growing number of images, which means preserving and encouraging new contexts and relationships between images rather than replacing them with context-free abstractions produced by diffusion models. 

Generative AI is a product of datasets and machine learning and digital humanities research. But in the past ten years, data about images and the images themselves have become increasingly interchangeable. Datasets were built to preserve and study metadata about images. But now, the metadata is stripped away, aside from the URL, which is used to analyze an image. The image is translated into abstracted information, ignoring where these images came from and the meaning – and relationships of power – that are embedded into what they depict. In erasing these sources, we lose insight into what they mean and how they should be understood: whether an image of a city was taken by a tourism board or an aid agency, for example. The biases that result from these absences are made clear.

Correcting these biases requires care and attention. It requires pathways for intervention and critical thinking about where images are sourced. It means prioritizing context over convenience. Without attention to context, correcting the source biases are far more challenging. 

Data Casts Shadows

In my fellowship with the Flickr Foundation, I am continuing my practice with AI, looking at the gaps between archives and data, and data and infrastructures, through the lens of an archivist. It is a creative research approach that examines how translations of translations shape the world. I am deliberately relying on the language of intuition — ghosts, hauntings, the ritual of the seance — to encourage a more human-scaled, intuitive relationship to this information. It’s a rebuttal of the idea that history, documentation, images and media can be reduced to objective data. 

That means examining the emerging infrastructure built on top of data, and returning to the archival view to see what was erased and what remains. What are the images in this dataset? What do they show us, and what do they mean? Maleve writes that to become AI infrastructure, a Flickr image is pulled from the context of its original circulation, losing currency. It is relabeled by machines, and even the associations of metadata itself become superfluous to the goal of image alignment. All that matters is what the machine sees and how it compares to similar images. The result is a calibration: the creation of a category. The original image is discarded, but the residue of whatever was learned lingers in the system. 

While writing this piece, I became transfixed by shadows within synthetic images. Where does the shadow cast in an AI generated image come from? They don’t come from the sun, because there is no sunlight within the black box of the AI system. Despite the hype, these models do not understand the physics of light, but merely produce traces of light abstracted from other sources.

Unlike photographic evidence, synthetic photographs don’t rely on being present to the world of light bouncing from objects onto film or sensors. The shadows we see in an AI generated image are the shadows cast by other images. The generated image is itself a shadow of shadows, a distortion of a distortion of light. The world depicted in the synthetic image is always limited to the worlds pre-arranged by the eyes of countless photographers. Those arrangements are further extended and mediated as these data shadows stretch across datasets, calibration systems, engineering decisions, design choices and automated processes that ignore or obscure their presence.

Working Backward from the Ghost

When we don’t know the source of decisions made about the system, the result is unexplainable, mysterious, spooky. But image generation platforms are a series of systems stacked on top of one another, trained on hastily assembled stews of image data. The outcomes go through multiple steps of analysis and calibration, outputs of one machine fed into another. Most of these systems are built upon a subset of human decisions scaled to cover inhuman amounts of information. Once automated, these decisions become disembodied, influencing the results.

In part 3 – the conclusion of this series – I’ll examine a means of reading AI generated images through the lens of power, hoping to reveal the intricate entanglement of context, control, and shifting meanings within text and image pairs. Just as shadows move across the AI generated image, so too, I propose, does the gaze of power contained within the archives.

I’ll attempt to trace the flow of power and meaning through datasets and data infrastructures that produce these prompted images, working backwards from what is produced. Where do these training images come from? What stories and images do they contain, or lack? In some ways, it is impossible to parse, like a ghost whose message from the past is buried in cryptic riddles. A seance is rarely satisfying, and shadows disappear under a flashlight.

But it’s my hope that learning to read and uncover these relationships improves our literacy about so-called AI images, and how we relate to them beyond toys for computer art. Rather, I hope to show that these are systems that perpetuate power, through inclusion and exclusion, and the sorting logic of automated computation. The more we automate a system, the more the system is haunted by unseen decisions. I hope to excavate the context of decisions embedded within the system and examine the ways that power moves through it. Otherwise, the future of AI will be dictated by what can most easily be forgotten.  

Read part three here.

***

I would be remiss not to point to the excellent and abundant work on Flickr as a dataset that has been published by Katrina Sluis and Nicolas Malevé, whose work is cited here but merits a special thank you in shaping the thinking throughout this research project. I am also grateful to scholars such as Timnit Gebru, whose work on dataset auditing has deeply informed this work, and to Dr. Abeba Birhane, whose work on the content of the LAION 5B dataset has inspired this creative research practice. 

In the images accompanying this text, I’ve paired images created in Stable Diffusion 1.6 for the prompt “Flickr.com street shadows.” They’re paired with images from actual Flickr members. I did not train AI on these photos, nor did I reference the originals in my prompts. But by pairing the two, we can see the ways that the original Flickr photos might have formed the hazy structures of those generated by Stable Diffusion. 

The Ghost Stays in the Picture, Part 1: Archives, Datasets, and Infrastructures

Eryk Salvaggio is a 2024 Flickr Foundation Research Fellow, diving into the relationships between images, their archives, and datasets through a creative research lens. This three-part series focuses on the ways archives such as Flickr can shape the outputs of generative AI in ways akin to a haunting.

“The Absence Becomes the Thing.”
– Rindon Johnson,
from The Law of Large Numbers

Every image generated by AI calls up a line of ghosts. They haunt the training data, where the contexts of photographs are reduced to the simplest of descriptions. They linger in the decisions of engineers and designers in what labels to use. The ghosts that haunt the generated image are hidden by design, but we can find them through their traces. We just need to know how to look.

As an artist, the images created by AI systems are rarely interesting to me solely as photographs. I find the absences that structure these images, and the stories told in the gaps, to be far more compelling. The images themselves recycle the tropes of their training data. By design, they lean into the most common patterns, changing the details like a lazy student changing the words of a plagiarized essay.

I don’t turn to generative AI for beautiful images. I look for evidence of ghosts.

What exactly is a ghost in an AI system? It’s a structure or decision that haunts information in barely discernible, even invisible, ways. Datasets are shaped by absences, and those absences shape the image. As a diffusion model seeks the path to an image, the absence of pathways constrains what is possible. We can read these paths by looking at AI images critically, addressing the negative space of what appears on our screens. Who are the people we don’t see? What are the stories these images cannot tell?

This can mean absences in representation. When we have thousands of photographs of white children tagged as “girls,” but few black children, black girls are absent from the images. Absence haunts the generated image, shaping it: we will see mostly white girls because black girls have been pushed to the edges. This is not just a glib example. The exact scenario is precisely what I found when I analyzed a dataset used for training image generation tools and automated surveillance systems in 2019. The pattern holds today. Victorian-era portraits of white girls are prevalent in the training data for generative AI systems such as Stable Diffusion. Black girls are absent, with highly sexualized images of adult women taking their place.

Infrastructure makes ghosts, too. We build complex systems one step at a time, like a set of intersecting hallways. Artificial Intelligence is, at its heart, a means of automating decisions. They carry decisions from the past into the future. Once we inscribe these decisions into code, the code becomes infrastructure, subsumed into a labyrinth made by assembling the code of others and code yet to be written. As we renovate these structures through new code or system upgrades, the logic of a particular path is lost. We may need to build new walls around it. But when we bury code, we bury decisions beneath a million lines of if/then statements, weights, and biases of machine learning. Unchallenged, the world that has slipped past us shapes the predictions of these systems in ways we cannot understand.

This is true of most data driven, automated systems, whether we are talking about resume filters or parole decisions. For the generated photograph, these decisions include how we test and calibrate image recognition systems, and how we iterate upon these systems with every new model launch and interface.

Diffusion models — at the core of image generation systems — are an entanglement of systems. It relies on one system to label images, examining how pixels are clustered and matching them with human descriptions. We relied on underpaid labor by humans to test these systems by comparing the results of that tool to what they saw themselves. These comparisons are recorded and integrated into the memory of the model. The actions of those people were fused into the infrastructure of the model, shaping decisions long after they stopped working on the dataset.

We tend to make up stories about synthetic images based on what’s depicted. That is the work of the human imagination: a way of making sense of media based on references we’ve seen before. That is a ghost story, too. But if we want to meet the ghosts that shape AI-generated images, we have to dig deeper into the systems that produce them. The AI ghost story is a story of the past reaching into the present, and to understand it, it helps to know the lineage of those decisions.

Image synthesis has a history, and that history lingers in black boxes of the neural nets as they shape noisy pixels into something recognizable. Part of that story is the datasets, but data is applied to a vast number of uses. One of those uses is training larger systems to sort and handle larger sums of data.

Data shapes data infrastructure. From small sets of data, patterns are found and applied to larger sums of data. These patterns are repeatedly invoked whenever we call upon these systems into the future. The source data is always an incomplete model of things. But nonetheless, it is applied to larger datasets, which inherit and amplify the gaps, absences, and decisions from the past.

This is part of my creative research work on the seance of the digital archive. It focuses not only on data, but the lineage of data and the decisions made using that data to shape larger systems. A key piece of this lineage, and one that merits deeper exploration, is Flickr.

The Archive and the Dataset

With the rise of generative AI, vast troves of visual, textual, and sonic cultural heritage data have been folded into models that create new images, text, even music. But images are host to a special kind of spectral illusion. Most images shared online were never intended to become “data,” and in many ways, this transformation into data is at odds with the real value at the heart of what these archives preserve.

What is the difference between an archive and a dataset? We are dealing with many levels of abstraction here: an archive consists of individual objects designed to serve some human purpose. These objects may then be curated into a collection. It may be a collection of pamphlets, political cartoons, or documentary photographs. It may be the amateur photographer aiming to preserve a snapshot of a birthday party whose daughter and granddaughter celebrated alongside one another. Flickr, as a photo-sharing website, is host to all of these. The miracle of data, compression, and the world wide web is that the same infrastructures can be shared for moments important to “history” but also to the individual. It preserves images from cultural heritage institutions and family beach outings alike.

Flickr is three things at once: an archive and a dataset, most famously. But it is also a kind of data infrastructure. Let’s explore these one by one.

Flickr is an archive. It is a website that preserves history. It holds digital copies of historical artifacts for individual reflection and context. Flickr is a website for memories, stored in its copies of images, snapshots, aids to the remembrance of personal stories. These are assembled into an archive, a collective photo album. Flickr as an archive is a place where the context of an individual item is preserved. But we make sense of this archive socially. Meanings change as users sort these images, tag them, and reuse them (with permission) across the web. The archive is a collection of images with their own history beyond the website itself.

Flickr is a dataset. Flickr images can be described, at scale, in pure numbers. In 2011, the web site claimed to have 6 billion images and more recently boasted of having “tens of billions” of photos, with estimates of 25 million uploads per day. By contrast, the largest widely used image dataset used in machine learning, LAION 5B, contains roughly 5.85 billion images. Flickr as a massive, expanding dataset poses a particular set of challenges in thinking about its future. One of these is the daunting task of sorting and understanding all of those images. The dataset, then, is really just the archive viewed through the abstraction of scale. Billions of images now seen as one data set, with each image merely a piece of the collective whole. As a dataset, we focus on the ways the entirety of that set can be preserved and understood.

But in shifting our lens of focus from an archive to a dataset, individual details become less important. In changing scales in this way, it’s important to move fluidly between them — much as we close one eye, then the other, as we look at the letters of the eye exam. If we want to tackle the myopia of design decisions, we must get used to shifting between these two views, rather than treating one as the sole way we see the world.

What does it mean for Flickr to be “infrastructure” for AI? It helps to define this slippery term, so I turn to a definition used by the Initiative for Public Digital Infrastructure at UMass Amherst:

“Infrastructures are fundamental systems that allow us to build other systems—new houses and businesses rely on the infrastructures of electric power lines, water mains, and roads—and infrastructures are often invisible so long as they work well.”

In the relationship to images in particular, Katrina Sluis describes the shift in meaning attributed to images as their context shifts from archive to data infrastructures:

“Photographic culture is now being sustained by a variety of agents who sit outside the traditional scope of cultural institutions. I’m thinking here of the computer scientist, web designer, Silicon Valley entrepeneur or Amazon Mechanical Turker. And of course, these are actors who are not engaged with photographic culture and the politics of representation, the history of photography or the inherent polysemy of the image. In the computer science lab, the photograph remains relatively uncomplicated – it is ultimately a blob of information – whether materialized as a “picture” or left latent as data.”

Flickr’s images played an important role in shaping image recognition systems at the outset, and in turn, image generation systems. As a result of this entrenchment of images into AI, many Flickr images have become a form of “accidental infrastructure” for AI. I should be clear that Flickr has not trained a generative AI model of its own for the production of new images, nor has it arranged, as of this writing, for the sale of images for use in image training.

When we examine Flickr as infrastructure, we will see that these two worlds — archive and dataset — have come to occupy the same space, complicating our understanding of them both. Flickr’s movement from archive to dataset in the eyes of AI training isn’t permanent. It reflects a historical shift in the ways people understand and relate to images. So it is worth exploring how that shift changes what we see, and how ghosts from the archive come to “haunt” the dataset. In establishing these two lenses of focus, we might find strategies of shifting between the two. This can help us better articulate the context of images that have built, and likely will continue to build, the foundations of generative AI systems and the images these systems produce.

Flickers in the Infrastructure

How did Flickr’s transition from archive to dataset allow it to become a piece of AI infrastructure?

It started with one of the first breakthroughs in AI image generation — StyleGAN 2. StyleGAN 2 could produce images of human faces that were nearly photorealistic. It was entirely a result of the FFHQ dataset, which NVIDIA made from 70,000 Flickr portraits of faces. NVIDIA’s dataset drew on photographs from Flickr and, notably, warned that the dataset would inherit Flickr’s biases. The FFHQ dataset also went on to be used for image labeling and face recognition technologies, too.

We can easily trace the influence of that dataset on the faces StyleGAN 2 produced. In 2019, I did my own analysis of that dataset, looking image by image at the collection. In so doing, I examined the dataset through the lens of an archive. I looked at it as a collection of individual photographs, and individual people. I discovered that less than 3% of the faces sampled from the dataset contained black women. As a result, the faces produced by the image model were less likely to generate faces of black women. When it did, they were less photorealistic than other faces. The absences were shaping what we saw.

If datasets are haunted, then the synthetic image is a seance — a way of generating a specter from the datasets. This word, specter, refers to both the appearance of a spirit, but also the appearance of an image, deriving from the Latin for spectrum. The synthetic image is a specter. It’s an image which appears from an unknown place. It is one slice from a spectrum of possible images associated with a prompt. Absences in the dataset constrained the output of possible images. Without black women in the dataset, black women were not in the images. This is one way absences can haunt the synthetic image.

But there is another case study worth exploring, which is the ways that Flickr haunts infrastructures of AI. How did the dataset shape the automated decision making processes that were then included in longer, more complex systems of image generation?

In part two of this blog post, we’ll look at YFCC100M, a dataset of 99.2 million photos released in June 2014. And we’ll look at the path it has taken as it moved the world’s relationship to this collection of Flickr images from images, into an archive, into a dataset. Along the way, we’ll see how that dataset, by becoming a go-to reference for calibration and testing of image recognition and synthesis, became infused into the infrastructures of generated images.