From Desiderata to READMEs: The case for a C.A.R.E.-full Data Lifeboat Pt. II

by Fattori McKenna

This is the second installment of a two-part blog post where we detail our thinking around ethics and the Data Lifeboat README function. In this post, we’ll discuss the outcomes of the exercise set during our Mellon co-design workshops, where we asked participants to help design the README prompts with C.A.R.E./F.A.I.R. in mind.

As detailed in our previous blog post, a README is a file type commonly used in software development and distribution to inform users about the files contained in a directory. We have adapted this container for the purposes of the Data Lifeboat to add detail and description beyond what is local to the files on the Flickr.com platform.

 

A README for Anti-Accumulation and Conscious Selection Practices

The inclusion of a README encourages Data Lifeboat creators to slow down and practice conscious description. We are resolute that the Data Lifeboat shouldn’t become another bulk download tool—a means of large-scale data transfer from Flickr.com to other platforms or data banks (which would risk the images potentially becoming detached from their original content or sharing settings). Instead, we have the opportunity with the README to add detail, nuance, and context by writing about the collection as a whole or the images it contains. This is particularly important for digital cultural heritage datasets, as it is frequently an issue that images arrive in archival collections without context or may be digitised and uploaded without adequate metadata (either due to a lack of information or a lack of staff resources).

In the current prototype, the prompts for the README are as follows:

  • Tell the future why you are making this Data Lifeboat.
  • Is there anything special you’d like future viewers to know about the contents? Anything to be careful about?

These questions are a good start for getting Data Lifeboat creators to think about the intentions and future reception of the contents or collection. We may wish, however, to add further structuring to these questions to craft a sort of creation flow.

You can read more about the current working prototype in Alex’s recent blog-post.

Screenshots from the working prototype:

Learning from pre-existing G.L.A.M. frameworks

During our Mellon workshops, we asked our participants to support the co-design of the README creation flow, with the aim of prompting Data Lifeboat creators to think along the lines of C.A.R.E. and F.A.I.R. principles (detailed in our Part 1 blog post). Many of our workshop participants were already working within G.L.A.M. (Galleries, Libraries, Archives, Museums) institutions and had experience grappling with legal frameworks and ethical responsibilities related to their collections.

In preparation for the workshop, we asked representatives to bring copies of their organisations’ (Digital) Deposit Agreements. These are formal, legal documents that outline the terms and conditions under which an object or collection is temporarily placed in the custody of the museum by a depositor. The agreement governs the responsibilities, rights, and obligations of both the depositor and the museum during the period of the deposit. These agreements often include:

  • Description of contents
  • Purpose of deposit
  • Duration of deposit
  • Custodial requirements (e.g. care, storage, maintenance)
  • Liability
  • Rights & Restrictions (e.g. display, reproduction, transference)
  • Dispute Resolution

Digital Deposit Agreements, for the bequeathing of digital contents (such as scanned photographs, research datasets, email records, and digital artworks), may include specifications of format, versioning, metadata, digital access, technical infrastructure, and cybersecurity.

The Data Lifeboat README flow could be modelled on some of the categories that frequently arise in Digital Deposit Agreements. However, we also felt there are unique considerations around social media collecting (the intended content of the Data Lifeboat) that we ought to design for. In particular, how might we encourage Data Lifeboat creators to consider the implications of their collection across the four user groups:

  1. Flickr Members
  2. Data Lifeboat Creators
  3. Safe Harbour Dock Operators
  4. Photo Subjects

As well as a speculative fifth user, the Future Viewer.

Co-Designing the README Flow

After exploring the current benefits and limitations of existing Digital Deposit Agreements, we invited our workshop participants to engage in a hands-on exercise designed to refine the README feature for the Data Lifeboat. The prompt was:

What prompts or questions for Data Lifeboat creators could we include in the README to help them think about C.A.R.E. or F.A.I.R. principles. Try to map each question to a letter.

To support the exercise, we provided printouts of the C.A.R.E. and F.A.I.R. principles as reference material, encouraging participants to ground their questions in these frameworks. Each participant created a list of potential prompts individually before we reconvened for a group discussion. This collaborative step surfaced shared concerns and recurring themes, which we have organised below, along with some sample questions as they could appear in the README creation flow.

Possible README Questions

We have broken down the README questions into 8 themes, these follow a logical and temporal flow, from present considerations to future horizons.

  1. Purpose and Compilation

    Why it matters: clearly defining the purpose and methods for compiling the photos in the Data Lifeboat prompts creators to reflect on their motivations and intentions.

    Sample questions: 

    • What is the purpose of this Data Lifeboat?
    • Why are you creating it?
    • Why is it important that these photos are preserved?
    • What procedures were used to collect the photos in this Data Lifeboat?
    • Who was involved?
    • Is this Data Lifeboat a part of a larger dataset?

  2. Copyright and Ownership

    Why it matters: clear documentation of authorship and ownership (wherever possible) can protect creators’ rights, or attribution at a minimum.

    Sample questions: 

    • Do you own all the rights to the images in this Data Lifeboat?
    • If you did not take these photographs, has the photographer given permission for them to be included in a Data Lifeboat?
    • Do you know where the photos came from before Flickr?
    • Would you be willing to relinquish your rights under any specific circumstances?

  3. Privacy and Consent

    Why it matters: respecting privacy and obtaining consent (where possible) are critical safeguards for the dignity and rights of the represented, particularly important for sensitive content or at-risk communities.

    Sample questions:

    • Is there any potentially personally identifiable data or sensitive information in this Data Lifeboat (either yours or someone else’s) that you wouldn’t want someone to see?
    • Tell us who is in the photos, are they aware they are being included in this Data Lifeboat, could they have reasonably given consent (if not, why)?

  4. Context and Description

    Why it matters: providing rich, contextual information (which the free text input allows for) can help supplement existing collections with missing information, as well as helping to avoid misinterpretation or detachment from origins.

    Sample questions:

    • Can you add context or description to an image(s) in this Data Lifeboat?
    • Could you add context to comments, tags or groups within this Data Lifeboat?
    • Do the titles and descriptions accurately reflect the photos in this Data Lifeboat?
    • What do you think is important about this image(s) that you want a future viewer of the Data Lifeboat to understand?

  5. Ethics and Cultural Sensitivites

    Why it matters: we have the opportunity to append ethics to historically unjust collections by giving space for Data Lifeboat creators to write how the images should be viewed, understood and should [more on previous interventions here]

    Sample questions:

    • Are you a member of the community this Data Lifeboat is depicting?
    • If not, have you thought about the implications for the community?
    • Are there potential sensitivities in the information stored in this Data Lifeboat that future viewers should be aware of?
    • Are there historical or current harms enacted in this material within the Data Lifeboat (if so, would you like to explain these to a future viewer)?
    • Think about whether this Data Lifeboat contains content that could be used in another way by bad actors, should you include them?

  6. Future Access and Use

    Why it matters: outlining conditions or requests for future access and use of the Data Lifeboat collection, whilst this cannot be secured, can at least serve as guardrails.

    Sample questions:

    • Is this Data Lifeboat just for you, or could it be public one day?
    • Who will you be sharing this Data Lifeboat with? Could you nominate a ‘trusted friend’ to also keep a copy of this Data Lifeboat?
    • Would you like to see the Data Lifeboat collection returned to the community (if so, who to)?
    • What should or should not be done with this Data Lifeboat and its contents?

  7. Storage and Safe Harbors

    Why it matters: recording (or suggesting) where the Data Lifeboat could end up prompts creators to think about future viewers or stakeholders. There may be the possibility, at point of creation, to designate a Safe Harbor location and its (desired) conditions for storage.

    Sample questions:

    • Where will this Data Lifeboat go once downloaded?
    • Who would you like to notify about the creation of this Data Lifeboat, is there anyone you’d like to send a copy to?
    • Where would you ideally like this Data Lifeboat “docked”?

  8. Long-term Vision

    Why it matters: articulating (or attempting to) a long-term vision can help ensure the Data Lifeboat remains meaningful and intact for as long as possible, even after the creator is no longer directly involved.

    Sample questions:

    • Have you included enough information about this Data Lifeboat so you would remember why you made it?
    • Would someone else know why you made it?
    • What/who might represent your interests in this Data Lifeboat after you are no longer around?
    • For whom are you saving these materials in the Data Lifeboat, and what do you want them to understand?

 

Conclusion: README as Cipher?

With these questions in mind, we have a unique opportunity to embed conscious curation, guided by C.A.R.E. and F.A.I.R. principles, into long-term digital preservation of Flickr.com. The README prompts Data Lifeboat creators to thoughtfully consider the ethical dimensions of their future collections.

This deliberate slowness or friction in the process is, we believe, a strength. However, it is equally important that writing a README is an enjoyable and engaging experience. Inventing a pleasant yet inquiring process will be central to the next phase of our work, where service design will play a key role. We will need to decide where to strike the balance between a README flow that is thought-provoking and poetic versus one that is instructive and immediately legible.

While the extent to which the README’s conditions can be made machine-readable remains an open question, there may be ways to encode conditions for use or distribution into the Data Lifeboat itself, or the Safe Harbor Network. It is essential to acknowledge, however, that a README is not a binding promise—it cannot guarantee the fulfilment of all wishes for all time. At the very least, it represents a thoughtful attempt to record the creator’s intentions.

The README serves as a cipher for future reception and reconstruction, a snapshot of the present, and a gift to future explorers.

From Desiderata to READMEs: The case for a C.A.R.E.-full Data Lifeboat Pt. I

By Fattori McKenna

This is the first of a two-part blog post where we detail our thinking around ethics and the Data Lifeboat README function. In this blog-post we reflect on the theoretical precursors and structural interventions that inform our approach. We specifically question how these dovetail with the dataset we are working with (i.e. images on Flickr.com) and the tool we’re developing, the Data Lifeboat. In part 2 (now available), we will detail the learnings from our ethics session at the Mellon co-design workshops and how we plan to embed these into the README feature.

Spencer Baird, the American naturalist and first curator of the Smithsonian Institution, instructed his collectors in ‘the field’ what to collect, how to describe it and how to preserve it until returning back Eastwards, carts laden. His directions included:

Birds and mammalia larger than a rat should be skinned. For insects and bugs — the harder kinds may be put in liquor, but the vessels and bottles should not be very large. Fishes under six inches in length need not have the abdominal incision… Specimens with scales and fins perfect, should be selected and if convenient, stitched or pinned in bits of muslin to preserve the scales. Skulls of quadrupeds may be prepared by boiling in water for a few hours… A little potash or ley will facilitate the operation.

Baird’s 1848 General Directions for Collecting and Preserving Objects of Natural History is an example of a collecting guide, also known at the time as a desiderata (literally ‘desired things’). It is this archival architecture that Hannah Turner (2021) takes critical aim at in Cataloguing Culture: Legacies of Colonialism in Museum Documentation. According to Turner, Baird’s design “enabled collectors in the field and museum workers to slot objects into existing categories of knowledge”.

Whilst the desiderata prompted the diffuse and amateur spread of collecting in the 19th century, no doubt flooding burgeoning institutional collections with artefacts from the so-called ‘field’, the input and classification systems these collecting guides held came with their own risks. Baird’s 1848 desiderata shockingly includes human subjects—Indigenous people—perceived as extensions of the natural world and thus procurable materials in a concerted attempt to both Other and historicise. Later collecting guides would be issued for indigenous tribal artefacts, such as the Haíłzaqv-Haida Great Canoe – now in the American Museum of Natural History’s Northwest Coast Hall – as well as capturing intangible cultural artefacts – as documented in Kara Lewis’ study of the 1890 collection of Passamaquoddy wax recording cylinders used for tribal music and language. But Turner pivots our focus away from what has been collected, and instead towards how these objects were collected, explaining, “practices and technologies, embedded in catalogues, have ethical consequences”.

While many physical artefacts have been returned to Indigenous tribes through activist-turned-institutional measures (such as the repatriation of Iroquois Wampum belts from the National Museum of the American Indian or the Bååstede project returning Sami cultural heritage from Norway’s national museums), the logic of the collecting guides remains. Two centuries later, the nomenclature and classification systems from these collecting guides have been largely transposed into digital collection management systems (CMS), along with digital copies of the objects themselves. Despite noteworthy efforts to to provide greater access and transparency through F.A.I.R. principles or rewrite and reclaim archival knowledge systems—such as Traditional Knowledge (T.K.) Labels and C.A.R.E. principles, Kara Lewis (2024) notes that “because these systems developed out of the classification structures before them, and regardless of how much more open and accessible they become, they continue to live with the colonial legacies ingrained within them”. The slowness of the Galleries, Libraries, Archives and Museums (G.L.A.M.) sector to adapt, Lewis continues, stems less from “an unwillingness to change, and more with budgets that do not prioritize CMS customizations”. Evidently a challenge lies in the rigidly programmed nature of rationalising cultural description for computational input.

In our own Content Mobility programme, the Data Lifeboat project, we propose that creators write a README. In our working prototype, the input is an open-text field, allowing creators to write as much or as little as they wish about their Data Lifeboat’s purpose, contents, and future intentions. However, considering Turner’s cautionary perspective, we face a modern parallel: today’s desiderata is data, and the field is the social web—deceptively public for users to browse and “Right-Click-Save” at will. We realised that in designing the input architecture for Data Lifeboats, we could inadvertently be creating a 21st century desiderata: a seemingly open and neutral digital collecting tool that beneath the surface risks perpetuating existing inequalities.

This blog-post will introduce the theoretical and ethical underpinnings to the Data Lifeboat’s collecting guide, or README, that we want to design. The decades of remedy and reconciliatory work, tirelessly driven primarily by Indigenous rights activists, in addressing the archival injustices first cemented by early collecting guides provides a robust starting point for embedding ethics into the Data Lifeboat. Indigenous cultural heritage inevitably exists within Flickr’s collections, particularly among our Flickr Commons members who are actively pursuing their own reconciliation initiatives. Yet the value of these interventions extends beyond Indigenous cultural heritage, serving as a foundation for ethical data practices that benefit all data subjects in the age of Big Data.

A Brief History of C.A.R.E Principles

Building on decades of Indigenous activism and scholarship in restitution and reconciliation, the C.A.R.E. principles emerged in 2018 from a robust lineage of interventions, such as Native American Graves Protection and Repatriation Act (NAGPRA, 1990) and The United Nations Declaration on the Rights of Indigenous Peoples (UNDRIP, 2007), which sought to recognise and restore Indigenous sovereignty over tangible and intangible cultural heritage.

These earlier frameworks were primarily rooted in consultation processes with Indigenous communities, ensuring that their consent and governance shaped the management of artefacts and knowledge systems. For instance, NAGPRA enabled tribes to reclaim human remains and sacred objects through formalised dialogues and consultation sessions with museums. Similarly, Traditional Knowledge Labels (Local Contexts Initiative) were designed to identify Indigenous protocols for accessing and using knowledge within the museum’s existing collection, for instance a tribal object may be reserved for viewing only by female tribal members. These methods worked effectively within the domain of physical collections but faltered when confronted with the scale and opaqueness of data in the digital age.

In this context, Indigenous governance of data emerged as essential, particularly for sensitive datasets such as health records, where documented misuse showed evidence of perpetuating harm. As the Data Science field developed, it often prioritised the technical ideals of F.A.I.R. principles (Findable, Accessible, Interoperable, Reusable), which advocate for improved usability and discoverability of data, to counter increasingly oblique and privatised resources. Though valuable, F.A.I.R. principles fell short on the ethical dimensions of data, particularly on how data is collected and used in ways that affect already at-risk communities (see also O’Neil 2016, Eubanks 2018, and Benjamin 2019). As the Global Indigenous Data Alliance argued:

“Mainstream values related to research and data are often inconsistent with Indigenous cultures and collective rights”

Recognising the challenges posed by Big Data and Machine Learning (ML)—from entrenched bias in data to the opacity of ML algorithms—Indigenous groups such as the Te Mana Raraunga Māori Data Sovereignty Network, the US Indigenous Data Sovereignty Network, and the Maiam nayri Wingara Aboriginal and Torres Strait Islander Data Sovereignty Collective led efforts to articulate frameworks for ethical data governance. These efforts culminated in a global, inter-tribal workshop in Gaborone, Botswana, in 2018, convened by Stephanie Russo Carroll and Maui Hudson in collaboration with the Research Data Alliance (RDA) International Indigenous Data Sovereignty Interest Group. The workshop formalised the C.A.R.E. principles, which were published by the Global Indigenous Data Alliance in September 2019 and proposed as a governance framework with people and purpose at its core.

The C.A.R.E. principles foreground the following four values around data:

  1. Collective Benefit: Data must enhance collective well-being and serve the communities to which it pertains.
  2. Authority to Control: Communities must retain governance over their data and decide how it is accessed, used, and shared.
  3. Responsibility: Data handlers must minimise harm and ensure alignment with community values.
  4. Ethics: Ethical considerations rooted in cultural values and collective rights must guide all stages of the data lifecycle.

C.A.R.E. in Data Lifeboats?

While the C.A.R.E. principles were initially developed to address historical data inequities and exploitation faced by Indigenous communities, they offer a framework that can benefit all data practices: as the Global Indigenous Data Alliance argues, “Being CARE-Full is a prerequisite for equitable data and data practices.”

We believe the principles are important for Data Lifeboat, as collecting networked images from Flickr poses the following complexities:

  • Data Lifeboat creators will be able to include images from Flickr Commons members (which may include images of culturally sensitive content)
  • Data Lifeboat creators may be able to include images from other Flickr members, besides themselves
  • Subjects of photographs in a Data Lifeboat may be from historically at-risk groups
  • Data Lifeboats are designed to last and therefore may be separated from their original owners, intents and contexts.

The Global Inidgenous Data Alliance asserts, their principles must guide every stage of data governance “from collection to curation, from access to application, with implications and responsibilities for a broad range of entities from funders to data users.” The creation of a Data Lifeboat is an opportunity to create a new collection, thus we have the opportunity to embed C.A.R.E. principles from the start. Although we cannot control how Data Lifeboats will be used or handled after their creation, we can attempt to establish an architecture for encouraging that C.A.R.E. is deployed throughout the data lifecycle.

Enter: The README

Our ambition for the Data Lifeboat (and the ethos behind many of Flickr.org programmes) is the principle of “conscious collecting”. We aim to move away from the mindset of perpetual accumulation that plagues both museums and Big Tech alike—a mindset that advances a dangerous future, as cautioned by both anti-colonialist and environmentalist critiques. Conscious collecting allows us to better consider and care for what we already have.

One of the possible ways we can embed conscious collecting is through the inclusion of a README—a reflective, narrative-driven process for creating a Data Lifeboat.

READMEs are files traditionally used in software development and distribution that contain information about files within the directory. It is often in the form of plain text (.txt, .md), to maximise readability, frequently containing information about operating instructions, troubleshooting, credits, licensing and changelogs, intended to be read on start-up. In the Data Lifeboat, we have adopted this container to supplement the files. Data Lifeboat creators are introduced to the README in the creation process and, in the present prototype, are met with the following prompts to assist writing:

  • Tell the future why you are making this Data Lifeboat.
  • Is there anything special you’d like future viewers to know about the contents? Anything to be careful about?

(These prompts are not fixed, as you’ll read in Part 2)

During our workshops, participants noted the positive (and rarely seen) experience of introducing friction to data preservation. This friction slows down the act of collecting and creates space to engage with the social and ethical dimensions of the content. As Christen & Andersen (2019) emphasise in their call for Slow Archives, “Slowing down creates a necessary space for emphasising how knowledge is produced, circulated, and exchanged through a series of relationships”. We hope that Data Lifeboat’s README will contribute to Christen & Andersen’s invocation for the “development of new methodologies that move toward archival justice that is reparative, reflective, accountable, and restorative”.

We propose three primary functions of the README in a Data Lifeboat:

  1. Telling the Story of an Archive

    Boast, Bravo, and Srinivasan (2018), reflecting on Inuit artefacts in an institutional museum collection, write that its transplant results in the deprivation of “richly situated life of objects in their communities and places of origin.” Once subsumed into a collection, artefacts often suffer the “loss of narrative and thick descriptions when transporting them to distant collections”.

    We are conscious that this could be the fate of many images once transplanted in a Data Lifeboat. Questions emerged in our workshops as to how to maintain the contextual world around the object, speaking of not only its social metadata (comments, tags, groups, albums) but also the more personal levers of choice, value and connection. A README resists the diminishment of narrative by creating opportunities to retain and reflect on the relational life of the materials.

    The README directly resists the archival instinct toward neutrality, by its very format it holds that this can never be true. Boden critiques the paucity of current content management systems, their highly structured input formats cannot meet our responsibilities to communities as they do not give space to fully citing how information came to be known and associated with an object and on whose authority. Boden argues for “reflections on the knowledge production process”, which is what we intend the README to encourage the Data Lifeboat creator to do. The README prompts (could) suggest Data Lifeboat creator reflect on issues around ownership (e.g. is this your photo?), consent (e.g. were all photo subjects able to consent to inclusion in a Data Lifeboat?), and embedded power relations (e.g. are there any persecuted minorities in this Data Lifeboat?): acknowledging the archive is never objective.

    More poetically, the README could prompt greater storytelling, serving as a canvas for both critical and emotional reflection on the content of a Data Lifeboat. Through guided prompts, creators could explore their personal connections to the images, share the stories behind their selection process, and document the emotional resonance of their collection. A README allows creators to capture and contextualise not only the images themselves, but to add layers of personal inscription and meaning, creating a richer, more distributed archive.

  2. Decentralised and Distributed Annotation

    The Data Lifeboat constitutes a new collecting format that intends to operate outside traditional archival systems’ rigid limitations and universalising classification schemes. The README encourages decentralised curation and annotation by enabling communities to directly contribute to selecting and contextualising archival and contemporary images, fostering what Huvila (2008) terms the ‘participatory archive’ [more on Data Lifeboat as a tool for decentralised curation here].

    User-generated descriptions such as comments, tags, groups, and galleries — known on Flickr as ‘social metadata’ —serve as “ontological keys that unlock the doors to diverse, rich, and incommensurable knowledge communities” (Boast et al., 2018), embracing multiple ways of knowing the world. Together, these create ‘folksonomies’—socially-generated digital classification systems that David Sturz argues are particularly well-suited to “loosely-defined, developing fields,” such as photo subjects and themes often overlooked by the institutional canon. The Data Lifeboat captures the rich, social media that is native to Flickr, preserving decades worth of user contributions.

    The success of community annotation projects has been well-documented. The Library of Congress’s own Flickr Pilot Project demonstrated how community input enhanced detail, correction, and enrichment. As Michelle Springer et al. (2018) note, “many of our old photos came to us with very little description and that additional description would be appreciated”. Within nine months of joining Flickr, committing to a hands-off approach, the Library of Congress accumulated 67,000 community-added tags. “The wealth of interaction and engagement that has taken place within the comments section has resulted in immediate benefits both for the Library and users of the collections,” continues Springer et al. After staff verification, these corrections and additions to captions and titles demonstrated how decentralised annotation could reshape the central archive itself. As Laura Farley (2014) observes, community annotation “challenges archivists to see their collections not as closely guarded property of the repository, but instead as records belonging to a society of users”.

    Beyond capturing existing metadata, the README enables Data Lifeboat creators to add free-form context, such as correcting erroneous tags or clarifying specific terminology that future viewers might misinterpret—like the Portals to Hell group. As Duff and Harris (2002) write, “the power to describe is the power to make and remake records and to determine how they will be used and remade in the future. Each story we tell about our records, each description we compile, changes the meaning of records and recreates them” — the README hands over the narrative power to describe.

  3. Data Restitution and Justice

    Thinking speculatively, the README could serve an even more transformative purpose as a tool for digital restitution. Through the Data Lifeboat framework, communities could reclaim contested archival materials and reintegrate them into their own digital ecosystems. This approach aligns with “Steal It Back” (Rivera-Carlisle, 2023) initiatives such as Looty, which creates digital twins of contested artefacts, currently held in Western museums. By leveraging digital technologies, these initiatives counter the slow response of GLAM institutions to restitution calls. As Pavis and Wallace (2023) note, digital restitution offers the chance to “reverse existing power hierarchies and restore power with the peoples, communities, and countries of origin”. In essence, this offers a form of “platform exit” that carves an alternative avenue of control of content to original creators or communities, regardless of who initially uploaded the materials. In an age of encroaching data extractivism, the power to disengage, though severe, for at-risk communities can be the “reassertion of autonomy and agency in the face of pervasive connectivity” (Kaun and Treré, 2021).

    It is a well-documented challenge in digital archives that many of the original uploaders were not the original creators, which prompts ought to prompt reflections around copyright and privacy. As Payal Arora (2019) has noted our dominant frameworks largely ignore empirical realities of the Global South: “We need to open our purview to alternative meanings including paying heed to the desire for selective visibility, how privacy is often not a choice, and how the cost of privacy is deeply subjective”. Within the README, Data Lifeboat creators can establish terms for their collections, specifying viewing contexts, usage conditions, and other critical contextual information. They can also specify restrictions on where and how their images may be hosted or reused in the future (e.g. ‘I refuse to let these image be used in AI training data sets’). A README could allow for Data Lifeboat creators to expand and detail more fluid and cultural and context-specific conditions for privacy and re-use.

    At best, these terms would allow Data Lifeboat creators to articulate their preferences for how their materials are accessed, interpreted and reused in the future, functioning as an ethical safeguard. While these terms may not always be enforceable, they provide a clear record of the creators’ intentions. Looking ahead, we could envision the possibility of making these terms machine-readable and executable. The sustenance of these terms could potentially be incorporated into the governance framework of the Safe Harbor Network, our proposed decentralised storage system of cultural institutions that can hold Data Lifeboats for the long-term.

Discussion: README as a Datasheet for Networked Social Photography Data Sets?

In the long history of cataloging and annotating data, Timnit Gebru et al.’s (2018) Datasheets for Datasets stands out as an emerging best practice for the machine learning age. These datasheets provide “a structured approach to the description of datasets,” documenting provenance, purpose, and ethical considerations. By encouraging creators to critically reflect on the collection, composition, and application of datasets, datasheets foster transparency and accountability in an otherwise vast, opaque, and voraciously consuming sphere.

The Digital Cultural Heritage space has made similar calls for datasheets in archival contexts, as they too handle large volumes of often uncontextualised and culturally sensitive data. As Alkemade et al. (2023) note, cultural heritage data is unique: “They are extremely diverse by nature, biased by definition and hardly ever created or collected with computation in mind”. They argue, “In contrast to industrial or research datasets that are assembled to create knowledge… cultural heritage datasets may present knowledge as it was fabricated in earlier times, or community-based knowledge from lost local contexts”. Given this uniqueness, digital cultural heritage requires a tailored datasheet format that enables rich, detailed contextualization reflecting both the passage of time and potentially lost or inaccessible meanings. Just as datasheets have transformed technical datasets, the README has the potential to reshape how we collect, interpret, and preserve the networked social photography that is native to the Flickr.com platform — something we argue is part of our collective digital heritage.

There are, of course, limitations—neither datasheets nor READMEs will be a panacea for C.A.R.E-full data practices. Gebru et al. acknowledge that “Dataset creators cannot anticipate every possible use of a database”. The descriptive approach also presents possible trade-offs: “identifying unwanted societal biases often requires additional labels indicating demographic information about individuals,” which may conflict with privacy or data protection. Gebru notes that the Datasheet “will necessarily impose overhead on dataset creator”—we recognise this friction as a positive. Echoing Christen and Anderson’s call “Slowing down is about focusing differently, listening carefully, and acting ethically“.

Conclusion

Our hope is that the README is both a reflective and instructive tool that prompts Data Lifeboat Creators to consider the needs and wishes of each of the four main user groups in the Data Lifeboat ecosystem:

  1. Flickr Members
  2. Data Lifeboats Creators
  3. Safe Harbor Dock Operators
  4. Subjects in the Photo

While we do not yet know precisely what form the README will take, we hope our iterative design process can offer flexibility to accommodate the needs of—and our responsibilities to—Data Lifeboat creators, photographic subjects and communities, and future viewers.

In our Mellon-funded Data Lifeboat workshops in October and November, we asked our participants to support us in co-designing a digital collecting tool with care in mind. We asked:

What prompts or questions for Data Lifeboat creators could we include in the README to help them think about C.A.R.E. or F.A.I.R. principles. Try to map each question to a letter.

The results of this exercise and what this means for Data Lifeboat development will be detailed in Part 2.

 

The photographs in this blog post come from the Smithsonian Institution’s Thomas Smillie Collection (Record Unit 95) – Thomas Smillie served as the first official photographer for the Smithsonian Institution from 1870 until his death in 1917. As head of the photography lab as well as its curator, he was responsible for photographing all of the exhibits, objects, and expeditions, leaving an informal record of early Smithsonian collections.

Bibliography

Alkemade, Henk, et al. “Datasheets for Digital Cultural Heritage Datasets.” Journal of Open Humanities Data, vol. 9, 2023, doi:10.5334/johd.124.

Arora, Payal. “Decolonizing Privacy Studies.” Television & New Media, vol. 20, no. 4, 26 Oct. 2018, pp. 366–378, doi:10.1177/1527476418806092.

Baird, Spencer. “General Directions for Collecting and Preserving Objects of Natural History”, c. 1848, Dickinson College Archives & Special Collections

Benjamin, Ruha. Race After Technology: Abolitionist Tools for the New Jim Code. Polity, 2019.

Boast, Robin, et al. “Return to Babel: Emergent Diversity, Digital Resources, and Local Knowledge.” The Information Society, vol. 23, no. 5, 27 Sept. 2007, pp. 395–403, doi:10.1080/01972240701575635.

Boden, Gertrud. “Whose Information? What Knowledge? Collaborative Work and a Plea for Referenced Collection Databases.” Collections: A Journal for Museum and Archives Professionals, vol. 18, no. 4, 12 Oct. 2022, pp. 479–505, doi:10.1177/15501906221130534.

Carroll, Stephanie Russo, et al. “The CARE Principles for Indigenous Data Governance.” Data Science Journal, vol. 19, 2020, doi:10.5334/dsj-2020-043.

Christen, Kimberly, and Jane Anderson. “Toward Slow Archives.” Archival Science, vol. 19, no. 2, 1 June 2019, pp. 87–116, doi:10.1007/s10502-019-09307-x.

“Digital Media Activism A Situated, Historical, and Ecological Approach Beyond the Technological Sublime.” Digital Roots, by Emiliano Treré and Anne Kaun, De Gruyter Oldenbourg, 2021.

Duff, Wendy M., and Verne Harris. “Stories and Names: Archival Description as Narrating Records and Constructing Meanings.” Archival Science, vol. 2, no. 3–4, Sept. 2002, pp. 263–285, doi:10.1007/bf02435625.

Eubanks, Virginia. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. Picador, 2018.

Gebru, Timnit, et al. “Datasheets for Datasets.” Communications of the ACM, vol. 64, no. 12, 19 Nov. 2021, pp. 86–92, doi:10.1145/3458723.

Griffiths, Kalinda E et al. “Indigenous and Tribal Peoples Data Governance in Health Research: A Systematic Review.” International journal of environmental research and public health vol. 18,19 10318. 30 Sep. 2021, doi:10.3390/ijerph181910318

Lewis, Kara. “Toward Centering Indigenous Knowledge in Museum Collections Management Systems.” Collections: A Journal for Museum and Archives Professionals, vol. 20, no. 1, Mar. 2024, pp. 27–50, doi:10.1177/15501906241234046.

O’Neil, Cathy. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Penguin, 2017.

Rivera-Carlisle, Joanna. “Contextualising the Contested: XR as Experimental Museology.” Herança, vol. 6, no. 1, 2023, doi.org/10.52152/heranca.v6i1.676

Pavis, Mathilde, and Andrea Wallace. “Recommendations on Digital Restitution and Intellectual Property Restitution.” SSRN Electronic Journal, 2023, doi:10.2139/ssrn.4323678.

Schaefer, Sibyl. “Energy, Digital Preservation, and the Climate: Proactively Planning for an Uncertain Future.” iPRES 2024 Papers – International Conference on Digital Preservation. 2024.

Shilton, Katie, and Ramesh Srinivasan. “Participatory Appraisal and Arrangement for Multicultural Archival Collections.” Archivaria, vol. 63, Spring 2007.

Springer, Michelle et al. “For the Common Good: The Library of Congress Flickr Pilot Project”. Library of Congress Collections, 2008.

Sturz, David N. “Communal Categorization: The Folksonomy”, INFO622: Content Representation, 2004.

Turner, Hannah. Cataloguing Culture: Legacies of Colonialism in Museum Documentation. University of British Columbia Press, 2022.

Progress Report: Creating a Data Lifeboat circa 2024

Alex Chan

In my previous post, I showed you our prototype Data Lifeboat creation workflow. At the end of the workflow, we’d created a Data Lifeboat we could download! Now I want to show you what you get inside the Data Lifeboat package.

Design goals

When we were designing the contents of a Data Lifeboat, we had the following principles in mind:

  • Self-contained – a Data Lifeboat should have no external dependencies, and not rely on any external services that might go away between it being created and opened.
  • Long lasting – a Data Lifeboat should be readable for a long time. It’s a bit optimistic to imagine anything digital we create today will last for 100 years, but we can aim for several decades at least!
  • Understandable – it should be easy for anybody to understand what’s in a Data Lifeboat, and why it might be worth exploring further.
  • Portable – a Data Lifeboat should be easy to move around, and slot into existing preservation systems and workflows without too much difficulty.

A lot of the time, when you export your data from a social media site, you get a folder full of opaque JSON files. That’s tricky to read, and it’s not obvious why you’d care about what’s inside – we wanted to do something better!

We decided to create a “viewer” that lives inside every Data Lifeboat package which gives you a more human-friendly way to browse the contents. The underlying data is still machine-readable, but you can see all the photos and metadata without needing to read a JSON file. This viewer is built as a static website. Building small websites with vanilla HTML and JavaScript gives us something lightweight, portable, and likely to last a long time.

This is inspired by services like Twitter and Instagram which also create static websites as part of their account export – but we’re going much smaller and simpler.

Folder structure

When you open a Data Lifeboat, here’s what’s inside:


The files folder contains all of the photo and video files – the JPEGs, GIFs, and so on. We currently store two sizes of each file: the high-resolution original file that was uploaded to Flickr, and a low-resolution thumbnail.

The metadata folder contains all of the metadata, in machine-readable JavaScript/JSON files. This includes the technical metadata (like the upload date or resolution) and the social metadata (like comments and favorites).

The viewer folder contains the code for our viewer. It’s a small number of hand-written HTML, CSS, and JavaScript files.

The README.html file is the entry point to the viewer, and the first file we want people to open. This name is a convention that comes from the software industry, but we hope that the meaning will be clear even if people are unfamiliar with it.

If you’re trying to put a Data Lifeboat into a preservation system that requires a fixed packaging format like BagIt or OCFL, you could make this the payload folder – but we didn’t want to require those tools in Data Lifeboat. Those structures are useful in large institutions, but less understandable to individuals. We think of this as progressive enhancement, but for data formats.

Inside the viewer

Let’s open the viewer and take a look inside.

When you open README.html, the first thing you see is a “cover sheet”. This is meant to be a quick overview of what’s in the Data Lifeboat – a bit like the cover sheet on a box of papers in a physical archive. It gives you some summary statistics and tells you why the creator thought these photos were worth keeping – this is what was written in the Data Lifeboat creation workflow. It also shows a small number of photos, from the most popular tags in the Data Lifeboat.

This cover sheet is a completely self-contained HTML file. Normally web pages load multiple external resources, like images or style sheets, but we plan for this file to be completely self-contained. Styles will be inline, and images will be base64-encoded data URIs. This design choice makes it easy to create multiple copies of the cover sheet, independent ​​of the rest of the Data Lifeboat, as a summary of the contents.

For example, if you had a large collection of Data Lifeboats, you could create an index from these cover sheets that a researcher could browse before deciding exactly which Data Lifeboat they wanted to download.

Now let’s look at a list of photos. If you click on any of the summary stats, or the “Photos” tab in the header, you see a list of photos.

This list shows you a preview thumbnail of each photo, and some metadata that can be used for filtering and sorting. For example, you can sort by photos with the most/least comments, or filter to photos uploaded by a particular Flickr member.

If you click on a photo, you can see an individual photo page. This shows you the original copy of the photo, and all the metadata we have about it:

Eventually you’ll be able to use the metadata on this page to find similar photos – for example, you’ll be able to click on a tag to find other photos with the same tag.

These pages still need a proper visual design, and this prototype is just meant to show the range of data we can capture. It’s already more understandable than a JSON file, but we think we can do even better!

Legible in the long term

The viewer will also contain documentation, about both the idea of Data Lifeboat and the structure of this particular package. If a Data Lifeboat is opened by somebody who doesn’t know about the project in 50 years, we want them to understand what they’re looking at and how they can use it.

It will also contain the text and agreement date of any policies agreed upon by the creator of this particular Data Lifeboat.

For example, as we create the machine-readable metadata files, we’re starting to document their structure. This should make it easier for future users to extract the metadata programmatically, or even build alternative viewer applications.

Lo-fi and low-tech

The whole viewer is written in a deliberately low-tech way. All the HTML templates, CSS and JavaScript are written by hand, with no external dependencies or bloated frameworks. This keeps the footprint small, makes it easier for us to work on as a small team, and we believe gives the viewer a good chance of lasting for multiple decades. The technology behind the web has a lot of sticking power.

This is a work-in-progress – we have more ideas that we haven’t built yet, and lots of areas where we know where the viewer can be improved. Check back soon for updates as we continue to improve it, and look out for a public alpha next year where you’ll be able to create your own Data Lifeboats!

A Phoenix in Paris: Data Lifeboats for Citizen-Driven Histories

By Fattori McKenna & George Oates

This blog post discusses the value of social media photography in enhancing our understanding and emotional vocabulary around historic events. It makes the case for a Data Lifeboat as a effective collecting tool for these types of citizen-made galleries (and histories). Additionally it also recounts the recommendation of other Data Lifeboat themes as collated during the Mellon co-design workshops.

On Saturday, December 7th 2024, Notre Dame Cathedral reopened its iron-clad tracery doors, marking the end of a four-year closure. The news coverage focused on the splendour — and occasional controversy — of its distinguished guests, contentious seating plans and retro light shows. The reopening inevitably brought back memories of the 2019 tragedy that befell the cathedral, destroyed by fire. Somehow the event underscored our collective helplessness under the Covid-19 lockdowns as viewers could only watch in horror as the same images spread around news and social media. On reflection the ubiquity and uniformity of the images is surprising, so often captured from the southeast end of the chancel: the flechè engulfed in flames like a tapered candle, and behind it through the iridescent smoke, a pair of lancet windows seemed to peer back at the viewer, embodying a vision of Notre Dame des Larmes—Our Lady in tears.

There is an alternative history of the Notre Dame fire, captured not through mainstream media but in the dispersed archives of networked social photography—an often overlooked and underreported lens on the event. Among the pictures gathered by the Flickr Foundation is a remarkable collection of street photographs that offer a fresh perspective (also shared in this post). These images place the fire in context: smoke billows against the backdrop of the Eiffel Tower as seen from Trocadéro; a woman in a fur coat strolls nonchalantly past the boarded-up bouquinistes; a couple steal a glance from their moped; a man abandons his bicycle, supplicating on the banks of the Seine. User-generated social photography expands the event from its mass-reproduced, singular, fixed perspective, into a multi-dimensional, multi-vocal narrative, that unfolds longitudinally over time.

This is the incantation of social photography at its best, so often dismissed for its sheer volume, producing images that are “unoriginal, obvious, boring.” Yet, as art historian Geoffrey Batchen counters, “There are no such things as banal photographs, only banal accounts of photography.” The true value lies not just in the images themselves, but in how we look at them. It is through this act of curation, contextualisation and interpretation that these photographs gain their depth.

 

A People’s History through the Lens

Embedded within the story of photography itself is a people’s history. From its inception, photography has centred the social subject, capturing the overlooked and hidden realities that traditional media refused to. In mid-19th century Paris, early photographers, Eugène Atget and Félix Nadar chronicled the changing urban landscapes, preserving scenes of working-class neighbourhoods, subject to Haussman’s destruction, proletarian characters and everyday life. The camera’s portability, speed and (perceived) candidness made it suitable to the task of documenting the unseen.

The social subject in photography has long been intended to elicit sentiment and action. Social photographs compel viewers to respond emotionally and, ideally, to take action. Jacob Riis’s How the Other Half Lives (1888), aimed at middle-class audiences, used images of New York’s Lower East Side slums to generate empathy and drive charity. As Walter Benjamin later described it, the medium of photography has a “revolutionary use-value” in its ability to render visible hidden social structures. Contemporary documentary collections, such Kris Graves’s A Bleak Reality, have harnessed this historic compulsion to catalyse social change.

There is rightly so, caution against social photography’s treatment of its subjects. As Maren Stange argues, in her analysis of American documentary photographers including Riis (along with Hine and the famed Farm Security Administration collection), social photography has historically rested on assumptions about its subjects, instrumentalising them or reducing them to symbolic devices. Moreover, it often fails to acknowledge that the photographer inherently constructs the photograph’s reality. As David Peeler notes, each photograph holds “layers of accreted meanings” shaped by the photographer’s choice in composition, processing, and presentation. In the age of citizen-driven photography, with the distributed ability to manipulate images, these limitations become even more pronounced, requiring an explicit recognition of the constructed nature of the medium.

The construction of this fragment of reality in photography, however, is not inherently negative. When acknowledged, it can be a source of power, offering what Donna Haraway describes as situated knowledge—a rejection of objectivity in favor of partial, contextual perspectives. Citizen-driven collections, though subjective and, by their very nature incomplete, serve as an antidote to what Haraway calls the “conquering gaze from nowhere.” They counter dominant, institutional narratives with fragmented, personal views.

 

The Photograph Over Time

The value of a photograph often increases over time, as its meaning and significance evolve in relation to historical, cultural, and social contexts. Photographs gain depth through longitudinal perspectives, becoming more compelling at a distance as they reveal shifts, absences, and forgotten details. As Silke Helmerdig, in Fragments, Futures, Absence and the Past, discusses in her treatment of German photography of the 2010s, we ought to see photography as speaking to us in the future subjunctive, it asks us, “what could be, if?”

With time and attention, photographs can be viewed in aggregate, the future historian can pull from concurrent sources. Our contemporary photographic collecting tools, as in the case of Flickr’s Galleries and Albums, which allow curation of others people’s photographs, can come to resemble a sort of photomontage. Rosalind Krauss, writing on the photomontages of Hannah Höch and other Dadaists in The Optical Unconscious, argues that the medium forces a dialogue between images, creating unexpected connections and breaking the linearity of traditional visual narratives thus opening space for political critique. The Notre Dame gallery disrupts the throughline of the ‘official’ imagery of the event, creating a space for discourse of other elements besides the central action (e.g. gender, capitalism, urbanism).

Securing the People’s Archive

Having discussed the value of the citizen-made collection, this compels us to ask what if our institutional archives began collecting contemporaneously more? We believe Data Lifeboat can help with this. The Notre Dame Gallery is just one example of a potential collection for a Data Lifeboat: our tool for long-term preservation of networked social photography and citizen-made collections from Flickr.com.

Data Lifeboats could be deployed as a curatorial tool for networked social photography, providing institutions with a way to collect, catalogue and reflect on citizen-driven narratives. At present, there is not an archival standard for images on social media and archivists still struggle with the vastness and maintenance of those they’ve managed to collect [see our blog-post from iPres]. Data Lifeboat thus operates as a sort of packaging tool, flexible and light enough to adapt to collections of differing scales and purviews, but still maintaining the social context that makes networked images so valuable.

There are two potential approaches:

  1. Hands-on: Data Lifeboats could be commissioned by an institution around a certain topic. For example, the Museum of London could commission a group of Whitechapel teenagers to collect photos from Flickr.com of their neighbourhood spaces that are meaningful to them.
  2. Hands-off: Citizens create Data Lifeboats independently of a topic of their choosing. Institutions may choose to hold these bounded social media archives as a public good, for the benefit of our collective digital heritage.

In both cases, the institutions become holders of Data Lifeboats and they are subsumed into their digital collections management systems. Data Lifeboats become part of a process of Participatory Appraisal, extending and diversifying the ‘official archive’, addressing the persistent gap of who gets to be represented. As we have also discussed, there are also possibilities for distributing the load of Data Lifeboats, more on this in the Safe Harbor Network.

Other possible Data Lifeboats

During our Mellon-funded workshops, we asked participants to suggest Data Lifeboats they would like to see in their institutional collections, but also any they would create themselves for personal use.

At-risk Subjects

Collections focus on documenting vulnerable or ephemeral content that might disappear without active intervention. This includes both environmental changes and socio-political documentation that could be censored or lost.

e.g. glaciers over time, rapid response after a disaster, disappearing rural life across Europe, politically at-risk accounts

Subjects often overlooked

Collections that aim to preserve marginalised voices and underrepresented perspectives, helping to fill gaps in traditional institutional archives and ensure a more representative historical record.

e.g. a queer community coffee shop, Black astronauts, local street art, life in Communist Poland

Nostalgia for Web 1.0

As so much of Web 1.0 disappears (e.g. Geocities, MySpace music, see also ‘Digital Dark Age‘), there is a desire to archive and begin critically reflecting on the early days of the web.

e.g. self-portraits from the early 2000s, vernacular photography from the 2010s, Flickr HQ, most viewed Flickr photos

Quirky Collections

Flickr is renowned as a home for serendipitous discovery on the web, sometimes lauded as ‘digital shoebox of photographs’, there is the opportunity to replicate this ‘quirkiness’ with Data Lifeboats.

e.g. ghost signs, every Po’Boy in town, electricity pylons of the world

Personal collections

e.g. family archives, 365 challenges, a group of friends

Data Lifeboats could serve as secure containers for digital family heirlooms. Built into Flickr.com are privacy controls (Friends, Family) that would carry over to Data Lifeboats, preserving privacy for the long-term

 

Conclusion

The Notre Dame gallery exemplifies an ideal subject for a Data Lifeboat, both in its content and curatorial approach. The Data Lifeboat framework serves as an apt vessel, with its built-in capabilities:

  • Data Lifeboats can capture alternative viewpoints, situated knowledges and stories from below through tapping into the vast Flickr archive. We recognise that we can never capture, nor preserve, the archive in its entirety, so Data Lifeboats tap into the logic of the archival sliver.
  • Data Lifeboats can preserve citizen-driven communication through their unique storage of social metadata. This means that the conversations around the images are preserved with the images themselves, creating a holistic entity.
  • Data Lifeboats are purposely designed with posterity in mind. Technologically, their light-touch design means they are built to last. Furthermore, the README (link) nudges the Data Lifeboat creator toward conscious curation and commentary, providing value to future historians.

Can you think of any other Data Lifeboats? We’d love to hear about them.

Alex Chan

Progress Report: Creating a Data Lifeboat circa 2024

In October and November, we held two Data Lifeboat workshops, funded by the Mellon Foundation. We had four days of detailed discussions about how Data Lifeboat should work, we talked about some of the big questions around ethics and care, and got a lot of useful input from our attendees.

As part of the workshops, we showed a demo of our current software, where we created and downloaded a prototype Data Lifeboat. This is an early prototype, with a lot of gaps and outstanding questions, but it was still very helpful to guide some of our conversations. Feedback from the workshops will influence what we are building, and we plan to release a public alpha next year.

We are sharing this work in progress as a snapshot of where we’ve got to. The prototype isn’t built for other people to use, but I can walk you through it with screenshots and explanations.

In this post, I’ll walk you through the creation workflow – the process of preparing a Data Lifeboat. In a follow-up post, I’ll show you what you get when you download a finished Data Lifeboat.

Step 1: Sign in to Flickr

To create a Data Lifeboat, you have to sign in to your Flickr account:

This gives us an authenticated source of identity for who is creating each Data Lifeboat. This means each Data Lifeboat will reflect the social position of its creator in Flickr.com. For example, after you log in, your Data Lifeboat could contain photos shared with you by friends and family, where those photos would not be accessible to other Flickr members who aren’t part of the same social graph.

Step 2: Choose the photos you want to save

To choose photos, you enter a URL that points to photos on Flickr.com:

In the prototype, we can only accept a single URL, either a Gallery, Album, or Photostream for now. This is good enough for prototyping, but we know we’ll need something more flexible later – for example, we might allow you to enter multiple URLs, and then we’d get photos from all of them.

Step 3: See a summary of the photos you’d be downloading

Once you’ve given us a URL, we fetch the first page of photos and show you a summary. This includes things like:

  • How many photos are there?
  • How many photos are public, semi-public, or private?
  • What license are the photos using?
  • What’s the safety level of the photos?
  • Have the owners disabled downloads for any of these photos?

Each of these controls affects what we are permitted to put in a Data Lifeboat, and the answers will be different for different people. Somebody creating their family archive may want all the photos, whereas somebody creating a Data Lifeboat for a museum might only want photos which are publicly visible and openly licensed.

We want Data Lifeboat creators to make informed decisions about what goes in their Data Lifeboat, and we believe we can do better than showing them a series of toggle switches. The current design of this screen is to give us a sense of how these controls are used in practice. It exposes the raw mechanics of Flickr.com, and relies on a detailed understanding of how Flickr.com works. We know this won’t be the final design. We might, for example, build an interface that asks people where they intend to store the Data Lifeboat, and use that to inform which photos we include. This is still speculative, and we have a lot of ideas we haven’t tried yet.

The prototype only saves public, permissively-licensed photos, because we’re still working out the details of how we handle licensed and private photos.

Step 4: Write a README

This is a vital step – it’s where people give us more context. A single Data Lifeboat can only contain a sliver of Flickr, so we want to give the creator the opportunity to describe why they made this selection, and also to include any warnings about sensitive content so it’s easier to use the archive with care in future.

Tori will be writing up what happened at the workshops around how we could design this particular interface to encourage creators to think carefully here.

We like the idea of introducing ‘positive friction’ to this process, supporting people to write constructive and narrative notes to the future about why this particular sliver is important to keep.

Step 5: Agree to policies

When you create a Data Lifeboat, you need to agree to certain conditions around responsible use of the photos you’re downloading:

The “policies” in the current prototype are obviously placeholders. We know we will need to impose certain conditions, but we don’t know what they are yet.

One idea we’re developing is that these policies might adapt dynamically based on the contents of the Data Lifeboat. If you’re creating a Data Lifeboat that only contains your own public photos, that’s very different from one that contains private photos uploaded by other people.

Step 6: One moment please…

Creating or “baking” a Data Lifeboat can take a while – we need to download all the photos, their associated metadata, and construct a Data Lifeboat package.

In the prototype we show you a holding page:

We imagine that in the future, we’d email you a notification when the Data Lifeboat has finished baking.

Step 7: Download the Data Lifeboat

We have a page where you can download your Data Lifeboat:

Here you see a list of all the Data Lifeboats that we’ve been prototyping, because we wanted people to share their ideas for Data Lifeboats at our co-design workshops. In the real tool, you’ll only be able to see and download Data Lifeboats that you created.

What’s next?

We still have plenty to get on with, but you can see the broad outline of where we’re going, and it’s been really helpful to have an end-to-end tool that shows the whole Data Lifeboat creation process.

Come back next week, and I’ll show you what you get inside the Data Lifeboat when you download it!

Our Data Lifeboat workshops are complete

Thanks to support from the Mellon Foundation, we have now completed our two international Data Lifeboat workshops. They were great! We have various blog posts planned to share what happened, and I’ll just start with a very quick summary.

As you may know, we had laid out doing two workshops:

  1. Washington DC, at The Library of Congress, in October, and
  2. London, at the Garden Museum and Autograph Gallery, in November.

We were pleased to welcome a total of 32 people across the events, from libraries, archives, academic institutions, the freelance world, other like-minded nonprofits, Flickr.com, and Flickr.org.

Now we are doing the work of sifting through the bazillion post-its and absorbing the great conversations had as we worked through Tori’s fantastic program for the event. We were all very well-fed and organized too, thanks to Ewa’s superb project management. Thank you both.

Workshop aims

The aims of each workshop were the same:

  • Articulate the value of archiving social media, and Data Lifeboat
  • Detail where Data Lifeboat fits in current ecology of tools and practices
  • Detail where Data Lifeboat fits with curatorial approaches and content delivery
  • Plot (and recognise) the type and amount of work it would take to establish Data Lifeboat or similar in organisations

Workshop outline

We met these aims by lining up the workshops into different sessions:

  1. Foundations of Long-Term Digital Preservation – Backward/forward horizons; understanding digital infrastructures; work happening in long-term digital preservation
  2. Data Lifeboat: What we’re thinking so far – Reporting on our NEH work to prototype software and policy, including a live demo; positioning a Data Lifeboat in emergency/not-emergency scenarios; curation needs or desires to use Data Lifeboats as selection/acquisition tool
  3. Consent and Care in Social Media Archiving – Ethics of care in digital archives; social context and care vs extractive data practices; mapping ethical rights, risks, responsibilities including copyright and data protection, and consent, and
  4. Characteristics of a Robust & Responsible Safe Harbor Network (our planned extension of the Data Lifeboat concept – think LOCKSS-ish) – The long history of safe harbor networks; logistics of such a network; Trust.

I’m not going to report on these now, but whet your appetite for our further reporting back.

Background readings

Tori also prepared some grounding readings for the event, which we thought others may like to review:

Needless to say, we all enjoyed it very much, and heard the same from our attendees. Several follow-on chats have been arranged, and the community continues to wiggle towards each other.

Alex Chan

Progress Report: Creating a Data Lifeboat circa 2024

In September, Tori and I went to Belgium for iPres 2024. We were keen to chat about digital preservation and discuss some of our ideas for Data Lifeboat – and enjoy a few Belgian waffles, of course!

Photo by the Digital Preservation Coalition, used under CC BY-NC-SA 2.0

We ran a workshop called “How do you preserve 50 billion photos?” to talk about the challenges of archiving social media at scale. We had about 30 people join us for a lively discussion. Sadly we don’t have any photos of the workshop, but we did come away with a lot to think about, and we wanted to share some of the ideas that emerged.

Thanks to the National Endowment for the Humanities for supporting this trip as part of our Digital Humanities Advancement Grant.

How much can we preserve?

Some people think we should try to collect all of social media – accumulate as much data as possible, and sort through it later. That might be appealing in theory, because we won’t delete anything that’s important to future researchers, but it’s much harder in practice.

For Data Lifeboat, this is a problem of sheer scale. There are 50 billion photos on Flickr, and trillions of points that form the social graph that connects them. It’s simply too much for a single institution to collect as a whole.

At the conference we heard about other challenges that make it hard to archive social media like constraints on staff time, limited resources, and a lack of cooperation from platform owners. Twitter/X came up repeatedly, as an example of a site which has become much harder to archive after changes to the API.

There are also longer-term concerns. Sibyl Schaefer, who works at the University of California, San Diego, presented a paper about climate change, and how scarcity of oil and energy will affect our ability to do digital preservation. All of our digital services rely on a steady supply of equipment and electricity, which seem increasingly fraught as we look over the next 100 years. “Just keep everything” may not be a sustainable strategy.

This paper was especially resonant for us, because she encourages us to think about these problems now, before the climate crisis gets any worse. It’s better to make a decision when you have more options and things are (relatively) calm, than wait for things to get really bad and be forced to make a snap judgment. This matches our approach to rights, privacy, and legality with Data Lifeboat – we’re taking the time to consider the best approach while Flickr is still happy and healthy, and we’re not under time pressure to make a quick decision.

What should we keep?

We went to iPres believing that trying to keep everything is inappropriate for Flickr and social media, and the conversations we had only strengthened this view. There are definitely benefits to this approach, but they require an abundance of staffing and resources that simply don’t exist.

One thing we heard at our Birds of a Feather session is that if you can only choose a selection of photos from Flickr, large institutions don’t want to make that selection themselves. They want an intermediate curator to choose photos to go in a Data Lifeboat, and then bequeath that Data Lifeboat to their archive. That person decides what they think is worth keeping, not the institution.

Who chooses what to keep?

If you can only save some of a social media service, how do you decide which part to take? You might say “keep the most valuable material”, but who decides what’s valuable? This is a thorny question that came up again and again at iPres.

An institution could conceivably collect Data Lifeboats from many people, each of whom made a different selection. Several people pointed out that any selection process will introduce bias and inequality – and while it’s impossible to fix these completely, having many people involved can help mitigate some of the issues.

This ‘collective selection’ helps deal with the fact that social media is really big – there’s so much stuff to look at, and it’s not always obvious where the interesting parts are. Sharing that decision with different people creates a broader perspective of what’s on a platform, and what material might be worth keeping.

Why are we archiving social media?

The discussion around why we archive social media is still frustratingly speculative. We went to iPres hoping to hear some compelling use cases or examples, but we didn’t.

There are plenty of examples of people using material from physical archives to aid their research. For example, one person told the story of the Minutes of the Milk Marketing Board. Once thought of as a dry and uninteresting collection, it became very useful when there was an outbreak of foot-and-mouth disease in Britain. We didn’t hear any case studies like that for digital archives.

There are already lots of digital archives and archives of Internet material. It would be interesting to hear from historians and researchers who are using these existing collections, to hear what they find useful and valuable.

The Imaginary Future Researcher

A lot of discussion revolved around an imaginary future researcher or PhD student, who would dive into the mass of digital material and find something interesting – but these discussions were often frustratingly vague. The researcher would do something with the digital collections, but more specifics weren’t forthcoming.

As we design Data Lifeboat, we’ve found it useful to imagine specific scenarios:

  • The Museum of London works with schools across the city, engaging students to collect great pictures of their local area. A schoolgirl in Whitechapel selects 20 photos of Whitechapel she thinks are worth depositing in the Museum’s collection.
  • The botany student at California State looks across Flickr to find photography of plant coverage in a specific area and gathers them as a longitudinal archive.
  • A curation student interning at Qtopia in Sydney wants to gather community documentation of Sydney’s Mardi Gras.

These only cover a handful of use cases, but they’ve helped ground our discussions and imagine how a future researcher might use the material we’ve saving.

A Day Out in Antwerp

On my final day in Belgium, I got to visit a couple of local institutions in Antwerp. I got a tour of the Plantin-Moretus Museum, which focuses on sixteenth-century book printing. The museum is an old house with some gorgeous gardens:

And a collection of old printing machines:

There was even a demonstration on a replica printing machine, and I got to try a bit of printing myself – it takes a lot of force to push the metal letters and the paper together!

Then in the afternoon I went to the FelixArchief, the city archive of Antwerp, which is stored inside an old warehouse next to the Port of Antwerp:

We got a tour of their stores, including some of the original documents from the very earliest days of Antwerp:

And while those shelves may look like any other archive, there’s a twist – they have to fit around the interior shape of the original warehouse! The archivists have got rather good at playing tetris with their boxes, and fitting everything into tight spaces – this gets them an extra 2 kilometres of shelving space!

Our tour guide explained that this is all because the warehouse is a listed building – and if the archives were ever to move out, they’d need to remove all their storage and leave it in a pristine condition. No permanent modifications allowed!

Next steps for Data Lifeboat

We’re continuing to think about what we heard about iPres, and bring it into the design and implementation of Data Lifeboat.

This month and next, we’re holding Data Lifeboat co-design workshops in Washington DC and London to continue these discussions.

by Fattori McKenna

Field Notes #01: Lughnasadh

Deep Reading in the Last Days of Summer

 

I joined the Foundation team in early August, with the long-term goal of better understanding future users of the Data Lifeboat project and Safe Harbor network. Thanks to the Digital Humanities Advancement Grant we were awarded by the National Endowment for the Humanities, my first task was to get up to speed with the Data Lifeboat project, a concept that has been in the works since 2022, as part of Flickr.org’s Content Mobility Program, and recently developed a working prototype. I have the structured independence to design my own research plan and, as every researcher knows, being able to immerse oneself in the topic prior, is a huge advantage. It allows us to frame the problem at hand, to be resolute with objectives and ground the research in what is known and current.

 

Stakeholder interviews

To understand what would be needed from the research plan, I first wanted to understand how we got to where we are with Data Lifeboat project.

I spoke with Flickr.org’s tight-knit internal team to gather perspectives that emphasised varying approaches to the question of long-term digital preservation: ranging from the technological, to the speculative, to the communal. It was curious to see how different team members viewed the project, each speaking from their own specialty, with their wider ambitions and community in mind.

Branching out, I enlisted external stakeholders for half-hour chats, those who’ve had a hand in the Data Lifeboat project since it was in napkin-scribble format. The tool owes its present form to a cadre of digital preservation experts and enthusiasts, who do not work on the project full-time, but have generously given their hours to partake in workshops, coffees, Whereby calls, and a blissfully meandering Slack thread. Knowing these folks would be, themselves, a huge repository of knowledge, I wanted a way to capture this. Besides introductions to the Safe Harbor Network co-design workshops (as supported by the recent Mellon Foundation grant) and my new role, I centred our conversation around three key questions:

  1. What has your experience of the last six months of the Data Lifeboat project been like? How do you think we are doing? Any favourite moments, any concerns?
  2. What are the existing practices around digital acquisition, storage and maintenance in your organisation(s)? How would the Data Lifeboat and Safe Harbor Network differ from the existing practices?
  3. Where are the blind-spots that still exist for developing the Data Lifeboat project and Safe Harbor Network? What might we want to find out from the co-design workshops in October and November?

Here it was notable to learn what had stuck with them in the repose since the last Data Lifeboat project meet-up. For some the emphasis was on how the Data Lifeboat tool could connect institutions, for others it was how the technology can decentralise power and ownership of data. All were keen to see what shape the project would take next.

One point, however, remained amorphous to all stakeholders that we ought to carry forward into research: what is the problem that Data Lifeboat project is solving? Specifically in a non-emergency scenario (as the emergency need is intuitive). How can we best articulate that problem to our imagined users?

As our prototype user group is likely to be institutional users of Flickr (Galleries, Libraries, Archives and Museums), it will be important to meet them where they are, which brought me onto my next August task: the mini-literature review.

 

Mini Literature Review

Next, I wanted to get up to date on the contemporary discourses around digital preservation. Whilst stakeholders have brought their understanding of these topics to shaping the Data Lifeboat project, it felt as if the project was missing its own bibliography or set of citations. I wanted to ask, what are the existing conversations that Data Lifeboat project is speaking to?

It goes without saying that this is a huge topic and, despite my humble background in digital heritage research (almost always theoretical), cramming this all into one month would be impossible. Thus, I adopted the ethos of the archival ‘sliver’ that so informs the ethos of the Data Lifeboat project, to take a snapshot of current literature. After reviewing the writing to date on the project (shout-out to Jenn’s reporting here and here), I landed on three guiding topics for the literature review:

 

The Status of Digital Preservation

  • What are the predominant tools and technologies of digital preservation?
  • What are recent reflections and learnings from web archiving experiments?
  • What are current institutional and corporate strategies to digital social collecting and long-term data storage?

Examples include:

Care & Ethics of Archives

  • What are the key ethical considerations among archivists today?
  • How are care practices being embedded into archives and archival practice?
  • What reflections and responses exist to previous ethical interventions?

Examples include:

Collaboration and Organisation in Archival Practice

  • What are the infrastructures (hard and soft) of archival practice?
  • What are the predominant organisational structures, considerations and difficulties in digital archives
  • How does collaboration appear in archives? Who are the (visible and invisible) stakeholders?

Examples include:

 

A selection of academic articles, blog posts and industry guidelines were selected as source materials (as well as crowdsourcing from the Flickr.org team’s favourites). In reading these texts, I had top of mind the questions: ‘What does this mean for the Data Lifeboat project and the Safe Harbor Network’, in more granular terms this means, ‘What can we learn from these investigations?’ ‘Where are we positioned in the wider ecosystem of digital preservation?’ and finally, ‘What should we be thinking about that we aren’t yet?’

Naturally with more time, or with an academic audience in mind, a more rigorous methodology to discourse capture would be appropriate. For our purposes, however, this snapshot approach suffices – ultimately the data this research is grounded in comes not from textual problematising, but instead will emerge from our workshops with future users.

Having this resource is of huge benefit to meeting our session participants where they stand. Whilst there will inevitably be discourses, approaches and critiques I have missed, I will at least be able to speak the same language as our participants and get into the weeds of our problems in a complex, rather than baseline, manner. Furthermore, my ambition is for this bibliography to become an ongoing and open-source asset, expanding as the project develops.

These three headers (1. The Status of Digital Preservation, 2. Care & Ethics of Archives, 3. Collaboration and Organisation in Archival Practice) currently constitute placeholders for our workshop topics. It is likely, however, that these titles could evolve, splinter or coalesce as we come closer to a more refined and targeted series of questions for investigating with our participants.

 

Question Repository [in the works]

Concurrently to these ongoing workstreams, I am building a repository, or long-list, of questions for our upcoming workshops. The aim is to first go broad, listing all possible questions, in an attempt to capture as many inquisitive voices as possible. These will then be refined down, grouped under thematic headings which will in turn structure the sub-points or provocations for our sessions. This iterative process reflects a ground-up methodology, derived from interviews, reading, and the collective knowledge of the Flickr.org community, to finally land on working session titles for our October and November Safe Harbor Network co-design workshops.

Looking ahead, there is an opportunity to test several of these provocations around Data Lifeboat at our Birds-of-a-Feather session, taking place at this year’s International Conference on Digital Preservation (iPres) in Ghent later this month. Here we might foresee which questions generate lively and engaged discussion; which features of the Data Lifeboat tool and project prompt anticipation or concern; and finally, which pathways we ought to explore further.

 

Other things I’ve been thinking about this month

Carl Öhman’s concept of the Neo-Natufians in The Afterlife of Data: What Happens to Your Information When you Die and Why You Should Care

Öhman proposes that the digital age has ushered in a major shift in how we interact with our deceased. Referencing the Natufians, the first non-nomadic peoples to keep the dead among their tribe (who would adorn skulls with seashells and place them in the walls) instead of leaving them behind to the elements, he posits our current position is equally as seismic. The dead now live alongside us in the digital realm. A profound shift from the family shoebox of photographs, the dead are accessible from virtually anywhere at any time, their (visible and invisible) data trail co-existing with ours. An inescapable provocation for the Data Lifeboat project to consider.

“The imago mask, printed not in wax but in ones and zeros”

The Shikinen Sengu Ritual at Ise Jingu, Japan

The Shikinen Sengu is a ritual held at the Ise Grand Shrine in Japan every 20 years, where the shrine is completely rebuilt and the sacred objects are transferred to the new structure. This practice has been ongoing for over a millennium and makes me think on the mobility of cultural heritage (analogue or digital) and that stasis, despite its intuitive appeal, can cause objects to perish. I am reminded of the oft-exalted quote from di Lampedusa’s Sicilian epic:

“If we want things to stay as they are, things will have to change.” The Leopard, by Giuseppe Tomasi di Lampedusa

Furthermore Shikinen Sengu highlights the importance of ritual in sustaining objects, despite the wear-and-tear that handling over millennia may cause. What might our rituals around digital cultural data be, what practices could we generate (even if the original impetus gets lost)?

 

Background Ephemera

Currently Playing: Laura Misch Sample the Earth and Sample the Sky

Currently Reading: The Hearing Trumpet by Leonora Carrington

Currently Drinking: Clipper Green Tea

Welcome, Fattori!

Hello, world! I’m Fattori, Lead Researcher on the Data Lifeboat Project at the Flickr Foundation.

I first used Flickr in 2005; at that time, I was an angsty teen who needed a place to store grainy photos of Macclesfield, my post-industrial hometown, that I shot on an old Minolta camera. Since then, both my career and my academic research have focused on themes that are central to the aims of Flickr.org: images, databases, community, and the recording of human experiences.

In 2017 I began working as a researcher for strategic design studios based in New York, Helsinki, London and Mumbai. My research tried to address complex questions about humans’ experience of modern visual cultures by blending semiotics, ethnography and participatory methods. My commercial projects allowed me to explore women’s domestic needs in rural Vietnam, the future of work in America’s Rust Belt, and much in between.

As a postgraduate researcher at the University of Oxford’s Internet Institute, my work explores how blockchain experiments have shaped art and heritage sectors in the U.K. and Italy. At an Oxford Generative AI Summit I met the Flickr Foundation’s Co-Founder, George, and we hosted a workshop on Flickr’s 100-Year Plan with University and Bodleian academics, archivists, and students. I subsequently became more involved with Flickr.org when I contributed research to their generative AI time-capsule, A Generated Family of Man.

Now, as a Lead Researcher at Flickr.org, I’m developing a plan to help better understand future users of Data Lifeboat and the proposed Safe Harbour Network. We want to know how these tools might be implemented in real-world contexts, what problems they might solve, and how we can maintain the soft, collective infrastructure that keeps the Data Lifeboat afloat. 

Beyond my professional life, I always have a jumper on my knitting needles (I can get quite nerdy about wool), I rush to a potter’s wheel whenever I can, and I’m writing a work of historical fiction about a mystic in the Balearic Islands. Like my 2005 self, I still snap the odd photo, these days on a Nikon L35AF.

Improving millions of files on Wikimedia Commons with Flickypedia Backfillr Bot

Last year, we built Flickypedia, a new tool for copying photos from Flickr to Wikimedia Commons. As part of our planning, we asked for feedback on Flickr2Commons and analysed other tools. We spotted two consistent themes in the community’s responses:

  • Write more structured data for Flickr photos
  • Do a better job of detecting duplicate files

We tried to tackle both of these in Flickypedia, and initially, we were just trying to make our uploader better. Only later did we realize that we could take our work a lot further, and retroactively apply it to improve the metadata of the millions of Flickr photos already on Wikimedia Commons. At that moment, Flickypedia Backfillr Bot was born. Last week, the bot completed its millionth update, and we guesstimate we will be able to operate on another 13 million files.

The main goals of the Backfillr Bot are to improve the structured data for Flickr photos on Wikimedia Commons and to make it easier to find out which photos have been copied across. In this post, I’ll talk about what the bot does, and how it came to be.

Write more structured data for Flickr photos

There are two ways to add metadata to a file on Wikimedia Commons: by writing Wikitext or by creating structured data statements.

When you write Wikitext, you write your metadata in a MediaWiki-specific markup language that gets rendered as HTML. This markup can be written and edited by people, and the rendered HTML is designed to be read by people as well. Here’s a small example, which has some metadata to a file linking it back to the original Flickr photo:

== {{int:filedesc}} ==
{{Information
|Description={{en|1=Red-whiskered Bulbul photographed in Karnataka, India.}}
|Source=https://www.flickr.com/photos/shivanayak/12448637/
|Author=[[:en:User:Shivanayak|Shiva shankar]]
|Date=2005-05-04
|Permission=
|other_versions=
}}

and here’s what that Wikitext looks like when rendered as HTML:

A table with four rows: Description (Red-whiskered Bulbul photographed in Karnataka, India), Date (4 May 2005), Source (a Flickr URL) and Author (Shiva shankar)

This syntax is convenient for humans, but it’s fiddly for computers – it can be tricky to extract key information from Wikitext, especially when things get more complicated.

In 2017, Wikimedia Commons added support for structured data. This allows editors to add metadata in a machine-readable format. This makes it much easier to edit metadata programmatically, and there’s a strong desire from the community for new tools to write high-quality structured metadata that other tools can use.

When you add structured data to a file, you create “statements” which are attached to properties. The list of properties is chosen by the volunteers in the Wikimedia community.

For example, there’s a property called “source of file” which is used to indicate where a file came from. The file in our example has a single statement for this property, which says the file is available on the Internet, and points to the original Flickr URL:

Structured data is exposed via an API, and you can retrieve this information in nice machine-readable XML or JSON:

$ curl 'https://commons.wikimedia.org/w/api.php?action=wbgetentities&sites=commonswiki&titles=File%3ARed-whiskered%20Bulbul-web.jpg&format=xml'
<?xml version="1.0"?>
<api success="1">
  …
  <P7482>
    …
    <P973>
      <_v snaktype="value" property="P973">
        <datavalue
          value="https://www.flickr.com/photos/shivanayak/12448637/"
          type="string"/>
      </_v>
    </P973>
    …
  </P7482>
</api>

(Here “P7482” means “source of file” and “P973” is “described at URL”.)

Part of being a good structured data citizen is following the community’s established patterns for writing structured data. Ideally every tool would create statements in the same way, so the data is consistent across files – this makes it easier to work with later.

We spent a long time discussing how Flickypedia should use structured data, and we got a lot of helpful community feedback. We’ve documented our current data model as part of our Wikimedia project page.

Do a better job of detecting duplicate files

If a photo has already been copied from Flickr onto Wikimedia Commons, nobody wants to copy it a second time.

This sounds simple – just check whether the photo is already on Commons, and don’t offer to copy it if it’s already there. In practice, it’s quite tricky to tell if a given Flickr photo is on Commons. There are two big challenges:

  1. Files on Wikimedia Commons aren’t consistent in where they record the URL of the original Flickr photo. Newer files put the URL in structured data; older files only put the URL in Wikitext or the revision descriptions. You have to look in multiple places.
  2. Files on Wikimedia Commons aren’t consistent about which form of the Flickr URL they use – with and without a trailing slash, with the user NSID or their path alias, or the myriad other URL patterns that have been used in Flickr’s twenty-year history.

Here’s a sample of just some of the different URLs we saw in Wikimedia Commons:

https://www.flickr.com/photos/joyoflife//44627174
https://farm5.staticflickr.com/4586/37767087695_bb4ecff5f4_o.jpg
www.flickr.com/photo_edit.gne?id=3435827496
https://www.flickr.com/photo.gne?short=2ouuqFT

There’s no easy way to query Wikimedia Commons and see if a Flickr photo is already there. You can’t, for example, do a search for the current Flickr URL and be sure you’ll find a match – it wouldn’t find any of the examples above. You can combine various approaches that will improve your chances of finding an existing duplicate, if there is one, but it’s a lot of work and you get varying results.

For the first version of Flickypedia, we took a different approach. We downloaded snapshots of the structured data for every file on Wikimedia Commons, and we built a database of all the links between files on Wikimedia Commons and Flickr photos. For every file in the snapshot, we looked at the structured data properties where we might find a Flickr URL. Then we tried to parse those URLs using our Flickr URL parsing library, and find out what Flickr photo they point at (if any).

This gave us a SQLite database that mapped Flickr photo IDs to Wikimedia Commons filenames. We could use this database to do fast queries to find copies of a Flickr photo that already exist on Commons. This proved the concept, but it had a couple of issues:

  • It was an incomplete list – we only looked in the structured data, and not the Wikitext. We estimate we were missing at least a million photos.
  • Nobody else can use this database; it only lives on the Flickypedia server. Theoretically somebody else could create it themselves – the snapshots are public, and the code is open source – but it seems unlikely.
  • This database is only as up-to-date as the latest snapshot we’ve downloaded – it could easily fall behind what’s on Wikimedia Commons.

We wanted to make this process easier – both for ourselves, and anybody else building Flickr–Wikimedia Commons integrations.

Adding the Flickr Photo ID property

Every photo on Flickr has a unique numeric ID, so we proposed a new Flickr photo ID property to add to structured data on Wikimedia Commons. This proposal was discussed and accepted by the Wikimedia Commons community, and gives us a better way to match files on Wikimedia Commons to photos on Flickr:

This is a single field that you can query, and there’s an unambiguous, canonical way that values should be stored in this field – you don’t need to worry about the different variants of Flickr URL.

We added this field to Flickypedia, so any files uploaded with our tool will get this new field, and we hope that other Flickr upload tools will consider adding this field as well. But what about the millions of Flickr photos already on Wikimedia Commons? This is where Flickypedia Backfillr Bot was born.

Updating millions of files

Flickypedia Backfillr Bot applies our structured data mapping to every Flickr photo it can find on Wikimedia Commons – whether or not it was uploaded with Flickypedia. For every photo which was copied from Flickr, it compares the structured data to the live Flickr metadata, and updates the structured data if the two don’t match. This includes the Flickr Photo ID.

It reuses code from our duplicate detector: it goes through a snapshot looking for any files that come from Flickr photos. Then it gets metadata from Flickr, checks if the structured data matches that metadata, and if not, it updates the file on Wikimedia Commons.

Here’s a brief sketch of the process:

Most of the time this logic is fairly straightforward, but occasionally the bot will get confused – this is when the bot wants to write a structured data statement, but there’s already a statement with a different value. In this case, the bot will do nothing and flag it for manual review. There are edge cases and unusual files in Wikimedia Commons, and it’s better for the bot to do nothing than write incorrect or misleading data that will need to be reverted later.

Here are two examples:

  • Sometimes Wikimedia Commons has more specific metadata than Flickr. For example, this Flickr photo was posted by the Donostia Kultura account, and the description identifies Leire Cano as the photographer.

    Flickypedia Backfillr Bot wants to add a creator statement for “Donostia Kultura”, because it can’t understand the description – but when this file was copied to Wikimedia Commons, somebody added a more specific creator statement for “Leire Cano”.

    The bot isn’t sure which statement is correct, so it does nothing and flags this for manual review – and in this case, we’ve left the existing statement as-is.

  • Sometimes existing data on Wikimedia Commons has been mapped incorrectly. For example, this Flickr photo was taken “circa 1943”, but when it was copied to Wikimedia Commons somebody added an overly precise “date taken” statement claiming it was taken on “1 Jan 1943”.

    This bug probably occurred because of a misunderstanding of the Flickr API. The Flickr API will always return a complete timestamp in the “date” field, and then return a separate granularity value telling you how accurate it is. If you ignored that granularity value, you’d create an incorrect statement of what the date is.

    The bot isn’t sure which statement is correct, so it does nothing and flags this for manual review – and in this case, we made a manual edit to replace the statement with the correct date.

What next?

We’re going to keep going! There were a few teething problems when we started running the bot, but the Wikimedia community helped us fix our mistakes. It’s now been running for a month or so, and processed over a million files.

All the Flickypedia code is open source on GitHub, and a lot of it isn’t specific to Flickr – it’s general-purpose code for working with structured data on Wikimedia Commons, and could be adapted to build similar bots. We’ve already had conversations with a few people about other use cases, and we’ve got some sketches for how that code could be extracted into a standalone library.

We estimate that at least 14 million files on Wikimedia Commons are photos that were originally uploaded to Flickr – more than 10% of all the files on Commons. There’s plenty more to do. Onwards and upwards!