Data Lifeboat Update 2: More questions than answers

By Ewa Spohn

Thanks to the Digital Humanities Advancement Grant we were awarded by the National Endowment for the Humanities, our Data Lifeboat project (which is part of the Content Mobility Program) is now well and truly underway. The Data Lifeboat is our response to the challenge of archiving the 50 billion or so images currently on Flickr, should the service go down. It’s simply too big to archive as a whole, and we think that these shared histories should be available for the long term, so we’re exploring a decentralized approach. Find out more about the context for this work in our first blog post.

So, after our kick-off last month, we were left with a long list of open questions. That list became longer thanks to our first all-hands meeting that took place shortly afterwards! It grew again once we had met with the project user group – staff from the British Library, San Diego Air & Space Museum, and Congregation of Sisters of St Joseph – a small group representing the diversity of Flickr Commons members. Rather than being overwhelmed, we were buoyed by the obvious enthusiasm and encouragement across the group, all of whom agreed that this is very much an idea worth pursuing. 

As Mia Ridge from the British Library put it; “we need ephemeral collections to tell the story of now and give people who don’t currently think they have a role in preservation a different way of thinking about it”. And from Mary Grace of the Congregation of Sisters of St. Joseph in Canada, “we [the smaller institutions] don’t want to be the 3rd class passengers who drown first”. 

Software sketching

We’ve begun working on the software approach to create a Data Lifeboat, focussing on the data model and assessing existing protocols we may use to help package it. Alex and George started creating some small prototypes to test how we should include metadata, and have begun exploring what “social metadata” could be like – that’s the kind of metadata that can only be created on Flickr, and is therefore a required element in any Data Lifeboat (as you’ll see from the diagram below, it’s complex). 


Feb 2024: An early sketch of a Data Lifeboat’s metadata graph structure.

Thanks to our first set of tools, Flinumeratr and Flickypedia, we have robust, reusable code for getting photos and metadata from Flickr. We’ve done some experiments with JSON, XML, and METS as possible ways to store the metadata, and started to imagine what a small viewer that would be included in each Data Lifeboat might be like. 

Complexity of long-term licensing

Alongside the technical development we have started developing our understanding of the legal issues that a Data Lifeboat is going to have to navigate to avoid unintended consequences of long-term preservation colliding with licenses set in the present. We discussed how we could build care and informed participation into the infrastructure, and what the pitfalls might be. There are fiddly questions around creating a Data Lifeboat containing photos from other Flickr members. 

  • As the image creator, would you need to be notified if one of your images has been added to a Data Lifeboat? 
  • Conversely, how would you go about removing an image from a Data Lifeboat? 
  • What happens if there’s a copyright dispute regarding images in a Data Lifeboat that is docked somewhere else? 

We discussed which aspects of other legal and licensing models might apply to Data Lifeboats, given the need to maintain stewardship and access over the long term (100 years at least!), as well as the need for the software to remain usable over this kind of time horizon. This isn’t something that the world of software has ready answers for. 

  • Could Flickr.org offer this kind of service? 
  • How would we notify future users of the conditions of the license, let alone monitor the decay of licenses in existing Data Lifeboats over this kind of timescale? 

So many standards to choose from

We had planned to do a deep dive into the various digital asset management systems used by cultural institutions, but this turned out to be a trickier subject than we thought as there are simply too many approaches, tools, and cobbled-together hacks being used in cultural institutions. Everyone seems to be struggling with this, so it’s not clear (yet) how best to approach this. If you have any ideas, let us know!

This work is supported by the National Endowment for the Humanities.

NEH logo

Introducing Flickypedia, our first tool

Building a new bridge between Flickr and Wikimedia Commons

For the past four months, we’ve been working with the Culture & Heritage team at the Wikimedia Foundation — the non-profit that operates Wikipedia, Wikimedia Commons, and other Wikimedia free knowledge projects — to build Flickypedia, a new tool for bridging the gap between photos on Flickr and files on Wikimedia Commons. Wikimedia Commons is a free-to-use library of illustrations, photos, drawings, videos, and music. By contributing their photos to Wikimedia Commons, Flickr photographers help to illustrate Wikipedia, a free, collaborative encyclopedia written in over 300 languages. More than 1.7 billion unique devices visit Wikimedia projects every month.

We demoed the initial version at GLAM Wiki 2023 in Uruguay, and now that we’ve incorporated some useful feedback from the Wikimedia community, we’re ready to launch it. Flickypedia is now available at https://www.flickr.org/tools/flickypedia/, and we’re really pleased with the result. Our goal was to create higher quality records on Wikimedia Commons, with better connected data and descriptive information, and to make it easier for Flickr photographers to see how their photos are being used.

This project has achieved our original goals – and a couple of new ones we discovered along the way.

So what is Flickypedia?

An easy way to copy photos from Flickr to Wikimedia Commons

The original vision of Flickypedia was a new tool for copying photos from Flickr to Wikimedia Commons, a re-envisioning of the popular Flickr2Commons tool, which copied around 5.4M photos.

This new upload tool is what we built first, leveraging ideas from Flinumeratr, a toy we built for exploring Flickr photos. You start by entering a Flickr URL:

And then Flickypedia will find all photos at that URL, and show you the ones which are suitable for copying to Wikimedia Commons. You can choose which photos you want to upload:

Then you enter a title, a short description, and any categories you want to add to the photo(s):

Then you click “Upload”, and the photo(s) are copied to Wikimedia Commons. Once it’s done, you can leave a comment on the original Flickr photo, so the photographer can see the photo in its new home:

As well as the title and caption written by the uploader, we automatically populate a series of machine-readable metadata fields (“Structured Data on Commons” or “SDC”) based on the Flickr information – the original photographer, date taken, a link to the original, and so on. You can see the exact list of fields in our data modeling document. This should make it easier for Commons users to find the photos they need, and maintain the link to the original photo on Flickr.

This flow has a little more friction than some other Flickr uploading tools, which is by design. We want to enable high-quality descriptions and metadata for carefully selected photos; not just bulk copying for the sake of copying. Our goal is to get high quality photos on Wikimedia Commons, with rich metadata which enables them to be discovered and used – and that’s what Flickypedia enables.

Reducing risk and responsible licensing

Flickr photographers can choose from a variety of licenses, and only some of them can be used on Wikimedia Commons: CC0, Public Domain, CC BY and CC BY-SA. If it’s any other license, the photo shouldn’t be on Wikimedia Commons, according to its licensing policy.

As we were building the Flickypedia uploader, we took the opportunity to emphasize the need for responsible licensing – when you select your photographs, it checks the licenses, and doesn’t allow you to copy anything that doesn’t have a Commons-compatible license:

This helps to reduce risk for everyone involved with Flickr and Wikimedia Commons.

Better duplicate detection

When we looked at the feedback on existing Flickr upload tools, there was one bit of overwhelming feedback: people want better duplicate detection. There are already over 11 million Flickr photos on Wikimedia Commons, and if a photo has already been copied, it doesn’t need to be copied again.

Wikimedia Commons already has some duplicate detection. It’ll spot if you upload a byte-for-byte identical file, but it can’t detect duplicates if the photo has been subtly altered – say, converted to a different file format, or a small border cropped out.

It turns out that there’s no easy way to find out if a given Flickr photo is in Wikimedia Commons. Although most Flickr upload tools will embed that metadata somewhere, they’re not consistent about it. We found at least four ways to spot possible duplicates:

  • You could look for a Flickr URL in the structured data (the machine-readable metadata)
  • You could look for a Flickr URL in the Wikitext (the human-readable description)
  • You could look for a Flickr ID in the filename
  • Or Flickypedia could know that it had already uploaded the photo

And even looking for matching Flickr URLs can be difficult, because there are so many forms of Flickr URLs – here are just some of the varieties of Flickr URLs we found in the existing Wikimedia Commons data:

(And this is without some of the smaller variations, like trailing slashes and http/https.)

We’d already built a Flickr URL parser as part of Flinumeratr, so we were able to write code to recognise these URLs – but it’s a fairly complex component, and that only benefits Flickypedia. We wanted to make it easier for everyone.

So we did!

We proposed (and got accepted) a new Flickr Photo ID property. This is a new field in the machine-readable structured data, which can contain the numeric ID. This is a clean, unambiguous pointer to the original photo, and dramatically simplifies the process of looking for existing Flickr photos.

When Flickypedia uploads a new photo to Flickr, it adds this new property. This should make it easier for other tools to find Flickr photos uploaded with Flickypedia, and skip re-uploading them.

Backfillr Bot: Making Flickr metadata better for all Flickr photos on Commons

That’s great for new photos uploaded with Flickypedia – but what about photos uploaded with other tools, tools that don’t use this field? What about the 10M+ Flickr photos already on Wikimedia Commons? How do we find them?

To fix this problem, we created a new Wikimedia Commons bot: Flickypedia Backfillr Bot. It goes back and fills in structured data on Flickr photos on Commons, including the Flickr Photo ID property. It uses our URL parser to identify all the different forms of Flickr URLs.

This bot is still in a preliminary stage—waiting for approval from the Wikimedia Commons community—but once granted, we’ll be able to improve the metadata for every Flickr photo on Wikimedia Commons. And in addition, create a hook that other tools can use – either to fill in more metadata, or search for Flickr photos.

Sydney Harbour Bridge, from the Museums of History New South Wales. No known copyright restrictions.

Flickypedia started as a tool for copying photos from Flickr to Wikimedia Commons. From the very start, we had ideas about creating stronger links between the two – the “say thanks” feature, where uploaders could leave a comment for the original Flickr photographer – but that was only for new photos.

Along the way, we realized we could build a proper two-way bridge, and strengthen the connection between all Flickr photos on Wikimedia Commons, not just those uploaded with Flickypedia.

We think this ability to follow a photo around the web is really important – to see where it’s come from, and to see where it’s going. A Flickr photo isn’t just an image, it comes with a social context and history, and being uploaded to Wikimedia Commons is the next step in its journey. You can’t separate an image from its context.

As we start to focus on Data Lifeboat, we’ll spend even more time looking at how to preserve the history of a photo – and Flickypedia has given us plenty to think about.

If you want to use Flickypedia to upload some photos to Wikimedia Commons, visit www.flickr.org/tools/flickypedia.

If you want to look at the source code, go to github.com/Flickr-Foundation/flickypedia.

Data Lifeboat Update 2: More questions than answers

By Ewa Spohn

Thanks to the Digital Humanities Advancement Grant we were awarded by the National Endowment for the Humanities, our Data Lifeboat project (which is part of the Content Mobility Program) is now well and truly underway. The Data Lifeboat is our response to the challenge of archiving the 50 billion or so images currently on Flickr, should the service go down. It’s simply too big to archive as a whole, and we think that these shared histories should be available for the long term, so we’re exploring a decentralized approach. Find out more about the context for this work in our first blog post.

So, after our kick-off last month, we were left with a long list of open questions. That list became longer thanks to our first all-hands meeting that took place shortly afterwards! It grew again once we had met with the project user group – staff from the British Library, San Diego Air & Space Museum, and Congregation of Sisters of St Joseph – a small group representing the diversity of Flickr Commons members. Rather than being overwhelmed, we were buoyed by the obvious enthusiasm and encouragement across the group, all of whom agreed that this is very much an idea worth pursuing. 

As Mia Ridge from the British Library put it; “we need ephemeral collections to tell the story of now and give people who don’t currently think they have a role in preservation a different way of thinking about it”. And from Mary Grace of the Congregation of Sisters of St. Joseph in Canada, “we [the smaller institutions] don’t want to be the 3rd class passengers who drown first”. 

Software sketching

We’ve begun working on the software approach to create a Data Lifeboat, focussing on the data model and assessing existing protocols we may use to help package it. Alex and George started creating some small prototypes to test how we should include metadata, and have begun exploring what “social metadata” could be like – that’s the kind of metadata that can only be created on Flickr, and is therefore a required element in any Data Lifeboat (as you’ll see from the diagram below, it’s complex). 


Feb 2024: An early sketch of a Data Lifeboat’s metadata graph structure.

Thanks to our first set of tools, Flinumeratr and Flickypedia, we have robust, reusable code for getting photos and metadata from Flickr. We’ve done some experiments with JSON, XML, and METS as possible ways to store the metadata, and started to imagine what a small viewer that would be included in each Data Lifeboat might be like. 

Complexity of long-term licensing

Alongside the technical development we have started developing our understanding of the legal issues that a Data Lifeboat is going to have to navigate to avoid unintended consequences of long-term preservation colliding with licenses set in the present. We discussed how we could build care and informed participation into the infrastructure, and what the pitfalls might be. There are fiddly questions around creating a Data Lifeboat containing photos from other Flickr members. 

  • As the image creator, would you need to be notified if one of your images has been added to a Data Lifeboat? 
  • Conversely, how would you go about removing an image from a Data Lifeboat? 
  • What happens if there’s a copyright dispute regarding images in a Data Lifeboat that is docked somewhere else? 

We discussed which aspects of other legal and licensing models might apply to Data Lifeboats, given the need to maintain stewardship and access over the long term (100 years at least!), as well as the need for the software to remain usable over this kind of time horizon. This isn’t something that the world of software has ready answers for. 

  • Could Flickr.org offer this kind of service? 
  • How would we notify future users of the conditions of the license, let alone monitor the decay of licenses in existing Data Lifeboats over this kind of timescale? 

So many standards to choose from

We had planned to do a deep dive into the various digital asset management systems used by cultural institutions, but this turned out to be a trickier subject than we thought as there are simply too many approaches, tools, and cobbled-together hacks being used in cultural institutions. Everyone seems to be struggling with this, so it’s not clear (yet) how best to approach this. If you have any ideas, let us know!

This work is supported by the National Endowment for the Humanities.

NEH logo

Introducing flinumeratr, our first toy

by Alex

Today we’re pleased to release Flinumeratr, our first toy. You enter a Flickr URL, and it shows you a list of photos that you’d see at that URL:

This is the first engineering step towards what we’ll be building for the rest of this quarter: Flickypedia, a new tool for copying Creative Commons-licensed photos from Flickr to Wikimedia Commons.

As part of Flickypedia, we want to make it easy to select photos from Flickr that are suitable for Wikimedia Commons. You enter a Flickr URL, and Flickypedia will work out what photos are available. This “Flickr URL enumerator”, or “Flinumeratr”, is a proof-of-concept of that idea. It knows how to recognise a variety of URL types, including individual photos, albums, galleries, and a member’s photostream.

We call it a “toy” quite deliberately – it’s a quick thing, not a full-featured app. Keeping it small means we can experiment, try things quickly, and learn a lot in a short amount of time. We’ll build more toys as we have more ideas. Some of those ideas will be reused in bigger projects, and others will be dropped.

Flinumeratr is a playground for an idea for Flickypedia, but it’s also been a context for starting to develop our approach to software development. We’ve been able to move quickly – this is only my fourth day! – but starting a brand new project is always the easy bit. Maintaining that pace is the hard part.

We’re all learning how to work together, I’m dusting off my knowledge of the Flickr API, and we’re establishing some basic coding practices. Things like a test suite, documentation, checks on pull requests, and other guard rails that will help us keep moving. Setting those up now will be much easier than trying to retrofit them later. There’s plenty more we have to decide, but we’re off to a good start.

Under the hood, Flinumeratr is a Python web app written in Flask. We’re calling the Flickr API with the httpx library, and testing everything with pytest and vcrpy. The latter in particular has been so helpful – it “records” interactions with the Flickr API so I can replay them later in our test suite. If you’d like to see more, all our source code is on GitHub.

You can try Flinumeratr at https://flinumeratr.glitch.me. Please let us know what you think!