Progress Report: Creating a Data Lifeboat circa 2024

Alex Chan

In my previous post, I showed you our prototype Data Lifeboat creation workflow. At the end of the workflow, we’d created a Data Lifeboat we could download! Now I want to show you what you get inside the Data Lifeboat package.

Design goals

When we were designing the contents of a Data Lifeboat, we had the following principles in mind:

  • Self-contained – a Data Lifeboat should have no external dependencies, and not rely on any external services that might go away between it being created and opened.
  • Long lasting – a Data Lifeboat should be readable for a long time. It’s a bit optimistic to imagine anything digital we create today will last for 100 years, but we can aim for several decades at least!
  • Understandable – it should be easy for anybody to understand what’s in a Data Lifeboat, and why it might be worth exploring further.
  • Portable – a Data Lifeboat should be easy to move around, and slot into existing preservation systems and workflows without too much difficulty.

A lot of the time, when you export your data from a social media site, you get a folder full of opaque JSON files. That’s tricky to read, and it’s not obvious why you’d care about what’s inside – we wanted to do something better!

We decided to create a “viewer” that lives inside every Data Lifeboat package which gives you a more human-friendly way to browse the contents. The underlying data is still machine-readable, but you can see all the photos and metadata without needing to read a JSON file. This viewer is built as a static website. Building small websites with vanilla HTML and JavaScript gives us something lightweight, portable, and likely to last a long time.

This is inspired by services like Twitter and Instagram which also create static websites as part of their account export – but we’re going much smaller and simpler.

Folder structure

When you open a Data Lifeboat, here’s what’s inside:


The files folder contains all of the photo and video files – the JPEGs, GIFs, and so on. We currently store two sizes of each file: the high-resolution original file that was uploaded to Flickr, and a low-resolution thumbnail.

The metadata folder contains all of the metadata, in machine-readable JavaScript/JSON files. This includes the technical metadata (like the upload date or resolution) and the social metadata (like comments and favorites).

The viewer folder contains the code for our viewer. It’s a small number of hand-written HTML, CSS, and JavaScript files.

The README.html file is the entry point to the viewer, and the first file we want people to open. This name is a convention that comes from the software industry, but we hope that the meaning will be clear even if people are unfamiliar with it.

If you’re trying to put a Data Lifeboat into a preservation system that requires a fixed packaging format like BagIt or OCFL, you could make this the payload folder – but we didn’t want to require those tools in Data Lifeboat. Those structures are useful in large institutions, but less understandable to individuals. We think of this as progressive enhancement, but for data formats.

Inside the viewer

Let’s open the viewer and take a look inside.

When you open README.html, the first thing you see is a “cover sheet”. This is meant to be a quick overview of what’s in the Data Lifeboat – a bit like the cover sheet on a box of papers in a physical archive. It gives you some summary statistics and tells you why the creator thought these photos were worth keeping – this is what was written in the Data Lifeboat creation workflow. It also shows a small number of photos, from the most popular tags in the Data Lifeboat.

This cover sheet is a completely self-contained HTML file. Normally web pages load multiple external resources, like images or style sheets, but we plan for this file to be completely self-contained. Styles will be inline, and images will be base64-encoded data URIs. This design choice makes it easy to create multiple copies of the cover sheet, independent ​​of the rest of the Data Lifeboat, as a summary of the contents.

For example, if you had a large collection of Data Lifeboats, you could create an index from these cover sheets that a researcher could browse before deciding exactly which Data Lifeboat they wanted to download.

Now let’s look at a list of photos. If you click on any of the summary stats, or the “Photos” tab in the header, you see a list of photos.

This list shows you a preview thumbnail of each photo, and some metadata that can be used for filtering and sorting. For example, you can sort by photos with the most/least comments, or filter to photos uploaded by a particular Flickr member.

If you click on a photo, you can see an individual photo page. This shows you the original copy of the photo, and all the metadata we have about it:

Eventually you’ll be able to use the metadata on this page to find similar photos – for example, you’ll be able to click on a tag to find other photos with the same tag.

These pages still need a proper visual design, and this prototype is just meant to show the range of data we can capture. It’s already more understandable than a JSON file, but we think we can do even better!

Legible in the long term

The viewer will also contain documentation, about both the idea of Data Lifeboat and the structure of this particular package. If a Data Lifeboat is opened by somebody who doesn’t know about the project in 50 years, we want them to understand what they’re looking at and how they can use it.

It will also contain the text and agreement date of any policies agreed upon by the creator of this particular Data Lifeboat.

For example, as we create the machine-readable metadata files, we’re starting to document their structure. This should make it easier for future users to extract the metadata programmatically, or even build alternative viewer applications.

Lo-fi and low-tech

The whole viewer is written in a deliberately low-tech way. All the HTML templates, CSS and JavaScript are written by hand, with no external dependencies or bloated frameworks. This keeps the footprint small, makes it easier for us to work on as a small team, and we believe gives the viewer a good chance of lasting for multiple decades. The technology behind the web has a lot of sticking power.

This is a work-in-progress – we have more ideas that we haven’t built yet, and lots of areas where we know where the viewer can be improved. Check back soon for updates as we continue to improve it, and look out for a public alpha next year where you’ll be able to create your own Data Lifeboats!

Progress Report: Creating a Data Lifeboat circa 2024

Alex Chan

In October and November, we held two Data Lifeboat workshops, funded by the Mellon Foundation. We had four days of detailed discussions about how Data Lifeboat should work, we talked about some of the big questions around ethics and care, and got a lot of useful input from our attendees.

As part of the workshops, we showed a demo of our current software, where we created and downloaded a prototype Data Lifeboat. This is an early prototype, with a lot of gaps and outstanding questions, but it was still very helpful to guide some of our conversations. Feedback from the workshops will influence what we are building, and we plan to release a public alpha next year.

We are sharing this work in progress as a snapshot of where we’ve got to. The prototype isn’t built for other people to use, but I can walk you through it with screenshots and explanations.

In this post, I’ll walk you through the creation workflow – the process of preparing a Data Lifeboat. In a follow-up post, I’ll show you what you get when you download a finished Data Lifeboat.

Step 1: Sign in to Flickr

To create a Data Lifeboat, you have to sign in to your Flickr account:

This gives us an authenticated source of identity for who is creating each Data Lifeboat. This means each Data Lifeboat will reflect the social position of its creator in Flickr.com. For example, after you log in, your Data Lifeboat could contain photos shared with you by friends and family, where those photos would not be accessible to other Flickr members who aren’t part of the same social graph.

Step 2: Choose the photos you want to save

To choose photos, you enter a URL that points to photos on Flickr.com:

In the prototype, we can only accept a single URL, either a Gallery, Album, or Photostream for now. This is good enough for prototyping, but we know we’ll need something more flexible later – for example, we might allow you to enter multiple URLs, and then we’d get photos from all of them.

Step 3: See a summary of the photos you’d be downloading

Once you’ve given us a URL, we fetch the first page of photos and show you a summary. This includes things like:

  • How many photos are there?
  • How many photos are public, semi-public, or private?
  • What license are the photos using?
  • What’s the safety level of the photos?
  • Have the owners disabled downloads for any of these photos?

Each of these controls affects what we are permitted to put in a Data Lifeboat, and the answers will be different for different people. Somebody creating their family archive may want all the photos, whereas somebody creating a Data Lifeboat for a museum might only want photos which are publicly visible and openly licensed.

We want Data Lifeboat creators to make informed decisions about what goes in their Data Lifeboat, and we believe we can do better than showing them a series of toggle switches. The current design of this screen is to give us a sense of how these controls are used in practice. It exposes the raw mechanics of Flickr.com, and relies on a detailed understanding of how Flickr.com works. We know this won’t be the final design. We might, for example, build an interface that asks people where they intend to store the Data Lifeboat, and use that to inform which photos we include. This is still speculative, and we have a lot of ideas we haven’t tried yet.

The prototype only saves public, permissively-licensed photos, because we’re still working out the details of how we handle licensed and private photos.

Step 4: Write a README

This is a vital step – it’s where people give us more context. A single Data Lifeboat can only contain a sliver of Flickr, so we want to give the creator the opportunity to describe why they made this selection, and also to include any warnings about sensitive content so it’s easier to use the archive with care in future.

Tori will be writing up what happened at the workshops around how we could design this particular interface to encourage creators to think carefully here.

We like the idea of introducing ‘positive friction’ to this process, supporting people to write constructive and narrative notes to the future about why this particular sliver is important to keep.

Step 5: Agree to policies

When you create a Data Lifeboat, you need to agree to certain conditions around responsible use of the photos you’re downloading:

The “policies” in the current prototype are obviously placeholders. We know we will need to impose certain conditions, but we don’t know what they are yet.

One idea we’re developing is that these policies might adapt dynamically based on the contents of the Data Lifeboat. If you’re creating a Data Lifeboat that only contains your own public photos, that’s very different from one that contains private photos uploaded by other people.

Step 6: One moment please…

Creating or “baking” a Data Lifeboat can take a while – we need to download all the photos, their associated metadata, and construct a Data Lifeboat package.

In the prototype we show you a holding page:

We imagine that in the future, we’d email you a notification when the Data Lifeboat has finished baking.

Step 7: Download the Data Lifeboat

We have a page where you can download your Data Lifeboat:

Here you see a list of all the Data Lifeboats that we’ve been prototyping, because we wanted people to share their ideas for Data Lifeboats at our co-design workshops. In the real tool, you’ll only be able to see and download Data Lifeboats that you created.

What’s next?

We still have plenty to get on with, but you can see the broad outline of where we’re going, and it’s been really helpful to have an end-to-end tool that shows the whole Data Lifeboat creation process.

Come back next week, and I’ll show you what you get inside the Data Lifeboat when you download it!

Progress Report: Creating a Data Lifeboat circa 2024

Alex Chan

In September, Tori and I went to Belgium for iPres 2024. We were keen to chat about digital preservation and discuss some of our ideas for Data Lifeboat – and enjoy a few Belgian waffles, of course!

Photo by the Digital Preservation Coalition, used under CC BY-NC-SA 2.0

We ran a workshop called “How do you preserve 50 billion photos?” to talk about the challenges of archiving social media at scale. We had about 30 people join us for a lively discussion. Sadly we don’t have any photos of the workshop, but we did come away with a lot to think about, and we wanted to share some of the ideas that emerged.

Thanks to the National Endowment for the Humanities for supporting this trip as part of our Digital Humanities Advancement Grant.

How much can we preserve?

Some people think we should try to collect all of social media – accumulate as much data as possible, and sort through it later. That might be appealing in theory, because we won’t delete anything that’s important to future researchers, but it’s much harder in practice.

For Data Lifeboat, this is a problem of sheer scale. There are 50 billion photos on Flickr, and trillions of points that form the social graph that connects them. It’s simply too much for a single institution to collect as a whole.

At the conference we heard about other challenges that make it hard to archive social media like constraints on staff time, limited resources, and a lack of cooperation from platform owners. Twitter/X came up repeatedly, as an example of a site which has become much harder to archive after changes to the API.

There are also longer-term concerns. Sibyl Schaefer, who works at the University of California, San Diego, presented a paper about climate change, and how scarcity of oil and energy will affect our ability to do digital preservation. All of our digital services rely on a steady supply of equipment and electricity, which seem increasingly fraught as we look over the next 100 years. “Just keep everything” may not be a sustainable strategy.

This paper was especially resonant for us, because she encourages us to think about these problems now, before the climate crisis gets any worse. It’s better to make a decision when you have more options and things are (relatively) calm, than wait for things to get really bad and be forced to make a snap judgment. This matches our approach to rights, privacy, and legality with Data Lifeboat – we’re taking the time to consider the best approach while Flickr is still happy and healthy, and we’re not under time pressure to make a quick decision.

What should we keep?

We went to iPres believing that trying to keep everything is inappropriate for Flickr and social media, and the conversations we had only strengthened this view. There are definitely benefits to this approach, but they require an abundance of staffing and resources that simply don’t exist.

One thing we heard at our Birds of a Feather session is that if you can only choose a selection of photos from Flickr, large institutions don’t want to make that selection themselves. They want an intermediate curator to choose photos to go in a Data Lifeboat, and then bequeath that Data Lifeboat to their archive. That person decides what they think is worth keeping, not the institution.

Who chooses what to keep?

If you can only save some of a social media service, how do you decide which part to take? You might say “keep the most valuable material”, but who decides what’s valuable? This is a thorny question that came up again and again at iPres.

An institution could conceivably collect Data Lifeboats from many people, each of whom made a different selection. Several people pointed out that any selection process will introduce bias and inequality – and while it’s impossible to fix these completely, having many people involved can help mitigate some of the issues.

This ‘collective selection’ helps deal with the fact that social media is really big – there’s so much stuff to look at, and it’s not always obvious where the interesting parts are. Sharing that decision with different people creates a broader perspective of what’s on a platform, and what material might be worth keeping.

Why are we archiving social media?

The discussion around why we archive social media is still frustratingly speculative. We went to iPres hoping to hear some compelling use cases or examples, but we didn’t.

There are plenty of examples of people using material from physical archives to aid their research. For example, one person told the story of the Minutes of the Milk Marketing Board. Once thought of as a dry and uninteresting collection, it became very useful when there was an outbreak of foot-and-mouth disease in Britain. We didn’t hear any case studies like that for digital archives.

There are already lots of digital archives and archives of Internet material. It would be interesting to hear from historians and researchers who are using these existing collections, to hear what they find useful and valuable.

The Imaginary Future Researcher

A lot of discussion revolved around an imaginary future researcher or PhD student, who would dive into the mass of digital material and find something interesting – but these discussions were often frustratingly vague. The researcher would do something with the digital collections, but more specifics weren’t forthcoming.

As we design Data Lifeboat, we’ve found it useful to imagine specific scenarios:

  • The Museum of London works with schools across the city, engaging students to collect great pictures of their local area. A schoolgirl in Whitechapel selects 20 photos of Whitechapel she thinks are worth depositing in the Museum’s collection.
  • The botany student at California State looks across Flickr to find photography of plant coverage in a specific area and gathers them as a longitudinal archive.
  • A curation student interning at Qtopia in Sydney wants to gather community documentation of Sydney’s Mardi Gras.

These only cover a handful of use cases, but they’ve helped ground our discussions and imagine how a future researcher might use the material we’ve saving.

A Day Out in Antwerp

On my final day in Belgium, I got to visit a couple of local institutions in Antwerp. I got a tour of the Plantin-Moretus Museum, which focuses on sixteenth-century book printing. The museum is an old house with some gorgeous gardens:

And a collection of old printing machines:

There was even a demonstration on a replica printing machine, and I got to try a bit of printing myself – it takes a lot of force to push the metal letters and the paper together!

Then in the afternoon I went to the FelixArchief, the city archive of Antwerp, which is stored inside an old warehouse next to the Port of Antwerp:

We got a tour of their stores, including some of the original documents from the very earliest days of Antwerp:

And while those shelves may look like any other archive, there’s a twist – they have to fit around the interior shape of the original warehouse! The archivists have got rather good at playing tetris with their boxes, and fitting everything into tight spaces – this gets them an extra 2 kilometres of shelving space!

Our tour guide explained that this is all because the warehouse is a listed building – and if the archives were ever to move out, they’d need to remove all their storage and leave it in a pristine condition. No permanent modifications allowed!

Next steps for Data Lifeboat

We’re continuing to think about what we heard about iPres, and bring it into the design and implementation of Data Lifeboat.

This month and next, we’re holding Data Lifeboat co-design workshops in Washington DC and London to continue these discussions.

Progress Report: Creating a Data Lifeboat circa 2024

Alex Chan

This week we added maps to our Commons Explorer, and it’s proving to be a fun new way to find photos.

There are over 50,000 photos in the Flickr Commons collection which have location information telling us where the photo was taken. We can plot those locations on a map of the world, so you can get a sense of the geographical spread:

This map is interactive, so you can zoom in and move around to focus on a specific place. As you do, we’ll show you a selection of photos from the area you’ve selected.

You can also filter the map, so you see photos from just a single Commons member. For smaller members the map points can tell a story in themselves, and give you a sense of where a collection is and what it’s about:

These maps are available now, and know about the location of every geotagged photo in Flickr Commons.

Give them a try!

How can you add a location to a Flickr Commons photo?

For the first version of this map, we use the geotag added by the photo’s owner.

If you’re a Flickr Commons member, you can add locations to your photos and they’ll automatically show up on this map. The Flickr Help Center has instructions for how to do that.

It’s possible for other Flickr members to add machine tags to photos, and there are already thousands of crowdsourced tags that have location-related information. We don’t show those on the map right now, but we’re thinking about how we might do that in future!

How does the map work?

There are three technologies that make these maps possible.

The first is SQLite, the database engine we use to power the Commons Explorer. We have a table which contains every photo in the Flickr Commons, and it includes any latitude and longitude information. SQLite is wicked fast and our collection is small potatoes, so it can get the data to draw these maps very quickly.

I’d love to tell you about some deeply nerdy piece of work to hyper-optimize our queries, but it wasn’t necessary. I wrote the naïve query, added a couple of column indexes, and that first attempt was plenty fast. Tallying the locations for the entire Flickr Commons collection takes ~45ms; tallying the locations for an individual member is often under a millisecond.)

The second is Leaflet.js, a JavaScript library for interactive maps. This is a popular and feature-rich library that made it easy for us to add a map to the site. Combined with a marker clustering plugin, we had a lot of options for configuring the map to behave exactly as we wanted, and to connect it to Flickr Commons data.

The third is OpenStreetMap. This is a world map maintained by a community of volunteers, and we use their map tiles as the backdrop for our map.

Plus ça Change

To help us track changes to the Commons Explorer, we’ve added another page: the changelog.

This is part of our broader goal of archiving the organization. Even in the six months since we launched the Explorer, it’s easy to forget what happened when, and new features quickly feel normal. The changelog is a place for us to remember what’s changed and what the site used to look like, as we continue to make changes and improvements.

Working with snapshots of structured data on Wikimedia Commons

In my previous post about Flickypedia Backfillr Bot, I mentioned that Flickypedia uses snapshots of structured data on Wikimedia Commons to spot possible duplicates:

We downloaded snapshots of the structured data for every file on Wikimedia Commons, and we built a database of all the links between files on Wikimedia Commons and Flickr photos. For every file in the snapshot, we looked at the structured data properties where we might find a Flickr URL. Then we tried to parse those URLs using our Flickr URL parsing library, and find out what Flickr photo they point at (if any).

As we’ve been working on Flickypedia, we’ve developed a few tactics for working with these snapshots, which we thought might be useful for other people working with Wikimedia Commons data.

What are these snapshots?

Files on Wikimedia Commons can contain structured data—machine-readable metadata saying where the file came from, the license of the file, when it was created, and so on. For a longer explanation of structured data, read my previous post.

The structured data snapshots are JSON files that contain the structured data statements for all the files on Wikimedia Commons. (One of many public dumps of Wikimedia content.) These snapshots are extremely useful if you have a task that involves searching the database en masse – for example, finding all the Flickr photos on Commons.

All the snapshots we worked with are available for download from https://dumps.wikimedia.org/commonswiki/entities/, and new snapshots are typically created a few times a week.

Do you need snapshots?

Snapshots can be cumbersome, so if you need a quick answer, there may be better ways to get data out of Wikimedia Commons, like Special:MediaSearch and the Commons Query Service, which both support querying on structured data. But if you need to look at Wikimedia Commons as a whole, or run some sort of complex query or analysis that doesn’t fit into an existing tool, the structured snapshots can be very useful.

We’ve already found several use cases for them at the Flickr Foundation:

  • Finding every Flickr photo on Wikimedia Commons. As discussed in previous posts, the many variants of Flickr URL make it difficult to run a query for Flickr photos on Commons – but we can do this analysis easily with a snapshot. We can parse the data in the snapshot with our Flickr URL parser and store the normalised information in a new database.
  • Seeing how structured data is already being used. When we were designing the Flickypedia data model, part of our research involved looking at how structured data was already being used for Flickr photos. Using the snapshots, we could look for examples we could mimic, and compare our ideas to the existing data. Was our proposal following a popular, well-established approach, or was it novel and perhaps more controversial?
  • Verifying our assumptions about structured data. By doing an exhaustive search of the structured data, we could check if our assumptions were correct – and sometimes we’d find counterexamples that forced us to rethink our approach. For example, “every Wikimedia Commons file comes from zero or one Flickr photos”. Looking at the snapshots told us this was false – there are some files which link to multiple Flickr photos, because the same photo was uploaded to Flickr multiple times.

How do you download a snapshot?

The snapshots are fairly large: the latest snapshots are over 30GB, and that’s only getting bigger as more structured data is created. It takes me multiple hours to download a snapshot, and that can be annoying if the connection drops partway through.

Fortunately, Wikimedia Commons has a well-behaved HTTP server that supports resumable downloads. There are lots of download managers that can resume the download when it gets interrupted, so you can download a snapshot over multiple sessions. I like curl because it’s so ubiquitous – there’s a good chance it’s already installed on whatever computer I’m using.

This is an example of the curl command I run:

curl \
  --location \
  --remote-name \
  --continue-at - \
  "https://dumps.wikimedia.org/commonswiki/entities/20240617/commons-20240617-mediainfo.json.gz"

I usually have to run it multiple times to get a complete download, but it does eventually succeed. The important flag here is -​-continue-at –, which tells curl to resume a previous download.

Which format should you download?

The snapshots are available in two formats: bzip2-compressed JSON, and gzip-compressed JSON. They have identical contents, just compressed differently. Which should you pick?

I wasn’t sure which format was right, so when I was getting started, I downloaded both and ran some experiments to see which was a better fit for our use case. We iterate through every file in a snapshot as part of Flickypedia, so we wanted a format we could read quickly.

The file sizes are similar: 33.6GB for bzip2, 43.4GB for gzip. Both of these are manageable downloads for us, so file size wasn’t a deciding factor.

Then I ran a benchmark on my laptop to see how long it took to read each format. This command is just uncompressing each file, and measuring the time it takes:

$ time bzcat commons-20240617-mediainfo.json.bz2 >/dev/null
Executed in 113.48 mins

$ time gzcat commons-20240617-mediainfo.json.gz >/dev/null
Executed in 324.17 secs

That’s not a small difference: gzip is 21 times faster to uncompress than bzip2. Even accounting for the fairly unscientific test conditions, it was the clear winner. For Flickypedia, we use the gzip-compressed snapshots.

What’s inside a snapshot?

An uncompressed snapshot is big – the latest snapshot contains nearly 400GB of JSON.

The file contains a single, massive JSON object:

[
   { … data for the first file … },
   { … data for the second file … },
   …
   { … data for the last file … }
]

Aside from the opening and closing square brackets, each line has a JSON object that contains the data for a single file on Wikimedia Commons. This makes it fairly easy to stream data from this file, without trying to parse the entire snapshot at once.

If you’re curious about the structure of the data, we have some type definitions in Flickypedia: one for the top-level snapshot entries, one for the Wikidata data model which is used for structured data statements. Unfortunately I haven’t been able to find a lot of documentation for these types on Wikimedia Commons itself.

How to read snapshots

The one-file-per-line structure of the snapshot JSON allows us to write a streaming parser in Python. This function will read one file at a time, which is more efficient than reading the entire file at once:

import gzip
import json


def get_entries_from_snapshot(path):
    with gzip.open(path) as uncompressed_file:
        for line in uncompressed_file:

            # Skip the square brackets at the beginning/end of the file
            # which open/close the JSON object
            if line.strip() in {b"[", b"]"}:
                continue

            # Strip the trailing comma at the end of each line
            line = line.rstrip(b",\n")

            # Parse the line as JSON, and yield it to the caller
            entry = json.loads(line)
            yield entry


path = "commons-20240617-mediainfo.json.gz"

for entry in get_entries_from_snapshot(path):
    print(entry)

# {'type': 'mediainfo', 'id': 'M76', … }
# …

This does take a while – on my machine, it takes around 45 minutes just to read the snapshot, with no other processing.

To avoid having to do this too often, my next step is to extend this script to extract the key information I want from the snapshot.

For example, for Flickypedia, we’re only really interested in P12120 (Flickr Photo ID) and P7482 (Source of File) when we’re looking for Flickr photos which are already on Commons. A script which extracts just those two fields can reduce the size of the data substantially, and give me a file that’s easier to work with.

Working with snapshots of structured data on Wikimedia Commons

Last year, we built Flickypedia, a new tool for copying photos from Flickr to Wikimedia Commons. As part of our planning, we asked for feedback on Flickr2Commons and analysed other tools. We spotted two consistent themes in the community’s responses:

  • Write more structured data for Flickr photos
  • Do a better job of detecting duplicate files

We tried to tackle both of these in Flickypedia, and initially, we were just trying to make our uploader better. Only later did we realize that we could take our work a lot further, and retroactively apply it to improve the metadata of the millions of Flickr photos already on Wikimedia Commons. At that moment, Flickypedia Backfillr Bot was born. Last week, the bot completed its millionth update, and we guesstimate we will be able to operate on another 13 million files.

The main goals of the Backfillr Bot are to improve the structured data for Flickr photos on Wikimedia Commons and to make it easier to find out which photos have been copied across. In this post, I’ll talk about what the bot does, and how it came to be.

Write more structured data for Flickr photos

There are two ways to add metadata to a file on Wikimedia Commons: by writing Wikitext or by creating structured data statements.

When you write Wikitext, you write your metadata in a MediaWiki-specific markup language that gets rendered as HTML. This markup can be written and edited by people, and the rendered HTML is designed to be read by people as well. Here’s a small example, which has some metadata to a file linking it back to the original Flickr photo:

== {{int:filedesc}} ==
{{Information
|Description={{en|1=Red-whiskered Bulbul photographed in Karnataka, India.}}
|Source=https://www.flickr.com/photos/shivanayak/12448637/
|Author=[[:en:User:Shivanayak|Shiva shankar]]
|Date=2005-05-04
|Permission=
|other_versions=
}}

and here’s what that Wikitext looks like when rendered as HTML:

A table with four rows: Description (Red-whiskered Bulbul photographed in Karnataka, India), Date (4 May 2005), Source (a Flickr URL) and Author (Shiva shankar)

This syntax is convenient for humans, but it’s fiddly for computers – it can be tricky to extract key information from Wikitext, especially when things get more complicated.

In 2017, Wikimedia Commons added support for structured data. This allows editors to add metadata in a machine-readable format. This makes it much easier to edit metadata programmatically, and there’s a strong desire from the community for new tools to write high-quality structured metadata that other tools can use.

When you add structured data to a file, you create “statements” which are attached to properties. The list of properties is chosen by the volunteers in the Wikimedia community.

For example, there’s a property called “source of file” which is used to indicate where a file came from. The file in our example has a single statement for this property, which says the file is available on the Internet, and points to the original Flickr URL:

Structured data is exposed via an API, and you can retrieve this information in nice machine-readable XML or JSON:

$ curl 'https://commons.wikimedia.org/w/api.php?action=wbgetentities&sites=commonswiki&titles=File%3ARed-whiskered%20Bulbul-web.jpg&format=xml'
<?xml version="1.0"?>
<api success="1">
  …
  <P7482>
    …
    <P973>
      <_v snaktype="value" property="P973">
        <datavalue
          value="https://www.flickr.com/photos/shivanayak/12448637/"
          type="string"/>
      </_v>
    </P973>
    …
  </P7482>
</api>

(Here “P7482” means “source of file” and “P973” is “described at URL”.)

Part of being a good structured data citizen is following the community’s established patterns for writing structured data. Ideally every tool would create statements in the same way, so the data is consistent across files – this makes it easier to work with later.

We spent a long time discussing how Flickypedia should use structured data, and we got a lot of helpful community feedback. We’ve documented our current data model as part of our Wikimedia project page.

Do a better job of detecting duplicate files

If a photo has already been copied from Flickr onto Wikimedia Commons, nobody wants to copy it a second time.

This sounds simple – just check whether the photo is already on Commons, and don’t offer to copy it if it’s already there. In practice, it’s quite tricky to tell if a given Flickr photo is on Commons. There are two big challenges:

  1. Files on Wikimedia Commons aren’t consistent in where they record the URL of the original Flickr photo. Newer files put the URL in structured data; older files only put the URL in Wikitext or the revision descriptions. You have to look in multiple places.
  2. Files on Wikimedia Commons aren’t consistent about which form of the Flickr URL they use – with and without a trailing slash, with the user NSID or their path alias, or the myriad other URL patterns that have been used in Flickr’s twenty-year history.

Here’s a sample of just some of the different URLs we saw in Wikimedia Commons:

https://www.flickr.com/photos/joyoflife//44627174
https://farm5.staticflickr.com/4586/37767087695_bb4ecff5f4_o.jpg
www.flickr.com/photo_edit.gne?id=3435827496
https://www.flickr.com/photo.gne?short=2ouuqFT

There’s no easy way to query Wikimedia Commons and see if a Flickr photo is already there. You can’t, for example, do a search for the current Flickr URL and be sure you’ll find a match – it wouldn’t find any of the examples above. You can combine various approaches that will improve your chances of finding an existing duplicate, if there is one, but it’s a lot of work and you get varying results.

For the first version of Flickypedia, we took a different approach. We downloaded snapshots of the structured data for every file on Wikimedia Commons, and we built a database of all the links between files on Wikimedia Commons and Flickr photos. For every file in the snapshot, we looked at the structured data properties where we might find a Flickr URL. Then we tried to parse those URLs using our Flickr URL parsing library, and find out what Flickr photo they point at (if any).

This gave us a SQLite database that mapped Flickr photo IDs to Wikimedia Commons filenames. We could use this database to do fast queries to find copies of a Flickr photo that already exist on Commons. This proved the concept, but it had a couple of issues:

  • It was an incomplete list – we only looked in the structured data, and not the Wikitext. We estimate we were missing at least a million photos.
  • Nobody else can use this database; it only lives on the Flickypedia server. Theoretically somebody else could create it themselves – the snapshots are public, and the code is open source – but it seems unlikely.
  • This database is only as up-to-date as the latest snapshot we’ve downloaded – it could easily fall behind what’s on Wikimedia Commons.

We wanted to make this process easier – both for ourselves, and anybody else building Flickr–Wikimedia Commons integrations.

Adding the Flickr Photo ID property

Every photo on Flickr has a unique numeric ID, so we proposed a new Flickr photo ID property to add to structured data on Wikimedia Commons. This proposal was discussed and accepted by the Wikimedia Commons community, and gives us a better way to match files on Wikimedia Commons to photos on Flickr:

This is a single field that you can query, and there’s an unambiguous, canonical way that values should be stored in this field – you don’t need to worry about the different variants of Flickr URL.

We added this field to Flickypedia, so any files uploaded with our tool will get this new field, and we hope that other Flickr upload tools will consider adding this field as well. But what about the millions of Flickr photos already on Wikimedia Commons? This is where Flickypedia Backfillr Bot was born.

Updating millions of files

Flickypedia Backfillr Bot applies our structured data mapping to every Flickr photo it can find on Wikimedia Commons – whether or not it was uploaded with Flickypedia. For every photo which was copied from Flickr, it compares the structured data to the live Flickr metadata, and updates the structured data if the two don’t match. This includes the Flickr Photo ID.

It reuses code from our duplicate detector: it goes through a snapshot looking for any files that come from Flickr photos. Then it gets metadata from Flickr, checks if the structured data matches that metadata, and if not, it updates the file on Wikimedia Commons.

Here’s a brief sketch of the process:

Most of the time this logic is fairly straightforward, but occasionally the bot will get confused – this is when the bot wants to write a structured data statement, but there’s already a statement with a different value. In this case, the bot will do nothing and flag it for manual review. There are edge cases and unusual files in Wikimedia Commons, and it’s better for the bot to do nothing than write incorrect or misleading data that will need to be reverted later.

Here are two examples:

  • Sometimes Wikimedia Commons has more specific metadata than Flickr. For example, this Flickr photo was posted by the Donostia Kultura account, and the description identifies Leire Cano as the photographer.

    Flickypedia Backfillr Bot wants to add a creator statement for “Donostia Kultura”, because it can’t understand the description – but when this file was copied to Wikimedia Commons, somebody added a more specific creator statement for “Leire Cano”.

    The bot isn’t sure which statement is correct, so it does nothing and flags this for manual review – and in this case, we’ve left the existing statement as-is.

  • Sometimes existing data on Wikimedia Commons has been mapped incorrectly. For example, this Flickr photo was taken “circa 1943”, but when it was copied to Wikimedia Commons somebody added an overly precise “date taken” statement claiming it was taken on “1 Jan 1943”.

    This bug probably occurred because of a misunderstanding of the Flickr API. The Flickr API will always return a complete timestamp in the “date” field, and then return a separate granularity value telling you how accurate it is. If you ignored that granularity value, you’d create an incorrect statement of what the date is.

    The bot isn’t sure which statement is correct, so it does nothing and flags this for manual review – and in this case, we made a manual edit to replace the statement with the correct date.

What next?

We’re going to keep going! There were a few teething problems when we started running the bot, but the Wikimedia community helped us fix our mistakes. It’s now been running for a month or so, and processed over a million files.

All the Flickypedia code is open source on GitHub, and a lot of it isn’t specific to Flickr – it’s general-purpose code for working with structured data on Wikimedia Commons, and could be adapted to build similar bots. We’ve already had conversations with a few people about other use cases, and we’ve got some sketches for how that code could be extracted into a standalone library.

We estimate that at least 14 million files on Wikimedia Commons are photos that were originally uploaded to Flickr – more than 10% of all the files on Commons. There’s plenty more to do. Onwards and upwards!

The surprising utility of a Flickr URL parser

In my first week at the Flickr Foundation, we made a toy called Flinumeratr. This is a small web app that takes a Flickr URL as input, and shows you all the photos which are present at that URL.

As part of this toy, I made a Python library which parses Flickr URLs, and tells you what the URL points to – a single photo, an album, a gallery, and so on. Initially it just handled fairly common patterns, the sort of URLs that you’d encounter if you use Flickr today, but it’s grown to handle more complicated URLs.

$ flickr_url_parser "https://www.flickr.com/photos/sdasmarchives/50567413447"
{"type": "single_photo", "photo_id": "50567413447"}

$ flickr_url_parser "https://www.flickr.com/photos/aljazeeraenglish/albums/72157626164453131"
{"type": "album", "user_url": "https://www.flickr.com/photos/aljazeeraenglish", "album_id": "72157626164453131", "page": 1}

$ flickr_url_parser "https://www.flickr.com/photos/blueminds/page3"
{"type": "user", "user_url": "https://www.flickr.com/photos/blueminds"}

The implementation is fairly straightforward: I use the hyperlink library to parse the URL text into a structured object, then I compare that object to a list of known patterns. Does it look like this type of URL? Or this type of URL? Or this type of URL? And so on.

You can run this library as a command-line tool, or call it from Python – there are instructions in the GitHub README.

There are lots of URL variants

In my second week and beyond, I started to discover more variants, which should probably be expected in 20-year old software! I’ve been looking into collections of Flickr URLs that have been built up over multiple years, and although most of these URLs follow common patterns, there are lots of unusual variants in the long tail.

Some of these are pretty simple. For example, the URL to a user’s photostream can be formed using your Flickr user NSID or your path alias, so flickr.com/photos/197130754@N07/ and flickr.com/photos/flickrfoundation/ point to the same page.

Others are more complicated, and you can trace the history of Flickr through some of the older URLs. Some of my favorites include:

  • Raw JPEG files, on live.staticflickr.com, farm1.static.flickr.com, and several other subdomains.

  • Links with a .gne suffix, like www.flickr.com/photo_edit.gne?id=3435827496 (from Wikimedia Commons). This acronym stands for Game Neverending, the online game out of which Flickr was born.

  • A Flash video player called stewart.swf, which might be a reference to Stewart Butterfield, one of the cofounders of Flickr.

I’ve added support for every variant of Flickr URL to the parsing library – if you want to see a complete list, check out the tests. I need over a hundred tests to check all the variants are parsed correctly.

Where we’re using it

I’ve been able to reuse this parsing code in a bunch of different projects, including:

  • Building a similar “get photos at this URL” interface in Flickypedia.

  • Looking for Flickr photo URLs in Wikimedia Commons. This is for detecting Flickr photos which have already been uploaded to Commons, which I’ll describe more in another post.

  • Finding Flickr pages which have been captured in the Wayback Machine – I can get a list of saved Flickr URLs, and then see what sort of pages have actually been saved.

When I created the library, I wasn’t sure if this code was actually worth extracting as a standalone package – would I use it again, or was this a premature abstraction?

Now that I’ve seen more of the diversity of Flickr URLs and found more uses for this code, I’m much happier with the decision to abstract it into a standalone library. Now we  only need to add support for each new URL variant once, and then all our projects can benefit.

If you want to try the Flickr URL parser yourself, all the code is open source on GitHub.

Data Lifeboat Update 4: What a service architecture could be like

We’re starting to write code for our Data Lifeboat, and that’s pushed us to decide what the technical architecture looks like. What are the different systems and pieces involved in creating a Data Lifeboat? In this article I’m going to outline what we imagine that might look like.

We’re still very early in the prototyping stage of this work. Our next step is going to be building an end-to-end prototype of this design, and seeing how well it works.

Here’s the diagram we drew on the whiteboard last week:

Let’s step through it in detail.

First somebody has to initiate the creation of a Data Lifeboat, and choose the photos they want to include. There could be a number of ways to start this process: a command-line tool, a graphical web app, a REST API.

We’re starting to think about what those interfaces will look like, and how they’ll work. When somebody creates a Data Lifeboat, we need more information than just a list of photos. We know we’re going to need things like legal agreements, permission statements, and a description of why the Lifeboat was created. All this information needs to be collected at this stage.

However these interfaces work, it all ends in the same way: with a request to create a Data Lifeboat for a list of photos and their metadata from Flickr.

To take a list of photos and create a Data Lifeboat, we’ll have a new Data Lifeboat Creator service. This will call the Flickr API to fetch all the data from Flickr.com, and package it up into a new file. This could take a long time, because we need to make a lot of API calls! (Minutes, if not hours.)

We already have the skeleton of this service in the Commons Explorer, and we expect to reuse that code for the Data Lifeboat.

We are also considering creating an index of all the Data Lifeboats we’ve created – for example, “Photo X was added to Data Lifeboat Y on date Z”. This would be a useful tool for people wanting to look up Flickr URLs if the site ever goes away. “I have a reference to photo X, where did that end up after Flickr?”

When all the API calls are done, this service will eventually produce a complete, standalone Data Lifeboat which is ready to be stored!

When we create the Data Lifeboat, we’re imagining we’ll keep it on some temporary storage owned by the Flickr Foundation. Once the packaging is complete, the person or organization who requested it can download it to their permanent storage. Then it becomes their responsibility to make sure it’s kept safely – for example, creating backups or storing it in multiple geographic locations.

The Flickr Foundation isn’t going to run a single, permanent store of all Data Lifeboats ever created. That would turn us into another Single Point of Failure, which is something we’re keen to avoid!

There are still lots of details to hammer out at every step of this process, but thinking about the broad shape of the Data Lifeboat service has already been useful. It’s helped us get a consistent understanding of what the steps are, and exposed more questions for us to ponder as we keep building.

How does the Commons Explorer work?

Last week we wrote an introductory post about our new Commons Explorer; today we’re diving into some of the technical details. How does it work under the hood?

When we were designing the Commons Explorer, we knew we wanted to look across the Commons collection – we love seeing a mix of photos from different members, not just one account at a time. We wanted to build more views that emphasize the breadth of the collection, and help people find more photos from more members.

We knew we’d need the Flickr API, but it wasn’t immediately obvious how to use it for this task. The API exposes a lot of data, but it can only query the data in certain ways.

For example, we wanted the homepage to show a list of recent uploads from every Flickr Commons member. You can make an API call to get the recent uploads for a single user, but there’s no way to get all the uploads for multiple users in a single API call. We could make an API call for every member, but with over 100 members we’d be making a lot of API calls just to render one component of one page!

It would be impractical to fetch data from the API every time we render a page – but we don’t need to. We know that there isn’t that much activity in Flickr Commons – it isn’t a social media network with thousands of updates a second – so rather than get data from the API every time somebody loads a page, we decided it’s good enough to get it once a day. We trade off a bit of “freshness” for a much faster and more reliable website.

We’ve built a Commons crawler that runs every night, and makes thousands of Flickr API calls (within the API’s limits) to populate a SQLite database with all the data we need to power the Commons Explorer. SQLite is a great fit for this sort of data – it’s easy to run, it gives us lots of flexibility in how we query the data, and it’s wicked fast with the size of our collection.

There are three main tables in the database:

  • The members
  • The photos uploaded by all the members
  • The comments on all those photos

We’re using a couple of different APIs to get this information:

  • The flickr.commons.getInstitutions API gives us a list of all the current Commons members. We combine this with the flickr.people.getInfo API to get more detailed information about each member (like their profile page description).
  • The flickr.people.getPhotos API gives us a list of all the photos in each member’s photostream. This takes quite a while to run – it returns up to 500 photos per call, but there are over 1.8 million photos in Flickr Commons.
  • The flickr.photos.comments.getList API gives us a list of all the comments on a single photo. To save us calling this 1.8 million times, we have some logic to check if there are any (new) comments since the last crawl – we don’t need to call this API if nothing has changed.

We can then write SQL queries to query this data in interesting ways, including searching photos and comments from every member at once.

We have a lightweight Flask web app that queries the SQLite database and renders them as nice HTML pages. This is what you see when you browse the website at https://commons.flickr.org/.

We have a couple of pages where we call the Flickr API to get the most up-to-date data (on individual member pages and the cross-Commons search), but most of the site is coming from the SQLite database. After fine-tuning the database with a couple of indexes, it’s now plenty fast, and gives us a bunch of exciting new ways to explore the Commons.

Having all the data in our own database also allows us to learn new stuff about the Flickr Commons collection that we can’t see on Flickr itself – like the fact that it has 1.8 million photos, or that together Flickr Commons as a whole has had 4.4 billion views.

This crawling code has been an interesting test bed for another project – we’ll be doing something very similar to populate a Data Lifeboat, but we’ll talk more about that in a separate post.

Data Lifeboat Update 2: More questions than answers

By Ewa Spohn

Thanks to the Digital Humanities Advancement Grant we were awarded by the National Endowment for the Humanities, our Data Lifeboat project (which is part of the Content Mobility Program) is now well and truly underway. The Data Lifeboat is our response to the challenge of archiving the 50 billion or so images currently on Flickr, should the service go down. It’s simply too big to archive as a whole, and we think that these shared histories should be available for the long term, so we’re exploring a decentralized approach. Find out more about the context for this work in our first blog post.

So, after our kick-off last month, we were left with a long list of open questions. That list became longer thanks to our first all-hands meeting that took place shortly afterwards! It grew again once we had met with the project user group – staff from the British Library, San Diego Air & Space Museum, and Congregation of Sisters of St Joseph – a small group representing the diversity of Flickr Commons members. Rather than being overwhelmed, we were buoyed by the obvious enthusiasm and encouragement across the group, all of whom agreed that this is very much an idea worth pursuing. 

As Mia Ridge from the British Library put it; “we need ephemeral collections to tell the story of now and give people who don’t currently think they have a role in preservation a different way of thinking about it”. And from Mary Grace of the Congregation of Sisters of St. Joseph in Canada, “we [the smaller institutions] don’t want to be the 3rd class passengers who drown first”. 

Software sketching

We’ve begun working on the software approach to create a Data Lifeboat, focussing on the data model and assessing existing protocols we may use to help package it. Alex and George started creating some small prototypes to test how we should include metadata, and have begun exploring what “social metadata” could be like – that’s the kind of metadata that can only be created on Flickr, and is therefore a required element in any Data Lifeboat (as you’ll see from the diagram below, it’s complex). 


Feb 2024: An early sketch of a Data Lifeboat’s metadata graph structure.

Thanks to our first set of tools, Flinumeratr and Flickypedia, we have robust, reusable code for getting photos and metadata from Flickr. We’ve done some experiments with JSON, XML, and METS as possible ways to store the metadata, and started to imagine what a small viewer that would be included in each Data Lifeboat might be like. 

Complexity of long-term licensing

Alongside the technical development we have started developing our understanding of the legal issues that a Data Lifeboat is going to have to navigate to avoid unintended consequences of long-term preservation colliding with licenses set in the present. We discussed how we could build care and informed participation into the infrastructure, and what the pitfalls might be. There are fiddly questions around creating a Data Lifeboat containing photos from other Flickr members. 

  • As the image creator, would you need to be notified if one of your images has been added to a Data Lifeboat? 
  • Conversely, how would you go about removing an image from a Data Lifeboat? 
  • What happens if there’s a copyright dispute regarding images in a Data Lifeboat that is docked somewhere else? 

We discussed which aspects of other legal and licensing models might apply to Data Lifeboats, given the need to maintain stewardship and access over the long term (100 years at least!), as well as the need for the software to remain usable over this kind of time horizon. This isn’t something that the world of software has ready answers for. 

  • Could Flickr.org offer this kind of service? 
  • How would we notify future users of the conditions of the license, let alone monitor the decay of licenses in existing Data Lifeboats over this kind of timescale? 

So many standards to choose from

We had planned to do a deep dive into the various digital asset management systems used by cultural institutions, but this turned out to be a trickier subject than we thought as there are simply too many approaches, tools, and cobbled-together hacks being used in cultural institutions. Everyone seems to be struggling with this, so it’s not clear (yet) how best to approach this. If you have any ideas, let us know!

This work is supported by the National Endowment for the Humanities.

NEH logo