Working with snapshots of structured data on Wikimedia Commons

In my previous post about Flickypedia Backfillr Bot, I mentioned that Flickypedia uses snapshots of structured data on Wikimedia Commons to spot possible duplicates:

We downloaded snapshots of the structured data for every file on Wikimedia Commons, and we built a database of all the links between files on Wikimedia Commons and Flickr photos. For every file in the snapshot, we looked at the structured data properties where we might find a Flickr URL. Then we tried to parse those URLs using our Flickr URL parsing library, and find out what Flickr photo they point at (if any).

As we’ve been working on Flickypedia, we’ve developed a few tactics for working with these snapshots, which we thought might be useful for other people working with Wikimedia Commons data.

What are these snapshots?

Files on Wikimedia Commons can contain structured data—machine-readable metadata saying where the file came from, the license of the file, when it was created, and so on. For a longer explanation of structured data, read my previous post.

The structured data snapshots are JSON files that contain the structured data statements for all the files on Wikimedia Commons. (One of many public dumps of Wikimedia content.) These snapshots are extremely useful if you have a task that involves searching the database en masse – for example, finding all the Flickr photos on Commons.

All the snapshots we worked with are available for download from https://dumps.wikimedia.org/commonswiki/entities/, and new snapshots are typically created a few times a week.

Do you need snapshots?

Snapshots can be cumbersome, so if you need a quick answer, there may be better ways to get data out of Wikimedia Commons, like Special:MediaSearch and the Commons Query Service, which both support querying on structured data. But if you need to look at Wikimedia Commons as a whole, or run some sort of complex query or analysis that doesn’t fit into an existing tool, the structured snapshots can be very useful.

We’ve already found several use cases for them at the Flickr Foundation:

  • Finding every Flickr photo on Wikimedia Commons. As discussed in previous posts, the many variants of Flickr URL make it difficult to run a query for Flickr photos on Commons – but we can do this analysis easily with a snapshot. We can parse the data in the snapshot with our Flickr URL parser and store the normalised information in a new database.
  • Seeing how structured data is already being used. When we were designing the Flickypedia data model, part of our research involved looking at how structured data was already being used for Flickr photos. Using the snapshots, we could look for examples we could mimic, and compare our ideas to the existing data. Was our proposal following a popular, well-established approach, or was it novel and perhaps more controversial?
  • Verifying our assumptions about structured data. By doing an exhaustive search of the structured data, we could check if our assumptions were correct – and sometimes we’d find counterexamples that forced us to rethink our approach. For example, “every Wikimedia Commons file comes from zero or one Flickr photos”. Looking at the snapshots told us this was false – there are some files which link to multiple Flickr photos, because the same photo was uploaded to Flickr multiple times.

How do you download a snapshot?

The snapshots are fairly large: the latest snapshots are over 30GB, and that’s only getting bigger as more structured data is created. It takes me multiple hours to download a snapshot, and that can be annoying if the connection drops partway through.

Fortunately, Wikimedia Commons has a well-behaved HTTP server that supports resumable downloads. There are lots of download managers that can resume the download when it gets interrupted, so you can download a snapshot over multiple sessions. I like curl because it’s so ubiquitous – there’s a good chance it’s already installed on whatever computer I’m using.

This is an example of the curl command I run:

curl \
  --location \
  --remote-name \
  --continue-at - \
  "https://dumps.wikimedia.org/commonswiki/entities/20240617/commons-20240617-mediainfo.json.gz"

I usually have to run it multiple times to get a complete download, but it does eventually succeed. The important flag here is -​-continue-at –, which tells curl to resume a previous download.

Which format should you download?

The snapshots are available in two formats: bzip2-compressed JSON, and gzip-compressed JSON. They have identical contents, just compressed differently. Which should you pick?

I wasn’t sure which format was right, so when I was getting started, I downloaded both and ran some experiments to see which was a better fit for our use case. We iterate through every file in a snapshot as part of Flickypedia, so we wanted a format we could read quickly.

The file sizes are similar: 33.6GB for bzip2, 43.4GB for gzip. Both of these are manageable downloads for us, so file size wasn’t a deciding factor.

Then I ran a benchmark on my laptop to see how long it took to read each format. This command is just uncompressing each file, and measuring the time it takes:

$ time bzcat commons-20240617-mediainfo.json.bz2 >/dev/null
Executed in 113.48 mins

$ time gzcat commons-20240617-mediainfo.json.gz >/dev/null
Executed in 324.17 secs

That’s not a small difference: gzip is 21 times faster to uncompress than bzip2. Even accounting for the fairly unscientific test conditions, it was the clear winner. For Flickypedia, we use the gzip-compressed snapshots.

What’s inside a snapshot?

An uncompressed snapshot is big – the latest snapshot contains nearly 400GB of JSON.

The file contains a single, massive JSON object:

[
   { … data for the first file … },
   { … data for the second file … },
   …
   { … data for the last file … }
]

Aside from the opening and closing square brackets, each line has a JSON object that contains the data for a single file on Wikimedia Commons. This makes it fairly easy to stream data from this file, without trying to parse the entire snapshot at once.

If you’re curious about the structure of the data, we have some type definitions in Flickypedia: one for the top-level snapshot entries, one for the Wikidata data model which is used for structured data statements. Unfortunately I haven’t been able to find a lot of documentation for these types on Wikimedia Commons itself.

How to read snapshots

The one-file-per-line structure of the snapshot JSON allows us to write a streaming parser in Python. This function will read one file at a time, which is more efficient than reading the entire file at once:

import gzip
import json


def get_entries_from_snapshot(path):
    with gzip.open(path) as uncompressed_file:
        for line in uncompressed_file:

            # Skip the square brackets at the beginning/end of the file
            # which open/close the JSON object
            if line.strip() in {b"[", b"]"}:
                continue

            # Strip the trailing comma at the end of each line
            line = line.rstrip(b",\n")

            # Parse the line as JSON, and yield it to the caller
            entry = json.loads(line)
            yield entry


path = "commons-20240617-mediainfo.json.gz"

for entry in get_entries_from_snapshot(path):
    print(entry)

# {'type': 'mediainfo', 'id': 'M76', … }
# …

This does take a while – on my machine, it takes around 45 minutes just to read the snapshot, with no other processing.

To avoid having to do this too often, my next step is to extend this script to extract the key information I want from the snapshot.

For example, for Flickypedia, we’re only really interested in P12120 (Flickr Photo ID) and P7482 (Source of File) when we’re looking for Flickr photos which are already on Commons. A script which extracts just those two fields can reduce the size of the data substantially, and give me a file that’s easier to work with.