The surprising utility of a Flickr URL parser

In my first week at the Flickr Foundation, we made a toy called Flinumeratr. This is a small web app that takes a Flickr URL as input, and shows you all the photos which are present at that URL.

As part of this toy, I made a Python library which parses Flickr URLs, and tells you what the URL points to – a single photo, an album, a gallery, and so on. Initially it just handled fairly common patterns, the sort of URLs that you’d encounter if you use Flickr today, but it’s grown to handle more complicated URLs.

$ flickr_url_parser "https://www.flickr.com/photos/sdasmarchives/50567413447"
{"type": "single_photo", "photo_id": "50567413447"}

$ flickr_url_parser "https://www.flickr.com/photos/aljazeeraenglish/albums/72157626164453131"
{"type": "album", "user_url": "https://www.flickr.com/photos/aljazeeraenglish", "album_id": "72157626164453131", "page": 1}

$ flickr_url_parser "https://www.flickr.com/photos/blueminds/page3"
{"type": "user", "user_url": "https://www.flickr.com/photos/blueminds"}

The implementation is fairly straightforward: I use the hyperlink library to parse the URL text into a structured object, then I compare that object to a list of known patterns. Does it look like this type of URL? Or this type of URL? Or this type of URL? And so on.

You can run this library as a command-line tool, or call it from Python – there are instructions in the GitHub README.

There are lots of URL variants

In my second week and beyond, I started to discover more variants, which should probably be expected in 20-year old software! I’ve been looking into collections of Flickr URLs that have been built up over multiple years, and although most of these URLs follow common patterns, there are lots of unusual variants in the long tail.

Some of these are pretty simple. For example, the URL to a user’s photostream can be formed using your Flickr user NSID or your path alias, so flickr.com/photos/197130754@N07/ and flickr.com/photos/flickrfoundation/ point to the same page.

Others are more complicated, and you can trace the history of Flickr through some of the older URLs. Some of my favorites include:

  • Raw JPEG files, on live.staticflickr.com, farm1.static.flickr.com, and several other subdomains.

  • Links with a .gne suffix, like www.flickr.com/photo_edit.gne?id=3435827496 (from Wikimedia Commons). This acronym stands for Game Neverending, the online game out of which Flickr was born.

  • A Flash video player called stewart.swf, which might be a reference to Stewart Butterfield, one of the cofounders of Flickr.

I’ve added support for every variant of Flickr URL to the parsing library – if you want to see a complete list, check out the tests. I need over a hundred tests to check all the variants are parsed correctly.

Where we’re using it

I’ve been able to reuse this parsing code in a bunch of different projects, including:

  • Building a similar “get photos at this URL” interface in Flickypedia.

  • Looking for Flickr photo URLs in Wikimedia Commons. This is for detecting Flickr photos which have already been uploaded to Commons, which I’ll describe more in another post.

  • Finding Flickr pages which have been captured in the Wayback Machine – I can get a list of saved Flickr URLs, and then see what sort of pages have actually been saved.

When I created the library, I wasn’t sure if this code was actually worth extracting as a standalone package – would I use it again, or was this a premature abstraction?

Now that I’ve seen more of the diversity of Flickr URLs and found more uses for this code, I’m much happier with the decision to abstract it into a standalone library. Now we  only need to add support for each new URL variant once, and then all our projects can benefit.

If you want to try the Flickr URL parser yourself, all the code is open source on GitHub.