Data-​​Mining My Old Radio Playlists

In which I scrape all my old radio playlists off the web, cook them with Python, Google Refine and PostgreSQL, and discover that I played one heck of a long tail of songs. And J.J. Cale.

A friend from the Ann Arbor years was in town for a con­fer­ence, and it put me in mind of my radio days. After every show I used to post up a link to the playlist auto­mat­i­cally gen­er­ated by the WCBN server. Like this for example. I was won­dering: what could be done with all that DJ Hugonaut data?

I was won­dering that, and I also happen to have a lot of idle time at night while I’m on sleeping-​​baby-​​monitoring duty. So here we go.

Scraping the Webpages with Python

The first step was to actually get all that data from the web, prefer­ably in some more data-​​like format than raw HTML. This task is some­times called ‘scraping’, which sounds kind of nasty but there you are. There may be better lan­guages for web scraping, but since I started pro­gram­ming in Python I’ve been enjoying coding more than any other time in my life. So Python it is. If you like pressing the tab button, you too might enjoy pro­gram­ming in Python.

Python has built in capa­bil­i­ties for con­necting to a website and getting the HTML, but nav­i­gating the raw HTML tags for useful info sounds like a terrible idea. An HTML parser was needed. I chose Beautiful Soup on the strength of it’s name. Works great.

Here’s the Python code I wrote for the web scraping job. Please don’t look at it. It was largely written while dis­tracted and occa­sion­ally while drunk.

The script starts by con­necting to hugh​stimson​.org/​p​r​o​j​e​c​t​s​/​d​j​h​u​go/ and scanning for indi­vidual podcast episodes. It extracts a bunch of data about each episode, most espe­cially the link to the playlist on the WCBN server. It sucks down that webpage as well, and rummages through it looking for all the tracks I played that day with their times, names, artist, album, etc.

It orga­nizes all those tracks from all the episodes into a table, and spits it into a big gob of a .csv file.

De-​​Duping in Google Refine

DJs at WCBN key the name of the track they’re playing into the playlist database as they play it, as well as the artist/​album/​label. In theory. In practice you’re busy spilling a thermos of cold coffee into turntable 1, and you rarely have time to spell Dilaudid (Marrtronix Version) pre­cisely the same way you did last month, or look up the label that orig­i­nally released A John Waters Christmas.

So the data quality from my old playlists is not exactly pristine. Enter Google Refine.

Google pitches Refine mostly towards jour­nal­ists who have to deal with the crummy state of publicly released gov­ern­ment records (here’s a com­pelling example of that pitch). But it’s also useful for former DJs rem­i­niscing about the bril­liant sets of their younger years.

There are a lot of power options in Refine that I don’t know how to use, but the key tool is cluster. Import a file, click on a column, create a ‘facet’, and then when that facet appears press the Cluster button.

Here’s an example of  Cluster sug­gesting that Tragically Hip and The Tragically Hip might be the same band. This is fundamental.

Google Refine clustering band names

There are a number of cluster algo­rithms avail­able, and as I ran them against track names, artists names and album names I found they almost all dis­cov­ered a new set of dupli­cates. After accepting or rejecting a round of sug­ges­tions from each algo­rithm you can re-​​run the clus­tering on the results and see if anything new pops up.

There are a bunch of algo­rithms for clus­tering baked into Refine, and most have tweaking options avail­able.  Almost every algo­rithm I ran surfaced mis­la­belled entries in my tracks, albums and artists, and most algo­rithms seemed capable of flagging a dif­ferent set of mistakes. Once a cluster of entries has been sug­gested you can quickly pick one name to assign to the bunch of them. Press Merge Selected & Re-​​Cluster (a very sat­is­fying button) and see what new sug­ges­tions emerge. If there’s nothing new, tweak your algo­rithm or try a new one. Deploy all algo­rithms. That’s life advice.

De-​​duplicating of table entries can also be done using the Filter option in Excel, plus a whole lot of scanning up and down the filtered list with your hurting eyeballs. But Refine’s cluster tools saves time, and catches edge cases you likely would have missed if you were depending on alpha­bet­ical prox­imity of sister entries. Even better, the merge & recluster process induces a strong feeling of saving time. This is espe­cially impor­tant if you’re ignoring your actual work to play with a vanity project.

Here’s the output:


fullscreen version

Analysis in PostgreSQL

Clean data is fine, but what I wanted was sum­ma­riza­tion. Mostly I wanted to answer a question that nagged me throughout my tenure as a DJ: was I qeueing up the same 12 songs over and over? Because some­times when it was 6:40 in the morning and the request line wasn’t flashing and I was dropping the needle on MC5’s Sister Anne yet again it sure felt like I was.

Perhaps pivot tables in Excel could have addressed this question. But since I started using PostGIS to answer geo­graphic posers I’ve been much taken by the speed and flex­i­bility of typing out little SQL words to do magic. PostGIS runs on PostgreSQL so up into the PostgreSQL database went the Refine output.

Results, Finally

With the data loaded into the database, ques­tions can be put to it. Some of those ques­tions follow, with the specific SQL query I used to ask them, and the result.

First off, how many music shows did I do (at least that I remem­bered to post a playlist for)?

SELECT COUNT(DISTINCT(episodename)) FROM radio

Result: 52

Sounds about right. And just how many times did I press the red ‘play’ button during those 52 shows?

SELECT COUNT(title) FROM radio

Result: 1387

That’s a lot of songs. I repeated some for sure. How many totally dif­ferent tracks did I play?

SELECT COUNT(DISTINCT(title)) FROM radio

Result: 1118

Damn. Over a thousand dif­ferent songs. And I promise I only occa­sion­ally pulled an album at random off the shelf.

Speaking of which, just how many albums did I draw from?

SELECT COUNT(DISTINCT(album)) FROM radio

Result: 599

And how many dif­ferent artists did I play?

SELECT COUNT(DISTINCT(artist)) FROM radio

Result: 650

Also a lot. More artists than albums in fact. Presumably that’s because I couldn’t always be bothered to enter an album name, but usually got to the artist field, which came first. And some­times I was playing some odd bit of internet arti­fac­tery and I didn’t know what the hell to write for ‘album’ anyway.

The DJ Hugonaut Charts

Down to details. What were the favourites?

Most-​​played tracks

SELECT title, artist, COUNT(title) FROM radio
GROUP BY title, artist
HAVING COUNT(title) > 3
ORDER BY COUNT(title) DESC, title ASC
#1 Violet Stars Happy Hunting! Janelle Monae 6 plays
#2 Run DNA The Avalanches 5 plays
#3 Can’t Let Go Lucinda Williams 4 plays
#3 Canary in a Coalmine The Police 4 plays
#3 Chicken Soup for the Fuck You Shout Out Out Out 4 plays
#3 Many Moons Janelle Monae 4 plays
#3 Sexual Healing Hot 8 Brass Band 4 plays

Ah yes, Violet Stars Happy Hunting. They say that album did very well on college radio. I’m to blame.

Most-​​played artists

SELECT artist, COUNT(artist) FROM radio
GROUP BY artist
HAVING COUNT (artist) > 9
ORDER BY COUNT DESC, artist ASC
#1 J.J. Cale 20 plays
#2 Fred Eaglesmith 17 plays
#2 Janelle Monae 17 plays
#3 Kleptones 16 plays
#3 The Mountain Goats 16 plays
#4 Neil Young 15 plays
#5 Lucinda Williams 13 plays
#5 Mike Doughty 13 plays
#6 Merle Travis 12 plays
#6 Tom Waits 12 plays
#7 Danko Jones 11 plays
#7 Jonathan Richman 11 plays
#8 Bob Dylan 10 plays
#8 Go Home Productions 10 plays

How I miss broad­casting J.J. Cale to the Ann Arbor and greater Ypsilanti region.

Fred Eaglesmith, Lucinda Williams, Merle Travis — I’ve for­gotten how much country I played. I don’t like country I swear. Just the awesome parts. And Merle Travis I suppose.

Most-​​played albums

SELECT album, artist, COUNT(album) FROM radio
GROUP BY album, artist
HAVING COUNT (album) > 5
ORDER BY COUNT DESC, album ASC
#1 Metropolis Suite 1: The Chase Janelle Monae 17 plays
#2 Live’r Than You’ll Ever Be Kleptones 11 plays
#3 Rockity Roll Mike Doughty 9 plays
#4 Fred J. Eaglesmith Fred Eaglesmith 7 plays
#4 The Complete Bootlegs Go Home Productions 7 plays
#5 Car Wheels On A Gravel Road Lucinda Williams 6 plays
#5 Not Saying/​Just Saying Shout Out Out Out 6 plays
#5 Southern Roots Jerry Lee Lewis 6 plays
#5 The Essential Taj Mahal 6 plays
#5 Vampire Weekend Vampire Weekend 6 plays

Again with the Janelle Monae. Metropolis Suite 1: The Chase was no doubt the album I played the most from, but it only made it to the top of this par­tic­ular list thanks to Refine’s ability to resolve a lot of dif­ferent ways to spell the same album name while stashing vinyl in slip covers and qeueing the emer­gency broad­cast test.

Live’r Than You’ll Ever Be made the list because the tracks on that album bleed very sweetly into each other. Thus if you need to walk out of the studio for a fire alarm or smoke break you can count on a solid hour of self-​​transitioning music. There’s a tip for you.

The Long Tail

So those are the charts. What I find sur­prising about them is how little play those top plays got. For example: the top 10 albums col­lec­tively con­tributed only 6% of 1400 plays. And that includes The Essential Taj Mahal.

Here’s the fre­quency of track play frequencies:

SELECT titlecount, COUNT(titlecount)
FROM (SELECT COUNT(title) AS titlecount
      FROM radio
      GROUP BY title, artist)
      AS subquery
GROUP BY titlecount
ORDER BY titlecount DESC
# of plays # of tracks played that many times
6 1
5 1
4 5
3 37
2 152
1 941

And let’s see that data in tra­di­tional long-​​tail layout.

SELECT title, artist, COUNT(title) FROM radio
GROUP BY title, artist
ORDER BY COUNT(title) DESC

track play frequencies

Yes by god, that is a long tail.

Sure I had some crushes on a few tracks, but 68% of my airtime was made up of 941 tracks that I queued up once and never again. I guess I could have gotten away with a few more repeats after all. Next time: all J.J. Cale, all the time. That will be good radio.

8 comments:

SQL is fun (to some people), but there’s a much quicker way to analyze your data, par­tic­u­larly if it’s already in Google Refine.

1. Click Facet->Text Facet on the Artist column
2. Click Sort by: Count (default is sort by name) in the next text facet on left

Repeat for the track column and you’ve got your first two analyses done in four clicks with not a single SQL state­ment written.

This is true! Good point. But if there’s a way to get Refine to get past the first two analyses I’m not yet aware of it.

I was won­dering the converse: is there a way to do some of fuzzy clus­tering and/​or de-​​duping in postgres?

As someone that wants to learn to use Python to scrape sites for useful analysis, this is very inspiring.

Hugh,

Can I used your dataset and try to do the same analysis using google refine only. I’m pretty sure I can do the same ;-)

For sure! I’m emailing you with the .csv file produced by the scraper script. If it’s possible to do the “chart” sum­maries using Refine I would love to know how.

SQL is great, but if Refine can handle advanced rolling up of data that would make it super con­ve­nient. Particularly for some of my journo friends who I could possibly talk into using Refine, but wouldn’t touch PostgreSQL et al with a prover­bial ten foot barge pole.

Very inter­ested to see what you can do with it.

[…] week I wrote about using Python, Google Refine and PostgreSQL to datamine my old radio playlists. That […]

Voici la liste des albums de J.J. Cale. Le premier est sorti en 1972, le dernier en 2009. Pour chacun d’eux, une page avec la cou­ver­ture, les paroles des chansons.

pony

hugh​stimson​.org

leave a comment