Playlist Datamining 2: Doing it All in Google Refine

Last week I wrote about using Python, Google Refine and PostgreSQL to datamine my old radio playlists. That was fun.

Almost as soon as that post went up a commenter appeared to suggest that I wasn’t making full use of Google Refine‘s powers. Yes it can do data cleaning, but (they suggested) it can also do some of the data-to-knowledge transformation that I had jumped over to SQL for. I resisted that, and @MagdMartin, propietor of the Google refine basic blog, arrived and offered to prove it.

Prove it he did:

Data exploration tutorial with google refineGoogle refine basic blog

The full results can be had at the above link, include screen shots demonstrating the steps he took to reproduce my music charts without ever leaving Refine. E.g.:

Notice the Facet by choice counts button highlighted in red. I didn’t. It seems to be important.

He also started to look at distribution of albums among episodes, which isn’t something I got into. And made the point that I managed to play one Kleptones track twice in a single episode. I haven’t forgotten that incident, but thanks for the reminder.

I’m not likely to abandon PostgreSQL and SQL-based data mining completely in favour of Refine. Refine is of course not intended as a relational database manager, whereas Postgres can work it’s magic across many linked tables simultaneously. Even for working with a single spreadsheet it’s unlikely that Google Refine (or any GUI-based data mining application) will be able to match the utter data-mashing flexibility offered by SQL. Once you get good at composing SQL queries it can also save a bunch of time, particularly if you’re doing the same kind of querying repeatedly on multiple datasets or multiples facets of a given dataset. Tweaking a word or two in a SQL sentence and pasting it back into the console window is faster than re-clicking a bunch of buttons in the proper order.

On the other hand, there are real advantages to the all-Refine method proven above. Refine is relatively easy to install on your computer and import data into. PostgreSQL is a pain in the ass on both counts. Clicking buttons is intuitive. This is probably the only sentence on the internet containing the words SQL and intuitive. I can imagine some of my friends in journalism and the social sciences becoming very productive with Google Refine’s cleaning and manipulation methods (and they should!). PostgreSQL is probably a bridge too far for most people who don’t spend most of their lives up to their elbows in teh data.

Thanks MagdMartin for demo-ing the possibilities!

Data-Mining My Old Radio Playlists

In which I scrape all my old radio playlists off the web, cook them with Python, Google Refine and PostgreSQL, and discover that I played one heck of a long tail of songs. And J.J. Cale.

A friend from the Ann Arbor years was in town for a conference, and it put me in mind of my radio days. After every show I used to post up a link to the playlist automatically generated by the WCBN server. Like this for example. I was wondering: what could be done with all that DJ Hugonaut data?

I was wondering that, and I also happen to have a lot of idle time at night while I’m on sleeping-baby-monitoring duty. So here we go.

Scraping the Webpages with Python

The first step was to actually get all that data from the web, preferably in some more data-like format than raw HTML. This task is sometimes called ‘scraping‘, which sounds kind of nasty but there you are. There may be better languages for web scraping, but since I started programming in Python I’ve been enjoying coding more than any other time in my life. So Python it is. If you like pressing the tab button, you too might enjoy programming in Python.

Python has built in capabilities for connecting to a website and getting the HTML, but navigating the raw HTML tags for useful info sounds like a terrible idea. An HTML parser was needed. I chose Beautiful Soup on the strength of it’s name. Works great.

Here’s the Python code I wrote for the web scraping job. Please don’t look at it. It was largely written while distracted and occasionally while drunk.

The script starts by connecting to hughstimson.org/projects/djhugo/ and scanning for individual podcast episodes. It extracts a bunch of data about each episode, most especially the link to the playlist on the WCBN server. It sucks down that webpage as well, and rummages through it looking for all the tracks I played that day with their times, names, artist, album, etc.

It organizes all those tracks from all the episodes into a table, and spits it into a big gob of a .csv file.

De-Duping in Google Refine

DJs at WCBN key the name of the track they’re playing into the playlist database as they play it, as well as the artist/album/label. In theory. In practice you’re busy spilling a thermos of cold coffee into turntable 1, and you rarely have time to spell Dilaudid (Marrtronix Version) precisely the same way you did last month, or look up the label that originally released A John Waters Christmas.

So the data quality from my old playlists is not exactly pristine. Enter Google Refine.

Google pitches Refine mostly towards journalists who have to deal with the crummy state of publicly released government records (here’s a compelling example of that pitch). But it’s also useful for former DJs reminiscing about the brilliant sets of their younger years.

There are a lot of power options in Refine that I don’t know how to use, but the key tool is cluster. Import a file, click on a column, create a ‘facet’, and then when that facet appears press the Cluster button.

Here’s an example of  Cluster suggesting that Tragically Hip and The Tragically Hip might be the same band. This is fundamental.

Google Refine clustering band names

There are a number of cluster algorithms available, and as I ran them against track names, artists names and album names I found they almost all discovered a new set of duplicates. After accepting or rejecting a round of suggestions from each algorithm you can re-run the clustering on the results and see if anything new pops up.

There are a bunch of algorithms for clustering baked into Refine, and most have tweaking options available.  Almost every algorithm I ran surfaced mislabelled entries in my tracks, albums and artists, and most algorithms seemed capable of flagging a different set of mistakes. Once a cluster of entries has been suggested you can quickly pick one name to assign to the bunch of them. Press Merge Selected & Re-Cluster (a very satisfying button) and see what new suggestions emerge. If there’s nothing new, tweak your algorithm or try a new one. Deploy all algorithms. That’s life advice.

De-duplicating of table entries can also be done using the Filter option in Excel, plus a whole lot of scanning up and down the filtered list with your hurting eyeballs. But Refine’s cluster tools saves time, and catches edge cases you likely would have missed if you were depending on alphabetical proximity of sister entries. Even better, the merge & recluster process induces a strong feeling of saving time. This is especially important if you’re ignoring your actual work to play with a vanity project.

Here’s the output:


fullscreen version

Analysis in PostgreSQL

Clean data is fine, but what I wanted was summarization. Mostly I wanted to answer a question that nagged me throughout my tenure as a DJ: was I qeueing up the same 12 songs over and over? Because sometimes when it was 6:40 in the morning and the request line wasn’t flashing and I was dropping the needle on MC5’s Sister Anne yet again it sure felt like I was.

Perhaps pivot tables in Excel could have addressed this question. But since I started using PostGIS to answer geographic posers I’ve been much taken by the speed and flexibility of typing out little SQL words to do magic. PostGIS runs on PostgreSQL so up into the PostgreSQL database went the Refine output.

Results, Finally

With the data loaded into the database, questions can be put to it. Some of those questions follow, with the specific SQL query I used to ask them, and the result.

First off, how many music shows did I do (at least that I remembered to post a playlist for)?

SELECT COUNT(DISTINCT(episodename)) FROM radio

Result: 52

Sounds about right. And just how many times did I press the red ‘play’ button during those 52 shows?

SELECT COUNT(title) FROM radio

Result: 1387

That’s a lot of songs. I repeated some for sure. How many totally different tracks did I play?

SELECT COUNT(DISTINCT(title)) FROM radio

Result: 1118

Damn. Over a thousand different songs. And I promise I only occasionally pulled an album at random off the shelf.

Speaking of which, just how many albums did I draw from?

SELECT COUNT(DISTINCT(album)) FROM radio

Result: 599

And how many different artists did I play?

SELECT COUNT(DISTINCT(artist)) FROM radio

Result: 650

Also a lot. More artists than albums in fact. Presumably that’s because I couldn’t always be bothered to enter an album name, but usually got to the artist field, which came first. And sometimes I was playing some odd bit of internet artifactery and I didn’t know what the hell to write for ‘album’ anyway.

The DJ Hugonaut Charts

Down to details. What were the favourites?

Most-played tracks

SELECT title, artist, COUNT(title) FROM radio
GROUP BY title, artist
HAVING COUNT(title) > 3
ORDER BY COUNT(title) DESC, title ASC
#1 Violet Stars Happy Hunting! Janelle Monae 6 plays
#2 Run DNA The Avalanches 5 plays
#3 Can’t Let Go Lucinda Williams 4 plays
#3 Canary in a Coalmine The Police 4 plays
#3 Chicken Soup for the Fuck You Shout Out Out Out 4 plays
#3 Many Moons Janelle Monae 4 plays
#3 Sexual Healing Hot 8 Brass Band 4 plays

Ah yes, Violet Stars Happy Hunting. They say that album did very well on college radio. I’m to blame.

Most-played artists

SELECT artist, COUNT(artist) FROM radio
GROUP BY artist
HAVING COUNT (artist) > 9
ORDER BY COUNT DESC, artist ASC
#1 J.J. Cale 20 plays
#2 Fred Eaglesmith 17 plays
#2 Janelle Monae 17 plays
#3 Kleptones 16 plays
#3 The Mountain Goats 16 plays
#4 Neil Young 15 plays
#5 Lucinda Williams 13 plays
#5 Mike Doughty 13 plays
#6 Merle Travis 12 plays
#6 Tom Waits 12 plays
#7 Danko Jones 11 plays
#7 Jonathan Richman 11 plays
#8 Bob Dylan 10 plays
#8 Go Home Productions 10 plays

How I miss broadcasting J.J. Cale to the Ann Arbor and greater Ypsilanti region.

Fred Eaglesmith, Lucinda Williams, Merle Travis — I’ve forgotten how much country I played. I don’t like country I swear. Just the awesome parts. And Merle Travis I suppose.

Most-played albums

SELECT album, artist, COUNT(album) FROM radio
GROUP BY album, artist
HAVING COUNT (album) > 5
ORDER BY COUNT DESC, album ASC
#1 Metropolis Suite 1: The Chase Janelle Monae 17 plays
#2 Live’r Than You’ll Ever Be Kleptones 11 plays
#3 Rockity Roll Mike Doughty 9 plays
#4 Fred J. Eaglesmith Fred Eaglesmith 7 plays
#4 The Complete Bootlegs Go Home Productions 7 plays
#5 Car Wheels On A Gravel Road Lucinda Williams 6 plays
#5 Not Saying/Just Saying Shout Out Out Out 6 plays
#5 Southern Roots Jerry Lee Lewis 6 plays
#5 The Essential Taj Mahal 6 plays
#5 Vampire Weekend Vampire Weekend 6 plays

Again with the Janelle Monae. Metropolis Suite 1: The Chase was no doubt the album I played the most from, but it only made it to the top of this particular list thanks to Refine’s ability to resolve a lot of different ways to spell the same album name while stashing vinyl in slip covers and qeueing the emergency broadcast test.

Live’r Than You’ll Ever Be made the list because the tracks on that album bleed very sweetly into each other. Thus if you need to walk out of the studio for a fire alarm or smoke break you can count on a solid hour of self-transitioning music. There’s a tip for you.

The Long Tail

So those are the charts. What I find surprising about them is how little play those top plays got. For example: the top 10 albums collectively contributed only 6% of 1400 plays. And that includes The Essential Taj Mahal.

Here’s the frequency of track play frequencies:

SELECT titlecount, COUNT(titlecount)
FROM (SELECT COUNT(title) AS titlecount
      FROM radio
      GROUP BY title, artist)
      AS subquery
GROUP BY titlecount
ORDER BY titlecount DESC
# of plays # of tracks played that many times
6 1
5 1
4 5
3 37
2 152
1 941

And let’s see that data in traditional long-tail layout.

SELECT title, artist, COUNT(title) FROM radio
GROUP BY title, artist
ORDER BY COUNT(title) DESC

track play frequencies

Yes by god, that is a long tail.

Sure I had some crushes on a few tracks, but 68% of my airtime was made up of 941 tracks that I queued up once and never again. I guess I could have gotten away with a few more repeats after all. Next time: all J.J. Cale, all the time. That will be good radio.

Youtube: the Audio Library for Congolese and Other Music

“Trawling through Vincent’s collection we pulled out 10 contemporary and classic grooves straight from the streets of Kinshasa. Many of these records are released as limited pressings and finding them can be an arduous task. Our best advice is to try the specialist African music outlet Stern’s.”

Congo, where rumba meets r’n’b — Josh Surtees, The Guardian

The article goes on to describe the ten tracks, each with a Youtube link discreetly included for those who don’t have time to scour Stern’s.

I would never have predicted that a video site would become something like a rough-and-ready universal library of audio.

Here’s Wendo Kolosoy, described in the article as the grandfather of Congolese rhumba, performing Marie Louise:

When I was DJ-ing at WCBN there was some disagreement over the probity of playing Youtube clips over the airwaves. The Program Director felt, reasonably enough, that DJs should strive for highest audio quality and to showcase the extraordinary, vinyl-anchored WCBN music library. I’m not sure exactly how that discussion resolved itself, but I don’t doubt that people will still fire up Youtube when they catch a tricky request or just can’t find a special track in the stacks. Because they can.

Review: Mike Doughty at the Biltmore Cabaret

Just as surely as Zach Galifianakis has become the Bill Murray of our generation, so Mike Doughty is our Fred Eaglesmith.

Music Still Has the Right to Children

How is it possible that Boards of Canada‘s Music Has the Right to Children still sounds like it’s from the near-future, almost 12 years after it was released?

(I would like to embed a music player here so you could listen to the album, but it looks like imeem.com has disappeared into the horrible maw of myspace, and I can’t find another web service that allows full-length track streaming.)

The Boats That Rocked

Anyone who liked The Boat That Rocked might also like this wikipedia article about Radio Caroline — the ship-board pirate radio stations that (presumably) inspired that movie. It’s quite a remarkable history.

Note that although The Boat That Rocked was set in 1966, Radio Caroline was actually pop-centric station until they switched over to album rock in 1974.

You can still tune in an internet descendant of Radio Caroline populated by many of the marine-era djs, if you don’t mind your classic rock mixed up with ads for real estate agencies and gift shows.

Soundtrack for the Hiring Season

The treeplanting company I’ve lately been working for seems to have folded. Who knows, maybe that is the final nail in the coffin of my treeplanting career, which has been drifting off to sleep asymptotically for years now.

Regardless of my own summer destiny, I’ve been fielding the occasional email from rookies and vets who are bound for the block this summer, and are looking for references and advice during this, the hiring season. I don’t know how much advice I have, but I can at least offer “Gorilla Warfare“, LazzaGun Soundsystem’s latest treeplanting themed musical concoction, which arrived today in the email. Be warned, the language is as salty as the planting day is long.

Gorilla Warfare(MP3)  by  Lars Zergun

You may remember Lars from such bygone hits as 2007 Pounder Mix and Welcome 2 Treeplanting 2012.

Retro Computing Jam Session

Have you ever had that feeling that somewhere out there, people are jamming on a Vic-20, a PET and a Commodore 64, possibly in some kind of classroom setting?

The middle computer would be Petsnyth‘s first (I assume) public performance.

Broken Happiness Machines Are Go

A couple of weeks ago I mentioned Petsynth, Chiron Bramberger’s novel synthesizer software for the Commodore Pet. But Chiron doesn’t just write music on the Pet, he also blasts 8-bit rythmic weirdness from a pipe-organ arrangement of Amigas and Ataris and god knows what else.

And last week, Chiron pressed play on the Broken Happiness Machines website, from which he will distribute some of those grooves. (And hopefully that software!)


photo by Chiron Bramberger

PetSynth: A Superior Synthesizer for the Commodore Pet

Today I got to try PetSynth v0.006, by Chiron Bramberger. Chiron owns several Commodore Pet personal computers, and was dissapointed by the quality of the music-making software available for them, so he wrote his own. He has plans to release it to the Commodore community but currently it’s stored on a 5.25″ disk lodged in his dual disk drive.

In addition to producing lovely beeps, Chiron figured out how to drive distortion in the Commodore’s built-in music hardware, to produce vibrato, and to generate something very much like a drum tone. Hooked up to a pair of pot knobs for bending the signal, the system can produce some mean 8 bit grooves.

Chiron also builds guitar effects pedals from the recycled innards of modems, grafted into lovely hand-painted acrylic boxes with the shells of harddrives for backs, but that’s a seperate project.

older posts →