Playlist Datamining 2: Doing it All in Google Refine

Last week I wrote about using Python, Google Refine and PostgreSQL to datamine my old radio playlists. That was fun.

Almost as soon as that post went up a commenter appeared to suggest that I wasn’t making full use of Google Refine‘s powers. Yes it can do data cleaning, but (they suggested) it can also do some of the data-to-knowledge transformation that I had jumped over to SQL for. I resisted that, and @MagdMartin, propietor of the Google refine basic blog, arrived and offered to prove it.

Prove it he did:

Data exploration tutorial with google refineGoogle refine basic blog

The full results can be had at the above link, include screen shots demonstrating the steps he took to reproduce my music charts without ever leaving Refine. E.g.:

Notice the Facet by choice counts button highlighted in red. I didn’t. It seems to be important.

He also started to look at distribution of albums among episodes, which isn’t something I got into. And made the point that I managed to play one Kleptones track twice in a single episode. I haven’t forgotten that incident, but thanks for the reminder.

I’m not likely to abandon PostgreSQL and SQL-based data mining completely in favour of Refine. Refine is of course not intended as a relational database manager, whereas Postgres can work it’s magic across many linked tables simultaneously. Even for working with a single spreadsheet it’s unlikely that Google Refine (or any GUI-based data mining application) will be able to match the utter data-mashing flexibility offered by SQL. Once you get good at composing SQL queries it can also save a bunch of time, particularly if you’re doing the same kind of querying repeatedly on multiple datasets or multiples facets of a given dataset. Tweaking a word or two in a SQL sentence and pasting it back into the console window is faster than re-clicking a bunch of buttons in the proper order.

On the other hand, there are real advantages to the all-Refine method proven above. Refine is relatively easy to install on your computer and import data into. PostgreSQL is a pain in the ass on both counts. Clicking buttons is intuitive. This is probably the only sentence on the internet containing the words SQL and intuitive. I can imagine some of my friends in journalism and the social sciences becoming very productive with Google Refine’s cleaning and manipulation methods (and they should!). PostgreSQL is probably a bridge too far for most people who don’t spend most of their lives up to their elbows in teh data.

Thanks MagdMartin for demo-ing the possibilities!

1 comment:

leave a comment