Playlist Datamining 2: Doing it All in Google Refine

Last week I wrote about using Python, Google Refine and PostgreSQL to datamine my old radio playlists. That was fun.

Almost as soon as that post went up a com­menter appeared to suggest that I wasn’t making full use of Google Refine’s powers. Yes it can do data cleaning, but (they sug­gested) it can also do some of the data-​​to-​​knowledge trans­for­ma­tion that I had jumped over to SQL for. I resisted that, and @MagdMartin, propi­etor of the Google refine basic blog, arrived and offered to prove it.

Prove it he did:

Data explo­ration tutorial with google refineGoogle refine basic blog

The full results can be had at the above link, include screen shots demon­strating the steps he took to repro­duce my music charts without ever leaving Refine. E.g.:

Notice the Facet by choice counts button high­lighted in red. I didn’t. It seems to be important.

He also started to look at dis­tri­b­u­tion of albums among episodes, which isn’t some­thing I got into. And made the point that I managed to play one Kleptones track twice in a single episode. I haven’t for­gotten that incident, but thanks for the reminder.

I’m not likely to abandon PostgreSQL and SQL–based data mining com­pletely in favour of Refine. Refine is of course not intended as a rela­tional database manager, whereas Postgres can work it’s magic across many linked tables simul­ta­ne­ously. Even for working with a single spread­sheet it’s unlikely that Google Refine (or any GUI-​​based data mining appli­ca­tion) will be able to match the utter data-​​mashing flex­i­bility offered by SQL. Once you get good at com­posing SQL queries it can also save a bunch of time, par­tic­u­larly if you’re doing the same kind of querying repeat­edly on multiple datasets or mul­ti­ples facets of a given dataset. Tweaking a word or two in a SQL sentence and pasting it back into the console window is faster than re-​​clicking a bunch of buttons in the proper order.

On the other hand, there are real advan­tages to the all-​​Refine method proven above. Refine is rel­a­tively easy to install on your computer and import data into. PostgreSQL is a pain in the ass on both counts. Clicking buttons is intu­itive. This is probably the only sentence on the internet con­taining the words SQL and intu­itive. I can imagine some of my friends in jour­nalism and the social sciences becoming very pro­duc­tive with Google Refine’s cleaning and manip­u­la­tion methods (and they should!). PostgreSQL is probably a bridge too far for most people who don’t spend most of their lives up to their elbows in teh data.

Thanks MagdMartin for demo-​​ing the possibilities!

1 comment:



leave a comment