Monday, November 8, 2010

Mash-up: visualizing the top 500 authors in technology-enhanced learning

Some time ago I was asked to analyze the "Top authors in Computer Education" list by Microsoft Academic Search. In my view, there are problems with the list (and rankings in general), but in this post I don't want to discuss this, but highlight the powers of mash-ups.
Update: Tony Hirst suggest to user Google Fusion Tables, and he is right, this makes it even simpler. See at the bottom of the post for more info.

The power of visualization is that it makes information hidden in the data visible. For this list, I wanted to see it on a map, to see at one glance where the top ranked scientist come from (Pittsburgh by the way) and which are near me (not so many, sadly).
Here is the visualization:

View Larger Map
The amazing thing is that I could generate this map by reusing free services: Google Spreadsheets for extracting the information and generating a RSS feed, Yahoo! Pipes to manipulate the feed and YQL to add additional information. In detail:
Data extraction: the data in the table is plain html, which makes access difficult. In addition, the top 10 entries are formatted differently than the remaining entries. To convert the data into a more processing friendly format, I used Google Spreadsheets. This spreadsheet extracts the author names and their organizations using these two formulas:
=importXML("academic.research.microsoft.com/CSDirectory/author_category_23.htm"; "//td/a[contains(@href,'Author')]//text()")
=importXML("academic.research.microsoft.com/CSDirectory/author_category_23.htm"; "//a[contains(@href,'Organization')]")
importXML is a very powerful function that allows you to extract data using xpath.
I then shared this document as a Web page and copied the RSS link. This makes the list of authors available for further processing.
Yahoo! Pipes is an amazing service for mashing-up data. I created a pipe that takes my RSS output and adds geo-information, that is, it tries to find a location for the organization of each researcher. The pipes first cleans up the data from the spreadsheet and strips off unnecessary elements. It also limits the input to the first 500 items, as otherwise I had some problems with query limitations. 
The biggest problem is to add the geo-information. The table only contains the name of the organization, but no precise information on the location. Yahoo! Pipes offers the Location Builder module for converting text into locations, but it never worked for me. Luckily, hapdaniel has build a pipe that uses more intelligence to find a location from a piece of text. His pipe uses YQL, the Yahoo! Query Language, another great service from Yahoo! to process Web services (originally, I wanted to use YQL instead of Google Spreadsheet to extract the data, but I couldn't combine the results from the two queries necessary to extract author and organization into one feed). My pipe passes the organization name to his pipe and as a result gets back location information, when available. Obviously, this does not work every time, but often enough.
The pipe then adds the location information in the way it should be done for further processing and outputs the result: here is the result of the top authors in computer education as a KML file. You can load this link in Google Earth or input it into the search field of Google Maps to get the above map.
It is really amazing what you can do today with free services. Yahoo! has developed a suite of great tools, and I sincerely hope they will remain available.
Update: By importing the Google Spreadsheet into Google Fusion Tables, we get the geocoding for free, without having to do any RSS manipulation. Just import the data, that is all. Almost too easy! Here is the map (access the table containing the data about the researchers):