A simple perl script to interrogate the Technorati API

Technorati API perl query in action

Sometimes (for instance when I’m doing the research for the blogger typology) you need to get a whole load of Technorati data for a whole load of blogs.

This research can (of course) be done by hand. And (of course) for a long list of blogs this would take a great deal of time. Handily, Technorati provides developers with an API that lets you automate those queries. An API (for those of you who don’t know) is an Application Programming Interface – a toolkit provided by a service or application (in this case by Technorati) that lets other computer applications ask it questions and use the answers for their own purposes. It may be helpful to think of APIs as being like the knobs on top of a Lego brick that let you stick other Lego on to it without in any way changing the nature of the brick itself. On the other hand it may not be so helpful after all.

After much struggling with a Yahoo! Pipe to query the Technorati API for a list of blogs, I was forced to abandon my attempt. I would have liked to have shared that Pipe with the world (if you’re good with Yahoo! Pipes, do please take a look at it and see if you can help me!) [Tuesday January 6, 2009: Thanks to help and encouragement from Bob Briski, this now looks like it's on its way to working!]

Instead, I’ve written a perl script to do this. Perl isn’t as easy for people to use for themselves as Pipes, but if you are comfortable with a command prompt, then you’re half way there.

What this script does is take a list of blog urls, and for each item in the list queries Technorati for the following information:

  1. Blog title
  2. Inbound blogs (the number of unique external blogs linking to the blog over the past six months, this is also known as “Technorati Authority”)
  3. Inbound links (the total number of links into the site)
  4. Technorati Rank (a sort of overall score)

The script

[code lang="perl"]#!/usr/bin/perl
# use modules
use LWP::Simple;
use XML::Simple;
# set up variables
open(INFILE, $ARGV[0]) or die "Can't open list of blogs to read: $!";
$apikey='enter your Technorati API key here';
# create object
$xml = new XML::Simple;
# read each line, and make the Technorati API call
while () {
sub callTechnoratiAPI {
$url = 'http://api.technorati.com/bloginfo?format=xml&key='.$apikey.'&url='.$_;
# get XML file from Technorati
$content = get $url;
die "Can't get $url" unless defined $content;
# read XML file
$data = $xml->XMLin($content);
# access XML data and print TSV to screen
# (you can fiddle with this as much
# or as little as you like)
print ""$data->{document}->{result}->{weblog}->{name}"t";
print "$data->{document}->{result}->{url}t";
print "$data->{document}->{result}->{weblog}->{inboundblogs}t";
print "$data->{document}->{result}->{weblog}->{inboundlinks}t";
print "$data->{document}->{result}->{weblog}->{rank}n";

How to use it

I can’t give you any real advice on how to run perl on your system. If you want to play around with it, Macs come with perl already installed, Windows users should download and install the free ActivePerl. But you’ll need to install the perl bundle XML::Simple, and I don’t know where to begin telling you how to do that if you don’t already know how perl and CPAN work. You see why I wanted to use Yahoo! Pipes?

If all of that doesn’t bother you, you’ll also need to sign up for a Technorati account (if you’re into this sort of thing, you should already have an account), and get your free API key. This key will let you make 500 queries in a 24-hour period, so you’ll need to plan how you use it.

The script as it’s listed above outputs tab-separated values to screen like this:
matm% ./parse_technorati.pl bloglist.txt
"Chris Gilmour's Diary Vol. 14" http://www.illandancient.blogspot.com 6 10 861604
"The Red Rocket: Technology, PR and social media marketing" www.theredrocket.co.uk 15 29 397843
"Going Underground's Blog" http://london-underground.blogspot.com 254 467 13332

The blog’s title and url are followed in order by the inbound blogs (authority) count, the inbound links count, and the Technorati rank.

I use tab-separated values because that makes it simple to cut-and-paste directly into Excel or Google Spreadsheets for further analysis.

Known bugs

Right now, the script occasionally throws out something like this:

matm% ./parse_technorati.pl bloglist.txt
"Lytham Villa" http://lythamvilla.blogspot.com/ HASH(0x8ff7a0) HASH(0x8ff7f4) 4978471
"KickTime || A Driftless Regional Webspace" http://kicktime.org HASH(0x908e0c) HASH(0x908db8) 1951828

I’ll work on this, but if anyone can point me in the right direction, I’ll be most grateful.