Blogger typology: quantitative analysis step 1
I’ve published the first dump of survey and “blog metrics” data from the blogger questionnaire as a spreadsheet on Google Docs. Many, many thanks to all of you who volunteered your information.
Please feel free to use this as you see fit for your own projects. I’ve anonymised this data (just because it’s best practice, not because I think any blogger would be mortally offended by having the world know what inspires them to blog!)
I’m slowly ploughing through the data and doing the quant analysis, but I thought I’d share a few bits and pieces first.
I’m not really in a position to offer anything other than very top-line results right now. Remember that the purpose of this exercise is to create a broad and commonsensical “blogger typology” that will help us with our planning and engagement programmes.
All I have here are “descriptive” statistics; that is, data about the data themselves. You can jump straight to the results should you so wish. The next bit is just a rumination of what we can measure (I’ve left out the “why” for now, but would be happy to discuss.)
Blog metrics background
First of all, a quick aside about how we chose what we’re going to measure. This has been a source of great interest to us at Porter Novelli, and I think will continue to be so over the coming year.
There are any mumber of blog metrics out there to choose from. I’m rather interested in Stowe Boyd’s Conversation Index, which he defines as:
The ratio between posts and comments+trackbacks (posts/comments+trackbacks)
A healthy blog, he suggests, has a Conversation Index less than 1.0 — that is, there’s something more than simple broadcasting going on. To me there seem to be only two downsides to this metric:
- a poorly-moderated blog with do follow links in the comments section will often have a lower conversation index than the norm.
- it’s hard to generate Conversation Index data automatically for a large number of blogs
This says nothing about using the Conversation Index as a performance indicator for one’s own blogs and posts of course (indeed, it’s to be recommended.) It would probably work well for Twitter accounts, too – we’ll try to take a look at this when we’ve finally completed the twitter eavesdropper.
But taking Stowe Boyd’s post as inspiration, here’s what I’d like to be able to collect and store from the blogs I’m looking at. These are, I think, the relevant data points that are common to nearly all blogs, and from which a more complex and meaningful set of metrics (like Boyd’s Conversation Index, Technorati’s Authority, or indeed Google’s Page Rank, AideRSS’s PostRank, and even Porter Novelli’s own network analysis metrics) can be constructed.
At this point it’s worth inserting a note of reality. As mentioned above, collecting this data is difficult. Since every blog is different, every blog’s schema (the way that its information is structured) differs. XML feeds like RSS and Atom go a long way to fixing this (hence the focus on “last 10 posts” for some of the metrics listed above) but even then it’s hard to automate.
So — right now I’ve been using the mechanical turk approach to getting the data. I mostly use the excellent and reliable Get A Freelancer, but given that I’m part of a global network I’m also looking at Amazon’s service (I don’t understand why this isn’t available outside the United States.) I like the idea of paying per-transaction; it offers more flexibility.
The big problem (as with any data entry project) is accuracy, and I’m still working on the processes for that (although I do have a couple of ideas.) I’d be interested to hear from anyone who has some experience in this area so that we can share ideas.
Early results
Our standard set of blog metrics are based on well-known CRM RFM metrics: recency, frequency, and tenure. Obviously we can’t get “spend” data (the “M” in RFM), but we can substitute it with Technorati’s “Authority” data. This seems (and is) pretty arbitrary — but it’s relatively easy to automate as demonstrated in yesterday’s post on using perl to access Technorati’s API.
Let’s look at the data we’ve gathered.
Recency
We calculate “recency” as the number of days since the latest post (in Excel we subtract “date of last post” from “date of retrieval”.) When it comes to recency, the lower the score the better: it implies an active blogger.
The first box plot is heavily skewed by one outlier who hasn’t posted for over three years.

Box plot describing posting recency of 62 bloggers who completed questionnaire
So in the following box-plot we’ve removed that outlier; none of the other numbers really changes of course, but we can display a better picture.

Box plot describing posting recency of 61 bloggers (one outlier removed)
You can see that half the bloggers have posted within the past four days. The retrieval date for this data was January 2nd, so this is really rather impressive. If you look at this review of corporate blogging programmes from public relations agency networks that we carried out for internal purposes early last year, you’ll see that there’s far less enthusiasm (the median recency score there was 12 days) and that wasn’t over a period where everyone was having a holiday and a party.
Frequency
To calculate frequency we look at the number of posts in the last complete month. In some studies, frequency of posting has shown a high correlation with Technorati Authority (of which more later.) I expect to see a wide range of frequency “behaviours” when we come to do the segmentation; people who are mostly link blogging will, I think, show higher frequency than people who are opinion blogging or announcement blogging (these terms are borrowed from current typologies, and have nothing to do with the final product!) It’s hard to say therefore that “higher frequency is better” but again, it’s an indicator of active and engaged blogging. Or possibly, of course, of a spam blog.
From the public relations engagement point of view, a blog with higher frequency will have a higher probability of carrying our story and (in all likelihood) a larger readership.

Box plot describing frequency of posting of 62 bloggers who completed questionnaire
There are no clear outliers here: half the respondents post more than seven times a month, so just under two posts a week. In case you hadn’t guessed, it’s probably worth pointing out that the blog with a frequency of 225 is a multi-authored blog.
The “complete month” we looked at for this exercise was December; numbers are likely to have been affected by the holidays (and affected differently for different segments.)
Active tenure
We define ‘active tenure’ as the number of days between the first and the last post. When taken together with recency and frequency, this is a good measure of commitment: someone may have high recency and frequency scores, but if their tenure is low, then we have no guarantee that this behaviour will continue into the future. Lots of us start blogs in a fit of enthusiasm only to find that work begins to get in the way, and that we have less time to post than we did! Occasionally we’ll find a blog that has relatively high active tenure and low recency; we’d read that as a bad sign.

Box plot describing active tenure of 62 bloggers who completed questionnaire
Looking at the respondents, we see that half of them have been blogging for more than a year (401 days); this is a good sign for when we come to do the segmentation. Indeed the distribution looks fairly healthy overall with an mid range from 149 to 952 days. I still can’t say that this is a representative sample at all, which is a shame. Does anyone know of any data that would help me frame this better?
Authority
Technorati’s “Authority” is the only metric that we don’t calculate ourselves. Indeed, as per yesterday’s post, I’ve just succeeded in automating this bit of the data gathering process.
Technorati has this to say:
Technorati Authority is the number of blogs linking to a website in the last six months. The higher the number, the more Technorati Authority the blog has … The best way to increase your Technorati Authority is to write things that are interesting to other bloggers so they’ll link to you.
Right now, Technorati’s Authority scores range from 0 to 28,378 (HuffPo, as retrieved on Sunday January 4, 2009.) Our respondents (as seen below) range from 0 to 356, with a mid range from 5 to 41. I don’t think that this really gets us where we want to be going (although it does guide our hand a little better when it comes to looking at the next rounds of data gathering.)

Box plot describing Technorati Authority of 62 bloggers who completed questionnaire
A word about the box plots
We use box plots as a simple way to represent and compare data. Simply plotting the average (arithmetic mean) of the data disguises more than it reveals; a few outliers (as we have seen) can drastically alter the picture and give an “unrepresentative” picture.
Each box plot shows the following information:

- Maximum (the highest value observed)
- Minimum (the lowest value observed)
- The Interquartile Range or IQR (where the middle 50% of the values fall: this is a more robust statistic than the full range, and is depicted as the “box” from which the box plot gets its name)
- The Median (the number separating the higher 50% of the sample from the lower 50%, the median gives you an idea of which way the data skew.)
Where can I get the latest data
Here, as a Google Docs spreadsheet. If you do anything interesting with the data, do please let me know, and I’ll link to it from here. Feel free to share this spreadsheet as http://icanhaz.com/blogger_data
How were the respondents recruited?
To date, everyone has been self-selecting. I’ve used four main channels to promote the questionnaire, this blog, Twitter, a LinkedIn Q&A, and word-of-mouth. If you’d like to take the questionnaire yourself, please do — it’s never too late. Furthermore, I’d be most grateful if you’d pass the link along (please post it on your blog, or on twitter, or send it via email as http://icanhaz.com/blogger_questions).
Here (courtesy of Google spreadsheets) is how the current set of respondents got to the questionnaire (WOM isn’t tracked here, just the seed link.)





[...] Morrison has published the initial data from his blogger typology survey, however there is still time to participate here. Get over there, give them a couple of minutes [...]
[...] into a spreadsheet. The reasons why I wanted to do this are covered in this post about the quantitative analysis of blogs, and my eventual perl-based solution to the problem is covered in this [...]
Mat,
Though a few factors are subjective, and there could be different ways to nail it, I wanted to give my 2 cents for discussion sake:
1) Do you really think ‘Tenure’ matters as much? I mean if the guy has any good Technorati authority and good frequency, it negates the value of ‘tenure’.
2)Again frequency and authority combined can mitigate the importance of recency.Technorati takes only last 180 days data for its authority and we are using Frequency per month as another metrics. Which implies that there areless chances that a blog with high authority and frequency will be dormant for a long period of time. This metrics need not be automated, and in some circumstances, could be physically navigated
I know you’ve mentioned that there are more complex metrics (can we see them somewhere?) and that Page Rank could be deduced from the above metrics (or did I get it wrong?)
Thanks
Shalabh
Shared: Blogger typology: quantitative analysis step 1 | mediaczar: How did I miss this? Useful insights .. http://tinyurl.com/7sml9p
[...] Blogger typology: quantitative analysis step 1 | mediaczar How did I miss this? Useful insights from Mat. (tags: forblog measurement porternovelli matmorrison) [...]