Attractions and Saturations: Mapping the comments in diaries

by dmsilev

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Friday, Jun. 30, 2006 Friday, Jun. 30, 2006 at 7:00:34am PDT

Surgeon General's warning: This diary contains both meta and mathematics. Only consume in moderation.

The rich get richer. A theme that seem to describe modern America. It also seems to describe diaries here on dKos; once a diary starts getting comments, it seems to grab more and more, while some other poor diary languishes with a meagre 3 or 4 comments. That's the impression. But is it true, and can we quantify this tendency? Yes, and yes.

Let's start with the data, courtesy of jotter:

This is 3 months worth of diaries, counting up how many diaries got a certain number of comments. As you can see, most diaries only get a few comments. In fact, the most common number of comments for a diary is a tie between two and three. There are some diaries with a dozen or so comments, a smaller number of diaries with a few dozen, and a tiny fraction with one hundred or more (off the edge of this graph).

So, let's try to model how people chose which diary to comment in. The simplest method is purely random. A person picks a diary, and every diary has an equal chance to be chosen. Then, the next person leaves a comment, and again every diary has an equal chance. And so forth. What happens? Well, this:

Focus for the moment on the green curve, labeled "bell-curve model". This is the result of assuming that comments go in completely randomly (mathspeak: flat probability distribution). As you can see from the graph, this gives rise to the familiar "bell curve" shape, and it's all wrong compared to the actual results. It predicts that the average diary will get 20 comments, and few diairies will get either 0 comments or more than 40. Anyone who reads dkos at all can tell you that this is completely and utterly wrong.

So, we need a better model, which is the one shown in red labelled "attraction model". What, you ask, is attraction? Attraction is the tendency for commenters to be drawn to already-active diaries. That represents people who are more likely to click on a link that has a bigger number of comments attached to it, on-going conversations inside a diary, and at the high end, the increased visibility due to being on the Recommended list. As you can see from the red curve, this model does match the actual behavior pretty well.

We can also look at what goes on at the very high end of the spectrum, the multi-hundred comment diaries. There aren't many of them, but they account for a large number of comments. To do this, I've taken the second graph and replotted it on a log-log scale:

As you can see, the overall shape matches pretty well, but I need to tweak the parameters a bit more to get a perfect match.

Now for the nitty-gritty. People whose eyes have already glazed over should skip this section and go to the conclusions at the end (look for the bold 'conclusions' line).

The model works by taking a set of diaries, and sequentially placing a bunch of comments into these diaries. Like the bell-curve model mentioned above, the diary which recieves a given comment is chosen randomly, but the key difference is that different diaries have different chances of being chosen. Here's what the cumulative probability distribution looked like at the end of one run:

How to read this graph: when picking the diary that gets the next comment, the random number generator picks a number between 0 and 1. We then look at that x value on the graph, and see what the corresponding y value is. That y value is the index number of the diary that gets the comment. On this particular graph, you can see that two diaries are sucking up all of the oxygen, taking roughly half of all new comments between them. The other 148 diaries in the set have to fight over the remaining half.

So, where does this probability distribution come from? The generating equation is this (copied directly from my code):

attract_val = attraction*com_val+ attraction*(.8*com_val)^2
prob[j] = prob[j-1] + 1 + attract_val

So, every diary gets one unit of attractiveness for free. Beyond that, the attractiveness depends on the number of comments that that diary already has. 'attraction' is an adjustable parameter that I twiddled with to get a reasonable match with the data; it looks like attraction=5 produced the best results. Note, also, the quadratic term, which means that not only do the rich get richer, but the very rich get rich even faster than the merely well off. (The 0.8 is another tweakable parameter).

(Implementation note: To keep my computer from exploding, this probability distribution is only recalculated once every few hundred comments. Changing the recalculaion frequency does change the parameters needed to fit the data, which means that the model isn't as robust as it really should be)

In the diary title, I mentioned saturation. Saturation is a cap imposed on the attraction value. Basically, once a diary hits a certain threshold (250 comments works pretty well), its attractiveness doesn't increase with increasing number of comments. It's worth noting, however, that a diary that's hit the cap is still several thousand times more attractive than a 0-comment diary. If this sort of cap is not imposed, there's a tendency for one diary to suck up the majority of all comments.

There's one other parameter worth noting, which is the number of diaries which the commenter can chose from. The best matches occurred when this was around 150 diaries. That's roughly 1/2 of a day's output. It's also more or less equal to the typical turn-around time on the Recommended list.

After allocating all the comments to the diaries, the diary distribution is binned into a histogram. The simulation is rerun many times to improve the statistics of the histogram.

For the results shown here, 150 diaries sharing 3000 comments were simulated, with the simulation run 125 times. An attraction value of 5 was used, with a quadratic weight of 0.8. A saturation cap of 250 comments, corresponding to an attractiveness of roughly 20,000, was imposed.

Conclusions

The essential dynamic driving where people chose to leave comments is that they are drawn to already-popular diaries, making those diaries even more popular. This only works up to a certain point; above about a couple hundred comments, all diaries appear to be equally attractive. A 1000 comment diary isn't any more popular than a 300 comment diary. This "popularity contest" extends over roughly 1/2 day worth of diaries, dominated by the high-traffic Recommended diaries.

Note, please, that this model says nothing about which diaries will become popular. That's impossible to predict. But, once a diary becomes popular, it acts as a magnet and pulls more comments in.

-dms