This is The Centerfielder here with an announcement about the dKosopedia. In addition to the wiki we all know and love -- or at the very least have heard of and tolerate -- the dKosopedia will soon contain some new features. I'll explain more about these at the end of this diary.
The thing I want to look at now, though, is tags. The idea of diary tags and tag clouds and all sounds great, but in practice has been a disaster. As of this afternoon (Mon, Sep 18, 2006) there are 48613 tags. That's a lot of tags. How can we cut that down? Well, a lot of them are certainly spelling errors (such as "Afghanistan" and "Afganistan"), but there are also many tags that were obviously intended to mean the same thing but are worded differently (such as "Cheap Labor", "Cheap Labor Conservatism", "Cheap Labor Conservatives", "cheap labor Republicans", and "Cheap-Labor Conservatives"), some where tags were not separated, some where tags were separated by a semicolon or period instead of a comma, etc. How can someone willing to fix tags easily find such degenerate cases?
Here's one way to look at the problem...
I parsed up the alltags page and created a Soundex code for each tag. A Soundex code is a mapping of a word (or words) to four characters -- a letter followed by three numbers. The Soundex code for "dKosopedia" is "D221" and for "Centerfielder" is "C536". Tags which map to the same soundex code should sound similar (for some definition of "sound") and we can use these crude mappings as a starting place for those who want to fix tags. At least that's the theory; whether it will ultimately be useful or not I don't know. Let's take a look...
From the above examples, the following tags map to soundex code C141 (the number that follows the tag is the number of diaries with that tag):
C141 [Cheap Labor] 15
C141 [Cheap Labor Conservatism] 1
C141 [Cheap Labor Conservatives] 8
C141 [cheap labor Republicans] 2
C141 [Cheap-Labor Conservatives] 1
C141 [civil lberties] 1
C141 [Civil Liberites] 1
C141 [civil libertarians] 2
C141 [Civil Liberties] 295
C141 [Civil Liberties (all tags)] 1
C141 [Civil Liberties and Public Policy Program] 2
C141 [Civil Liberties Oversight Board] 1
C141 [civil liberties vs safety] 1
C141 [civil libertties] 1
C141 [civil liberty] 4
C141 [Civil Partnerships] 2
C141 [Civil Procedure] 1
C141 [Cobell v Norton] 1
If some Trusted User wanted to go in and clean up the appropriate diaries, we could easily cut these 18 tags down to 10. Here's part of group A125:
A125 [Afaghanistan] 1
A125 [Afgan Poppies Are Us] 1
A125 [Afganistan] 3
A125 [Afghanastan] 1
A125 [Afghanistam] 1
A125 [Afghanistan] 594
A125 [Afghanistan debacle] 1
A125 [Afghanistan war] 4
A125 [Afghans 4 Tomorrow] 1
A125 [Afhganistan] 1
We can eliminate 5 misspelled tags right off the bat. How about other tags that include the word "Afghanistan"?
H212 [Hizb-i-Islami Afghanistan] 1
I621 [Iraq Afghanistan] 1
I625 [Iraq and Afghanistan Veterans of America] 6
O251 [osama bin laden muslim islam terrorism war iraq afghanistan] 1
S315 [Stephen Harper Canada Afghanistan] 1
T653 [the war on terrorism in Afghanistan] 1
V452 [violence in Afghanistan] 1
W262 [Waziristan pakistan Afghanistan Musharraf Karzai revolution Bush administration] 1
W651 [war in Afganistan] 1
Five of these are because of missing commas. How many other tags are improperly separated? Let's look first at tags with many spaces (such as the second to last example above). There are 1240 tags of 5 or more space-separated words (5 was arbitrarily chosen). Some of these are valid ("Department of Health and Human Services"), but a lot aren't. Eyeballing it I'd say it's near 50-50. Some examples:
[Bush 9/11 Reid Libby Fitzgerald Schumer] 1
[bush barf labor day speech] 1
[Bush Cheney terrorism Iraq WTC war politics Emerson 9-11] 1
[bush cheney torture iraq war] 1
[Bush coward 911 nyc lie] 1
[bush executive puke gop puke piece of shit puke] 1
[Bush GOP Lies Propaganda timeline Iraq ] 1
[democracy iraq bush WOT SCWOT] 1
[democrat democratic party 2006 2008 Dean unity] 1
[democrat values republican core 2006 2008 elections] 1
[Democratic Election Strategy Meta Platform Issues Victory] 1
[Democratic idea Iran nuke nuclear] 1
Other invalid separators are periods:
[9-11. Europe] 1
[9-11. Path to 9-11] 1
[9-11. victims] 1
and semicolons:
[9-11; boycott] 1
[9-11; Path] 1
[Alexandra Wilhelmsen; George Weigel] 1
[american foreign policy; Middle East] 1
[American habits; worldview of America] 1
[Ben Shapiro; Midterm elections] 1
[Bob Dylan; song of war] 1
[Book Report; Election; GOTV; Youth] 1
[Bush; 9-11; terrorists; King] 1
[Bush; Iraq; New strategy ] 1
[bush; merkel; massage; american] 1
[bush; troops; staged] 1
[Cliff Robertson was right; Robert Redford NYT didn't publish] 1
[Coulter;Wilson;Plame;Senate] 1
Hmmm. Look at "9-11. Path to 9-11" and "9-11; Path". How many other tags reference the "Path to 9-11" show and the ABC boycott? Searching all the tags for "path" or "boycott" or "disney" or "ABC" and culling, we get
A121 [ABC Path to 9-11 President Clinton] 1
A125 [abc movie path] 1
N313 [NYT path to 9-11] 1
P300 [9-11; Path] 1
P300 [Path] 5
P300 [Path 9-11] 1
P330 [9-11. Path to 9-11] 1
P330 [Path to 9-11] 79
P330 [Path to 911] 6
P330 [Pathway to 9-11] 1
P331 [Path to 9-11 ABC Disney] 1
P331 [Path to 9-11 frame ABC] 1
P331 [Path to 9-11 video media activism ABC Disney 9-11] 1
P332 [path; 9-11; Disney; ABC] 1
P333 [Path to 9-11 ad] 1
P333 [path to 9-11 disney abc] 1
P334 [path to 9-11 lies]
T132 [thepathsince911.com] 1
T133 [ The Path to 9-11] 1
T133 [ The Path to 9-11] 1
T133 [The Path to 9-11] 1
T133 [The Path to 9-11] 352
A121 [ABC boycot boycott Disney] 1
B230 [9-11; boycott] 1
B230 [boycott] 135
B231 [Boycott ABC] 2
B232 [Boycott Disney] 50
B232 [Boycott ESPN] 1
B232 [Boycott Scholastic] 1
B232 [boycotts] 13
D251 [disney boycott] 1
D526 [Democrats. Boycott] 1
A123 [ABC Disney 9-11] 1
A123 [ABC-Disney] 1
B232 [Boycott Disney] 50
C613 [Corrupt Disney] 1
D250 [Disney] 346
D251 [Disney-ABC] 2
D253 [Disney docudrama Clinton] 1
T623 [Trash Disney parties] 1
A120 [ ABC] 1
A120 [ABC] 606
A121 [ABC. PT911] 1
A123 [ABC advertising] 1
A123 [abc tv] 1
F212 [FUCK ABC] 1
I621 [Iraq. ABC] 1
T212 [Tags: ABC] 1
There's a couple dozen superfluous tags in this batch. Well, enough. One last thing. Look at this:
T133 [ The Path to 9-11] 1 (note: starts with 2 spaces)
T133 [ The Path to 9-11] 1 (note: starts with one space)
T133 [The Path to 9-11] 1
T133 [The Path to 9-11] 352
We can get rid of three of these tags just by correcting spacing. The following tags start with whitespace; it should be a simple matter to clean these up.
[ The Path to 9-11] 1
[ ABC] 1
[ American Patriotism] 1
[ CIA] 1
[ Electronic Voting Machines] 1
[ GBCW] 1
[ Hamdan v. Rumsfeld] 1
[ Iraq] 2
[ Israel] 2
[ Jon Tester] 1
[ Neal Bush] 1
[ oh] 1
[ Richard Clarke] 1
[ Samuel Alito] 1
[ terrorism] 1
[ The Path to 9-11] 1
[ troll diary] 2
[ Vietnam] 1
[ voter registration] 1
So, that's a lot to chew on. I'm not sure how much the Soundex mappings help, but grouping the tags in this fashion, crude though it may be, at lease reduces the problem to bite size chunks. I'll shortly be putting lists of tags -- alphabetical, multiword, different separators -- on the dKosopedia at http://www.dkosopedia.com/... for others to download and study. Until that happens, though, please choose a tag or two and spend a few minutes cleaning up a couple of diaries. The Tag Editors Workspace on the dKosopedia was started a few months ago by SarahLee as a place to coordinate this work. It's an outdated page but if some brave souls want to take this project over it's an ok place to start.
Now, for those other things:
First, for those of us on Mac OS X, I'm working on a dKosopedia Dashboard widget, based on Sean Biilig's WikiPedia widget. My modified widget works reasonably well, but has one or two bugs I hope to squash this week. Expect a beta release sometime next week. Second, the dKosopedia will start collecting meta data about dailykos.com, as well as indexing some of the more interesting "series" diaries, such as the History for Kossacks series.