I'm interested in natural language processing.
Where could I find a very large collection of natural language text ("corpus" in NLP speak)?
Hmm. Oh, right. Right here.
What kind of things can you do with such a corpus?
Lots! But for just a first exercise, I do really love the "text generators" based on word and phrase frequencies.
So, I counted all the bigrams in Daily Kos diaries published in 2012 that were used 10 or more times.
A bigram is just a two word phrase, like, for instance, "Daily Kos".
Once you know often a word is likely to occur in a Daily Kos diary, and how often that word is followed by other words, you can generate "Daily Kos Text".
My two personal favorites.
- Media attention deficit.
- House oversight hearing aid kit.
Fun! Jump on over for more.
Read More