Searching Daily Kos with election tags (etags)
You can now search using election tags as follows.
etag=CT-SEN
Etags with embedded spaces should be quoted.
etag="CA-LT. GOV"
You can only use one etag per search.
Good election tags have a two letter postal state abbreviation
hyphen seperated from either a house district number, or a statewide
office abbreviation (SEN, GOV, Lt.GOV, SOS, etc).
There is one exception, etag="2006 ELECTION" also works.
Good etags are ones we already know about by finding them in
the tag list of an existing diary or by appearing in one of the
election roundup diaries.
The etag is used to build a more complicated query that looks for
the tag in tags, title, or body text (stories/diaries) or in subject
and comment text (comments). Certain meta dairies are excluded, including
open threads, jotter's daily lists, and sidinny's election roundup, and
the top comments diaries.
Software used.The search facility currently installed is
based on the swish-e search engine:
it is fast (compiled C code), open source, and flexible (you can
feed it documents out of a database using your own custom preproccessor).
It is intended for document collections numbering a million or
less, which will do for Daily Kos with under half a million documents
(excluding comments). In addition, swish-e offers many of features
most desired, including search of phrases and search within
configurable document areas (title, author, tags, URLs).
What is indexed. All Daily Kos documents from the beginning
to within 5 minutes of the present moment are indexed. Older
documents are stored in static yearly indices while more recent
documents are indexed and reindexed, frequently at first, then less
frequently as they age, until they roll over into a growing, static,
permenant yearly index. The right mix of number of index files and
frequency of update to obtain the optimal combination of ease of
maintenance and speed of search may require further experimentation.
At present three different series of indices are maintained, corresponding
to (in order of increasing total number of documents) front page stories,
user diaries, and comments. For 2005 the number of stories, diaries, and comments was
5097, 88943, and 693224, respectively, containing 56993, 371578, and 2945916
distinct words.
Design Considerations. The utility of search to a community of users
is delivered not simply by the results of the text search, but most
critically by the ordering of the results such that highest "value"
hits are displayed first. (Most search researchers know that no
one looks beyond about the first handful of results; almost no one
looks beyond the first page.) The key to arranging for best to
come first is to understand, have access to, and use the meta data
associated with the text documents searched.
Meta Data. At Daily Kos, we have a number of sorts of
meta data that are relevant to documents. First among these is the
time of posting. What is the latest? When did they say that? Who
was first? That sort of thing. So the one thing search has to get
right is time, in the sense of being up to date, that is including
the latest diaries, as well as in the sense of permitting searches
to be organized by reference to the present time.
Making it up to date. Because recommendations are only
open for 24 hours after a diary is published, it is important that
search indices be refreshed sufficiently often during that interval
to provide reasonably up to date information on total recommendations
and comments. Currently, diaries up to 6 hours old are refreshed every
5 minutes, and diaries 6 to 24 hours old are refreshed every 15
minutes.
More Meta Data. Other important meta data associated
with stories and diaries includes comments and recommendations.
The number of recommendations offers insight, imperfect, but
useful, into the degree of approval a diary has received from the
Daily Kos community.
The number of comments indicates the size of the discussion
engendered. The two numbers together give an interesting
insight into a diaries history at Daily Kos.
Some user diaries are promoted to the front page, and some
diarists ("front pagers") are allowed to post directly to the front
page. The latter are not open for recommendation. In order to
harmonize listings from searches which include both front page
stories and user diaries, a plausible value for number of recommendations
(83) is assigned to front page stories, but in lists, this number
is indicated by "*" to avoid placing too much emphasis on this made
up number.
The search query. Search queries can contain individual
words, all of which must be found in a text document to constitute
a hit. Queries can also contain phrases, indicated by double quotes
around the phrase. All words are converted to lower case while
searching, so letter case is of no consequence. In addition, queries
can include boolean terms, such as "and", "or", and "not" which can
be used to construct more complex and perhaps more informative
results.
By default "and" is assumed between individual words in a query.
Finally, queries or parts of queries can be restricted to only
search within parts of a document or document metadata. The parts
available are shown above, along with examples.
Certain characters have special meaning to the search engine
(open and close paren, double quote, equals sign). Only a restricted
list of characters are expected to occur in a word.
For the indices here at daily kos, the word characters are as follows.
- The numerals 0-9
- The "standard" alphabet letters a-z (remember all letters are mapped to lower case)
- The "extra" alphabet characters ªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
All other characters (non word, non special) are silently replaced
by spaces while searching. For instance searching
with "Jerome a Paris", "Jerome.a.Paris", or "Jerome,a,Paris" all
give the same result.
This is important in certain cases, for example searching for
the url http://riverbendblogspot.com is accomplished with the query
site=("riverbend.blogspot.com").
The reasonable looking but incorrect query site=riverbendblog.blogspot.com
doesn't work as expected because in the absence of the double quotes and
parentheses, replacement of "." with " " and the default boolean
AND understood between search words makes it equivalent to
site=riverbendblog AND blogspot AND com, which
returns as hits only documents with riverbendblog in a url, and
blogspot and com in the body text.