Skip to main content

Cross-posted from Overdetermined.net

In my last entry, I discussed matching lists when they did not share a common, persistent and unique identifier.  Basic conclusion: challenging! In this week's entry, I'll share a common technique for making the job a little easier--one which has a number of uses beyond list-matching.  Read more...

Way back in the first entry in Building a Voter File, I wrote that "If your name is John on one source of information, Johnny on another, and Jhon on yet a third, it’s a challenge to ensure that all this really refers to the same person.<span>"  This is true, but it's not even the best example of what I was getting at.  Consider the following addresses:

</span>

<span>123 Main St.</span>, Apartment 2, Washington, DC 20010

123 Main Street, Apt. 2, Washington, District of Columbia, 20010

123 Main, #2, Washington DC, 20010

A human would recognize these as all being the same location.  A computer would have a much harder time.  However, it is possible.  In order for a computer to be able to recognize this automatically, we need some sort of standard format for addresses, and software that will put addresses into that format.

Luckily, we have both things.  First, the standards.  The U.S. Postal Service has an extremely detailed standard (PDF) for addresses on bulk mail.  There are all kinds of rules (including specifics for apartments, rural routes, and any other quirk of addresses you can think of.  For the moment, take a look at the examples here.  The most important things to note are the standardized capitalization and spellings (e.g. BLVD, not Boulevard vs. Blvd. vs. BLVD), and the use of 9-digit zip codes.  These were originally invented by the Post Office to make it easier for them to sort bulk mail, but they have the added benefit of making sure that the same address will be represented the same way on multiple files, assuming they've been standardized.

So how are addresses standardized anyway? There are software packages that will do it for you, with varying rates of success.  I frankly am not an expert in comparing these, but in general, the higher the rate of addresses that are recognized, the better.  It's worth doing some research into various software packages, but depending on the quality of your source file, you should expect 95-99% of addresses to be standardized without errors.  However, that will depend heavily on the type of addresses you're feeding in--suburban areas are usually easiest (lots of regular streets with tons of numbers, but few apartments) and heavily urban or (especially) rural areas will produce more errors in standardizing.  Once your file's addresses have been standardized, it will make address matches between different versions much more meaningful.

Originally posted to blueleader on Mon Dec 29, 2008 at 12:13 PM PST.

EMAIL TO A FRIEND X
Your Email has been sent.
You must add at least one tag to this diary before publishing it.

Add keywords that describe this diary. Separate multiple keywords with commas.
Tagging tips - Search For Tags - Browse For Tags

?

More Tagging tips:

A tag is a way to search for this diary. If someone is searching for "Barack Obama," is this a diary they'd be trying to find?

Use a person's full name, without any title. Senator Obama may become President Obama, and Michelle Obama might run for office.

If your diary covers an election or elected official, use election tags, which are generally the state abbreviation followed by the office. CA-01 is the first district House seat. CA-Sen covers both senate races. NY-GOV covers the New York governor's race.

Tags do not compound: that is, "education reform" is a completely different tag from "education". A tag like "reform" alone is probably not meaningful.

Consider if one or more of these tags fits your diary: Civil Rights, Community, Congress, Culture, Economy, Education, Elections, Energy, Environment, Health Care, International, Labor, Law, Media, Meta, National Security, Science, Transportation, or White House. If your diary is specific to a state, consider adding the state (California, Texas, etc). Keep in mind, though, that there are many wonderful and important diaries that don't fit in any of these tags. Don't worry if yours doesn't.

You can add a private note to this diary when hotlisting it:
Are you sure you want to remove this diary from your hotlist?
Are you sure you want to remove your recommendation? You can only recommend a diary once, so you will not be able to re-recommend it afterwards.
Rescue this diary, and add a note:
Are you sure you want to remove this diary from Rescue?
Choose where to republish this diary. The diary will be added to the queue for that group. Publish it from the queue to make it appear.

You must be a member of a group to use this feature.

Add a quick update to your diary without changing the diary itself:
Are you sure you want to remove this diary?
(The diary will be removed from the site and returned to your drafts for further editing.)
(The diary will be removed.)
Are you sure you want to save these changes to the published diary?

Comment Preferences

  •  Also, lose the commas (1+ / 0-)
    Recommended by:
    lgmcp

    which postal system does with a double space between state and zip code.  As a tax hack, I long appreciated the way the IRS simplifies addresses.  I hate to say it, but all caps does simplify things.

    When I do it to my customers, if they notice, I tell them to wait until they bet something from the IRS, and that's how it is addressed.  They think I'm a genius.  They are wrong!

  •  Good matching requires sophisticated (2+ / 0-)
    Recommended by:
    Garrett, Brooke In Seattle

    probabilistic analysis software, followed by human review of rejected and doubtful matches.

    Given our hodge-podge of voting rules and regulations, and the archaic nature of processes in many state elections boards, it's not surprising that this modern solution has not been widely implemented.

    So yes, the post office's standards offer a nice common model for how to make all our address databases interchangeable.  It's a good way to get clean data.  But the other, is the way to deal with dirty data, and voters WILL continue to provide that, copiously.  

    We need to demand implementation of common and defensible standards on probablistic identity-matching software.  We are already doing it for medical records, though still in the immature or emerging stage in many cases, and we need to bring it to voting records.  EVERYWHERE.

    "The extinction of the human race will come from its inability to EMOTIONALLY comprehend the exponential function." -- Edward Teller

    by lgmcp on Mon Dec 29, 2008 at 12:36:58 PM PST

  •  I'm more interested in name matching (1+ / 0-)
    Recommended by:
    Garrett

    But that's for selfish, work-related, reasons.  Nothing to do with voter files.

    But it's really a fascinating subject.

    The way to win is not to move to the right wing; the way to win is to move to the right policy. -- Nameless Soldier

    by N in Seattle on Mon Dec 29, 2008 at 12:50:28 PM PST

    •  Name matching is the heart (0+ / 0-)

      of voter data, because the person/identity is far more central to the concept of voting than is the exactitude of address.

      Both name matching and address matching face the same conceptual and logistical challenges.  Both are common to a wide variety of people-tracking applications.  And both must be handled intelligently, either by software or by human evaluators, to keep from screwing things up (like your life when you present at the emergency room) left and right.

      "The extinction of the human race will come from its inability to EMOTIONALLY comprehend the exponential function." -- Edward Teller

      by lgmcp on Mon Dec 29, 2008 at 05:31:14 PM PST

      [ Parent ]

  •  there is some free software (2+ / 0-)
    Recommended by:
    Garrett, lgmcp

    The US CDC has Link Plus and there's an open source project called febrl.  I haven't used either one but I have used address standardization software (CASS certified Zip Plus 4 is just $99) and have written some of my own implementations of standardization and also for probabilistic record linkage.  

    It is a tricky field in that it's harder to translate human reasoning into computer code than you might expect.  You always need some type of manual human  check of marginal matches after employing a record linkage algorithm.

    •  The address locator service in ArcGIS (0+ / 0-)

      is certainly a robust implementation for this.  I wouldn't be surprised if it was the gold standard. And with the added bonus of accurately placing the residences on a corresponding map.

      "The extinction of the human race will come from its inability to EMOTIONALLY comprehend the exponential function." -- Edward Teller

      by lgmcp on Mon Dec 29, 2008 at 05:33:11 PM PST

      [ Parent ]

    •  Thanks very much (0+ / 0-)

      I hadn't heard of those options, I'll take a look at them.  Appreciate the help!

  •  nice if the rules were followed (0+ / 0-)

    One of my co-workers routinely had mail misrouted, because the street name was 'housenumber - direction street kind street name' (which was also a number). His mail tended to be routed as 'housenumber - street name - street kind - unit number', and they'd ignore the zip code mismatch that resulted.
    The USPS wasn't going to help: they told him he had to identify the database that was the source of the error first. That's really nice when, for example, it's your mortgage bill that's being missent.

  •  Companies are even harder than people/addresses (1+ / 0-)
    Recommended by:
    lgmcp

    They change what they call themselves, and who owns who, all the time.

  •  I think we need one of those databases... (0+ / 0-)

    like linked-in where it updates your address when you move automatically and registers you in the new district at the same time when you had a valid registration in the other state...

    Obama/Biden'08 Delivering Change he Promised

    by dvogel001 on Mon Dec 29, 2008 at 03:28:53 PM PST

  •  Just remember to have the database match (1+ / 0-)
    Recommended by:
    lgmcp

    the sleeping pit-bull in the yard
    to that apartment in front.

    signed, Eykan Viser

    ;~)

    Human reason is beautiful and invincible --Milosz, Incantation

    by juancito on Mon Dec 29, 2008 at 04:07:11 PM PST

Subscribe or Donate to support Daily Kos.

Click here for the mobile view of the site