Cross-posted from Overdetermined.net
In my last entry, I discussed matching lists when they did not share a common, persistent and unique identifier. Basic conclusion: challenging! In this week's entry, I'll share a common technique for making the job a little easier--one which has a number of uses beyond list-matching. Read more...
Way back in the first entry in Building a Voter File, I wrote that "If your name is John on one source of information, Johnny on another, and Jhon on yet a third, it’s a challenge to ensure that all this really refers to the same person.<span>" This is true, but it's not even the best example of what I was getting at. Consider the following addresses:
</span>
<span>123 Main St.</span>, Apartment 2, Washington, DC 20010
123 Main Street, Apt. 2, Washington, District of Columbia, 20010
123 Main, #2, Washington DC, 20010
A human would recognize these as all being the same location. A computer would have a much harder time. However, it is possible. In order for a computer to be able to recognize this automatically, we need some sort of standard format for addresses, and software that will put addresses into that format.
Luckily, we have both things. First, the standards. The U.S. Postal Service has an extremely detailed standard (PDF) for addresses on bulk mail. There are all kinds of rules (including specifics for apartments, rural routes, and any other quirk of addresses you can think of. For the moment, take a look at the examples here. The most important things to note are the standardized capitalization and spellings (e.g. BLVD, not Boulevard vs. Blvd. vs. BLVD), and the use of 9-digit zip codes. These were originally invented by the Post Office to make it easier for them to sort bulk mail, but they have the added benefit of making sure that the same address will be represented the same way on multiple files, assuming they've been standardized.
So how are addresses standardized anyway? There are software packages that will do it for you, with varying rates of success. I frankly am not an expert in comparing these, but in general, the higher the rate of addresses that are recognized, the better. It's worth doing some research into various software packages, but depending on the quality of your source file, you should expect 95-99% of addresses to be standardized without errors. However, that will depend heavily on the type of addresses you're feeding in--suburban areas are usually easiest (lots of regular streets with tons of numbers, but few apartments) and heavily urban or (especially) rural areas will produce more errors in standardizing. Once your file's addresses have been standardized, it will make address matches between different versions much more meaningful.