As a data miner, I am alarmed not only at the NSA's database, but also at the fear that the name of what I do conjures up. Data mining is niether inherently good nor evil: It is a tool. Like a hammer can be used to build a house or to crush a cranium, data mining can be used for great good or great evil. But let's look at what the NSA is doing...
See you on the flip
The NSA Database: A Data Miner's thoughts
First let me qualify. I own a small data mining company in California. I use off-the-shelf software and hardware that I built into a powerful data mining cluster. I apply considerable computing firepower to assist political candidates and PACs.
What it is:
Now, let's talk about data mining. I should begin with what it is not: Data mining is not magic, though the results can frequently resemble it. Data mining is math. Nothing more. It is math applied to seemingly unrelated or only tangentially related datasets that reveals patterns within the data that may not be evident to even the most rigorous scrutiny. Data mining has been used to find the genetic causes of disease, predict credit card fraud, understand global warming, and a host of other applications from the beneficial to the benign, from the unscrupulous to the malign. It is a tool, and like any other tool it can be used for good or for evil.
How it's done:
Data mining starts with taking all the data you have, stored in fact tables and relating it to dimensions. Fact tables may contain a record of web transactions or cash register sales. They are essentially events. Dimensions are ways of measuring those events: Over time, over geography, over gender, over race, over political affiliation, over economic status, you get the idea.
Next, the data miner extracts a subset of data for which he knows he has highly accurate data and divides that set in half. The first half of the subset data is what the data miner creates the data model from. This data is analyzed in several ways, to determine what the most predominant characteristics are. The high signal-to-noise ratio data are our key indicators.
Following that, we select a target attribute, let's say, "Political Affiliation" and we apply our data model to the second half of our subset data (stuff we already know about) and test our model to see how accurately it can predict our target attribute. The better the results, the more confidence we have in our model.
Finally we apply our model to the data that we don't know much about. The software will give us a prediction as to our target attribute, along with a probability that the prediction is correct and a confidence score.
A little bit about the data itself:
We store data in our fact tables as long lists with everything that is known about each transaction. The transactions themselves are rows, while the attributes of those transactions are columns. Generally speaking, the more relevant attributes we can supply to the data mining software, the more accurate it will be. Ironically, the bigger the haystack, the easier it is to find a pile of needles in it. Finding a single needle is still nearly, but not quite, impossible.
Data Mining, Phone Records and the NSA:
According to what we learned last week, the only data that the phone companies gave to the NSA were lists that only contained the calling number and the called number. I call bullshit. As a data miner, if you give me the same data, I can run it through the process and give you lots of metadata (data about the data) but since there are no dimensions to relate to, there is no useful intelligence to be gleaned. To build anything useful from this data without the actual name and/or address of the account holder, I would want to have all of the following:
Calling number
Destination number
Start time
Stop time
Geometry
(type of phone,
Geometry data)
The geometry data is a little hazy, since it's actually an embedded table within a table. The Type of Phone data (land, wireless, satellite) tells me whether I will have point geometry or line geometry. Since land lines don't move, they can be satisfied with a single point. Wireless and satellite phones do move, and plotting where they've been requires line data. The phone companies know the geographic location of the end points of all of their land lines and since E911 can pinpoint the location of any cell phone that passes through their service with the accuracy of a cruise missile. (Isn't that something? They got us all to "chip" ourselves and we thought it was a "Good thing!")
Now I have data that I can relate to other data already in my possession: Geography. I can now match a phone to a physical address and look up in that data what is present at that address. A residence? I look that up in tax rolls and I know the property owner. (Be careful who you rent to!) A business establishment? (Be careful who you sell to!) I can track a cell phone over it's entire habitual path. Over a period of five years, I can tell you an awful lot about the owner of the phone. Does the phone end up in a public school five times per week and stay there all day? Probably just a kid going to school. Does the phone go to a mosque on Fridays and flight school on Saturdays? Could be a terrorist. Does the phone show up at protest marches and opposition party committee meetings? Better watch this guy!
Data quality:
The tough part about prediction using data mining is in the quality of the data. You see, if you're someone who does not want to be noticed, there are things that you can do to hide from the data mine. Paying for everything in cash (shouldn't be a problem if you're a terrorist who is funded from some shadowy organization like Specter or Al Qaida) will help hide you from the database since your name is never entered. Your daily routine might still be there, though. Using disposable cell phones and prepaid phone cards can also aid in hiding you from the data mine, but, if you are a creature of habit, it does not matter how many times you change the phones or calling cards, the geographic patterns of call origination or call destination will reveal you. Never calling or receiving a call from the same place will make you nearly invisible to the data mine. That's tough to do. All of these tactics, and others, when consciously applied, can provide some degree of protection, but a skilled data miner might still find you. What you are actually doing is reducing the quality of the data. This low-quality data yields low-quality results. The inverse of this is that creatures of habit who always travel the same routes, visit the same places, call the same people, pay with credit, debit or check, generate high-quality data. What can be predicted about these people is far more accurate than what can be predicted about terrorists, because these are law-abiding people who go about their own business not worrying about who is minding it besides themselves.
Who are they really spying on?
All of this brings us to ask who the real targets of all of this spying is. In truth, it could be the terrorists. In order to identify them, you need to know an awful lot about those who are not terrorists. This helps to eliminate false positives. However, the data for terrorists is so sparse, that even if a possible terrorist is identified, the algorithms used will rarely generate a high probability and a high confidence. In other words, little, if any actionable intelligence. On the other hand, if you want to predict how a person will vote in a given election, you can get an amazingly accurate prediction from the high-quality data from Joe and Jane Sixpack.
Which brings us to the Big Question: Why?
We have wondered for years how the Rethugs can keep squeaking out wins in elections they should lose. We know that their data miner of choice, ChoicePoint, was the company that purged the Florida voter rolls in the 2000 election. And Lo! They pop up again in the NSA scandal. It does not take a data miner to see that these thugs don't ever intend to lose an election again.
Now all of this is intended to educate. I hope that I can dispel some of the fears about data mining. It really is a useful tool when applied properly. However, as I said earlier, it can be used to work great evil. I hope that when we look carefully at data mining in the context of creating legislation that the environment is one of "how do we restrain, not kill, this thing?"