A Statistical Analysis of Why the NSA Data Mining Won't Work

by synuclein

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Sunday, Feb. 05, 2006 Sunday, Feb. 05, 2006 at 3:10:26pm PST

For the past several weeks, we've been hearing about this "New" NSA program to track down and spy on suspected al Qaeda terrorists in the USA. I'm not setting this diary up to discuss the relative 4th Amendment issues, whether the Bush admin. was spying on people they shouldn't have, or anything other than why, from a strictly mathematical mindset, this program could not work with any efficiency.

More after the flip

The initial information suggested that the program was of a broad scope, intercepting potentially thousands of peoples calls and e-mails. In response, the administration has suggested that the program is much more limited, and has focused on a much smaller number of individuals who are communicating with "known or suspected al Qaeda operatives" outside the US. Now comes more revelations (from the WaPo) that the operation has engaged in direct monitoring (by an individual at NSA) of somewhere in the vicinity of 5000 individuals over the past 4 years. Of these 5000, the article points out that "fewer than 10 US citizens or residents a year" were considered to be a sufficient threat to engage more aggressive observation (including domestic wiretapping and warrants for said activities). What is alluded to in the article, but nobody has the numbers for, is that those 5000 people who were directly observed were sorted out from a much larger pool of potentially tens to hundreds of thousands of individuals thru computerized and some human monitoring.

I have been thinking about this program from the perspective of a person who routinely uses statistics for research purposes. In statistics, there are two types of errors that can be committed - and these equally apply to epidemiology and data mining. The first is called a "Type I" error, or a "False Positive" and the second is a "Type II" or "False Negative". In the case of the wiretapping/domestic surveillance program, these would be identified as either calling an innocent individual a terrorist (Type I) or misidentifying a terrorist as an innocent (Type II).

When epidemiologists design algorithms to analyze for risk analyses (for diseases, etc.), the problem that often arises is that the rate of these two errors are inversely correlated. So, when a test is established to determine whether you have cancer, for example, the rate of Type II errors is set as low as possible (to avoid missing anyone with cancer). As a consequence, the rate of Type I errors rises (some people have, for example, electron dense areas in their breasts or higher than "normal" levels of PSA in their system, suggesting that they have a cancer - but they don't). This is why most medical tests for things like cancer or other such disorders are not performed on otherwise healthy individuals - and why those who are tested (and test positive) are often examined by other tests to confirm the initial diagnosis. This helps avoid the problem of diagnosing someone with a disease he or she may not actually have and also reduces the unnecessary secondary testing of these healthy individuals to reveal their generally healthy condition.

The other thing to consider in this is the relative frequency of incidence. With something like cancer (depending on the type of cancer), the rate can be relatively high. Other diseases are comparatively rare - especially in different parts of the world (very little risk of developing rickettsia in the developed world). This is used by phsicians to engage in "differential diagnosis", where certain diseases or maladies can be "ignored" due to their relative "rareness" in a population.

Why are these two digressions above important? To establish some analogies and set the stage for the analysis of this program.

First, lets establish some baseline assumptions - based on the number of hijackers on September 11 (19 - or 20), we can assume that there are a relatively small population of other al Qaeda operatives in the US. There are two reasons to believe this is true: first, the "success" of the September 11 operation revealed the effectiveness of a small-unit operation, and second, the more operatives around, the higher the risk of exposure. For the sake of my "model", I'll assume that there are 100 "terrorists" in the US - this is ten times the number cited in the WaPo article as having been observed more tightly, and five times the number of people on Sept 11. Even still, in a population of 300-400 million - the odds that any individual is a "terrorist" is pretty low.

Now that we have our "sick" population (using the sick/healthy analogy of epidemiology testing), I'll make a guess at the number of "healthy" people in the observed population. Based on the 2000 census there are approximately 1.5 million "Asian Indian" (which I assume to mean descendants from the subcontinent - India and Pakistan), and 1.2 million "Asian Other" (which may represent the Middle East ???), of course there are over 15 million who are in the "Some Other Race" (which may include North African, maybe Middle Eastern). So about 18 million people, round it to 20 million and assume that one in 100 has been flagged for some reason (calling their family in Quetta, trip to Syria, etc.) and we've got a sample of 200,000. So we're sampling a population that's not so rare (about 1 in a thousand or two).

Now, let's discuss the test - of course, we want to design the "best" test possible and want to miss none of the terrorists in the US. To that end, we set the "False Negative" tightly. In the real world, this number would be about 10%, but we'll assume that the NSA (as the largest employer of mathematicians in the US) can design a better algorithm and they have a failure rate of 1%. Because the net is set so tight, we have to accept that more innocents are going to get ID'ed too. The consequence is that we end up with a "False Positive" of 5% (this number too is too small - in standard tests, the "False Positive" is often closer to 15-20%, which is why not everybody gets every test).

Time to do the math.
If we have a 99% success rate, then the NSA (and FBI, etc) catch 99 of the 100 "terrorists" in the US. The remaining one is so demoralized (and without team support) that he quits or runs back to a cave in Afghanistan.

Unfortunately, we also have a 5% "False Positive" rate. Since the NSA had been observing 200,000 individuals, this results in the tracking and (possible) apprehension of 10,000 innocent people who are suspected of being terrorists.

So, we've caught 99 of the 100 terrorists in the border, but also brought under suspicion an additional 10,000 innocent people. Based on this, the relative "sensitivity" rate is less than 1% (99/10,099). Each of these people (innocent and guilty) needs to be tracked by NSA sigint people, observed by FBI or local law enforcement and (probably at some point) brought in for questioning or legal action. Based on the fact that we've only got limited human resources, this program makes no sense from a "cost-benefit" analysis.

I'm off to watch the game - will come back after the kids are in bed and add links to the NYT/WaPo stories, etc.