As diaried (update diary is here) by RiderOnTheStorm, Russell Tice, former NSA analyst, has come forward and blown the whistle on the NSA'd domestic spying programs. This diary attempts to put some of the technical issues into as close to layman's terms as I can get, and look at what the ramifications of the use of existing data mining and machine learning techniques are in terms of individual privacy.
Disclaimer: The following diary is based upon my professional background as a software developer with extensive experience in data mining and machine learning. The software techniques presented are fundamental to the field. However, my thoughts on how these techniques may be being applied by the government agencies responsible for the electronic surveillance are speculative; I am not nor have ever been a member of the "intelligence community". I have, however, worked on various intrusion detection and prevention systems over the years.
While the NSA and DHS do have not to date publicly discussed the specifics of their surveillance programs, we now have a whistle blower confirming that they do indeed exist, and that they are using machine learning and data mining techniques.
The sort of automated machine learning techniques that appear to be utilized by the NSA are are fairly common these days; on-line retailers use them to detect fraud, credit card companies use them to detect stolen cards, lenders (whether they actually use the results or not) use them to determine loan risks, and advertisers use them extensively to target ads.
Fundamentally, these programs attempt to classify the data into sets. In the case of on-line fraud detection, the algorithm would attempt to classify a given order being placed with the on-line retialer as fraudulent or not. In the case of insurance companies, the software would attempt to classify into the categories of good or bad risk. Note that the classification into a given solution set is often weighted; that is, there is a probability associated with the classification. So, again using on-line fraud detection algorithms as an example, a given order might be classified as having a 80% chance of being fraudulent and a 20% chance of being legitimate. It is up to the user of the algorithm to determine what threshold is used to consider a data point properly classified; in our example, if the threshold was 75%, the order would be considered fraudulent; if it was 90%, it would not.
Such machine learning programs rarely directly process entire set of raw data available to generate a classification. Rather, metadata is used. In some cases, meta data is available as part of the total data available, as with the case of emails. In an email, the header gives information about the sender and the recipient, the route the mail took, the subject line, and so forth. The body of the email, however, falls into the category of raw data. This does not mean, however, that metadata cannot be extracted from the body. There are numerous algorithms and techniques for taking large bodies of raw data and extracting descriptive metadata from it that can be used by a classifier. Such extracted metadata may not be human readable, but it allows the classifier to operate on the transformed raw data.
Thus, it should be noted that the manner in which such classifiers work does not require the program to "understand" everything in the communication. Rather, certain critical data points are extracted or extrapolated from the overall data. Thus, a voice recognition component of such a system would not have to understand every word; rather, it could look for certain words or combinations of words. Similarly, a text analyzer could look for certain key words, phrases and constructions. A common every day example of such a technique that most of us use regularly are the Bayesian classifiers used to filter for spam.
One fundamental principle of such machine learning and data mining programs, or classifiers, is that they a require data points from each set that they will be classifying for. For example, when on-line retailers train their fraud detection software, that software must be provided with a data set of customers who are placing legitimate orders, and a data set of orders that are fraudulent. (These data sets are drawn from previous orders that have been proven to be legitimate or fraudulent). In general, a model is built based on known data with the assumption that the known data is representative of future data to be processed.
Unfortunately, given the way such classifiers work, it is not possible to train them with just a set of, in this example, non-fraudulent orders, and call any others fraudulent. This is a fundamental principle of machine learning; each and every set classification that the program attempts to assign samples to must have been seen by the classifier during its training phase. This does not mean that every possible data point must have been seen by any means; however, it does mean that an adequate representative sample from each possible classification must have been seen by the classifier during its training phase.
In the case of any government wiretapping, and given the sheer number of such communications, it would be impossible to have human beings read and classify them all as either terrorist or non-terrorist. However, in order for an automated classifier to work, it would have to be trained with a set of both terrorist communications, and a set of non-terrorist communications.
And herein lies the first problem with the government wire tapping programs, even if the intended goal is to automate the entire system wherein the communications are not necessarily read by people. In order to train the automated system, it requires that a sample set of legitimate communications be used to train the program to recognize legitimate communications. And the only way to do this is to use known legitimate communications, which would have had to been selected manually. Furthermore, given the complexity of training such a system, the number of communications required to train would have to be very large. Additionally, as communication patterns change, the system would have to be continually trained anew, meaning more manually examined communications. This means that no matter how well intended, no matter how many legal protections were in place, at some point, known legitimate communications would have to be human read. There is no escaping this with existing technologies and techniques.
The second problem with even an automated system is that there are no effective protections against the government, or a misguided government employee examining an individual's data. The recent whistle blower revelations concerning the abuses have proven that this is not a theoretical risk, but an actual situation. Aside from the fact as noted above, that both legitimate and illegitimate data points must be manually provided to the system to train it, meaning that a large number of legitimate American communications were read by the NSA, the possibilities for abuse are staggering. This of course is exacerbated by the lack of FISA oversight during the development of the program; without public understanding of how such systems work, without public understanding of actual threat levels, of the possible, actual rewards of such a system and without oversight, abuse of such a system was almost guaranteed. And now, it appears that there is confirmation that such abuses did indeed occur.
Perhaps in the end, our society will determine that we need to sacrifice a certain amount of our privacy. But if we do make such a decision, it should be made with full knowledge of what is really being done with the data, what protections will be in place to protect individual privacy and to prevent the data from being further abused, and what rewards can be legitimately expected from these programs.
In closing, I realize that I have made several simplifications in this diary. However, it is my hope that this will provide some insight into how such systems work, and some of the inherent risks with the technologies themselves. It is my belief that this thing is too important to be addressed by our society without a good understanding of the actual software mechanisms used in such systems, and the ramifications of those mechanisms.
Update: Russell Tice just used the magic words "Data Mining" on Keith Olbermann's show.
Update 2: PBnJ wrote an extremely interesting diary about the incredible progress made in database technology, entitled A Quintillion Tiny Snowflakes.