Questions to Ponder: The Societal Ramifications of Government Data Mining

by JRandomPoster

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Friday, Jan. 23, 2009 Friday, Jan. 23, 2009 at 7:35:33am PST

Two days ago, Keith Olbermann had Russell Tice, former NSA analyst, on his show. Tice came forward to discuss the NSA's domestic spying program. (See here and here for the lead diaries on this by RiderOnTheStorm.)

Yesterday, I wrote a diary detailing the technical aspects of such a system, and presented some of the ramifications of such a system.

Today, I present a set questions that need be answered by our nation and society before we should even let those in power decide to proceed with this program.

First, though, a synopsis of the technical background, the ramifications thereof, and final conclusion: that even the process of building a terrorist communication detection system will, by nature of the statistical and machine learning principles upon which such a system must be built, will severely degrade, if not eliminate, any expectation of privacy in one's electronic communications. (If you want to skip dry techie stuff and scroll down to the questions relating to the social ramifications of the NSA program, you can just take my word that even the building and use of such a system means that the concept of electronic private communication is completely toast.)

To surveil as much information as the NSA would need to, the system must be automated. This system would have to be capable of classifying communications as "terrorist" or "non-terrorist". How this system would be built is based upon the principles of statistics, machine learning, data mining and pattern recognition. While there is continual growth in these fields, they are still constrained by the fundamental, mathematically provable foundations that these disciplines are built upon.

In order to build such a system, one would have to take a large sample of communications that are not terrorist as well as communications known to be terrorist in order to "train" the system to recognize the difference between the two. This identification of terrorist/non-terrorist labels for the training communication must be done manually, by humans. (You can't have the classifier you haven't trained yet provide the training classifications.) The fundamental principles upon such systems are built require that when a system is trained, it must have a adequate representation of the entire range of data upon which it will eventually be used.

For example, such classifier systems cannot learn to recognize the difference between the colors blue and green unless, during the training phase of their development, they have seen both blue and green. Just showing such a system the color green would not reliably allow it to tell the difference when presented with new color, blue. This is a trivial example, but it illustrates one of the fundamental principles of machine learning and statistical analysis. And if you can find a way to get around that math, you'll be able to write your own ticket from that point on.

The identification of "terrorist" and "non-terrorist" communications therefore, at some point, must involve human reading of the communications used to train the system to mark them for the training. This means that a huge number of communications, selected because they are written by innocent people about innocent things, would not only be read, but would have been selected for reading by virtue of the fact that there was absolutely nothing illicit in any way about them. Furthermore, in order to continue to help the system learn once it is in use, false positives need to be read by a human as well.

Even if one was to limit the training set to a limited set of communications (say government workers), it wouldn't work. Why? Because when training a classifier, the range of the training data must span the range of the live data. Either the system would fail outright, or it would effectively have to retrain from scratch (doing it live) based on the feedback of the false positives - either way, you're back to square one as far as people reading the communications of innocent people.

Nor can you build a test data generation program without using real world data to "teach" such a program how to generate realistic data. It is the same problem that one has with the requisite manual reading of the training communications; until your data simulator has been trained to generate realistic looking data, it cannot generate said realistic data.

Therefore, even building such a system is a violation of our fourth amendment rights and of our privacy in general. Just to build the system means that the NSA must do the equivalent of secretly taking the mail out of thousands of people's mail boxes, reading it, and recording it. And these people must, by the underlying principles of statistics and machine learning, be selected because they are innocent.

It should also be noted that such a system will never be complete. It would require constant retraining as communication technologies changed and the patterns of terrorist communications changed. This means that the human factor during training (in addition to the identification of false positives) will never go away. It's not like a decision could be made to bite the privacy bullet once and be done with it. In addition, for maintenance of the system, the developers and data base administrators must have access to individuals private data. It is completely unrealistic to think that the maintainers of the system can do so without having access to the data.

So, again, to summarize: the very building and existence of such a system means that you no longer have privacy in reference to your electronic communications. The very concept of private communication will become a thing of the past.

Which leads me to the promised questions:

What happens, as a society, to us when our expectation of privacy is degraded, knowing that there is at least a chance that any electronic communication will be viewed by a stranger, let alone the government? How does the change in expectation of privacy change how we view ourselves? Our other views on our expectations of privacy? Our other freedoms? Other people's freedoms?

What are the actual risks and rewards of such a system? Or to put it another way, of how much actual value does such a system provide when compared to the risks presented by abuses and the cost of such a system? How can we as a nation debate this, when full disclosure about the true terrorist threat and terrorist capabilities might in fact truly increase the threat? Similarly, how do we as a nation hold our debate when full disclosure about how the system works would make it easier to evade detection by said system?

What happens to current business and software practices that rely upon the use of encryption? Do they become illegal? Does one have to provide their encryption keys to the NSA? What are the penalties for intentionally evading the system for non-terrorist purposes?

With the technical and non-technical (e.g., political, organizational, people level security, etc.) considerations taken into account, how does one prevent blackmail situations? How does one prevent those who run the system from using the data for personal gain?

Who would have access to the data collected, or portions of the data? Does one have the right to view one's own data set? If so, can one ask that mistakes be corrected? Who is responsible for the correctness of the data? Would corporations be able to, in any manner gain access to the data? Could one be able to look at a family member's data? A friend's? A stranger's? Could the data be sold? Could the data be subpoenaed in non-terrorist criminal and civil proceedings? Could it be used for as the basis for credit ratings? Could it be used by the IRS? By debt collectors? Could the military use it in the context of "don't ask, don't tell"?

How does one secure the system from hackers? And not just the actual data repositories, but also, the collection taps and the communication between those taps and the repositories. Compromising such a system would become the Holy Grail of identity thieves. Such a system would also become the primary target in e-warfare; a centralized repository with that level of information about every single US citizen would be irresistible. Computer security is a never ending competition in which the good guys can never loose a single round, but the bad guys only need to win once to achieve victory. Are we willing to take the risk of massive data theft inherent in having such a system?

What about the question of evidence of another crime being committed - if when data mining for terrorists, could the information gathered be used to go after someone who mentions (s)he's smoked a joint at a party? What about a murder? A shoplifter? A child molester? What about someone relating an anecdote about someone else committing a non-terrorist criminal act (i.e., hearsay)? How are privileged communications between a lawyer and client handled in this context?

How do we protect ourselves against the next corrupt administration (and eventually, there will come another Bush type administration) from using the system, a system that will have been developed further, against their political enemies? How does one remove the temptation for even a good or great administration not to use the system for political ends?

What mechanisms could be put in place to ensure ongoing review of how the data was being used? What mechanisms could be devised that would ensure public debate if there arose a need to change the parameters of the use of the system?

What balance between safety and privacy are people willing to accept? What balance should they accept?

Do we fundamentally have a right to privacy?

These are the questions that we as Americans should be asking and debating publicly. It may be that we are willing to accept the course of surrendering privacy, with all that entials, in the name of security. But if this is the course our nation and society takes, it should be taken with full understanding of all of the ramifications, and only after true public debate.