Data Killed Privacy and We're All Accomplices

by Welp

Community

(This content is not subject to review by Daily Kos staff prior to publication.)

Monday, Jun. 17, 2013 Monday, Jun. 17, 2013 at 4:39:28pm PDT

Much has been written over the past few days about the NSA's recently unveiled surveillance program. The details about the program are still sparse and at the moment it remains unclear as to exactly what capabilities the program grants an analyst. That said, even the most permissive interpretation generally involves NSA employees the ability to see content from individuals without authorization from FISC.

However, this remains a mere tip of the iceberg for what is possible using technology that has largely already been developed and implemented in a smaller scale by casinos, the FBI, and even the Obama campaign. Yes, the problem is data. There's too much of it, and it can be used in surprising ways by law enforcement with equipment that already exists.

In this diary (my first one!), I'm going to give a brief overview of the state of Big Data and how relatively innocuous information could (is?) be used to predict whether a person might be engaged in illegal behavior - or whether they might do so in the future. The conclusions are purely speculative, but the results are very much within the realm of technological, budgetary, and legal feasibility. More after the fold.

Your Data is Everywhere

You probably already know that virtually everything you do online can be tracked in some way. If you sign up for an account on many websites, there's a substantial chance that your information has been scraped or purchased by spambots and marketers. Google uses its own sophisticated algorithms to deliver relevant advertisements to you based on data from previous search terms you've used on Google to which links you click and how long you spend staring on that website. Without you having to actually tell them, Google can predict with a fair degree of certainty your gender, approximate age, and interests.

Google wasn't the first one to come up with this idea. Marketers have been purchasing your consumer data for decades in order to determine where they should direct junk mail. Casinos use this too. By meticulously compiling the behavior of individual patrons and using the omnipresent security apparatus within their buildings, they have created profiles that allow them to predict with fair precision the amount of money they can extract from new gamblers that walk through the door based solely on demographics, betting behavior, or even their clothing. This technology has become even more advanced in recent times, allowing them to reasonably tell if offering a couple free drinks to a gambler might cause them to stay and try their luck a little longer. And let's not forget credit scores, which determine how likely you are to default based on your purchasing behavior.

President Obama's re-election campaign was perhaps the first very large scale implementation of "Big Data," and in many ways it eclipsed what the private sector had been doing. While data had been used prior to Obama (including casino-like profiling techniques called 'microtargeting'), the principle innovation was the effort to combine large numbers of disparate databases of public information. Before, the different departments of the campaign might use their own databases. Magazine subscription data might determine who would receive a mailer about gun control. Voter contact would work from the voter file databases provided by each individual state. OFA spent millions of dollars on an effort to combine all of those disparate databases into a single one and then use that to determine five absolutely critical pieces of information about every single American it could:

1. Where is the voter? How can we contact them?
2. Will they vote?
3. Who will they vote for?
4. Can they be persuaded the vote the other way?
5. If they can be persuaded, what messaging will work best?

Each piece of information would be thrown into an algorithm that would rank the voters with a "support score." Essentially, the score was a 1-100 indicator that would be used to target individuals. For example, a user with a support score of 95 means if you took every individual with that support score and looked at their ballots on election day, 95% of them would vote for Barack Obama. This is different from a poll because while a poll is a snapshot of presidential preference in time, this is a predictor of behavior. In other words, the person may not yet know they will vote for Barack Obama, but OFA has a pretty good idea. These numbers proved substantially more accurate than polling on election day.

Disparate Pieces of Information Interact in Strange Ways

Many data relationships are straightforward. For example, voting records can be used to determine someone's likelihood of voting. If they've voted in the past four elections, chances are they'll vote in the fifth. Sometimes it's less so. Regression techniques can be used to identify relationships that can be used to predict a person's behavior. For example, suppose a person turns eighteen on October of 2012. Without a record, it's difficult to score that voter's likelihood to turn out. But through regression analysis, you might find that people turning eighteen on October are very unlikely to vote even though there is a high percentage chance they'd rather see Obama win. Thus, it's likely a waste of resources to contact them. But then if you add their magazine subscriptions - say, Guns and Ammo - their turnout score is boosted but their support score drops dramatically. Other information - including ethnicity - can clarify that picture.

But that's the obvious data. One of the emerging trends inside campaign analytics is the ability to combine social media and voter file data to identify supporters or potential volunteers. But once you have access to a Facebook wall, you have huge amounts of data just waiting to be catalogued. You might find relationships between how often a person posts a picture on Instagram and their likelihood of voting on election day. Or the average age of their Facebook friends might be a strong indicator of their likelihood to vote Democratic.

The more points of data you can catalogue, the more accurate your predictions. What about the amount of jelly they put on their PB&J sandwich? It sounds ridiculous, but is it more ridiculous than the widely publicized fact that music genre preference strongly correlates with political support (even when you correct for age and ethnicity)? Now, the power isn't unlimited. Not everything is statistically significant or useful, and some data can actually be misleading if applied incorrectly. But a careful hand can differentiate these and lead to highly accurate results.

Big Data and Law Enforcement

At this point, I've well belabored that with enough data you can predict political support. That extends to buying habits and even gambling habits. But what happens when the NSA or the FBI starts to compile thousands of disparate databases? With a budget many times larger than the entirety of the Obama campaign, virtually all public data in the United States could be combined and correlated with other parts of the data stream. This could include every Tweet and every blog post. It could include criminal records. But if you can predict a person's likelihood to vote, couldn't you also predict things like, say, a person's likelihood to possess heroin? The answer is - well, yes. Why not?

Once you've managed to bridge all this data together and identify with fair certainty to whom each piece of data belongs to, you could give every person in the United States a 'criminality score.' Since there is no legal protection against mass analysis of public records (as you have no expectation of privacy), this wouldn't run afoul of existing privacy protections. Years ago, it was believed by some privacy advocates that the NSA was monitoring all communications in the world for certain keywords (like "bomb") via the ECHELON system for additional review by an analyst. This probably isn't exactly true, but a similar system could be used on a far more powerful basis. What if, instead, the percentage of your words on Twitter that are considered 'aggressive' or your likelihood to share certain kinds of videos on YouTube could be used to predict how likely you are to take a violent criminal action? With enough processing power and data storage, a government agency might be able to find out. In fact, the NSA's much talked about data center in Utah could very well house a system capable of finding and applying those correlations.

Now, the really scary Minority Report scenario would be if your criminality score could be used to obtain a search or surveillance warrant. That seems unlikely, but it's also not out of the realm of possibility. After all, a criminality score of 99 would mean that they would be vastly more certain you have committed some kind of crime than search warrants are currently issued for. There are also various constitutional protections against some kinds of profiling. Even if that weren't allowed, however, it's still a very powerful (potential) system.

Who Needs ECHELON, anyway?

In many ways, we've brought this technology on ourselves. Americans are less concerned about privacy than ever. We've grown to tolerate tracking via cookies. Virtually every free service on the Internet somehow catalogues our behavior into some random database off in the nether. In many ways, services like Twitter, Facebook, and Instagram have encouraged us to widely and publicly share things we wouldn't have shared 20 years ago. Many people don't realize that potential employers are checking out your Facebook wall, or that what you publish on Twitter is public and searchable. In fact, there's increasing evidence that Americans don't actually care. The growth of services like Facebook have shown that Americans are actually not particularly bothered about the fact that they are being tracked. It's an increasingly accepted part of our lives, and the only way to avoid this seemingly dystopian future is for a massive sea change in attitudes toward privacy. I'm not particularly hopeful on that point.