Using publicly available information about the storage capabilities of Google, Microsoft, and Amazon, and rumors about the capabilities of the NSA's new Utah data center due to come online later this year, I take a hard look at what's possible and what's reasonable for intelligence agencies and private companies to be able to store incredible numbers of telephone calls in the era of Big Data.
Glenn Greenwald teased a new story he's been working on for The Guardian about new technology that allows the National Security Agency to "redirect" up to one billion cell phone calls per day to its servers, and then store them for an unknown length of time:
What we are really talking about here is a globalized system that prevents any form of electronic communication from taking place without its being stored and monitored by the National Security Agency. It doesn't mean they're listening to every call, it means they're storing every call and have the capability to listen to them at any time, and it does mean that they're collecting millions upon millions upon millions of our phone and email records.That leaves a lot of room for interpretation and there's not much to go on until the story is published. These claims raise serious questions that need answering.
- Does this capability only pertain to foreign phone calls, or does it also include domestic communications?
- Do domestic and global telecommunications companies willing participate in the gathering of this information, or is it being collected discretely without their knowledge and/or consent?
- Are America's allies aware of this capability and are they a partner to it?
- Under what U.S. law and international law does this administration and the NSA justify the program under which this operates?
- Are warrants from the Foreign Intelligence Surveillance Court used to listen to the content of stored telephone calls, and are such warrants used to gather that information in the first place?
- What physical and technical safeguards are in place to protect privacy?
- How long has the NSA had this capability?
- Has anyone within this administration or any other had reservations about its legality? Has anyone in Congress been briefed on this capability?
All good questions, and the public doesn't currently have any answers.
One good question raised by several people who commented on the story I wrote Friday on this topic questioned if the NSA even has the technical ability to store that much information.
Only someone with an appropriate security clearance knows the data storage capacity of the NSA (this would make a good question for Edward Snowden, if anyone can get access) and large information providers like Google aren't always forthcoming about their storage solutions, either. But I believe we can make some make some educated guesses by looking at the capabilities of the private sector which then can be extrapolated to government intelligence services worldwide.
Your entire life and then some
A paper published by Jeffery Dean and Sanjay Ghemwat in 2008 revealed that Google was processing 20 petabytes (20,971,520 gigabytes) of data per day as of the fall of 2007. Processing data isn't the same thing as storing it, but I think that gets us in the ballpark, and we're talking about technical capabilities that are already six years old.
A note written by Peter Vajgel from the Facebook Engineering group in the spring of 2009 revealed that his company was storing 1.5 petabytes (1,572,864 gigabytes) worth of photos and was adding 25 terabytes (25,600 gigabytes) worth every week. He also shared that Facebook's photo storage servers were only using a series of 1 terabyte drives at that time, and we've now got 4 terabyte drives on the market for $179.99. And that's a consumer rate, we've got to assume that governments, like Facebook and Google, get discounts for buying in bulk.
A story by Sebastian Anthony on Extreme Tech found that Facebook's IPO documentation claimed storage of 100 petabytes of data between photos and video, Microsoft was over 100 petabytes for Hotmail (given its advantages, it's probably significantly more for Gmail), and even Dropbox is storing in excess of 40 petabytes of data.
Netflix is storing its video library on Amazon's S3 service, with the former at one time consisting of one petabyte of data, and S3 was thought to store a total of 566 exabytes of data as of late 2011. 566 exabytes is 579,584 petabytes (607,737,872,384 gigabytes.)
Even if all of these estimates are off, you could cut them by large factors and still be talking about incredible amounts of data storage.
The government doesn't know the meaning of thrift
We'll probably never know the current and maximum storage capacity (and even less well ever be known about its processing capabilities) of the National Security Agency or Central Intelligence Agency, but we do have some information to go.
The NSA is constructing a 1.5 million square foot data center at Camp Williams, Utah, with a projected completion date of September of this year. That data center is rumored to have a storage capacity measured in yottabytes, with one yottabyte equaling 1,125,899,906,842,624 gigabytes.
Compared to some of the major information providers that I listed above, at one yottabyte, the NSA conceivably could store:
- Nearly 716 million times the amount of data that Facebook used in 2009 for photo storage.
- One billion copies of the Netflix library.
- Nearly 11 million copies of the entire Hotmail service.
- Every single bit of data that Google was processing in 2007 in a single day, 53 million times over.
That's a lot of storage, and obviously the NSA won't be using it only to store telephone intelligence, but there's plenty of room for that when it comes to recording and storing the content of phone calls for an extended length of time.
Hundreds of hours on a floppy disk
Many people are making assumptions about the storage requirements of audio based on their personal experience with the MP3 codec when used to store music. The frequency range of the human voice is much less than that of musical instruments, and can be clearly understood at a much lower quality level that we might prefer just to satisfy personal taste. MP3 would be significant overkill if the government's priority were efficiency and quality only mattered in discerning words
My old Samsung Reality cellphone (2010) uses the QCELP codec developed by Qualcomm in 1994 to record voice notes. A quick test using that codec found that 32 seconds of audio only required 22.5 KiB of storage space, which works out to 47.407 minutes of audio per megabyte of storage. The 4 GiB SD card in my phone could store nearly 1.7 million hours of audio using QCELP, and that codec is ancient.
(Note: A second test of constant talking stored 30 seconds in 45.6 KiB of data. These codecs apparently are super efficient at not storing silence.)
Forget about five megabytes of storage for five minutes of audio with MP3, the NSA and any other intelligence agency has had the capability to store 596 hours of audio on a 1.44 MiB floppy disk for nearly two decades, if not more.
QCELP has been replaced several times over the years. Selectable Mode Vocoder (which is in the process of being replaced by Enhanced Variable Rate Codec B) can store audio with a bitrate as low as 800 bits per second at its lowest quality. At that bitrate, an hours worth of the human voice can be stored in as little as 351.5 KiB, or nearly three hours of speech per megabyte.
A single four terabyte hard drive from Newegg ($179.99) could store 11.1 million hours of audio using SMV. If rumors are accurate, then the NSA's new data center could conceivably store 383,347,862,637,820 years of audio, or about 40,297,527 trillion five minute phone calls in total, using SMV. With that kind of storage capacity possibly coming online within half a year, it's not unreasonable to believe that the NSA could already intercept and store one billion calls per day.
With little more than rumors to work from, there are of course many caveats to these estimates. The NSA may not have anything close to that kind of storage in reality, and even if it does have that capacity once the Utah data center goes online, it may not be storing that much data in perpetuity. It's also likely that a great deal of that storage capacity would be lost to backups, live data redundancy, regular hardware loss, and an incentive not to fill the storage system to the brim.
It's also unlikely that the NSA would use the SMV codec, much less at its lowest quality setting. A higher quality bitrate setting could easily increase storage requirements by a factor of 10, SMV isn't the most advanced codec in existence for storing voice information, and the NSA may even have more advanced codecs than what the private sector has. Yet with such astronomical numbers to begin with, a 10 times reduction of 383,347,862,637,820 years worth of voice storage isn't very meaningful.
Another thing to consider is that since more than one phone call can be intercepted at a time, a hundred hours worth of calls might be recorded within a single hour of real-time, or even a few minutes, depending on how extensive the surveillance is. One million hours worth of calls could end up being captured within a surprisingly short amount of time.
Private sector spooks
The information providers that I mentioned earlier can once again give us an idea of what the NSA might reasonably be capable of storing today.
The servers storing Netflix's one petabyte of data could hold 357,020 years worth of audio with the SMV codec at its lowest bitrate, or 35,702 years at a much higher quality. Microsoft could store 35,702,051 years on the servers that make up Hotmail.
Amazon may well have the largest storage system in the world. If every human on Earth (call it seven billion) made a five minute phone call in a 24 hour period, that'd be 35 billion minutes worth of voice data. That's only 190.9 terabytes worth of data per day using the SMV codec, meaning Amazon's rumored 566 exabytes of storage capacity could hold 207,025,133,181 years worth of phone calls.
In other words, even using an audio quality at 10 times my estimates, Amazon could easily store every phone call in the world indefinitely.
Even if the NSA falls well short of Amazon or the yottabyte mark, a start-up like Dropbox could store 19,669,809 years worth of audio give its current storage capabilities, and that's a company that's run on less than $260 million in funding since 2008, staffed by just 221 people as of January of 2013.
Given the current state of technology and what we already believe has been deployed in the private sector, it's very likely that the National Security Agency could store far more than one billion phone calls per day if it really wanted to.
A few notes about this story:
- I emailed Glenn Greenwald on Friday to see if he wanted to look over this story to be sure that my interpretation of his claims upon which this story is based was accurate, and didn't hear back. I didn't expect to, mind you, but I thought I'd offer him the opportunity anyway. If it turns out that my interpretation was wrong, I'll update this story to reflect that, but I believe I've done the best that I could with the available information.
- The math in this story isn't especially difficult, but it's prone to errors given all the conversions involved. SMV's lowest bitrate is 800 bits per second, while all of my calculations have been done in bytes, where 1024 bytes equals 1 kilobyte. Not every calculation was done at the byte level; the final estimate of Amazon's storage capacity was done by dividing 593494016 (566 exabytes in terabytes) by 190.9, rather than 190.xxxxxxxx. Some calculations were actually done at the byte level.
I invite anyone interested in the math to check my calculations and I'll happily update this story with an errors that you can find. This is the calculator that I used for this story, since it'll allow the manipulation of extremely large numbers without having the calculator visually convert them to a power.
- If anyone wants to nitpick over the SMV estimate of 800 bits/second of storage as practical, the SMV codec at all, or things of that nature, please feel free. I won't argue. Just keep in mind that I've already acknowledged that these are rough estimates based entirely on rumors. This story is not meant to be a definitive and exact listing. Even if you assume a speech codec using 10 times the bitrate that I have, the numbers are so astronomical that I don't believe that any error I've made or critical (and legitimate) nitpick would make a difference in the final conclusion that major storage players in the private sector, and the NSA, could store as much voice data as they could ever want.