The public now knows more than it did last year about how the NSA spies on all Americans. And the revelations keep on coming. But one of the more recent ones to come out, the one that they have denied most vociferously, has been sort of obvious for years. They're not just collecting metadata on our phone calls. They're collecting the actual content of millions of phone calls. Not all of them -- the telephone network is more decentralized than the Internet -- but on as many as they want. And they're able to search them for content. So if your call was monitored and you said "one pepperoni and anchovy pizza, easy on the cheese", they could search for "pepperoni" and "anchovies" and find your call in their massive library.
How do I know this? Because the pieces came out one by one over time. And one key piece became visible when its developers spun off a venture to commercialize the technology. And it's now a going business. That company has changed its name twice, and no longer has the NSA-like product that it started with, but it was a public web service for a while. Not listening to phone calls, of course, but the next best thing, podcasts. Because if you can index and search podcasts, you can index and search stored phone calls.
This isn't trivial technology. Let's review.
Speech recognition by computer has been a difficult problem for years. In 1966, Star Trek assumed that it would be normal, and that was a reasonable supposition. But at that point in time it was almost as far from reality as warp drive.
By 1980 or so, commercial speech recognition systems were on the market. Not cheap, not PC stuff, but available for commercial applications. Not that there were many then... it's a hard problem! In that era, speech recognition generally came in two flavors. There was speaker-dependent continuous, where the system was trained to a given user's voice, but they could speak actual sentences. And there was speaker-independent recognition of a limited vocabulary of separate words. "Left". "Right". "Stop". But not "bring me the newspaper" spoken the way people normally do.
By the 1990s, continuous, speaker-independent recognition was getting closer. Processors were faster, after all, memory cheaper, and people had a better idea of how speech worked, in terms that a computer could comprehend. Dragon Naturally Speaking became available on the PC.
By the 2000s, the best systems were even better. DARPA's GALE project, funded in 1996, had made progress against its goals:
The goal of the DARPA GALE program is to develop and apply computer software technologies to absorb, analyze and interpret huge volumes of speech and text in multiple languages. Automatic processing “engines” will convert and distill the data, delivering pertinent, consolidated information in easy-to-understand forms to military personnel and monolingual English-speaking analysts in response to direct or implicit requests.
GALE will consist of three major engines: Transcription, Translation and Distillation. The output of each engine is English text. The input to the transcription engine is speech and to the translation engine, text.
One of the companies working on GALE was a well-known DARPA contractor,
BBN Technologies. BBN, based in Cambridge, MA, is sometimes called that city's "third university". It has always been a remarkable collection of smart people. It began as an acoustical consulting firm, but its most famous work was in computers. Back in the late 1960s and 1970s, they built and ran this little thing for ARPA, the ARPANET. It grew into the Internet. In the 1970s, its biggest shareholder was a little-known investor named George Soros. He sold his shares around 1980 to the French oil services giant Schlumberger. In 1997, telecom-based conglomerate GTE (which also had defense-electronics holdings) bought it. Verizon soon bought GTE, but they eventually spit out what was left of BBN (almost all military contracts) and Raytheon bought it.
BBN's business model has always been based on government contracting. They do R&D on seriously advanced stuff, getting cost-plus or a modest profit. That has been reliable income for them. On many occasions, though, they've tried to commercialize their technology, usually to spin it off.
Now given its background in computing, acoustics, and even psychology, speech processing was an obvious area of interest for BBN. And they did plenty of government work. In the 1990s, they spun out a company that still makes an automated telephone operator, which directs calls by spoken name. But that's relatively low-tech, since the vocabulary is limited. It was pre-GALE.
It was some years later, after the new century, when they created, and later spun out, a subsidiary called Podzinger. This was portrayed as a search engine for podcasts. Unlike text-based web pages, podcasts weren't picked up by conventional search engines, since their content was audio, not searchable text. Podzinger searched the web's podcasts and other audio content, converted it to text, indexed it, and made it available to anyone as a web search engine.
Continuous, speaker-independent, natural speech, indexed. How nice! Wait a second... if it works on podcasts, wouldn't it work on stored phone calls too?
That was the giveaway.
Now let's talk about phone calls. A wireline call is transmitted at 64000 bits per second. A cellular call, though, is usually compressed to about 1/8 that. That's why the sound quality is so poor. But speech compression doesn't necessarily harm speech recognition, since advanced compression techniques understand the sounds that make speech intelligible. It might even be a useful first stage of voice recognition. So if calls are all compressed down to cellular levels, the storage requirements aren't so bad. 1kbytes/second, 3.6M bytes/hour. (Compressed speech doesn't store the silence so this works for both directions.) As a short-term storage mechanism, it's tolerable. Convert it to text, though, and if people speak at an average of say 150 words/minute (divided between the two sides), say six bytes/word, then it's only 900 bytes/minute, 54k bytes/hour. No effort at all to store or index! Especially with today's dense array of 3 Terabyte drives.
Podzinger has changed its name twice, and I'm not going to mention the new name, as it's a pretty different company now. Their main product takes audio and video and does speech-to-text for its owners, so that they can do search engine optimization with it. They no longer have a public search engine, though they sell "enterprise search". It's all small-scale stuff compared to NSA.
But the technology was revealed. Bulk spoken words can be turned to text and indexed for search. It has been possible for about a decade if not longer. And given the NSA's penchant for "total information awareness", they've probably got a lot of everyone's calls on file.