Why Metadata Matters

In response to the controversy around the NSA’s secret data collection, President Obama said:

…nobody is listening to your telephone calls. That’s not what this program’s about. As was indicated, what the intelligence community is doing is looking at phone numbers and durations of calls. They are not looking at people’s names, and they’re not looking at content. But by sifting through this so-called metadata, they may identify potential leads with respect to folks who might engage in terrorism.

James Clapper, the Director of National Intelligence, wrote in a letter:

The program does not allow the Government to listen in on anyone’s phone calls. The information acquired does not include the content of any communications or the identity of any subscriber. The only type of information acquired under the Court’s order is telephony metadata, such as telephone numbers dialed and length of calls.

It’s called protecting America,” said Sen. Diane Feinstein. “As you know, this is just metadata. There is no content involved. In other words, no content of a communication.”

The government’s position is mistaken or disingenuous. “Metadata” is often more revealing than “content.” The distinction usually doesn’t mean much.

I’ve been researching how metadata can reveal sensitive information. Metadata collection and analysis can do good things, like helping epidemiologists track the spread of disease or helping law enforcement catch terrorists. But the privacy implications are bigger than Obama, Clapper, and Feinstein admit. Metadata can uniquely identify you with only a few data points. The FBI used only metadata to track down former CIA Director David Petraeus’s love interest, Paula Broadwell. Metadata knows that you called an HIV clinic, your doctor, and your insurance company within the same hour.

I’ve annotated a collection of links on metadata. Click a title to jump to its annotation:

After revelations that Canada was conducting its own secret surveillance program, the Canadian government protested that it was only collecting metadata, not content. Michael Geist responds that technologies exist to mine personal information from metadata and that legal structures have fallen behind:

The problem is that surveillance technologies (including the ability to data mine massive amounts of information) have moved far beyond laws that were crafted for a much different world. The geographic or content limitations placed on surveillance activities by organizations such as CSEC may have been effective years ago when such activities were largely confined to specific locations and the computing power needed to mine metadata was not readily available. That is clearly no longer the case with geography often a distinction without a difference and the value of metadata sometimes greater than the actual content of telephone conversations. If we genuinely believe in preserving some privacy in an environment where everyone cellphone call is tracked, we must be open to significant legislative reforms and increased oversight that better reflects the realities of modern-day communications surveillance.


The Guardian explains how metadata works. The guide shows how the FBI used metadata to track down the identity of Gen. Petraeus’s paramour. It lets users select the services they use—email, mobile phones, cameras, Facebook, Twitter, search engines, and web browsers—and shows the metadata they reveal to third parties.


Roger Pilon and Richard A. Epstein argued that the NSA surveillance program is analogous to the pen registers that were at issue in Smith v. Maryland, only “on a grander scale.” They argued that the program’s harm to individual privacy is “trivial” compared to the potential harms of terrorist attacks.

Julian Sanchez responds that Smith “has long been widely condemned as mistaken and incoherent by legal scholars”—and even if we accept Smith‘s logic:

…it makes a difference when the acquisition is on a “grand scale.” In a 1983 ruling [United States v. Knotts] that found no Fourth Amendment violation in the short-term tracking of a single automobile, Justice Rehnquist acknowledged in passing that “dragnet-type law enforcement practices” involving mass tracking might require the Court “to determine whether different constitutional principles may be applicable.”

He also cites Justice Alito’s concurrence in United States v. Jones which raised concerns about the privacy implications of prolonged, fine-grained surveillance.


Ford brings up the AOL search log scandal from 2006, when AOL released massive, anonymized logs of search data, from which the public was able to identify individual users and reveal embarrassing details about their online activity. The NSA can draw similar insights from call metadata, he argues. The NSA’s program is part of a broader trend in data analysis:

We are not just ourselves anymore. Each one of us has a new, statistical self living in databases around the world. It’s those selves, uniquely identified bundles of behavior, that marketers target and companies try to reach. These are remarkable, distributed portraits of what we read, what we eat, and where we sleep. When it comes to our statistical selves, the difference between the NSA and private companies such as Facebook or Google or Amazon.com (AMZN) lies in what the government can do with the data it collects. It’s building that giant index so that, if it needs to, it can actively cross the line between your statistical self and your real, physical self. It’s the difference between “would you like to receive local coupons for businesses you love?” and “why is there a van in front of our house?”

Do we have a choice? Not much of one, not yet. It’s possible but very burdensome to encrypt all of your data and become less snoopable. Americans, according to polls, just don’t care that much about this sort of privacy.


In 2006, Joe Biden warned of the privacy risks of the government’s collection of call records:

I don’t have to listen to your phone calls to know what you’re doing. If I know every single phone call you made, I’m able to determine every single person you talked to. I can get a pattern about your life that is very, very intrusive.


The investigations following the Boston bombing showed how law enforcement mines mobile phone data and social media:

Less than 24 hours after two explosions killed three people and injured dozens more at the April 15 Boston Marathon, the Federal Bureau of Investigation had compiled 10 terabytes of data in hopes of finding needles in haystacks of information that might lead to the suspects.

The tensest part of the ongoing investigation – the death of one suspect and the capture of the second – concluded four days later in part because the FBI-led investigation analyzed mountains of cell phone tower call logs, text messages, social media data, photographs and video surveillance footage to quickly pinpoint the suspects.

Since 2011, [DHS] has monitored public-facing social media networks, blogs and content aggregators. The monitoring has stirred up controversy among privacy advocates because of worries that DHS would be collecting personally identifiable information (PII) on social media users. The agency has stated that they’re not doing data mining for PII, but that they would use such information in exigent circumstances to rescue, say, an earthquake victims tweeting from under a pile of rubble or the victim of a terrorist attack trapped in a hotel. The National Operations Center at DHS “identifies and monitors only information needed to provide situational awareness and establish a common operating picture,” according to an April privacy impact assessment from DHS about the program.


This paper uses a Nokia dataset to show how mobile phone usage data can predict personal details of the user:

In this paper, we describe how we use the mobile phone usage of users to predict their demographic attributes. Using call log, visited GSM cells information, visited Bluetooth devices, visited Wireless LAN devices, accelerometer data, and so on, we predict the gender, age, marital status, job and number of people in household of users. The accuracy of developed classifiers for these classification problems ranges from 45-87% depending upon the particular classification problem.


Researchers analyzed Facebook data to show that a user’s pattern of “likes” can predict many personal traits:

Kosinski and his colleagues conducted their experiment over the course of several years, through their MyPersonality website and Facebook app. More than 8 million people took the MyPersonality survey, which asked participants about their personal details and also had them answer questions about personality traits. About half of the test-takers gave their OK for the researchers to match up their survey results with Facebook likes, on an anonymous basis. More than 58,000 of the volunteered profiles from U.S. respondents were selected for matching.

The results were analyzed to produce correlations in more than a dozen categories, including five widely accepted personality attributes (openness, conscientiousness, extraversion, agreeableness and emotional stability). Those are the attributes analyzed on the “You Are What You Like” website. The other categories included IQ, religion, politics, sexual orientation, age, gender, race, relationship status, alcohol and drug use, tobacco use, life satisfaction, number of friends — and even whether a Facebook user’s parents had separated by the time the user was 21.


This article investigates how companies use customer data to predict personal traits—including the infamous story of Target’s algorithms learning that a woman was pregnant before she had told her family:

About a year after Pole created his pregnancy-prediction model, a man walked into a Target outside Minneapolis and demanded to see the manager. He was clutching coupons that had been sent to his daughter, and he was angry, according to an employee who participated in the conversation.

“My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”

The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.

On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”


Robert Siegel interviews J. Kirk Wiebe, former NSA analyst, on how the NSA collects phone records and uses the aggregated metadata:

SIEGEL: This is serious? You can really infer things from big patterns of calls?

WIEBE: Absolutely. There’s a science, if you will, of intelligence called traffic analysis. It’s concerned with communications and looking for patterns in communications that can be associated with meaning that’s useful to get some insight in what someone’s intentions are.

SIEGEL: But if I understand this properly, I mean, what the NSA seems to be doing here is they’re deciding that in order to find the one dangerous needle, they must have access to the entire haystack.

WIEBE: Absolutely.

SIEGEL: And there’s no way that you could do some kind of predetermination of what part of the haystack you should be looking at?

WIEBE: You know…

SIEGEL: You have to have access to everybody’s phone logs or else it doesn’t work.

WIEBE: That’s right.


The NSA has shifted its focus from eavesdropping to metadata analysis:

The agency’s ability to efficiently mine metadata, data about who is calling or e-mailing, has made wiretapping and eavesdropping on communications far less vital, according to data experts. That access to data from companies that Americans depend on daily raises troubling questions about privacy and civil liberties that officials in Washington, insistent on near-total secrecy, have yet to address.

“American laws and American policy view the content of communications as the most private and the most valuable, but that is backwards today,” said Marc Rotenberg, the executive director of the Electronic Privacy Information Center, a Washington group. “The information associated with communications today is often more significant than the communications itself, and the people who do the data mining know that.”


Intelligence agencies increasingly rely on metadata to sift huge amounts of information. In some cases, metadata can be more revealing than the content of the message itself:

Researchers at IDC have determined that the amount of global digital information, like e-mail, Twitter posts and digital photos, has risen from about 500 billion gigabytes in 2008 to almost four trillion gigabytes this year. By 2015, they estimate, there will be eight trillion gigabytes of material to go through, much of it from fast-growing countries with young populations, like China and Indonesia.

For some communications, metadata matters more than content. “A call to a suicide hot line, Alcoholics Anonymous, or a gay sex chat room at 2 a.m. are all more sensitive” than the actual message, said Christopher Soghoian, principal technologist at the American Civil Liberties Union. “You can text political donations. The metadata shows your political leanings, the content just shows the amount you gave. Calling a cell tower away from my house in the middle of the night indicates I’m not sleeping at home.”

“Metadata is the least protected form of communications information, and that is a shame,” he said. “You just have to say it’s important to an ongoing investigation.”


This article shows how companies, ressarchers, and government agencies draw insights from “deceptively innocuous” metadata:

“When you can get it all in one place and analyze the patterns, you can learn an enormous amount about the behavior of people,” said Daniel J. Weitzner, a principal research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory.

Analysts can gain clues to sleep patterns (when people are asleep, they send no e-mails and make no calls), religion (based on locations of calls made or the absence of communications on the Sabbath) or even social position (based on how often people get calls and e-mails and how quickly they receive responses).

In 2007, researchers at Columbia University were able to identify the senior-most company officers at the bankrupt Enron Corp. by studying individual e-mail volume and average response time in 620,000 company e-mails. The highest-ranking officers got the most e-mail and the quickest responses.

Similarly, federal agents use software and social-network analysis to map out terrorist cells and criminal groups. They look, for instance, at who calls whom most frequently, in a technique known as “link analysis.”


Epidemiologists are mining phone metadata to build predictive models of the spread of diseases:

When [Buckee] and her colleagues studied the data, she found that people making calls or sending text messages originating at the Kericho tower were making 16 times more trips away from the area than the regional average. What’s more, they were three times more likely to visit a region northeast of Lake Victoria that records from the health ministry identified as a malaria hot spot. The tower’s signal radius thus covered a significant waypoint for transmission of malaria, which can jump from human to human via mosquitoes. Satellite images revealed the likely culprit: a busy tea plantation that was probably full of migrant workers. The implication was clear, Buckee says. “There will be a ton of infected [people] there.”


As part of a contest to improve its recommendation algorithm, Netflix released an anonymized dataset of the ratings of 500,000 of its members. Researchers demonstrated an algorithm to de-anonymize the records:

We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.

As shown by our experiments below, it is possible to learn sensitive non-public information about a person from his or her movie viewing history. We assert that even if the vast majority of Netflix subscribers did not care about the privacy of their movie ratings (which is not obvious by any means), our analysis would still indicate serious privacy issues with the Netflix Prize dataset.


The Times editorial board responds to the NSA’s secret surveillance program:

The surreptitious collection of “metadata” — every bit of information about every phone call except the word-by-word content of conversations — fundamentally alters the relationship between individuals and their government.

Tracking whom Americans are calling, for how long they speak, and from where, can reveal deeply personal information about an individual. Using such data, the government can discover intimate details about a person’s lifestyle and beliefs — political leanings and associations, medical issues, sexual orientation, habits of religious worship, and even marital infidelities. Daniel Solove, a professor at George Washington University Law School and a privacy expert, likens this program to a Seurat painting. A single dot may seem like no big deal, but many together create a nuanced portrait.


German politician Malte Spitz sued Deutsche Telekom to turn over six months of his own cell phone data. He then gave the data to Die Zeit, which compiled a detailed interactive map of Spitz’z location and communications:

Spitz map


Researchers show that four data points about a person’s location can identify that person with 95% accuracy:

We study fifteen months of human mobility data for one and a half million individuals and find that human mobility traces are highly unique. In fact, in a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier’s antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals. We coarsen the data spatially and temporally to find a formula for the uniqueness of human mobility traces given their resolution and the available outside information. This formula shows that the uniqueness of mobility traces decays approximately as the 1/10 power of their resolution. Hence, even coarse datasets provide little anonymity. These findings represent fundamental constraints to an individual’s privacy and have important implications for the design of frameworks and institutions dedicated to protect the privacy of individuals.

The authors of the study published an editorial in response to the NSA surveillance leak:

Many players in government characterize the NSA’s use of metadata as more or less benign. The agency gathers the phone records, detects worrisome patterns that might threaten America’s security, and only then asks for a search warrant to dig into the communications content of certain individuals.

But metadata is more powerful than most people realize. For instance, something as simple as recording Facebook “likes” and website clicks can reveal a person’s religious and political views, economic standing, sexual preference, personality, mental health, ethnicity, use of addictive substances, and more. The ability to characterize groups by these traits might tempt some in the government to cross the line from finding terrorists to targeting groups because of their political leanings.

Because of the scale and connectedness of data collection and the inability of today’s institutions to squarely face the privacy issues involved, we strongly back a new approach to data privacy that we’re working on here at MIT’s Media Lab. It puts individuals in control of their personal data, allowing them to determine who can possess their data, how it can be shared, redistributed, and disposed of.

Each citizen would have a personal data store, like an email inbox, that would let them see where data about them goes and how it is being used. The NSA could still get a court order allowing it to use a person’s metadata to track terrorists, but at least an individual could see that something is happening – rather like seeing a police cruiser patrolling the neighborhood. The big difference from now is that individuals could see which companies or government agencies were using data about them, and control these groups’ access to that data.


Sociologist Kieran Healy analyzes a dataset of 260 18th-century American colonists’ names and their group affiliations. His goal is to find “persons of interest.” His analysis of the metadata leads him to Paul Revere:

Once again, I remind you that I know nothing of Mr Revere, or his conversations, or his habits or beliefs, his writings (if he has any) or his personal life. All I know is this bit of metadata, based on membership in some organizations. And yet my analytical engine, on the basis of absolutely the most elementary of operations in Social Networke Analysis, seems to have picked him out of our 254 names as being of unusual interest.

Healy creates a social network map of the colonists:

From a table of membership in different groups we have gotten a picture of a kind of social network between individuals, a sense of the degree of connection between organizations, and some strong hints of who the key players are in this world. And all this—all of it!—from the merest sliver of metadata about a single modality of relationship between people.

In a follow-up post, Healy talks more about his methodology and about whether this data really is “metadata.”

The dataset and R code are on GitHub. Healy also links to a paper in which the author does a deeper social network analysis of the same dataset.


This article quotes Gary Pruitt, president of the Associated Press, on the harms of aggregated metadata:

These records potentially reveal communications with confidential sources across all of the newsgathering activities undertaken by the AP during a two-month period, provide a road map to AP’s newsgathering operations, and disclose information about AP’s activities and operations that the government has no conceivable right to know.

It also quotes the court in United States v. Maynard on the mosaic theory:

A person who knows all of another travels can deduce whether he is a weekly church goer, a heavy drinker, a regular at the gym, an unfaithful husband, an outpatient receiving medical treatment, an associate of particular individuals or political groups and not just one such fact about a person, but all such facts.

Finally, it looks at how the NSA uses call records:

The primary purpose of large-scale databases such as the NSA’s call records is generally said to be data-mining: rather than examining individuals, algorithms are used to find patterns of unusual activity that may mark terrorism or criminal conspiracies.

However, collection and storage of this information gives government a power it’s previously lacked: easy and retroactive surveillance.

If authorities become interested in an individual at a later stage, and obtain their number, officials can look back through the data and gather their movements, social network, and more – possibly for several years (although the secret court order only allows for three months of data collection).

In essence, you’re being watched; the government just doesn’t know your name while it’s doing it.


A hacker at Co.Labs experiments with spying on himself:

He pulled 20 random phone numbers from his call history and marked whether they belonged to a man or a woman. Then he used all the calls from those 20 numbers as his test samples, including the time and duration of call. Google’s Prediction API gave his model a 67 percent confidence level in predicting the gender of a caller after training with those 861 test examples. Though by scientific terms, that’s not particularly accurate, Stein “found it surprisingly good at determining a caller’s gender.”


Jane Mayer interviewed Susan Landau about the dangers of metadata, as well as its usefulness for law enforcement:

“The public doesn’t understand,” she told me, speaking about so-called metadata. “It’s much more intrusive than content.” She explained that the government can learn immense amounts of proprietary information by studying “who you call, and who they call. If you can track that, you know exactly what is happening—you don’t need the content.”

For example, she said, in the world of business, a pattern of phone calls from key executives can reveal impending corporate takeovers. Personal phone calls can also reveal sensitive medical information: “You can see a call to a gynecologist, and then a call to an oncologist, and then a call to close family members.” And information from cell-phone towers can reveal the caller’s location. Metadata, she pointed out, can be so revelatory about whom reporters talk to in order to get sensitive stories that it can make more traditional tools in leak investigations, like search warrants and subpoenas, look quaint. “You can see the sources,” she said. When the F.B.I. obtains such records from news agencies, the Attorney General is required to sign off on each invasion of privacy. When the N.S.A. sweeps up millions of records a minute, it’s unclear if any such brakes are applied.

Metadata, Landau noted, can also reveal sensitive political information, showing, for instance, if opposition leaders are meeting, who is involved, where they gather, and for how long. Such data can reveal, too, who is romantically involved with whom, by tracking the locations of cell phones at night.

For the law-enforcement community, particularly the parts focussed on locating terrorists, metadata has led to breakthroughs. Khalid Sheikh Mohammed, the master planner of the September 11, 2001, attacks on New York and Washington, “got picked up by his cell phone,” Landau said. Many other criminal suspects have given themselves away through their metadata trails. In fact, Landau told me, metadata and other new surveillance tools have helped cut the average amount of time it takes the U.S. Marshals to capture a fugitive from forty-two days to two.

But with each technological breakthrough comes a break-in to realms previously thought private. “It’s really valuable for law enforcement, but we have to update the wiretap laws,” Landau said.


Cryptocat, an encrypted chat provider, explains why it doesn’t store metadata from its users’ conversations:

If you’ve been following the news at all for the past week, you’d have heard of the outrageous reports of Internet surveillance on behalf of the NSA. While those reports suggest that the NSA may not have complete access to content, they still allow the agency access to metadata. If we were talking about phone surveillance, for example, metadata would be the time you made calls, which numbers you called, how long your calls have lasted, and even where you placed your calls from. This circumstantial data can be collected en masse to paint very clear surveillance pictures about individuals or groups of individuals.


The EFF gives examples of meaningful phone metadata:

  • They know you rang a phone sex service at 2:24 am and spoke for 18 minutes. But they don’t know what you talked about.
  • They know you called the suicide prevention hotline from the Golden Gate Bridge. But the topic of the call remains a secret.
  • They know you spoke with an HIV testing service, then your doctor, then your health insurance company in the same hour. But they don’t know what was discussed.
  • They know you received a call from the local NRA office while it was having a campaign against gun legislation, and then called your senators and congressional representatives immediately after. But the content of those calls remains safe from government intrusion.
  • They know you called a gynecologist, spoke for a half hour, and then called the local Planned Parenthood’s number later that day. But nobody knows what you spoke about.


« Previous post: Getting a National Security Letter

Next post: How the government gets your cell phone's GPS data »