Apple’s ‘Differential Privacy’ Is About Collecting Your Data---But Not ​Your Data

At WWDC, Apple name-checked the statistical science of learning as much as possible about a group while learning as little as possible about any individual in it.
Image may contain Human Person Lighting Stage Musical Instrument and Musician
Senior vice president of software engineering Craig Federighi.Justin Kaneps for WIRED

Apple, like practically every mega-corporation, wants to know as much as possible about its customers. But it's also marketed itself as Silicon Valley's privacy champion, one that---unlike so many of its advertising-driven competitors---wants to know as little as possible about you. So perhaps it's no surprise that the company has now publicly boasted about its work in an obscure branch of mathematics that deals with exactly that paradox.

At the keynote address of Apple's Worldwide Developers' Conference in San Francisco on Monday, the company's senior vice president of software engineering Craig Federighi gave his familiar nod to privacy, emphasizing that Apple doesn't assemble user profiles, does end-to-end encrypt iMessage and Facetime and tries to keep as much computation as possible that involves your private information on your personal device rather than on an Apple server. But Federighi also acknowledged the growing reality that collecting user information is crucial to making good software, especially in an age of big data analysis and machine learning. The answer, he suggested rather cryptically, is "differential privacy."

"We believe you should have great features and great privacy," Federighi told the developer crowd. "Differential privacy is a research topic in the areas of statistics and data analytics that uses hashing, subsampling and noise injection to enable...crowdsourced learning while keeping the data of individual users completely private. Apple has been doing some super-important work in this area to enable differential privacy to be deployed at scale."

Differential privacy, translated from Apple-speak, is the statistical science of trying to learn as much as possible about a group while learning as little as possible about any individual in it. With differential privacy, Apple can collect and store its users’ data in a format that lets it glean useful notions about what people do, say, like and want. But it can't extract anything about a single, specific one of those people that might represent a privacy violation. And neither, in theory, could hackers or intelligence agencies.

"With a large dataset that consists of records of individuals, you might like to run a machine learning algorithm to derive statistical insights from the database as a whole, but you want to prevent some outside observer or attacker from learning anything specific about some [individual] in the data set," says Aaron Roth, a University of Pennsylvania computer science professor whom Apple's Federighi named in his keynote as having "written the book" on differential privacy. (That book, co-written with Microsoft researcher Cynthia Dwork, is the Algorithmic Foundations of Differential Privacy [PDF].) "Differential privacy lets you gain insights from large datasets, but with a mathematical proof that no one can learn about a single individual."

As Roth notes when he refers to a "mathematical proof," differential privacy doesn't merely try to obfuscate or "anonymize" users' data. That anonymization approach, he argues, tends to fail. In 2007, for instance, Netflix released a large collection of its viewers' film ratings as part of a competition to optimize its recommendations, removing people's names and other identifying details and publishing only their Netflix ratings. But researchers soon cross-referenced the Netflix data with public review data on IMDB to match up similar patterns of recommendations between the sites and add names back into Netflix's supposedly anonymous database.

That sort of de-anonymizing trick has countermeasures---say, removing the titles of the Netflix films and keeping only their genre. But there's never a guarantee that some other clever trick or cross-referenced data couldn't undo that obfuscation. "If you start to remove people’s names from data, it doesn’t stop people from doing clever cross-referencing," says Roth. "That’s the kind of thing that's provably prevented by differential privacy."

'It's Future Proof'

Differential privacy, Roth explains, seeks to mathematically prove that a certain form of data analysis can't reveal anything about an individual---that the output of an algorithm remains identical with and without the input containing any given person's private data. "You might do something more clever than the people before to anonymize your data set, but someone more clever than you might come around tomorrow and de-anonymize it," says Roth. "Differential privacy, because it has a provable guarantee, breaks that loop. It’s future proof."

Federighi's emphasis on differential privacy likely means Apple is actually sending more of your data than ever off of your device to its servers for analysis, just as Google and Facebook and every other data-hungry tech firm does. But Federighi implies that Apple is only transmitting that data in a transformed, differentially private form. In fact, Federighi named three of those transformations: Hashing, a cryptographic function that irreversibly turns data into a unique string of random-looking characters; subsampling, or taking only a portion of the data; and noise injection, adding random data that obscures the real, sensitive personal information. (As an example of that last method, Microsoft's Dwork points to the technique in which a survey asks if the respondent has ever, say, broken a law. But first, the survey asks them to flip a coin. If the result is tails, they should answer honestly. If the result is heads, they're instructed to flip the coin again and then answer "yes" for heads or "no" for tails. The resulting random noise can be subtracted from the results with a bit of algebra, and every respondent is protected from punishment if they admitted to lawbreaking.)

When WIRED asked for more information on how it applies differential privacy, an Apple representative responded only by referring to the iOS 10 preview guide, which described how the techniques will be used in the latest version of Apple's mobile operating system:

Starting with iOS 10, Apple is using Differential Privacy technology to help discover the usage patterns of a large number of users without compromising individual privacy. To obscure an individual’s identity, Differential Privacy adds mathematical noise to a small sample of the individual’s usage pattern. As more people share the same pattern, general patterns begin to emerge, which can inform and enhance the user experience. In iOS 10, this technology will help improve QuickType and emoji suggestions, Spotlight deep link suggestions and Lookup Hints in Notes.

Whether Apple is using differential privacy techniques with the rigor necessary to fully protect its customers' privacy, of course, is another question. In his keynote, Federighi said that Apple had given the University of Pennyslvania's Roth a "quick peek" at its implementation of the mathematical techniques it used. But Roth told WIRED he couldn't comment on anything specific that Apple's doing with differential privacy. Instead, much like the techniques he's helped to study and invent, Roth offered a general takeaway that successfully avoided revealing any details: "I think they’re doing it right."