LIKE any doctor, Jacques Fellay wants to give his patients the best care possible. But his instrument of choice is no scalpel or stethoscope, it is far more powerful than that. Hidden inside each of us are genetic markers that can tell doctors like Fellay which individuals are susceptible to diseases such as AIDS, hepatitis and more. If he can learn to read these clues, then Fellay would have advance warning of who requires early treatment.
This could be life-saving. The trouble is, teasing out the relationships between genetic markers and diseases requires an awful lot of data, more than any one hospital has on its own. You might think hospitals could pool their information, but it isn’t so simple. Genetic data contains all sorts of sensitive details about people that could lead to embarrassment, discrimination or worse. Ethical worries of this sort are a serious roadblock for Fellay, who is based at Lausanne University Hospital in Switzerland. “We have the technology, we have the ideas,” he says. “But putting together a large enough data set is more often than not the limiting factor.”
Fellay’s concerns are a microcosm of one of the world’s biggest technological problems. The inability to safely share data hampers progress in all kinds of other spheres too, from detecting financial crime to responding to disasters and governing nations effectively. Now, a new kind of encryption is making it possible to wring the juice out of data without anyone ever actually seeing it. This could help end big data’s big privacy problem – and Fellay’s patients could be some of the first to benefit.
It was more than 15 years ago that we first heard that “data is the new oil”, a phrase coined by the British mathematician and marketing expert Clive Humby. Today, we are used to the idea that personal data is valuable. Companies like Meta, which owns Facebook, and Google’s owner Alphabet grew into multibillion-dollar behemoths by collecting information about us and using it to sell targeted advertising.
Data could do good for all of us too. Fellay’s work is one example of how medical data might be used to make us healthier. Plus, Meta shares anonymised user data with aid organisations to help plan responses to floods and wildfires, in a project called Disaster Maps. And in the US, around 1400 colleges analyse academic records to spot students who are likely to drop out and provide them with extra support. These are just a few examples out of many – data is a currency that helps make the modern world go around.
Getting such insights often means publishing or sharing the data. That way, more people can look at it and conduct analyses, potentially drawing out unforeseen conclusions. Those who collect the data often don’t have the skills or advanced AI tools to make the best use of it, either, so it pays to share it with firms or organisations that do. Even if no outside analysis is happening, the data has to be kept somewhere, which often means on a cloud storage server, owned by an external company.
You can’t share raw data unthinkingly. It will typically contain sensitive personal details, anything from names and addresses to voting records and medical information. There is an obligation to keep this information private, not just because it is the right thing to do, but because of stringent privacy laws, such as the European Union’s General Data Protection Regulation (GDPR). Breaches can see big fines.
Over the past few decades, we have come up with ways of trying to preserve people’s privacy while sharing data. The traditional approach is to remove information that could identify someone or make these details less precise, says privacy expert Yves-Alexandre de Montjoye at Imperial College London. You might replace dates of birth with an age bracket, for example. But that is no longer enough. “It was OK in the 90s, but it doesn’t really work any more,” says de Montjoye. There is an enormous amount of information available about people online, so even seemingly insignificant nuggets can be cross-referenced with public information to identify individuals.
One significant case of reidentification from 2021 involves apparently anonymised data sold to a data broker by the dating app Grindr, which is used by gay people among others. A media outlet called The Pillar obtained it and correlated the location pings of a particular mobile phone represented in the data with the known movements of a high-ranking US priest, showing that the phone popped up regularly near his home and at the locations of multiple meetings he had attended. The implication was that this priest had used Grindr, and a scandal ensued because Catholic priests are required to abstain from sexual relationships and the church considers homosexual activity a sin.
A more sophisticated way of maintaining people’s privacy has emerged recently, called differential privacy. In this approach, the manager of a database never shares the whole thing. Instead, they allow people to ask questions about the statistical properties of the data – for example, “what proportion of people have cancer?” – and provide answers. Yet if enough clever questions are asked, this can still lead to private details being triangulated. So the database manager also uses statistical techniques to inject errors into the answers, for example recording the wrong cancer status for some people when totting up totals. Done carefully, this doesn’t affect the statistical validity of the data, but it does make it much harder to identify individuals. The US Census Bureau adopted this method when the time came to release statistics based on its 2020 census.
Trust no one
Still, differential privacy has its limits. It only provides statistical patterns and can’t flag up specific records – for instance to highlight someone at risk of disease, as Fellay would like to do. And while the idea is “beautiful”, says de Montjoye, getting it to work in practice is hard.
There is a completely different and more extreme solution, however, one with origins going back 40 years. What if you could encrypt and share data in such a way that others could analyse it and perform calculations on it, but never actually see it? It would be a bit like placing a precious gemstone in a glovebox, the chambers in labs used for handling hazardous material. You could invite people to put their arms into the gloves and handle the gem. But they wouldn’t have free access and could never steal anything.
This was the thought that occurred to Ronald Rivest, Len Adleman and Michael Dertouzos at the Massachusetts Institute of Technology in 1978. They devised a theoretical way of making the equivalent of a secure glovebox to protect data. It rested on a mathematical idea called a homomorphism, which refers to the ability to map data from one form to another without changing its underlying structure. Much of this hinges on using algebra to represent the same numbers in different ways.
Imagine you want to share a database with an AI analytics company, but it contains private information. The AI firm won’t give you the algorithm it uses to analyse data because it is commercially sensitive. So, to get around this, you homomorphically encrypt the data and send it to the company. It has no key to decrypt the data. But the firm can analyse the data and get a result, which itself is encrypted. Although the firm has no idea what it means, it can send it back to you. Crucially, you can now simply decrypt the result and it will make total sense.
“The promise is massive,” says Tom Rondeau at the US Defense Advanced Research Projects Agency (DARPA), which is one of many organisations investigating the technology. “It’s almost hard to put a bound to what we can do if we have this kind of technology.”
In the 30 years since the method was proposed, researchers devised homomorphic encryption schemes that allowed them to carry out a restricted set of operations, for instance only additions or multiplications. Yet fully homomorphic encryption, or FHE, which would let you run any program on the encrypted data, remained elusive. “FHE was what we thought of as being the holy grail in those days,” says Marten van Dijk at CWI, the national research institute for mathematics and computer science in the Netherlands. “It was kind of unimaginable.”
One approach to homomorphic encryption at the time involved an idea called lattice cryptography. This encrypts ordinary numbers by mapping them onto a grid with many more dimensions than the standard two. It worked – but only up to a point. Each computation ended up adding randomness to the data. As a result, doing anything more than a simple computation led to so much randomness building up that the answer became unreadable.
In 2009, Craig Gentry, then a PhD student at Stanford University in California, made a breakthrough. His brilliant solution was to periodically remove this randomness by decrypting the data under a secondary covering of encryption. If that sounds paradoxical, imagine that glovebox with the gem inside. Gentry’s scheme was like putting one glovebox inside another, so that the first one could be opened while still encased in a layer of security. This provided a workable FHE scheme for the first time.
Workable, but still slow: computations on the FHE-encrypted data could take millions of times longer than identical ones on raw data. Gentry went on to work at IBM, and over the next decade, he and others toiled to make the process quicker by improving the underlying mathematics. But lately the focus has shifted, says Michael Osborne at IBM Research in Zurich, Switzerland. There is a growing realisation that massive speed enhancements can be achieved by optimising the way cryptography is applied for specific uses. “We’re getting orders of magnitudes improvements,” says Osborne.
IBM now has a suite of FHE tools that can run AI and other analyses on encrypted data. Its researchers have shown they can detect fraudulent transactions in encrypted credit card data using an artificial neural network that can crunch 4000 records per second. They also demonstrated that they could use the same kind of analysis to scour the encrypted CT scans of more than 1500 people’s lungs to detect signs of covid-19 infection.
Also in the works are real-world, proof-of-concept projects with a variety of customers. In 2020, IBM revealed the results of a pilot study conducted with the Brazilian bank Banco Bradesco. Privacy concerns and regulations often prevent banks from sharing sensitive data either internally or externally. But in the study, IBM showed it could use machine learning to analyse encrypted financial transactions from the bank’s customers to predict if they were likely to take out a loan. The system was able to make predictions for more than 16,500 customers in 10 seconds and it performed just as accurately as the same analysis performed on unencrypted data.
Other companies are keen on this extreme form of encryption too. Computer scientist Shafi Goldwasser, a co-founder of privacy technology start-up Duality, says the firm is achieving significantly faster speeds by helping customers better structure their data and tailoring tools to their problems. Duality’s encryption tech has already been integrated into the software systems that technology giant Oracle uses to detect financial crimes, where it is assisting banks in sharing data to detect suspicious activity.
Still, for most applications, FHE processing remains at least 100,000 times slower compared with unencrypted data, says Rondeau. This is why, in 2020, DARPA launched a programme called Data Protection in Virtual Environments to create specialised chips designed to run FHE. Lattice-encrypted data comes in much larger chunks than normal chips are used to dealing with. So several research teams involved in the project, including one led by Duality, are investigating ways to alter circuits to efficiently process, store and move this kind of data. The goal is to analyse any FHE-encrypted data just 10 times slower than usual, says Rondeau, who is managing the programme.
Even if it were lightning fast, FHE wouldn’t be flawless. Van Dijk says it doesn’t work well with certain kinds of program, such as those that contain branching logic made up of “if this, do that” operations. Meanwhile, information security researcher Martin Albrecht at Royal Holloway, University of London, points out that the justification for FHE is based on the need to share data so it can be analysed. But a lot of routine data analysis isn’t that complicated – doing it yourself might sometimes be simpler than getting to grips with FHE.
For his part, de Montjoye is a proponent of privacy engineering: not relying on one technology to protect people’s data, but combining several approaches in a defensive package. FHE is a great addition to that toolbox, he reckons, but not a standalone winner.
This is exactly the approach that Fellay and his colleagues have taken to smooth the sharing of medical data. Fellay worked with computer scientists at the Swiss Federal Institute of Technology in Lausanne who created a scheme combining FHE with another privacy-preserving tactic called secure multiparty computation (SMC). This sees the different organisations join up chunks of their data in such a way that none of the private details from any organisation can be retrieved.
In a paper published in October 2021, the team used a combination of FHE and SMC to securely pool data from multiple sources and use it to predict the efficacy of cancer treatments or identify specific variations in people’s genomes that predict the progression of HIV infection. The trial was so successful that the team has now deployed the technology to allow Switzerland’s five university hospitals to share patient data, both for medical research and to help doctors personalise treatments. “We’re implementing it in real life,” says Fellay, “making the data of the Swiss hospitals shareable to answer any research question as long as the data exists.”
If data is the new oil, then it seems the world’s thirst for it isn’t letting up. FHE could be akin to a new mining technology, one that will open up some of the most valuable but currently inaccessible deposits. Its slow speed may be a stumbling block. But, as Goldwasser says, comparing the technology with completely unencrypted processing makes no sense. “If you believe that security is not a plus, but it’s a must,” she says, “then in some sense there is no overhead.”