How to Train an AI within GDPR Limitations

We are living through the data big bang, in which the number of bytes of data we collectively create is a 30-digit number. This is good, as data is the raw material for innovation — so long as we can harness, systematize, and analyze it. What makes mastering colossal data streams possible is artificial intelligence. Hardly anything can digest these enormous piles of data to derive meaningful information as quickly as artificial intelligence (AI) algorithms.

But what data are algorithms going to analyze? How much? What for? In 2016, the European Union adopted a regulation that answers these questions to some extent. However, it is also a game-changer for AI and machine learning (ML) development.

In this article, you’ll learn about:

What we can create by combining data and AI
Why the GDPR affects AI development
The main challenges which arise due to GDPR and machine learning limitations
How to develop GDPR-friendly artificial intelligence

Learn how Intellias helped a US-based company providing personalized learning programs for EU residents comply with GDPR requirements

What can we create by combining data and AI?

By leveraging the capabilities of AI and ML, tech companies and research institutions make new materials, discover drugs, detect fraud, protect crops, and so on. In our daily lives, we face AI algorithms, too — from email filters to personalized product offerings and music suggestions to digital assistants.

Another prominent result of AI and data synergy is smart city technology that aims at addressing common urban issues, such as optimizing traffic and parking, managing emergencies, preventing vandalism, and ensuring public safety. In other words, it aims to create social harmony through technology.

An example of smart city technology is the social credit scoring system being implemented in China. The country’s government uses an AI-powered system capable of comparing vast amounts of data with official databases and developing knowledge based on this analysis. Most of the data is gathered from traditional sources such as financial, criminal, and government records, registry offices, and third-party sources such as online credit platforms.

Systems like the Chinese social credit scoring system do bring benefits for citizens and make urban services more efficient. However, such vast access to data raises concerns about privacy, bias, and political interference. The alliance of AI and big data has brought data subjects’ privacy rights and freedoms to the table.

Discover five use cases of machine learning in FinTech and banking and learn how to apply best practices to your business

How does the GDPR impact AI and machine learning?

While China was expanding the use of personal data for urban governance, the European Union was passing regulations to limit data use.

The European Union adopted a Data Protection Directive long before people started to share their data online. And after years of discussions and preparations, the European Parliament replaced this directive by adopting the General Data Protection Regulation in May 2016. With the GDPR, the EU aimed to harmonize data privacy laws across all its member countries, safeguard data being transferred abroad, and provide individuals with more control over their personal data. The GDPR applies to data that, either alone or in combination with other data, can identify a person. How to Train an AI with GDPR Limitations

The GDPR has been the starting point for data protection laws in countries outside the EU and in individual US states. For example, the California Consumer Privacy Act was signed in June 2018, becoming the first data privacy act in the US.

The GDPR went into force in May 2018, affecting Europe-based companies and all companies processing and holding the personal data of those residing in the EU. The tech industry took issue with the stringent rules, as the regulation touches on the two main pillars of artificial intelligence and machine learning.

First, it enhances data security, as AI and data privacy always come together. The GDPR poses strict obligations on companies that collect and process any personal data. Most AI-based systems require large volumes of information to train and learn from. Usually, personal data is among these training datasets. The GDPR’s impact on AI and machine learning development is inevitable.

Second, the regulation explicitly addresses “automated individual decision-making” and profiling. According to Article 22, a person has a right not to be subject to either if they produce legal effects concerning him or her. Automated individual decision-making here covers an AI’s decisions made without any human intervention. Profiling means the automated processing of personal data to evaluate certain things about the data subject. For instance, an AI system might analyze a user’s credit card history to identify the user’s spending patterns.

Learn how AI-driven analytics helps turn analytical insights into tangible business outcomes

What challenges arise from GDPR limitations on AI?

The GDPR has six data protection principles at its core. According to a report by the Norwegian Data Protection Authority, artificial intelligence and data protection face four challenges associated with these principles.

Fairness and discrimination

The GDPR fairness principle addresses fair processing of personal data. In other words, data must be processed with respect for the data subject’s interests. Also, the regulation obligates that a data controller take measures to prevent discriminatory effects on individuals. It’s no secret that many AI systems are trained using biased data. Or that their algorithmic models contain certain biases. That’s why AI systems often demonstrate racial, gender, health, religious, or ideological discrimination. For GDPR compliance using AI, companies have to learn how to mitigate those biases in their AI systems.

Purpose limitation

The purpose limitation principle of the GDPR states that a data subject has to be informed about the purpose of data collection and processing. Only then can a person choose whether to consent to the processing. The interesting thing is that sometimes AI systems use information that’s a side product of the original data collection. For instance, an AI application can use social media data for calculating a user’s insurance rate. The GDPR states that data can be processed further if the further purpose is compatible with the original. If it isn’t, the data collector should get additional approval from the data subject. But this principle has a few exceptions.

Further data processing is always compatible with the previous purpose if it’s connected to scientific, historical, or statistical research. Herein lies a problem, since there’s no clear definition of scientific research. This means that in some cases, AI development may be considered such research. The rule of thumb is that when the AI model is static and already deployed, the purpose of its data collection can’t be regarded as research.

Learn what alternative data can increase accuracy in evaluating a consumer’s creditworthiness

Data minimization

The GDPR data minimization principle controls the degree of intervention into a data subject’s privacy. It ensures that data collected fits the purpose of the project. Collected information should be adequate, limited, and relevant. These requirements encourage developers to think through the application of their AI models. Engineers have to determine what data and what quantity of it is necessary for a project. Sometimes, this can be a challenge. It’s not always possible to predict how and what a model will learn from data. Developers should continuously reassess the type of and minimum quantity of training data required to fulfil the data minimization principle.

Transparency and the right to information

The GDPR aims to ensure that individuals have the power to decide which of their information is used by third parties. This means that data controllers have to be open and transparent about their actions. They should provide a detailed description of what they’re doing with personal information to the owners of that information. Unfortunately, with AI systems, this may be hard to do.

That’s because AI is essentially a black box. It’s not always clear how the model makes decisions. Which makes it impossible to explain an AI’s complicated processes to an everyday user. Naturally, when AI is not entirely transparent, the question of liability arises.

According to the GDPR and AI, a data subject has the right to an explanation of an automated decision. So data controllers have to figure out ways to give one.

How to develop GDPR-friendly artificial intelligence

Like it or not, IT companies have to ensure all their processes are compliant with the GDPR. Data processors and data controllers who violate this regulation will have to pay significant fines. Luckily, there are several ways of making AI compliant with the GDPR. Take a look at these GDPR-friendly methods of AI development.

We need to find a way to design and use machine learning algorithms in a way that is compliant with the GDPR, because they will generate value for both service providers and data subjects if done correctly.

Alessandro Guarino, Senior Information Security Professional at StudioAG

GANs (Generative Adversarial Networks). Today, the trend in AI development is to use less data more efficiently rather than to accumulate lots of data. A GAN reduces the need for training data by generating input data with the help of output data. Basically, with this method, we take the input and try to figure out what the output will look like. To achieve this, we need to train two neural networks: a generator and a discriminator.

The generator learns how to put data together to generate an image that resembles the output. The discriminator learns how to tell the difference between real data and the data produced by the generator. The problem here is that GANs still require lots of data to be trained properly. So this method doesn’t eliminate the need for training data; it just allows us to reduce the amount of initial data and generate a lot of similar augmented data. But if we use a small number of initial datasets, we risk getting a biased AI model in the end. So generative adversarial neural networks don’t solve these issues fully, though they do allow us to decrease the need for initial data.

Federated learning is another method of reducing the need for data in AI development. Remarkably, it doesn’t require collecting data at all. In federated learning, personal data doesn’t leave the system that stores it. It’s never collected or uploaded to an AI’s computers. With federated learning, an AI model trains locally on each system with local data. Later, the trained model merges with the master model as an update. But the problem is that a locally trained AI model is limited, since it’s personalized. And even if no data leaves the device, the model is still largely based on personal data. Unfortunately, this contradicts the GDPR’s transparency principle.

The AI model is personalized on the user’s phone. All the training data remains on the device and is not uploaded to the cloud.
How to Train an AI with GDPR Limitations
Source: AI Google Blog

Transfer learning is a method that enables the effective reuse of prior work and leads to the democratization of artificial intelligence. In this case, the AI model doesn’t train from scratch. Instead, it takes an existing model and retrains itself using it to meet the current purpose. Since the AI model uses a pre-existing model, it takes significantly less computing resources and requires less data. But transfer learning works best when the previous model has been trained on a large dataset. Also, the previous model has to be reliable and not contain any biases. So transfer learning can minimize data use but doesn’t exclude the need for data fully.

The explainable AI (XAI) method helps to reduce the black box effect of artificial intelligence. The goal of explainable AI is to assist humans in understanding what’s happening under the hood of an AI system. With this method, an AI model can explain its decisions. It can also characterize its own abilities and give some insights about its future behavior. Explainable AI cannot directly reduce the need for data, but it allows us to understand which exact data is required to enhance model accuracy so researchers can extend the training dataset with required data only and not add a lot of meaningless data.

XAI concept
How to Train an AI with GDPR Limitations
Source: Darpa

The simple truth is that all of these AI training methods we’ve mentioned are somewhat limited. They may comply with one GDPR principle but contradict another. This means that to train AI models properly and achieve great results, you’ll have to combine several methods.

Summing up

The synergy of AI/ML and data can be of tremendous benefit to individuals and society, but organizations using AI systems must address the risks for data privacy rights and freedoms. To do that, AI market players should bring together the GDPR and machine learning. Tech companies have to revise their data privacy and artificial intelligence policies. Data controllers have to ensure that their AI systems don’t violate the regulation. Luckily, there are several methods of making AI compliant with GDPR. GANs, XAI, federated learning, transfer learning, and differential privacy can help you develop a GDPR-friendly artificial intelligence system.