AI Privacy

Many privacy regulations, including GDPR, mandate that organizations abide by certain privacy principles, such as data minimization and privacy by design.
Data minimization requires that only data necessary to fulfill a certain purpose be collected. Privacy by design means that service providers are expected to design their systems to maintain privacy from the outset. They are also occasionally required to perform a privacy impact assessment for new services being released.

It’s often difficult, however, to comply with such privacy regulations when using AI techniques and approaches. Advanced machine learning algorithms, such as deep neural networks, tend to consume large amounts of data to generate predictions or classifications. These algorithms often result in a “black box” model, where it is difficult to derive exactly which data influenced the decision and how.

We are currently researching new techniques to enable AI-based solutions to adhere to such privacy requirements. These techniques include:

  1. Data minimization for machine learning models – helps to reduce the amount and granularity of features used by machine learning algorithms to perform classification or prediction, by either removal (suppression) or generalization. This process is tailored to the machine learning model at hand, thus reducing the negative effect on model accuracy. Once the minimized feature set is determined, any newly collected data for analysis can be minimized before applying the model to it. This method is focused on preserving the privacy of individuals for whom predictions will be made by the model, i.e., on runtime data.
  2. Machine learning model anonymization – creates a model-based, tailored anonymization scheme to anonymize training data before using it to train an ML model. Using knowledge encoded within a model allows us to derive an anonymization that minimizes the effect on the model’s accuracy. Once the new model is trained on the anonymized dataset, it can be used, shared, and published freely. The focus is on enabling enterprises to expose/share the analytics model, while protecting the individuals whose data was used to train the model, thus adhering to privacy regulations.
  3. Privacy risk assessment for machine learning models – enables comparing and choosing between different ML models based not only on accuracy but also on privacy risk. We are studying ways to assess and quantify the privacy risk of these models, as well as reduce their privacy risks by directing their development processes to produce models that rely less on sensitive data. There are several risk factors that can be taken into account, such as privacy risks for training data (e.g., membership inference and attribute inference attacks) and privacy risks of the general population (regardless of participation in the training set). The risk level also depends on the level of sensitivity of the features used to train the model. Our approach will take all of these factors into account and suggest pathways to mitigate these risks.

Here is a high-level schematic depiction of the data minimization process:

Here is a high-level schematic depiction of the anonymization process:


Abigail Goldsteen, IBM Research - Haifa