IBM researchers check AI bias with counterfactual text

From recruitment to credit risk apps, or use in the healthcare and criminal justice system, AI unreliability has been giving researchers headaches for years.

We’ve tackled the issue by flipping it upside down.

Our team has developed an AI that verifies other AIs’ “fairness” by generating a set of counterfactual text samples and testing machine learning systems without supervision.

In our recent paper, “Generate Your Counterfactuals: Towards Controlled Counterfactual Generation for Text” accepted to AAAI 2021, we describe how our software, dubbed GYC, generates test cases to check the reliability of AI models.¹ GYC can evaluate other AIs for accuracy, gender bias, and run a sensitivity analysis, as well as check a model’s adversarial robustness and its ability to cope with spurious correlations. It also verifies machine learning and natural language systems for trustworthiness.

What is counterfactual text, anyway?

Any text is a description of a scenario or a setting — and counterfactual text is a synthetically generated one that an AI model is forced to treat differently.

One well-known way to generate counterfactual texts is by using pre-defined templates and dictionaries — this is how, for instance, Checklist works.² But in that case, the generated set of counterfactual samples is very rigid. GYC, instead, generates such samples in an unsupervised way so that they appear in user inputs as plausible, diverse, goal-oriented, and effective.

To create the AI, we used different loss functions to make sure that the desired properties of counterfactuals are preserved during the counterfactual generation. To guide the generation, we used the GPT-2 decoder, a tool in data science that relies on specific proximity and diversity constraints to change the input sentence in a variety of ways.

The main goal of our AI is to drive generation around a specific condition, such as sentiment. To enforce the condition, we assume that we have access to a function for sentiment that takes the text and returns the probability of the text being positive or negative. This function could either be available to us openly or as a black box, with hidden contents. We propose different ways to deal with “black-box” and “white-box” access to the condition model.

So how does it work?

Consider the text “my boss is a man.” A counterfactual text could then be “my boss is a woman.” A hypothetical sentiment model would give a “positive” label to the first statement and a “negative” to the second one. GYC’s counterfactual text tests the reliability of the AI that produced the original text by doing this intervention.

The system’s output indicates that after GYC changes “man” to “woman,” the sentiment exhibited by the model changes. Typically, for a condition model, changing some minimal part of the text — in this case, gender — shouldn’t impact the output sentiment label at all. But it does — and that’s where our AI comes in. Counterfactual samples can be fed as training data for data augmentation algorithms, and used to de-bias the underlying sentiment model.

Having ran our experiments on three datasets, we observed that GYC generates a high label-flip score in counterfactuals belonging to a different class than the input sentence. GYC does this by maintaining diversity and by preserving semantic content and syntactic structure of the input sentence.

For example, consider as text input this named-entity recognition (NER) model: “My friend lives in beautiful London.” GYC could then generate high quality counterfactual samples, such as “My friend lives in majestic downtown Chicago” or “My friend lives in gorgeous London” or “My friend lives in the city of New Orleans.”

This means that GYC can generate variations of the “location” tag by generating a diverse set of counterfactual samples. With these samples, it’s possible to check AI reliability by analyzing the difference in behavior of a given model on input and the counterfactual set. Samples like that could be training data for de-biasing any model that differentiates on the basis of location.

GYC is the first method that generates test cases by changing multiple elements in text without any rule-based cues. The research is still ongoing and we are now trying to improve the reconstruction step — currently expensive to run for sentences longer than about 15 words. We are also working on getting GYC to generate counterfactuals using multiple condition models, which should significantly improve the automatic counterfactual generation.

Where could GYC be used?

Our GYC model could help test natural language processing models to perform a behavioral check. Such test-cases complement the traditional test cases designed by software engineers, and seem to be relevant with the increased adoption of NLP algorithms. GYC can be easily adapted to test any classification model, even with a black-box access to it. One can plug any score function and generate test cases around a specific condition.

GYC is the first method that generates test cases by changing multiple elements in text without any rule-based cues.

GYC could also be beneficial for data augmentation and counterfactual logit pairing algorithms that require counterfactual text samples of the training data to de-bias language models for sentiment. Such samples should satisfy a specific condition, for example the presence of a protected attribute like gender, age or race. While these techniques claim to be highly successful, in reality getting enough data for a corresponding protected attribute is tricky. GYC could boost the performance of these algorithms by generating counterfactual text samples with high generation quality.

Finally, models that require explainability could benefit from our research as well. Blindly following the decisions of AI models has triggered issues with AI fairness, reliability and privacy — leading to the emergence of explainability in AI. GYC could generate textual explanations on a given input and a given model, to identify and fix ethical issues in AI models.

IBM Research AI is proudly sponsoring AAAI 2021 as a Platinum Sponsor. We will present 40 main track papers, in addition to at least seven workshop papers, 10 demos, four IAAI papers, and one tutorial. IBM Research AI is also co-organizing three workshops. We hope you can join us from February 2-9 to learn more about our research. To view our full presence at AAAI 2021, visit here.

Learn more about:

Fairness, Accountability, Transparency: To increase the accountability of high-risk AI systems, we're developing technologies to increase their end-to-end transparency and fairness.

Subscribe to our Future Forward newsletter and stay up to date on the latest research news

Subscribe to our newsletter

References

Madaan, N., Padhi, I., Panwar, N., & Saha, D. (2021). Generate your counterfactuals: Towards controlled counterfactual generation for text. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15), 13516–13524. ↩
Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of nlp models with checklist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4902–4912. ↩

LLMs have model cards. Now, benchmarks do, too
Release
Kim Martineau
16 Dec 2025
Lightweight tools for ‘steering’ LLMs down the right path
Research
Kim Martineau
15 Oct 2025
In AI, alignment is the goal. Steerability is how you get there
Q & A
Kim Martineau
26 Sep 2025
IBM’s safety checkers top a new AI benchmark
News
Kim Martineau
09 Apr 2025

What is counterfactual text, anyway?

So how does it work?

Where could GYC be used?

Learn more about:

References

Related posts

LLMs have model cards. Now, benchmarks do, too

Lightweight tools for ‘steering’ LLMs down the right path

In AI, alignment is the goal. Steerability is how you get there

IBM’s safety checkers top a new AI benchmark