images of diverse people

Diversity in Faces Dataset

The Diversity in Faces(DiF)is a large and diverse dataset that seeks to advance the study of fairness and accuracy in facial recognition technology.The first of its kind available to the global research community,DiF provides a dataset of annotations of 1 million human facial images.

images of diverse people
face outline

How do we measure and ensure diversity for human faces in AI systems?

We are familiar with how faces differ by age,gender,and skin tone,and how different faces can vary across some of these dimensions.But,as prior studies have shown,these dimensions are not adequate for characterizing the full diversity of human faces.Dimensions like face symmetry,facial contrast,the pose the face is in,the length or width of the face’s attributes(eyes,nose,forehead,etc.)are also important. For the facial recognition systems to perform as desired – and the outcomes to become increasingly accurate – training data must be diverse and offer a breadth of coverage.For example,the training datasets must be large enough and different enough that the technology learns all the ways in which faces differ to accurately recognize those differences in a variety of situations.The images must reflect the distribution of features in faces we see in the world.

To help accelerate the study of diversity and coverage of data for AI facial recognition systems,IBM Research has released a large and diverse dataset called Diversity in Faces(DiF)to advance the study of fairness and accuracy in facial recognition technology.

Dataset highlights

1-million images of human faces

1-million images of human faces from the publicly availableYFCC-100M Creative Commons dataset.

Scientificly annotated facical features

The faces annotated using 10 well-established and independent coding schemes from the scientific literature[1-10].The coding schemes principally include objective measures of human faces,such as craniofacial features(e.g.,head length,nose length,forehead height).

Advancing study of fairness and accuracy

Studying diversity in faces is complex.The dataset provides a jumping off point for the global research community to further our collective knowledge.

How to access the DiF dataset

Our initial analysis has shown that the DiF dataset provides a more balanced distribution and broader coverage of facial images compared to previous datasets.Furthermore,the insights obtained from the statistical analysis of the 10 initial coding schemes on the DiF dataset has furthered our own understanding of what is important for characterizing human faces and enabled us to continue important research into ways to improve facial recognition technology.

The dataset is available today to the global research community upon request.IBM is proud to make this available and our goal is to help further our collective research and contribute to creating AI systems that are more fair. 

Steps to gain access

Step 1

Review the DiF Terms of Use and Privacy Notice.


Terms of use

DiF Privacy Notice

Step 2

Download and complete the questionnaire.


DiF Questionnaire(PDF)

Step 3

Email completed questionnaire to IBM Research.


Step 4

Further instructions will be provided from IBM Research via email once application is approved.

Important notices

This IBM Research Diversity in Faces Dataset and any use of it is subject to the IBM Research DiF Dataset Terms of Use.  

If you have any questions,comments or concerns related the IBM Research DiF Dataset or the project,contact us at:  

Important documents for the DiF Dataset

Terms of use

DiF Privacy Notice

DiF Questionnaire